Read and write Word doc files using POI
The hwpf module of Apache poi is specially used to read and write word doc files. In hwpf, we use HWPFDocument to represent a word doc document. There are several concepts in HWPFDocument:
Range : It represents a range, which can be the entire document, a certain section, a paragraph (Paragraph), or a paragraph (CharacterRun) with common attributes.
Section : A subsection of a word document. A word document can be composed of multiple subsections.
Paragraph : A paragraph of a word document, a subsection can be composed of multiple paragraphs.
CharacterRun : A paragraph of text with the same properties, and a paragraph can be composed of multiple CharacterRuns.
Table : A table.
TableRow : The row corresponding to the table.
TableCell : The cell corresponding to the table.
Section, Paragraph, CharacterRun, and Table are all inherited from Range.
1 Read word doc file
In daily applications, it is very rare for us to read information from word files, and we write content into word files more often. There are two main ways to read data from word doc files using POI: read through WordExtractor and read through HWPFDocument . When reading information inside WordExtractor, it is still obtained through HWPFDocument.
1.1 Read files through WordExtractor
When reading a file using WordExtractor, we can only read the text content of the file and some properties based on the document. As for the properties of the document content, we cannot read it. If you want to read the properties of the document content, you need to use HWPFDocument to read it. Here is an example of using WordExtractor to read files:
public class HwpfTest { @SuppressWarnings("deprecation") @Test public void testReadByExtractor() throws Exception { InputStream is = new FileInputStream("D://test.doc"); WordExtractor extractor = new WordExtractor(is); //Output all text of the word document System.out.println(extractor.getText()); System.out.println(extractor.getTextFromPieces()); //Output the content of the header System.out.println("Header:" + extractor.getHeaderText()); //Output the content of the footer System.out.println("Footer:" + extractor.getFooterText()); //Output the metadata information of the current word document, including the author, document modification time, etc. System.out.println(extractor.getMetadataTextExtractor().getText()); //Get the text of each paragraph String paraTexts[] = extractor.getParagraphText(); for (int i=0; i<paraTexts.length; i++) { System.out.println("Paragraph " + (i+1) + " : " + paraTexts[i]); } //Output some information about the current word printInfo(extractor.getSummaryInformation()); //Output some information about the current word this.printInfo(extractor.getDocSummaryInformation()); this.closeStream(is); } /** * Output SummaryInfomation * @param info */ private void printInfo(SummaryInformation info) { //Author System.out.println(info.getAuthor()); //Character statistics System.out.println(info.getCharCount()); //Number of pages System.out.println(info.getPageCount()); //Title System.out.println(info.getTitle()); //Theme System.out.println(info.getSubject()); } /** * Output DocumentSummaryInfomation * @param info */ private void printInfo(DocumentSummaryInformation info) { //Category System.out.println(info.getCategory()); //Company System.out.println(info.getCompany()); } /** * Close input stream* @param is */ private void closeStream(InputStream is) { if (is != null) { try { is.close(); } catch (IOException e) { e.printStackTrace(); } } } }1.2 Read files through HWPFDocument
HWPFDocument is a representative of current Word documents and its functions are stronger than WordExtractor. Through it, we can read tables, lists, etc. in the document, and we can also add, modify and delete the content of the document. It is just that after these new additions, modifications and deletions are completed, the relevant information is saved in the HWPFDocument, which means that what we changed is the HWPFDocument, not the files on the disk. If we want these modifications to take effect, we can call the write method of HWPFDocument to output the modified HWPFDocument to the specified output stream. This can be the output stream of the original file, or the output stream of the new file (equivalent to Save As) or other output streams. Here is an example of reading a file through HWPFDocument:
public class HwpfTest { @Test public void testReadByDoc() throws Exception { InputStream is = new FileInputStream("D://test.doc"); HWPFDocument doc = new HWPFDocument(is); //Output bookmark information this.printInfo(doc.getBookmarks()); //Output text System.out.println(doc.getDocumentText()); Range range = doc.getRange();//This.insertInfo(range); this.printInfo(range); //Read the table this.readTable(range); //Read the list this.readList(range); //Delete range Range r = new Range(2, 5, doc); r.delete(); //Delete in memory, if you need to save it to a file, you need to write it back to the file//Write the current HWPFDocument to the output stream doc.write(new FileOutputStream("D://test.doc")); this.closeStream(is); } /** * Close the input stream* @param is */ private void closeStream(InputStream is) { if (is != null) { try { is.close(); } catch (IOException e) { e.printStackTrace(); } } } /** * Output bookmark information* @param bookmarks */ private void printInfo(Bookmarks bookmarks) { int count = bookmarks.getBookmarksCount(); System.out.println("Number of bookmarks:" + count); Bookmark bookmark; for (int i=0; i<count; i++) { bookmark = bookmarks.getBookmark(i); System.out.println("Bookmark" + (i+1) + "The name is: " + bookmark.getName()); System.out.println("start position: " + bookmark.getStart()); System.out.println("end position: " + bookmark.getEnd()); } } /** * Read the table* Each carriage return represents a paragraph, so for a table, each cell contains at least one paragraph, and each row ends with a paragraph. * @param range */ private void readTable(Range range) { //Transfer the table within the range range. TableIterator tableIter = new TableIterator(range); Table table; TableRow row; TableCell cell; while (tableIter.hasNext()) { table = tableIter.next(); int rowNum = table.numRows(); for (int j=0; j<rowNum; j++) { row = table.getRow(j); int cellNum = row.numCells(); for (int k=0; k<cellNum; k++) { cell = row.getCell(k); //Output cell text System.out.println(cell.text().trim()); } } } } } /** * Read list* @param range */ private void readList(Range range) { int num = range.numParagraphs(); Paragraph para; for (int i=0; i<num; i++) { para = range.getParagraph(i); if (para.isInList()) { System.out.println("list: " + para.text()); } } } /** * Output Range * @param range */ private void printInfo(Range range) { //Get the number of paragraphs int paraNum = range.numParagraphs(); System.out.println(paraNum); for (int i=0; i<paraNum; i++) {// this.insertInfo(range.getParagraph(i)); System.out.println("Paragraph" + (i+1) + ":" + range.getParagraph(i).text()); if (i == (paraNum-1)) { this.insertInfo(range.getParagraph(i)); } } int secNum = range.numSections(); System.out.println(secNum); Section section; for (int i=0; i<secNum; i++) { section = range.getSection(i); System.out.println(section.getMarginLeft()); System.out.println(section.getMarginRight()); System.out.println(section.getMarginRight()); System.out.println(section.getMarginTop()); System.out.println(section.getMarginBottom()); System.out.println(section.getPageHeight()); System.out.println(section.text()); } } /** * Insert content into Range, it will only be written into memory* @param range */ private void insertInfo(Range range) { range.insertAfter("Hello"); } }2 Write word doc file
When writing word doc files using POI, we must first have a doc file, because when we write doc files, we write it through HWPFDocument, and HWPFDocument must be attached to a doc file. So the usual way is to first prepare a doc file with blank content on the hard disk, and then create a HWPFDocument based on the blank file. After that, we can add new content to the HWPFDocument and then write it to another doc file. This is equivalent to using POI to generate a word doc file.
In actual applications, when we generate word files, we generate a certain type of file. The format of this type of file is fixed, but some fields are different. So in practical applications, we don’t have to generate the content of the entire word file through HWPFDocument. Instead, create a new word document on disk, and its content is the content of the word file we need to generate, and then use a method similar to "${paramName}" to replace some of the contents that belong to the variables. In this way, when we generate a word file based on certain information, we only need to obtain the HWPFDocument based on the word file, and then call the replaceText() method of Range to replace the corresponding variable with the corresponding value, and then write the current HWPFDocument to the new output stream. This method is used more frequently in practical applications because it can not only reduce our workload, but also make the text format clearer. Let’s make an example based on this method.
Suppose we now have some changing information, and then we need to generate a word doc file in the following format through this information:
So according to the above description, first step is to create a doc file in the corresponding format as a template, and its content is as follows:
With such a template, we can create the corresponding HWPFDocument, then replace the corresponding variable with the corresponding value, and then output the HWPFDocument to the corresponding output stream. Below is the corresponding code.
public class HwpfTest { @Test public void testWrite() throws Exception { String templatePath = "D://word//template.doc"; InputStream is = new FileInputStream(templatePath); HWPFDocument doc = new HWPFDocument(is); Range range = doc.getRange(); //Replace ${reportDate} in the range range with the current date range.replaceText("${reportDate}", new SimpleDateFormat("yyyy-MM-dd").format(new Date())); range.replaceText("${appleAmt}", "100.00"); range.replaceText("${bananaAmt}", "200.00"); range.replaceText("${totalAmt}", "300.00"); OutputStream os = new FileOutputStream("D://word//write.doc"); //Export doc to the output stream doc.write(os); this.closeStream(os); this.closeStream(is); } /** * Close input stream* @param is */ private void closeStream(InputStream is) { if (is != null) { try { is.close(); } catch (IOException e) { e.printStackTrace(); } } } /** * Close output stream* @param os */ private void closeStream(OutputStream os) { if (os != null) { try { os.close(); } catch (IOException e) { e.printStackTrace(); } } } } }(Note: This article is based on poi3.9)
Thank you for reading, I hope it can help you. Thank you for your support for this site!