Preface (Background introduction):
Apache POI is the next open source project of the Apache Foundation, used to process documents in the Office series and can create and parse documents in word, excel, and ppt formats.
There are two technologies for processing word documents, namely HWPF (.doc) and XWPF (.docx). If you are familiar with these two technologies, you should be able to understand the pain of using Java to parse word documents.
Two of the biggest problems are:
The first is that these two classes do not have a unified parent class and interface (XSSF and HSSF next door cast their contemptuous eyes), so they cannot perform interface programming in the same format;
The second is that there is no interface for the relative position of the pictures in the document in the official API, which leads to the fact that although you can obtain all the pictures in the document, you cannot know where these pictures are. In the future, you will not be able to insert the pictures in the correct position.
For the first point, I have no choice but to study other related technologies, such as jacob, doc4j, etc., to see if there are any other solutions, but doc4j seems to be able to process 2007 documents (.docx).
For the second point, this article will give me the author's solution. In fact, this is also the purpose of my writing this article.
Note: Just look at Chapter 2 and Chapter 3 if you are simply asking for speed;
1. Preparation knowledge
1. The two formats of word documents correspond to two different storage methods
As we all know, word documents have two storage formats: doc and docx
doc: It is commonly called Word2003, which uses binary storage data; this is not the focus of our discussion today.
docx:word2007, uses xml to store data and format.
Maybe you will ask, why is it an xml format that is obviously a document ending at docx?
It's very simple: you can just select a docx file, right-click to open it with the compression tool, and you can get a directory structure like this:
So you think docx is a complete document, but in fact it is just a compressed file. (docx:?_?)
2. The definition format of xml in Word documents:
From the previous example, we learned that docx documents use compressed files, that is, xml to describe data. So how is the data in word documents defined specifically?
Due to space, the entire compressed document will not be described in detail here. I will only briefly introduce two files/folders:
First, the document.xml file in the word directory, which is the definition of the entire document content;
The second is the media folder in the word directory. You can guess the multimedia content in the document by looking at the name:
Figure 3: word/document.xml (define document content)
Figure 4: Contents under the word/media folder
The following are some key contents of the document.xml document:
A: document overall structure definition:
<w:document mc:ignorable="w14 w15 wp14" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" xmlns:wpscustomdata="http://www.wps.cn/officeDocument/2013/wpsCustomData"> <w:body> <w:p> <w:ppr> <w:pstyle w:val="2"> </w:pstyle> <w:keepnext w:val="0"> </w:keepnext> <w:keeplines w:val="0"> </w:keeplines> <w:widowcontrol> </w:widowcontrol> <w:suppresslinenumbers w:val="0"> </w:suppresslinenumbers> <w:pbdr> <w:top w:color="auto" w:space="0" w:sz="0" w:val="none"> </w:top> <w:left w:color="auto" w:space="0" w:val="none"> </w:bottom> <w:right w:color="auto" w:space="0" w:sz="0" w:val="none"> </w:right> </w:pbdr>
B: Document paragraph content:
<w:p> <w:ppr> <w:pstyle w:val="2"> </w:pstyle> <w:keepnext w:val="0"> </w:keepnext> <w:keeplines w:val="0"> </w:keeplines> <w:widowcontrol> </w:widowcontrol> <w:suppresslinenumbers w:val="0"> </w:suppresslinenumbers> <w:pbdr> <w:top w:color="auto" w:space="0" w:sz="0"> </w:top> <w:left w:color="auto" w:space="0" w:sz="0"> w:val="none"> </w:left> <w:bottom w:color="auto" w:space="0" w:sz="0" w:val="none"> </w:bottom> <w:right w:color="auto" w:space="0" w:sz="0" w:val="none"> </w:right> </w:pbdr> <w:shd w:fill="FAFAFA" w:val="clear"> </w:shd> <w:spacing w:after="150" w:afterautospacing="0" w:before="150" w:beforeautospacing="0" w:line="378" w:linerule="atLeast"> </w:spacing> <w:ind w:firstline="0" w:left="0" w:right="0"> </w:ind> <w:rpr> <w:rfonts w:ascii="Verdana" w:cs="Verdana" w:hansi="Verdana" w:hint="default"> </w:rfonts> <w:iw:val="0"> </w:i> <w:caps w:val="0"> </w:caps> <w:color w:val="404040"> </w:color> <w:spacing w:val="0"> </w:spacing> <w:sz w:val="21"> </w:sz> <w:szcs w:val="21"> </w:szcs> </w:rpr> </w:ppr> <w:rpr> <w:rpr> <w:rpr> <w:rfonts w:ascii="Verdana" w:cs="Verdana" w:hansi="Verdana" w:hint="default"> </w:rfonts> <w:iw:val="0"> </w:i> <w:caps w:val="0"> </w:caps> <w:color w:val="404040"> </w:color> <w:spacing w:val="0"> </w:spacing> <w:sz w:val="21"> </w:sz> <w:szcs w:val="21"> </w:szcs> <w:bdr w:color="auto" w:space="0" w:sz="0" w:val="none"> </w:bdr> <w:shd w:fill="FAFAFA" w:val="clear"> </w:shd> </w:rpr> <w:t> Author: Brian Dear </w:t> </w:r> </w:p>
C: Image content definition:
<w:r> <w:rpr> <w:rfonts w:ascii="Verdana" w:cs="Verdana" w:hansi="Verdana" w:hint="default"> </w:rfonts> <w:iw:val="0"> </w:i> <w:caps w:val="0"> </w:caps> <w:color w:val="404040"> </w:color> <w:spacing w:val="0"> </w:spacing> <w:sz w:val="21"> </w:sz> <w:szcs w:val="21"> </w:szcs> <w:bdr w:color="auto" w:space="0" w:sz="0"" w:val="none"> </w:bdr> <w:shd w:fill="FAFAFA" w:val="clear"> </w:shd> </w:rpr> <w:drawing> <wp:inline distb="0" distl="114300" dist="114300" dist="0"> <wp:extent cx="5543550" cy="5543550"> </wp:extent> <wp:effectextent b="0" l="0" r="0" t="0"> </wp:effectextent> <wp:docpr descr="IMG_256" id="1" name="Picture 1"> </wp:docpr> <wp:cnvgraphicframepr> <a:graphicframelocks nochangeaspect="1" xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"> </a:graphicframelocks> </wp:cnvgraphicframepr> <a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"> <a:graphicdata uri="http://schemas.openxmlformats.org/drawingml/2006/picture"> <pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture"> <pic:nvpicpr> <pic:cnvpr descr="IMG_256" id="1" name="Picture 1"> </pic:cnvpr> <pic:cnvpicpr> <a:piclocks nochangeaspect="1"> </a:piclocks> </pic:cnvpicpr> </pic:nvpicpr> <pic:blipfill> <a:blip r:embed="rId4"> </a:blip> <a:stretch> <a:fillrect> </a:fillrect> </a:stretch> </pic:blipfill> <pic:sppr> <a:xfrm> <a:off x="0" y="0"> </a:off> <a:ext cx="5543550" cy="5543550"> </a:ext> </a:xfrm> <a:prstgeom prst="rect"> <a:avlst> </a:avlst> </a:prstgeom> <a:nofill> </a:nofill> <a:ln w="9525"> <a:nofill> </a:nofill> </a:ln> </pic:sppr> </pic:pic> </a:graphicdata> </a:graphic> </wp:inline> </w:drawing> </w:r>
If you are interested, you can look at the above three xml codes. I will give the conclusion directly here:
word document shema file:xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
Document root node: <w:document> defines the beginning of the entire document
<w:body> is the child node of the document and the main content of the document
<w:p>body child node, a paragraph, is the paragraph in the word document
<w:r> child node of P element, a Run defines a paragraph with the same format in the paragraph
<w:t>The child node of the Run element node is the content of the document.
<w:drawing> child node of the run element, defines a picture:
<w:inline> Drawing child nodes, no in-depth research has been made on the specific application.
<a:graphic> Define image content
<pic:blipfill> This is a child node of the graphic document, which defines the index of the image content. Specifically, poi can obtain the corresponding resources of the image based on this name, and the key to obtaining the location of the document image is here.
Overall, XWPF parsing docx documents is to parse the xml document, save all nodes, and then convert them into more useful properties, providing API for users to use.
So we can use the interface provided to us by POI to get the document content, parse the data in the document yourself, and get which paragraph the picture is in. Of course, you can also know which Run element the picture is located behind.
2. Realization
package com.szdfhx.reportStatistic.util;import com.microsoft.schemas.vml.CTShape;import org.apache.poi.xwpf.usermodel.XWPFParagraph;import org.apache.poi.xwpf.usermodel.XWPFPictureData;import org.apache.poi.xwpf.usermodel.XWPFRun;import org.apache.xmlbeans.XmlCursor;import org.apache.xmlbeans.XmlObject;import org.openxmlformats.schemas.drawingml.x2006.main.CTGraphicalObject;import org.openxmlformats.schemas.drawingml.x2006.picture.CTPicture;import org.openxmlformats.schemas.drawingml.x2006.main.CTDrawing;import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTObject;import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTR;import java.util.ArrayList;import java.util.List;import java.util.List;import java.util.Map;public class XWPFUtils { //Get all image indexes in a certain paragraph public static List<String> readImageInParagraph(XWPFParagraph paragraph) { //Image index List<String> imageBundleList = new ArrayList<String>(); //All XWPFRun List<XWPFRun> in paragraph runList = paragraph.getRuns(); for (XWPFRun run : runList) { //XWPFRun is its own attribute generated by POI after parsing the xml element. It cannot be parsed through xml. It needs to be converted into CTR CTR ctr = run.getCTR(); //Transaction of child elements XmlCursor c = ctr.newCursor(); //This is to get all child elements: c.selectPath("./*"); while (c.toNextSelection()) { XmlObject o = c.getObject(); //If the child element is in the form of <w:drawing>, use CTDrawing to save the image if (o instanceof CTDrawing) { CTDrawing drawing = (CTDrawing) o; CTInline[] ctInlines = drawing.getInlineArray(); for (CTInline ctInline : ctInlines) { CTGraphicalObject graphic = ctInline.getGraphic(); // XmlCursor cursor = graphic.getGraphicData().newCursor(); cursor.selectPath("./*"); while (cursor.toNextSelection()) { XmlObject xmlObject = cursor.getObject(); // If the child element is in the form <pic:pic> if (xmlObject instanceof CTPicture) { org.openxmlformats.schemas.drawingml.x2006.picture.CTPicture picture = (org.openxmlformats.schemas.drawingml.x2006.picture.CTPicture) xmlObject; //Get the element's attribute imageBundleList.add(picture.getBlipFill().getBlip().getEmbed()); } } } } } //Use CTObject to save the image//<w:object> form if (o instanceof CTObject) { CTObject object = (CTObject) o; System.out.println(object); XmlCursor w = object.newCursor(); w.selectPath("./*"); while (w.toNextSelection()) { XmlObject xmlObject = w.getObject(); if (xmlObject instanceof CTShape) { CTShape shape = (CTShape) xmlObject; imageBundleList.add(shape.getImagedataArray()[0].getId2()); } } } } } return imageBundleList; }}First of all, we need to propose the encapsulation of xml elements by XWPF:
<w:document> Corresponding to XWPFDocument class
<w:run>corresponding to XWPFRun class
Basically, it only corresponds to the Run layer. Because there are many child elements of run, there is no longer any encapsulation and definition at the following level.
So we can only get all the XWPFRun objects converted into its xml definition: CTR objects. Finally, use CTR to read and parse the contents of the Run element and obtain the index of the image.
The second thing to talk about is the definition of the entire XML element:
We can see that POI uses XML parsed by the xmlbeans technology under Apache. If you don’t discuss the related technologies in depth, you must understand two key points:
1: All elements in the xml document are encapsulated by the xmlbean and inherit an XMLObject interface, so this class can be used to receive the acquired child elements;
2: Element traversal is done through XmlCursor. The specific acquisition of child elements is controlled based on the selectPath attribute of the XmlCursor object. When the selectPath is "./*", it is defined as traversing child elements;
So it is written as follows: it can traverse the child elements of the current element and check the type of the child element:
CTR ctr = run.getCTR();//Transulate the child elements XmlCursor c = ctr.newCursor();//This is to get all the child elements: c.selectPath("./*"); while (c.toNextSelection()) { XmlObject o = c.getObject();//If the child element is in the form <w:drawing>, use CTDrawing to save the picture if (o instanceof CTDrawing) {CTDrawing drawing = (CTDrawing) o;Finally, you may have questions, didn't this element <w:drawing> defines an image?
So
if (o instanceof CTObject) {CTObject object = (CTObject) o;...}What is the second judgment condition used for?
You should have guessed it
That's right! In addition to <w:drawing>, the xml in the docx document can also be used to define the image.
Why are there only these two?
Because I only used the first method to parse, I found that some pictures were lost, so I found the second method... Maybe there are more than two? I don't know, anyway, there is no problem for me at the moment.
Perhaps you, who are smart, have encountered more situations in practice?
Then, using the xml parsing method mentioned above, I believe you can read it correctly and get the index value you want.
Broaden it a little bit. If there are other APIs that are not provided by POI, can we also implement them through XML parsing technology? This requires us to explore in practice. I believe that time will give us the answer.
Okay, now we have got the index value, so how do we get the image resources?
POI provides ready-made methods:
There is getPictureDataByID(String picture) in the XWPFDocument class;
The method can get the XWPFPictrueDate object, which is the image resource.
For specific operations, please refer to the relevant blog posts and APIs, which will not be introduced in detail here.
3. Test:
Code to test using Junit4:
package com.szdfhx.reportStatistic.util;import org.apache.commons.collections.CollectionUtils;import org.apache.commons.lang.StringUtils;import org.apache.poi.xwpf.usermodel.XWPFDocument;import org.apache.poi.xwpf.usermodel.XWPFParagraph;import org.apache.poi.xwpf.usermodel.XWPFPictureData;import org.junit.Test;import java.io.FileInputStream;import java.io.IOException;import java.io.InputStream;import java.util.Collections;import java.util.List;import static org.junit.Assert.*;public class XWPFUtilsTest { @Test public void readImageInParagraph() throws IOException { InputStream in = new FileInputStream("D://Document//My Blog//Java parsing word, get the image location in the document//Example.docx"); XWPFDocument xwpfDocument = new XWPFDocument(in); List<XWPFParagraph> paragraphList = xwpfDocument.getParagraphs(); System.out.println("Image index/t|Image name/t|Content of a paragraph of text on the image/t"); System.out.pringln("------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ if(CollectionUtils.isNotEmpty(imageBundleList)){ for(String pictureId:imageBundleList){ XWPFPictureData pictureData = xwpfDocument.getPictureDataByID(pictureId); String imageName = pictureData.getFileName(); String lastParagraphText = paragraphList.get(i-1).getParagraphText(); System.out.println(pictureId +"/t|" + imageName + "/t|" + lastParagraphText); } } } } } }}Show results:
The use of image names here means that I have obtained the corresponding resources. In fact, if you are familiar with the content of the previous article, you will find that the name of the image is actually the full name of all the images in the word/media folder.
In the corresponding XWPFPictureData object, the binary data of the image can be obtained through the getData() property, so that you can save it to the database or your local folder!
4. Others:
Speaking of this, the second problem mentioned at the beginning has been solved here.
So, what should I do with the first question?
If your system does not require high speed, then my advice is to convert the doc document into docx document to parse - POI has a mature API to do
If you want to consider performance, you have to write two sets of methods to parse the document.
So...How to get the relative position of the image in a doc-type word document?
I don't know... Or, come and tell me?
In the above Java parsing word, the method to obtain the location of the picture in the document is all the content I share with you. I hope you can give you a reference and I hope you can support Wulin.com more.