Preface
POI is a well-known library for reading and writing Microsoft's documents under Apache. Many people should have used POI when exporting reports, creating word documents, and reading them. POI does bring great convenience to these operations. One of the tools I've made recently is to read word and excel files in my computer.
POI structure description
Package name description
HSSF provides the ability to read and write Microsoft Excel XLS format archives.
XSSF provides the ability to read and write Microsoft Excel OOXML XLSX format archives.
HWPF provides the ability to read and write Microsoft Word DOC format archives.
HSLF provides the ability to read and write Microsoft PowerPoint format archives.
HDGF provides the ability to read Microsoft Visio format archives.
HPBF provides the ability to read archives in Microsoft Publisher format.
HSMF provides the function of reading Microsoft Outlook format archives.
Here are some of the pitfalls encountered in both word and excel:
Word
For word files, all I need is to extract the text in the main text in the file. So you can create a method to read the doc or docx file:
private static String readDoc(String filePath, InputStream is) { String text= ""; try { if (filePath.endsWith("doc")) { WordExtractor ex = new WordExtractor(is); text = ex.getText(); ex.close(); is.close(); } else if(filePath.endsWith("docx")) { XWPFDocument doc = new XWPFDocument(is); XWPFWordExtractor extractor = new XWPFWordExtractor(doc); text = extractor.getText(); extractor.close(); is.close(); } } catch (Exception e) { logger.error(filePath, e); } finally { if (is != null) { is.close(); } } return text; }In theory, this code should be valid for reading most doc or docx files. But!!! I found a strange problem, that is, when my code reads certain doc files, it often gives such an exception:
org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents.
What does this exception mean? In simple terms, the file you opened is not a doc file, and you should use the method of reading docx to read it. But what we are clearly opening is a file with the suffix doc!
In fact, doc and docx are essentially different. doc is OLE2 type, while docx is OOXML type. If you open a docx file with a compressed file, you will find some folders:
In essence, the docx file is a zip file that contains some xml files. Therefore, although some docx files are not large in size, the xml files inside are indeed relatively large, which is why it consumes a lot of memory when reading some docx files that don't seem to be very large.
Then I opened this doc file using a compressed file. As expected, its internals are as shown in the picture above, so essentially we can think of it as a docx file. Maybe it's because it's saved in some compatibility mode, which leads to such a scam problem. So, now we can judge whether a file is doc or docx based on the suffix name, which is unreliable.
To be honest, I don't think this is a rare problem. But I didn't find anything about this on Google. how to know whether a file is .docx or .doc format from Apache POI This example is to use ZipInputStream to determine whether a file is a docx file:
boolean isZip = new ZipInputStream( fileStream ).getNextEntry() != null;
But I don't think this is a good way, because I have to build a ZipInpuStream, which is obviously not good. In addition, this operation seems to affect InputStream, so you will have problems reading normal doc files. Or you use the File object to determine whether it is a zip file. But this is not a good way either, because I also need to read the doc or docx file in the compressed file, so my input must be Inputstream, so this option is not OK either. I talked with a group of foreigners on stackoverflow for most of the time. Sometimes I really doubted the ability of these foreigners to understand, but in the end, a big shot gave me a solution that made me ecstatic, FileMagic. This is a new feature added to POI 3.17:
public enum FileMagic { /** OLE2 / BIFF8+ stream used for Office 97 and higher documents */ OLE2(HeaderBlockConstants._signature), /** OOXML / ZIP stream */ OOXML(OOXML_FILE_HEADER), /** XML file */ XML(RAW_XML_FILE_HEADER), /** BIFF2 raw stream - for Excel 2 */ BIFF2(new byte[]{ 0x09, 0x00, // sid=0x0009 0x04, 0x00, // size=0x0004 0x00, 0x00, // unused 0x70, 0x00 // 0x70 = multiple values }), /** BIFF3 raw stream - for Excel 3 */ BIFF3(new byte[]{ 0x09, 0x02, // sid=0x0209 0x06, 0x00, // size=0x0006 0x00, 0x00, // unused 0x70, 0x00 // 0x70 = multiple values }), /** BIFF4 raw stream - for Excel 4 */ BIFF4(new byte[]{ 0x09, 0x04, // sid=0x0409 0x06, 0x00, // size=0x0006 0x00, 0x00, // unused 0x70, 0x00 // 0x70 = multiple values },new byte[]{ 0x09, 0x04, // sid=0x0409 0x06, 0x00, // size=0x0006 0x00, 0x00, // unused 0x00, 0x01 }), /** Old MS Write raw stream */ MSWRITE( new byte[]{0x31, (byte)0xbe, 0x00, 0x00 }, new byte[]{0x32, (byte)0xbe, 0x00, 0x00 }), /** RTF document */ RTF("{//rtf"), /** PDF document */ PDF("%PDF"), // keep UNKNOWN always as last enum! /** UNKNOWN magic */ UNKNOWN(new byte[0]); final byte[][] magic; FileMagic(long magic) { this.magic = new byte[1][8]; LittleEndian.putLong(this.magic[0], 0, magic); } FileMagic(byte[]... magic) { this.magic = magic; } FileMagic(String magic) { this(magic.getBytes(LocaleUtil.CHARSET_1252)); } public static FileMagic valueOf(byte[] magic) { for (FileMagic fm : values()) { int i=0; boolean found = true; for (byte[] ma : fm.magic) { for (byte m : ma) { byte d = magic[i++]; if (!(d == m || (m == 0x70 && (d == 0x10 || d == 0x20 || d == 0x40)))) { found = false; break; } } if (found) { return fm; } } } return UNKNOWN; } /** * Get the file magic of the supplied InputStream (which MUST * support mark and reset).<p> * * If unsure if your InputStream does support mark / reset, * use {@link #prepareToCheckMagic(InputStream)} to wrap it and make * sure to always use that, and not the original!<p> * * Even if this method returns {@link FileMagic#UNKNOWN} it could potentially mean, * that the ZIP stream has leading junk bytes * * @param inp An InputStream which supports either mark/reset */ public static FileMagic valueOf(InputStream inp) throws IOException { if (!inp.markSupported()) { throw new IOException("getFileMagic() only operates on streams which support mark(int)"); } // Grab the first 8 bytes byte[] data = IOUtils.peekFirst8Bytes(inp); return FileMagic.valueOf(data); } /** * Checks if an {@link InputStream} can be reset (ie used for checking the header magic) and wraps it if not * * @param stream stream to be checked for wrapping * @return a mark enabled stream */ public static InputStream prepareToCheckMagic(InputStream stream) { if (stream.markSupported()) { return stream; } // we used to process the data via a PushbackInputStream, but user code could provide a too small one // so we use a BufferedInputStream instead now return new BufferedInputStream(stream); }}Here is the main code, which mainly determines the file type based on the first 8 bytes of InputStream. There is no way to think that this is the most elegant solution. At the beginning, I was actually thinking that the first few bytes of the compressed file seemed to be defined by a different one, magicmumber. Because FileMagic's dependencies are compatible with version 3.16, I just need to add this class, so the correct way for us to read the word file now is:
private static String readDoc (String filePath, InputStream is) { String text= ""; is = FileMagic.prepareToCheckMagic(is); try { if (FileMagic.valueOf(is) == FileMagic.OLE2) { WordExtractor ex = new WordExtractor(is); text = ex.getText(); ex.close(); } else if(FileMagic.valueOf(is) == FileMagic.OOXML) { XWPFDocument doc = new XWPFDocument(is); XWPFWordExtractor extractor = new XWPFWordExtractor(doc); text = extractor.getText(); extractor.close(); } } catch (Exception e) { logger.error("for file " + filePath, e); } finally { if (is != null) { is.close(); } } return text; } Excel
For the excel article, I won’t look for comparisons between the previous plan and the current plan. I'll give me the best practices now:
@SuppressWarnings("deprecation" ) private static String readExcel(String filePath, InputStream inp) throws Exception { Workbook wb; StringBuilder sb = new StringBuilder(); try { if (filePath.endsWith(".xls")) { wb = new HSSFWorkbook(inp); } else { wb = StreamingReader.builder() .rowCacheSize(1000) // number of rows to keep in memory (defaults to 10) .bufferSize(4096) // buffer size to use when reading InputStream to file (defaults to 1024) .open(inp); // InputStream or File for XLSX file (required) } sb = readSheet(wb, sb, filePath.endsWith(".xls")); wb.close(); } catch (OLE2NotOfficeXmlFileException e) { logger.error(filePath, e); } finally { if (inp != null) { inp.close(); } } return sb.toString(); } private static String readExcelByFile(String filepath, File file) { Workbook wb; StringBuilder sb = new StringBuilder(); try { if (filepath.endsWith(".xls")) { wb = WorkbookFactory.create(file); } else { wb = StreamingReader.builder() .rowCacheSize(1000) // number of rows to keep in memory (defaults to 10) .bufferSize(4096) // buffer size to use when reading InputStream to file (defaults to 1024) .open(file); // InputStream or File for XLSX file (required) } sb = readSheet(wb, sb, filepath.endsWith(".xls")); wb.close(); } catch (Exception e) { logger.error(filepath, e); } return sb.toString(); } private static StringBuilder readSheet(Workbook wb, StringBuilder sb, boolean isXls) throws Exception { for (Sheet sheet: wb) { for (Row r: sheet) { for (Cell cell: r) { if (cell.getCellType() == Cell.CELL_TYPE_STRING) { sb.append(cell.getStringCellValue()); sb.append(" "); } else if (cell.getCellType() == Cell.CELL_TYPE_NUMERIC) { if (isXls) { DataFormatter formatter = new DataFormatter(); sb.append(formatCellValue(cell)); } else { sb.append(cell.getStringCellValue()); } sb.append(" "); } } } } return sb; }In fact, for excel read, the biggest problem my tool faces is memory overflow. Often, reading certain particularly large excel files will cause a memory overflow problem. Later I finally found an excellent tool excel-streaming-reader, which can streamline xlsx files and split some particularly large files into small files to read.
Another optimization is that in the scenario where File objects can be used, I use File objects to read files instead of using InputStream to read them, because using InputStream requires all of them to be loaded into memory, so this is very memory-consuming.
Finally, my little trick is to use cell.getCellType to reduce the amount of data, because I only need to get some text and numbers string content.
The above are some of my explorations and discoveries when reading files using POI, and I hope it will be helpful to you. The above examples are also applied in one of my tools everywhere (this tool can mainly help you search the full text of content on your computer). If you are interested, you can take a look. Welcome to star or pr.
Summarize
The above is the entire content of this article. I hope that the content of this article has certain reference value for everyone's study or work. If you have any questions, you can leave a message to communicate. Thank you for your support to Wulin.com.