利用POI讀取word、Excel文件的最佳實踐教程

作者：Eve Cole 更新時間：2025-07-20 23:32:01

前言

POI是Apache 旗下一款讀寫微軟家文檔聲名顯赫的類庫。應該很多人在做報表的導出，或者創建word 文檔以及讀取之類的都是用過POI。 POI 也的確對於這些操作帶來很大的便利性。我最近做的一個工具就是讀取計算機中的word 以及excel 文件。

POI結構說明

包名稱說明

HSSF提供讀寫Microsoft Excel XLS格式檔案的功能。

XSSF提供讀寫Microsoft Excel OOXML XLSX格式檔案的功能。

HWPF提供讀寫Microsoft Word DOC格式檔案的功能。

HSLF提供讀寫Microsoft PowerPoint格式檔案的功能。

HDGF提供讀Microsoft Visio格式檔案的功能。

HPBF提供讀Microsoft Publisher格式檔案的功能。

HSMF提供讀Microsoft Outlook格式檔案的功能。

下面就word和excel兩方面講解以下遇到的一些坑：

word 篇

對於word 文件，我需要的就是提取文件中正文的文字。所以可以創建一個方法來讀取doc 或者docx 文件：

 private static String readDoc(String filePath, InputStream is) { String text= ""; try { if (filePath.endsWith("doc")) { WordExtractor ex = new WordExtractor(is); text = ex.getText(); ex.close(); is.close(); } else if(filePath.endsWith("docx")) { XWPFDocument doc = new XWPFDocument(is); XWPFWordExtractor extractor = new XWPFWordExtractor(doc); text = extractor.getText(); extractor.close(); is.close(); } } catch (Exception e) { logger.error(filePath, e); } finally { if (is != null) { is.close(); } } return text; }

理論上來說，這段代碼應該對於讀取大多數doc 或者docx 文件都是有效的。但是!!!!我發現了一個奇怪的問題，就是我的代碼在讀取某些doc 文件的時候，經常會給出這樣的一個異常：

 org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents.

這個異常的意思是什麼呢，通俗的來講，就是你打開的文件並不是一個doc 文件，你應該使用讀取docx 的方法去讀取。但是我們明明打開的就是一個後綴是doc 的文件啊！

其實doc 和docx 的本質不同的，doc 是OLE2 類型，而docx 而是OOXML 類型。如果你用壓縮文件打開一個docx 文件，你會發現一些文件夾：

本質上docx 文件就是一個zip 文件，裡麵包含了一些xml 文件。所以，一些docx 文件雖然大小不大，但是其內部的xml 文件確實比較大的，這也是為什麼在讀取某些看起來不是很大的docx 文件的時候卻耗費了大量的內存。

然後我使用壓縮文件打開這個doc 文件，果不其然，其內部正是如上圖，所以本質上我們可以認為它是一個docx 文件。可能是因為它是以某種兼容模式保存從而導致如此坑爹的問題。所以，現在我們根據後綴名來判斷一個文件是doc 或者docx 就是不可靠的了。

老實說，我覺得這應該不是一個很少見的問題。但是我在谷歌上並沒有找到任何關於此的信息。 how to know whether a file is .docx or .doc format from Apache POI 這個例子是通過ZipInputStream 來判斷文件是否是docx 文件：

 boolean isZip = new ZipInputStream( fileStream ).getNextEntry() != null;

但我並不覺得這是一個很好的方法，因為我得去構建一個ZipInpuStream，這很顯然不好。另外，這個操作貌似會影響到InputStream，所以你在讀取正常的doc 文件會有問題。或者你使用File 對象去判斷是否是一個zip 文件。但這也不是一個好方法，因為我還需要在壓縮文件中讀取doc 或者docx 文件，所以我的輸入必須是Inputstream，所以這個選項也是不可以的。我在stackoverflow 上和一幫老外扯了大半天，有時候我真的很懷疑這幫老外的理解能力，不過最終還是有一個大佬給出了一個讓我欣喜若狂的解決方案，FileMagic。這個是一個POI 3.17新增加的一個特性：

 public enum FileMagic { /** OLE2 / BIFF8+ stream used for Office 97 and higher documents */ OLE2(HeaderBlockConstants._signature), /** OOXML / ZIP stream */ OOXML(OOXML_FILE_HEADER), /** XML file */ XML(RAW_XML_FILE_HEADER), /** BIFF2 raw stream - for Excel 2 */ BIFF2(new byte[]{ 0x09, 0x00, // sid=0x0009 0x04, 0x00, // size=0x0004 0x00, 0x00, // unused 0x70, 0x00 // 0x70 = multiple values }), /** BIFF3 raw stream - for Excel 3 */ BIFF3(new byte[]{ 0x09, 0x02, // sid=0x0209 0x06, 0x00, // size=0x0006 0x00, 0x00, // unused 0x70, 0x00 // 0x70 = multiple values }), /** BIFF4 raw stream - for Excel 4 */ BIFF4(new byte[]{ 0x09, 0x04, // sid=0x0409 0x06, 0x00, // size=0x0006 0x00, 0x00, // unused 0x70, 0x00 // 0x70 = multiple values },new byte[]{ 0x09, 0x04, // sid=0x0409 0x06, 0x00, // size=0x0006 0x00, 0x00, // unused 0x00, 0x01 }), /** Old MS Write raw stream */ MSWRITE( new byte[]{0x31, (byte)0xbe, 0x00, 0x00 }, new byte[]{0x32, (byte)0xbe, 0x00, 0x00 }), /** RTF document */ RTF("{//rtf"), /** PDF document */ PDF("%PDF"), // keep UNKNOWN always as last enum! /** UNKNOWN magic */ UNKNOWN(new byte[0]); final byte[][] magic; FileMagic(long magic) { this.magic = new byte[1][8]; LittleEndian.putLong(this.magic[0], 0, magic); } FileMagic(byte[]... magic) { this.magic = magic; } FileMagic(String magic) { this(magic.getBytes(LocaleUtil.CHARSET_1252)); } public static FileMagic valueOf(byte[] magic) { for (FileMagic fm : values()) { int i=0; boolean found = true; for (byte[] ma : fm.magic) { for (byte m : ma) { byte d = magic[i++]; if (!(d == m || (m == 0x70 && (d == 0x10 || d == 0x20 || d == 0x40)))) { found = false; break; } } if (found) { return fm; } } } return UNKNOWN; } /** * Get the file magic of the supplied InputStream (which MUST * support mark and reset).<p> * * If unsure if your InputStream does support mark / reset, * use {@link #prepareToCheckMagic(InputStream)} to wrap it and make * sure to always use that, and not the original!<p> * * Even if this method returns {@link FileMagic#UNKNOWN} it could potentially mean, * that the ZIP stream has leading junk bytes * * @param inp An InputStream which supports either mark/reset */ public static FileMagic valueOf(InputStream inp) throws IOException { if (!inp.markSupported()) { throw new IOException("getFileMagic() only operates on streams which support mark(int)"); } // Grab the first 8 bytes byte[] data = IOUtils.peekFirst8Bytes(inp); return FileMagic.valueOf(data); } /** * Checks if an {@link InputStream} can be reseted (ie used for checking the header magic) and wraps it if not * * @param stream stream to be checked for wrapping * @return a mark enabled stream */ public static InputStream prepareToCheckMagic(InputStream stream) { if (stream.markSupported()) { return stream; } // we used to process the data via a PushbackInputStream, but user code could provide a too small one // so we use a BufferedInputStream instead now return new BufferedInputStream(stream); }}

在這給出主要的代碼，其主要就是根據InputStream 前8 個字節來判斷文件的類型，毫無以為這就是最優雅的解決方式。一開始，其實我也是在想對於壓縮文件的前幾個字節似乎是由不同的定義的，magicmumber。因為FileMagic 的依賴和3.16 版本是兼容的，所以我只需要加入這個類就可以了，因此我們現在讀取word 文件的正確做法是：

 private static String readDoc (String filePath, InputStream is) { String text= ""; is = FileMagic.prepareToCheckMagic(is); try { if (FileMagic.valueOf(is) == FileMagic.OLE2) { WordExtractor ex = new WordExtractor(is); text = ex.getText(); ex.close(); } else if(FileMagic.valueOf(is) == FileMagic.OOXML) { XWPFDocument doc = new XWPFDocument(is); XWPFWordExtractor extractor = new XWPFWordExtractor(doc); text = extractor.getText(); extractor.close(); } } catch (Exception e) { logger.error("for file " + filePath, e); } finally { if (is != null) { is.close(); } } return text; }

excel 篇

對於excel 篇，我也就不去找之前的方案和現在的方案的對比了。就給出我現在的最佳做法了：

 @SuppressWarnings("deprecation" ) private static String readExcel(String filePath, InputStream inp) throws Exception { Workbook wb; StringBuilder sb = new StringBuilder(); try { if (filePath.endsWith(".xls")) { wb = new HSSFWorkbook(inp); } else { wb = StreamingReader.builder() .rowCacheSize(1000) // number of rows to keep in memory (defaults to 10) .bufferSize(4096) // buffer size to use when reading InputStream to file (defaults to 1024) .open(inp); // InputStream or File for XLSX file (required) } sb = readSheet(wb, sb, filePath.endsWith(".xls")); wb.close(); } catch (OLE2NotOfficeXmlFileException e) { logger.error(filePath, e); } finally { if (inp != null) { inp.close(); } } return sb.toString(); } private static String readExcelByFile(String filepath, File file) { Workbook wb; StringBuilder sb = new StringBuilder(); try { if (filepath.endsWith(".xls")) { wb = WorkbookFactory.create(file); } else { wb = StreamingReader.builder() .rowCacheSize(1000) // number of rows to keep in memory (defaults to 10) .bufferSize(4096) // buffer size to use when reading InputStream to file (defaults to 1024) .open(file); // InputStream or File for XLSX file (required) } sb = readSheet(wb, sb, filepath.endsWith(".xls")); wb.close(); } catch (Exception e) { logger.error(filepath, e); } return sb.toString(); } private static StringBuilder readSheet(Workbook wb, StringBuilder sb, boolean isXls) throws Exception { for (Sheet sheet: wb) { for (Row r: sheet) { for (Cell cell: r) { if (cell.getCellType() == Cell.CELL_TYPE_STRING) { sb.append(cell.getStringCellValue()); sb.append(" "); } else if (cell.getCellType() == Cell.CELL_TYPE_NUMERIC) { if (isXls) { DataFormatter formatter = new DataFormatter(); sb.append(formatter.formatCellValue(cell)); } else { sb.append(cell.getStringCellValue()); } sb.append(" "); } } } } return sb; }

其實，對於excel 讀取，我的工具面臨的最大問題就是內存溢出。經常在讀取某些特別大的excel 文件的時候都會帶來一個內存溢出的問題。後來我終於找到一個優秀的工具excel-streaming-reader，它可以流式的讀取xlsx 文件，將一些特別大的文件拆分成小的文件去讀。

另外一個做的優化就是，對於可以使用File 對象的場景下，我是去使用File 對象去讀取文件而不是使用InputStream 去讀取，因為使用InputStream 需要把它全部加載到內存中，所以這樣是非常佔用內存的。

最後，我的一點小技巧就是使用cell.getCellType 去減少一些數據量，因為我只需要獲取一些文字以及數字的字符串內容就可以了。

以上，就是我在使用POI 讀取文件的一些探索和發現，希望對你能有所幫助。上面的這些例子也是在我的一款工具everywhere 中的應用（這款工具主要是可以幫助你在電腦中進行內容的全文搜索），感興趣的可以看看，歡迎star 或者pr。

總結

以上就是這篇文章的全部內容了，希望本文的內容對大家的學習或者工作具有一定的參考學習價值，如果有疑問大家可以留言交流，謝謝大家對武林網的支持。