本次经验内容中将以Java示例展示读取PDF中的表格的方法。这里使用到的主要类、方法及解释如下,供参考:1. PdfDocument Class:Represents a pdf document model.2. PdfDocument. loadFromFile (string filename) Method:Loads a PDF document.3. PdfTableExtractor Class:Represents the PDF table extractor.4. PdfTable Class:Defines a PDF table.5. PdfTableExtractor. extractTable (int pageIndex) Method:Extracts table from page.6. PdfTable.getText(int rowIndex,int columnIndex) Method:Gets Text in cell.7. FileWriter. write() Method:Saves extracted text in table to a .txt file.
工具/原料
IntelliJ IDEA 2018(JDK 1.8.0)
PDF 测试文档
PDF Jar包:Spire.PDF for Java Version: 4.10.2
导入jar
1、将jar包下载到本地,解压。然后执行如下步骤来手动导入:
2、找到本地路径下的jar文件,点击OK,添加到列表,
3、添加后,勾选选项,点击Apply完成jar导入。
Java代码
1、import com.spire.pdf.*;import com.spire.pdf.utilities.PdfTable;import com.spire.pdf.utilities.PdfTableExtractor;import java.io.FileWriter;import java.io.IOException;public class ExtractTable { public static void main(String[] args)throws IOException { //加载PDF文档 PdfDocument pdf = new PdfDocument(); pdf.loadFromFile("test.pdf"); //创建StringBuilder类的实例 StringBuilder builder = new StringBuilder(); //抽取表格 PdfTableExtractor extractor = new PdfTableExtractor(pdf); PdfTable[] tableLists ; for (int page = 0; page < pdf.getPages().getCount(); page++) { tableLists = extractor.extractTable(page); if (tableLists != null && tableLists.length > 0) { for (PdfTable table : tableLists) { int row = table.getRowCount(); int column = table.getColumnCount(); for (int i = 0; i < row; i++) { for (int j = 0; j < column; j++) { String text = table.getText(i, j); builder.append(text+" "); } builder.append("\r\n"); } } } } //将提取的表格内容写入txt文档 FileWriter fileWriter = new FileWriter("ExtractedTable.txt"); fileWriter.write(builder.toString()); fileWriter.flush(); fileWriter.close(); }}
2、执行代码,生成txt文档。如图表格读取结果: