使用PDFBox 2.0从PDF中提取文本

我正在尝试使用PDFBox 2.0进行文本提取。我想获得有关特定字符的字体大小和页面上该字符的位置矩形的信息。我使用PDFTextStripper在PDFBox的1.6实现了这个：使用PDFBox 2.0从PDF中提取文本

PDFParser parser = new PDFParser(is); 
    try{ 
     parser.parse(); 
    }catch(IOException e){ 

    } 
    COSDocument cosDoc = parser.getDocument(); 
    PDDocument pdd = new PDDocument(cosDoc); 
    final StringBuffer extractedText = new StringBuffer(); 
    PDFTextStripper textStripper = new PDFTextStripper(){ 
     @Override 
     protected void processTextPosition(TextPosition text) { 
      extractedText.append(text.getCharacter()); 
      logger.debug("text position: "+text.toString()); 
     } 
    }; 
    textStripper.setSuppressDuplicateOverlappingText(false); 
    for(int pageNum = 0;pageNum<pdd.getNumberOfPages();pageNum++){ 
     PDPage page = (PDPage) pdd.getDocumentCatalog().getAllPages().get(pageNum); 
     textStripper.processStream(page, page.findResources(), page.getContents().getStream()); 
    } 
    pdd.close();

但是在2.0版本PDFBox的中，processStream方法已被删除。我怎样才能达到与PDFBox 2.0相同？

我已经试过如下：

 PDDocument pdd = PDDocument.load(inputStream); 
     PDFTextStripper textStripper = new PDFTextStripper(){ 
      @Override 
      protected void processTextPosition(TextPosition text){ 
       int pos = PDFdocument.length(); 
       String textadded = text.getUnicode(); 
       Range range = new Range(pos,pos+textadded.length()); 
       int pagenr = this.getCurrentPageNo(); 
       Rectangle2D rect = new Rectangle2D.Float(text.getX(),text.getY(),text.getWidth(),text.getHeight()); 
      } 
     }; 
     textStripper.setSuppressDuplicateOverlappingText(false); 
     for(int pageNum = 0;pageNum<pdd.getNumberOfPages();pageNum++){ 
      PDPage page = (PDPage) pdd.getDocumentCatalog().getPages().get(pageNum); 
      textStripper.processPage(page); 
     } 
     pdd.close();

的processTextPosition(TextPosition text)方法不会被调用。任何建议将非常受欢迎。

来源

2016-02-29 Dieudonné

P请看源代码中的DrawPrintTextLocations示例，这就是您显然想要做的。它使用writeString（）调用。 –

谢谢，那个例子完全是我在找的东西。 –

@tilmanhausherr建议的DrawPrintTextLocations example为我的问题提供了解决方案。

的分析器是使用下面的代码开始（该inputStream从PDF文件的URL输入流）：

PDDocument pdd = null; 
    try { 
     pdd = PDDocument.load(inputStream); 
     PDFParserTextStripper stripper = new PDFParserTextStripper(PDFdocument,pdd); 
     stripper.setSortByPosition(true); 
     for (int i=0;i<pdd.getNumberOfPages();i++){ 
      stripper.stripPage(i); 
     } 
    } catch (IOException e) { 
     // throw error 
    } finally { 
     if (pdd!=null) { 
      try { 
       pdd.close(); 
      } catch (IOException e) { 

      } 
     } 
    }

该代码使用的PDFTextStripper自定义子类：

class PDFParserTextStripper extends PDFTextStripper { 

    public PDFParserTextStripper() throws IOException { 
     super(); 
    } 


    public void stripPage(int pageNr) throws IOException { 
     this.setStartPage(pageNr+1); 
     this.setEndPage(pageNr+1); 
     Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream()); 
     writeText(document,dummy); // This call starts the parsing process and calls writeString repeatedly. 
    } 



    @Override 
    protected void writeString(String string,List<TextPosition> textPositions) throws IOException { 
     for (TextPosition text : textPositions) { 
      System.out.println("String[" + text.getXDirAdj()+","+text.getYDirAdj()+" fs="+text.getFontSizeInPt()+" xscale="+text.getXScale()+" height="+text.getHeightDir()+" space="+text.getWidthOfSpace()+" width="+text.getWidthDirAdj()+" ] "+text.getUnicode()); 
     } 
    } 

}

来源

2016-03-02 09:29:36

这工作得很好，谢谢。为什么PDFRenderer＆PDPage对象呢？ – Darajan

@Darajan你是对的。它们可能是早期尝试的遗留物。我会从答案中删除它们。 –

@Dieudonné你能指导我吗？“PDF文档”课程在哪里？ –

这是一个使用@tilmanhausherr建议的实现：

import java.io.ByteArrayOutputStream; 
import java.io.File; 
import java.io.FileInputStream; 
import java.io.IOException; 
import java.io.InputStream; 
import java.io.OutputStreamWriter; 
import java.io.Writer; 
import java.util.List; 
import org.apache.pdfbox.pdmodel.PDDocument; 
import org.apache.pdfbox.text.PDFTextStripper; 
import org.apache.pdfbox.text.TextPosition; 

class PDFParserTextStripper extends PDFTextStripper 
{ 
    public PDFParserTextStripper(PDDocument pdd) throws IOException 
    { 
     super(); 
     document = pdd; 
    } 

    public void stripPage(int pageNr) throws IOException 
    { 
     this.setStartPage(pageNr+1); 
     this.setEndPage(pageNr+1); 
     Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream()); 
     writeText(document,dummy); // This call starts the parsing process and calls writeString repeatedly. 
    } 

    @Override 
    protected void writeString(String string,List<TextPosition> textPositions) throws IOException 
    { 
     for (TextPosition text : textPositions) { 
      System.out.println("String[" + text.getXDirAdj()+","+text.getYDirAdj()+" fs="+text.getFontSizeInPt()+" xscale="+text.getXScale()+" height="+text.getHeightDir()+" space="+text.getWidthOfSpace()+" width="+text.getWidthDirAdj()+" ] "+text.getUnicode()); 
     } 
    } 

    public static void extractText(InputStream inputStream) 
    { 
     PDDocument pdd = null; 

     try 
     { 
      pdd = PDDocument.load(inputStream); 
      PDFParserTextStripper stripper = new PDFParserTextStripper(pdd); 
      stripper.setSortByPosition(true); 
      for (int i=0; i<pdd.getNumberOfPages(); i++) 
      { 
       stripper.stripPage(i); 
      } 
     } 
     catch (IOException e) 
     { 
      // throw error 
     } 
     finally 
     { 
      if (pdd != null) 
      { 
       try 
       { 
        pdd.close(); 
       } 
       catch (IOException e) 
       { 

       } 
      } 
     } 
    } 

    public static void main(String[] args) throws IOException 
    { 
     File f = new File("C:\\PathToYourPDF\\pdfFile.pdf"); 
     FileInputStream fis = null; 

     try 
     { 
      fis = new FileInputStream(f); 
      extractText(fis); 
     } 
     catch(IOException e) 
     { 
      e.printStackTrace(); 
     } 
     finally 
     { 
      try 
      { 
       if(fis != null) 
        fis.close(); 
      } 
      catch(IOException ex) 
      { 
       ex.printStackTrace(); 
      } 
     } 
    } 
}

来源

2017-05-12 19:19:45 user4332758

使用PDFBox 2.0从PDF中提取文本

回答

相关问题