pdf段落或文本位置块

我想检索构成PDF页面中的段落和/或文本块的矩形。pdf段落或文本位置块

我看过iTextSharp和DataLogics。

我已经能够做的最好的事情是找到一个单词。但是，我需要知道这些单词是否在同一个文本块中。

我正在使用C＃。有人有任何想法吗？

2009-04-15 Dave

这是Java中的内容，但它涉及从pdf获取内容，然后从内容中的索引获取值。

我不确定，但您可能可以在C＃中实现类似的功能。获取内容并打印出来。

//create a new reader from the source file 
PdfReader reader = new PdfReader(fileName); 
//create the file array 
RandomAccessFileOrArray raf = new RandomAccessFileOrArray(fileName); 
//get the content of the pdf reader (which is the source file) 
byte bContent [] = reader.getPageContent(1,raf); 
ByteArrayOutputStream bs = new ByteArrayOutputStream(); 
bs.write(bContent); 
//create a string of the contents of the page in order to get the data needed 
String contentOf1099 = bs.toString(); 
if(debug) 
{ 
    System.err.println("contentOf1099 = "+contentOf1099); 
} 
//get the value based off an index 
String value = contentOf1099.substring(contentOf1099.indexOf((",contentOf1099.indexOf("155 664 Td"))+1,contentOf1099.indexOf("(",contentOf1099.indexOf("155 664 Td "))+12);

来源

2009-04-15 19:39:54 northpole

birdlips，最后一行真的给我带来麻烦。你能为我分手吗？ – Dave 2009-04-15 20:39:32

UNless其结构化PDF，这不会存在。 PDF是位置处的一组drawString命令 - 没有段落或空格标记。你需要从文本位置处理这一点。

来源

2009-04-16 06:38:30

提取页面上每个单词的所有坐标，然后尝试将它们组合在一起。

第一件要做的事情就是分成几行。要做到这一点，你想循环所有的单词与所有的命令字，并将y0小于另一个的y1的组合在一起，而y1大于另一个的y0。这些是线条。

然后你需要将你的行分成段落。行首的x位置应在另一行页面宽度的1/25内。线的y坐标之间的距离应该小于线的高度。这些是段落。

来源

2012-01-05 11:57:56 Alasdair

pdf段落或文本位置块

回答

相关问题