2008-10-13 83 views
2

我正试图在web应用程序中为pdf执行搜索结果高亮显示。我有原始的pdf和在搜索结果中使用的小png版本。基本上我正在寻找一个像这样的api:如何从pdf文档获取字符​​偏移信息?

pdf_document.find_offsets('somestring') 
# => { top: 501, left: 100, bottom: 520, right: 150 }, { ... another box ... }, ... 

我知道有可能从pdf获取这些信息,因为Apple的Preview.app实现了这一点。

需要运行在Linux上的东西,最好是开源的。我知道你可以用windows上的acrobat来做到这一点。

+0

嗨quackingduck, 是你能找到答案吗?如果是,请在此发布。 – Thushan 2009-03-10 13:11:07

回答

4

CAM::PDF可以做几何部分相当不错,但有一些麻烦的串有时匹配。该技术会像下面轻测试的代码:

use CAM::PDF; 
my $pdf = CAM::PDF->new('my.pdf') or die $CAM::PDF::errstr; 
for my $pagenum (1 .. $pdf->numPages) { 
    my $pagetree = $pdf->getPageContentTree($pagenum) or die; 
    my @text = $pagetree->traverse('MyRenderer')->getTextBlocks; 
    for my $textblock (@text) { 
     print "text '$textblock->{str}' at ", 
      "($textblock->{left},$textblock->{bottom})\n"; 
    } 
} 

package MyRenderer; 
use base 'CAM::PDF::GS'; 

sub new { 
    my ($pkg, @args) = @_; 
    my $self = $pkg->SUPER::new(@args); 
    $self->{refs}->{text} = []; 
    return $self; 
} 
sub getTextBlocks { 
    my ($self) = @_; 
    return @{$self->{refs}->{text}}; 
} 
sub renderText { 
    my ($self, $string, $width) = @_; 
    my ($x, $y) = $self->textToDevice(0,0); 
    push @{$self->{refs}->{text}}, { 
     str => $string, 
     left => $x, 
     bottom => $y, 
     right => $x + $width, 
     #top => $y + ???,                     
    }; 
    return; 
} 

其中输出看起来是这样的:

text 'E' at (52.08,704.16) 
text 'm' at (73.62096,704.16) 
text 'p' at (113.58936,704.16) 
text 'lo' at (140.49648,704.16) 
text 'y' at (181.19904,704.16) 
text 'e' at (204.43584,704.16) 
text 'e' at (230.93808,704.16) 
text ' N' at (257.44032,704.16) 
text 'a' at (294.6504,704.16) 
text 'm' at (320.772,704.16) 
text 'e' at (360.7416,704.16) 
text 'Employee Name' at (56.4,124.56) 
text 'Employee Title' at (56.4,114.24) 
text 'Company Name' at (56.4,103.92) 

正如你可以从输出中看到的,字符串匹配将是一个小乏味,但几何很简单(除字体高度外)。