在perl中解析pdf

我想从pdf中提取一些信息。我正尝试使用CAM::PDF模块中的getpdftext.pl。当我运行$~ getpdftext.pl sample.pdf时，它会生成一个pdf文本到stdout。在perl中解析pdf

但我想写这个文本文件并解析perl中的必需字段。有人可以请指导我如何做到这一点？

但是，当我尝试在我的Perl脚本中调用pdftotext.pl我得到一个No such file错误。

#program从PDF提取文本，并将其保存在文本文件中

use PDF; 

use CAM::PDF; 

use CAM::PDF::PageText; 

use warnings; 

use IPC::System::Simple qw(system capture); 

$filein = 'sample.pdf';                 
$fileout = 'output1.txt'; 

open OUT, ">$fileout" or die "error: $!"; 

open IN, "getpdftext.pl $filein" or die "error :$!" ; 

while(<IN>) 
{ 
    print OUT $fileout; 
}

来源

2011-10-06 sandyutd

见perldoc -f open。你想获取外部命令的输出流，并将其用作Perl脚本中的输入流。这就是-|模式是什么：

open my $IN, '-|', "getpdftext.pl $filein" or die $!; 
while (<$IN>) { 
    ... 
}

来源

2011-10-06 22:34:57 mob

thanks mob， - |选项帮助 – sandyutd

它可能会更容易使getpdftext.pl做你想做什么。

使用来自getpdftext.pl的代码，这个（未经测试的代码）应该输出pdf到一个文本文件。

my $filein = 'sample.pdf';                 
my $fileout = 'output1.txt'; 

my $doc = CAM::PDF->new($filein) || die "$CAM::PDF::errstr\n"; 
open my $fo, '>', $fileout or die "error: $!"; 

foreach my $p (1 .. $doc->numPages()) { 
    my $str = $doc->getPageText($p); 
    if (defined $str) { 
     CAM::PDF->asciify(\$str); 
     print $fo $str; 
    } 
} 

close $fo;

来源

2011-10-06 22:38:46 AFresh1

非常感谢 – sandyutd

不客气。如果您愿意，也可以直接使用文本而不是将其打印到文件中。可能通过将'打开我的$ fo ...'到'my $ docstr ='';'和print $ fo $ str;'到'$ docstr。= $ str;'并使用它而不是'close $ FO;'。 – AFresh1

在perl中解析pdf

回答

相关问题