从pdf中提取文本到c＃

我正在寻找一种方法从pdf中提取文本并将其用于我的程序。我在网上做了一些研究，并得到了一些图书馆的工作。这些不是免费的;然而，在这里有限制。从pdf中提取文本到c＃

所以我正在寻找一个免费的图书馆。我想到了ITextSharp，但我不知道要开始。你们能帮我出去吗？

2012-02-29 jorne

请注意，iTextSharp也不是免费软件。 – Bobrovsky 2012-03-01 17:05:16

查看文档和资源： - http://api.itextpdf.com/ - http://stackoverflow.com/questions/3365986/documentation-for-itextsharp – 2012-02-29 14:23:42

喜欢的东西应该为你工作。您必须观看它 - 它们随时会使用iTextSharp发布更改函数名称，这有点烦人 - Lol

public static string GetPDFText(String pdfPath) 
{ 
    PdfReader reader = new PdfReader(pdfPath); 

    StringWriter output = new StringWriter(); 

    for (int i = 1; i <= reader.NumberOfPages; i++) 
     output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy())); 

    return output.ToString(); 
}

来源

2012-02-29 14:32:01 Dave

好，好！仍然有一个难题：如果pdf中有图像，是否存在问题，或者他是否会阅读它们？ – jorne 2012-03-01 10:52:53

如果文档中包含图像，这应该没问题。要提取图像，您需要检查对象集合中的每个pdfobject。这只会提取文本:) – Dave 2012-03-01 19:25:27

iTextSharp是开源的，但许可模式在版本4.1.6后发生了变化。旧许可证严格得不那么严格，而新许可证则需要支付，如果你在商业上使用它并且不想发布你的源代码。这可能会也可能不会影响你。

下面是一个使用5.1.2.0版本的文本提取的最基本的版本：

//Full path to the file to read 
string fileToRead = System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), @"file1.pdf"); 
//Bind a PdfReader to our file 
iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(fileToRead); 
//Extract all of the text from the first page 
string allPage1Text = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1); 
//That's it! 
Console.Write(allPage1Text);

来源

2012-02-29 14:29:33

从pdf中提取文本到c＃

回答

相关问题