如何以编程方式在c中搜索PDF文档＃

我需要搜索pdf文件以查看某个字符串是否存在。有问题的字符串肯定是编码为文本（即它不是图像或任何东西）。我试图只是搜索文件，就好像它是纯文本，但这不起作用。如何以编程方式在c中搜索PDF文档＃

可以做到这一点吗？ .net2.0是否有任何图书馆为我提取/解码PDF文件中的所有文本？

来源

2009-02-20 Nathan

这里有几个图书馆。结账http://www.codeproject.com/KB/cs/PDFToText.aspx 和http://itextsharp.sourceforge.net/

它需要一点努力，但它是可能的。

来源

2009-02-20 01:26:46 volatilsis

+1 for iTextSharp。它应该能够做到你所需要的。 – jeremcc 2009-02-20 03:10:58

绝大多数情况下，无法直接通过在记事本中打开来搜索PDF内容 - 甚至在少数情况下（取决于PDF的构建方式），您都可以由于PDF在内部处理文本的方式，因此只能搜索单个单词。

我的公司有一个商业解决方案，可以让你从PDF文件中提取文本。我在下面包含了一些示例代码，as shown on this page，演示了如何从PDF文件中搜索特定字符串的文本。

using System; 
using System.IO; 
using QuickPDFDLL0718; 

namespace QPLConsoleApp 
{ 
    public class QPL 
    { 
     public static void Main() 
     { 
      // This example uses the DLL edition of Quick PDF Library 
      // Create an instance of the class and give it the path to the DLL 
      PDFLibrary QP = new PDFLibrary("QuickPDFDLL0718.dll"); 

      // Check if the DLL was loaded successfully 
      if (QP.LibraryLoaded()) 
      { 
       // Insert license key here/Check the license key 
       if (QP.UnlockKey("...") == 1) 
       { 
        QP.LoadFromFile(@"C:\Program Files\Quick PDF Library\DLL\GettingStarted.pdf"); 

        int iPageCount = QP.PageCount(); 
        int PageNumber = 1; 
        int MatchesFound = 0; 

        while (PageNumber <= iPageCount) 
        { 
         QP.SelectPage(PageNumber); 
         string PageText = QP.GetPageText(3); 

         using (StreamWriter TempFile = new StreamWriter(QP.GetTempPath() + "temp" + PageNumber + ".txt")) 
         { 
          TempFile.Write(PageText); 
         } 

         string[] lines = File.ReadAllLines(QP.GetTempPath() + "temp" + PageNumber + ".txt"); 
         string[][] grid = new string[lines.Length][]; 

         for (int i = 0; i < lines.Length; i++) 
         { 
          grid[i] = lines[i].Split(','); 
         } 

         foreach (string[] line in grid) 
         { 
          string FindMatch = line[11]; 

          // Update this string to the word that you're searching for. 
          // It can be one or more words (i.e. "sunday" or "last sunday". 

          if (FindMatch.Contains("characters")) 
          { 
           Console.WriteLine("Success! Word match found on page: " + PageNumber); 
           MatchesFound++; 
          } 
         } 
         PageNumber++; 
        } 

        if (MatchesFound == 0) 
        { 
         Console.WriteLine("Sorry! No matches found."); 
        } 
        else 
        { 
         Console.WriteLine(); 
         Console.WriteLine("Total: " + MatchesFound + " matches found!"); 
        } 
        Console.ReadLine(); 
       } 
      } 
     } 
    } 
}

来源

2010-03-29 11:30:32 Rowan

您可以使用Docotic.Pdf library来搜索PDF文件中的文本。

这里是一个示例代码：

static void searchForText(string path, string text) 
{ 
    using (PdfDocument pdf = new PdfDocument(path)) 
    { 
     for (int i = 0; i < pdf.Pages.Count; i++) 
     { 
      string pageText = pdf.Pages[i].GetText(); 
      int index = pageText.IndexOf(text, 0, StringComparison.CurrentCultureIgnoreCase); 
      if (index != -1) 
       Console.WriteLine("'{0}' found on page {1}", text, i); 
     } 
    } 
}

图书馆还可以extract formatted and plain text从整个文档或文档页面。

声明：我为图书馆供应商Bit Miracle工作。

来源

2012-01-21 22:13:20 Bobrovsky

如何以编程方式在c中搜索PDF文档＃

回答

相关问题