2012-08-01 98 views

回答

0

不容易在Word文档中结束,虽然Word创建以w文件:lastRenderedPageBreak。

最好让您的OCR程序在每个已转换文本块之间的文档中插入一些标记。

然后,根据它是什么类型的Word文档,使用适当的工具处理该文件。

3

如果您安装了Word,则可以使用Word对象模型从C#处理Word文档。

首先,添加对Word对象模型的引用。右键点击该项目,然后Add Reference... -> COM -> Microsoft Word 14.0 Object Model(或类似的,取决于您的Word版本)。

然后,您可以使用下面的代码:

using Microsoft.Office.Interop.Word; 
//for older versions of Word use: 
//using Word; 

namespace WordSplitter { 
    class Program { 
     static void Main(string[] args) { 
      //Create a new instance of Word 
      var app = new Application(); 

      //Show the Word instance. 
      //If the code runs too slowly, you can show the application at the end of the program 
      //Make sure it works properly first; otherwise, you'll get an error in a hidden window 
      //(If it still runs too slowly, there are a few other ways to reduce screen updating) 
      app.Visible = true; 

      //We need a reference to the source document 
      //It should be possible to get a reference to an open Word document, but I haven't tried it 
      var doc = app.Documents.Open(@"path\to\file.doc"); 
      //(Can also use .docx) 

      int pageCount = doc.Range().Information[WdInformation.wdNumberOfPagesInDocument]; 

      //We'll hold the start position of each page here 
      int pageStart = 0; 

      for (int currentPageIndex = 1; currentPageIndex <= pageCount; currentPageIndex++) { 
       //This Range object will contain each page. 
       var page = doc.Range(pageStart); 

       //Generally, the end of the current page is 1 character before the start of the next. 
       //However, we need to handle the last page -- since there is no next page, the 
       //GoTo method will move to the *start* of the last page. 
       if (currentPageIndex < pageCount) { 
        //page.GoTo returns a new Range object, leaving the page object unaffected 
        page.End = page.GoTo(
         What: WdGoToItem.wdGoToPage, 
         Which: WdGoToDirection.wdGoToAbsolute, 
         Count: currentPageIndex + 1 
        ).Start - 1; 
       } else { 
        page.End = doc.Range().End; 
       } 
       pageStart = page.End + 1; 

       //Copy and paste the contents of the Range into a new document 
       page.Copy(); 
       var doc2 = app.Documents.Add(); 
       doc2.Range().Paste(); 
      } 
     } 
    } 
} 

参考:Word Object Model Overview on MSDN

+0

感谢亲爱@ZevSpitz – Iman 2012-08-03 08:11:23

+0

这是一个完美的出发点,以创造一些有用的。 – 2012-10-16 15:12:45

4

other answer,但有一个IEnumerator和扩展方法的文档。

static class PagesExtension { 
    public static IEnumerable<Range> Pages(this Document doc) { 
     int pageCount = doc.Range().Information[WdInformation.wdNumberOfPagesInDocument]; 
     int pageStart = 0; 
     for (int currentPageIndex = 1; currentPageIndex <= pageCount; currentPageIndex++) { 
      var page = doc.Range(
       pageStart 
      ); 
      if (currentPageIndex < pageCount) { 
       //page.GoTo returns a new Range object, leaving the page object unaffected 
       page.End = page.GoTo(
        What: WdGoToItem.wdGoToPage, 
        Which: WdGoToDirection.wdGoToAbsolute, 
        Count: currentPageIndex+1 
       ).Start-1; 
      } else { 
       page.End = doc.Range().End; 
      } 
      pageStart = page.End + 1; 
      yield return page; 
     } 
     yield break; 
    } 
} 

主要的代码最终是这样的:

static void Main(string[] args) { 
    var app = new Application(); 
    app.Visible = true; 
    var doc = app.Documents.Open(@"path\to\source\document"); 
    foreach (var page in doc.Pages()) { 
     page.Copy(); 
     var doc2 = app.Documents.Add(); 
     doc2.Range().Paste(); 
    } 
} 
相关问题