使用/ CCITTFaxDecode过滤器从PDF中提取图像

我有一个从扫描软件生成的pdf。 pdf每页有1个TIFF图像。我想从每个页面中提取TIFF图像。使用/ CCITTFaxDecode过滤器从PDF中提取图像

我正在使用iTextSharp，并且我已经成功找到了这些图像，并且可以从PdfReader.GetStreamBytesRaw方法中取回原始字节。问题是，就像我之前发现的那样，iTextSharp不包含PdfReader.CCITTFaxDecode方法。

我还知道什么？即使没有iTextSharp的，我可以在记事本中打开PDF，并找到/Filter /CCITTFaxDecode流，我从/DecodeParams知道它是用CCITTFaxDecode组4

有谁在那里知道我怎么能得到CCITTFaxDecode过滤图像出来的我PDF？

干杯， Kahu

来源

2010-04-14 Kahu

的情况下，任何人有兴趣在一个单独iTextSharp的解决方案，还有这其他链接（http://kuujinbo.info/iTextSharp/CCITTFaxDecodeExtract.aspx），它建议使用iTextSharp 5xx中引入的新功能叫做解析器 - 它确实有效。信贷到@kuunjinbo 这只是在我的情况下，我不得不使用ImageConverter的结果来创建一个位图（不知道为什么） – 2016-11-18 11:35:42

或许你可以尝试用解压的pdftk的PDF？语法是

pdftk infile.pdf output uncompressed.pdf uncompress

我没有CCITTFax编码的pdf，所以我无法测试它。

来源

2010-04-21 15:55:24 topskip

这个解压缩命令似乎没有与libtiff COMPRESSION字段相关。我在一个带有CCITTFaxdecode图像的pdf上试了这个，它似乎没有什么区别。 – nealmcb 2011-12-09 19:49:49

这个库... http://www.bitmiracle.com/libtiff/和下面这个例子应该让你的存在方式

string filter = pd.Get(PdfName.FILTER).ToString(); 
string width = pd.Get(PdfName.WIDTH).ToString(); 
string height = pd.Get(PdfName.HEIGHT).ToString(); 
string bpp = pd.Get(PdfName.BITSPERCOMPONENT).ToString(); 

switch (filter) 
{ 
    case "/CCITTFaxDecode": 

     byte[] data = PdfReader.GetStreamBytesRaw((PRStream)pdfStream); 
     int tiff = TIFFOpen("example.tif", "w"); 
     TIFFSetField(tiff, (uint)BitMiracle.LibTiff.Classic.TiffTag.IMAGEWIDTH,(uint)Int32.Parse(width)); 
     TIFFSetField(tiff, (uint)BitMiracle.LibTiff.Classic.TiffTag.IMAGEHEIGHT, (uint)Int32.Parse(height)); 
     TIFFSetField(tiff, (uint)BitMiarcle.LibTiff.Classic.TiffTag.COMPRESSION, (uint)BitMiracle.Libtiff.Classic.Compression.CCITTFAX4); 
     TIFFSetField(tiff, (uint)BitMiracle.LibTiff.Classic.TiffTag.BITSPERSAMPLE, (uint)Int32.Parse(bpp)); 
     TIFFSetField(tiff, (uint)BitMiarcle.Libtiff.Classic.TiffTag.SAMPLESPERPIXEL,1); 

     IntPtr pointer = Marshal.AllocHGlobal(data.length); 
     Marshal.copy(data, 0, pointer, data.length); 
     TIFFWriteRawStrip(tiff, 0, pointer, data.length); 
     TIFFClose(tiff); 

     break; 




     break; 

}

来源

2010-05-24 15:44:28 vbcrlfuser

和NuGet在这里：[BitMiracle.LibTiff.NET]（https://www.nuget.org/packages/BitMiracle.LibTiff.NET/）。 – 2017-05-26 12:19:32

事实上，vbcrlfuser的回答也帮我99％，但是代码不是为当前版本完全正确BitMiracle.LibTiff.NET，我可以下载它。在当前版本中，等效代码如下所示：

using iTextSharp.text.pdf; 
using BitMiracle.LibTiff.Classic; 

... 
     Tiff tiff = Tiff.Open("C:\\test.tif", "w"); 
     tiff.SetField(TiffTag.IMAGEWIDTH, UInt32.Parse(pd.Get(PdfName.WIDTH).ToString())); 
     tiff.SetField(TiffTag.IMAGELENGTH, UInt32.Parse(pd.Get(PdfName.HEIGHT).ToString())); 
     tiff.SetField(TiffTag.COMPRESSION, Compression.CCITTFAX4); 
     tiff.SetField(TiffTag.BITSPERSAMPLE, UInt32.Parse(pd.Get(PdfName.BITSPERCOMPONENT).ToString())); 
     tiff.SetField(TiffTag.SAMPLESPERPIXEL, 1); 
     tiff.WriteRawStrip(0, raw, raw.Length); 
     tiff.Close();

使用上面的代码，我终于了在C有效的TIFF文件：\ test.tif。谢谢vbcrlfuser！

来源

2010-08-29 08:09:12

什么是“生”？你能展示完整的例子吗？ – 2014-09-20 17:08:42

B.K.这可能有点晚了，但我猜想是这样。请参阅vbcrlfuser的答案... byte [] data = PdfReader.GetStreamBytesRaw（（PRStream）pdfStream）; （与原始使用方式相同） – planty182 2015-10-13 09:32:43

这里是Python实现：

import PyPDF2 
import struct 

""" 
Links: 
PDF format: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf 
CCITT Group 4: https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-T.6-198811-I!!PDF-E&type=items 
Extract images from pdf: http://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python 
Extract images coded with CCITTFaxDecode in .net: http://stackoverflow.com/questions/2641770/extracting-image-from-pdf-with-ccittfaxdecode-filter 
TIFF format and tags: http://www.awaresystems.be/imaging/tiff/faq.html 
""" 


def tiff_header_for_CCITT(width, height, img_size, CCITT_group=4): 
    tiff_header_struct = '<' + '2s' + 'h' + 'l' + 'h' + 'hhll' * 8 + 'h' 
    return struct.pack(tiff_header_struct, 
         b'II', # Byte order indication: Little indian 
         42, # Version number (always 42) 
         8, # Offset to first IFD 
         8, # Number of tags in IFD 
         256, 4, 1, width, # ImageWidth, LONG, 1, width 
         257, 4, 1, height, # ImageLength, LONG, 1, lenght 
         258, 3, 1, 1, # BitsPerSample, SHORT, 1, 1 
         259, 3, 1, CCITT_group, # Compression, SHORT, 1, 4 = CCITT Group 4 fax encoding 
         262, 3, 1, 0, # Threshholding, SHORT, 1, 0 = WhiteIsZero 
         273, 4, 1, struct.calcsize(tiff_header_struct), # StripOffsets, LONG, 1, len of header 
         278, 4, 1, height, # RowsPerStrip, LONG, 1, lenght 
         279, 4, 1, img_size, # StripByteCounts, LONG, 1, size of image 
         0 # last IFD 
         ) 

pdf_filename = 'scan.pdf' 
pdf_file = open(pdf_filename, 'rb') 
cond_scan_reader = PyPDF2.PdfFileReader(pdf_file) 
for i in range(0, cond_scan_reader.getNumPages()): 
    page = cond_scan_reader.getPage(i) 
    xObject = page['/Resources']['/XObject'].getObject() 
    for obj in xObject: 
     if xObject[obj]['/Subtype'] == '/Image': 
      """ 
      The CCITTFaxDecode filter decodes image data that has been encoded using 
      either Group 3 or Group 4 CCITT facsimile (fax) encoding. CCITT encoding is 
      designed to achieve efficient compression of monochrome (1 bit per pixel) image 
      data at relatively low resolutions, and so is useful only for bitmap image data, not 
      for color images, grayscale images, or general data. 

      K < 0 --- Pure two-dimensional encoding (Group 4) 
      K = 0 --- Pure one-dimensional encoding (Group 3, 1-D) 
      K > 0 --- Mixed one- and two-dimensional encoding (Group 3, 2-D) 
      """ 
      if xObject[obj]['/Filter'] == '/CCITTFaxDecode': 
       if xObject[obj]['/DecodeParms']['/K'] == -1: 
        CCITT_group = 4 
       else: 
        CCITT_group = 3 
       width = xObject[obj]['/Width'] 
       height = xObject[obj]['/Height'] 
       data = xObject[obj]._data # sorry, getData() does not work for CCITTFaxDecode 
       img_size = len(data) 
       tiff_header = tiff_header_for_CCITT(width, height, img_size, CCITT_group) 
       img_name = obj[1:] + '.tiff' 
       with open(img_name, 'wb') as img_file: 
        img_file.write(tiff_header + data) 
       # 
       # import io 
       # from PIL import Image 
       # im = Image.open(io.BytesIO(tiff_header + data)) 
pdf_file.close()

来源

2016-01-01 10:26:38

根据标签和问题主体，使用iTextSharp。因此，* python实现*并不回答这个问题。 – mkl 2016-01-01 13:33:30

根据TIFF规范（[link]（http://www.itu.int/itudoc/itu-t/com16/tiff-fx/docs/tiff6.html）），我认为你的变量'tiff_header_struct'应该阅读''''+'2s'+'H'+'L'+'H'+'HHLL'* 8 +'L''。特别注意最后的'L''。 – Taar 2016-12-19 12:37:59

它已被写入extension此。它是如此简单

PdfDictionary item; 
if (item.IsImage()) { 
    Image image = item.ToImage(); 
}

来源

2017-12-19 06:50:03

使用/ CCITTFaxDecode过滤器从PDF中提取图像

回答

相关问题