如何解压/放缩PDF流

使用2016-W4 pdf，它有两个大流，以及一堆其他对象和较小的流。我试图泄漏流，以处理源数据，但我挣扎。我只能得到损坏的输入和无效的校验和错误。如何解压/放缩PDF流

我已经编写了一个测试脚本来帮助调试，并且已经从文件中拉出更小的流来测试。

下面是从原始的PDF 2流，其长度对象一起：

流1：

149 0 obj 
<< /Length 150 0 R /Filter /FlateDecode /Type /XObject /Subtype /Form /FormType 
1 /BBox [0 0 8 8] /Resources 151 0 R >> 
stream 
x+TT(T0B ,JUWÈS0Ð37±402V(NFJSþ¶ 
« 
endstream 
endobj 
150 0 obj 
42 
endobj

流2

142 0 obj 
<< /Length 143 0 R /Filter /FlateDecode /Type /XObject /Subtype /Form /FormType 
1 /BBox [0 0 0 0] /Resources 144 0 R >> 
stream 
x+Tçã 
endstream 
endobj 
143 0 obj 
11 
endobj

我复制只是stream的内容我nto Vim内的新文件（不包括在stream之后和endstream之前的回车）。

我都试过：

compress/flate（rfc-1951） - （除去前2个字节（CMF，FLG））
compress/zlib（rfc-1950）

我已经将流转换为[]byte了如下：

package main 

import (
    "bytes" 
    "compress/flate" 
    "compress/gzip" 
    "compress/zlib" 
    "fmt" 
    "io" 
    "os" 
) 

var (
    flateReaderFn = func(r io.Reader) (io.ReadCloser, error) { return flate.NewReader(r), nil } 
    zlibReaderFn = func(r io.Reader) (io.ReadCloser, error) { return zlib.NewReader(r) } 
) 

func deflate(b []byte, skip, length int, newReader func(io.Reader) (io.ReadCloser, error)) { 
    // rfc-1950 
    // -------- 
    // First 2 bytes 
    // [120, 1] - CMF, FLG 
    // 
    // CMF: 120 
    //  0111 1000 
    //  ↑ ↑ 
    //  | CM(8) = deflate compression method 
    //  CINFO(7) = 32k LZ77 window size 
    // 
    // FLG: 1 
    //  0001 ← FCHECK 
    //   (CMF*256 + FLG) % 31 == 0 
    //    120 * 256 + 1 = 30721 
    //        30721 % 31 == 0 

    stream := bytes.NewReader(b[skip:length]) 
    r, err := newReader(stream) 
    if err != nil { 
     fmt.Println("\nfailed to create reader,", err) 
     return 
    } 

    n, err := io.Copy(os.Stdout, r) 
    if err != nil { 
     if n > 0 { 
      fmt.Print("\n") 
     } 
     fmt.Println("\nfailed to write contents from reader,", err) 
     return 
    } 
    fmt.Printf("%d bytes written\n", n) 
    r.Close() 
} 

func main() { 
    //readerFn, skip := flateReaderFn, 2 // compress/flate RFC-1951, ignore first 2 bytes 
    readerFn, skip := zlibReaderFn, 0 // compress/zlib RFC-1950, ignore nothing 

    //                        ⤹ This is where the error occurs: `flate: corrupt input before offset 19`. 
    stream1 := []byte{120, 1, 43, 84, 8, 84, 40, 84, 48, 0, 66, 11, 32, 44, 74, 85, 8, 87, 195, 136, 83, 48, 195, 144, 51, 55, 194, 177, 52, 48, 50, 86, 40, 78, 70, 194, 150, 74, 83, 8, 4, 0, 195, 190, 194, 182, 10, 194, 171, 10} 
    stream2 := []byte{120, 1, 43, 84, 8, 4, 0, 1, 195, 167, 0, 195, 163, 10} 

    fmt.Println("----------------------------------------\nStream 1:") 
    deflate(stream1, skip, 42, readerFn) // flate: corrupt input before offset 19 

    fmt.Println("----------------------------------------\nStream 2:") 
    deflate(stream2, skip, 11, readerFn) // invalid checksum 
}

我确定我在某个地方做错了什么，我只是不太明白。

（PDF格式确实在查看器打开）

来源

2017-02-20 Justin

你确定vim显示你并复制了正确的字节吗？您应该从十六进制编辑器获取数据（例如，检查[hecate]（https://github.com/evanmiller/hecate））。 – icza

@icza - 如果你想发布作为答案，我会给你信用=） – Justin

二进制数据应该从不从文本编辑器中复制出/保存。可能会有这种情况发生，并且只是在火焰中添加油。

您最终从PDF中“挖出”的数据很可能与PDF中的实际数据为不一致。您应该从十六进制编辑器中获取数据（例如，尝试使用hecate获取新内容），或者编写一个保存该文件的简单应用程序（严格将该文件作为二进制文件处理）。

提示＃1：

显示的二进制数据分布在多条线上。二进制数据不包含回车符，这是一个文本控制。如果是这样，那就意味着编辑确实将解释为文本，所以一些代码/字符在其中“消耗”以开始新行。多个序列可被解释为相同的换行符（例如\n，\r\n）。通过排除它们，您已经处于数据丢失状态，通过包含它们，您可能已经有了不同的顺序。如果数据被解释并显示为文本，则可能会出现更多的问题，因为控制字符更多，并且某些字符在显示时可能不会显示。

提示＃2：

当flateReaderFn时，解码所述第二示例成功（完成没有错误）。这意味着“你在咆哮着正确的树”，但成功取决于实际数据是什么以及它在文本编辑器中“扭曲”的程度。

来源

2017-02-21 20:23:49 icza

好，表白时间...

我在试图了解放气，我完全忽略了Vim的WASN”事实如此赶上将流内容正确保存到新文件中。所以我花了很多时间阅读RFC，并深入了解Go compress/...包的内部，假设问题出在我的代码上。

在我发布我的问题后不久，我尝试阅读PDF作为一个整体，找到stream/endstream位置，并通过放气推动。只要我看到内容在屏幕上滚动，我就意识到自己愚蠢的错误。

+1 @icza，这正是我的问题。

这样做很好，因为我对整个过程有了更好的理解，比起刚开始工作的时候更好。

来源

2017-02-21 18:34:42 Justin

如何解压/放缩PDF流

回答

相关问题