2017-03-09 130 views
2

我在尝试将PDF(XFA)转换为字符串时出现以下错误。 这些错误开始来的时候,我从PDFBox 1.8.12切换到PDFBox 2.0.4PDFBox 2.0.4:XFA到文本错误

这里是日志

Mar 09, 2017 7:16:07 AM org.apache.pdfbox.pdfparser.BaseParser parseCOSArray 
WARNING: Corrupt object reference at offset 779916 
Mar 09, 2017 7:16:07 AM org.apache.pdfbox.pdfparser.BaseParser parseCOSArray 
WARNING: Corrupt object reference at offset 780049 
Mar 09, 2017 7:16:07 AM org.apache.pdfbox.pdfparser.BaseParser parseCOSArray 
WARNING: Corrupt object reference at offset 780074 
java.io.IOException: Unknown dir object c='>' cInt=62 peek='>' peekInt=62 at offset 780074 
    at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:951) 
    at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:651) 
    at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:866) 
    at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:150) 
    at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:274) 
    at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:207) 
    at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:854) 
    at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:772) 
    at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:741) 
    at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:672) 
    at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:632) 
    at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:217) 
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:252) 

java.io.IOException: Wrong type of referenced length object COSObject{7, 0}: COSDictionary 
    at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:907) 
    at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:949) 
    at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:780) 
    at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:741) 
    at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:672) 
    at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:632) 
    at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:217) 
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:252) 
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:966) 
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:922) 
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:870) 

我读的迁移和使用的负载,而不是loadNonSeq,因为现在PDFBox的手柄内部。

关于如何解决这些错误的任何建议。

编辑 Error#1 Error#2

编辑#2 @TilmanHausherr我检查你的理论。我在Sublime中打开了文件,删除了开始处的额外空间并保存了它。我得到了以下错误

org.apache.pdfbox.filter.FlateFilter decode 
SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException 
java.io.IOException: java.util.zip.DataFormatException: invalid distance too far back 
    at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:82) 
    at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69) 
    at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:162) 
    at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.<init>(PDFXrefStreamParser.java:56) 
    at org.apache.pdfbox.pdfparser.COSParser.parseXrefStream(COSParser.java:2075) 
    at org.apache.pdfbox.pdfparser.COSParser.parseXrefObjStream(COSParser.java:348) 
    at org.apache.pdfbox.pdfparser.COSParser.parseXref(COSParser.java:303) 
    at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:194) 
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:252) 
    at utils.PDFManager.PDFToText(PDFManager.java:280) 
    at processing.charge.CertificateUtils.getCertificateTypeFromFile(CertificateUtils.java:56) 
    at processing.charge.CertificateUtils.getCertificateType(CertificateUtils.java:48) 
    at processing.Controller.getDocumentType(Controller.java:110) 
    at processing.Controller.insertIntoDb(Controller.java:43) 
    at Test.main(Test.java:203) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:498) 
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147) 
Caused by: java.util.zip.DataFormatException: invalid distance too far back 
    at java.util.zip.Inflater.inflateBytes(Native Method) 
    at java.util.zip.Inflater.inflate(Inflater.java:259) 
    at java.util.zip.Inflater.inflate(Inflater.java:280) 
    at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:107) 
    at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:64) 
    ... 19 more 
Mar 09, 2017 11:07:22 PM org.apache.pdfbox.filter.FlateFilter decode 
SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException 
java.io.IOException: java.util.zip.DataFormatException: invalid distance too far back 
    at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:82) 
    at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69) 
    at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:162) 
    at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.<init>(PDFXrefStreamParser.java:56) 
    at org.apache.pdfbox.pdfparser.COSParser.parseXrefStream(COSParser.java:2075) 
    at org.apache.pdfbox.pdfparser.COSParser.parseXrefObjStream(COSParser.java:348) 
    at org.apache.pdfbox.pdfparser.COSParser.parseXref(COSParser.java:303) 
    at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:194) 
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:252) 
    at utils.PDFManager.PDFToText(PDFManager.java:280) 
    at processing.charge.CertificateUtils.getCertificateTypeFromFile(CertificateUtils.java:56) 
    at processing.charge.CertificateUtils.getCertificateType(CertificateUtils.java:49) 
    at processing.Controller.getDocumentType(Controller.java:110) 
    at processing.Controller.insertIntoDb(Controller.java:43) 
    at Test.main(Test.java:203) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:498) 
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147) 
Caused by: java.util.zip.DataFormatException: invalid distance too far back 
    at java.util.zip.Inflater.inflateBytes(Native Method) 
    at java.util.zip.Inflater.inflate(Inflater.java:259) 
    at java.util.zip.Inflater.inflate(Inflater.java:280) 
    at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:107) 
    at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:64) 

也验证你的理论,我在崇高打开另一个文件(这是正常工作),并有相同的空格,制表符和CRS。

Working File

+2

我删除了itext标签,因为问题不是关于iText。这使得你的评论多余,@ bruno.lowagie。 :) –

+0

@TilmanHausherr我已经添加了PDF的链接。请检查这些。谢谢 – Mayank

+1

这两个文件都无效。可以用PDFBox通过在NOTEPAD ++开头删除空白字符(CR和Tabs)来打开它们。你有没有得到这些文件,或者这是一个破碎的Web服务器的缺陷?我用你的文件打开了一个问题:https://issues.apache.org/jira/browse/PDFBOX-3714 –

回答

2

正如在评论中讨论的,文件有PDF头开始前的空白(CRS和标签)。你可以用NOTEPAD ++(或任何可编辑二进制文件的编辑器)删除它们,或者(如果你的所有文件都有缺陷)编写一个短代码打开一个输入流,吞下字节,直到你点击“%”,然后复制所有其余的从那里到输出流。

我也打开了问题PDFBOX-3714

更新: 这已被固定在2.0.5,现在可用。

+0

我向[PDFBOX-3714](https://issues.apache.org/jira/browse/PDFBOX-3714)添加了一个想法。 – mkl

+0

谢谢。这正是我一直在寻找的。看起来这个bug在2.0.4中引入,并在2.0.5中修复 – VHS