2010-10-04 107 views
1

家伙,我已经发布了一个问题前面pypdf python tool .dont标志着这是重复的,因为我得到如下UnicodeEncodeError当阅读PDF文件使用pyPdf

import sys 
    import pyPdf 

    def convertPdf2String(path): 
     content = "" 
     # load PDF file 
     pdf = pyPdf.PdfFileReader(file(path, "rb")) 
     # iterate pages 
     for i in range(0, pdf.getNumPages()): 
      # extract the text from each page 
      content += pdf.getPage(i).extractText() + " \n" 
     # collapse whitespaces 
     content = u" ".join(content.replace(u"\xa0", u" ").strip().split()) 
     return content 

    # convert contents of a PDF file and store retult to TXT file 
    f = open('a.txt','w+') 
    f.write(convertPdf2String(sys.argv[1])) 
    f.close() 

    # or print contents to the standard out stream 
    print convertPdf2String("/home/tom/Desktop/Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace") 

指出这个错误我得到这个错误的1号PDF文件 UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128) 与此PDF以下错误http://www.envis-icpe.com/pointcounterpointbook/Hindi_Book.pdf

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 38: ordinal not in range(128)

如何解决这个

+0

您确定您刚刚执行了上面的代码吗? 'u“\ xe7”.encode(“ascii”,“xmlcharrefreplace”)'正确返回“ç”。使用“xmlcharrefreplace”,对于有效的Unicode字符,它不应该失败。 – AndiDog 2010-10-04 15:48:34

回答

2

我自己试了一下,得到了同样的结果。忽略我的评论,我还没有看到你正在将输出写入文件。这就是问题所在:

​​

由于convertPdf2String返回一个Unicode字符串,但file.write只能写字节,调用f.write尝试使用ASCII编码自动转换成Unicode字符串。由于PDF显然包含非ASCII字符,因此失败。因此,它应该是这样

f.write(convertPdf2String(sys.argv[1]).encode("utf-8")) 
# or 
f.write(convertPdf2String(sys.argv[1]).encode("ascii", "xmlcharrefreplace")) 

编辑:

工作的源代码,只有一行改变。

# Execute with "Hindi_Book.pdf" in the same directory 
import sys 
import pyPdf 

def convertPdf2String(path): 
    content = "" 
    # load PDF file 
    pdf = pyPdf.PdfFileReader(file(path, "rb")) 
    # iterate pages 
    for i in range(0, pdf.getNumPages()): 
     # extract the text from each page 
     content += pdf.getPage(i).extractText() + " \n" 
    # collapse whitespaces 
    content = u" ".join(content.replace(u"\xa0", u" ").strip().split()) 
    return content 

# convert contents of a PDF file and store retult to TXT file 
f = open('a.txt','w+') 
f.write(convertPdf2String(sys.argv[1]).encode("ascii", "xmlcharrefreplace")) 
f.close() 

# or print contents to the standard out stream 
print convertPdf2String("Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace") 
+0

@AndiDog:我曾尝试两个最初,并无法让他们work.My最初的目标是只读从命令行的pd内容,我不想这样做使用xpdf – Hulk 2010-10-04 19:03:54

+0

@Hulk:我测试了我写的东西在我的答案中,在同一个PDF文件上。你是说它不适合你吗? – AndiDog 2010-10-04 20:13:51

+0

@AndiDog:它仍然是一样的错误。我试着用这两个语句 – Hulk 2010-10-05 05:18:08