Python 3从网络解析PDF

我试图从网页获取PDF，解析并使用PyPDF2将结果打印到屏幕上。我懂了没有问题的工作与下面的代码：Python 3从网络解析PDF

with open("foo.pdf", "wb") as f: 
    f.write(requests.get(buildurl(jornal, date, page)).content) 
pdfFileObj = open('foo.pdf', "rb") 
pdf_reader = PyPDF2.PdfFileReader(pdfFileObj) 
page_obj = pdf_reader.getPage(0) 
print(page_obj.extractText())

中写入一个文件，这样我就可以读它虽然听起来浪费了，所以我想我只是削减这个中间人：

pdf_reader = PyPDF2.PdfFileReader(requests.get(buildurl(jornal, date, page)).content) 
page_obj = pdf_reader.getPage(0) 
print(page_obj.extractText())

然而，这让我产生了一个AttributeError: 'bytes' object has no attribute 'seek'。我如何将来自requests的PDF直接送入PyPDF2？

来源

2016-07-30 Bernardo Meurer

你必须返回content转换为使用一个类似文件的对象：

import io 

pdf_content = io.BytesIO(requests.get(buildurl(jornal, date, page)).content) 
pdf_reader = PyPDF2.PdfFileReader(pdf_content)

来源

2016-07-30 21:03:11

使用IO伪造使用文件（Python 3中）：

import io 

output = io.BytesIO() 
output.write(requests.get(buildurl(jornal, date, page)).content) 
output.seek(0) 
pdf_reader = PyPDF2.PdfFileReader(output)

我没有在你的环境测试，但是我测试了这个简单的例子，它的工作：

import io 

output = io.BytesIO() 
output.write(bytes("hello world","ascii")) 
output.seek(0) 
print(output.read())

产量：

b'hello world'

来源

2016-07-30 21:00:22

对不起，我忘了提及我需要Python3兼容 –

Python 3从网络解析PDF

回答

相关问题