我用下面的代码来读取pdf文件，但它没有读取它。可能是什么原因？使用python逐行读取pdf文件

>>> import os 

>>> from PyPDF2 import PdfFileReader, PdfFileWriter 

>>> path = "/Users/Rahul/Desktop/Dfiles/" 

>>> dirs = os.listdir(path) 

>>> directory = "/Users/Rahul/Desktop/Dfiles/106_2015_34-76357.pdf" 

>>> f = open(directory, 'rb') 

>>> reader = PdfFileReader(f) 

>>> contents = reader.getPage(0).extractText().split('\n') 

>>> f.close() 

>>> print contents

输出是[u'']而不是读取内容。

来源

2017-07-08 Rahul Pipalia

它适用于0以外的其他页码吗？你确定PDF中有文字，而不仅仅是图像或图形吗？ – mkrieger1

可能这可以帮助您阅读PDF。

import pyPdf 
def getPDFContent(path): 
    content = "" 
    pages = 10 
    p = file(path, "rb") 
    pdf_content = pyPdf.PdfFileReader(p) 
    for i in range(0, pages): 
     content += pdf_content.getPage(i).extractText() + "\n" 
    content = " ".join(content.replace(u"\xa0", " ").strip().split()) 
    return content

来源

2017-07-08 04:16:20

-1

你好拉胡尔Pipalia，

如果没有安装在你的Python PyPDF2所以先安装PyPDF2使用后此模块。

Ubuntu的安装步骤（安装python-pypdf）

首先，打开terminal
类型后sudo apt-get install python-pypdf

你万阿英，蒋达清解决方案

试试这个下面的代码，

# Import Library 
import PyPDF2 

# Which you want to read file so give file name with ".pdf" extension 
pdf_file = open('Your_Pdf_File_Name.pdf') 
read_pdf = PyPDF2.PdfFileReader(pdf_file) 
number_of_pages = read_pdf.getNumPages() 

#Give page number of the pdf file (How many page in pdf file). 
# @param Page_Nuber_of_the_PDF_file: Give page number here i.e 1 
page = read_pdf.getPage(Page_Nuber_of_the_PDF_file) 

page_content = page.extractText() 

# Display content of the pdf 
print page_content

从下面的链接下载PDF文档，并尝试这个代码， https://www.dropbox.com/s/4qad66r2361hvmu/sample.pdf?dl=1

我希望我的回答是很有帮助的。
如果有任何查询如此评论，请。

来源

2017-07-08 04:35:01

你好Rahul Pipalia ... –

如果我的答案有帮助，所以请接受.. –

我想你需要指定光盘名称，它在你的目录中缺失。例如“D：/Users/Rahul/Desktop/Dfiles/106_2015_34-76357.pdf”。我试过了，我可以没有任何问题地阅读。

或者，如果你想找到使用os模块，你真的不与目录关联的文件的路径，你可以尝试以下方法：

from PyPDF2 import PdfFileReader 
import os 

def find(name, path): 
    for root, dirs, files in os.walk(path): 
     if name in files: 
      return os.path.join(root, name) 

directory = find('106_2015_34-76357.pdf', 'D:/Users/Rahul/Desktop/Dfiles/') 

f = open(directory, 'rb') 

reader = PdfFileReader(f) 

contents = reader.getPage(0).extractText().split('\n') 

f.close() 

print(contents)

查找功能可以在纳迪亚Alramli的发现回答这里Find a file in python

来源

2017-10-03 17:04:54 Ahaha

import re 
import PyPDF2 

pdfFileObj = open('E://drive-download-20171015T225604Z-001/test_case/test2/try/xyz.pdf', 'rb') 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 
print("Number of pages:-"+str(pdfReader.numPages)) 
num = pdfReader.numPages 
i =0 
while(i<num): 
pageObj = pdfReader.getPage(i) 
text=pageObj.extractText() 
text1 = text.lower() 
for line in text1: 
    if(re.search("abc",line)): 
     print(line) 
i= i+1

我用它通过PDF格式的页面来遍历页面并搜索其关键术语和流程进一步。

来源

2018-01-23 12:47:56

使用python逐行读取pdf文件

回答

Ubuntu的安装步骤（安装python-pypdf）

你万阿英，蒋达清解决方案

相关问题