PDF到文本文件的转换

我在每个包含2个pdf的主文件夹中有3000个子文件夹。我写了下面的代码来转换文本文件中的PDF。PDF到文本文件的转换

* all.subfolders < - list.dirs（ “＃路径主文件夹”，full.names = TRUE）

sapply(all.subfolders[-1], function(x) { 

file <-list.files(x, full.names=TRUE) 

lapply(file, function(x) system(paste('"C:\\Program Files (x86)\\xpdfbin-win-3.03\\bin64\\pdftotext.exe"', paste0('"', x, '"')), wait = FALSE))})*

有

但在几个PDF文件可能无法在文本转换，如何让他们在一个列表左右。请帮忙。

来源

2017-10-09 Jain Arihant

为什么不能转换这些pdf？你收到错误信息了吗？也许这些PDF文件不包含文本？ –

它们包含文本，但我认为PDF是扫描的文件，因此无法转换。我dint得到任何错误消息。执行命令后，我发现他们转换的文件在各自的文件夹中。 –

我的声望不够高，无法评论，所以请原谅我作出回答，但事实并非如此。您的PDF文件可能受到保护，因此无法提取文本。使用pdf查看器打开文档时，请尝试从这些文档中复制文本。这可能由于保护而不起作用。如果您有权限提取和处理文本，则可以考虑将文件转换为图像（例如，通过ImageMagick），并在图像上应用OCR（例如，通过tesseract）。要开始您可能会参考，例如，以下脚本https://gist.github.com/benmarwick/11333467。

为了回应您关于如何识别尚未转换的文件的评论，您可以使用以下方法。我希望这是你一直在寻找的东西。

#retrieve all file paths 
#note that you can use recursive = T to avoid looping over directories yourself 
allfiles <- list.files("C:/.../mydirectory", full.names = T, recursive = T) 

#split filepaths into a set of pdf and txt files 
#txt files will, of course, only be the files that have been converted 
pdffiles <- allfiles[grep("pdf$", allfiles)] 
txtfiles <- allfiles[grep("txt$", allfiles)] 

#remove file ending 
pdffiles <- gsub(".pdf", "", pdffiles) 
pdffiles <- gsub(".txt", "", pdffiles) 

#check which files have not been converted 
notconverted <- setdiff(pdffiles, txtfiles) 

#if needed, file ending can be added again 
#e.g. for copying the unconverted files into a separate directory or so 
notconverted <- paste0(pdffiles, ".pdf")

来源

2017-10-09 14:14:20

直接的问题是找到那些未转换的PDF /子文件夹...我有3000个子文件夹，因此手动无法检查每个文件夹。 –

非常感谢您的时间和代码。它解决了查找未转换文件的问题。谢谢 –

我很高兴能帮上忙。请考虑接受关闭此主题的答案。 –

PDF到文本文件的转换

回答

相关问题