我使用pdftools
从pdf中提取了文本,并将结果保存为txt。将两列文本文档转换为单行文本挖掘
有没有一种有效的方法来将2列的txt转换为一列的文件。
这是什么,我有一个例子:
Alice was beginning to get very into the book her sister was reading,
tired of sitting by her sister but it had no pictures or conversations
on the bank, and of having nothing in it, `and what is the use of a book,'
to do: once or twice she had peeped thought Alice `without pictures or conversation?`
的
Alice was beginning to get very tired of sitting by her sister on the bank, and
of having nothing to do: once or twice she had peeped into the book her sister was
reading, but it had no pictures or conversations in it, `and what is the use of a
book,' thought Alice `without pictures or conversation?'
,而不是基于Extract Text from Two-Column PDF with R我修改的功能位获得:
library(readr)
trim = function (x) gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", x, perl=TRUE)
QTD_COLUMNS = 2
read_text = function(text) {
result = ''
#Get all index of " " from page.
lstops = gregexpr(pattern =" ",text)
#Puts the index of the most frequents ' ' in a vector.
stops = as.integer(names(sort(table(unlist(lstops)),decreasing=TRUE)[1:2]))
#Slice based in the specified number of colums (this can be improved)
for(i in seq(1, QTD_COLUMNS, by=1))
{
temp_result = sapply(text, function(x){
start = 1
stop =stops[i]
if(i > 1)
start = stops[i-1] + 1
if(i == QTD_COLUMNS)#last column, read until end.
stop = nchar(x)+1
substr(x, start=start, stop=stop)
}, USE.NAMES=FALSE)
temp_result = trim(temp_result)
result = append(result, temp_result)
}
result
}
txt = read_lines("alice_in_wonderland.txt")
result = ''
for (i in 1:length(txt)) {
page = txt[i]
t1 = unlist(strsplit(page, "\n"))
maxSize = max(nchar(t1))
t1 = paste0(t1,strrep(" ", maxSize-nchar(t1)))
result = append(result,read_text(t1))
}
result
但是,没有运气与一些文件。我想知道是否有一个更一般/更好的正则表达式来实现结果。
非常感谢提前!
我很想找到一个非PDF的选择。如果你想使用那个特定的故事,这里有一个纯文本版本:http://www.gutenberg.org/files/11/11-0.txt。否则,寻找另一个PDF到文本转换工具,它将转换为1列输出。 – neilfws
看起来像一个固定宽度的文件 - 如果在两列中总是有恒定的宽度,''dat < - read.fwf(file,widths = c(37,48),stringsAsFactors = FALSE)'会给你一个很好的开始。 – thelatemail
[保存我的理智](https://www.nu42.com/2014/09/scraping-pdf-documents-without-losing.html)意识到'pdftohtml'具有非常有用的XML输出模式。 –