2017-06-01 96 views
1

我使用pdftools从pdf中提取了文本,并将结果保存为txt。将两列文本文档转换为单行文本挖掘

有没有一种有效的方法来将2列的txt转换为一列的文件。

这是什么,我有一个例子:

Alice was beginning to get very  into the book her sister was reading, 
tired of sitting by her sister  but it had no pictures or conversations 
on the bank, and of having nothing in it, `and what is the use of a book,' 
to do: once or twice she had peeped thought Alice `without pictures or conversation?` 

Alice was beginning to get very tired of sitting by her sister on the bank, and 
of having nothing to do: once or twice she had peeped into the book her sister was 
reading, but it had no pictures or conversations in it, `and what is the use of a 
book,' thought Alice `without pictures or conversation?' 

,而不是基于Extract Text from Two-Column PDF with R我修改的功能位获得:

library(readr)  
trim = function (x) gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", x, perl=TRUE) 

QTD_COLUMNS = 2 

read_text = function(text) { 
    result = '' 
    #Get all index of " " from page. 
    lstops = gregexpr(pattern =" ",text) 
    #Puts the index of the most frequents ' ' in a vector. 
    stops = as.integer(names(sort(table(unlist(lstops)),decreasing=TRUE)[1:2])) 
    #Slice based in the specified number of colums (this can be improved) 
    for(i in seq(1, QTD_COLUMNS, by=1)) 
    { 
    temp_result = sapply(text, function(x){ 
     start = 1 
     stop =stops[i] 
     if(i > 1)    
     start = stops[i-1] + 1 
     if(i == QTD_COLUMNS)#last column, read until end. 
     stop = nchar(x)+1 
     substr(x, start=start, stop=stop) 
    }, USE.NAMES=FALSE) 
    temp_result = trim(temp_result) 
    result = append(result, temp_result) 
    } 
    result 
} 

txt = read_lines("alice_in_wonderland.txt") 

result = '' 

for (i in 1:length(txt)) { 
    page = txt[i] 
    t1 = unlist(strsplit(page, "\n"))  
    maxSize = max(nchar(t1)) 
    t1 = paste0(t1,strrep(" ", maxSize-nchar(t1))) 
    result = append(result,read_text(t1)) 
} 

result 

但是,没有运气与一些文件。我想知道是否有一个更一般/更好的正则表达式来实现结果。

非常感谢提前!

+0

我很想找到一个非PDF的选择。如果你想使用那个特定的故事,这里有一个纯文本版本:http://www.gutenberg.org/files/11/11-0.txt。否则,寻找另一个PDF到文本转换工具,它将转换为1列输出。 – neilfws

+1

看起来像一个固定宽度的文件 - 如果在两列中总是有恒定的宽度,''dat < - read.fwf(file,widths = c(37,48),stringsAsFactors = FALSE)'会给你一个很好的开始。 – thelatemail

+1

[保存我的理智](https://www.nu42.com/2014/09/scraping-pdf-documents-without-losing.html)意识到'pdftohtml'具有非常有用的XML输出模式。 –

回答

0

与固定左宽的列,我们可以将每行分成前37个字符和其余字符,将它们添加到左列和右列的字符串中。例如,使用正则表达式

use warnings; 
use strict; 

my $file = 'two_column.txt' 
open my $fh, '<', $file or die "Can't open $file: $!"; 

my ($left_col, $right_col); 

while (<$fh>) 
{ 
    my ($left, $right) = /(.{37})(.*)/; 

    $left =~ s/\s*$/ /; 

    $left_col .= $left; 
    $right_col .= $right; 
} 
close $fh; 

print $left_col, $right_col, "\n"; 

这将打印整个文本。或加入列,my $text = $left_col . $right_col;

(.{37})匹配任何字符(.),并做到这一点正是37倍({37}),捕获正则表达式模式与(); (.*)捕获所有剩余。这些由正则表达式返回并分配。 $left的尾部空格被压缩成一个。两者都被附加(.=)。

或命令行

perl -wne' 
    ($l, $r) = /(.{37})(.*)/; $l =~ s/\s*$/ /; $cL .= $l; $cR .= $r; 
    }{ print $cL,$cR,"\n" 
' two_column.txt 

其中}{开始END块,即(所有行已被处理后)退出之前运行。

+0

@pachamaltese我编辑了一下,为了清楚起见,并添加了几个语句。 – zdim

0

看起来像一个固定宽度的文件,如果总是有一定的宽度在两列:

dat <- read.fwf(textConnection(txt), widths=c(37,48), stringsAsFactors=FALSE) 
gsub("\\s+", " ", paste(unlist(dat), collapse=" ")) 

这将对这一切在一个大长串:

[1] "Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?"