2017-10-07 78 views
0

RStudio V1.0.153RStudio或R:文本挖掘到Excel项目

这将是一个漫长的职位,所以我会欣赏的人,将有耐心看完,并提供建议。我在〜110观察建库和它的部分将需要为可惜只以PDF格式提供数据。我是R新手,但认为我会对此大肆攻击。我更愿意尝试这种方式不是通过的PDF文件感兴趣手动输入数据的网页100S。

下面是数据的PDF格式PDF Pathology Report为Excel格式的源如下所示Sample Excel Format 基本上我的目标是尽可能容易地获得从骨子里这条道路的报告“肉”。不过,我明白一些清理工作总是必要的!

到目前为止,我已经转换使用开放源代码的网站的PDF格式为PNG,然后使用该返回分配给该对象“路径”的1字符串的正方体包。然后我使用了Tokenizers包:

words <- tokenize_words(X, lowercase = TRUE) 

dput(words) 
c("appropriate", "controls", "specimen", "1", "2", "old", "liver", 
    "explant", "posit", "ve", "for", "malignancy", "hepatocellular", 
    "carcinoma", "see", "synoptic", "report", "below", "advanced", 
    "stage", "chronic", "liver", "disease", "fibrosis", "staging" 
) 

我只是不知道从哪里走?也许,这可以用来淘汰感兴趣的短语和3-4的话,将有感兴趣的描述短语继TM封装的功能?

任何意见将不胜感激!

+1

您是否已经查看了允许将pdf解析为R的'pdftool'包?正是从Ropensci和工具,用于文本分析了很好的概述。 [查看博客文章(https://ropensci.org/blog/blog/2017/06/13/ropensci_text_tools) – cderv

回答

1

我不知道具体的工具,但你所描述的是很容易做到用正则表达式

淘汰感兴趣的短语和3-4字短语

以下
# words <- tokenize_words(X, lowercase = TRUE) 
words <- 
    c("appropriate", "controls", "specimen", "1", "2", "old", "liver", 
    "explant", "posit", "ve", "for", "malignancy", "hepatocellular", 
    "carcinoma", "see", "synoptic", "report", "below", "advanced", 
    "stage", "chronic", "liver", "disease", "fibrosis", "staging" 
) 


f <- function(x, phrase, n_words = 3L, upto = NULL) { 
    x <- paste0(x, collapse = ' ') 
    word <- '\\b\\w+\\b\\s*' 

    p <- if (!is.null(upto)) 
    sprintf('(?:%s)\\s*((%s)+)%s|.', phrase, word, upto) 
    else sprintf('(?:%s)\\s*((%s){1,%s})|.', phrase, word, n_words) 

    trimws(gsub(p, '\\1', x)) 
} 

paste0(words, collapse = ' ') 
# "appropriate controls specimen 1 2 old liver explant posit ve for malignancy 
# hepatocellular carcinoma see synoptic report below advanced stage chronic 
# liver disease fibrosis staging" 

f(words, 'carcinoma') 
# [1] "see synoptic report" 

f(words, 'old liver', 10) 
# [1] "explant posit ve for malignancy hepatocellular carcinoma see synoptic report" 

f(words, 'old liver', upto = 'carcinoma') 
# [1] "explant posit ve for malignancy hepatocellular" 

其中n_wordsphase匹配后返回的字的数量; upto基本上会返回phraseupto之间的所有内容