RStudio或R：文本挖掘到Excel项目

RStudio V1.0.153RStudio或R：文本挖掘到Excel项目

这将是一个漫长的职位，所以我会欣赏的人，将有耐心看完，并提供建议。我在〜110观察建库和它的部分将需要为可惜只以PDF格式提供数据。我是R新手，但认为我会对此大肆攻击。我更愿意尝试这种方式不是通过的PDF文件感兴趣手动输入数据的网页100S。

下面是数据的PDF格式PDF Pathology Report为Excel格式的源如下所示Sample Excel Format 基本上我的目标是尽可能容易地获得从骨子里这条道路的报告“肉”。不过，我明白一些清理工作总是必要的！

到目前为止，我已经转换使用开放源代码的网站的PDF格式为PNG，然后使用该返回分配给该对象“路径”的1字符串的正方体包。然后我使用了Tokenizers包：

words <- tokenize_words(X, lowercase = TRUE) 

dput(words) 
c("appropriate", "controls", "specimen", "1", "2", "old", "liver", 
    "explant", "posit", "ve", "for", "malignancy", "hepatocellular", 
    "carcinoma", "see", "synoptic", "report", "below", "advanced", 
    "stage", "chronic", "liver", "disease", "fibrosis", "staging" 
)

我只是不知道从哪里走？也许，这可以用来淘汰感兴趣的短语和3-4的话，将有感兴趣的描述短语继TM封装的功能？

任何意见将不胜感激！

来源

2017-10-07 MeeraWhy

您是否已经查看了允许将pdf解析为R的'pdftool'包？正是从Ropensci和工具，用于文本分析了很好的概述。 [查看博客文章（https://ropensci.org/blog/blog/2017/06/13/ropensci_text_tools） – cderv

我不知道具体的工具，但你所描述的是很容易做到用正则表达式

淘汰感兴趣的短语和3-4字短语

以下

# words <- tokenize_words(X, lowercase = TRUE) 
words <- 
    c("appropriate", "controls", "specimen", "1", "2", "old", "liver", 
    "explant", "posit", "ve", "for", "malignancy", "hepatocellular", 
    "carcinoma", "see", "synoptic", "report", "below", "advanced", 
    "stage", "chronic", "liver", "disease", "fibrosis", "staging" 
) 


f <- function(x, phrase, n_words = 3L, upto = NULL) { 
    x <- paste0(x, collapse = ' ') 
    word <- '\\b\\w+\\b\\s*' 

    p <- if (!is.null(upto)) 
    sprintf('(?:%s)\\s*((%s)+)%s|.', phrase, word, upto) 
    else sprintf('(?:%s)\\s*((%s){1,%s})|.', phrase, word, n_words) 

    trimws(gsub(p, '\\1', x)) 
} 

paste0(words, collapse = ' ') 
# "appropriate controls specimen 1 2 old liver explant posit ve for malignancy 
# hepatocellular carcinoma see synoptic report below advanced stage chronic 
# liver disease fibrosis staging" 

f(words, 'carcinoma') 
# [1] "see synoptic report" 

f(words, 'old liver', 10) 
# [1] "explant posit ve for malignancy hepatocellular carcinoma see synoptic report" 

f(words, 'old liver', upto = 'carcinoma') 
# [1] "explant posit ve for malignancy hepatocellular"

其中n_words是phase匹配后返回的字的数量; upto基本上会返回phrase和upto之间的所有内容

来源

2017-10-07 13:56:59 rawr

RStudio或R：文本挖掘到Excel项目

回答

相关问题