我想分析一个大的(n = 500,000)文档语料库。我使用quanteda
期望will be faster比tm_map()
从tm
。我想要一步一步地执行,而不是使用dfm()
的自动方式。我有这样的理由:在一种情况下,我不想在移除停用词之前进行标记化,因为这会导致许多无用的bigrams,在另一种情况下,我必须使用特定于语言的过程预处理文本。创建dfm一步一步与quanteda
谨以此顺序实施:
1)删除标点和数字
2),即标记化之前除去停用词(以避免无用的令牌)
3)标记化使用unigram进行和双字母组
4 )创建DFM
我尝试:
> library(quanteda)
> packageVersion("quanteda")
[1] ‘0.9.8’
> text <- ie2010Corpus$documents$texts
> text.corpus <- quanteda:::corpus(text, docnames=rownames(ie2010Corpus$documents))
> class(text.corpus)
[1] "corpus" "list"
> stopw <- c("a","the", "all", "some")
> TextNoStop <- removeFeatures(text.corpus, features = stopw)
# Error in UseMethod("selectFeatures") :
# no applicable method for 'selectFeatures' applied to an object of class "c('corpus', 'list')"
# This is how I would theoretically continue:
> token <- tokenize(TextNoStop, removePunct=TRUE, removeNumbers=TRUE)
> token2 <- ngrams(token,c(1,2))
奖金问题 如何删除quanteda
中的稀疏令牌? (即在tm
的removeSparseTerms()
相当于
UPDATE 在@肯的回答的光,这里是按部就班与quanteda
代码:
library(quanteda)
packageVersion("quanteda")
[1] ‘0.9.8’
1)删除自定义标点符号和数字。例如。注意到,“\ n”,在ie2010语料库
text.corpus <- ie2010Corpus
texts(text.corpus)[1] # Use texts() to extrapolate text
# 2010_BUDGET_01_Brian_Lenihan_FF
# "When I presented the supplementary budget to this House last April, I said we
# could work our way through this period of severe economic distress. Today, I
# can report that notwithstanding the difficulties of the past eight months, we
# are now on the road to economic recovery.\nIt is
texts(text.corpus)[1] <- gsub("\\s"," ",text.corpus[1]) # remove all spaces (incl \n, \t, \r...)
texts(text.corpus)[1]
2010_BUDGET_01_Brian_Lenihan_FF
# "When I presented the supplementary budget to this House last April, I said we
# could work our way through this period of severe economic distress. Today, I
# can report that notwithstanding the difficulties of the past eight months, we
# are now on the road to economic recovery. It is of e
上的原因的进一步注为什么一个人可能更愿意预处理。我目前的语料库是意大利语,这是一种用撇号连接单词的文章。因此,直线dfm()
可能导致不精确的标记。 例如为:
broken.tokens <- dfm(corpus(c("L'abile presidente Renzi. Un'abile mossa di Berlusconi"), removePunct=TRUE))
将产生相同的字两个分离的标记(“un'abile”和“L'abile”),因此需要与gsub()
这里的附加步骤的。
2)在quanteda
中,不可能在标记之前直接在文本中删除停用词。在我之前的例子中,“l”和“un”必须去掉,不要产生误导性的bigrams。这可以在tm
与tm_map(..., removeWords)
处理。
3)符号化
token <- tokenize(text.corpus[1], removePunct=TRUE, removeNumbers=TRUE, ngrams = 1:2)
4)创建DFM:
dfm <- dfm(token)
5)删除稀疏特征
dfm <- trim(dfm, minCount = 5)
为了总结答案,可以使用'texts()'函数在'quanteda'中逐步进行: – 000andy8484