从Google Ngrams中有效地推导出词同现矩阵

我需要使用Google Books N-grams的词汇数据来构造一个（稀疏！）矩阵的词共同出现（其中行是词和列是相同的单词，并且单元格反映它们出现在相同的上下文窗口中的次数）。所得到的tcm将被用于测量一系列词汇统计量并作为向量语义学方法（手套，LSA，LDA）的输入。从Google Ngrams中有效地推导出词同现矩阵

为了参考，谷歌图书（V2）的数据集被如下格式化（制表符分隔）

ngram  year match_count volume_count 
some word 1999 32    12   # example bigram

然而，问题是，当然，这些数据被超大型。虽然，我只需要几十年的数据子集（大约20年的ngram），我对一个高达2的上下文窗口感到满意（即使用trigram语料库）。我有一些想法，但没有一个看起来特别，很好，很好。

-Idea 1-最初或多或少这样的：

# preprocessing (pseudo) 
for file in trigram-files: 
    download $file 
    filter $lines where 'year' tag matches one of years of interest 
    find the frequency of each of those ngrams (match_count) 
    cat those $lines * $match_count >> file2 
    # (write the same line x times according to the match_count tag) 
    remove $file 

# tcm construction (using R) 
grams <- # read lines from file2 into list 
library(text2vec) 
# treat lines (ngrams) as documents to avoid unrelated ngram overlap 
it   <- itoken(grams) 
vocab  <- create_vocabulary(it) 
vectorizer <- vocab_vectorizer(vocab, skip_grams_window = 2) 
tcm  <- create_tcm(it, vectorizer) # nice and sparse

不过，我有一种预感，这可能不是最好的解决方案。 ngram数据文件已经包含n-gram形式的同现数据，并且有一个给出频率的标签。我有一种感觉应该有更直接的方式。

-Idea 2-我也在想cat'ing每个过滤NGRAM只有一次进入了新的文件（而不是复制它match_count次），然后创建一个空的中药，然后循环较全（年 - 过滤）ngram数据集并记录实例（使用match_count标签），其中任何两个词共现出现以填充tcm。但是，数据很大，这种循环可能需要很长时间。

-Idea 3-我发现一个Python库调用google-ngram-downloader，显然有一个共生矩阵创建函数，但是看一下代码，它会创建一个常规（非稀疏）矩阵（这将是巨大的，因为大多数条目都是0），并且（如果我正确的话）它只是loops through everything（并且我假设一个Python循环遍布这个数据将会超级低），所以它似乎更多地针对的是更小的数据子集。

编辑-Idea 4-跨越this old SO question来到询问使用Hadoop和配置单元的类似的任务，与断开链接AA简答题和MapReduce的左右（其中没有我熟悉的注释，这样我不知道从哪里开始）。

但我想我不能成为第一个与需要解决这样的任务，鉴于NGRAM数据集的普及，和（非word2vec）分布式语义的普及在tcm或dtm输入上运行的方法;因此 - >

...问题：从Google Books Ngram数据中构建一个term-term co-occurrence矩阵会更合理/有效吗？（这是所提议的完全不同的想法的变体; R首选但不是必需的）

来源

2017-01-25 user3554004

你能给谁都会算你共同occurecesies为三克的例子吗？它应该是什么样子。 –

那么，使用（可能是天真的）ngrams-as-documents方法，就像'x < - list（c（“this”，“is”，“example”），c（“example”，“it”，“是“））; it < - itoken（x）; vocab < - create_vocabulary（it）; vectorizer < - vocab_vectorizer（vocab，skip_grams_window = 2）; tcm < - create_tcm（it，vectorizer）;打印（翻译）; print（tcm）'但是这种感觉就像是漫长的过程（书籍/文档 - > ngram - >将ngrams导入为文档 - >从ngrams创建跳过 - > create_tcm），而ngram基本上说明了co - 已经发生，并且数据给出了任何ngram发生的次数 – user3554004

我会给你一个关于如何做到这一点的想法。但可以在几个地方改进。我在“通心粉式的”更好的解释性特意写，但可以推广到比三克以上

ngram_dt = data.table(ngram = c("as we know", "i know you"), match_count = c(32, 54)) 
# here we split tri-grams to obtain words 
tokens_matrix = strsplit(ngram_dt$ngram, " ", fixed = T) %>% simplify2array() 

# vocab here is vocabulary from chunk, but you can be interested first 
# to create vocabulary from whole corpus of ngrams and filter non 
# interesting/rare words 

vocab = unique(tokens_matrix) 
# convert char matrix to integer matrix for faster downstream calculations 
tokens_matrix_int = match(tokens_matrix, vocab) 
dim(tokens_matrix_int) = dim(tokens_matrix) 

ngram_dt[, token_1 := tokens_matrix_int[1, ]] 
ngram_dt[, token_2 := tokens_matrix_int[2, ]] 
ngram_dt[, token_3 := tokens_matrix_int[3, ]] 

dt_12 = ngram_dt[, .(cnt = sum(match_count)), keyby = .(token_1, token_2)] 
dt_23 = ngram_dt[, .(cnt = sum(match_count)), keyby = .(token_2, token_3)] 
# note here 0.5 - discount for more distant word - we follow text2vec discount of 1/distance 
dt_13 = ngram_dt[, .(cnt = 0.5 * sum(match_count)), keyby = .(token_1, token_3)] 

dt = rbindlist(list(dt_12, dt_13, dt_23)) 
# "reduce" by word indices again - sum pair co-occurences which were in different tri-grams 
dt = dt[, .(cnt = sum(cnt)), keyby = .(token_1, token_2)] 

tcm = Matrix::sparseMatrix(i = dt$token_1, j = dt$token_2, x = dt$cnt, dims = rep(length(vocab), 2), index1 = T, 
        giveCsparse = F, check = F, dimnames = list(vocab, vocab))

来源

2017-01-25 18:33:17

从Google Ngrams中有效地推导出词同现矩阵

回答

相关问题