索引文档中单词的最有效方法？

这出现在另一个问题，但我认为最好问这是一个单独的问题。给句子（100个几千顺序）的大名单：索引文档中单词的最有效方法？

[ 
"This is sentence 1 as an example", 
"This is sentence 1 as another example", 
"This is sentence 2", 
"This is sentence 3 as another example ", 
"This is sentence 4" 
]

什么是编写以下功能的最佳方式？

def GetSentences(word1, word2, position): 
    return ""

，其中给出了两个词，word1，word2和位置position，函数应该返回满足该限制所有语句列表。例如：

GetSentences("sentence", "another", 3)

应该返回句子1和3作为句子的指数。我目前的做法是使用字典是这样的：

Index = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: []))) 

for sentenceIndex, sentence in enumerate(sentences): 
    words = sentence.split() 
    for index, word in enumerate(words): 
     for i, word2 in enumerate(words[index:): 
      Index[word][word2][i+1].append(sentenceIndex)

但这种快速打击一切不成比例的对数据集大小为130 MB作为我的48GB的RAM在不到5分钟耗尽。我以某种方式感觉这是一个常见问题，但无法找到任何有关如何有效解决此问题的参考。有关如何解决这个问题的任何建议？

来源

2011-11-05 Legend

只是为了澄清：'position'是句子中两个单词之间的距离吗？ – misha

@misha：是的。这是正确的。 – Legend

有两个“句子1”令人困惑。它是否匹配第二个“1”而不是第一个？ – shookster

使用数据库存储值。

首先所有的句子添加到一个表（他们应该有标识）。你可以称它为例如。 sentences。
第二，创建包含在所有句子（称为例如。words，给每个单词一个ID）的单词表，保存单独表格中句子的表格记录和单词表格记录之间的连接（称之为例如。 sentences_words，它应该有两列，最好是word_id和sentence_id）。
当包含所有提及的单词的句子搜索，你的工作将被简化：
1. 你应该首先从表words，字正是你寻找的那些找到记录。查询看起来是这样的：
```
SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3'); 
```
2. 其次，你应该从已经要求word_id值（从words表中对应的词）表sentences找到sentence_id值。初始查询看起来是这样的：
```
SELECT `sentence_id`, `word_id` FROM `sentences_words` 
WHERE `word_id` IN ([here goes list of words' ids]); 
```
  这可以简化为这样：
```
SELECT `sentence_id`, `word_id` FROM `sentences_words` 
WHERE `word_id` IN (
    SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3') 
); 
```
3. 过滤器内的Python结果只返回sentence_id具有所有必要的word_id ID，您就值需要。

这基本上是基于存储在可被最适合于这个表单数据的大量的溶液 - 该数据库。

编辑：

如果你将只搜索两句话，你可以做更多的DBMS”侧（几乎所有）。
考虑到您还需要位置差异，您应该在sentences_words表格的第三列（我们称之为position）的第三列中存储单词的位置，并且在搜索适当的单词时，应计算与这两个单词相关的此值的差异。

来源

2011-11-05 01:20:02 Tadeck

+1非常感谢您的时间。我想我会与此一起去。我正在考虑使用SQLite的时刻，但如果这不能解决MySQL的问题。 – Legend

@传奇：谢谢。我相信，如果一个数据库不会被多个用户同时使用，那么sqlite非常适合这一点。如果只有一个用户会使用它，sqlite是我认为最好的，所以我完全同意你的选择。 – Tadeck

我回来再次感谢你。在说“使用合适的工具进行正确的工作”方面有很长的路要走:)建立搭配的时间已经从X（X> 12，并没有完成，因为它耗尽了内存！）现在使用小时到1小时sqlite，它甚至不重！ – Legend

下面是我在Python中做的。尽管假设这需要多次完成，但数据库管理系统是这项工作的正确工具。然而，这对于我有一百万行工作似乎很好。

sentences = [ 
    "This is sentence 1 as an example", 
    "This is sentence 1 as another example", 
    "This is sentence 2", 
    "This is sentence 3 as another example ", 
    "This is sentence 4" 
    ] 

sentences = sentences * 200 * 1000 

sentencesProcessed = [] 

def preprocess(): 
    global sentences 
    global sentencesProcessed 
    # may want to do a regex split on whitespace 
    sentencesProcessed = [sentence.split(" ") for sentence in sentences] 

    # can deallocate sentences now 
    sentences = None 


def GetSentences(word1, word2, position): 
    results = [] 
    for sentenceIndex, sentence in enumerate(sentencesProcessed): 
     for wordIndex, word in enumerate(sentence[:-position]): 
      if word == word1 and sentence[wordIndex + position] == word2: 
       results.append(sentenceIndex) 
    return results 

def main(): 
    preprocess() 
    results = GetSentences("sentence", "another", 3) 
    print "Got", len(results), "results" 

if __name__ == "__main__": 
    main()

来源

2011-11-05 02:06:28 shookster

+1谢谢你的这种做法。事实上，我测试了这个，发现它对于一次性查询来说速度非常快。但是，我试图做多个查询，但查找时间过高，这是预期的，因为没有索引。但毫无疑问，这是一个有趣的方法。谢谢。 – Legend

@Legend：是的，它每次查询时都会查看整个数据集。我只是想尝试一下:-) – shookster

索引文档中单词的最有效方法？

回答

相关问题