的Python：合并字典与增加值，但节约等领域

我有以下格式文本文件里：的Python：合并字典与增加值，但节约等领域

word_form root_form morphological_form frequency 
word_form root_form morphological_form frequency 
word_form root_form morphological_form frequency

... 1万件

但有些word_forms的包含撇号（ “），别人不这样做，所以我想指望他们为同一个词的情况下，这是说，我想这样的两行合并：

cup'board cup  blabla 12 
cupboard cup  blabla2 10

到这一个（补充频率）：

cupboard cup  blabla2 22

我正在寻找在Python 2.7的解决方案要做到这一点，我的第一个想法是阅读文本文件，存储在两个不同的字典有撇号的词和词没有，然后去了的话，从词典撇号，测试这些单词是否已经在字典中没有撇号，如果他们正在实现频率，如果不是简单地添加此行与撇号删除。这里是我的代码：

class Lemma: 
    """Creates a Lemma with the word form, the root, the morphological analysis and the frequency in the corpus""" 
    def __init__(self,lop): 
     self.word_form = lop[0] 
     self.root = lop[1] 
     self.morph = lop[2] 
     self.freq = int(lop[3]) 

def Reader(filename): 
    """Keeps the lines of a file in memory for a single reading, memory efficient""" 
    with open(filename) as f: 
     for line in f: 
      yield line 

def get_word_dict(filename): 
    '''Separates the word list into two dictionaries, one for words with apostrophe and one for words with apostrophe''' 
    '''Works in a reasonable time''' 
    '''This step can be done writing line by line, avoiding all storage in memory''' 
    word_dict = {} 
    word_dict_striped = {} 

    # We store the lemmas in two dictionaries, word_dict for words without apostrophe, word_dict_striped for words with apostrophe 
    with open('word_dict.txt', 'wb') as f: 
     with open('word_dict_striped.txt', 'wb') as g: 

      reader = Reader(filename) 
      for line in reader: 
       items = line.split("\t") 
       word_form = items[0] 
       if "'" in word_form: 
        # we remove the apostrophe in the word form and morphological analysis and add the lemma to the dictionary word_dict_striped 
        items[0] = word_form.replace("'","") 
        items[2] = items[2].replace("\+Apos", "") 

        g.write("%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3])) 
        word_dict_striped({items[0] : Lemma(items)}) 
       else: 
        # we just add the lemma to the dictionary word_dict 
        f.write("%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3])) 
        word_dict.update({items[0] : Lemma(items)}) 

    return word_dict, word_dict_striped 

def merge_word_dict(word_dict, word_dict_striped): 
    '''Takes two dictionaries and merge them by adding the count of their frequencies if there is a common key''' 
    ''' Does not run in reasonable time on the whole list ''' 

    with open('word_compiled_dict.txt', 'wb') as f: 

     for word in word_dict_striped.keys(): 
      if word in word_dict.keys(): 
       word_dict[word].freq += word_dict_striped[word].freq 
       f.write("%s\t%s\t%s\t%s" % (word_dict[word].word_form, word_dict[word].root, word_dict[word].morph, word_dict[word].freq)) 
      else: 
       word_dict.update(word_dict_striped[word]) 

    print "Number of words: ", 
    print(len(word_dict)) 

    for x in word_dict: 
     print x, word_dict[x].root, word_dict[x].morph, word_dict[x].freq 

    return word_dict

此解决方案在一个合理的时间，直到两个库的存储，我是否通过线两条TEXTFILES行写，以避免任何储存或我将它们存储在程序字典对象。但是两本词典的合并永远不会结束！

词典的“更新”功能可以正常工作，但会覆盖一个频率计数而不是添加两个。我看到了合并的字典的一些解决方案，除了与计数器： Python: Elegantly merge dictionaries with sum() of values Merge and sum of two dictionaries How to sum dict elements How to merge two Python dictionaries in a single expression? Is there any pythonic way to combine two dicts (adding values for keys that appear in both)? 但他们似乎只有当字典是形式（文字，计数）的工作，而我想继续在其他领域在字典中也是如此。

我接受所有的想法或重新构造问题，因为我的目标是有此程序只运行一次才能获得此文本文件中的合并列表，请提前致谢！

来源

2016-12-05 hajoki

难道你不能简单地用空字符串替换所有的撇号来删除它们吗？像这样：'word_form = items [0] .replace（“'”，“”）' –

但是，然后我会有两行同一个单词，这些频率将不会被添加，对吧？ – hajoki

对于给定的单词，最多可以组合两行吗？或者可能更多？那些需要彼此相邻的必须组合？如果要组合两条线，其他所有（除了计数）保证是相同的吗？ – Iluvatar

这是一些或多或少做你想要的东西。只需更改顶部的文件名即可。它不会修改原始文件。

input_file_name = "input.txt" 
output_file_name = "output.txt" 

def custom_comp(s1, s2): 
    word1 = s1.split()[0] 
    word2 = s2.split()[0] 
    stripped1 = word1.translate(None, "'") 
    stripped2 = word2.translate(None, "'") 

    if stripped1 > stripped2: 
     return 1 
    elif stripped1 < stripped2: 
     return -1 
    else: 
     if "'" in word1: 
      return -1 
     else: 
      return 1 

def get_word(line): 
    return line.split()[0].translate(None, "'") 

def get_num(line): 
    return int(line.split()[-1]) 

print "Reading file and sorting..." 

lines = [] 
with open(input_file_name, 'r') as f: 
    for line in sorted(f, cmp=custom_comp): 
     lines.append(line) 

print "File read and sorted" 

combined_lines = [] 

print "Combining entries..." 

i = 0 
while i < len(lines) - 1: 
    if get_word(lines[i]) == get_word(lines[i+1]): 
     total = get_num(lines[i]) + get_num(lines[i+1]) 
     new_parts = lines[i+1].split() 
     new_parts[-1] = str(total) 
     combined_lines.append(" ".join(new_parts)) 
     i += 2 
    else: 
     combined_lines.append(lines[i].strip()) 
     i += 1 

print "Entries combined" 
print "Writing to file..." 

with open(output_file_name, 'w+') as f: 
    for line in combined_lines: 
     f.write(line + "\n") 

print "Finished"

它对文字进行排序并稍微增加间距。如果这很重要，让我知道，它可以调整。

另一件事是它分类整个事情。对于只有一百万行，可能不会花太长时间，但再次，让我知道如果这是一个问题。

来源

2016-12-05 14:39:47 Iluvatar

非常感谢您的答案，只需不到一分钟！我对它进行了一些修改，即使没有撇号条目进行合并时也插入了撇号条目，并且我意识到我必须多次运行程序，因为有一些情况需要合并两条以上的行我的坏，我不知道那里有），但有一个完成的程序改变了一切！ – hajoki

的Python：合并字典与增加值，但节约等领域

回答

相关问题