2016-12-05 61 views
0

我有以下格式文本文件里:的Python:合并字典与增加值,但节约等领域

word_form root_form morphological_form frequency 
word_form root_form morphological_form frequency 
word_form root_form morphological_form frequency 

... 1万件

但有些word_forms的包含撇号( “),别人不这样做,所以我想指望他们为同一个词的情况下,这是说,我想这样的两行合并:

cup'board cup  blabla 12 
cupboard cup  blabla2 10 

到这一个(补充频率):

cupboard cup  blabla2 22 

我正在寻找在Python 2.7的解决方案要做到这一点,我的第一个想法是阅读文本文件,存储在两个不同的字典有撇号的词和词没有,然后去了的话,从词典撇号,测试这些单词是否已经在字典中没有撇号,如果他们正在实现频率,如果不是简单地添加此行与撇号删除。这里是我的代码:

class Lemma: 
    """Creates a Lemma with the word form, the root, the morphological analysis and the frequency in the corpus""" 
    def __init__(self,lop): 
     self.word_form = lop[0] 
     self.root = lop[1] 
     self.morph = lop[2] 
     self.freq = int(lop[3]) 

def Reader(filename): 
    """Keeps the lines of a file in memory for a single reading, memory efficient""" 
    with open(filename) as f: 
     for line in f: 
      yield line 

def get_word_dict(filename): 
    '''Separates the word list into two dictionaries, one for words with apostrophe and one for words with apostrophe''' 
    '''Works in a reasonable time''' 
    '''This step can be done writing line by line, avoiding all storage in memory''' 
    word_dict = {} 
    word_dict_striped = {} 

    # We store the lemmas in two dictionaries, word_dict for words without apostrophe, word_dict_striped for words with apostrophe 
    with open('word_dict.txt', 'wb') as f: 
     with open('word_dict_striped.txt', 'wb') as g: 

      reader = Reader(filename) 
      for line in reader: 
       items = line.split("\t") 
       word_form = items[0] 
       if "'" in word_form: 
        # we remove the apostrophe in the word form and morphological analysis and add the lemma to the dictionary word_dict_striped 
        items[0] = word_form.replace("'","") 
        items[2] = items[2].replace("\+Apos", "") 

        g.write("%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3])) 
        word_dict_striped({items[0] : Lemma(items)}) 
       else: 
        # we just add the lemma to the dictionary word_dict 
        f.write("%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3])) 
        word_dict.update({items[0] : Lemma(items)}) 

    return word_dict, word_dict_striped 

def merge_word_dict(word_dict, word_dict_striped): 
    '''Takes two dictionaries and merge them by adding the count of their frequencies if there is a common key''' 
    ''' Does not run in reasonable time on the whole list ''' 

    with open('word_compiled_dict.txt', 'wb') as f: 

     for word in word_dict_striped.keys(): 
      if word in word_dict.keys(): 
       word_dict[word].freq += word_dict_striped[word].freq 
       f.write("%s\t%s\t%s\t%s" % (word_dict[word].word_form, word_dict[word].root, word_dict[word].morph, word_dict[word].freq)) 
      else: 
       word_dict.update(word_dict_striped[word]) 

    print "Number of words: ", 
    print(len(word_dict)) 

    for x in word_dict: 
     print x, word_dict[x].root, word_dict[x].morph, word_dict[x].freq 

    return word_dict 

此解决方案在一个合理的时间,直到两个库的存储,我是否通过线两条TEXTFILES行写,以避免任何储存或我将它们存储在程序字典对象。但是两本词典的合并永远不会结束!

词典的“更新”功能可以正常工作,但会覆盖一个频率计数而不是添加两个。我看到了合并的字典 的一些解决方案,除了与计数器: Python: Elegantly merge dictionaries with sum() of values Merge and sum of two dictionaries How to sum dict elements How to merge two Python dictionaries in a single expression? Is there any pythonic way to combine two dicts (adding values for keys that appear in both)? 但他们似乎只有当字典是形式(文字,计数)的工作,而我想继续在其他领域在字典中也是如此。

我接受所有的想法或重新构造问题,因为我的目标是 有此程序只运行一次才能获得此文本文件中的合并列表,请提前致谢!

+0

难道你不能简单地用空字符串替换所有的撇号来删除它们吗?像这样:'word_form = items [0] .replace(“'”,“”)' –

+0

但是,然后我会有两行同一个单词,这些频率将不会被添加,对吧? – hajoki

+0

对于给定的单词,最多可以组合两行吗?或者可能更多?那些需要彼此相邻的必须组合?如果要组合两条线,其他所有(除了计数)保证是相同的吗? – Iluvatar

回答

0

这是一些或多或少做你想要的东西。只需更改顶部的文件名即可。它不会修改原始文件。

input_file_name = "input.txt" 
output_file_name = "output.txt" 

def custom_comp(s1, s2): 
    word1 = s1.split()[0] 
    word2 = s2.split()[0] 
    stripped1 = word1.translate(None, "'") 
    stripped2 = word2.translate(None, "'") 

    if stripped1 > stripped2: 
     return 1 
    elif stripped1 < stripped2: 
     return -1 
    else: 
     if "'" in word1: 
      return -1 
     else: 
      return 1 

def get_word(line): 
    return line.split()[0].translate(None, "'") 

def get_num(line): 
    return int(line.split()[-1]) 

print "Reading file and sorting..." 

lines = [] 
with open(input_file_name, 'r') as f: 
    for line in sorted(f, cmp=custom_comp): 
     lines.append(line) 

print "File read and sorted" 

combined_lines = [] 

print "Combining entries..." 

i = 0 
while i < len(lines) - 1: 
    if get_word(lines[i]) == get_word(lines[i+1]): 
     total = get_num(lines[i]) + get_num(lines[i+1]) 
     new_parts = lines[i+1].split() 
     new_parts[-1] = str(total) 
     combined_lines.append(" ".join(new_parts)) 
     i += 2 
    else: 
     combined_lines.append(lines[i].strip()) 
     i += 1 

print "Entries combined" 
print "Writing to file..." 

with open(output_file_name, 'w+') as f: 
    for line in combined_lines: 
     f.write(line + "\n") 

print "Finished" 

它对文字进行排序并稍微增加间距。如果这很重要,让我知道,它可以调整。

另一件事是它分类整个事情。对于只有一百万行,可能不会花太长时间,但再次,让我知道如果这是一个问题。

+0

非常感谢您的答案,只需不到一分钟!我对它进行了一些修改,即使没有撇号条目进行合并时也插入了撇号条目,并且我意识到我必须多次运行程序,因为有一些情况需要合并两条以上的行我的坏,我不知道那里有),但有一个完成的程序改变了一切! – hajoki