我有以下格式文本文件里:的Python:合并字典与增加值,但节约等领域
word_form root_form morphological_form frequency
word_form root_form morphological_form frequency
word_form root_form morphological_form frequency
... 1万件
但有些word_forms的包含撇号( “),别人不这样做,所以我想指望他们为同一个词的情况下,这是说,我想这样的两行合并:
cup'board cup blabla 12
cupboard cup blabla2 10
到这一个(补充频率):
cupboard cup blabla2 22
我正在寻找在Python 2.7的解决方案要做到这一点,我的第一个想法是阅读文本文件,存储在两个不同的字典有撇号的词和词没有,然后去了的话,从词典撇号,测试这些单词是否已经在字典中没有撇号,如果他们正在实现频率,如果不是简单地添加此行与撇号删除。这里是我的代码:
class Lemma:
"""Creates a Lemma with the word form, the root, the morphological analysis and the frequency in the corpus"""
def __init__(self,lop):
self.word_form = lop[0]
self.root = lop[1]
self.morph = lop[2]
self.freq = int(lop[3])
def Reader(filename):
"""Keeps the lines of a file in memory for a single reading, memory efficient"""
with open(filename) as f:
for line in f:
yield line
def get_word_dict(filename):
'''Separates the word list into two dictionaries, one for words with apostrophe and one for words with apostrophe'''
'''Works in a reasonable time'''
'''This step can be done writing line by line, avoiding all storage in memory'''
word_dict = {}
word_dict_striped = {}
# We store the lemmas in two dictionaries, word_dict for words without apostrophe, word_dict_striped for words with apostrophe
with open('word_dict.txt', 'wb') as f:
with open('word_dict_striped.txt', 'wb') as g:
reader = Reader(filename)
for line in reader:
items = line.split("\t")
word_form = items[0]
if "'" in word_form:
# we remove the apostrophe in the word form and morphological analysis and add the lemma to the dictionary word_dict_striped
items[0] = word_form.replace("'","")
items[2] = items[2].replace("\+Apos", "")
g.write("%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3]))
word_dict_striped({items[0] : Lemma(items)})
else:
# we just add the lemma to the dictionary word_dict
f.write("%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3]))
word_dict.update({items[0] : Lemma(items)})
return word_dict, word_dict_striped
def merge_word_dict(word_dict, word_dict_striped):
'''Takes two dictionaries and merge them by adding the count of their frequencies if there is a common key'''
''' Does not run in reasonable time on the whole list '''
with open('word_compiled_dict.txt', 'wb') as f:
for word in word_dict_striped.keys():
if word in word_dict.keys():
word_dict[word].freq += word_dict_striped[word].freq
f.write("%s\t%s\t%s\t%s" % (word_dict[word].word_form, word_dict[word].root, word_dict[word].morph, word_dict[word].freq))
else:
word_dict.update(word_dict_striped[word])
print "Number of words: ",
print(len(word_dict))
for x in word_dict:
print x, word_dict[x].root, word_dict[x].morph, word_dict[x].freq
return word_dict
此解决方案在一个合理的时间,直到两个库的存储,我是否通过线两条TEXTFILES行写,以避免任何储存或我将它们存储在程序字典对象。但是两本词典的合并永远不会结束!
词典的“更新”功能可以正常工作,但会覆盖一个频率计数而不是添加两个。我看到了合并的字典 的一些解决方案,除了与计数器: Python: Elegantly merge dictionaries with sum() of values Merge and sum of two dictionaries How to sum dict elements How to merge two Python dictionaries in a single expression? Is there any pythonic way to combine two dicts (adding values for keys that appear in both)? 但他们似乎只有当字典是形式(文字,计数)的工作,而我想继续在其他领域在字典中也是如此。
我接受所有的想法或重新构造问题,因为我的目标是 有此程序只运行一次才能获得此文本文件中的合并列表,请提前致谢!
难道你不能简单地用空字符串替换所有的撇号来删除它们吗?像这样:'word_form = items [0] .replace(“'”,“”)' –
但是,然后我会有两行同一个单词,这些频率将不会被添加,对吧? – hajoki
对于给定的单词,最多可以组合两行吗?或者可能更多?那些需要彼此相邻的必须组合?如果要组合两条线,其他所有(除了计数)保证是相同的吗? – Iluvatar