2011-08-23 66 views
1

我有一个包含制表符分隔的行5行的块的文本文件:提取物的物品,Python的

1 \t DESCRIPTION \t SENTENCE \t ITEMS 

1 \t DESCRIPTION \t SENTENCE \t ITEMS 

1 \t DESCRIPTION \t SENTENCE \t ITEMS 

1 \t DESCRIPTION \t SENTENCE \t ITEMS 

1 \t DESCRIPTION \t SENTENCE \t ITEMS 

2 \t DESCRIPTION \t SENTENCE \t ITEMS 

2 \t DESCRIPTION \t SENTENCE \t ITEMS 

2 \t DESCRIPTION \t SENTENCE \t ITEMS 

2 \t DESCRIPTION \t SENTENCE \t ITEMS 

2 \t DESCRIPTION \t SENTENCE \t ITEMS 

在每个块中,DESCRIPTION和SENTENCE列是相同的。感兴趣的数据是在项目栏中其是用于在所述块的每一行不同的,并且是在以下格式:

word1, word2, word3 

...等等

对于每个5线块,我需要计算ITEMS中word1,word2等的频率。例如,如果第一5行块被如下

1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3 

1 \t DESCRIPTION \t SENTENCE \t word1, word2 

1 \t DESCRIPTION \t SENTENCE \t word4 

1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3 

1 \t DESCRIPTION \t SENTENCE \t word1, word2 

然后此5行块的正确的输出将是

1, SENTENCE, (word1: 4, word2: 4, word3: 2, word4: 1) 

即,组块号,接着是判决随后词的频率计数。

我有一些代码可以提取五行块并计算一个块中的单词的频率,但是我被困在隔离每个块的任务中,获取单词频率,继续前进到下一个等

from itertools import groupby 

def GetFrequencies(file): 
    file_contents = open(file).readlines() #file as list 
    """use zip to get the entire file as list of 5-line chunk tuples""" 
    five_line_increments = zip(*[iter(file_contents)]*5) 
    for chunk in five_line_increments: #for each 5-line chunk... 
     for sentence in chunk:   #...and for each sentence in that chunk 
      words = sentence.split('\t')[3].split() #get the ITEMS column at index 3 
      words_no_comma = [x.strip(',') for x in words] #get rid of the commas 
      words_no_ws = [x.strip(' ')for x in words_no_comma] #get rid of the whitespace resulting from the removed commas 


     """STUCK HERE The idea originally was to take the words lists for 
     each chunk and combine them to create a big list, 'collection,' and 
     feed this into the for-loop below.""" 





    for key, group in groupby(collection): #collection is a big list containing all of the words in the ITEMS section of the chunk, e.g, ['word1', 'word2', word3', 'word1', 'word1', 'word2', etc.] 
     print key,len(list(group)),  

回答

0

编辑你的代码一点点,我认为这是你想要它做的事情:

file_contents = open(file).readlines() #file as list 
"""use zip to get the entire file as list of 5-line chunk tuples""" 
five_line_increments = zip(*[iter(file_contents)]*5) 
for chunk in five_line_increments: #for each 5-line chunk... 
    word_freq = {} #word frequencies for each chunk 
    for sentence in chunk:   #...and for each sentence in that chunk 
     words = "".join(sentence.split('\t')[3]).strip('\n').split(', ') #get the ITEMS column at index 3 and put them in list 
     for word in words: 
      if word not in word_freq: 
       word_freq[word] = 1 
      else: 
       word_freq[word] += 1 


    print word_freq 

输出:

{'word4': 1, 'word1': 4, 'word3': 2, 'word2': 4} 
+0

这确实一个不错的位工作。谢谢! – Renklauf

0

总结:你要附加“字”的集合,如果他们不是“描述”或“句子”?试试这个:

for word in words_no_ws: 
    if word not in ("DESCRIPTION", "SENTENCE"): 
     collection.append(word) 
1

使用Python 2.7

#!/usr/bin/env python 

import collections 

chunks={} 

with open('input') as fd: 
    for line in fd: 
     line=line.split() 
     if not line: 
      continue 
     if chunks.has_key(line[0]): 
      for i in line[3:]: 
       chunks[line[0]].append(i.replace(',','')) 
     else: 
      chunks[line[0]]=[line[2]] 

for k,v in chunks.iteritems(): 
    counter=collections.Counter(v[1:]) 
    print k, v[0], counter 

输出:

1 SENTENCE Counter({'word1': 3, 'word2': 3, 'word4': 1, 'word3': 1}) 
+0

不能因为有一个timecrunch更新到2.7,但是这是代码 – Renklauf

1

有一个在标准库中的CSV分析器,可以处理输入拆分为您

import csv 
import collections 

def GetFrequencies(file_in): 
    sentences = dict() 
    with csv.reader(open(file_in, 'rb'), delimiter='\t') as csv_file: 
     for line in csv_file: 
      sentence = line[0] 
      if sentence not in sentences: 
       sentences[sentence] = collections.Counter() 
      sentences[sentence].update([x.strip(' ') for x in line[3].split(',')])