提取物的物品，Python的

我有一个包含制表符分隔的行5行的块的文本文件：提取物的物品，Python的

1 \t DESCRIPTION \t SENTENCE \t ITEMS 

1 \t DESCRIPTION \t SENTENCE \t ITEMS 

1 \t DESCRIPTION \t SENTENCE \t ITEMS 

1 \t DESCRIPTION \t SENTENCE \t ITEMS 

1 \t DESCRIPTION \t SENTENCE \t ITEMS 

2 \t DESCRIPTION \t SENTENCE \t ITEMS 

2 \t DESCRIPTION \t SENTENCE \t ITEMS 

2 \t DESCRIPTION \t SENTENCE \t ITEMS 

2 \t DESCRIPTION \t SENTENCE \t ITEMS 

2 \t DESCRIPTION \t SENTENCE \t ITEMS

等

在每个块中，DESCRIPTION和SENTENCE列是相同的。感兴趣的数据是在项目栏中其是用于在所述块的每一行不同的，并且是在以下格式：

word1, word2, word3

...等等

对于每个5线块，我需要计算ITEMS中word1，word2等的频率。例如，如果第一5行块被如下

1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3 

1 \t DESCRIPTION \t SENTENCE \t word1, word2 

1 \t DESCRIPTION \t SENTENCE \t word4 

1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3 

1 \t DESCRIPTION \t SENTENCE \t word1, word2

然后此5行块的正确的输出将是

1, SENTENCE, (word1: 4, word2: 4, word3: 2, word4: 1)

即，组块号，接着是判决随后词的频率计数。

我有一些代码可以提取五行块并计算一个块中的单词的频率，但是我被困在隔离每个块的任务中，获取单词频率，继续前进到下一个等

from itertools import groupby 

def GetFrequencies(file): 
    file_contents = open(file).readlines() #file as list 
    """use zip to get the entire file as list of 5-line chunk tuples""" 
    five_line_increments = zip(*[iter(file_contents)]*5) 
    for chunk in five_line_increments: #for each 5-line chunk... 
     for sentence in chunk:   #...and for each sentence in that chunk 
      words = sentence.split('\t')[3].split() #get the ITEMS column at index 3 
      words_no_comma = [x.strip(',') for x in words] #get rid of the commas 
      words_no_ws = [x.strip(' ')for x in words_no_comma] #get rid of the whitespace resulting from the removed commas 


     """STUCK HERE The idea originally was to take the words lists for 
     each chunk and combine them to create a big list, 'collection,' and 
     feed this into the for-loop below.""" 





    for key, group in groupby(collection): #collection is a big list containing all of the words in the ITEMS section of the chunk, e.g, ['word1', 'word2', word3', 'word1', 'word1', 'word2', etc.] 
     print key,len(list(group)),

来源

2011-08-23 Renklauf

编辑你的代码一点点，我认为这是你想要它做的事情：

file_contents = open(file).readlines() #file as list 
"""use zip to get the entire file as list of 5-line chunk tuples""" 
five_line_increments = zip(*[iter(file_contents)]*5) 
for chunk in five_line_increments: #for each 5-line chunk... 
    word_freq = {} #word frequencies for each chunk 
    for sentence in chunk:   #...and for each sentence in that chunk 
     words = "".join(sentence.split('\t')[3]).strip('\n').split(', ') #get the ITEMS column at index 3 and put them in list 
     for word in words: 
      if word not in word_freq: 
       word_freq[word] = 1 
      else: 
       word_freq[word] += 1 


    print word_freq

输出：

{'word4': 1, 'word1': 4, 'word3': 2, 'word2': 4}

来源

2011-08-23 14:42:17 Polysics

这确实一个不错的位工作。谢谢！ – Renklauf

总结：你要附加“字”的集合，如果他们不是“描述”或“句子”？试试这个：

for word in words_no_ws: 
    if word not in ("DESCRIPTION", "SENTENCE"): 
     collection.append(word)

来源

2011-08-23 14:09:36 Constantinius

使用Python 2.7

#!/usr/bin/env python 

import collections 

chunks={} 

with open('input') as fd: 
    for line in fd: 
     line=line.split() 
     if not line: 
      continue 
     if chunks.has_key(line[0]): 
      for i in line[3:]: 
       chunks[line[0]].append(i.replace(',','')) 
     else: 
      chunks[line[0]]=[line[2]] 

for k,v in chunks.iteritems(): 
    counter=collections.Counter(v[1:]) 
    print k, v[0], counter

输出：

1 SENTENCE Counter({'word1': 3, 'word2': 3, 'word4': 1, 'word3': 1})

来源

2011-08-23 14:32:27

不能因为有一个timecrunch更新到2.7，但是这是代码 – Renklauf

有一个在标准库中的CSV分析器，可以处理输入拆分为您

import csv 
import collections 

def GetFrequencies(file_in): 
    sentences = dict() 
    with csv.reader(open(file_in, 'rb'), delimiter='\t') as csv_file: 
     for line in csv_file: 
      sentence = line[0] 
      if sentence not in sentences: 
       sentences[sentence] = collections.Counter() 
      sentences[sentence].update([x.strip(' ') for x in line[3].split(',')])

来源

2011-08-23 14:46:08 dtanders

提取物的物品，Python的

回答

相关问题