2011-04-11 40 views
0

我有这段代码。如何增加元组值并在python循环中搜索字符串

arfffile = [] 

inputed = raw_input("Enter Evaluation for name including file extension...") 

reader = open(inputed, 'r') 

verses = [] 

for line in reader: 
    verses.append(line) 

for line in verses: 
    if line.split('@') == "@": 
     verses.pop(line) 


numclusters = int(raw_input("Enter the number of clusters")) 

clusters = {} 

for i in range(1,numclusters+1): 
    clusters["cluster"+str(i)] = 0 



print clusters 
# If verse belongs to a cluster, increment the cluster count by one in the dictionary value. 
for verse in verses: 
    for k in clusters: 
     if k in verse: 
      clusters[k] += 1 
     else: 
      print "not in" 

print clusters 

yeslist = [] 

for verse in verses: 
    for k in clusters: 
     if k not in yeslist: 
      yeslist.append((k,0)) 
     elif k in yeslist: 
      print "already in" + k 


for verse in verses: 
    for k in clusters: 
     if k in verse and "Yes" in verse: 
      yeslist.append(yeslist.index(k), +1) 


    # iterate through dictionary and iterate through the lines 
    # need to read in file line by line, 



    # if "yes" and cluster x increment cluster 
    # need to work out percentage of possitive verses in each cluster. 

的ARFF文件的一个例子是

@relation tester999.arff_clustered 

@attribute Instance_number numeric 
@attribute allah numeric 
@attribute day numeric 
@attribute lord numeric 
@attribute people numeric 
@attribute earth numeric 
@attribute men numeric 
@attribute truth numeric 
@attribute verily numeric 
@attribute chapter numeric 
@attribute verse numeric 
@attribute CLASS {Yes,No} 
@attribute Cluster {cluster1,cluster2,cluster3} 

@data 
0,1,0,0,0,0,0,0,0,1,1,No,cluster3 
1,1,0,0,0,0,0,0,0,1,2,No,cluster3 
2,0,0,0,0,0,0,0,0,1,3,No,cluster3 
3,0,1,0,0,0,1,0,0,1,4,No,cluster3 
4,0,0,0,0,0,0,0,0,1,5,No,cluster3 
5,0,0,0,0,0,0,0,0,1,6,No,cluster3 
6,0,0,0,0,0,0,0,0,1,7,No,cluster3 
7,0,0,0,0,0,0,0,0,2,1,No,cluster3 
8,1,0,0,0,0,0,0,0,2,2,No,cluster3 
9,0,0,0,0,0,0,0,0,2,3,No,cluster3 
10,0,0,0,0,0,0,0,0,2,4,No,cluster3 
11,0,0,1,0,0,0,0,0,2,5,No,cluster2 

既然这样的程序读取中的数据线,例如

0,1,0,0,0,0,0,0,0,1,1,No,cluster3 

和我已经建立,其检测多少簇的字典在数据文件中。在这个例子中有3. cluster1 cluster2和cluster3。然后代码将每个群集附加为字典“群集”中表示为字符串的键值。然后,我遍历所有经文并对每行进行计数,以查看它属于哪个群集。

我的下一步是尝试对每个群集计数其中出现“是”的行的次数。所以说数据中每行有10行字符串为“是”,代码应该能够计算出现的次数。

到目前为止,我已经做了代码是在这里

for verse in verses: 
     for k in clusters: 
      if k in verse and "Yes" in verse: 
       yeslist.append(yeslist.index(k), +1) 

我真的basicaly创建的元组称为 “yeslist” 与价值观像这样的列表[(cluster1中,0),(Cluster2中,3)]

因此,对于每一行(表示为一个字符串),检查其中是否存在“是”,如果检查它属于哪个集群,则将该元组值加1。

我很难想出如何做到这一点的逻辑...任何人都可以帮忙吗?

谢谢。

+1

和问题的短变体是什么? – 2011-04-11 17:45:58

+0

我很确定元组是不可变的。 – DTing 2011-04-11 18:22:43

回答

1
import collections 

inputed = raw_input("Enter Evaluation for name including file extension...") 

reader = open(inputed, 'r') 

verses = [ line.strip() for line in reader.readlines() if line[0] != '@' ] 

reader.close() 

cluster_count = collections.defaultdict(int) 
yes_count = collections.defaultdict(int) 

verse_infos = [ (split_verse[-1],split_verse[-2]) for split_verse \ 
       in verses.split(",") ] 

for verse in verse_infos: 
    cluster_count[verse[0]]+=1 
    if verse[1] == 'yes': 
     yes_count[verse[0]]+=1 

你结束了两点字典:

cluster_count : keys = cluster#, values = count 
yes_count  : keys = cluster#, values = #yes 

,如果你真的想元组的列表:

yes_tuples = (x for x in sorted(yes_count.iteritems()))