2014-11-21 68 views
-1

我有两个文件,我想从中获取以下存在(1)和缺席(0)的矩阵。 如果在任何FILEB项(或COL1,不知道哪个输入是最好在这里)中cols2-4一个项目,“1”的分数被记录时,匹配其他明智“0”被记录从两个文件中的项目获取矩阵

文件答:

col1 col2 col3 col4 
esd dus esd muq 
uum uum dus esd 
dus esd uum dus 
muq muq muq uum 

文件B:

esd 
uum 
dus 
muq 

我尝试:

out_file=open("out.txt", "w") 
for itemA in open("fileA", "r") as file1: 
    file2=open("fileB", "r") 
    for row in file2: 
     for col in file2: 
      if itemA==file2[row][col]: 
       out_file.write(int(1)) 
      else: 
       out_file.write(int(0)) 

预期输出:

col1 col2 col3 
esd 0 1 0 
uum 1 0 0 
dus 0 0 1 
muq 1 1 0 

帮助python代码将不胜感激。

+0

你的代码的实际输出是什么? – boh 2014-11-21 14:40:38

+0

使用熊猫。 http://pandas.pydata.org/ – acushner 2014-11-21 14:42:45

+0

@boh:看代码,我的猜测会是语法错误;) – Wolph 2014-11-21 14:48:56

回答

1

是否有这样的工作适合你?

with open('a.txt') as fh: 
    for line in fh: 
     cols = line.split() 
     key = cols[0] 
     print key, 
     for col in cols[1:]: 
      # Print 1 if they are the same, 0 otherwise 
      print int(col == key), 

     # Newline 
     print 

随着a.txt

esd dus esd muq 
uum uum dus esd 
dus esd uum dus 
muq muq muq uum 

输出:

esd 0 1 0 
uum 1 0 0 
dus 0 0 1 
muq 1 1 0 
1

你不需要文件B,如果文件中的每一行的第一个项目是,你的东西寻找。

result = [] 
for line in open('input.txt').readlines(): 
    tokens = line.split() 
    seek = tokens[0] # We seek occurrences of the first token in the row. 
    row = [seek]  # This array stores pieces of output. 
    for item in tokens[1:]: 
     if item == seek: 
      row.append('1') # Note that these are strings, not integers. 
     else:     # You might like to replace them with other 
      row.append('0') # values such as 'Y'/'N' or 'T'/'F'. 
    result.append(row) 
lines = [' '.join(row) for row in result] # Making lines of output. 
text = '\n'.join(lines)      # Gluing the lines together. 
print(text)         # Printing for verification. 
with open('output.txt', 'w') as out_file: # Then writing to file. 
    out_file.write(text+'\n') 

上面的代码将借此输入:

esd dus esd muq 
uum uum dus esd 
dus esd uum dus 
muq muq muq uum 

,并产生这样的输出:

esd 0 1 0 
uum 1 0 0 
dus 0 0 1 
muq 1 1 0 
0

如果B中的列不必匹配在A的第一列,则你可以在任何文件上调用next方法使其处于同步读取的形式:

fileA = 'fileA.tsv' 
fileB = 'fileB.tsv' 
outfilename = 'outfile.tsv' 

with open(fileA) as fa: 
    with open(fileB) as fb: 
     with open(outfilename, 'w') as outfile: 
      for line in fb: 
       corresp_a_line = fa.next() 
       fields = corresp_a_line.split() 
       outfile.write(fields[0]) # write column 1 
       for field in fields[1:]: 
        outfile.write("\t{}".format(int(line.strip() in field))) 
       outfile.write("\n")