使用熊猫下令每隔两行

的程序，我有工作输出结果outputed制表符分隔的文件看起来像这样：使用熊猫下令每隔两行

marker A B C 
Bin_1 1 2 1 
marker C G H B T 
Bin_2 3 1 1 1 2 
marker B H T Z Y A C 
Bin_3 1 1 2 1 3 4 5

我想，这样它看起来像这样来解决它：

marker A B C G H T Y Z 
Bin_1 1 2 1 0 0 0 0 0 
Bin_2 0 1 3 1 1 1 0 0 
Bin_3 4 1 5 0 1 2 3 1

这是我迄今为止

import pandas as pd 
from collections import OrderedDict 
df = pd.read_csv('markers.txt',header=None,sep='\t') 
x = map(list,df.values) 
list_of_dicts = [] 
s = 0 
e =1 
g = len(x)+1 
while e < g: 
    new_dict = OrderedDict(zip(x[s],x[e])) 
    list_of_dicts.append(new_dict) 
    s += 2 
    e += 2

起初，我将这些以字典，然后WA我们要做一些计数并重新创建一个数据框，但这似乎需要花费大量的时间和内存来完成一项简单的任务。任何建议，以更好的方式来解决这个问题？

来源

2017-03-01 Elle

lines = [str.strip(l).split() for l in open('markers.txt').readlines()] 
dicts = {b[0]: pd.Series(dict(zip(m[1:], b[1:]))) 
     for m, b in zip(lines[::2], lines[1::2])} 
pd.concat(dicts).unstack(fill_value=0) 

     A B C G H T Y Z 
Bin_1 1 2 1 0 0 0 0 0 
Bin_2 0 1 3 1 1 2 0 0 
Bin_3 4 1 5 0 1 2 3 1

来源

2017-03-01 07:37:22 piRSquared

不是世界上最优雅的事情，但...

headers = df.iloc[::2][0].apply(lambda x: x.split()[1:]) 
data = df.iloc[1::2][0].apply(lambda x: x.split()[1:]) 

result = [] 
for h, d in zip(headers.values, data.values): 
    result.append(pd.Series(d, index=h)) 
pd.concat(result, axis=1).fillna(0).T 

    A B C G H T Y Z 
0 1 2 1 0 0 0 0 0 
1 0 1 3 1 1 2 0 0 
2 4 1 5 0 1 2 3 1

来源

2017-03-01 05:24:13 dataflow

的观点是，当你“追加” DataFrames，结果是与被列联盟列的数据帧，与NaN或任何在洞中。所以：

$ cat test.py 
import pandas as pd 

frame = pd.DataFrame() 
with open('/tmp/foo.tsv') as markers: 
    while True: 
     line = markers.readline() 
     if not line: 
      break 
     columns = line.strip().split('\t') 
     data = markers.readline().strip().split('\t') 
     new = pd.DataFrame(data=[data], columns=columns) 
     frame = frame.append(new) 

frame = frame.fillna(0) 

print(frame) 
$ python test.py < /tmp/foo.tsv 
    A B C G H T Y Z marker 
0 1 2 1 0 0 0 0 0 Bin_1 
0 0 1 3 1 1 2 0 0 Bin_2 
0 4 1 5 0 1 2 3 1 Bin_3

如果你不在其他地方使用熊猫，那么这可能（或可能不会）是矫枉过正。但如果你已经在使用它，那么我认为这是完全合理的。

来源

2017-03-01 05:25:42

为什么不将数据处理成输入一个字典，然后构建DataFrame：

>>> with open(...) as f: 
...  d = {} 
...  for marker, bins in zip(f, f): 
...   z = zip(h.split(), v.split()) 
...   _, bin = next(z) 
...   d[bin] = dict(z) 
>>> pd.DataFrame(d).fillna(0).T 
     A B C G H T Y Z 
Bin_1 1 2 1 0 0 0 0 0 
Bin_2 0 1 3 1 1 2 0 0 
Bin_3 4 1 5 0 1 2 3 1

如果你真的需要轴名称的列：

>>> pd.DataFrame(d).fillna(0).rename_axis('marker').T 
marker A B C G H T Y Z 
Bin_1 1 2 1 0 0 0 0 0 
Bin_2 0 1 3 1 1 2 0 0 
Bin_3 4 1 5 0 1 2 3 1

来源

2017-03-01 05:30:33 AChampion

使用熊猫下令每隔两行

回答

相关问题