熊猫read_table多列定义

我有一个代码生成文本数据，其中诊断输出附加到单个文本文件在运行过程中。根据我的设置，将会进行不同的测量，并在每次运行开始时使用相关的标题行。输出类似于此：熊猫read_table多列定义

# time diagnostic_1, diagnostic_2 
0.3 0.25376334 0.07494259 
1.7 0.3407481 0.03018158 
2.2 0.45349798 0.85539953 
3.4 0.22368132 0.52276335 
4.8 0.17906047 0.40659944 
# time diagnostic_1, diagnostic_3 
3.4 0.65968555 0.67085918 
4.8 0.2122165 0.80855038 
5.1 0.96943873 0.41903639 
6.8 0.16242912 0.91949807 
7.0 0.68513815 0.22881037 
8.8 0.83304083 0.02394251 
9.2 0.01699944 0.58386401 
# time diagnostic_2, diagnostic_3 
8 0.79595325 0.8913367 
9 0.46277533 0.47859048 
10 0.30773957 0.64765873 
11 0.19077614 0.39109832 
12 0.0020474 0.44365015

有没有办法有阅读指定的字符串，而不是指定的行数后后pandas.read_table回报？周围的工作我现在是做第一遍用grep来找到在哪里劈叉是，使用numpy.loadtxt

该吐出与我想要的信息的数据帧

from subprocess import check_output 
import numpy as np 
import pandas as pd 
from itertools import cycle 

fname = 'foo' 
headerrows = [int(s.split(b':')[0]) 
       for s in check_output(['grep', '-on', '^#', fname]).split()] 
# -1 to the range, because the header row is read separately 
limiters = [range(a, b-1) for a, b in zip(headerrows[:-1], headerrows[1:])] 
limiters += [cycle([True, ]), ] 

nameses = [['t', 'diagnostic_1', 'diagnostic_2'], 
      ['t', 'diagnostic_1', 'diagnostic_3'], 
      ['t', 'diagnostic_2', 'diagnostic_3']] 
dat = [] 
with open(fname, 'r') as fobj: 
    for names, limit in zip(nameses, limiters): 
     line = fobj.readline() 
     dat.append(pd.DataFrame(np.loadtxt((s for i, s in zip(limit, fobj))), 
           columns=names))

来源

2016-01-22 Elliot

完整的脚本加载阵列。有更新和删除列的猴子业务是必要的，以保持复合指数。 retval.merge(dset, how='outer')给出相同的列，但是是一个整数索引。

from subprocess import check_output 
import numpy as np 
import pandas as pd 
from itertools import cycle 

fname = 'foo' 
headerrows = [int(s.split(b':')[0]) 
       for s in check_output(['grep', '-on', '^#', fname]).split()] 
# subtract one because header column is read separately 
limiters = [range(a, b-1) for a, b in zip(headerrows[:-1], headerrows[1:])] 
limiters += [cycle([True, ]), ] 

nameses = [['t', 'diagnostic_1', 'diagnostic_2'], 
      ['t', 'diagnostic_1', 'diagnostic_3'], 
      ['t', 'diagnostic_2', 'diagnostic_3']] 

with open(fname, 'r') as fobj: 
    for names, limit in zip(nameses, limiters): 
     line = fobj.readline() 
     dset = pd.DataFrame(np.loadtxt((line for i, line in zip(limit, fobj))), 
          columns=names) 
     dset.set_index('t', inplace=True) 
     # if the return value already exists, merge in the new dataset 
     try: 
      retval = retval.merge(dset, how='outer', 
            left_index=True, right_index=True, 
            suffixes=('', '_')) 
      for col in (c for c in retval.columns if not c.endswith('_')): 
       upd = ''.join((col, '_')) 
       try: 
        retval[col].update(retval[upd]) 
        retval.drop(upd, axis=1, inplace=True) 
       except KeyError: 
        pass 
     except NameError: 
      retval = dset 
print(retval)

来源

2016-01-22 14:55:19 Elliot

熊猫read_table多列定义

回答

相关问题