2013-11-29 56 views
2

我试图写一个简单的脚本到CSV输出文件从Fortran代码转换成数据帧的熊猫对象,所以我可以做更多的分析,从长格式操纵的CSV文件。该CSV有两列,而是由与形状数据的多个所附块的[N,2](每个样品名称具有格式RN_x)。我得到了以下代码,但生成的DataFrame对象不允许分析。我还附上了一个示例文件(大大缩短了原文)。顺便说一下在数据文件中的第一列是指是日期,但在输出是对应于一天中的SI = imulation一个数字。任何意见将不胜感激。使用numpy的或熊猫

import numpy as np 
import pandas as pd 
import csv as csv 
readdata = csv.reader(open('C:/data/Test.csv', 'r')) 
data = [] 
for row in readdata: 
    data.append(row) 
a = np.array(data).reshape(11,-1, order = 'F') 
col = a[0,:4].reshape(4) 
row = pd.Index(a[4:,0:1].reshape(7)) 
b = a[4:,5:] 
df = pd.DataFrame(b, index = row, columns = col) 

样品:

RN_48865, 
1,Observed 
1,0 
259,Computed 
1,0.000014 
91,0.000014 
182,0.000014 
274,0.000014 
366,0.000014 
457,0.000014 
548,0.000014 
RN_7445, 
1,Observed 
1,0 
259,Computed 
1,0.000013 
91,0.000013 
182,0.000013 
274,0.000013 
366,0.000013 
457,0.000013 
548,0.000013 
RN_9288, 
1,Observed 
1,0 
259,Computed 
1,0.000011 
91,0.000011 
182,0.000011 
274,0.000011 
366,0.000011 
457,0.000011 
548,0.000011 
RN_10955, 
1,Observed 
1,0 
259,Computed 
1,0.000014 
91,0.000014 
182,0.000014 
274,0.000014 
366,0.000014 
457,0.000014 
548,0.000014 

输出示例:

Index,RN_48865,RN_7445,RN_9288,RN_10955 
1,0.000014,0.000013,0.000011,0.000014 
91,0.000014,0.000013,0.000011,0.000014 
182,0.000014,0.000013,0.000011,0.000014 
274,0.000014,0.000013,0.000011,0.000014 
366,0.000014,0.000013,0.000011,0.000014 
457,0.000014,0.000013,0.000011,0.000014 
548,0.000014,0.000013,0.000011,0.000014 
+0

那么,什么是问题? – cyborg

+0

对不起,不清楚。如何打开的长文件到一个数据帧的对象与(其将所述数目的基准日期解析日期,例如1995年1月1日;第一数据列)的索引从与所述第二柱填充数据,和多列“RN_x”标签作为列标签。原始长文件具有重复的表示在不同“位置”处的输出的重复数据块。我希望能够分析每个位置的统计信息。 – user2989613

+0

我不明白RN_x的“填充了与第二列数据的多个列‘’标签作为列的标签。”你为什么不简单地显示数据(用'\ n's)? – cyborg

回答

1

你实际上问几个问题。这就是我能从所需的输出明白:

source="""RN_48865, 
    1,Observed 
    1,0 
    259,Computed 
    1,0.000014 
    91,0.000014 
    182,0.000014 
    274,0.000014 
    366,0.000014 
    457,0.000014 
    548,0.000014 
    RN_7445, 
    1,Observed 
    1,0 
    259,Computed 
    1,0.000013 
    91,0.000013 
    182,0.000013 
    274,0.000013 
    366,0.000013 
    457,0.000013 
    548,0.000013 
    RN_9288, 
    1,Observed 
    1,0 
    259,Computed 
    1,0.000011 
    91,0.000011 
    182,0.000011 
    274,0.000011 
    366,0.000011 
    457,0.000011 
    548,0.000011 
    RN_10955, 
    1,Observed 
    1,0 
    259,Computed 
    1,0.000014 
    91,0.000014 
    182,0.000014 
    274,0.000014 
    366,0.000014 
    457,0.000014 
    548,0.000014 
""" 
import pandas as pd 
import numpy as np 
import StringIO 
df = pd.read_csv(StringIO.StringIO(source), header=None) 
rns = np.where(df[0].apply(lambda x: x.lstrip().startswith('RN_')))[0] 
length = rns[1] - rns[0] 
index = df[0].iloc[4:length] 
cols = df[0][::length].apply(lambda x: x.lstrip()).values 
result_df = pd.DataFrame(index=index) 
for col_num, col_start in enumerate(range(0, len(df), length)): 
    result_df[cols[col_num]] = df[1][col_num*length+4 : (col_num+1)*length].values 
print result_df 

输出:

 RN_48865 RN_7445 RN_9288 RN_10955 
1 0.000014 0.000013 0.000011 0.000014 
91 0.000014 0.000013 0.000011 0.000014 
182 0.000014 0.000013 0.000011 0.000014 
274 0.000014 0.000013 0.000011 0.000014 
366 0.000014 0.000013 0.000011 0.000014 
457 0.000014 0.000013 0.000011 0.000014 
548 0.000014 0.000013 0.000011 0.000014 

对于日期使用:

pandas.read_csv('file', 
    parse_date=0, # 0th column 
    date_parser=lambda x: pandas.Timestamp('1995-1-1')+timedelta(x)) 
+0

谢谢,这有助于一个元素。用户cyborg指出,我不同意这个问题。 – user2989613

+0

太好了。非常感谢。看起来我在那里错了。还有很多要学习。 – user2989613