2016-11-09 87 views
2

生成一个文本文件,我有一个包含该表中的文本文件:熊猫阅读从dataframe.to_string

    Ion TheoWavelength   Blended_Set 
Line_Label                                    
H1_4340A Hgamma_5_2  4340.471    None 
He1_4472A  HeI_4471  4471.479    None 
He2_4686A HeII_4686  4685.710    None 
Ar4_4711A  [ArIV]  4711.000    None 
Ar4_4740A  [ArIV]  4740.000    None 
H1_4861A  Hbeta_4_2  4862.683    None 

该表已经从熊猫数据框中使用dataframe.to_string然后保存unicode的变量生成。

我想用大熊猫函数来创建这个文件中的数据帧:

import pandas as pd 
df = pd.read_csv('my_table_file.txt', delim_whitespace = True, header = 0, index_col = 0) 

但是我得到这个错误

Traceback (most recent call last): 
    File 
    df = pd.read_csv(table, delim_whitespace = True, header = 0, index_col = 0) 
    File "/home/user/anaconda/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 562, in parser_f 
    return _read(filepath_or_buffer, kwds) 
    File "/home/user/anaconda/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 325, in _read 
    return parser.read() 
    File "/home/user/anaconda/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 815, in read 
    ret = self._engine.read(nrows) 
    File "/home/user/anaconda/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1314, in read 
    data = self._reader.read(nrows) 
    File "pandas/parser.pyx", line 805, in pandas.parser.TextReader.read (pandas/parser.c:8748) 
    File "pandas/parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:9003) 
    File "pandas/parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas/parser.c:9731) 
    File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:9602) 
    File "pandas/parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas/parser.c:23325) 
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4 

我敢说,这是造成由于索引中的列名名称在自己的行中。

无论如何避免这个问题或不包括此标签导出表?

P.S.我试图使用dataframe.to_csv表,但据我所知,它不允许你玩表格列格式,如果他们有不同的dtype

回答

1

我会在这种情况下使用HDF5格式 - 它会照顾您的索引。

除此之外,它的速度更快相比,CSV,您可以有条件地选择数据(比如使用SQL数据库),支持压缩等

演示:

In [2]: df 
Out[2]: 
        Ion TheoWavelength Blended_Set 
Line_Label 
H1_4340A Hgamma_5_2  4340.471  None 
He1_4472A  HeI_4471  4471.479  None 
He2_4686A HeII_4686  4685.710  None 
Ar4_4711A  [ArIV]  4711.000  None 
Ar4_4740A  [ArIV]  4740.000  None 
H1_4861A  Hbeta_4_2  4862.683  None 

In [3]: df.to_hdf('d:/temp/myhdf.h5', 'df', format='t', data_columns=True) 

In [4]: x = pd.read_hdf('d:/temp/myhdf.h5', 'df') 

In [5]: x 
Out[5]: 
        Ion TheoWavelength Blended_Set 
Line_Label 
H1_4340A Hgamma_5_2  4340.471  None 
He1_4472A  HeI_4471  4471.479  None 
He2_4686A HeII_4686  4685.710  None 
Ar4_4711A  [ArIV]  4711.000  None 
Ar4_4740A  [ArIV]  4740.000  None 
H1_4861A  Hbeta_4_2  4862.683  None 

你甚至可以查询您的HDF5文件,像SQL DB:

In [20]: x2 = pd.read_hdf('d:/temp/myhdf.h5', 'df', where="TheoWavelength > 4500 and Ion == '[ArIV]'") 

In [21]: x2 
Out[21]: 
       Ion TheoWavelength Blended_Set 
Line_Label 
Ar4_4711A [ArIV]   4711.0  None 
Ar4_4740A [ArIV]   4740.0  None 
+0

非常感谢您的回复。这是非常有趣的SQL功能,它很好地工作...但是,对于这种情况下,它必须是一个文本文件。我设法使它工作,在“read_csv”中添加任何以“L”开头的行(这不是此数据中的问题)中的注释。我试图使用ignore_rows,但它不起作用,如果你设置列索引...这很奇怪... – Delosari

0

考虑Python的内置StringIO,该io模块的方法的Python 3(StringIO作为Python 2中自己的模块)从标量字符串中读取文本。说它内大熊猫的read_table()然后操纵的字符串内容的第一线标题:

from io import StringIO 
import pandas as pd 

data = ''' 
        Ion TheoWavelength   Blended_Set 
Line_Label 
H1_4340A Hgamma_5_2  4340.471    None 
He1_4472A  HeI_4471  4471.479    None 
He2_4686A HeII_4686  4685.710    None 
Ar4_4711A  [ArIV]  4711.000    None 
Ar4_4740A  [ArIV]  4740.000    None 
H1_4861A  Hbeta_4_2  4862.683    None 
''' 

df = pd.read_table(StringIO(data), sep="\s+", header=None, skiprows=3, index_col=0) 

headers = [item for line in data.split('\n')[0:3] for item in line.split()][0:4] 
df.columns = headers[0:3] 
df.index.name = headers[3] 

如果你需要从文件中读取,使用read_table从文件中读取,然后读取文本文件中提取头:

df = pd.read_table("DataframeString.txt", sep="\s+", header=None, skiprows=3, index_col=0) 

data = [] 
with open("DataframeToString.txt", 'r') as f: 
    data.append(f.read().split()) 

df.index.name = data[0][3] 
df.columns = data[0][0:3] 

print(df) 
#     Ion TheoWavelength Blended_Set 
# Line_Label           
# H1_4340A Hgamma_5_2  4340.471  None 
# He1_4472A  HeI_4471  4471.479  None 
# He2_4686A HeII_4686  4685.710  None 
# vAr4_4711A  [ArIV]  4711.000  None 
# Ar4_4740A  [ArIV]  4740.000  None 
# H1_4861A  Hbeta_4_2  4862.683  None 
+0

非常感谢你的答复,但一个问题:如果你有一个文本文件中的“数据”,你需要打开文件两次(例如readlines),或者可以直接完成? – Delosari