2017-10-28 226 views
0

训练数据= https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data 测试数据= https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test如何设置在读csv文件正确的参数(蟒蛇,熊猫)

import numpy as np 
import pandas as pd 

train_data = pd.read_csv('adult.data.txt',sep= ',', header= None) 
test_data = pd.read_csv('adult.test.txt',sep= ',', header= None) 

当我做这个,有在读的测试数据错误,而不是即使布局中的训练数据是相同的:

Traceback (most recent call last): 
File "dtree.py", line 61, in <module> 
dtree() 
File "dtree.py", line 12, in dtree 
test_data = pd.read_csv('adult.test.txt',sep= ',', header= None) 
File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line 
498, in parser_f 
return _read(filepath_or_buffer, kwds) 
File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line 
285, in _read 
return parser.read() 
File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line 
747, in read 
ret = self._engine.read(nrows) 
File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line 
1197, in read 
data = self._reader.read(nrows) 
File "pandas/parser.pyx", line 766, in pandas.parser.TextReader.read 
(pandas/parser.c:7988) 
File "pandas/parser.pyx", line 788, in 
pandas.parser.TextReader._read_low_memory (pandas/parser.c:8244) 
File "pandas/parser.pyx", line 842, in 
pandas.parser.TextReader._read_rows (pandas/parser.c:8970) 
File "pandas/parser.pyx", line 829, in 
pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8838) 
File "pandas/parser.pyx", line 1833, in 
pandas.parser.raise_parser_error 
(pandas/parser.c:22649) 
pandas.parser.CParserError: Error tokenizing data. C error: Expected 1 
fields in line 2, saw 15 

于是我在TEST_DATA改变报头= 0和它编译,但只有1个列,而不是像15在train_data。这会导致问题,因为test_data.values只给出最后一列,与train_data.values不同。

我注意到测试和训练数据有两个不同之处。在测试中,每一行以完全停止的方式结束,训练没有任何内容,并且测试的第一行不是列车的入口。这是造成问题的原因之一吗?我如何克服它们?

回答

1

有一个在pandas.read_csv()功能

skiprows : list-like or integer or callable, default None 

    Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file. 

    If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2]. 

一个paramater你可以找到更多的https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

文件的第一行是:

| 1×3的交叉验证

应不能被解释为标题,也不能被解释为数据框的行。

你应该尝试阅读您的文件有:

test_data = pd.read_csv('adult.test.txt',sep= ',', header= None,skiprows=1)