2017-07-24 84 views
1

清一色大熊猫read_csv:页眉/ skiprows不工作

第一次问一个问题在这里,如果道歉格式是坏的,请让我知道如何改进我的问题。

我在寻求更好的理解pandas.read_csv()函数的头文件和skiprows参数。

这里是原始数据,我想在python阅读的例子:

MiniSonde 5 43656 
"Log File Name : lwrhyp_deploy_20170104" 
"Setup Date (MMDDYY) : 010417" 
"Setup Time (HHMMSS) : 114539" 
"Starting Date (MMDDYY) : 010417" 
"Starting Time (HHMMSS) : 140000" 
"Stopping Date (MMDDYY) : 123169" 
"Stopping Time (HHMMSS) : 235959" 
"Interval (HHMMSS) : 010000" 
"Sensor warmup (HHMMSS) : 000100" 
"Circltr warmup (HHMMSS) : 000030" 


"Date","Time","","Temp","","SpCond","","Sal","","Dep25","","TDG","","TDG","","LDO%","","LDO","","IBatt","" 
"MMDDYY","HHMMSS","","øC","","mS/cm","","ppt","","meters","","mmHg","","psia","","Sat","","mg/l","","Volts","" 

01/04/17,14:00:00,"",7.97,"",.0691,"",.02,"",.75,"",735,"",14.22,"",52.7,"",6.15,"",11.4,"" 
01/04/17,15:00:00,"",7.9,"",.0692,"",.02,"",.76,"",736,"",14.23,"",52.8,"",6.17,"",11.4,"" 
01/04/17,16:00:00,"",7.89,"",.0694,"",.02,"",.77,"",736,"",14.23,"",52.3,"",6.12,"",11.4,"" 
01/04/17,17:00:00,"",7.88,"",.0699,"",.02,"",.78,"",735,"",14.21,"",51.8,"",6.06,"",11.4,"" 
01/04/17,18:00:00,"",7.85,"",.0699,"",.02,"",.78,"",733,"",14.18,"",51.3,"",6.01,"",11.4,"" 
01/04/17,19:00:00,"",7.83,"",.0706,"",.02,"",.78,"",731,"",14.14,"",51.3,"",6.01,"",11.4,"" 
01/04/17,20:00:00,"",7.81,"",.0706,"",.02,"",.79,"",730,"",14.12,"",51.1,"",5.99,"",11.4,"" 
01/04/17,21:00:00,"",7.81,"",.0699,"",.02,"",.79,"",730,"",14.11,"",50.8,"",5.95,"",11.4,"" 
01/04/17,22:00:00,"",7.76,"",.0702,"",.02,"",.8,"",729,"",14.1,"",50.5,"",5.92,"",11.3,"" 
01/04/17,23:00:00,"",7.76,"",.0704,"",.02,"",.8,"",729,"",14.09,"",50.5,"",5.93,"",11.3,"" 
01/05/17,00:00:00,"",7.76,"",.07,"",.02,"",.8,"",729,"",14.09,"",50.5,"",5.92,"",11.3,"" 

我试图为使用该行以“日期”或“MMDDYY”为开头的行开始我标题行。当我在文本编辑器中打开原始数据时,对应于“Date”的行是第14行,它将是零索引python land中的第13行。

我用下面的代码认为它应该跳过第12行,并开始对13进行行读取数据:

test = pd.read_csv(filepath, skiprows=12, skip_blank_lines=True) 

但产生错误:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte 

很多后我发现下面的代码产生了我之后的结果类型,但是我不明白它为什么有效:

test = pd.read_csv(filepath, skiprows=[14], header=11, skip_blank_lines=True) 

我不明白read_csv是如何计算行数的。我是不正确的,因为标题行不在第11行,而是在第13行?该代码只适用于如果skiprows = [14],为什么呢?

在附注中,是否有办法阻止原始数据中存在的空白列被读入数据框?

+0

我不认为你需要有skiprows如果您使用的标题= 11 – jacoblaw

+0

有这样的“O”字,其可能会导致这个问题。试试这个,test = pd.read_csv(filepath,encoding ='ISO-8859-1') –

+0

@VenkateshDurgumahanthi - 改变编码是票!你能详细说明为什么这有效吗? –

回答

0

首先,skiprows没有做你认为它在这里。当你给它一个列表作为输入时,它会在解析文件时跳过这些行。对于你想要的,只需使用header

二,熊猫对文件行进行零索引。

第三,当您有skip_blank_lines=True时,它会在考虑#header#值之前重新索引文件的行。因此,在您的示例中,它不会在您的标题(和标题之后的那一行)之前为空行11和12编制索引。记住大熊猫零的索引文件的行,我们可以看到在标题header=11线燮如何:

line/ : content 
0:MiniSonde 5 43656 
1:"Log File Name : lwrhyp_deploy_20170104" 
2:"Setup Date (MMDDYY) : 010417" 
3:"Setup Time (HHMMSS) : 114539" 
4:"Starting Date (MMDDYY) : 010417" 
5:"Starting Time (HHMMSS) : 140000" 
6:"Stopping Date (MMDDYY) : 123169" 
7:"Stopping Time (HHMMSS) : 235959" 
8:"Interval (HHMMSS) : 010000" 
9:"Sensor warmup (HHMMSS) : 000100" 
10:"Circltr warmup (HHMMSS) : 000030" 


11:"Date","Time","","Temp","","SpCond","","Sal","","Dep25","","TDG","","TDG","","LDO%","","LDO","","IBatt","" 
12:"MMDDYY","HHMMSS","","øC","","mS/cm","","ppt","","meters","","mmHg","","psia","","Sat","","mg/l","","Volts","" 

13:01/04/17,14:00:00,"",7.97,"",.0691,"",.02,"",.75,"",735,"",14.22,"",52.7,"",6.15,"",11.4,"" 
14:01/04/17,15:00:00,"",7.9,"",.0692,"",.02,"",.76,"",736,"",14.23,"",52.8,"",6.17,"",11.4,"" 
15:01/04/17,16:00:00,"",7.89,"",.0694,"",.02,"",.77,"",736,"",14.23,"",52.3,"",6.12,"",11.4,"" 
16:01/04/17,17:00:00,"",7.88,"",.0699,"",.02,"",.78,"",735,"",14.21,"",51.8,"",6.06,"",11.4,"" 
17:01/04/17,18:00:00,"",7.85,"",.0699,"",.02,"",.78,"",733,"",14.18,"",51.3,"",6.01,"",11.4,"" 
18:01/04/17,19:00:00,"",7.83,"",.0706,"",.02,"",.78,"",731,"",14.14,"",51.3,"",6.01,"",11.4,"" 
19:01/04/17,20:00:00,"",7.81,"",.0706,"",.02,"",.79,"",730,"",14.12,"",51.1,"",5.99,"",11.4,"" 
20:01/04/17,21:00:00,"",7.81,"",.0699,"",.02,"",.79,"",730,"",14.11,"",50.8,"",5.95,"",11.4,"" 
21:01/04/17,22:00:00,"",7.76,"",.0702,"",.02,"",.8,"",729,"",14.1,"",50.5,"",5.92,"",11.3,"" 
22:01/04/17,23:00:00,"",7.76,"",.0704,"",.02,"",.8,"",729,"",14.09,"",50.5,"",5.93,"",11.3,"" 
23:01/05/17,00:00:00,"",7.76,"",.07,"",.02,"",.8,"",729,"",14.09,"",50.5,"",5.92,"",11.3,"" 
+0

为了让代码正常工作,我为什么必须有skiprows = [14]?我怎样才能使'MMDDYY'开头的行是标题行? –

+0

我不需要它,所以不知道为什么会这样。 'df = pd.read_csv(“test.txt”,header = 11,skip_blank_lines = True)'对我来说工作正常。 – jacoblaw