2016-04-22 113 views
2

我有以下file.txt的(有删节):熊猫:read_csv表示“空格分隔”

SICcode  Catcode  Category        SICname  MultSIC 
0111  A1500  Wheat, corn, soybeans and cash grain  Wheat  X 
0112  A1600  Other commodities (incl rice, peanuts)  Rice  X 
0115  A1500  Wheat, corn, soybeans and cash grain  Corn  X 
0116  A1500  Wheat, corn, soybeans and cash grain  Soybeans  X 
0119  A1500  Wheat, corn, soybeans and cash grain  Cash grains, NEC  X 
0131  A1100  Cotton  Cotton  X 
0132  A1300  Tobacco & Tobacco products     Tobacco  X 

我有一些问题,将其读入一个大熊猫DF。我试图pd.read_csv以下规格engine='python', sep='Tab'但它在一列返回的文件:

SICcode Catcode Category SICname MultSIC 
0 0111 A1500 Wheat, corn, soybeans... 
1 0112 A1600 Other commodities (in... 
2 0115 A1500 Wheat, corn, soybeans... 
3 0116 A1500 Wheat, corn, soybeans... 

然后我试图把它放到使用“标签”作为分隔符的Gnumeric文件,但它读取该文件为一列。有没有人有这个想法?

回答

3

如果df = pd.read_csv('file.txt', sep='\t')返回与一列的数据帧,那么显然file.txt不使用标签作为分隔符。您的数据可能只有空格作为分隔符。在这种情况下,你可以尝试

df = pd.read_csv('data', sep=r'\s{2,}') 

它使用正则表达式模式\s{2,}作为分隔符。这个正则表达式匹配2个或更多的空白字符。

In [8]: df 
Out[8]: 
    SICcode Catcode        Category   SICname \ 
0  111 A1500 Wheat, corn, soybeans and cash grain    Wheat 
1  112 A1600 Other commodities (incl rice, peanuts)    Rice 
2  115 A1500 Wheat, corn, soybeans and cash grain    Corn 
3  116 A1500 Wheat, corn, soybeans and cash grain   Soybeans 
4  119 A1500 Wheat, corn, soybeans and cash grain Cash grains, NEC 
5  131 A1100         Cotton   Cotton 
6  132 A1300    Tobacco & Tobacco products   Tobacco 

    MultSIC 
0  X 
1  X 
2  X 
3  X 
4  X 
5  X 
6  X 

如果这不起作用,请发帖print(repr(open(file.txt, 'rb').read(100))。这将向我们显示file.txt的前100个字节的明确表示。

+0

@ unutbu:它的工作!谢谢! –

1

我认为你可以尝试加sep="\t"read_csv如果csv中的数据被Tabulator分开。

import pandas as pd 

df = pd.read_csv('test/a.csv', sep="\t") 
print df 
    SICcode Catcode        Category   SICname \ 
0  111 A1500 Wheat, corn, soybeans and cash grain    Wheat 
1  112 A1600 ther commodities (incl rice, peanuts)    Rice 
2  115 A1500 Wheat, corn, soybeans and cash grain    Corn 
3  116 A1500 Wheat, corn, soybeans and cash grain   Soybeans 
4  119 A1500 Wheat, corn, soybeans and cash grain Cash grains, NEC 
5  131 A1100         Cotton   Cotton 
6  132 A1300    Tobacco & Tobacco products   Tobacco 

    MultSIC 
0  X 
1  X 
2  X 
3  X 
4  X 
5  X 
6  X 
+0

我试过你的建议,它将在一列中返回df。我会尝试将扩展名从'txt'改为'csv'。 –

+1

好吧,我想你可以在某个编辑器中尝试打开文件,例如记事本++和检查列的分隔符。它是标签吗?或者只有空白? – jezrael

+0

这是8位的标签。 –