如何读取CSV

-2

我具有存储在csv文件数据如下面格式如何读取CSV

892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q 
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S 
894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q 
895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,,S 
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22,1,1,3101298,12.2875,,S 
897,3,"Svensson, Mr. Johan Cervin",male,14,0,0,7538,9.225,,S

每个列中的数据类型

1. int  6. int 
2. int  7. int 
3. String  8. float 
4. String  9. float 
5. float  10.String 
       11.String

与892，893开始第一列中， ... 897应存储在int格式中array。第三栏如“威尔克斯夫人詹姆斯（Ellen Needs）”应该存储在string类型中。但是，第三列是string格式，但字符的长度为不固定的，即我不知道字符的最大长度存储在此列

我做：

csv_file_object = csv.reader(open('trainData.csv', 'rb')) 
header = csv_file_object.next() 

data=[] 
for row in csv_file_object: 
    data.append(row) 
    data = np.array(data)

但是，上述代码读取所有作为string列但很多都是不string格式，并且存储的信息在string格式。另一方面，如果我使用genfromtxt，则第三列是问题，因为它包含双份额内的逗号。

我希望用它自己的数据类型存储每列，即第一列应该被存储为int类型。

我预期的数组：

892 3 "Kelly, Mr. James" male 34.5 0 0 330911 7.8292 NaN Q 
893 3 "Wilkes, Mrs. James (Ellen Needs)" female 47 1 0 363272 7 NaN S 
894 2 "Myles, Mr. Thomas Francis" male 62 0 0 240276 9.6875 NaN Q 
895 3 "Wirz, Mr. Albert" male 27 0 0 315154 8.6625 NaN S 
896 3 "Hirvonen, Mrs. Alexander (Helga E Lindqvist)" female 22 1 1 3101298 12.2875 NaN S 
897 3 "Svensson, Mr. Johan Cervin" male 14 0 0 7538 9.225 S

正如你看到的，如果数据不可用，NaN或其衍生物应放。

我该怎么读csv文件？

来源

2015-07-21 caren vanderlee

pandas.read_csv（'data.csv'，dtypes = [int，int，str]）''？ – mbatchkarov

@mbatchkarov我不知道熊猫，我可以在**数组**或**矩阵**中得到预期结果吗？你能用自己的方式写出答案吗？ –

@mbatchkarov嘿，我应该如何使用它？第一行是标题 –

可以使用熊猫库更加轻松自如，就像这样：

import pandas as pd 

df = pd.read_csv("trainData.csv", dtype={'col1': int, 'col2': int, 'col3': str, 'col4': str, 'col5': float, 'col6':int, 
            'col7': int, 'col8': float, 'col9':float, 'col10': str, 'col11': str}) 
df = map(list, df.values) 
print df

输出：

[[892, 3, 'Kelly, Mr. James', 'male', 34.5, 0, 0, 330911.0, 7.8292, nan, 'Q'], 
[893, 3, 'Wilkes, Mrs. James (Ellen Needs)', 'female', 47.0, 1, 0, 363272.0, 7.0, nan, 'S'], 
[894, 2, 'Myles, Mr. Thomas Francis', 'male', 62.0, 0, 0, 240276.0, 9.6875, nan, 'Q'], 
[895, 3, 'Wirz, Mr. Albert', 'male', 27.0, 0, 0, 315154.0, 8.6625, nan, 'S'], 
[896, 3, 'Hirvonen, Mrs. Alexander (Helga E Lindqvist)', 'female', 22.0, 1, 1, 3101298.0, 12.2875, nan, 'S'], 
[897, 3, 'Svensson, Mr. Johan Cervin', 'male', 14.0, 0, 0, 7538.0, 9.225, nan, 'S']]

CSV文件应该是这样的，作为第一行会的列名

col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11 
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q 
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S 
894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q 
895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,,S 
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22,1,1,3101298,12.2875,,S 
897,3,"Svensson, Mr. Johan Cervin",male,14,0,0,7538,9.225,,S

您可以在此处详细了解大熊猫http://pandas.pydata.org/pandas-docs/stable/tutorials.html

来源

2015-07-21 11:05:54

我怎样才能达到第一个元素即892与熊猫数据帧？我做了df [0：0]或df [0] [0]但出现错误。 –

'print df.ix [0，'col1']'其中0是索引，'col1'是列的名称，或'print df ['col1']。values [0]'@carenvanderlee –

so非常感谢。我有点困惑。如果第0个元素是数据，也就是892，我怎么才能从'df' –

我不知道我对你有多了解，但我认为这对你有用。

我实现了另外两个函数来决定一个字符串是float还是integer。

如果字符串是一个空字符串，我写了None，尽管如此，您可以将其更改为任何您喜欢的内容。

import csv 
import numpy as np 

def isfloat(x): 
    try: 
     a = float(x) 
    except ValueError: 
     return False 
    else: 
     return True 

def isint(x): 
    try: 
     a = float(x) 
     b = int(a) 
    except ValueError: 
     return False 
    else: 
     return a == b 


csv_file_object = csv.reader(open('trainData.csv', 'rb')) 
header = csv_file_object 

data=[] 
for row in csv_file_object: 
    for index, cell in enumerate(row): 
     if isint(cell): 
      row[index] = int(cell) 
     elif isfloat(cell): 
      row[index] = float(cell) 
     if not cell: # cell == '' 
      row[index] = None # you can change the value to whatever you like. 
    data.append(row) 

print data

输出：

[[892, 3, 'Kelly, Mr. James', 'male', 34.5, 0, 0, 330911, 7.8292, None, 'Q'], 
[893, 3, 'Wilkes, Mrs. James (Ellen Needs)', 'female', 47, 1, 0, 363272, 7, None, 'S'], 
[894, 2, 'Myles, Mr. Thomas Francis', 'male', 62, 0, 0, 240276, 9.6875, None, 'Q'], 
[895, 3, 'Wirz, Mr. Albert', 'male', 27, 0, 0, 315154, 8.6625, None, 'S'], 
[896, 3, 'Hirvonen, Mrs. Alexander (Helga E Lindqvist)', 'female', 22, 1, 1, 3101298, 12.2875, None, 'S'], 
[897, 3, 'Svensson, Mr. Johan Cervin', 'male', 14, 0, 0, 7538, 9.225, None, 'S']]

来源

2015-07-21 10:11:20

我编辑过了，你能重新检查一下吗？ –

@carenvanderlee，我的回答没有完成你的问题？ –

你说过“我不确定我对你有多了解，但我认为这对你有用。” –

我假设你使用的是熊猫，因为这个问题是带标签的熊猫。阅读文件，像这样：

df = pd.read_csv('test.txt', skiprows=0, index_col=0, 
      names='city_type name sex weight has_cat has_dog bank_balance body_fat_index car_mileage car_type'.split())

你会得到一个数据帧是这样的： enter image description here

我把弥补名字列的自由。

一旦你将数据读入数据框，你可以用它做各种各样的魔法 - 看看熊猫教程（它们很棒）。下面是一个例子

df.bank_balance.describe() 

count   6.000000 
mean  726408.166667 
std  1170522.652019 
min   7538.000000 
25%  258995.500000 
50%  323032.500000 
75%  355181.750000 
max  3101298.000000 
Name: bank_balance, dtype: float64

来源

2015-07-21 11:06:06 mbatchkarov

如何读取CSV

回答

相关问题