如何使用python熊猫填充从csv文件多行阵列

CSV列标题 - 年，模型，修剪，结果

从进来的值csv文件如下 -

Year | Model | Trim | Result 

2012 | Camry | SR5 | 1 
2014 | Tacoma | SR5 | 1 
2014 | Camry | XLE | 0 
etc..

有包含超过200个独特的模型数据集2500+行。

然后将所有数值转换为数值以进行分析。

这里的输入是CSV文件的前3列和输出第四结果列

这里是我的脚本：

import pandas as pd 
inmport numpy as np 

c1 = [] 
c2 = [] 
c3 = [] 
input = [] 
output = [] 

# read in the csv file containing 4 columns 
df = pd.read_csv('success.csv') 
df.convert_objects(convert_numeric=True) 
df.fillna(0, inplace=True) 

# convert string values to numerical values 
def handle_non_numerical_data(df): 
    columns = df.columns.values 

    for column in columns: 
     text_digit_vals = {} 
     def convert_to_int(val): 
      return text_digit_vals[val] 
     if df[column].dtype != np.int64 and df[column].dtype != np.float64: 
      column_contents = df[column].values.tolist() 
      unique_elements = set(column_contents) 
      x = 0 
      for unique in unique_elements: 
       if unique not in text_digit_vals: 
        text_digit_vals[unique] = x 
        x+=1 

      df[column] = list(map(convert_to_int, df[column])) 

    return df 

df = handle_non_numerical_data(df) 

# extract each column to insert into input array later 
c1.append(df['Year']) 
c2.append(df['Model']) 
c3.append(df['Trim']) 

#create input array containg the first 3 rows of the csv file 
input = np.stack_column(c1,c2,c3) 
output.append(df['Result'])

这工作得很好，除了追加仅节选1个值，我会使用扩展，因为它似乎将它附加到数组的末尾？

UPDATE

从本质上讲这一切的伟大工程，我的问题是创建输入数组，我想该阵列由3列的 - 年，模型，修剪。

input = ([['Year'], ['Model'], ['Trim']],[['Year'], ['Model'], ['Trim']]...)

我只能似乎加上另一个的上面一个价值，而不是让他们序列..

我现在得到 -

input = ([['Year'], ['Year'], ['Year']].., [['Model'], ['Model'], ['Model']]..[['Trim'], ['Trim'], ['Trim']]...)

来源

2017-02-21 Ryan D

我竭力要理解这个问题。您能否重新解释，或者添加一个当前和预期行为的例子？ – Marat

目前还不清楚你在做什么，因为我们对你的csv一无所知。你应该尝试举一个输入和预期输出的例子。在这种情况下，即为什么'pd.read_csv'的结果是不可接受的。我怀疑，无论你想要完成什么，都可以以更直接的方式完成。 –

对不起，我试图更新这个问题，以更好地解释我的问题，基本上我不能将3个数组顺序排列成一个数组而不堆叠它们 –

要阐述我的意见，假设你有一些DataFrame由非整数值组成：

>>> df = pd.DataFrame([[np.random.choice(list('abcdefghijklmnop')) for _ in range(3)] for _ in range(10)]) 
>>> df 
    0 1 2 
0 j p j 
1 d g b 
2 n m f 
3 o b j 
4 h c a 
5 p m n 
6 c c l 
7 o d e 
8 b g h 
9 h o k

而且还有一个o本安输出：

>>> df['output'] = np.random.randint(0,2,10) 
>>> df 
    0 1 2 output 
0 j p j  0 
1 d g b  0 
2 n m f  1 
3 o b j  1 
4 h c a  1 
5 p m n  0 
6 c c l  1 
7 o d e  0 
8 b g h  1 
9 h o k  0

要将所有字符串值转换为整数，使用np.unique与return_inverse=True，这种逆将是你需要阵列，只要记住，你需要重塑（因为np.unique将具有扁平它）：

>>> unique, inverse = np.unique(df.iloc[:,:3].values, return_inverse=True) 
>>> unique 
array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 
     'o', 'p'], dtype=object) 
>>> inverse 
array([ 8, 14, 8, 3, 6, 1, 12, 11, 5, 13, 1, 8, 7, 2, 0, 14, 11, 
     12, 2, 2, 10, 13, 3, 4, 1, 6, 7, 7, 13, 9]) 
>>> input = inverse.reshape(df.shape[0], df.shape[1] - 1) 
>>> input 
array([[ 8, 14, 8], 
     [ 3, 6, 1], 
     [12, 11, 5], 
     [13, 1, 8], 
     [ 7, 2, 0], 
     [14, 11, 12], 
     [ 2, 2, 10], 
     [13, 3, 4], 
     [ 1, 6, 7], 
     [ 7, 13, 9]])

你可以随时回去：

>>> unique[input] 
array([['j', 'p', 'j'], 
     ['d', 'g', 'b'], 
     ['n', 'm', 'f'], 
     ['o', 'b', 'j'], 
     ['h', 'c', 'a'], 
     ['p', 'm', 'n'], 
     ['c', 'c', 'l'], 
     ['o', 'd', 'e'], 
     ['b', 'g', 'h'], 
     ['h', 'o', 'k']], dtype=object)

为了获得输出数组，再次，您只需使用df的.values采取适当的列 - 因为这些已经是numpy数组！

>>> output = df['output'].values 
>>> output 
array([0, 0, 1, 1, 1, 0, 1, 0, 1, 0])

你可能想重塑它，这取决于库，你要使用的分析（sklearn，SciPy的，等等）：

>>> output.reshape(output.size, 1) 
array([[0], 
     [0], 
     [1], 
     [1], 
     [1], 
     [0], 
     [1], 
     [0], 
     [1], 
     [0]])

来源

2017-02-21 03:59:59

谢谢你的解释！我很抱歉，我忘了提及在包含超过200个独特模型的数据集中有2500+行对许多独特的模型会有影响吗？ –

@RyanD不，它应该不是问题。 –

好酷我会在早上第一件事情，并报告回来，谢谢米尔！ –

如何使用python熊猫填充从csv文件多行阵列

回答

相关问题