2016-10-10 45 views
-1
I'm trying to load csv data file: 
ACCEPT,[email protected],t,[email protected],0,UK,3600000,3,1475917200000,1475920800000,MON,9,0,0,0 

以下列方式:NumPy的D型无效指数

dataset = genfromtxt('./training_set.csv', delimiter=',', dtype='a20, a20, a20, a8, i8, a20, i8, i8, i8, i8, a3, i8, i8, i8, i8') 
print(dataset) 
target = [x[0] for x in dataset] 
train = [x[1:] for x in dataset] 

在最后一行上面,我得到了一个错误:

--------------------------------------------------------------------------- 
IndexError        Traceback (most recent call last) 
<ipython-input-66-5d58edf06039> in <module>() 
     4 print(dataset) 
     5 target = [x[0] for x in dataset] 
----> 6 train = [x[1:] for x in dataset] 
     7 
     8 #rf = RandomForestClassifier(n_estimators=100) 

<ipython-input-66-5d58edf06039> in <listcomp>(.0) 
     4 print(dataset) 
     5 target = [x[0] for x in dataset] 
----> 6 train = [x[1:] for x in dataset] 
     7 
     8 #rf = RandomForestClassifier(n_estimators=100) 

IndexError: invalid index 

如何来处理呢?

+0

'dataset'是一维结构阵列。您按名称而不是列号或切片访问字段 – hpaulj

回答

1

与那dtype你已经创建了一个结构化数组 - 它是一个复合dtype 1d。

我已经从另一个问题的样本结构数组:

In [26]: data 
Out[26]: 
array([(b'1Q11', 252.0, 0.0166), (b'2Q11', 212.4, 0.0122), 
     (b'3Q11', 425.9, 0.0286), (b'4Q11', 522.3, 0.0322), 
     (b'1Q12', 263.2, 0.0185), (b'2Q12', 238.6, 0.0131), 
     ... 
     (b'1Q14', 264.5, 0.0179), (b'2Q14', 211.2, 0.0116)], 
     dtype=[('Qtrs', 'S4'), ('Y', '<f8'), ('X', '<f8')]) 

一个记录是:

In [27]: data[0] 
Out[27]: (b'1Q11', 252.0, 0.0166) 

虽然我可以内的访问元素的数,它不接受片:

In [36]: data[0][1] 
Out[36]: 252.0 
In [37]: data[0][1:] 
.... 
IndexError: invalid index 

使用结构化记录访问元素的首选方式是使用字段名称:

In [38]: data[0]['X'] 
Out[38]: 0.0166 

这样的名字让我在所有记录访问该场:

In [39]: data['X'] 
Out[39]: 
array([ 0.0166, 0.0122, 0.0286, ... 0.0116]) 

读取多领域,需要的字段名称的列表(且比2D切片多罗嗦):

In [42]: data.dtype.names[1:] 
Out[42]: ('Y', 'X') 

In [44]: data[list(data.dtype.names[1:])] 
Out[44]: 
array([(252.0, 0.0166), (212.4, 0.0122),... (211.2, 0.0116)], 
     dtype=[('Y', '<f8'), ('X', '<f8')]) 

===============

与您的示例行(复制3次)我可以加载:

In [53]: dataset=np.genfromtxt(txt,dtype=None,delimiter=',') 
In [54]: dataset 
Out[54]: 
array([ (b'ACCEPT', b'[email protected]', b't', b'[email protected]', 0, b'UK', 3600000, 3, 1475917200000, 1475920800000, b'MON', 9, 0, 0, 0), 
     (b'ACCEPT', b'[email protected]', b't', b'[email protected]', 0, b'UK', 3600000, 3, 1475917200000, 1475920800000, b'MON', 9, 0, 0, 0), 
     (b'ACCEPT', b'[email protected]', b't', b'[email protected]', 0, b'UK', 3600000, 3, 1475917200000, 1475920800000, b'MON', 9, 0, 0, 0)], 
     dtype=[('f0', 'S6'), ('f1', 'S15'), ('f2', 'S1'), ('f3', 'S8'), ('f4', '<i4'), ('f5', 'S2'), ('f6', '<i4'), ('f7', '<i4'), ('f8', '<i8'), ('f9', '<i8'), ('f10', 'S3'), ('f11', '<i4'), ('f12', '<i4'), ('f13', '<i4'), ('f14', '<i4')]) 
In [55]: 

dtype=None产生类似的东西你明确dtype;

要得到你想要的输出(如数组,而不是名单):

target = dataset['f0'] 
names=dataset.dtype.names[1:] 
train = dataset[list(names)] 

=====================

你也可以细化dtype以使任务更简单。定义2个字段,第二个字段包含大多数csv列。 genfromtxt处理这种dtype嵌套 - 只要总场数是正确的。

In [106]: dt=[('target','a20'), 
     ('train','a20, a20, a8, i8, a20, i8, i8, i8, i8, a3, i8, i8, i8, i8')] 
In [107]: dataset=np.genfromtxt(txt,dtype=dt,delimiter=',') 
In [108]: dataset 
Out[108]: 
array([ (b'ACCEPT', (b'[email protected]', b't', b'[email protected]', 0, b'UK', 3600000, 3, 1475917200000, 1475920800000, b'MON', 9, 0, 0, 0)), 
...], 
     dtype=[('target', 'S20'), ('train', [('f0', 'S20'), ('f1', 'S20'), ('f2', 'S8'), ('f3', '<i8'), ('f4', 'S20'), ('f5', '<i8'), ('f6', '<i8'), ('f7', '<i8'), ('f8', '<i8'), ('f9', 'S3'), ('f10', '<i8'), ('f11', '<i8'), ('f12', '<i8'), ('f13', '<i8')])]) 

现在只需要选择2顶级域:

In [109]: dataset['target'] 
Out[109]: 
array([b'ACCEPT', b'ACCEPT', b'ACCEPT'], 
     dtype='|S20') 

In [110]: dataset['train'] 
Out[110]: 
array([ (b'[email protected]', b't', b'[email protected]', 0, b'UK', 3600000, 3, 1475917200000, 1475920800000, b'MON', 9, 0, 0, 0), 
...], 
     dtype=[('f0', 'S20'), ('f1', 'S20'), ...]) 

我可以进一步嵌套,分组的i8列进组,每组4:

dt=[('target','a20'), ('train','a20, a20, a8, i8, a20, (4,)i8, a3, (4,)i8')] 
1
n [42]: dataset = np.genfromtxt('./np_inf.txt', delimiter=',', dtype='a20, a20, a20, a8, i8, a20, i8, i8, i8, i8, a3, i8, i8, i8, i8') 

In [43]: [x[0] for x in dataset] 
Out[43]: ['ACCEPT', 'ACCEPT', 'ACCEPT'] 

的问题是,dataset的条目是不是很有用型np.void的。它不允许分片,很明显,但你可以遍历它:

In [56]: type(dataset[0]) 
Out[56]: numpy.void 

In [57]: len(dataset[0]) 
Out[57]: 15 

In [58]: z = [[y for j, y in enumerate(x) if j > 0] for x in dataset] 

In [59]: z[0] 
Out[59]: 
['[email protected]', 
't', 
'[email protected]', 
0, 
'UK', 
3600000, 
3, 
1475917200000, 
1475920800000, 
'MON', 
9, 
0, 
0, 
0] 

但是你可能会更好过数组转换成结构化的D型,而不是使用名单。

还好,考虑用熊猫做pd.read_csv