2015-02-09 74 views
1

是否可以在不加载整个文件的情况下从hdf5文件读取给定的一组行?我有一个数据集的负荷相当大的HDF5文件,这里是什么,我脑子里想的,以减少时间和内存使用情况的例子:h5py:如何读取hdf5文件的选定行?

#! /usr/bin/env python 

import numpy as np 
import h5py 

infile = 'field1.87.hdf5' 
f = h5py.File(infile,'r') 
group = f['Data'] 

mdisk = group['mdisk'].value 

val = 2.*pow(10.,10.) 
ind = np.where(mdisk>val)[0] 

m = group['mcold'][ind] 
print m 

ind不给连续的行,但比较分散的。

上述代码失败,但它遵循切片hdf5数据集的标准方式。该错误消息我得到的是:

Traceback (most recent call last): 
    File "./read_rows.py", line 17, in <module> 
    m = group['mcold'][ind] 
    File "/cosma/local/Python/2.7.3/lib/python2.7/site-packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/dataset.py", line 425, in __getitem__ 
    selection = sel.select(self.shape, args, dsid=self.id) 
    File "/cosma/local/Python/2.7.3/lib/python2.7/site-packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/selections.py", line 71, in select 
    sel[arg] 
    File "/cosma/local/Python/2.7.3/lib/python2.7/site-packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/selections.py", line 209, in __getitem__ 
    raise TypeError("PointSelection __getitem__ only works with bool arrays") 
TypeError: PointSelection __getitem__ only works with bool arrays 
+0

说它'失败',但没有显示错误信息,或者什么是错误的,这里是一个很大的禁忌。 – hpaulj 2015-02-09 21:27:42

+0

您正在将整个'mdisk'数组加载到内存中。我不得不深入文档以确定有多少'mcold'被加载。这可能取决于'ind'是否是一个紧凑的切片或散布在数组中的值。 – hpaulj 2015-02-09 21:32:32

回答

3

我有一个样本h5py文件:

data = f['data'] 
# <HDF5 dataset "data": shape (3, 6), type "<i4"> 
# is arange(18).reshape(3,6) 
ind=np.where(data[:]%2)[0] 
# array([0, 0, 0, 1, 1, 1, 2, 2, 2], dtype=int32) 
data[ind] # getitem only works with boolean arrays error 
data[ind.tolist()] # can't read data (Dataset: Read failed) error 

这最后一个错误是由在列表中重复的值。

但是索引与唯一值列表工作正常

In [150]: data[[0,2]] 
Out[150]: 
array([[ 0, 1, 2, 3, 4, 5], 
     [12, 13, 14, 15, 16, 17]]) 

In [151]: data[:,[0,3,5]] 
Out[151]: 
array([[ 0, 3, 5], 
     [ 6, 9, 11], 
     [12, 15, 17]]) 

所以不会用正确的维度切片的数组:

In [157]: data[ind[[0,3,6]],:] 
Out[157]: 
array([[ 0, 1, 2, 3, 4, 5], 
     [ 6, 7, 8, 9, 10, 11], 
     [12, 13, 14, 15, 16, 17]]) 
In [165]: f['data'][:2,np.array([0,3,5])] 
Out[165]: 
array([[ 0, 3, 5], 
     [ 6, 9, 11]]) 
In [166]: f['data'][[0,1],np.array([0,3,5])] 
# errror about only one indexing array allowed 

所以,如果索引是正确的 - 唯一值,并且匹配数组维度,它应该工作。我的简单示例不测试多少数组加载。该文档听起来好像是从文件中选择元素而不将整个数组加载到内存中。

+0

是的!谢谢。这实际上是匹配数组维度的问题。在上面的示例代码中,它足以改变where语句: ind =(mdisk> val) – VGP 2015-02-10 10:24:09