从字符串列表中创建一个numpy结构数组

我正在使用python实用程序从Tycho 2星级目录中获取数据。我正在处理的函数之一查询目录并返回给定星号（或星号标识符集）的所有信息。从字符串列表中创建一个numpy结构数组

我目前正在循环浏览目录文件的行，然后尝试将行解析为numpy结构数组（如果查询的话）。（请注意，如果有更好的方法可以做到这一点，即使这个问题不是这个问题的关键，也可以让我知道 - 我这样做是因为目录太大而无法将其全部加载到内存中时间）

无论如何，一旦我确定了一个记录，我想保持我遇到了一个问题......我无法弄清楚如何解析它到一个结构化数组。

例如，假设我想保持的记录是：

record = '0002 00038 1| | 3.64121230| 1.08701186| 14.1| -23.0| 69| 82| 1.8| 1.9|1968.56|1957.30| 3|1.0|3.0|0.9|3.0|12.444|0.213|11.907|0.189|999| |   | 3.64117944| 1.08706861|1.83|1.73| 81.0|104.7| | 0.0'

现在，我试图解析为numpy的结构阵列D型这样的：

 dform = [('starid', [('TYC1', int), ('TYC2', int), ('TYC3', int)]), 
      ('pflag', str), 
      ('starBearing', [('rightAscension', float), ('declination', float)]), 
      ('properMotion', [('rightAscension', float), ('declination', float)]), 
      ('uncertainty', [('rightAscension', int), ('declination', int), ('pmRA', float), ('pmDc', float)]), 
      ('meanEpoch', [('rightAscension', float), ('declination', float)]), 
      ('numPos', int), 
      ('fitGoodness', [('rightAscension', float), ('declination', float), ('pmRA', float), ('pmDc', float)]), 
      ('magnitude', [('BT', [('mag', float), ('err', float)]), ('VT', [('mag', float), ('err', float)])]), 
      ('starProximity', int), 
      ('tycho1flag', str), 
      ('hipparcosNumber', str), 
      ('observedPos', [('rightAscension', float), ('declination', float)]), 
      ('observedEpoch', [('rightAscension', float), ('declination', float)]), 
      ('observedError', [('rightAscension', float), ('declination', float)]), 
      ('solutionType', str), 
      ('correlation', float)]

这似乎是它应该是一件相当简单的事情，但我所尝试的一切都是...

我试过了：

这两个

给我

{TypeError}cannot perform accumulate with flexible type

这是没有意义的，因为它不应该做任何积累。

我也试过

np.array(re.split('\|| ',record),dtype=dform)

这抱怨

{TypeError}a bytes-like object is required, not 'str'

和另一种变体

np.array([x.encode() for x in re.split('\|| ',record)],dtype=dform)

不抛出一个错误，但也肯定不会返回正确的结果：

[ ((842018864, 0, 0), '', (0.0, 0.0), (0.0, 0.0), (0, 0, 0.0, 0.0), (0.0, 0.0), 0, (0.0, 0.0, 0.0, 0.0), ((0.0, 0.0), (0.0, 0.0)), 0, '', '', (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), '', 0.0)...

那么我该如何做到这一点？我认为genfromtxt选项是要走的路（特别是因为偶尔会丢失数据），但我不明白为什么它不起作用。这是我只需要自己写一个解析器的东西吗？

来源

2015-12-21 Andrew

嗯......你的'记录'有32个字段，但'dform'只有'17'。那个怎么样？我可以想象它可以通过'np.genfromtxt（BytesIO（record.encode（）），dtype = dform，delimiter ='|'）'来工作，但无论如何，现在它似乎是不明确的。 –

由于我没有足够的重新创建问题，请尝试将'str'切换到'np.str_' – TriHard8

对不起，这个答案很漫长，但是这就是弄清楚发生了什么。特别是dtype的复杂性被其长度所掩盖。

我得到TypeError: cannot perform accumulate with flexible type错误，当我尝试你的delimiter名单。详细信息显示错误发生在LineSplitter。没有深入细节，分隔符应该是一个字符（或默认的'空白'）。

从genfromtxt文档：

定界符：STR，int或序列，可选用于分隔值的字符串。默认情况下，任何连续的空格作为分隔符。也可以提供整数或整数序列作为每个字段的宽度。

的genfromtxt分路比弦更厉害一点.split是loadtxt用途，而不是像一般的re分离器。

至于{TypeError}a bytes-like object is required, not 'str'，您可以为几个字段指定dtype 'str'。这是字节字符串，其中record是unicode字符串（在Py3中）。但你已经意识到，与BytesIO(record.encode())。

我喜欢测试genfromtxt案件：

record = b'....' 
np.genfromtxt([record], ....)

或者更好的

records = b"""one line 
tow line 
three line 
""" 
np.genfromtxt(records.splitlines(), ....)

如果我让genfromtxt演绎字段类型，并且只使用一个分隔符，我得到32个字段：

In [19]: A=np.genfromtxt([record],dtype=None,delimiter='|') 
In [20]: len(A.dtype) 
Out[20]: 32 
In [21]: A 
Out[21]: 
array((b'0002 00038 1', False, 3.6412123, 1.08701186, 14.1, -23.0, 69, 82, 1.8, 1.9, 1968.56, 1957.3, 3, 1.0, 3.0, 0.9, 3.0, 12.444, 0.213, 11.907, 0.189, 999, False, False, 3.64117944, 1.08706861, 1.83, 1.73, 81.0, 104.7, False, 0.0), 
     dtype=[('f0', 'S12'), ('f1', '?'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8'), ... ('f26', '<f8'), ('f27', '<f8'), ('f28', '<f8'), ('f29', '<f8'), ('f30', '?'), ('f31', '<f8')])

当我们得到整个字节和分隔符问题worke d out

np.array([x for x in re.split(b'\|| ',record)],dtype=dform)

确实运行。我现在看到你的dform很复杂，嵌套复合字段。

但是要定义一个结构化数组，你可以给它一个记录列表，例如，

np.array([(record1...), (record2...), ....], dtype([(field1),(field2),...]))

在这里，您正试图创建一条记录。我可以将你的列表包装在一个元组中，但是我得到的长度和长度之间不匹配，如果你计算所有的子字段dform可能需要66个值，但我们不能只用一个元组来做到这一点。

我从来没有试过从这样一个复杂的dtype创建一个数组，所以我正在四处寻找使它工作的方法。

In [41]: np.zeros((1,),dform) 
Out[41]: 
array([ ((0, 0, 0), '', (0.0, 0.0), (0.0, 0.0), (0, 0, 0.0, 0.0), (0.0, 0.0), 0, (0.0, 0.0, 0.0, 0.0), ((0.0, 0.0), (0.0, 0.0)), 0, '', '', (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), '', 0.0)], 
     dtype=[('starid', [('TYC1', '<i4'), ('TYC2', '<i4'), ('TYC3', '<i4')]), ('pflag', '<U'), ('starBearing', [('rightAscension', '<f8'), ('declination', '<f8')]), ('properMotion', [('rightAscension', '<f8'), ('declination', '<f8')]), ('uncertainty', [('rightAscension', '<i4'), ('declination', '<i4'), ('pmRA', '<f8'), ('pmDc', '<f8')]), ('meanEpoch', ....('solutionType', '<U'), ('correlation', '<f8')]) 

In [64]: for name in A.dtype.names: 
    print(A[name].dtype) 
    ....:  
[('TYC1', '<i4'), ('TYC2', '<i4'), ('TYC3', '<i4')] 
<U1 
[('rightAscension', '<f8'), ('declination', '<f8')] 
[('rightAscension', '<f8'), ('declination', '<f8')] 
[('rightAscension', '<i4'), ('declination', '<i4'), ('pmRA', '<f8'), ('pmDc', '<f8')] 
[('rightAscension', '<f8'), ('declination', '<f8')] 
int32 
[('rightAscension', '<f8'), ('declination', '<f8'), ('pmRA', '<f8'), ('pmDc', '<f8')] 
[('BT', [('mag', '<f8'), ('err', '<f8')]), ('VT', [('mag', '<f8'), ('err', '<f8')])] 
int32 
<U1 
<U1 
[('rightAscension', '<f8'), ('declination', '<f8')] 
[('rightAscension', '<f8'), ('declination', '<f8')] 
[('rightAscension', '<f8'), ('declination', '<f8')] 
<U1 
float64

我计数了34个原始dtype字段。大多数是'标量'，大约有2-4个术语，其中一个具有更高层次的嵌套。

如果我用|替换前两个拆分空格，record.split(b'|')会给我34个字符串。

让我们试着在genfromtxt：

In [79]: np.genfromtxt([record],delimiter='|',dtype=dform) 
Out[79]: 
array(((2, 38, 1), '', (3.6412123, 1.08701186), (14.1, -23.0), 
    (69, 82, 1.8, 1.9), (1968.56, 1957.3), 3, (1.0, 3.0, 0.9, 3.0), 
    ((12.444, 0.213), (11.907, 0.189)), 999, '', '', 
    (3.64117944, 1.08706861), (1.83, 1.73), (81.0, 104.7), '', 0.0), 
     dtype=[('starid', [('TYC1', '<i4'), ('TYC2', '<i4'), ('TYC3', '<i4')]), 
('pflag', '<U'), 
('starBearing', [('rightAscension', '<f8'), ('declination', '<f8')]), 
('properMotion', [('rightAscension', '<f8'), ('declination', '<f8')]), 
('uncertainty', [('rightAscension', '<i4'), ('declination', '<i4'), ('pmRA', '<f8'), ('pmDc', '<f8')]), 
('meanEpoch', [('rightAscension', '<f8'), ('declination', '<f8')]), 
('numPos', '<i4'), 
('fitGoodness', [('rightAscension', '<f8'), ('declination', '<f8'), ('pmRA', '<f8'), ('pmDc', '<f8')]), 
('magnitude', [('BT', [('mag', '<f8'), ('err', '<f8')]), ('VT', [('mag', '<f8'), ('err', '<f8')])]), 
('starProximity', '<i4'), ('tycho1flag', '<U'), ('hipparcosNumber', '<U'), 
('observedPos', [('rightAscension', '<f8'), ('declination', '<f8')]), 
('observedEpoch', [('rightAscension', '<f8'), ('declination', '<f8')]), 
('observedError', [('rightAscension', '<f8'), ('declination', '<f8')]), ('solutionType', '<U'), ('correlation', '<f8')])

这看起来几乎是合理的。 genfromtxt实际上可以将化合物字段中的值分开。这更多的是我想用np.array()来尝试。

所以，如果你得到的分隔符和字节/ unicode制定出来，genfromtxt可以处理这个烂摊子。

来源

2015-12-22 01:17:04 hpaulj

感谢您写出来的时间。这真的很有帮助。问题是（如你所述），genfromtxt只能处理一个分隔符。除了另一个问题，我会在另外一个问题中提出，现在似乎大部分工作都正在进行。 – Andrew

从字符串列表中创建一个numpy结构数组

回答

相关问题