2017-07-31 65 views
2

我有以下数据集:如何解析不同数据集之间的数组?

File1中:

<Molecular Orbital Primitive Coefficients> 
<MO Number> 
1 
</MO Number> 
4.224609607748e+00 4.085857782359e+00 1.273383604708e+00 -6.802974691818e-03 
9.099528133406e-03 6.867550219273e-03 5.859231188647e-03 3.684441849425e-03 
5.836775773317e-04 -2.316776085880e-16 -1.456850991492e-16 -2.307897076406e-17 
4.140895678156e-03 2.603906355541e-03 4.125025757803e-04 -1.739011495381e-03 
-1.681896173898e-03 -5.241735641835e-04 -1.739011375813e-03 -1.681896058258e-03 
-5.241735281434e-04 
<MO Number> 
2 
</MO Number> 
-9.785273892788e-01 -9.463889258321e-01 -2.949481372149e-01 -1.974411643609e-01 
2.640935048539e-01 1.993153249903e-01 2.392564397119e-01 1.504508715968e-01 
2.383394930083e-02 8.865383702284e-16 5.574791243465e-16 8.831407252698e-17 
1.690897356483e-01 1.063281646128e-01 1.684417017817e-02 4.608108515392e-02 
4.456761845182e-02 1.388977974599e-02 4.608108208174e-02 4.456761548054e-02 
1.388977881997e-02 
</Molecular Orbital Primitive Coefficients> 

文件2:

<Molecular Orbital Primitive Coefficients> 
<MO Number> 
1 
</MO Number> 
3.299451113326e-02 6.087754902119e-02 9.880244651376e-02 1.066781206974e-01 
6.773109582562e-02 1.104778461514e-02 -2.156994392623e-02 3.071021124268e-17 
1.072251279194e-16 -1.396334606969e-02 -2.002731618626e-16 -9.993341885751e-17 
<MO Number> 
2 
</MO Number> 
-2.009498358678e-04 -3.707687449719e-04 -6.017466156746e-04 -9.474065009358e-02 
3.917924760214e-01 -1.299844008310e-01 1.579980866207e-01 -2.827902468319e-15 
1.152587596877e-15 -2.310895197449e-01 2.213502483059e-15 -1.048685827923e-15 
<MO Number> 
3 
</MO Number> 
-1.763944008217e-17 -3.254619757728e-17 -5.282150804455e-17 -3.109320915001e-16 
-9.633800372448e-16 -1.118676262789e-17 -1.336368133403e-15 -1.286598202313e+00 
-1.412088253954e+00 2.299271905206e-15 1.305465570574e+00 1.432795875849e+00 
3.494418486873e-16 -1.710573251253e-01 -1.877416268172e-01 -7.134748738863e-16 
</Molecular Orbital Primitive Coefficients> 

在所述阵列的大小和数组的数量的文件之间的这种数据集的变化(即,一些文件可能有70个数组,所以70个MO号码,而另一些则有10个)。我正在尝试编写一个将MO Number标题之间的数据解析为数组的函数。这是我到目前为止:

def function3(start, end): 
    """Read MO information.""" 
    config_found = False 
    var = [] 
    for line in f: 
     if line.strip() == end: 
      config_found = False 
     elif config_found: 
      i = line.rstrip() 
      var.append(i) 
     elif line.strip() == start: 
      config_found = True 
    var1 = [elem.strip() for elem in var] 
    var2 = var1[1:-1] 
    var3 = np.array([line.split() for line in var2]) 
    var3 = np.asarray([list(map(float, item)) for item in var3]) 
    return var3 
m = {'start1':'1','end1':'2', 
     'start2':'2','end2':''} 
with open(filename, 'r') as f: 
    v['monumber1']=function3(m['start1'],m['end1']) 
    v['monumber2']=function3(m['start2'],m['end2']) 

这个问题是,我将需要为某些文件设置这些变量70次!而且,最终数组的开始和结束变量不适用于所有文件。有没有不同的方法来解决这个问题?

谢谢!

+2

有正则表达式和numpy的... –

+1

的可能性增加了'regex'和'numpy'标签可以帮助! –

+1

您的数据源是否建议阅读此标准?使用<>和建议一个xml模型。但只是松散的意思。 – hpaulj

回答

1

基于Vinicius的评论,我尝试了一些正则表达式,请看看它是否有帮助。通常不推荐使用read()方法,但是由于在这个例子中我的数据不是太多,所以我使用它。

import re 

x = [] 
with open(filename, 'r') as fh: 
    x = re.findall(r'\d\.\d+e[-+]\d+', fh.read()) 

out = map(float, x) 

希望这可以帮助,根据您的意见,上述为我工作。输出如下的文件2:

[0.03299451113326, 0.06087754902119, 0.09880244651376, 0.1066781206974, 0.06773109582562, 0.01104778461514, 0.02156994392623, 3.071021124268e-17, 1.072251279194e-16, 0.01396334606969, 2.002731618626e-16, 9.993341885751e-17, 0.0002009498358678, 0.0003707687449719, 0.0006017466156746, 0.09474065009358, 0.3917924760214, 0.129984400831, 0.1579980866207, 2.827902468319e-15, 1.152587596877e-15, 0.2310895197449, 2.213502483059e-15, 1.048685827923e-15, 1.763944008217e-17, 3.254619757728e-17, 5.282150804455e-17, 3.109320915001e-16, 9.633800372448e-16, 1.118676262789e-17, 1.336368133403e-15, 1.286598202313, 1.412088253954, 2.299271905206e-15, 1.305465570574, 1.432795875849, 3.494418486873e-16, 0.1710573251253, 0.1877416268172, 7.134748738863e-16] 
+0

不完全......文件中有其他数据,所以我需要使用标题专门分析每个数组到一个单独的变量。但是,谢谢你,但! – pennypeat

+0

也可以使用正则表达式,将尝试获得正确的正则表达式。有没有比1,2..70更好的标题?我可以使用吗? 欢迎您 –

+0

是的,您可以使用! – pennypeat