在Python中搜索大型XML文件的更有效方式

所以我有2个XML文件（A和B），每个文件都有大约90k的记录。是在Python中搜索大型XML文件的更有效方式

文件的格式如下：

<trips> 
    <trip id="" speed=""/> 
       . 
       . 
       . 
       . 
</trips>

我需要从两个文件的速度属性具有相同id属性进行比较。但是这两个文件中的id不在同一位置。例如以下将不起作用：

A = minidom.parse('A.xml') 
B = minidom.parse('B.xml') 

triplistA = A.getElememtByTagName('trip') 
triplistB = B.getElementByTagName('trip') 

i = 0 

for i in range(len(triplistA)): #A and B has same number of trip tag 
    tripA = triplistA[i] 
    tripB = triplistB[i] 

    #get the speed from tripA and tripB and compare, then do something

这意味着我必须通过文件B搜索以获得相同的ID，只有那时我可以比较速度。在最糟糕的情况下，它将需要n^2次，这对于90k记录来说非常长。

我认为在匹配一对旅程之后，我从文件B中删除记录，以便在下一次迭代中搜索B的时间更少。我试过用minidom删除节点，但它花费的时间更长。因此我使用元素树来删除节点。

然后，我有：

A = minidom.parse('A.xml') 
triplist = A.getElementByTagName('trip') 
B = ET.parse("B.xml") 
rootB = B.getroot() 


for tripA in triplist: 
    for tripB in rootB.findall('trip'): 
     if (tripB.get('id') == str(tripA.attributes['id'].value)): 
      #take speed from both nodes and do something 
      rootB.remove(tripB) 
      break

的过程中，由于在文件B节点的减少了快随着时间的传球，但也花费了一个半小时完成整个过程。

我的项目需要我多次做比较，经过比较速度还有半小时的过程（有些模拟，这部分时间浪费是不可避免的）。所以我想知道是否有更有效的方法来搜索大型XML文件。

谢谢大家提前。

来源

2016-06-09 Yi Xu Chee

演员无论是树木成类型的字典，然后对它们进行比较：

trips_a = {} 
for trip in A.getElementByTagName('trip'): 
    trips_a[trip.attributes['id']] = trip.attributes['id'].value 
for trip in B.getElementByTagName('trip'): 
    trip_value_from_B = trip.attributes['id'].value 
    trip_value_from_A = trips_a[trip.attributes['id'] 
    # Do something with trip_value_from_A and trip_value_from_B

来源

2016-06-09 04:46:00 VBB

它是这样工作的魅力！非常感谢你=） –

在Python中搜索大型XML文件的更有效方式

回答

相关问题