用Python解析XML时处理多个节点

对于一个任务，我需要解析一个200万行的XML文件，并将数据输入到MySQL数据库中。由于我们在类中使用了python环境和sqlite，我试图用python来解析文件。请记住，我只是学习Python，所以一切都是新的！用Python解析XML时处理多个节点

我已经尝试了几次，但不断失败并越来越沮丧。为了提高效率，我出测试我的代码上完整的XML的只是少量的，在这里：

<pub> 
<ID>7</ID> 
<title>On the Correlation of Image Size to System Accuracy in Automatic Fingerprint Identification Systems</title> 
<year>2003</year> 
<booktitle>AVBPA</booktitle> 
<pages>895-902</pages> 
<authors> 
    <author>J. K. Schneider</author> 
    <author>C. E. Richardson</author> 
    <author>F. W. Kiefer</author> 
    <author>Venu Govindaraju</author> 
</authors> 
</pub>

首次尝试

在这里，我成功地从每个标签拉出所有的数据，除非<authors>标签下有多个作者。我试图遍历authors标签中的每个节点，计数，然后为这些作者创建一个临时数组，然后使用SQL将它们放到我的数据库中。我为作者数量增加了15个，但显然只有4个！我该如何解决这个问题？

from xml.dom import minidom 

xmldoc= minidom.parse("test.xml") 

pub = xmldoc.getElementsByTagName("pub")[0] 
ID = pub.getElementsByTagName("ID")[0].firstChild.data 
title = pub.getElementsByTagName("title")[0].firstChild.data 
year = pub.getElementsByTagName("year")[0].firstChild.data 
booktitle = pub.getElementsByTagName("booktitle")[0].firstChild.data 
pages = pub.getElementsByTagName("pages")[0].firstChild.data 
authors = pub.getElementsByTagName("authors")[0] 
author = authors.getElementsByTagName("author")[0].firstChild.data 
num_authors = len(author) 
print("Number of authors: ", num_authors) 

print(ID) 
print(title) 
print(year) 
print(booktitle) 
print(pages) 
print(author)

来源

2017-04-23 douglasrcjames

注意，你都拿到字符的第一作者数这里，换码限制结果只有第一作者（索引0），然后获取其长度：

author = authors.getElementsByTagName("author")[0].firstChild.data 
num_authors = len(author) 
print("Number of authors: ", num_authors)

只是不限制结果让所有的作者：

author = authors.getElementsByTagName("author") 
num_authors = len(author) 
print("Number of authors: ", num_authors)

您可以使用列表理解以获得列表中的所有作者姓名而不是作者元素：

author = [a.firstChild.data for a in authors.getElementsByTagName("author")] 
print(author) 
# [u'J. K. Schneider', u'C. E. Richardson', u'F. W. Kiefer', u'Venu Govindaraju']

来源

2017-04-23 06:27:02 har07

我知道我需要访问数组中的每个变量，但语法上不确定。非常感谢！ – douglasrcjames

嘿@ har07，所以我取得了进展，但是某种意义上，我的一些XML数据是“不好的”......我有一个名称为“í”的特殊字符，并出现在“＆iacute”中。在XML文件中。我如何处理这些特殊的语言字符到Python？我得到的错误是“ExpatError：undefined entity：”。 – douglasrcjames

用Python解析XML时处理多个节点

回答

相关问题