2017-06-19 87 views
0

我正在学习Python,并试图从任何XML文件中提取所有标签和相应值的列表。这是我的代码到目前为止。使用Python将XML转换为标签和值列表

def ParseXml(XmlFile): 
    try: 
     parser = etree.XMLParser(remove_blank_text=True, compact=True) 
     tree = ET.parse(XmlFile, parser) 
     root = tree.getroot() 

     ListOfTags, ListOfValues, ListOfAttribs = [], [], [] 
     for elem in root.iter('*'): 
      Tag = elem.tag 
      ListOfTags.append(Tag) 

      value = elem.text 
      if value is not None: 
       ListOfValues.append(value) 
      else: 
       ListOfValues.append('') 

      attrib = elem.attrib 
      if attrib: 
       ListOfAttribs.append([attrib]) 
      else: 
       ListOfAttribs.append([]) 
     print('%s File parsed successfully' % XmlFile) 
     return (ListOfTags, ListOfValues, ListOfAttribs) 

    except Exception as e: 
     print('Error while parsing XMLs : %s : %s' % (type(e), e)) 
     return ([], [], []) 

对于像这样的XML输入:

<?xml version="1.0" encoding="UTF-8"?> 
<Application Version="2.01"> 
    <UserAuthRequest> 
     <VendorApp> 
      <AppName>SING</AppName> 
     </VendorApp> 
    </UserAuthRequest> 
    <ApplicationRequest ID="12-123-AH"> 
     <GUID>ABD45129-PD1212-121DFL</GUID> 
     <Type tc="200">Streaming</Type> 
     <File></File> 
     <FileExtension VendorCode="200"> 
      <Result> 
       <ResultCode tc="1">Success</ResultCode> 
      </Result> 
     </FileExtension> 
    </ApplicationRequest> 
</Application> 

此输出的标记,值和属性多个列表。这工作正常。

['Application', 'UserAuthRequest', 'VendorApp', 'AppName', 'ApplicationRequest', 'GUID', 'Type', 'File', 'FileExtension', 'Result', 'ResultCode'] 
['', '', '', 'SING', '', 'ABD45129-PD1212-121DFL', 'Streaming', '', '', '', 'Success'] 
[[{'Version': '2.01'}], [], [], [], [{'ID': '12-123-AH'}], [], [{'tc': '200'}], [], [{'VendorCode': '200'}], [], [{'tc': '1'}]] 

但我的问题是,我需要标签,包括父母和孩子的标签。像下面的实际输出我靶向:

['Application', 'UserAuthRequest', 'UserAuthRequest.VendorApp', 'UserAuthRequest.VendorApp.AppName', 'ApplicationRequest', 'ApplicationRequest.GUID', 'ApplicationRequest.Type', 'ApplicationRequest.File', 'ApplicationRequest.File.FileExtension', 'ApplicationRequest.File.FileExtension.Result', 'ApplicationRequest.File.FileExtension.Result.ResultCode'] 

我如何做到这一点与Python?还是有其他的替代方法来做到这一点?

+1

你尝试过使用BeautifulSoup吗? – snapcrack

+0

我在某处读到它与lxml类似的地方。是否有可能使用BeautifulSoup获得所需的输出?如果是这样,怎么样? – Naveen

+0

目标输出似乎不一致,至少对于根节点的孩子来说;他们应该是'Application.UserAuthRequest'和'Application.ApplicationRequest'。另外,_xml_中没有'ApplicationRequest.File。*'。 – CristiFati

回答

0

下面是仅使用[Python]: xml.etree.ElementTree — The ElementTree XML API递归的方法:

import xml.etree.ElementTree as ET 


def parse_node(node, ancestor_string=""): 
    #print(type(node), dir(node)) 

    if ancestor_string: 
     node_string = ".".join([ancestor_string, node.tag]) 
    else: 
     node_string = node.tag 
    tag_list = [node_string] 
    text = node.text 
    if text: 
     text_list = [text.strip()] 
    else: 
     text_list = [""] 
    attr_list = [node.attrib] 
    for child_node in list(node): 
     child_tag_list, child_text_list, child_attr_list = parse_node(child_node, ancestor_string=node_string) 
     tag_list.extend(child_tag_list) 
     text_list.extend(child_text_list) 
     attr_list.extend(child_attr_list) 
    return tag_list, text_list, attr_list 


def parse_xml(file_name): 
    tree = ET.parse("test.xml") 
    root_node = tree.getroot() 
    tags, texts, attrs = parse_node(root_node) 
    print(tags) 
    print(texts) 
    print(attrs) 


def main(): 
    parse_xml("a.xml") 


if __name__ == "__main__": 
    main() 

注意

  • 的想法是“记住路径“中的xml树。这是通过parse_nodeancestor_string的说法,这是计算了树中的每个节点,并传递给它(直接)孩子做
  • 的命名来自一个不同的问题,因为[Python]: PEP 8 -- Style Guide for Python Code考虑
  • 在1 ST一目了然,它似乎是有两个函数(mainparse_xml),其中一个只是调用别的,只增加了嵌套的无用的水平,但它是一个好的做法,我习惯了
  • 我纠正的属性列表。相反,含有单字典中的每个内部列表列表的列表,返回字典列表

输出(我和的Python 2.7的Python 3.5运行脚本):

['Application', 'Application.UserAuthRequest', 'Application.UserAuthRequest.VendorApp', 'Application.UserAuthRequest.VendorApp.AppName', 'Application.ApplicationRequest', 'Application.ApplicationRequest.GUID', 'Application.ApplicationRequest.Type', 'Application.ApplicationRequest.File', 'Application.ApplicationRequest.FileExtension', 'Application.ApplicationRequest.FileExtension.Result', 'Application.ApplicationRequest.FileExtension.Result.ResultCode'] 
['', '', '', 'SING', '', 'ABD45129-PD1212-121DFL', 'Streaming', '', '', '', 'Success'] 
[{'Version': '2.01'}, {}, {}, {}, {'ID': '12-123-AH'}, {}, {'tc': '200'}, {}, {'VendorCode': '200'}, {}, {'tc': '1'}] 
0

我相信这是你所需要的:

from bs4 import BeautifulSoup 
from urllib.request import urlopen 

soup = BeautifulSoup(yourlinkhere, 'lxml') 

lst = [] 

for tag in soup.findChildren(): 
    if tag.child: 
     lst.append(str(tag.name) + '.' + str(tag.child)) 
    else: 
     lst.append(tag.name)