2017-04-13 99 views
1

我有这个奇怪的XML我试图解析,并在阅读此后,我仍然有问题。Python解析奇怪的XML?

我想解析NIST CVE数据库,它只能用XML。这是它的一个例子。

<?xml version='1.0' encoding='UTF-8'?> 
<nvd xmlns:scap-core="http://scap.nist.gov/schema/scap-core/0.1" xmlns:cvss="http://scap.nist.gov/schema/cvss-v2/0.2" xmlns:vuln="http://scap.nist.gov/schema/vulnerability/0.4" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:patch="http://scap.nist.gov/schema/patch/0.1" xmlns="http://scap.nist.gov/schema/feed/vulnerability/2.0" xmlns:cpe-lang="http://cpe.mitre.org/language/2.0" nvd_xml_version="2.0" pub_date="2017-04-12T18:00:08" xsi:schemaLocation="http://scap.nist.gov/schema/patch/0.1 https://scap.nist.gov/schema/nvd/patch_0.1.xsd http://scap.nist.gov/schema/feed/vulnerability/2.0 https://scap.nist.gov/schema/nvd/nvd-cve-feed_2.0.xsd http://scap.nist.gov/schema/scap-core/0.1 https://scap.nist.gov/schema/nvd/scap-core_0.1.xsd"> 
    <entry id="CVE-2013-7450"> 
    <vuln:vulnerable-configuration id="http://nvd.nist.gov/"> 
     <cpe-lang:logical-test operator="OR" negate="false"> 
     <cpe-lang:fact-ref name="cpe:/a:pulp_project:pulp:2.2.1-1"/> 
     </cpe-lang:logical-test> 
    </vuln:vulnerable-configuration> 
    <vuln:vulnerable-software-list> 
     <vuln:product>cpe:/a:pulp_project:pulp:2.2.1-1</vuln:product> 
    </vuln:vulnerable-software-list> 
    <vuln:cve-id>CVE-2013-7450</vuln:cve-id> 
    <vuln:published-datetime>2017-04-03T11:59:00.143-04:00</vuln:published-datetime> 
    <vuln:last-modified-datetime>2017-04-11T10:01:04.323-04:00</vuln:last-modified-datetime> 
    <vuln:cvss> 
     <cvss:base_metrics> 
     <cvss:score>5.0</cvss:score> 
     <cvss:access-vector>NETWORK</cvss:access-vector> 
     <cvss:access-complexity>LOW</cvss:access-complexity> 
     <cvss:authentication>NONE</cvss:authentication> 
     <cvss:confidentiality-impact>NONE</cvss:confidentiality-impact> 
     <cvss:integrity-impact>PARTIAL</cvss:integrity-impact> 
     <cvss:availability-impact>NONE</cvss:availability-impact> 
     <cvss:source>http://nvd.nist.gov</cvss:source> 
     <cvss:generated-on-datetime>2017-04-11T09:43:13.623-04:00</cvss:generated-on-datetime> 
     </cvss:base_metrics> 
    </vuln:cvss> 
    <vuln:cwe id="CWE-295"/> 
    <vuln:references xml:lang="en" reference_type="VENDOR_ADVISORY"> 
     <vuln:source>MLIST</vuln:source> 
     <vuln:reference href="http://www.openwall.com/lists/oss-security/2016/04/18/11" xml:lang="en">[oss-security] 20160418 CVE-2013-7450: Pulp &lt; 2.3.0 distributed the same CA key to all users</vuln:reference> 
    </vuln:references> 
    <vuln:references xml:lang="en" reference_type="VENDOR_ADVISORY"> 
     <vuln:source>MLIST</vuln:source> 
     <vuln:reference href="http://www.openwall.com/lists/oss-security/2016/04/18/5" xml:lang="en">[oss-security] 20160418 Re: CVE request - Pulp &lt; 2.3.0 shipped the same authentication CA key/cert to all users</vuln:reference> 
    </vuln:references> 
    <vuln:references xml:lang="en" reference_type="VENDOR_ADVISORY"> 
     <vuln:source>MLIST</vuln:source> 
     <vuln:reference href="http://www.openwall.com/lists/oss-security/2016/05/20/1" xml:lang="en">[oss-security] 20160519 Pulp 2.8.3 Released to address multiple CVEs</vuln:reference> 
    </vuln:references> 
    <vuln:references xml:lang="en" reference_type="PATCH"> 
     <vuln:source>CONFIRM</vuln:source> 
     <vuln:reference href="https://bugzilla.redhat.com/show_bug.cgi?id=1003326" xml:lang="en">https://bugzilla.redhat.com/show_bug.cgi?id=1003326</vuln:reference> 
    </vuln:references> 
    <vuln:references xml:lang="en" reference_type="PATCH"> 
     <vuln:source>CONFIRM</vuln:source> 
     <vuln:reference href="https://bugzilla.redhat.com/show_bug.cgi?id=1328345" xml:lang="en">https://bugzilla.redhat.com/show_bug.cgi?id=1328345</vuln:reference> 
    </vuln:references> 
    <vuln:references xml:lang="en" reference_type="VENDOR_ADVISORY"> 
     <vuln:source>CONFIRM</vuln:source> 
     <vuln:reference href="https://github.com/pulp/pulp/pull/627" xml:lang="en">https://github.com/pulp/pulp/pull/627</vuln:reference> 
    </vuln:references> 
    <vuln:summary>Pulp before 2.3.0 uses the same the same certificate authority key and certificate for all installations.</vuln:summary> 
    </entry> 
<nvd> 

我试图与ET解析它,但我得到一些奇怪的输出...

例如,当我用这个,

with open('/tmp/nvdcve-2.0-modified 2.xml', 'rt') as f: 
    tree = ElementTree.parse(f) 
for child in root: 
    print child.tag, child.attrib 

我的输出看起来是这样的.. 。

{http://scap.nist.gov/schema/feed/vulnerability/2.0}entry {'id': 'CVE-2007-6759'} 

是什么使得它混乱,是如果我想遍历它,我似乎需要做..

for child in root.iter('{http://scap.nist.gov/schema/feed/vulnerability/2.0}entry'): 

如果我这样做,但我不知道孩子的孩子是什么,或者什么都不知道。

例如,我试图拔出vuln:cve-id,并且每个个体cvss:base_metrics(评分访问向量),vuln:summaryvuln:product

基本上,我试图从NIST网站每隔一小时下载一次“xml流”并将其更新到本地mysql数据库中,这样我在我的环境中执行漏洞评估时也可以查询本地。搞清楚如何迭代这个XML的东西是混乱的地狱。我想尝试将它转换为JSON,但由于没有1:1的XML/JSON转换,这似乎是一个不必要的额外步骤,可能存在问题。

回答

1

是的,带名称空间的XML必须被处理a little differently。这是继续使用ElementTree API的另一个解决方案。

在这个库的命名空间,在那里你看到vuln:summary你需要查找的根元素的​​属性vuln命名空间,然后把它称为{http://scap.nist.gov/schema/vulnerability/0.4}summary工作。

import xml.etree.ElementTree as ET 
tree = ET.parse('nvdcve-2.0-Modified.xml') 
root = tree.getroot() 
# default namespace is given by xmlns attribute of root element, still must be provided 
for entry in root.findall('{http://scap.nist.gov/schema/feed/vulnerability/2.0}entry'): 
    product_list = [] 
    metric_list = [] 
    # just use the element's id attribute 
    id = entry.get('id') 

    summary = entry.find('{http://scap.nist.gov/schema/vulnerability/0.4}summary').text 

    software = entry.find('{http://scap.nist.gov/schema/vulnerability/0.4}vulnerable-software-list') 
    if software is not None: 
     for sw in software.findall('{http://scap.nist.gov/schema/vulnerability/0.4}product'): 
      product_list.append(sw.text) 

    metrics = entry.find('{http://scap.nist.gov/schema/vulnerability/0.4}cvss') 
    if metrics is not None: 
     for metric in metrics.find('{http://scap.nist.gov/schema/cvss-v2/0.2}base_metrics').findall('*'): 
      # we don't know the element name, but can get it with the tag property 
      metric_list.append(metric.tag.replace('{http://scap.nist.gov/schema/cvss-v2/0.2}', '') + ': ' + metric.text) 

    print(id, summary, product_list, metric_list) 
    #save to database! 
+0

很好,谢谢。我不熟悉命名空间,第一次使用XML,这是超级混乱。通常我只使用JSON。 – Mallachar

+0

最后一个问题,如果我可能, 我该怎么去获得,具体来说,cvss:得分?我知道我可以做metric_list [0],但是如果不是拉动所有的基本度量标准,我想拉那个呢?我会做另一个嵌套for循环? – Mallachar

+0

只要看看现有的代码。但用你正在寻找的特定元素替换'findall('*')'。 – miken32

2

这是一个命名空间 XML文档。因此,您需要使用各自的名称空间来寻址节点。

在文档中所使用的命名空间在文档的顶部定义,并且被映射到所谓的命名空间前缀

xmlns="http://scap.nist.gov/schema/feed/vulnerability/2.0" 
xmlns:cvss="http://scap.nist.gov/schema/cvss-v2/0.2" 
xmlns:vuln="http://scap.nist.gov/schema/vulnerability/0.4" 
... 

所以前缀vuln被映射到"http://scap.nist.gov/schema/vulnerability/0.4"例如。

没有前缀的一个被称为默认命名空间 - 它适用于不使用显式的命名空间前缀(如根节点nvdentry节点)的所有节点。


所以,你要么需要使用完全合格的命名空间,或适当的名称空间前缀(在你的代码,你可以映射不同比他们已经解析文档中被映射)来解决这些要素。

下面是做的一个例子,使用lxml(和XPath表达式):

from lxml import etree 

NSMAP = { 
    'n': 'http://scap.nist.gov/schema/feed/vulnerability/2.0', 
    'cpe-lang': 'http://cpe.mitre.org/language/2.0', 
    'cvss': 'http://scap.nist.gov/schema/cvss-v2/0.2', 
    'patch': 'http://scap.nist.gov/schema/patch/0.1', 
    'scap-core': 'http://scap.nist.gov/schema/scap-core/0.1', 
    'vuln': 'http://scap.nist.gov/schema/vulnerability/0.4', 
    'xsi': 'http://www.w3.org/2001/XMLSchema-instance', 
} 


def normalized_tag(node): 
    return node.tag.replace('{%s}' % node.nsmap[node.prefix], '') 


root = etree.parse(open('nvdcve.xml')).getroot() 


entries = root.xpath('//n:nvd/n:entry', namespaces=NSMAP) 
for entry in entries: 
    print "Entry: %r" % entry.attrib['id'] 

    # CVE ID 
    cve_id = entry.xpath('./vuln:cve-id/text()', namespaces=NSMAP)[0] 
    print " CVE ID: %r" % cve_id 

    # Base Metrics 
    metrics = entry.xpath('./vuln:cvss/cvss:base_metrics/*', namespaces=NSMAP) 
    print " Base Metrics:" 
    for metric in metrics: 
     metric_name = normalized_tag(metric) 
     metric_value = metric.text 
     print " %s: %s" % (metric_name, metric_value) 

    # Summary 
    summary = entry.xpath('./vuln:summary/text()', namespaces=NSMAP)[0] 
    print " Summary: %s" % summary 

    # Products 
    products = entry.xpath('./vuln:vulnerable-software-list/vuln:product', 
          namespaces=NSMAP) 
    for product in products: 
     print " Product: %s" % product.text 

输出:

Entry: 'CVE-2013-7450' 
    CVE ID: 'CVE-2013-7450' 
    Base Metrics: 
    score: 5.0 
    access-vector: NETWORK 
    access-complexity: LOW 
    authentication: NONE 
    confidentiality-impact: NONE 
    integrity-impact: PARTIAL 
    availability-impact: NONE 
    source: http://nvd.nist.gov 
    generated-on-datetime: 2017-04-11T09:43:13.623-04:00 
    Summary: Pulp before 2.3.0 uses the same the same certificate authority key and certificate for all installations. 
    Product: cpe:/a:pulp_project:pulp:2.2.1-1 

有关XML命名空间的更多信息,请参阅Namespaces section in the lxml tutorialWikipedia article on XML Namespaces


有关XPath语法的更多信息,请参见例如XPath Syntax页面中W3Schools Xpath Tutorial

要开始使用XPath,在许多XPath testers之一中摆弄文档也会非常有帮助。此外,Firefox的Firebug插件或Google Chrome检查器允许您显示所选元素的XPath(或者更多)XPath。

+0

啊很高兴知道,thakn你。没有意识到命名空间或以这种方式工作过的东西。尝试使用ET教程与此相比令人困惑。谢谢! – Mallachar