用Python解析XML attrib'和'字符拆分

我正在使用NVD XML并尝试解析和分割XML以最终进入数据库。我遇到的问题是解析的XML attrib是或者用“或”来表示值。我无法分割这些字符串。我已经包含了代码和它目前失败的条目。预期输出是用Python解析XML attrib'和'字符拆分

product,america's_first_federal_credit_union,america's_first_fcu_mobile_banking

代码

#!/usr/bin/env python 
import os 
import sys 
import time 
from subprocess import call 
import xml.etree.ElementTree 
import re 

range_from = 2017 
range_to = 2017 

def process_entry(entry): 
    cve = entry.attrib.get("name") 
    print cve 
    cpes = get_cpes_affected(entry) 


def get_cpes_affected(entry): 
    child = [] 
    for e in entry.iter(): 
     if "}prod" in e.tag: 
      print e.attrib 
      print unichr(34) 
      if unichr(34) in e.attrib: 
       print "hey yo" 
       child.append("product," + str(e.attrib).split('"')[1] + "," + str(e.attrib).split('"')[3]) 
      else: 
       child.append("product," + str(e.attrib).split("'")[3] + "," + str(e.attrib).split("'")[7]) 
      #print e.tag, e.attrib 
     if "'prev'" in e.attrib: 
      child.append("version," + str(e.attrib).split("'")[7] + "," + str(e.attrib).split("'")[3]) 
     if "}vers" in e.tag and "'prev'" not in e.attrib: 
      child.append("version," + str(e.attrib).split("'")[3] + ",") 
      #print e.tag, e.attrib 
    for derp in child: 
     print derp 

for i in range(range_from, range_to+1): 
    os.system("wget -O tmp.zip https://nvd.nist.gov/download/nvdcve-%i.xml.zip" % i) 
    os.system("unzip -o tmp.zip") 
    e = xml.etree.ElementTree.parse('nvdcve-%i.xml' % i).getroot() 

    for entry in e: 
     process_entry(entry)

正被解析

<entry type="CVE" name="CVE-2017-5916" seq="2017-5916" published="2017-05-05" modified="2017-05-16" severity="Medium" CVSS_version="2.0" CVSS_score="4.3" CVSS_base_score="4.3" CVSS_impact_subscore="2.9" CVSS_exploit_subscore="8.6" CVSS_vector="(AV:N/AC:M/Au:N/C:P/I:N/A:N)"> 
<desc> 
    <descript source="cve">The America's First Federal Credit Union (FCU) Mobile Banking app 3.1.0 for iOS does not verify X.509 certificates from SSL servers, which allows man-in-the-middle attackers to spoof servers and obtain sensitive information via a crafted certificate.</descript> 
</desc> 
<loss_types> 
    <conf/> 
</loss_types> 
<range> 
    <network/> 
</range> 
<refs> 
    <ref source="MISC" url="https://medium.com/@chronic_9612/follow-up-76-popular-apps-confirmed-vulnerable-to-silent-interception-of-tls-protected-data-64185035029f" adv="1">https://medium.com/@chronic_9612/follow-up-76-popular-apps-confirmed-vulnerable-to-silent-interception-of-tls-protected-data-64185035029f</ref> 
</refs> 
<vuln_soft> 
    <prod name="america's_first_fcu_mobile_banking" vendor="america's_first_federal_credit_union"> 
    <vers num="3.1.0" prev="1" edition=":~~~iphone_os~~"/> 
    </prod> 
</vuln_soft>

条目失败上

{'vendor': "america's_first_federal_credit_union", 'name': "america's_first_fcu_mobile_banking"}

的XML条目的实施例

而只是为了有一个字符串它能够没有问题

{'vendor': 'emirates_nbd_bank_p.j.s.c', 'name': 'emirates_nbd_ksa'}

对不起分裂的例子忘了，包括错误

Traceback (most recent call last): 
    File "prev-version-load.py", line 49, in <module> 
    process_entry(entry) 
    File "prev-version-load.py", line 18, in process_entry 
    cpes = get_cpes_affected(entry) 
    File "prev-version-load.py", line 33, in get_cpes_affected 
    child.append("product," + str(e.attrib).split("'")[3] + "," + str(e.attrib).split("'")[7]) 
IndexError: list index out of range

来源

2017-10-05 Adthrawn

而你得到的错误是...？ –

你在使用lxml吗？ –

你试图得到什么输出？ '''dict'然后尝试解析它几乎肯定不是你想要做的事情...... –

这与解析xml无关，但与如何格式化输出无关。

与shell脚本不同，在大多数情况下，只是字符串，你可以做字符串小提琴来获得你想要的输出，python是一种面向对象的语言，Python中的对象有类型。特别是e.attrib是一种字典类型，您不能对字典进行字符串操作。

我建议使用ElementTree的findall()方法，而不是做我认为你正在尝试做的事情。举例来说，我觉得这什么是你真正想要做的事：

#!/usr/bin/env python 
from xml.etree import ElementTree as ET 

range_from = 2017 
range_to = 2017 

def process_entry(entry): 
    cve = entry.attrib.get("name") 
    print cve 
    cpes = get_cpes_affected(entry) 


def get_cpes_affected(entry): 
    prods = entry.findall('nvd:vuln_soft/nvd:prod', namespaces=namespaces) 
    for prod in prods: 
     print prod.attrib 
     print '"' 
    for prod in prods: 
     print "product,{},{}".format(prod.attrib['vendor'], prod.attrib['name']) 
     for vers in prod.findall('nvd:vers', namespaces=namespaces): 
      if vers.get('edition'): 
       print "version,{},".format(vers.attrib['edition']) 
      elif vers.get('prev') == '1': 
       print "version,{},".format(vers.attrib['prev']) 
      else: 
       print "version,{},".format(vers.attrib['num']) 


namespaces = {'nvd': 'http://nvd.nist.gov/feeds/cve/1.2'} 
# OPTIONAL: registering namespace is useful for outputting XML with ET.tostring()/ET.dump() 
#for prefix, ns in namespaces.items(): 
# ET.register_namespace(prefix, ns) 

for i in range(range_from, range_to+1): 
    e = ET.parse('nvdcve-%i.xml' % i).getroot() 

    for entry in e: 
     process_entry(entry)

来源

2017-10-05 17:46:01

是的，这是我正在尝试并且最初没有做的事情，然后又回到了我现在正在做的奇怪的不太行之有效的事情上。 – Adthrawn

考虑更换...

if "}prod" in e.tag: 
    print unichr(34) 
    if unichr(34) in e.attrib: 
     print "hey yo" 
     child.append("product," + str(e.attrib).split('"')[1] + "," + str(e.attrib).split('"')[3]) 
    else: 
     child.append("product," + str(e.attrib).split("'")[3] + "," + str(e.attrib).split("'")[7]) 
    #print e.tag, e.attrib 
if "'prev'" in e.attrib: 
    child.append("version," + str(e.attrib).split("'")[7] + "," + str(e.attrib).split("'")[3]) 
if "}vers" in e.tag and "'prev'" not in e.attrib: 
    child.append("version," + str(e.attrib).split("'")[3] + ",")

With ...

reg=r"\"|'(?=[^\"]*')|'(?=\W*\")" 
if "prod" in e.tag: 
    #print(re.split(reg,str(e.attrib))) 
    child.append("product," + re.split(reg,str(e.attrib))[3] + "," + re.split(reg,str(e.attrib))[7]) 
    #print e.tag, e.attrib 
if "prev" in e.attrib: 
    child.append("version," + re.split(reg,str(e.attrib))[7] + "," + re.split(reg,str(e.attrib))[3]) 
if "vers" in e.tag and "prev" not in e.attrib: 
    child.append("version," + re.split(reg,str(e.attrib))[3] + ",")

让我知道这是否有效，我会解释。

UPDATE

更好的解决方案是如下： -

if "prod" in e.tag: 
     #print(e.attrib) 
     child.append("product," + e.attrib['name'] + "," + e.attrib['vendor']) 
    if "prev" in e.attrib: 
     child.append("version," + e.attrib['prev'] + "," + e.attrib['num']) 
    if "vers" in e.tag and "prev" not in e.attrib: 
     child.append("version," + e.attrib['num'] + ",")

的工作与你给出的xml例子是here对所有三种情况下你的，我原来的解决方案和更新的解决方案。

来源

2017-10-05 16:50:40 kaza

啊，第二个解决方案比我试图的大杂烩更好。我曾尝试使用XPATH在没有执行ifs的情况下将字段拉开，但我没有做到这一点。 – Adthrawn

@Adthrawn：stdlib的xml.etree不支持xpath。如果你想使用xpath，你应该使用[lxml]（https://pypi.python.org/pypi/lxml）的[etree]（http://lxml.de/tutorial.html）。 –

用Python解析XML attrib'和'字符拆分

回答

相关问题