2012-08-03 99 views
0
import xml.dom.minidom 

content = """ 
<urlset xmlns="http://www.google.com/schemas/sitemap/0.90"> 
    <url> 
    <loc>http://www.domain.com/</loc> 
    <lastmod>2011-01-27T23:55:42+01:00</lastmod> 
    <changefreq>daily</changefreq> 
    <priority>0.5</priority> 
    </url> 
    <url> 
    <loc>http://www.domain.com/page1.html</loc> 
    <lastmod>2011-01-26T17:24:27+01:00</lastmod> 
    <changefreq>daily</changefreq> 
    <priority>0.5</priority> 
    </url> 
    <url> 
    <loc>http://www.domain.com/page2.html</loc> 
    <lastmod>2011-01-26T15:35:07+01:00</lastmod> 
    <changefreq>daily</changefreq> 
    <priority>0.5</priority> 
    </url> 
</urlset> 
""" 

xml = xml.dom.minidom.parseString(content) 
urlset = xml.getElementsByTagName("urlset")[0] 
url = urlset.getElementsByTagName("url") 

for i in range(0, url.length): 
    loc = url[i].getElementsByTagName("loc")[0].childNodes[0].nodeValue 
    lastmod = url[i].getElementsByTagName("lastmod")[0].childNodes[0].nodeValue 
    changefreq = url[i].getElementsByTagName("changefreq")[0].childNodes[0].nodeValue 
    priority = url[i].getElementsByTagName("priority")[0].childNodes[0].nodeValue 
    print "%s, %s, %s, %s" % (loc, lastmod, changefreq, priority) 

是否没有简单的方法来获取节点的值?解析XML以获取节点的值

loc = url[i].getElementsByTagName("loc")[0].childNodes[0].nodeValue 

回答

0

有可能是一个更好的方式来获得一个节点的值...但是这至少是一个更清洁的替代,你不要重复自己:

import xml.dom.minidom 

content = """ 
<urlset xmlns="http://www.google.com/schemas/sitemap/0.90"> 
    <url> 
    <loc>http://www.domain.com/</loc> 
    <lastmod>2011-01-27T23:55:42+01:00</lastmod> 
    <changefreq>daily</changefreq> 
    <priority>0.5</priority> 
    </url> 
    <url> 
    <loc>http://www.domain.com/page1.html</loc> 
    <lastmod>2011-01-26T17:24:27+01:00</lastmod> 
    <changefreq>daily</changefreq> 
    <priority>0.5</priority> 
    </url> 
    <url> 
    <loc>http://www.domain.com/page2.html</loc> 
    <lastmod>2011-01-26T15:35:07+01:00</lastmod> 
    <changefreq>daily</changefreq> 
    <priority>0.5</priority> 
    </url> 
</urlset> 
""" 

def get_first_node_val(obj, tag): 
    return obj.getElementsByTagName(tag)[0].childNodes[0].nodeValue 

xml = xml.dom.minidom.parseString(content) 
urlset = xml.getElementsByTagName("urlset")[0] 
urls = urlset.getElementsByTagName("url") 

for url in urls: 
    loc = get_first_node_val(url, "loc") 
    lastmod = get_first_node_val(url, "lastmod") 
    changefreq = get_first_node_val(url, "changefreq") 
    priority = get_first_node_val(url, "priority") 
    print "%s, %s, %s, %s" % (loc, lastmod, changefreq, priority) 
0

这项工作:loc = getElementsByTagName("loc")[i].innerHTML

+0

这不是Python的。 – anjanesh 2012-08-03 07:19:25

0

为什么点不则firstChild

loc = url[i].getElementsByTagName("loc").firstChild.nodeValue 
+0

回溯(最近最后调用): 文件 “script.py”,第31行,在 LOC = URL [I] .getElementsByTagName( “LOC”)firstChild.nodeValue AttributeError的: '节点列表' 对象没有属性'firstChild' – anjanesh 2012-08-03 07:58:35

+0

from xml.dom.minidom import Node ..您是否导入节点? – 2012-08-03 08:23:35

0

向“get_first_node_val”添加附加功能,该功能接受具有相同节点值的XML元素。例如,以下包含两个loc元素。

<url> 
<loc>http://domain.com/</loc> 
<loc>http://sub.domain.com</loc> 
<lastmod>2011-01-27T23:55:42+01:00</lastmod> 
<changefreq>daily</changefreq> 
<priority>0.5</priority> 
</url> 


def get_first_node_val(obj, tag): 
    element = [] 
    l = 0 
    for x in obj.getElementsByTagName(tag): 
    element.append({tag : obj.getElementsByTagName(tag)[l].childNodes[0].nodeValue}) 
    l += 1 
    return element 

输出

[{'loc': u'http://domain.com/'}, {'loc': u'http://sub.domain.com'}], [{'lastmod': u'2011-01-27T23:55:42+01:00'}], [{'changefreq': u'daily'}], [{'priority': u'0.5'}]