2016-08-14 55 views
0

我试图解析HTML文件,其中一种是如下图所示解析在python

<ol> 
    <li> 
    <div class="c1"> 
     <span class="s1">hi</span> 
     " hello " 
     <span class="s2">world!</span> 
    </div> 
    </li> 
    <li> 
    <div class="c2"> 
     <span class="s3">abc</span> 
     " def ghijkl " 
     <span class="s1">mno</span> 
     " pqr!" 
    </div> 
    </li> 
</ol> 

我尝试使用下面的代码

tree = html.fromstring(code.content) 
sol = tree.xpath('//ol//text()') 
for x in sol: 
    print x 
解析使用lxml的一个标签内的所有文本

我得到的结果,因为这

hi 
hello 
world! 
abc 
def ghijkl 
mno 
pqr! 

我能做些什么,以获取每个<li>标签的所有文本在一行。即我想要的输出为

hi hello world! 
abc def ghijkl mno pqr! 

回答

1
$ cat a.py 
from lxml import etree 

xml = """<ol> 
    <li> 
    <div class="c1"> 
     <span class="s1">hi</span> 
     " hello " 
     <span class="s2">world!</span> 
    </div> 
    </li> 
    <li> 
    <div class="c2"> 
     <span class="s3">abc</span> 
     " def ghijkl " 
     <span class="s1">mno</span> 
     " pqr!" 
    </div> 
    </li> 
</ol>""" 

tree = etree.fromstring(xml) 
sol = tree.xpath('//ol//li') 
for a in sol: 
    print " ".join([t.strip() for t in a.itertext()]).strip() 

$ python a.py 
hi " hello " world! 
abc " def ghijkl " mno " pqr!" 
1

你可以得到每个L1和使用normalize-space

from lxml import html 
h = """<ol> 
    <li> 
    <div class="c1"> 
     <span class="s1">hi</span> 
     " hello " 
     <span class="s2">world!</span> 
    </div> 
    </li> 
    <li> 
    <div class="c2"> 
     <span class="s3">abc</span> 
     " def ghijkl " 
     <span class="s1">mno</span> 
     " pqr!" 
    </div> 
    </li> 
</ol>""" 


tree = html.fromstring(h) 

for li in tree.xpath("//ol/li"): 
    print(li.xpath("normalize-space(.)")) 

它给你:

hi " hello " world! 
abc " def ghijkl " mno " pqr!"