2016-06-08 138 views
0

我有一些HTML中的样子:浏览HTML表格LXML

<html> 
<body> 
<table cellpadding="0" cellspacing="0" border="0" width="100%"> 
     <tr> 
      <td align="left" colspan="4"> 
     <!-- BEGIN NEXT PREV LINKS --> 
     <table cellspacing="2" cellpadding="0" border="0"> 
     <tr> 
      <td align="left"><font style="color:gray">Previous</font>&nbsp;</td> 
      <td align="center" colspan="2" nowrap><b>1-100 of 273 employees</b></td> 
      <td align="right">&nbsp;<a href="">Next</a></td> 
     </tr> 
     <tr> 
      <td align="left" colspan="2"><font style="color:gray">First Page</font></td> 
      <td align="right" colspan="2">&nbsp;&nbsp;<a href="">Last Page</a></td> 
     </tr> 
     </table> 
     <!-- END NEXT PREV LINKS --> 
</td>  
     <td colspan="9" align="right"> 
     <a href="">Add Checked to Favorites</a>&nbsp; 
    <br> 
     <a href="">Add Checked to Excluded</a>&nbsp; 
    </td> 
     </tr> 
     <tr> 
<td rowspan="2"></td><td rowspan="2"></td>  <td rowspan="2" valign="bottom" style="padding-right:5px;"><b><a href=""/></td> 
     <td rowspan="2" valign="bottom" style="padding-right:5px;"><b><a href="">Position</a></b></td> 
     <td colspan="2" align="center" valign="bottom" height="16"><b>Ratings</b><br><img src="/images/shim_333333.gif" width="130" height="1" alt="" hspace="5"></td>  <td rowspan="2">&nbsp;&nbsp;&nbsp;</td>  <td rowspan="2" valign="bottom" style="padding-right:5px;"><b><a href="">Birth&nbsp;Date</a></b></td> 
     <td rowspan="2" valign="bottom" style="padding-right:5px;"><b><a href="">States</a></b></td> 
     <td rowspan="2">&nbsp;</td><td rowspan="2"></td> <td rowspan="2" colspan="3" align="right" valign="bottom"><a href="">Clear&nbsp;All</a>&nbsp;</td>  </tr> 
     <tr> 
     <td align="center"><b><a href="">In-State<br>Rating</a></b></td> 
     <td align="center"><b><a href="">Out of State<br>Rating</a></b></td> 
     </tr> 
     <tr> 
      <td colspan="13" valign="bottom"><img src="/images/shim.gif" width="100%" height="1" alt=""></td> 
     </tr>  <tr> 
     <td align="right" colspan=13><img src="/images/shim_dddddd.gif" width="100%" height="1" border="0" alt=""></td> 
     </tr>  <tr > 
     <td></td><td><b style="">X</b></td> 
     <td nowrap><p><a href="">Cruise, Tom</a>&nbsp;</p></td> 
     <td nowrap>Actor&nbsp;</td> 
     <td align="center"><img src="/images/stars_2_sm_green.gif" alt="instate&#13;Recommendation&#13;Rating" height="11" width="55" align="middle" hspace="0" vspace="0"></td> 
     <td align="center"><img src="/images/stars_4_sm.gif" alt="Summary&#13;Estimate&#13;Rating" height="11" width="55" align="middle" hspace="0" vspace="0"></td> 
     <td>&nbsp;</td> 
     <td nowrap>1948&nbsp;</td> 
     <td nowrap>CA</td> 
     <td></td><td></td> 

       <td>&nbsp;</td> 
     <td align="right"><input type="checkbox" name="employee_cb" value="198720" style="height:15px"></td> 
     </tr>  <tr> 
     <td align="right" colspan=13><img src="/images/shim_dddddd.gif" width="100%" height="1" border="0" alt=""></td> 
     </tr>  <tr > 
     <td><b style="">X</b></td><td></td> 
     <td nowrap><p><a href="">Schwarzenegger, Arnold</a>&nbsp;</p></td> 
     <td nowrap>Governor&nbsp;</td> 
     <td align="center"><img src="/images/ohuohausd.jpg" alt="instate&#13;Recommendation&#13;Rating" height="11" width="55" align="middle" hspace="0" vspace="0"></td> 
     <td align="center"><img src="/images/ohuohausd.jpg" alt="Summary&#13;Estimate&#13;Rating" height="11" width="55" align="middle" hspace="0" vspace="0"></td> 
     <td>&nbsp;</td> 
     <td nowrap>No Current Date&nbsp;</td> 
     <td nowrap>-</td> 
     <td></td><td></td> 

       <td>&nbsp;</td> 
     <td align="right"><input type="checkbox" name="employee_cb" value="61184" style="height:15px"></td> 
     </tr>  <tr > 
     <td><b style="">X</b></td><td></td> 
     <td nowrap><p><a href="">Obama, Barack</a>&nbsp;</p></td> 
     <td nowrap>President&nbsp;</td> 
     <td align="center"><img src="/images/ohuohausd.jpg" alt="instate&#13;Recommendation&#13;Rating" height="11" width="55" align="middle" hspace="0" vspace="0"></td> 
     <td align="center"><img src="/images/ohuohausd.jpg" alt="Summary&#13;Estimate&#13;Rating" height="11" width="55" align="middle" hspace="0" vspace="0"></td> 
     <td>&nbsp;</td> 
     <td nowrap>No Current Date&nbsp;</td> 
     <td nowrap>-</td> 
     <td></td><td></td> 

       <td>&nbsp;</td> 
     <td align="right"><input type="checkbox" name="employee_cb" value="225747" style="height:15px"></td> 
     </tr> 
     <tr height="15"> 
     <td align="right" colspan="14"> 
     <!-- BEGIN NEXT PREV LINKS --> 
     <table cellspacing="2" cellpadding="0" border="0"> 
     <tr> 
      <td align="left"><font style="color:gray">Previous</font>&nbsp;</td> 
      <td align="center" colspan="2" nowrap><b>1-100 of 273 employees</b></td> 
      <td align="right">&nbsp;<a href="">Next</a></td> 
     </tr> 
     <tr> 
      <td align="left" colspan="2"><font style="color:gray">First Page</font></td> 
      <td align="right" colspan="2">&nbsp;&nbsp;<a href="">Last Page</a></td> 
     </tr> 
     </table> 
     <!-- END NEXT PREV LINKS --> 

     </td> 
     </tr>  <tr> 
    <td colspan="12" valign="bottom" nowrap><br> 
     <b style="">X</bfdgdfgb style="">X</b>Lorem ipsum dolor sit amet, consectetur adipiscing elit<br> 
    <b style="c">X</b>dfgfdg<b style="">X</b>Lorem ipsum dolor sit amet, consectetur adipiscing elit<br> <b style="">F</b>: A dsd "<b style="">F</b>Lorem ipsum dolor sit amet, consectetur adipiscing elit<br> 
     &nbsp;&nbsp;&nbsp;&nbsp;dfgdfg"<b style="">F</b>"Lorem ipsum dolor sit amet, consectetur adipiscing elit<br> 
    <b style="">E</b>gfhbgdfg"<b style="">E</b>Lorem ipsum dolor sit amet, consectetur adipiscing elit 
    </td> 
     </tr><tr><td colspan="20"> 
<table cellpadding="0" cellspacing="0" border="0" width="100%" align="center"> 
    <tr> 
    <td colspan="2"><img src="/images/shim.gif" width="100%" height="5" alt=""></td> 
    </tr> 
    <tr> 
    <td valign="top">States:&nbsp;</td> 
    <td>CA=California; ND=North&nbsp;Dakota</td> 
    </tr>  
</table> 
</td></tr>  
</table></body> 
</html> 

寻找类似的问题,我是能够构建(注意,该表始终处于完整的HTML代码17):

data = open("employeetest.htm",'r').read() 

root = lh.fromstring(data) 

rows = root.xpath("//table")[17].findall("tr") 
data = list() 
for row in rows: 
    data.append([c.text_content() for c in row.getchildren()]) 
print data 

这产生了一个非常混乱的名单。我的最终目标是获得

[['Cruise, Tom', 'Actor', '1948', 'CA'], ['Schwarzenegger, Arnold', 'Governor', 'No Current Date', '-'], ...]    

但是,表中包含的所有信息都会产生很多奇怪的元素。我知道我可以通过替换一个空格来清除结果\xa0。我不确定如何进一步浏览。谢谢!

+0

的数据是不是在任何表格,也什么呢'...'代表在具有NOWRAP属性,只有一个属性完全TRS您的预期产出? –

+0

也许我错了,但不是它被包含在一个'

'和'​​'这些元素。 '......'是为了代表相同模式的延续。 – sundorer

+0

是的,我错过了开幕式的标签。所以你想要发布的内容是你问题中的三个子列表? –

回答

1

不知道...应该是什么在你的预期产出,但前三个子表来获取数据,您可以缩小搜索寻找

from lxml import html 

root = html.fromstring(h) 
rows = root.xpath("//tr[td[@nowrap and text() and count(@*)=1]]") 
data = list() 

for row in rows: 
    print(row.xpath(".//td[@nowrap]//text()")) 

输出:

['Cruise, Tom', u'\xa0', u'Actor\xa0', u'1948\xa0', 'CA'] 
['Schwarzenegger, Arnold', u'\xa0', u'Governor\xa0', u'No Current Date\xa0', '-'] 
['Obama, Barack', u'\xa0', u'President\xa0', u'No Current Date\xa0', '-'] 
+0

谢谢!这工作得很好。我只是需要更改编码(用于我的目的),以防其他人在此过程中发现并需要这些信息。 – sundorer

1

您将不得不遍历html文档并获得更精确的XPath。另外,您面临着需要两个XPath表达式的不同元素中相关数据的挑战。这将需要一些操作来获得最终相关结果放在一起:

import lxml.etree as et 

with open("employeetest.htm",'r') as f: 
    text = f.read().replace('&nbsp', '').replace(';', '') 
root = et.HTML(text) 

# XPATH LISTS (W/ RELATED ITEMS) 
items1 = root.xpath("//td/p/a/text()") 
items2 = root.xpath("//td[p/a/text()]/following-sibling::td/text()") 

# NUMBER OF ITEMS RELATED BETWEEN EACH 
r = int(len(items2)/len(items1)) 

# ITERATE THROUGH WITH LIST SLICE AND APPEND 
data = [] 

for i in range(r): 
    inner = [] 
    inner.append(items1[i]) 
    for j in items2[0+i*r:2+i*r]: # SLICE BY EVERY THREE ITEMS 
     inner.append(j) 

    data.append(inner) 

print(data) 
# [['Cruise, Tom', 'Actor', '1948'], 
# ['Schwarzenegger, Arnold', 'Governor', 'No Current Date'], 
# ['Obama, Barack', 'President', 'No Current Date']]