2017-04-07 86 views
1

这可能是一些愚蠢的东西。但我想写一个简单的刮板来抓取这个网站上的列表:https://online.ncat.nsw.gov.au/Hearing/HearingList.aspx?LocationCode=2000无法提取与匹配的类或ID的所有跨度

那么,实际上它最终会运行每个LocationCode,但这是一个示例页面。

我想提取每个日期的<span>标题和table数据。

数据的一般形式是:

<span id="lblSubHeader1242017" class="clsGridItem">1:15 PM Wednesday, 12 Apr 2017 at Room 15.6 Level 15, 66 Goulburn st </span> 
<hr /> 
<table id="dg1242017"> 
    <tr class="clsGridItem"> 
     <td width="15%">RT 17/11111</td> 
     <td width="30%">Name of party</td> 
     <td width="55%">Name of party</td> 
    </tr> 
    ... 
</table> 

这是粗糙,但我可以抓住表中的数据相当不错与形式的代码:

page = requests.get('https://online.ncat.nsw.gov.au/Hearing/HearingList.aspx?LocationCode=2000') 
tree = html.fromstring(page.content) 
events = tree.xpath('//table//td/text()') 

但是当我试图抢在表外的跨度,所以我可以有地点和日期信息的东西,如:

days = tree.xpath('//span[starts-with(@id,"lbl")]/text()') 

days = tree.xpath('//span[@class,"clsGridItem"]/text()') 

我只得到了以下两个结果:

days: ['There are no matters listed in SYDNEY today', 'There are no matters listed in SYDNEY today'] 

这指的是两个跨度约的话了页面2/3:

<span id="lbl1442017" style="font-weight:bold;">SYDNEY: Friday, 14 Apr 2017</span><br /><br /><span id="lblError1442017" class="clsGridItem">There are no matters listed in SYDNEY today</span><br /><br /><br /><span id="lbl1742017" style="font-weight:bold;">SYDNEY: Monday, 17 Apr 2017</span><br /><br /><span id="lblError1742017" class="clsGridItem">There are no matters listed in SYDNEY today</span> 

谁能解释我做错了什么?

为什么其他跨度被跳过?

回答

1

您可以使用下面的代码来获取<span class="clsGridItem">每个文本内容:

days = tree.xpath('//span[@class="clsGridItem"]//text()') 

但我不知道为什么//span[@class="clsGridItem"]/text()不工作,因为它should be applicable as well...