2016-09-16 50 views
0

我试图抓住MTA信息页上的div。当我抓取html并用BeautifulSoup解析它时,它似乎缺少一些数据。美丽的课缺少

这里是我到目前为止的代码

from bs4 import BeautifulSoup 
import urllib # access the web 

# SUBWAY STATUS PROJECT 
userURL = "http://www.mta.info" # MTA SITE 

htmlfile = urllib.urlopen(userURL) #creates html file 
htmldoc = htmlfile.read() #creates html text 

soup = BeautifulSoup(htmldoc, 'html.parser')  

subChart = soup.find(id = 'subwayDiv') 

print subChart 

我使用打印只是为了确保我得到的所有数据。我发现我错过了一些我试图抓住的信息。如果我自己查看页面,我可以看到我缺少一个显示地铁状态的类。

我很新的节目,所以请介意我的无知

+0

它们是由ajax创建的,而不是常见的静态html,所以试试另一种方式。 – kiviak

回答

0

在subchart变量查找具有类subwayCategory的元素和存储id属性的值。 对于例如:从数据

<div style="float: left; width: 220px; border-bottom: 1px solid #7B7B98; padding: 4px 0;"> 
<div class="span-11"><img alt="1 2 3 Subway" class="subwayIcon_123" src="http://www.mta.info/sites/all/modules/custom/servicestatus/images/img_trans.gif"/></div> 
<div class="subwayCategory" id="123" style="margin-top: 4px;"></div> 

值带班subwayCategory的div id为123 现在做出http://www.mta.info/status/subway/{ID}

请求替换为术语{ID}的这部分您想要的身份证号码

+0

这不起作用。在浏览器或代码中尝试。 –

0

该数据是通过ajax请求获取的,您可以通过获取信息格式为json,你需要传递一个时间戳您可以与了time.time(唯一得到的)然后只需用json库解析它:

from time import time 
from json import load, loads 
import urllib 

url = "http://www.mta.info/service_status_json/{}".format(int(time())) 

json_dict = loads(load(urllib.urlopen(url))) 

from pprint import pprint as pp 
pp(json_dict) 

我不会添加所有的输出有实在是太多了,但使用"BT"我们得到:

{u'line': [{u'Date': {}, 
      u'Time': {}, 
      u'name': u'Bronx-Whitestone', 
      u'status': u'GOOD SERVICE', 
      u'text': {}}, 
      {u'Date': {}, 
      u'Time': {}, 
      u'name': u'Cross Bay', 
      u'status': u'GOOD SERVICE', 
      u'text': {}}, 
      {u'Date': {}, 
      u'Time': {}, 
      u'name': u'Henry Hudson', 
      u'status': u'GOOD SERVICE', 
      u'text': {}}, 
      {u'Date': u'09/16/2016', 
      u'Time': u' 5:57AM', 
      u'name': u'Hugh L. Carey', 
      u'status': u'SERVICE CHANGE', 
      u'text': u"     <span class='TitleServiceChange' >Service Change</span>     <span class='DateStyle'>     &nbsp;Posted:&nbsp;09/16/2016&nbsp; 5:57AM     </span><br/><br/>     HLC - HOV Lane Open 6 AM to 10 AM. Two-Way Operations in effect. Three (3) lanes Manhattan-bound. One (1) lane Brooklyn-bound.    <br/><br/>    "}, 
      {u'Date': {}, 
      u'Time': {}, 
      u'name': u'Marine Parkway', 
      u'status': u'GOOD SERVICE', 
      u'text': {}}, 
      {u'Date': u'09/16/2016', 
      u'Time': u' 5:57AM', 
      u'name': u'Queens Midtown', 
      u'status': u'SERVICE CHANGE', 
      u'text': u"     <span class='TitleServiceChange' >Service Change</span>     <span class='DateStyle'>     &nbsp;Posted:&nbsp;09/16/2016&nbsp; 5:57AM     </span><br/><br/>     QMT - HOV Lane Open 6 AM to 10 AM. Two-Way Operation in effect. Three (3) lanes Manhattan bound. One (1) lane Queens bound.    <br/><br/>         <span class='TitlePlannedWork' >Planned Work</span>     <br/>     <P style='MARGIN: 0in 0in 0pt'><SPAN style=''Times New Roman';2016; Queens-Midtown Tunnel downtown exit; One lane closed. Use 37<SUP>th</SUP></FONT><FONT size=3> St tunnel exit for access to 2</FONT><SUP><FONT size=3>nd</FONT></SUP><FONT size=3> Ave. Motorists should allow extra time and may wish to use an alternate route if possible' Drivers should expect delays and plan accordingly. Motorists can sign up for MTA e-mail or text alerts at </FONT><SPAN style='COLOR: blue'><A href='http://www.mta.info/'><SPAN style='COLOR: #0563c1'><FONT size=3>www.mta.info</FONT></SPAN></A><FONT size=3> </FONT></SPAN><FONT size=3>and check the Bridges and Tunnels homepage or Facebook page for the latest information on this planned work.</FONT></FONT></SPAN></P>    <br/><br/>         <span class='TitlePlannedWork' >Planned Work</span>     <br/>     QMT- MANHATTAN PLAZA WORK REQUIRES CLOSURE OF 'CROSSTOWN' LANES FOR 2 MONTHS. CUSTOMERS SEEKING A CROSSTOWN MANHATTAN ROUTE USE THE UPTOWN LANES; EXPECT DELAYS.    <br/><br/>    "}, 
      {u'Date': u'08/15/2016', 
      u'Time': u' 3:56PM', 
      u'name': u'Robert F. Kennedy', 
      u'status': u'PLANNED WORK', 
      u'text': u"     <span class='TitlePlannedWork' >Planned Work</span>     <br/>     <P style='MARGIN: 0in 0in 0pt'><SPAN style='COLOR: #1f497d'><FONT size=3 face=Calibri>Starting Monday, August 15, 2016 and through early 2018, one lane will be closed on the Queens-to-Manhattan ramp at the Robert F. Kennedy Bridge for roadway rehabilitation. In addition, overnight on Thursday, August 18 and Friday, August 19, there will be a series of intermittent FULL ramp closures, lasting 15-20 minutes each.</FONT></SPAN></P>    <br/><br/>    "}, 
      {u'Date': {}, 
      u'Time': {}, 
      u'name': u'Throgs Neck', 
      u'status': u'GOOD SERVICE', 
      u'text': {}}, 
      {u'Date': u'09/16/2016', 
      u'Time': u' 5:28AM', 
      u'name': u'Verrazano-Narrows', 
      u'status': u'PLANNED WORK', 
      u'text': u"     <span class='TitlePlannedWork' >Planned Work</span>     <br/>     VNB: PLANNED WORK; S. I. BOUND LOWER LEVEL - ONE LANE CLOSED; EXPECT DELAYS.    <br/><br/>    "}]} 

所以你只需要经过的字典,并挑选出你想要的东西。

+0

谢谢,当我回家时,我会尝试变瘦! –