2017-01-09 29 views
0

我正在学习Python中的美味汤和字典。我正在按照斯坦福大学的美丽汤的简短教程在这里找到:http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html将美味汤捕获的值存储在字典中,然后访问这些值

由于访问网站是禁止的我已经将教程中提供的文本存储到字符串,然后将字符串汤转换为汤对象。打印输出如下:

print(soup_string) 

<html><body><div class="ec_statements"><div id="legalert_title"><a  
href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-Senators- 
Urging-Them-to-Support-Cloture-and-Final-Passage-of-the-Paycheck- 
Fairness-Act-S.2199">'Letter to Senators Urging Them to Support Cloture  
and Final Passage of the Paycheck Fairness Act (S.2199) 
</a> 
</div> 
<div id="legalert_date"> 
September 10, 2014 
</div> 
</div> 
<div class="ec_statements"> 
<div id="legalert_title"> 
<a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to- 
Representatives-Urging-Them-to-Vote-on-the-Highway-Trust-Fund-Bill"> 
Letter to Representatives Urging Them to Vote on the Highway Trust Fund Bill 
</a> 
</div> 
<div id="legalert_date"> 
     July 30, 2014 
     </div> 
</div> 
<div class="ec_statements"> 
<div id="legalert_title"> 
<a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-Representatives-Urging-Them-to-Vote-No-on-the-Legislation-Providing-Supplemental-Appropriations-for-the-Fiscal-Year-Ending-Sept.-30-2014"> 
     Letter to Representatives Urging Them to Vote No on the Legislation Providing Supplemental Appropriations for the Fiscal Year Ending Sept. 30, 2014 
     </a> 
</div> 
<div id="legalert_date"> 
     July 30, 2014 
     </div> 
</div> 
<div class="ec_statements"> 
<div id="legalert_title"> 
<a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-Senators-Urging-Them-to-Vote-Yes- 
      on-the-Motion-to-Proceed-to-the-Emergency-Supplemental-Appropriations-Act-of-2014-S.2648"></a></div></div></body></html> 

在某些时候的导师捕捉汤对象中具有标记“格”的所有元素,类_ =“ec_statements”。该

“我们将通过所有在我们的信件收集的项目,并为每一个,拉出的名称,使之成为我们的字典的关键:

letters = soup_string.find_all("div", class_="ec_statements") 

然后导师说。值将是另一个字典,但我们还没有找到其他项目的内容,所以我们将创建一个空的字典对象。“

的代码如下:

lobbying = {} 
for element in letters: 
    lobbying[element.a.get_text()] = {} 

然而,当我打印游说字典,我发现的键和值的最后一个元素 - “信为本,以参议员紧压了他们,TO-投票 - 正在进行动议的紧急补充拨款 - 2014年的S.2648号法案“ - 缺少。相反,有一个没有分配密钥的空字典。

for key, value in lobbying.iteritems(): 
    print key, value 

{} 

     Letter to Representatives Urging Them to Vote No on the Legislation Providing Supplemental Appropriations for the Fiscal Year Ending Sept. 30, 2014 
     {} 

     Letter to Representatives Urging Them to Vote on the Highway Trust Fund Bill 
     {} 
'Letter to Senators Urging Them to Support Cloture and Final Passage of the Paycheck Fairness Act (S.2199) 
     {} 

你如何解释这一点?您的建议将不胜感激。

+0

last'div'没有文本,所以它创建了以空字符串为键的元素。而你将它看作是“一个没有分配键的空字典”。 – furas

+0

顺便说一句:至少使用'print'>“,key,”<“'你会看到你的键是空字符串,或者它只有'spaces','tabs'和'entered' – furas

回答

0

最后<div class="ec_statements">的元素<a>没有任何文字吧:

<a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-Senators-Urging-Them-to-Vote-Yes- 
      on-the-Motion-to-Proceed-to-the-Emergency-Supplemental-Appropriations-Act-of-2014-S.2648"> 
</a> 

比较这对上面的另一个DIV:

<a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to- 
Representatives-Urging-Them-to-Vote-on-the-Highway-Trust-Fund-Bill"> 
Letter to Representatives Urging Them to Vote on the Highway Trust Fund Bill 
</a> 

正如你所看到的,在第二个文本示例在<a>标记之后和</a>标记之前。在第一个例子中,没有这样的文字。

0

要调用element.a.get_text()生成密钥,但对于最后一个元素的标签没有文本内容:<a ...></a>