2016-09-21 124 views
1

我想通过python中的BeautifulSoup库获取它的HTML后提取链接的标题。 基本上,整个标题标签使用BeautifulSoup从标题标签中提取数据?

<title>Imaan Z Hazir on Twitter: &quot;Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)&quot;</title> 

我想提取的数据是在& QUOT标签,这只是这个Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3) 我尝试作为

import urllib 
import urllib.request 

from bs4 import BeautifulSoup 

link = "https://twitter.com/ImaanZHazir/status/778560899061780481" 
try: 
    List=list() 
    r = urllib.request.Request(link, headers={'User-Agent': 'Chrome/51.0.2704.103'}) 
    h = urllib.request.urlopen(r).read() 
    data = BeautifulSoup(h,"html.parser") 
    for i in data.find_all("title"): 
     List.append(i.text) 
     print(List[0]) 
except urllib.error.HTTPError as err: 
    pass 

我也尝试作为

for i in data.find_all("title.&quot"): 

for i in data.find_all("title>&quot"): 

for i in data.find_all("&quot"): 

and

for i in data.find_all("quot"): 

但是没有人在工作。

+0

我期望BeautifulSoup将'"'转换成''',所以你只需要寻找'''' – zvone

+0

@zvone这是什么? ''''你的意思是这个''标题<">“'? – Amar

回答

0

就劈在结肠中的文字:

In [1]: h = """<title>Imaan Z Hazir on Twitter: &quot;Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)&quot;</title>""" 

In [2]: from bs4 import BeautifulSoup 

In [3]: soup = BeautifulSoup(h, "lxml") 

In [4]: print(soup.title.text.split(": ", 1)[1]) 
"Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)" 

其实在看网页,你不需要拆可言,文字是div内的p标记。JS-鸣叫文本容器,TH:

In [8]: import requests 

In [9]: from bs4 import BeautifulSoup 


In [10]: soup = BeautifulSoup(requests.get("https://twitter.com/ImaanZHazir/status/778560899061780481").content, "lxml") 


In [11]: print(soup.select_one("div.js-tweet-text-container p").text) 
Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3) 

In [12]: print(soup.title.text.split(": ", 1)[1]) 
"Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)" 

所以,你可以为同样的结果做任何一种方式。

+0

Caunnungham这个工作!感谢您的通知。'print(soup.select_one(”div.js-tweet-text-container p“)。text)'' – Amar

0

一旦你解析的HTML:

data = BeautifulSoup(h,"html.parser") 

查找标题是这样的:

title = data.find("title").string # this is without <title> tag 

现在找到字符串中的两个引号(")。有很多方法可以做到这一点。我会用正则表达式:

import re 
match = re.search(r'".*"', title) 
if match: 
    print match.group(0) 

你从来没有搜索&quot;或任何其他&NAME;序列,因为BeautifulSoup将它们转换成他们所代表的实际字符。

编辑:

正则表达式不捕捉报价是:

re.search(r'(?<=").*(?=")', title) 
0

下面是使用正则表达式来提取引号内的文本的简单完整的例子:

import urllib 
import re 
from bs4 import BeautifulSoup 

link = "https://twitter.com/ImaanZHazir/status/778560899061780481" 

r = urllib.request.urlopen(link) 
soup = BeautifulSoup(r, "html.parser") 
title = soup.title.string 
quote = re.match(r'^.*\"(.*)\"', title) 
print(quote.group(1)) 

这里发生的事情是,在获取页面的源代码并找到title之后,我们使用正则表达式对标题来提取引号内的文字。

我们告诉正则表达式查找符号在开引号(\")前的字符串(^.*)的开头的任意数,然后捕获它和关闭的引号(第二\")之间的文本。

然后我们通过告诉Python打印第一个捕获的组(正则表达式中括号之间的部分)来打印捕获的文本。

这里有更多关于Python与正则表达式匹配 - https://docs.python.org/3/library/re.html#match-objects

相关问题