BeautifulSoup：提取，是不是在给定的标签

文本我有以下变量，header等于：BeautifulSoup：提取，是不是在给定的标签

<p>Andrew Anglin<br/> 
<strong>Daily Stormer</strong><br/> 
February 11, 2017</p>

我想从这个变量只有日期February 11, 2017提取。如何在Python中使用BeautifulSoup？

来源

2017-02-11 mel

您需要提供更多的HTML的。这甚至可以在没有BeautifulSoup的情况下实现。 'html.split（ '\ n'）[ - 1] [： - 4]' – MYGz

如果您知道的日期始终是在头变量的最后文本节点，那么你可以访问.contents property并获得最后一个元素返回列表中：

from bs4 import BeautifulSoup 

soup = BeautifulSoup(html, 'html.parser') 
header = soup.find('p') 

header.contents[-1].strip() 
> February 11, 2017

或者，如MYGz pointed out in the comments below，你可能分裂为新行文本和检索列表中的最后一个元素：

from bs4 import BeautifulSoup 

soup = BeautifulSoup(html, 'html.parser') 
header = soup.find('p') 

header.text.split('\n')[-1] 
> February 11, 2017

如果你不知道日期文本节点的位置，那么另一个选项是解析出任何匹配的字符串：

from bs4 import BeautifulSoup 
import re 

soup = BeautifulSoup(html, 'html.parser') 
header = soup.find('p') 

re.findall(r'\w+ \d{1,2}, \d{4}', header.text)[0] 
> February 11, 2017

然而，当你的标题所暗示的，如果你只希望检索没有与元素标签包裹文本节点，那么你可以使用这将筛选出的元素如下：将返回以下，因为第一个文本节点

from bs4 import BeautifulSoup 
import re 

soup = BeautifulSoup(html, 'html.parser') 
header = soup.find('p') 

text_nodes = [e.strip() for e in header if not e.name and e.strip()]

记住不裹：

> ['Andrew Anglin', 'February 11, 2017']

当然，你也可以结合过去的两个选项，并在返回的文本节点解析出日期字符串：

from bs4 import BeautifulSoup 
import re 

soup = BeautifulSoup(html, 'html.parser') 
header = soup.find('p') 

for node in header: 
    if not node.name and node.strip(): 
     match = re.findall(r'^\w+ \d{1,2}, \d{4}$', node.strip()) 
     if match: 
      print(match[0]) 

> February 11, 2017

来源

2017-02-11 16:41:36

BeautifulSoup：提取，是不是在给定的标签

回答

相关问题