Python BeautifulSoup从html文件p标签中提取内容，该标签位于组div标签中。我正在打印空白

我想从我的Selenium测试报告html文件中提取一些数据。我得到空白打印到PyCharm控制台。我想从P标签中获取所有数据。它在一个div标签下。Python BeautifulSoup从html文件p标签中提取内容，该标签位于组div标签中。我正在打印空白

的HTML片段是：

<div class='heading'> 
<h1>Test Report</h1> 
<p class='attribute'><strong>Start Time:</strong> 2016-08-12 11:57:33</p> 
<p class='attribute'><strong>Duration:</strong> 0:48:09.007000</p> 
<p class='attribute'><strong>Status:</strong> Pass 75</p> 

<p class='description'>Selenium - ClearCore 501 Regression edit project automated test</p> 
</div>

由于一开始我还第一次尝试了获取开始时间，看看我能打印值到控制台上。我没有打印出任何东西。我想获得的描述出来过，硒 - ClearCore 501回归编辑项目的自动化测试

我的代码是：

from bs4 import BeautifulSoup 

def extract_data_from_report_htmltestrunner(): 
    filename = (r"C:\share\ClearCore501_Automated_GUI_TestReport.html") 
    html_report_part = open(filename,'r') 
    soup = BeautifulSoup(html_report_part, "html.parser") 
    div_heading = soup.find('div', {'class': 'heading'}) 
    p = div_heading.find('p', text='Start Time:') 
    print "test" 
    print p

我已经加入：

if __name__ == "__main__": 
extract_data_from_report_htmltestrunner()

我输出现在得到的是：

test 
None

我在做什么错误请？

感谢，里亚兹

来源

2016-08-12 Riaz Ladhani

[*如果一个标签包含一个以上的事情，那么，目前还不清楚应该是指什么.string，所以.string被定义为无*]（https://www.crummy.com/software/BeautifulSoup/bs4/doc/#string） – styvane

文本是在强标签不是* P，从而发现并调用.parent得到p标签：

In [10]: html = """<div class='heading'> 
    ....: <h1>Test Report</h1> 
    ....: <p class='attribute'><strong>Start Time:</strong> 2016-08-12 11:57:33</p> 
    ....: <p class='attribute'><strong>Duration:</strong> 0:48:09.007000</p> 
    ....: <p class='attribute'><strong>Status:</strong> Pass 75</p> 
    ....: 
    ....: <p class='description'>Selenium - ClearCore 501 Regression edit project automated test</p> 
    ....: </div>""" 

In [11]: from bs4 import BeautifulSoup 

In [12]: soup = BeautifulSoup(html, "html.parser") 

In [13]: div_heading = soup.find('div', {'class': 'heading'}) 

In [14]: p = div_heading.find('strong', text='Start Time:').parent 

In [15]: print p 
<p class="attribute"><strong>Start Time:</strong> 2016-08-12 11:57:33</p>

要获得描述使用类名称：

In [16]: div_heading.find("p", class_="description") 
Out[16]: <p class="description">Selenium - ClearCore 501 Regression edit project automated test</p> 
In [17]: div_heading.find("p", class_="description").text 
Out[17]: u'Selenium - ClearCore 501 Regression edit project automated test'

如果你的理由t想要日期，请致电p.find（text = True，递归= False）所以你没有从任何孩子那里得到文本。

In [18]: p = div_heading.find('strong', text='Start Time:').parent 

In [19]: p.find(text=True, recursive=False) 
Out[19]: u' 2016-08-12 11:57:33' 
In [20]: p.text 
Out[20]: u'Start Time: 2016-08-12 11:57:33'

您可以在两种方法中看到上述差异。只是打电话的.text雄厚的标签将只是给你u'Start时间：”：

In [21]: div_heading.find('strong', text='Start Time:').text 
Out[21]: u'Start Time:'

来源

2016-08-12 14:47:03

谢谢，这是非常有用的。如果我想获得文字开始时间：从强标记，它的价值，我怎么能做到这一点。我希望的输出将是“开始时间：2016-08-12 11:57:33” –

它会是p.find（text = True，recursive = False） –

@RiazLadhani，看第二个最后的代码片段，调用* p.text *给你所有的文字，包括来自孩子的文字，recursive = False只从父母进入。 –

Python BeautifulSoup从html文件p标签中提取内容，该标签位于组div标签中。我正在打印空白

回答

相关问题