2017-10-13 53 views
1

到目前为止,我所看到的是,如果通过硒进行过滤,网页的页面源代码就可以从该页面源解析文本或必需的东西,并应用bs4或lxml不管页面源是否启用了JavaScript。不过,我的问题是,如何通过过滤硒然后使用bs4或lxml库来解析来自某个html elements的文档。如果粘贴下面元素被认为然后将BS4或限于lxml我的移动方式是:使用硒解析来自某些“html元素”的文本

html=''' 
<tr onmouseover="this.originalstyle=this.style.backgroundColor;this.style.backgroundColor='DodgerBlue'; 
this.originalcolor=this.style.color;this.style.color='White';Tip('<span Style=Color:Red>License: <BR />20-214767 (Validity: 21/05/2022)<BR />20C-214769 (Validity: 21/05/2022)<BR />21-214768 (Validity: 21/05/2022)</span>');" onmouseout="this.style.backgroundColor=this.originalstyle;this.style.color=this.originalcolor;UnTip();" style="background-color:White;font-family:Times New Roman;font-size:12px;"> 
     <td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">AAYUSH PHARMA</td><td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">PUNE-1ST FLOOR, SR.NO.742/A, DINSHOW APARTMENT,,SWAYAM HOSPITAL AND NURSING HOME, BHAWANI PETH</td><td style="font-weight:normal;font-style:normal;text-decoration:none;" align="center">RH - 3</td><td>swapnil ramakant pawar, BPH, [140514-21/04/2017]</td> 
</tr> 
''' 
from bs4 import BeautifulSoup 
soup = BeautifulSoup(html,"lxml") 
#rest of the code here 

from lxml.html import fromstring 
tree = fromstring(html)   
#rest of the code here 

现在,我怎么可以过滤上述糊html部分使用硒,然后将其应用于BS4库?无法想到driver.page_source,因为它仅适用于从网页过滤的情况。

为了更具体一点,如果我想使用类似下面的东西,那该怎么做?

from selenium import webdriver 
driver = webdriver.Chrome() 

element_html = driver-------(html) #this "html" is the above pasted one 
print(element_html) 

回答

1

driver.page_source会给你一个特定时刻页面的​​完整HTML源代码。你,虽然,有一个元素实例,可以得到它使用.get_attribute()方法是outerHTML

element = driver.find_element_by_id("some_id") 
element_html = element.get_attribute("outerHTML") 

soup = BeautifulSoup(element_html, "lxml") 

至于从出mouseover属性提取span元素源 - 我会先用BeautifulSoup解析tr元素,获取onmouseover属性,然后使用正则表达式从Tip()函数调用中提取html值。然后,用BeautifulSoup重新解析HTML跨度:

import re 

from bs4 import BeautifulSoup 

html=''' 
<tr onmouseover="this.originalstyle=this.style.backgroundColor;this.style.backgroundColor='DodgerBlue'; 
this.originalcolor=this.style.color;this.style.color='White';Tip('<span Style=Color:Red>License: <BR />20-214767 (Validity: 21/05/2022)<BR />20C-214769 (Validity: 21/05/2022)<BR />21-214768 (Validity: 21/05/2022)</span>');" onmouseout="this.style.backgroundColor=this.originalstyle;this.style.color=this.originalcolor;UnTip();" style="background-color:White;font-family:Times New Roman;font-size:12px;"> 
     <td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">AAYUSH PHARMA</td><td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">PUNE-1ST FLOOR, SR.NO.742/A, DINSHOW APARTMENT,,SWAYAM HOSPITAL AND NURSING HOME, BHAWANI PETH</td><td style="font-weight:normal;font-style:normal;text-decoration:none;" align="center">RH - 3</td><td>swapnil ramakant pawar, BPH, [140514-21/04/2017]</td> 
</tr> 
''' 

soup = BeautifulSoup(html, "lxml") 
mouse_over = soup.tr['onmouseover'] 

span = re.search(r"Tip\('(.*?)'\)", mouse_over).group(1) 
span_soup = BeautifulSoup(span, "lxml") 
print(span_soup.get_text()) 

打印:

License: 20-214767 (Validity: 21/05/2022)20C-214769 (Validity: 21/05/2022)21-214768 (Validity: 21/05/2022) 
+0

感谢alecxe你的答案爵士。这是我无法提供我所期望的清晰度。现在它更有意义,我期望。谢谢。 – SIM

+0

您可能已经注意到,上面粘贴的html元素中的'span'标签在javascript中,这就是为什么我会在这种用法之后。再次感谢。 – SIM

+0

@Topto啊,现在我发现它位于'onmouseover'属性中。我会提供一个关于如何使用bs4提取它的示例,给我一分钟。 – alecxe