2017-11-11 89 views
2

我目前正在尝试使用硒和BeautifulSoup从网站检索所有iframe标签。问题是我没有得到所有的内置页框,因为BS4没有搜索到网页内有内部html文档,我不相信JavaScript在HTML内执行,所以可能有一些HTML元素不是得到渲染。是否有一个网络抓取工具,可以让我请求一个url,检索完整的js呈现的HTML文件,然后搜索DOM并获取与iframe匹配的所有标签,即使在内部HTML代码中也是如此。从网站请求完全JavaScript呈现的html源代码,并找到所有iframe标签

基本上我能够在铬检查器工具中看到我想要的所有标记,但它们不会显示在从BS4中find_all('iframe')函数检索的列表中。

这里是我的代码:

from bs4 import BeautifulSoup 

import requests 

from selenium import webdriver 

browser = webdriver.Chrome('C:/Users/G/chromedriver.exe') 

browser.get("https://reddit.com") 

HTML = browser.page_source 

innerHTML = browser.execute_script("return document.body.innerHTML") 

page = BeautifulSoup(innerHTML, 'html.parser') 

for iframe in page.find_all('iframe'): 
    print(iframe) 

browser.close() 

回答

0

你可以得到所有的标签完全通过Selenium用下面的代码块:

from selenium import webdriver 

browser = webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe') 
browser.get("https://reddit.com") 
frames_tag = browser.find_elements_by_tag_name("iframe") 
frames_xpath = browser.find_elements_by_xpath("//iframe") 
frames_css = browser.find_elements_by_css_selector("iframe") 
print("Frames detected through iframe tag are %s" %frames_tag) 
print("Frames detected through xpath are %s" %frames_xpath) 
print("Frames detected through css are %s" %frames_css) 
browser.quit() 

我的控制台上的输出:

Frames detected through iframe tag are [<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="ead39d06-0e39-4b40-9425-a86a1fe88d4f")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="1ce10f29-a620-4ce6-90e1-9da563046c70")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="ba8493e4-8246-47a0-9ed4-3f51b8c0f133")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="19c0f134-c243-47bd-96d1-6b06ff66a011")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="95d78fa6-fb4f-4b7c-89c5-9b85965f0e4c")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="e6d2d931-1f35-432f-8825-052e244fe798")>] 
Frames detected through xpath are [<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="ead39d06-0e39-4b40-9425-a86a1fe88d4f")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="1ce10f29-a620-4ce6-90e1-9da563046c70")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="ba8493e4-8246-47a0-9ed4-3f51b8c0f133")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="19c0f134-c243-47bd-96d1-6b06ff66a011")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="95d78fa6-fb4f-4b7c-89c5-9b85965f0e4c")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="e6d2d931-1f35-432f-8825-052e244fe798")>] 
Frames detected through css are [<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="ead39d06-0e39-4b40-9425-a86a1fe88d4f")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="1ce10f29-a620-4ce6-90e1-9da563046c70")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="ba8493e4-8246-47a0-9ed4-3f51b8c0f133")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="19c0f134-c243-47bd-96d1-6b06ff66a011")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="95d78fa6-fb4f-4b7c-89c5-9b85965f0e4c")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="e6d2d931-1f35-432f-8825-052e244fe798")>] 
+0

谢谢这工作。在我调用find_elements_by_tag_name之前,我确实需要添加一个睡眠时间,并以这种方式找到更多的iframe。无论如何,你知道如何请求由iframe创建的内部html吗? – user8922432

相关问题