2017-08-02 75 views
2

我跟着关于JavaScript刮痧很多教程,但我真的不能设法把号码的开出,从这个表:动态文本刮

http://www.wsj.com/mdc/public/npage/2_3023_creditdervs.html

我尝试了最后一个Sentdex教程使用此代码:

import bs4 as bs 
import sys 
import urllib.request 
from PyQt5.QtWebEngineWidgets import QWebEnginePage 
from PyQt5.QtWidgets import QApplication 
from PyQt5.QtCore import QUrl 

class Page(QWebEnginePage): 
    def __init__(self, url): 
     self.app = QApplication(sys.argv) 
     QWebEnginePage.__init__(self) 
     self.html = '' 
     self.loadFinished.connect(self._on_load_finished) 
     self.load(QUrl(url)) 
     self.app.exec_() 

    def _on_load_finished(self): 
     self.html = self.toHtml(self.Callable) 
     print('Load finished') 

    def Callable(self, html_str): 
     self.html = html_str 
     self.app.quit() 


def main(): 
    page = Page('http://www.wsj.com/mdc/public/npage/2_3023_creditdervs.html') 
    soup = bs.BeautifulSoup(page.html, 'html.parser') 
    tableSup = soup.find_all("td",{"class": "col2 yellowBack"}) 
    print(tableSup) 

if __name__ == '__main__': main() 

它看起来像我出的目标......大家说话总是与那些出现在网页源代码,但随后在美丽的汤标签文本消失文本相关的脚本,但我可以”真的找到脚本的屁股与上面的页面主表中的值相关联?

任何关于我应该指导我的研究的建议?

回答

2

注意你要刮的表是在iframe里面,你应该对这个iframe做一个请求,然后继续刮表。通过对元素的简单检查发现了iframe网址。使用requests一个例子代码如下所示:

from bs4 import BeautifulSoup 
import requests 

iframe = "https://web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/sWPxwhzYw8K4DcqW07HfIQykbYMaXf8fTzWQEqN6Sq2pe6I0o/TehV5qd" 
html = requests.get(iframe).text 
soup = BeautifulSoup(html,'html.parser') 

column = soup.findAll("td",{"class": "col2 yellowBack"}) 
values = [row.string for row in column] 

看起来你有兴趣从该列中的值,因此values是所需的输出:

>>> values 
['56.37', '107.75', 'n.a.', '95.99', 'n.a.', '56.00', '52.32', '234.85', '81.21', '40.72', '76.29', '19.90', 'n.a.', '92.41', '12.83', '62.19', '78.28', '60.51', '4995.58', '92.99', '67.56', '175.24', '58.71', '82.14', '57.75', '46.86', '22.95', '70.06', '150.16', '6793.46', '31.07', '34.31', '50.39'] 
+0

太棒了!非常感谢。我注意到