2017-07-29 34 views
0

我想从使用大熊猫的Javascript网站刮表。为此,我使用Selenium首先到达我想要的页面。我能够以文本格式打印表格(如注释脚本所示),但我也希望能够在Pandas中使用表格。我附上我的脚本如下,我希望有人能帮我弄清楚这一点。试图从Selenium的结果中使用熊猫刮表

import time 
from selenium import webdriver 
import pandas as pd 

chrome_path = r"Path to chrome driver" 
driver = webdriver.Chrome(chrome_path) 
url = 'http://www.bursamalaysia.com/market/securities/equities/prices/#/? 
filter=BS02' 

page = driver.get(url) 
time.sleep(2) 


driver.find_element_by_xpath('//*[@id="bursa_boards"]/option[2]').click() 


driver.find_element_by_xpath('//*[@id="bursa_sectors"]/option[11]').click() 
time.sleep(2) 

driver.find_element_by_xpath('//*[@id="bm_equity_price_search"]').click() 
time.sleep(5) 

target = driver.find_elements_by_id('bm_equities_prices_table') 
##for data in target: 
## print (data.text) 

for data in target: 
    dfs = pd.read_html(target,match = '+') 
for df in dfs: 
    print (df) 

运行上面的脚本,我得到下面的错误:

Traceback (most recent call last): 
    File "E:\Coding\Python\BS_Bursa Properties\Selenium_Pandas_Bursa Properties.py", line 29, in <module> 
    dfs = pd.read_html(target,match = '+') 
    File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\html.py", line 906, in read_html 
    keep_default_na=keep_default_na) 
    File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\html.py", line 728, in _parse 
    compiled_match = re.compile(match) # you can pass a compiled regex here 
    File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\re.py", line 233, in compile 
    return _compile(pattern, flags) 
    File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\re.py", line 301, in _compile 
    p = sre_compile.compile(pattern, flags) 
    File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_compile.py", line 562, in compile 
    p = sre_parse.parse(p, flags) 
    File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py", line 855, in parse 
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0) 
    File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py", line 416, in _parse_sub 
    not nested and not items)) 
    File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py", line 616, in _parse 
    source.tell() - here + len(this)) 
sre_constants.error: nothing to repeat at position 0 

我使用pd.read_html的URL也试过,但它返回“找不到表格”的错误。网址是:http://www.bursamalaysia.com/market/securities/equities/prices/#/?filter=BS08&board=MAIN-MKT&sector=PROPERTIES&page=1

回答

0

您可以用下面的代码

import time 
from selenium import webdriver 
import pandas as pd 

chrome_path = r"Path to chrome driver" 
driver = webdriver.Chrome(chrome_path) 
url = 'http://www.bursamalaysia.com/market/securities/equities/prices/#/?filter=BS02' 

page = driver.get(url) 
time.sleep(2) 

df = pd.read_html(driver.page_source)[0] 
print(df.head()) 

这是输出得到的表格

No Code Name Rem Last Done LACP Chg % Chg Vol ('00) Buy Vol ('00) Buy Sell Sell Vol ('00) High Low 
0 1 5284CB LCTITAN-CB s 0.025 0.020 0.005 +25.00 406550 19878 0.020 0.025 106630 0.025 0.015 
1 2 1201 SUMATEC [S] s 0.050 0.050 - - 389354 43815 0.050 0.055 187301 0.055 0.050 
2 3 5284 LCTITAN [S] s 4.470 4.700 -0.230 -4.89 367335 430 4.470 4.480 34 4.780 4.140 
3 4 0176 KRONO [S] - 0.875 0.805 0.070 +8.70 300473 3770 0.870 0.875 797 0.900 0.775 
4 5 5284CE LCTITAN-CE s 0.130 0.135 -0.005 -3.70 292379 7214 0.125 0.130 50 0.155 0.100 

要想从所有的页面,你可以抓取其余页面,并使用df.append

+0

谢谢数据你非常想指出解决方案。你的建议很棒!你介意在read_html中解释[0]是什么吗?我试图在read_html文档中搜索它,但找不到任何解释。 –

+0

因为两个表被返回,你想要的是第一个表。你可以通过'df [0]'和'df [1]' – ksai

+0

看到两个不同的表,知道有多少表返回? –