2016-09-25 215 views
0

我需要从此URL中的第二个tbody获取列标题。从多个html'tbody'获取列标题

http://bepi.mpob.gov.my/index.php/statistics/price/daily.html

具体来说,我想看看 “九月,十月” ......等

我收到以下错误:

runfile('C:/Python27/Lib/site-packages/xy/workspace/webscrape/mpob1.py', wdir='C:/Python27/Lib/site-packages/xy/workspace/webscrape') 
Traceback (most recent call last): 

    File "<ipython-input-8-ab4005f51fa3>", line 1, in <module> 
    runfile('C:/Python27/Lib/site-packages/xy/workspace/webscrape/mpob1.py', wdir='C:/Python27/Lib/site-packages/xy/workspace/webscrape') 

    File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 685, in runfile 
    execfile(filename, namespace) 

    File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 71, in execfile 
    exec(compile(scripttext, filename, 'exec'), glob, loc) 

    File "C:/Python27/Lib/site-packages/xy/workspace/webscrape/mpob1.py", line 26, in <module> 
    soup.findAll('tbody', limit=2)[1].findAll('tr').findAll('th')] 

IndexError: list index out of range 

可以在这里请人帮我出来吗?我将永远感激!

已经张贴下面我的代码:

import requests 

from bs4 import BeautifulSoup 

import pandas as pd 



url = "http://bepi.mpob.gov.my/index.php/statistics/price/daily.html" 



r = requests.get(url) 



soup = BeautifulSoup(r.text, 'lxml') 


column_headers = [th.getText() for th in 
       soup.findAll('tbody', limit=2)[1].findAll('tr').findAll('th')] 
+0

你的意思,你只需要每月选择元素的内容,或者您​​真正需要点击“查看价格”并解析“按地区划分的MPOB每日FFB参考价格摘要”表格?谢谢 – alecxe

+0

我需要点击'查看价格'。需要解析的表格是“马来西亚半岛:RBD P. Oil,RBD P.Olein&RBD P. Stearin'当地价格摘要' –

回答

1

当您单击“查看价格”按钮POST请求被发送到http://bepi.mpob.gov.my/admin2/price_local_daily_view3.php端点。模拟这个POST请求,解析生成的HTML:

import requests 
from bs4 import BeautifulSoup 


with requests.Session() as session: 
    session.get("http://bepi.mpob.gov.my/index.php/statistics/price/daily.html") 

    response = session.post("http://bepi.mpob.gov.my/admin2/price_local_daily_view3.php", data={ 
     "tahun": "2016", 
     "bulan": "9", 
     "Submit2222": "View Price" 
    }) 
    soup = BeautifulSoup(response.content, 'lxml') 

    table = soup.find("table", id="hor-zebra") 
    headers = [td.get_text() for td in table.find_all("tr")[2].find_all("td")] 
    print(headers) 

打印表格的标题:

[u'Tarikh', u'September', u'October', u'November', u'December', u'September', u'October', u'November', u'December', u'September', u'October', u'November', u'December'] 
+0

,这非常完美!谢谢! –