2015-09-26 32 views
1

我想通过一个脚本来解析与美丽的汤在Python 2.7中的表。在一个美丽的汤脚本中刮多个页面 - 获得相同的结果

第一个表格解析工作并产生预期结果。第二个循环产生与第一个循环完全相同的结果。
其他细节:

  • 如果我手动使用第二循环用于解析URL,我得到 ,我想刮预期的页面。刷新有点延迟。
  • 我在其他网站上使用它,并且循环按预期工作。

下面是脚本:

import urllib2 
    import csv 
    from bs4 import BeautifulSoup # latest version bs4 

    week = raw_input("Which week?") 
    week = str(week) 
    data = [] 
    first = "http://fantasy.nfl.com/research/projections#researchProjections=researchProjections%2C%2Fresearch%2Fprojections%253Foffset%253D" 
    middle = "%2526position%253DO%2526sort%253DprojectedPts%2526statCategory%253DprojectedStats%2526statSeason%253D2015%2526statType%253DweekProjectedStats%2526statWeek%253D" 
    last = "%2Creplace" 
    page_num = 1 
    for page_num in range(1,3): 
     page_mult = (page_num-1) * 25 +1 
     next = str(page_mult) 
     url = first + next + middle + week + last 
    print url #I added this in order to check my output 
    html = urllib2.urlopen(url).read() 
    soup = BeautifulSoup(html,"lxml") 
    table = soup.find('table', attrs={'class':'tableType-player hasGroups'}) 
    table_body = table.find('tbody') 

    rows = table_body.find_all('tr') 
    for row in rows: 
     cols = row.find_all('td') 
     cols = [ele.text.strip() for ele in cols] 
     data.append([ele for ele in cols if ele]) # Get rid of empty values 
    b = open('NFLtable.csv', 'w') 
    a = csv.writer(b) 
    a.writerows(data) 
    b.close() 
    page_num =page_num+1 
    print data 
+0

我不确定要理解,你想抓两页,但只有你的脚本得到一个? –

+0

种。我正在把第一页罚款。当它在循环的第二次迭代中时,它会正确创建第二个url,但会再次返回第一个页面的结果。因此,如果每页有25条记录,我会在CSV中获得50条记录,但前25条与最后25条记录相同。如果我在第二个循环之外迭代,则会再次获得前25条记录。 – ztomazin

回答

1

他们正在使用AJAX请求额外的结果,与一些HTML作为值的一个JSON响应的实际页面。

我修改您的代码一点,不妨一试:

import urllib2 
import urllib 
import csv 
from bs4 import BeautifulSoup # latest version bs4 
import json 

week = raw_input("Which week?") 
week = str(week) 
data = [] 
url_format = "http://fantasy.nfl.com/research/projections?offset={offset}&position=O&sort=projectedPts&statCategory=projectedStats&statSeason=2015&statType=weekProjectedStats&statWeek={week}" 

for page_num in range(1, 3): 
    page_mult = (page_num - 1) * 25 + 1 
    next = str(page_mult) 
    url = url_format.format(week=week, offset=page_mult) 
    print url # I added this in order to check my output 

    request = urllib2.Request(url, headers={'Ajax-Request': 'researchProjections'}) 
    raw_json = urllib2.urlopen(request).read() 
    parsed_json = json.loads(raw_json) 
    html = parsed_json['content'] 

    soup = BeautifulSoup(html, "html.parser") 
    table = soup.find('table', attrs={'class': 'tableType-player hasGroups'}) 
    table_body = table.find('tbody') 

    rows = table_body.find_all('tr') 
    for row in rows: 
     cols = row.find_all('td') 
     cols = [ele.text.strip() for ele in cols] 
     data.append([ele for ele in cols if ele]) # Get rid of empty values 

print data 

我一周= 4的测试。

+0

谢谢你的帮助 - 这工作。我不知道从拼抢和打电话的不同之处,以确保我刮掉了正确的数据。 – ztomazin