2011-10-25 48 views
1

这是覆盖在这个岗位:Python web scraping involving HTML tags with attributesPython Web Scraping;美丽的汤

但我一直没能做到这个网页类似的东西:http://www.expatistan.com/cost-of-living/comparison/melbourne/auckland

我想刮的值:

<td class="price city-2"> 
                 NZ$15.62 
             <span style="white-space:nowrap;">(AU$12.10)</span> 
                </td> 
    <td class="price city-1"> 
                 AU$15.82 
           </td> 

基本上价格城市-2和价格城市-1(NZ $ 15.62和HK $ 15.82)

目前有:

import urllib2 

from BeautifulSoup import BeautifulSoup 

url = "http://www.expatistan.com/cost-of-living/comparison/melbourne/auckland?" 
page = urllib2.urlopen(url) 

soup = BeautifulSoup(page) 

price2 = soup.findAll('td', attrs = {'class':'price city-2'}) 
price1 = soup.findAll('td', attrs = {'class':'price city-1'}) 

for price in price2: 
    print price 

for price in price1: 
    print price 

理想情况下,我也想逗号分隔值为:

<th colspan="3" class="clickable">Food</th>, 

提取 '食品',

<td class="item-name">Daily menu in the business district</td> 

'在商业区每日菜单'

,然后价格城市-2和价格city1值提取

所以打印输出会是:

食品,在商业区每日菜单,NZ $ 15.62,AU $ 15.82

谢谢!

回答

2

我发现BeautifulSoup尴尬使用。这是一个基于webscraping module版本:

from webscraping import common, download, xpath 

# download html 
D = download.Download() 
html = D.get('http://www.expatistan.com/cost-of-living/comparison/melbourne/auckland') 

# extract data 
items = xpath.search(html, '//td[@class="item-name"]') 
city1_prices = xpath.search(html, '//td[@class="price city-1"]') 
city2_prices = xpath.search(html, '//td[@class="price city-2"]') 

# display and format 
for item, city1_price, city2_price in zip(items, city1_prices, city2_prices): 
    print item.strip(), city1_price.strip(), common.remove_tags(city2_price, False).strip() 

输出:

Daily menu in the business district AU$15.82 NZ$15.62

Combo meal in fast food restaurant (Big Mac Meal or similar) AU$7.40 NZ$8.16

1/2 Kg (1 lb.) of chicken breast AU$6.07 NZ$10.25

...

+0

谢谢理查德。我安装了webscraping,但是当我运行你的代码时,我得到这个包的错误------- from webscraping import common,download,xpath 文件“C:\ Python27 \ lib \ site-packages \ webscraping \ download.py“,第649行 缓存= ^ 语法错误:无效语法 ------- 我在安装过程中是否搞砸了?我只是下载了zip文件并将其放入Python目录中的site-packages文件夹中。 –

+0

*更新*看起来像在download.py文件中有一个额外的回车。似乎现在工作。谢谢。 这也可以用来提取链接?我没有在文档中看到任何示例?这是吗?---- xpath.parse(html,'/ html/body/ul [2]/li [@ class =“info”]/a/@ href') –

+0

yes xpath可用于提取链接 – hoju

0

如果加载目标网页的HTML到一个变量htmlsource,这pyparsing webscraper将在给定的CSV格式格式化数据:

from pyparsing import * 

th,thEnd = makeHTMLTags("th") 
thCategory = th.setParseAction(withAttribute(**{'class':'clickable', 'colspan':'3'})) 
category = thCategory.suppress() + SkipTo(thEnd)('category') + thEnd 

# set up tag recognizers, with specialized patterns based on class attribute 
td, tdEnd = makeHTMLTags("td") 
tdWithClass = lambda cls : td.copy().setParseAction(withAttribute(**{'class':cls})) 
itemTd = tdWithClass('item-name') 
price1Td = tdWithClass('price city-1') 
price2Td = tdWithClass('price city-2') 

# define some currencies 
currency = oneOf("NZ$ AU$ US$ SG$").setName("currency") 

# define a currency amount as a real number 
amount = Regex(r'\d+,\d{3}|\d+(\.\d+)?').setParseAction(lambda t:float(t[0].replace(',',''))) 

# define the format of a city value 
cityval = Group((price1Td | price2Td) + currency("currency") + amount("amt") + SkipTo(tdEnd) + tdEnd) 

# define a comparison item, including item name and item cost in city1 and city2 
comparison = Group(itemTd + SkipTo(tdEnd)("item") + tdEnd + (cityval*2)("valuedata")) 

# attach a parse action to clean up automated token naming 
def assignPriceTags(t): 
    for v in t[0].valuedata: 
     if v['class'] == 'price city-1': 
      t[0]['price1'] = v 
     else: 
      t[0]['price2'] = v 

    # remove extraneous results names created by makeHTMLTags 
    for tg in 'class tag startTd endTd empty'.split(): 
     del t[0][tg] 
     for v in t[0].valuedata: 
      del v[tg] 
    del t[0]['valuedata'] 
comparison.setParseAction(assignPriceTags) 


currentcategory = '' 
for compdata in (category|comparison).searchString(htmlsource): 
    if 'category' in compdata: 
     currentcategory = compdata.category 
     continue 
    compdata = compdata[0] 
    #~ print compdata.dump() 
    print "%s, %s, %s%s, %s%s" % (currentcategory, compdata.item, 
       compdata.price1.currency, compdata.price1.amt, 
       compdata.price2.currency, compdata.price2.amt) 

打印:

Food, Daily menu in the business district, AU$15.82, NZ$15.62 
Food, Combo meal in fast food restaurant (Big Mac Meal or similar), AU$7.4, NZ$7.91 
Food, 1/2 Kg (1 lb.) of chicken breast, AU$6.07, NZ$10.25 
Food, 1 liter (1 qt.) of whole fat milk, AU$1.8, NZ$2.65 
Food, 500 gr (16 oz.) of local cheese, AU$5.99, NZ$7.2 
Food, 1 kg (2 lb.) of apples, AU$4.29, NZ$3.46 
Food, 2 kg (4,5 lb.) of potatoes, AU$4.31, NZ$5.29 
Food, 0.5 l (16 oz) beer in the supermarket, AU$4.12, NZ$4.36 
Food, 2 liters of Coca-Cola, AU$3.07, NZ$2.64 
Food, bread for 2 people for 1 day, AU$2.32, NZ$1.93 
Housing, monthly rent for a 85 m2 (900 Sqft) furnished apartment in expensive area of the city, AU$1766.0, NZ$2034.0 
Housing, Internet 8MB (1 month), AU$49.0, NZ$61.0 
Housing, 40” flat screen TV, AU$865.0, NZ$1041.0 
Housing, utilities 1 month (heating, electricity, gas ...), AU$211.0, NZ$170.0 
Clothes, 1 pair of Levis 501, AU$119.0, NZ$123.0 
Clothes, 1 summer dress in a chain store (Zara, H&M, ...), AU$63.0, NZ$50.0 
Clothes, 1 pair of Adidas trainers, AU$142.0, NZ$166.0 
Clothes, 1 pair of average business shoes, AU$130.0, NZ$133.0 
Transportation, Volkswagen Golf 2.0 TDI 140 CV 6 vel. (or equivalent), with no extras, new, AU$28321.0, NZ$45574.0 
Transportation, 1 liter (1/4 gallon) of gas, AU$1.43, NZ$2.13 
Transportation, monthly ticket public transport, AU$110.0, NZ$138.0 
Personal Care, medicine against cold for 6 days (Frenadol, Coldrex, ...), AU$14.27, NZ$17.85 
Personal Care, 1 box of 32 tampons (Tampax, OB, ...), AU$5.51, NZ$7.71 
Personal Care, 4 rolls of toilet paper, AU$3.57, NZ$3.07 
Personal Care, Tube of toothpaste, AU$3.37, NZ$3.39 
Personal Care, Standard men's haircut in expat area of the city, AU$27.0, NZ$27.0 
Entertainment, 2 tickets to the movies, AU$33.0, NZ$30.0 
Entertainment, 2 tickets to the theater (best available seats), AU$163.0, NZ$139.0 
Entertainment, dinner out for two in Italian restaurant with wine and dessert, AU$100.0, NZ$100.0 
Entertainment, basic dinner out for two in neighborhood pub, AU$46.0, NZ$46.0 
Entertainment, 1 cocktail drink in downtown club, AU$14.31, NZ$14.38 
Entertainment, 1 beer in neighbourhood pub, AU$4.69, NZ$6.72 
Entertainment, iPod nano 8GB (6th generation), AU$176.0, NZ$252.0 
Entertainment, 1 min. of prepaid mobile tariff (no discounts or plans), AU$1.14, NZ$0.84 
Entertainment, 1 month of gym in business district, AU$90.0, NZ$91.0 
Entertainment, 1 package of Marlboro cigarretes, AU$15.97, NZ$14.47 
+0

欣赏保罗的响应,它肯定得到结果我之后;然而,我曾希望得到一个使用美丽汤的回复,只是因为我已经投入了一些时间,并且是一个编程noob。这个答案涉及到很多代码,这些代码已经超出了我的头脑,我希望能够在一定程度上复制它,以及我已经学到的一些东西。我知道不同的问题需要不同的解决方案,所以这可能也是我想学习的东西。谢谢! –