我已经在Python中编写了一个脚本,用于从craigslist中删除五个项目的“名称”和“电话”。我面临的问题是,当我运行我的脚本时,它只给出三个结果而不是五个结果。更具体地说,由于前两个链接在他们的页面中没有附加链接(联系信息),所以他们不需要再打开任何附加页面的请求。然而,没有(联系信息)链接的这两个链接无法通过我的第二个函数中的“if ano_page_link:”语句渗透并且从不打印。我该如何解决这个缺陷,以便它是否有电话号码,刮板将打印所有五个结果。刮板无法打印所有结果
我,试图脚本:
import re ; import requests ; from lxml import html
base = "http://bangalore.craigslist.co.in"
url_list = [
'http://bangalore.craigslist.co.in/reb/d/flat-is-for-sale-at-cooke-town/6266183606.html',
'http://bangalore.craigslist.co.in/reb/d/prestige-sunnyside/6259128505.html',
'http://bangalore.craigslist.co.in/reb/d/jayanagar-2nd-block-4000-sft/6221720477.html',
'http://bangalore.craigslist.co.in/reb/d/prestige-ozone-type-3-r-villa/6259928614.html',
'http://bangalore.craigslist.co.in/reb/d/zed-homes-3-bedroom-flat-for/6257075793.html'
]
def get_link(medium_link):
response = requests.get(medium_link).text
tree = html.fromstring(response)
try:
name = tree.cssselect('span#titletextonly')[0].text
except IndexError:
name = ""
try:
link = base + tree.cssselect('a.showcontact')[0].attrib['href']
except IndexError:
link = ""
parse_doc(name, link)
def parse_doc(title, ano_page_link):
if ano_page_link:
page = requests.get(ano_page_link).text
tel = re.findall(r'\d{10}', page)[0] if re.findall(r'\d{10}', page) else ""
print(title, tel)
if __name__ == '__main__':
for link in url_list:
get_link(link)
结果我有:
Jayanagar 2nd Block, 4000 sft Plot for Sale 9845012673
PRESTIGE OZONE TYPE D 3 B/R VILLA FOR SALE 9611226364
T ZED HOMES 3 BEDROOM FLAT FOR SALE 9611226364
结果我很期待:
A Flat is for sale at Cooke Town
Prestige Sunnyside
Jayanagar 2nd Block, 4000 sft Plot for Sale 9845012673
PRESTIGE OZONE TYPE D 3 B/R VILLA FOR SALE 9611226364
T ZED HOMES 3 BEDROOM FLAT FOR SALE 9611226364
你在'for'循环中做函数定义吗?为什么? – Andersson
对不起,先生。我不应该有。我为此演示做了这个。 – SIM
按照您的建议修改了Andersson先生。 – SIM