2016-12-24 43 views
0

复制字符串写入文件我真的不知道Python和我研究了很多,但是这是我能想出如何从网页在Python

import urllib2 
import re 

file = open('C:\Users\Sadiq\Desktop\IdList.txt', 'w') 

for a in range(1,11): 
    s = str(a) 
    url='http://fanpagelist.com/category/top_users/view/list/sort/fans/page%s' + s 
    page = urllib2.urlopen(url).read() 
    for x in range(1,21): 
     id = re.search('php?id=(.+?)"',page) 
     file.write(id) 
file.close() 

我最好的代码试图复制身份证号码。在网页的像这样

HREF = “/ like_box.php?ID = 6679099553”

我只想写一个txt文件在新行数。有10个网页我想刮,我只想从每页的前20个ID。 但是,当我运行我的代码时,它显示403错误 如何做到这一点?

这是完全错误

C:\Users\Sadiq\Desktop>extractId.py 
Traceback (most recent call last): 
File "C:\Users\Sadiq\Desktop\extractId.py", line 7, in <module> 
page = urllib2.urlopen(url).read() 
File "C:\Python27\lib\urllib2.py", line 154, in urlopen 
return opener.open(url, data, timeout) 
File "C:\Python27\lib\urllib2.py", line 437, in open 
response = meth(req, response) 
File "C:\Python27\lib\urllib2.py", line 550, in http_response 
'http', request, response, code, msg, hdrs) 
File "C:\Python27\lib\urllib2.py", line 475, in error 
return self._call_chain(*args) 
File "C:\Python27\lib\urllib2.py", line 409, in _call_chain 
result = func(*args) 
File "C:\Python27\lib\urllib2.py", line 558, in http_error_default 
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) 
urllib2.HTTPError: HTTP Error 403: Forbidden 
+0

打印网址,并看到,这是不正确。如果你使用'+',那么你不需要'%s'。要连接两个字符串,你需要'“A”+“B”或“A%s”%“B”' – furas

+0

btw:'write()'不会添加'“\ n”'所以你需要'写(id +“\ n”)' – furas

+0

谢谢,但仍然没有帮助。我仍然收到相同的错误 –

回答

0

尝试BeautifulSoup为HTML刮:

from requests import request 
from bs4 import BeautifulSoup as bs 


with open('C:\Users\Sadiq\Desktop\IdList.txt', 'w') as out: 
    for page in range(1,11): 
     url='http://fanpagelist.com/category/top_users/view/list/sort/fans/page%d' % page # no need to convert 'page' to string 
     html = request('GET', url).text # requests module easier to use 
     soup = bs(html, 'html.parser') 
     for a in soup.findAll('a', {'class':"like_box"})[:20]: # search all links ('a') that have property "like_box" 
      out.write(a['href'].split('=')[1] + '\n') 
+0

我收到此错误 C:\用户\萨迪克\桌面> extractId.py 回溯(最近通话最后一个): 文件“C:\用户\萨迪克\桌面\蝙蝠侠文件“build \ bdist.win-amd64 \ egg \ bs4 \ __ init__.py”,第165行,在__init__中 bs4文件“Quiz \ extractId.py”,第9行,在 汤= bs(html,'lxml') .FeatureNotFound:无法找到具有您要求的功能的树型构建器 d:lxml。你需要安装一个解析器库吗? –

+0

尝试删除'lxml':'soup = bs(html)' –

+0

它没有工作。它告诉我(它打印了一条消息,好像它是一个人,很奇怪!)我应该在html旁边写html.parser。 但是这段代码正在写所有页面上的id。 我只想要第20个。你能否改变你的答案来达到这个效果? 也请改正肥皂汤。提前致谢! –

0

不要使用普通的正则表达式刮,使用HTML解析器像Beautiful Soup

而我认为你的错误来自你如何构建你的网址。使用“%”符号输入转换的变量不是“+”,这是附加

from bs4 import BeautifulSoup 
import urllib2 
for a in range(1,11): 
    s = str(a) 
    url='http://fanpagelist.com/category/top_users/view/list/sort/fans/page%s' % s 
    page = urllib2.urlopen(url).read() 
    soup = BeautifulSoup(page) 
    # find all links where the href contains 'php?id=' 
    # Note: you can also use css selectors or beautifulsoup's regex to do this 
    valid_links = [] 
    for link in soup.find_all('a',href=True): 
     if 'php?id=' in link: 
      valid_links.append(link['href']) 
    print valid_links 
+0

我仍然得到同样的错误 –