2013-02-19 81 views
2

我是noob的定义。我对python几乎一无所知,并且正在寻求帮助。我可以阅读足够的代码来改变变量以适应我的需求,但是当我做一些原始代码不需要的东西...我迷路了。Python - 使脚本循环,直到条件满足,并为每个循环使用不同的代理地址

所以这里是交易,我找到了一个craigslist(CL)标记脚本,最初搜索所有CL网站和标记的帖子,其中包含一个特定的关键字(它被写为标记所有提到scienceology的帖子)。

我改变它只在我的一般区域(15个网站而不是437)搜索CL网站,但它仍然会查找已更改的特定关键字。我想自动标记持续垃圾邮件的人,并且很难排序,因为我在CL上做了很多业务,从邮件中排序。

我想让脚本执行循环,直到它不能再在每个循环之后找到满足标准更改代理服务器的帖子。并在剧本里面放置代理/ IP地址的地方

我期待着您的回复。

这里是改变的代码,我有:

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 


import urllib 
from twill.commands import * # gives us go() 

areas = ['sfbay', 'chico', 'fresno', 'goldcountry', 'humboldt', 'mendocino', 'modesto', 'monterey', 'redding', 'reno', 'sacramento', 'siskiyou', 'stockton', 'yubasutter', 'reno'] 

def expunge(url, area): 
    page = urllib.urlopen(url).read() # <-- and v and vv gets you urls of ind. postings 
    page = page[page.index('<hr>'):].split('\n')[0] 
    page = [i[:i.index('">')] for i in page.split('href="')[1:-1] if '<font size="-1">' in i] 

    for u in page: 
     num = u[u.rfind('/')+1:u.index('.html')] # the number of the posting (like 34235235252) 
     spam = 'https://post.craigslist.org/flag?flagCode=15&amppostingID='+num # url for flagging as spam 
     go(spam) # flag it 


print 'Checking ' + str(len(areas)) + ' areas...' 

for area in ['http://' + a + '.craigslist.org/' for a in areas]: 
    ujam = area + 'search/?query=james+"916+821+0590"+&catAbb=hhh' 
    udre = area + 'search/?query="DRE+%23+01902542+"&catAbb=hhh' 
    try: 
     jam = urllib.urlopen(ujam).read() 
     dre = urllib.urlopen(udre).read() 
    except: 
     print 'tl;dr error for ' + area 

    if 'Found: ' in jam: 
     print 'Found results for "James 916 821 0590" in ' + area 
     expunge(ujam, area) 
     print 'All "James 916 821 0590" listings marked as spam for area' 

    if 'Found: ' in dre: 
     print 'Found results for "DRE # 01902542" in ' + area 
     expunge(udre, area) 
     print 'All "DRE # 01902542" listings marked as spam for area' 
+0

如果你只使用'go',只导入'go':'从twill.commands导入go' – askewchan 2013-02-19 21:17:19

+0

导入错误:没有名为模块去 – 2013-02-19 22:18:10

+0

奇怪:HTTP://斜纹.idyll.org/python-api.html说:'从twill.commands进口去' – askewchan 2013-02-19 22:24:07

回答

0

您可以创建一个恒定的循环是这样的:

while True: 
    if condition : 
     break 

Itertools有技巧的去重复http://docs.python.org/2/library/itertools.html

特别是屈指可数,退房itertools.cycle

(这些都是指向正确方向的指针。你可以制定一个解决方案,其他,甚至两个)

+0

对不起,我不明白它..我试图添加repeat()进入代码,但我不断得到Traceback(最近一次调用最后): 文件“/ home/quonundrum/Desktop/CL。py',第43行,在 repeat('spam,4') NameError:name'repeat'is not defined >>> – 2013-02-19 21:57:40

+0

'import itertools as it' then call'it.repeat()' – askewchan 2013-02-19 22:27:31

+0

I' ('go,4'),it.repeat('go(spam),4'),it.repeat('expunge'),it.repeat('ujam')..和一大堆其他人......这不是重复,但也没有给出任何错误 – 2013-02-19 22:54:55

0

我对你的代码做了一些改变。在我看来,函数expunge已经循环遍历页面中的所有结果,所以我不确定你需要做什么循环,但是有一个例子说明如何在结束时检查结果是否被找到,但没有循环可以打破。

不知道如何更改代理/ IP。

顺便说一句,你有'reno'两次。

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 

import urllib 
from twill.commands import go 

areas = ['sfbay', 'chico', 'fresno', 'goldcountry', 'humboldt', 
     'mendocino', 'modesto', 'monterey', 'redding', 'reno', 
     'sacramento', 'siskiyou', 'stockton', 'yubasutter'] 
queries = ['james+"916+821+0590"','"DRE+%23+01902542"'] 

def expunge(url, area): 
    page = urllib.urlopen(url).read() # <-- and v and vv gets you urls of ind. postings 
    page = page[page.index('<hr>'):].split('\n')[0] 
    page = [i[:i.index('">')] for i in page.split('href="')[1:-1] if '<font size="-1">' in i] 

    for u in page: 
     num = u[u.rfind('/')+1:u.index('.html')] # the number of the posting (like 34235235252) 
     spam = 'https://post.craigslist.org/flag?flagCode=15&amppostingID='+num # url for flagging as spam 
     go(spam) # flag it 

print 'Checking ' + str(len(areas)) + ' areas...' 

for area in areas: 
    for query in queries: 
     qurl = 'http://' + area + '.craigslist.org/search/?query=' + query + '+&catAbb=hhh' 
     try: 
      q = urllib.urlopen(qurl).read() 
     except: 
      print 'tl;dr error for {} in {}'.format(query, area) 
      break 

     if 'Found: ' in q: 
      print 'Found results for {} in {}'.format(query, area) 
      expunge(qurl, area) 
      print 'All {} listings marked as spam for area'.format(query) 
     elif 'Nothing found for that search' in q: 
      print 'No results for {} in {}'.format(query, area) 
      break 
     else: 
      break 
+0

酷,看起来好多了。有没有办法让它继续运行,直到它没有得到任何结果? – 2013-02-19 23:57:53

+0

你意味着你期望结果页面在程序运行时改变? – askewchan 2013-02-20 00:02:55

+0

在shell中显示它发现/标记的东西,所以我想知道是否有脚本继续运行,直到没有更多结果对于被搜索的关键字(IE浏览器的所有结果)重新标记直到删除)。 – 2013-02-20 00:27:52

0

我做了一些改变...不知道他们工作得如何,但我没有得到任何错误。请让我知道,如果你发现任何错误/缺少的东西。 - 感谢

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 

import urllib, urllib2 
from twill.commands import go 


proxy = urllib2.ProxyHandler({'https': '108.60.219.136:8080'}) 
opener = urllib2.build_opener(proxy) 
urllib2.install_opener(opener) 
proxy2 = urllib2.ProxyHandler({'https': '198.144.186.98:3128'}) 
opener2 = urllib2.build_opener(proxy2) 
urllib2.install_opener(opener2) 
proxy3 = urllib2.ProxyHandler({'https': '66.55.153.226:8080'}) 
opener3 = urllib2.build_opener(proxy3) 
urllib2.install_opener(opener3) 
proxy4 = urllib2.ProxyHandler({'https': '173.213.113.111:8080'}) 
opener4 = urllib2.build_opener(proxy4) 
urllib2.install_opener(opener4) 
proxy5 = urllib2.ProxyHandler({'https': '198.154.114.118:3128'}) 
opener5 = urllib2.build_opener(proxy5) 
urllib2.install_opener(opener5) 


    areas = ['sfbay', 'chico', 'fresno', 'goldcountry', 'humboldt', 
    'mendocino', 'modesto', 'monterey', 'redding', 'reno', 
    'sacramento', 'siskiyou', 'stockton', 'yubasutter'] 
queries = ['james+"916+821+0590"','"DRE+%23+01902542"'] 

    def expunge(url, area): 
page = urllib.urlopen(url).read() # <-- and v and vv gets you urls of ind. postings 
page = page[page.index('<hr>'):].split('\n')[0] 
page = [i[:i.index('">')] for i in page.split('href="')[1:-1] if '<font size="-1">' in i] 

    for u in page: 
    num = u[u.rfind('/')+1:u.index('.html')] # the number of the posting (like 34235235252) 
    spam = urllib2.urlopen('https://post.craigslist.org/flag?flagCode=15&amppostingID='+num) 
    spam2 = urllib2.urlopen('https://post.craigslist.org/flag?flagCode=28&amppostingID='+num) 
    spam3 = urllib2.urlopen('https://post.craigslist.org/flag?flagCode=16&amppostingID='+num) 
    go(spam) # flag it 
    go(spam2) # flag it 
    go(spam3) # flag it 

print 'Checking ' + str(len(areas)) + ' areas...' 

    for area in areas: 
for query in queries: 
    qurl = 'http://' + area + '.craigslist.org/search/?query=' + query + '+&catAbb=hhh' 
    try: 
     q = urllib.urlopen(qurl).read() 
    except: 
     print 'tl;dr error for {} in {}'.format(query, area) 
     break 

    if 'Found: ' in q: 
     print 'Found results for {} in {}'.format(query, area) 
     expunge(qurl, area) 
     print 'All {} listings marked as spam for {}'.format(query, area) 
     print '' 
     print '' 
    elif 'Nothing found for that search' in q: 
     print 'No results for {} in {}'.format(query, area) 
     print '' 
     print '' 
     break 
    else: 
     break 
0
#!/usr/bin/env python 
# -*- coding: utf-8 -*- 

import urllib, urllib2 
from twill.commands import go 


proxy = urllib2.ProxyHandler({'https': '108.60.219.136:8080'}) 
opener = urllib2.build_opener(proxy) 
urllib2.install_opener(opener) 
proxy2 = urllib2.ProxyHandler({'https': '198.144.186.98:3128'}) 
opener2 = urllib2.build_opener(proxy2) 
urllib2.install_opener(opener2) 
proxy3 = urllib2.ProxyHandler({'https': '66.55.153.226:8080'}) 
opener3 = urllib2.build_opener(proxy3) 
urllib2.install_opener(opener3) 
proxy4 = urllib2.ProxyHandler({'https': '173.213.113.111:8080'}) 
opener4 = urllib2.build_opener(proxy4) 
urllib2.install_opener(opener4) 
proxy5 = urllib2.ProxyHandler({'https': '198.154.114.118:3128'}) 
opener5 = urllib2.build_opener(proxy5) 
urllib2.install_opener(opener5) 


areas = ['capecod'] 
queries = ['rent','rental','home','year','falmouth','lease','credit','tenant','apartment','bedroom','bed','bath'] 

    def expunge(url, area): 
page = urllib.urlopen(url).read() # <-- and v and vv gets you urls of ind. postings 
page = page[page.index('<hr>'):].split('\n')[0] 
page = [i[:i.index('">')] for i in page.split('href="')[1:-1] if '<font size="-1">' in i] 

    for u in page: 
    num = u[u.rfind('/')+1:u.index('.html')] # the number of the posting (like 34235235252) 
    spam = urllib2.urlopen('https://post.craigslist.org/flag?flagCode=15&amppostingID='+num) 
    spam2 = urllib2.urlopen('https://post.craigslist.org/flag?flagCode=28&amppostingID='+num) 
    spam3 = urllib2.urlopen('https://post.craigslist.org/flag?flagCode=16&amppostingID='+num) 
    go(spam) # flag it 
    go(spam2) # flag it 
    go(spam3) # flag it 

print 'Checking ' + str(len(areas)) + ' areas...' 

    for area in areas: 
for query in queries: 
    qurl = 'http://' + area + '.craigslist.org/search/?query=' + query + '+&catAbb=hhh' 
    try: 
     q = urllib.urlopen(qurl).read() 
    except: 
     print 'tl;dr error for {} in {}'.format(query, area) 
     break 

    if 'Found: ' in q: 
     print 'Found results for {} in {}'.format(query, area) 
     expunge(qurl, area) 
     print 'All {} listings marked as spam for {}'.format(query, area) 
     print '' 
     print '' 
    elif 'Nothing found for that search' in q: 
     print 'No results for {} in {}'.format(query, area) 
     print '' 
     print '' 
     break 
    else: 
     break 
相关问题