2014-09-12 136 views
-2

以下代码不断给我012行上的错误IndexError: list index out of rangeprint (aTweet + '~' + timeSource[x] + '~' + keyWord[i])。这与keyword[i]术语有关吗?我明白Index out of range通常意味着提供一个索引,其中不存在列表元素。这是否意味着错误实际上可能在于本节:Python:索引超出范围错误

if (len(splitSource) > 20): 
       max_range = 19 
      else: 
       max_range = len(splitSource) 

参考代码:

import re 
from re import sub 
import time 
import cookielib 
from cookielib import CookieJar 
import urllib2 
from urllib2 import urlopen 
import difflib 
import sys 

cj = CookieJar() 
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) 
opener.addheaders = [('User-agent', 'Mozilla/5.0')] 

keyWord = ["Scotch"] 

def main(): 
    i=0 
    while i<len(keyWord): 
     startingLink = 'https://twitter.com/search/realtime?q='+keyWord[i] 
     tUrl = startingLink+'&src=hash' 

     oldTwit = [] 
     newTwit = [] 


     howSimAr = [.5,.5,.5,.5,.5] 

     sourceCode = opener.open(tUrl).read() 
     splitSource = re.findall(r'<p class="js-tweet-text tweet-text">(.*?)</p>',sourceCode) 
     timeSource = re.findall(r'js-nav" title="(.*?)"',sourceCode) 

     if (len(splitSource) > 20): 
      max_range = 19 
     else: 
      max_range = len(splitSource) 

     print '' 
     print '' 
     print '' 
     ##print 'Keyword: ' + keyWord[i] 
     print ''    

     for x in range (0, max_range): 
      aTweet = re.sub(r'<.*?>','',splitSource[x]) 
      print (aTweet + '~' + timeSource[x] + '~' + keyWord[i]) 
      #print ';' 
      newTwit.append(aTweet) 

##  comparison = difflib.SequenceMatcher(None, newTwit, oldTwit) 
##  howSim = comparison.ratio() 
##  print ';' 
##  print 'This selection is',howSim,'similar to the past' 
##  howSimAr.append(howSim) 
##  howSimAr.remove(howSimAr[0]) 
## 
##  waitMultiplier = reduce(lambda x, y: x+y, howSimAr)/len(howSimAr) 
## 
##  print '' 
##  print 'The current similarity array:',howSimAr 
##  print 'Our current Multiplier:', waitMultiplier 

     oldTwit = [None] 
     for eachItem in newTwit: 
      oldTwit.append(eachItem) 

     newTwit = [None] 

     time.sleep(2) 
     x = 0 
     i = i + 1 

## except Exception, e: 
##  print str(e) 
##  print 'errored in the main try' 
main() 
+0

您正在将'timeSource'索引为'x',但'x'的范围由'splitSource'的长度决定(通过'max_range')。如果'splitSource'比'timeSource'更长(包含更多元素),这将不起作用。 – 2014-09-12 15:05:15

+0

@Tom有道理,创建另一个变量会更好吗? – 2014-09-12 15:09:36

+0

我不清楚'splitSource's和'timeSource's之间的关系是什么,或者你的代码试图做什么。他们似乎都与推文有关,但我不知道你期望的数据是什么?例如。当你搜索关键字“苏格兰威士忌”时,你期望'splitSource'中有多少物品,'timeSource'中有多少物品? – 2014-09-12 15:19:25

回答

0

在Twitter搜索页面的源代码零次出现js-nav" title="所以第二个正则表达式的会一无所获。事实上,加入

print "len(timeSource) =", len(timeSource) 
print "max_range =", max_range 

for x in range (0, max_range): 

将显示:

len(timeSource) = 0 
max_range = 20 
不管你想archieve

,你会过得更好使用HTMLParser左右,与HTML工作比使用re。这将更容易确保timeSource[x]splitSource[x]将全部归于x