2016-05-23 63 views
1

我的python脚本解析来自多个RSS源的标题和链接。我将这些标题存储在列表中,并且要确保我从不打印重复项目。我怎么做?如何告诉python不要打印列表中的项目?

#!/usr/bin/python 
from twitter import * 
from goose import Goose 
import feedparser 
import time 
from pyshorteners import Shortener 
import pause 
import newspaper 

dr = feedparser.parse("http://www.darkreading.com/rss_simple.asp") 
sm =feedparser.parse("http://www.securitymagazine.com/rss/topic/2654-cyber-tactics.rss") 



dr_posts =["CISO Playbook: Games of War & Cyber Defenses", 
     "SWIFT Confirms Cyber Heist At Second Bank; Researchers Tie Malware Code to Sony Hack","The 10 Worst Vulnerabilities of The Last 10 Years", 
     "GhostShell Leaks Data From 32 Sites In 'Light Hacktivism' Campaign", 
      "OPM Breach: 'Cyber Sprint' Response More Like A Marathon", 
     "Survey: Customers Lose Trust In Brands After A Data Breach", 
     "Domain Abuse Sinks 'Anchors Of Trust'", 
     "The 10 Worst Vulnerabilities of The Last 10 Years", 
] 

sm_posts = ["10 Steps to Building a Better Cybersecurity Plan"] 

x = 1 

while True: 

    try: 

     drtitle = dr.entries[x]["title"] 
     drlink = dr.entries[x]["link"] 
     if drtitle in dr_posts: 
      x += 1 
      drtitle = dr.entries[x]["title"] 
      drtitle = dr.entries[x]["link"] 
      print drtitle + "\n" + drlink 
      dr_posts.append(drtitle) 
      x -= 1 
      pause.seconds(10) 
     else: 
      print drtitle + "\n" + drlink 
      dr_posts.append(drtitle) 
      pause.seconds(10) 

     smtitle = sm.entries[x]["title"] 
     smlink = sm.entries[x]["link"] 
     if smtitle in sm_posts: 
      x +=1 
      smtitle = sm.entries[x]["title"] 
      smtitle = sm.entries[x]["title"] 
      print smtitle + "\n" + smlink 
      sm_posts.append(smtitle) 
      pause.seconds(10) 
    else: 
     print smtitle + "\n" + smlink 
     sm_posts.append(smtitle) 
     x+=1 
     pause.seconds(10) 



except IndexError: 
    print "FAILURE" 
    break 

我暂时只让它跳过条目。这将是一个问题,因为如果在RSS提要中有更多的重复,那么我会有更多的重复。

回答

2

您可以利用数据结构set,因为其“唯一性”属性将为您完成工作。基本上,我们可以让你的列表成为一个集合,然后再次设置一个列表,这可以确保你的列表现在填充了严格唯一的值。

如果你有一个列表L,那么你可以把它通过独特

l = list(set(l)) 
+0

谢谢!这真的帮助了我! – Frank

+0

没问题的人! Glsd帮助。随时将我的答案标记为已接受(复选框) –

0

如果你不希望打印您可以使用counterdefaultdict

sm_posts = defaultdict(int) 
sm_posts[sm_links] += 1 
print sm_posts.keys() #will print all the unique links 

重复链接好处是你也可以通过做链接重复链接的次数

sm_posts[sm_links] 
>>> link_counts 

试试吧。