2015-06-21 68 views
0

我无法使这个脚本工作来从一系列维基百科文章中刮取信息。维基百科的美丽的汤

我想要做的是迭代一系列维基URL并拉出维基门户类别上的页面链接(例如https://en.wikipedia.org/wiki/Category:Electronic_design)。

我知道我所经历的所有维基页面都有一个页面链接部分。为什么会出现这种错误

Traceback (most recent call last): 
    File "./wiki_parent.py", line 37, in <module> 
    cleaned = pages.get_text() 
AttributeError: 'NoneType' object has no attribute 'get_text' 


然而,当我试图通过他们迭代我收到此错误信息?

我读的第一部分中的文件是这样的:

1 Category:Abrahamic_mythology 
2 Category:Abstraction 
3 Category:Academic_disciplines 
4 Category:Activism 
5 Category:Activists 
6 Category:Actors 
7 Category:Aerobics 
8 Category:Aerospace_engineering 
9 Category:Aesthetics 

,并存储在该port_ID字典,像这样:

{1:“类别:Abrahamic_mythology ',2:'Category:Abstraction',3:'Category:Academic_disciplines',4:'Category:Activism',5:'Category:Activists',6:'Category:Actors',7:'Category:Aerobics', 8:'Category:Aerospace_engineering',9:'Category:Aesthetics',10:'Category:Agnosticism',11:'Category:Agriculture'...}

所需的输出是:

parent_num, page_ID, page_num 

我认识的代码是一个小的hackish,但我只是试图让这个工作:

#!/usr/bin/env python 
import os,re,nltk 
from bs4 import BeautifulSoup 
from urllib import urlopen 
url = "https://en.wikipedia.org/wiki/"+'Category:Furniture' 

rootdir = '/Users/joshuavaldez/Desktop/L1/en.wikipedia.org/wiki' 

reg = re.compile('[\w]+:[\w]+') 
number=1 
port_ID = {} 
for root,dirs,files in os.walk(rootdir): 
    for file in files: 
     if reg.match(file): 
      port_ID[number]=file 
      number+=1 


test_file = open('test_file.csv', 'w') 

for key, value in port_ID.iteritems(): 

    url = "https://en.wikipedia.org/wiki/"+str(value) 
    raw = urlopen(url).read() 
    soup=BeautifulSoup(raw) 
    pages = soup.find("div" , { "id" : "mw-pages" }) 
    cleaned = pages.get_text() 
    cleaned = cleaned.encode('utf-8') 
    pages = cleaned.split('\n') 
    pages = pages[4:-2] 
    test = test = port_ID.items()[0] 

    page_ID = 1 
    for item in pages: 
     test_file.write('%s %s %s\n' % (test[0],item,page_ID)) 
     page_ID+=1 
    page_ID = 1 
+2

好,那么在代码中,页面将被绑定到无 –

+0

您可能需要仔细检查您使用的方式soup.find() – nthall

+0

对不起,我是一个漂亮的新手公司der,但在这种情况下,页面被绑定到None意味着什么,并且有没有简单的方法来解决这个问题? – jdv12

回答

2

你刮的几页循环。但可能会有一些页面没有任何<div id="mw-pages">标签。所以,你所得到的AttributeError在行,

cleaned = pages.get_text() 

您可以使用if条件检查,如:

if pages: 
    # do stuff 

或者你可以用try-except块一样避开它,

try: 
    cleaned = pages.get_text() 
    # do stuff 
except AttributeError as e: 
    # do something 
+0

非常感谢你! – jdv12