2016-12-14 82 views
-2

我用bs4编写了一个Python 3程序,以便成功获取维基百科的子类别。现在,我可以看到打印结果,但我无法将结果写入文件。如何写一个文件作为我的打印?

from bs4 import BeautifulSoup 
import requests 
import csv 

url = 'https://en.wikipedia.org/wiki/Category:proprietary software' 
content = requests.get(url).content 
soup = BeautifulSoup(content,'lxml') 
noOFsubcategories = soup.find('p') 
print('------------------------------------------------------------------') 
print(noOFsubcategories.text+'------------------------------------------------------------------') 
tag = soup.find('div', {'class' : 'mw-category'}) 
links = tag.findAll('a') 
#print(links) 

counter = 1 
for link in links: 
    print (str(counter) + " " + link.text) 
    counter = counter + 1 

with open('subcategories.csv', 'a') as f: 
    f.write(links) 
+0

你能更准确地解答这个问题吗?怎么了?你期望什么? – jonrsharpe

+0

当我在python3中执行上面的代码时,输​​出文件是空的。所以,我发布了上述问题。 –

+0

我通过以下答案解决了问题,并以学习者身份得到通知。对不起我的英语不好。除了我的语言,我绝对不会打字。 –

回答

2

只是一个小变化,循环下把写,每个循环都会写一个链接到文件

counter = 1 
for link in links: 
    print (str(counter) + " " + link.text) 
    counter = counter + 1 
    with open('subcategories.csv', 'a') as f: 
     f.write(link['href'].split(':')[1]+'\n') 

出来:

/wiki/Category:Formerly_proprietary_software 
/wiki/Category:Freeware 
/wiki/Category:Oracle_software 
/wiki/Category:Proprietary_cross-platform_software 
/wiki/Category:Proprietary_database_management_systems 
/wiki/Category:Proprietary_operating_systems 
/wiki/Category:Proprietary_version_control_systems 
/wiki/Category:Proprietary_wiki_software 
/wiki/Category:Shareware 
/wiki/Category:VMware 
/wiki/Category:Warez 

更好:

# do not need to open file in each loop, just put it above loop 
counter = 1 
with open('subcategories.csv', 'a') as f: 
    for link in links: 
     print (str(counter) + " " + link.text) 
     counter = counter + 1 
     f.write(link['href']+'\n') 
+0

如何获得以上,没有“/ wiki/Category:”? –

+0

我更新我的代码 –

+0

哇!多么简单!我用这个方法去除数据。 filedata = filedata.replace('/ wiki/wiki/Category:','')非常感谢兄弟! –

0

首先初始化列表与索引和链接文本列表,然后使用csv.writer写入csv文件。注意下面的使用enumerate()

links = [[index, a.get_text()] for index, a in enumerate(tag.find_all('a'), start=1)] 

with open('subcategories.csv', 'a') as f: 
    writer = csv.writer(f) 
    writer.writerows(links) 

而且,您可以通过使用单个CSS selector提高你定位的子类别的方式:

soup.select("div.mw-category a") 

完整的代码我m执行:

import csv 

from bs4 import BeautifulSoup 
import requests 


url = 'https://en.wikipedia.org/wiki/Category:proprietary software' 
content = requests.get(url).content 
soup = BeautifulSoup(content, 'lxml') 
noOFsubcategories = soup.find('p') 

tag = soup.find('div', {'class': 'mw-category'}) 

links = [[index, a.get_text()] for index, a in enumerate(tag.find_all('a'), start=1)] 

with open('subcategories.csv', 'a') as f: 
    writer = csv.writer(f) 
    writer.writerows(links) 

运行此代码后的subcategories.csv内容是:

1,Formerly free software 
2,Formerly proprietary software 
3,Freeware 
4,Oracle software 
5,Proprietary cross-platform software 
6,Proprietary database management systems 
7,Proprietary operating systems 
8,Proprietary version control systems 
9,Proprietary wiki software 
10,Shareware 
11,VMware 
12,Warez 
+0

但输出文件的第一行如下。 “F,o,r,m,e,r,l,y,,f,r,e,e,s,o,f,t,w,a,r,e”我怎样才能避开逗号? –

+0

@ info-farmer你确定你已经使用'writerows()'? – alecxe

+0

@ info-farmer我更新了正在执行的完整代码。希望有所帮助。 – alecxe

相关问题