2017-06-29 231 views
1

我从维基百科爬行数据,它迄今为止工作。我可以在终端上显示它,但是我不能按照我需要的方式将它写入csv文件: -/ 代码很长,但我仍然将它粘贴到此处,并希望有人能帮助我。将数据写入csv

import csv 
import requests 
from bs4 import BeautifulSoup 


def spider(): 
    url = 'https://de.wikipedia.org/wiki/Liste_der_Gro%C3%9F-_und_Mittelst%C3%A4dte_in_Deutschland' 
    code = requests.get(url).text # Read source code and make unicode 
    soup = BeautifulSoup(code, "lxml") # create BS object 

    table = soup.find(text="Rang").find_parent("table") 
    for row in table.find_all("tr")[1:]: 
     partial_url = row.find_all('a')[0].attrs['href'] 
     full_url = "https://de.wikipedia.org" + partial_url 
     get_single_item_data(full_url)   # goes into the individual sites 


def get_single_item_data(item_url): 
    page = requests.get(item_url).text # Read source code & format with .text to unicode 
    soup = BeautifulSoup(page, "lxml") # create BS object 
    def getInfoBoxBasisDaten(s): 
     return str(s) == 'Basisdaten' and s.parent.name == 'th' 
    basisdaten = soup.find_all(string=getInfoBoxBasisDaten)[0] 

    basisdaten_list = ['Bundesland', 'Regierungsbezirk:', 'Höhe:', 'Fläche:', 'Einwohner:', 'Bevölkerungsdichte:', 
         'Postleitzahl', 'Vorwahl:', 'Kfz-Kennzeichen:', 'Gemeindeschlüssel:', 'Stadtgliederung:', 
         'Adresse', 'Anschrift', 'Webpräsenz:', 'Website:', 'Bürgermeister', 'Bürgermeisterin', 
         'Oberbürgermeister', 'Oberbürgermeisterin'] 

    with open('staedte.csv', 'w', newline='', encoding='utf-8') as csvfile: 
     fieldnames = ['Bundesland', 'Regierungsbezirk:', 'Höhe:', 'Fläche:', 'Einwohner:', 'Bevölkerungsdichte:', 
         'Postleitzahl', 'Vorwahl:', 'Kfz-Kennzeichen:', 'Gemeindeschlüssel:', 'Stadtgliederung:', 
         'Adresse', 'Anschrift', 'Webpräsenz:', 'Website:', 'Bürgermeister', 'Bürgermeisterin', 
         'Oberbürgermeister', 'Oberbürgermeisterin'] 
     writer = csv.DictWriter(csvfile, fieldnames=fieldnames, delimiter=';', quotechar='|', quoting=csv.QUOTE_MINIMAL, extrasaction='ignore') 
     writer.writeheader() 

     for i in basisdaten_list: 
      wanted = i 
      current = basisdaten.parent.parent.nextSibling 
      while True: 
       if not current.name: 
        current = current.nextSibling 
        continue 
       if wanted in current.text: 
        items = current.findAll('td') 
        print(BeautifulSoup.get_text(items[0])) 
        print(BeautifulSoup.get_text(items[1])) 
        writer.writerow({i: BeautifulSoup.get_text(items[1])}) 

       if '<th ' in str(current): break 
       current = current.nextSibling 


print(spider()) 

2种方式的输出不正确。细胞是他们正确的地方,只有一个城市被写入,其他所有人都不见了。它看起来像这样:

enter image description here

但它应该是这样的+所有其他城市在这:

enter image description here

+0

输出有什么问题? –

+0

我做了一个截图。您可以使用代码轻松测试它,它在Python 3.6中起作用。 – saitam

回答

0

“......只有一个城市是写......”:您为每个城市拨打get_single_item_data。然后在这个函数里面,在声明with open('staedte.csv', 'w', newline='', encoding='utf-8') as csvfile:中用相同的名字打开输出文件,每次调用该函数时将覆盖输出文件。

将每个变量写入新行:在语句writer.writerow({i: BeautifulSoup.get_text(items[1])})中,您将一个变量的值写入一行。您需要做的是在开始查找页面值之前为值创建字典。当你累积页面中的值时,你会按字段名称将它们推送到字典中。然后,在找到所有可用值后,请致电writer.writerow