2015-09-07 121 views
4

我有一个网站上的以下表,我与BeautifulSoup 提取这是URL(我还附上了图片enter image description here获取表的内容BeautifulSoup

理想我想有每家公司在一个排在CSV但是我得到它在不同的行。请参见图片连接。

enter image description here

我想它有它像场“d”但我得到它在A1,A2,A3 ...

这是我用来提取代码:

def _writeInCSV(text): 
    print "Writing in CSV File" 
    with open('sara.csv', 'wb') as csvfile: 
     #spamwriter = csv.writer(csvfile, delimiter='\t',quotechar='\n', quoting=csv.QUOTE_MINIMAL) 
     spamwriter = csv.writer(csvfile, delimiter='\t',quotechar="\n") 

     for item in text: 
      spamwriter.writerow([item]) 

read_list=[] 
initial_list=[] 


url="http://www.nse.com.ng/Issuers-section/corporate-disclosures/corporate-actions/closure-of-register" 
r=requests.get(url) 
soup = BeautifulSoup(r._content, "html.parser") 

#gdata_even=soup.find_all("td", {"class":"ms-rteTableEvenRow-3"}) 

gdata_even=soup.find_all("td", {"class":"ms-rteTable-default"}) 




for item in gdata_even: 
    print item.text.encode("utf-8") 
    initial_list.append(item.text.encode("utf-8")) 
    print "" 

_writeInCSV(initial_list) 

有人可以帮助吗?

+0

它甚至会更好,我可以复制整个表以CSV但我用怎么办挣扎是 – Nant

回答

3

这里的理念是:

  • 读取来自表中的标题单元
  • 读取从表
  • 压缩所有的数据行细胞与标头产生字典
  • 的列表中的所有其他行
  • 使用csv.DictWriter()转储到csv

实施:

import csv 
from pprint import pprint 

from bs4 import BeautifulSoup 
import requests 

url = "http://www.nse.com.ng/Issuers-section/corporate-disclosures/corporate-actions/closure-of-register" 
soup = BeautifulSoup(requests.get(url).content, "html.parser") 

rows = soup.select("table.ms-rteTable-default tr") 
headers = [header.get_text(strip=True).encode("utf-8") for header in rows[0].find_all("td")] 

data = [dict(zip(headers, [cell.get_text(strip=True).encode("utf-8") for cell in row.find_all("td")])) 
     for row in rows[1:]] 

# see what the data looks like at this point 
pprint(data) 

with open('sara.csv', 'wb') as csvfile: 
    spamwriter = csv.DictWriter(csvfile, headers, delimiter='\t', quotechar="\n") 

    for row in data: 
     spamwriter.writerow(row) 
1

由于@alecxe已经提供了一个惊人的答案,下面是使用pandas库的另一种方法。

import pandas as pd 

url = "http://www.nse.com.ng/Issuers-section/corporate-disclosures/corporate-actions/closure-of-register" 
tables = pd.read_html(url) 

tb1 = tables[0] # Get the first table. 
tb1.columns = tb1.iloc[0] # Assign the first row as header. 
tb1 = tb1.iloc[1:] # Drop the first row. 
tb1.reset_index(drop=True, inplace=True) # Reset the index. 

print tb1.head() # Print first 5 rows. 
# tb1.to_csv("table1.csv") # Export to CSV file. 

结果:

In [5]: runfile('C:/Users/.../.spyder2/temp.py', wdir='C:/Users/.../.spyder2') 
0     Company  Dividend Bonus  Closure of Register \ 
0 Nigerian Breweries Plc   N3.50  Nil 5th - 11th March 2015 
1   Forte Oil Plc   N2.50 1 for 5 1st – 7th April 2015 
2   Nestle Nigeria   N17.50  Nil   27th April 2015 
3  Greif Nigeria Plc  60 kobo  Nil 25th - 27th March 2015 
4  Guaranty Bank Plc N1.50 (final)  Nil   17th March 2015 

0   AGM Date  Payment Date 
0  13th May 2015 14th May 2015 
1 15th April 2015 22nd April 2015 
2  11th May 2015 12th May 2015 
3 28th April 2015  5th May 2015 
4 ​31st March 2015 31st March 2015 

In [6]: 
+0

我得到的错误: C:\ Python27 \ python.exe C:/Users/Anant/XetraWebBot/Test/ReadCSV.py Traceback(最近一次调用最后一次): 文件“C:/Users/Anant/XetraWebBot/Test/ReadCSV.py”,第4行,在 tables = pd.read_html(url) AttributeError:'module 'object has no attribute'read_html' – Nant

+0

很可能你没有更新的'pandas'或者你没有'html5lib'模块。应该预先警告:'pandas'可以简化表格拼写,正如你在上面看到的那样,但是除非你使用像Anaconda这样的发行版(这是我用于上面的),否则设置它可能是相当有问题的。 – Manhattan