获取表的内容BeautifulSoup

我有一个网站上的以下表，我与BeautifulSoup 提取这是URL（我还附上了图片获取表的内容BeautifulSoup

理想我想有每家公司在一个排在CSV但是我得到它在不同的行。请参见图片连接。

我想它有它像场“d”但我得到它在A1，A2，A3 ...

这是我用来提取代码：

def _writeInCSV(text): 
    print "Writing in CSV File" 
    with open('sara.csv', 'wb') as csvfile: 
     #spamwriter = csv.writer(csvfile, delimiter='\t',quotechar='\n', quoting=csv.QUOTE_MINIMAL) 
     spamwriter = csv.writer(csvfile, delimiter='\t',quotechar="\n") 

     for item in text: 
      spamwriter.writerow([item]) 

read_list=[] 
initial_list=[] 


url="http://www.nse.com.ng/Issuers-section/corporate-disclosures/corporate-actions/closure-of-register" 
r=requests.get(url) 
soup = BeautifulSoup(r._content, "html.parser") 

#gdata_even=soup.find_all("td", {"class":"ms-rteTableEvenRow-3"}) 

gdata_even=soup.find_all("td", {"class":"ms-rteTable-default"}) 




for item in gdata_even: 
    print item.text.encode("utf-8") 
    initial_list.append(item.text.encode("utf-8")) 
    print "" 

_writeInCSV(initial_list)

有人可以帮助吗？

来源

2015-09-07 Nant

它甚至会更好，我可以复制整个表以CSV但我用怎么办挣扎是 – Nant

这里的理念是：

读取来自表中的标题单元
读取从表
压缩所有的数据行细胞与标头产生字典
使用csv.DictWriter()转储到csv

实施：

import csv 
from pprint import pprint 

from bs4 import BeautifulSoup 
import requests 

url = "http://www.nse.com.ng/Issuers-section/corporate-disclosures/corporate-actions/closure-of-register" 
soup = BeautifulSoup(requests.get(url).content, "html.parser") 

rows = soup.select("table.ms-rteTable-default tr") 
headers = [header.get_text(strip=True).encode("utf-8") for header in rows[0].find_all("td")] 

data = [dict(zip(headers, [cell.get_text(strip=True).encode("utf-8") for cell in row.find_all("td")])) 
     for row in rows[1:]] 

# see what the data looks like at this point 
pprint(data) 

with open('sara.csv', 'wb') as csvfile: 
    spamwriter = csv.DictWriter(csvfile, headers, delimiter='\t', quotechar="\n") 

    for row in data: 
     spamwriter.writerow(row)

来源

2015-09-07 09:08:47 alecxe

由于@alecxe已经提供了一个惊人的答案，下面是使用pandas库的另一种方法。

import pandas as pd 

url = "http://www.nse.com.ng/Issuers-section/corporate-disclosures/corporate-actions/closure-of-register" 
tables = pd.read_html(url) 

tb1 = tables[0] # Get the first table. 
tb1.columns = tb1.iloc[0] # Assign the first row as header. 
tb1 = tb1.iloc[1:] # Drop the first row. 
tb1.reset_index(drop=True, inplace=True) # Reset the index. 

print tb1.head() # Print first 5 rows. 
# tb1.to_csv("table1.csv") # Export to CSV file.

结果：

In [5]: runfile('C:/Users/.../.spyder2/temp.py', wdir='C:/Users/.../.spyder2') 
0     Company  Dividend Bonus  Closure of Register \ 
0 Nigerian Breweries Plc   N3.50  Nil 5th - 11th March 2015 
1   Forte Oil Plc   N2.50 1 for 5 1st – 7th April 2015 
2   Nestle Nigeria   N17.50  Nil   27th April 2015 
3  Greif Nigeria Plc  60 kobo  Nil 25th - 27th March 2015 
4  Guaranty Bank Plc N1.50 (final)  Nil   17th March 2015 

0   AGM Date  Payment Date 
0  13th May 2015 14th May 2015 
1 15th April 2015 22nd April 2015 
2  11th May 2015 12th May 2015 
3 28th April 2015  5th May 2015 
4 31st March 2015 31st March 2015 

In [6]:

来源

2015-09-07 09:44:03 Manhattan

我得到的错误： C：\ Python27 \ python.exe C：/Users/Anant/XetraWebBot/Test/ReadCSV.py Traceback（最近一次调用最后一次）：文件“C：/Users/Anant/XetraWebBot/Test/ReadCSV.py”，第4行，在 tables = pd.read_html（url） AttributeError：'module 'object has no attribute'read_html' – Nant

很可能你没有更新的'pandas'或者你没有'html5lib'模块。应该预先警告：'pandas'可以简化表格拼写，正如你在上面看到的那样，但是除非你使用像Anaconda这样的发行版（这是我用于上面的），否则设置它可能是相当有问题的。 – Manhattan

获取表的内容BeautifulSoup

回答

相关问题