2016-06-10 62 views
1

我是Python新手/编程,并希望有人能帮助我。我想知道如何格式化以下内容才能写入mySQL。下面的代码:Python - BeautifulSoup - 格式化数据写入MySql

from bs4 import BeautifulSoup 
    import urllib.request 

    url = "http://www.footballoutsiders.com/stats/qb" 
    page = urllib.request.urlopen(url) 
    soup = BeautifulSoup(page.read(),"html.parser") 
    table = soup.find('table', attrs ={'class': 'stats'}) 
    td = table.get_text() 

产生这样的:

C.Palmer 
ARI 
1,698 
1 
1,755 
1 
34.4% 
1 
36.0% 
82.2 
1 
557 
4,495 
5,310 
35 
2 
2 
11 
64.5% 
16/253 

我奋力循环遍历每个行创建这样的“(COL1,COL2,COL3,COL4 ...等格式。)'我相信我可以加载到mySQL的下列字段中。

Player 
Team 
DYAR 
Rk 
YAR 
Rk 
DVOA 
Rk 
VOA 
QBR 
Rk 
Passes 
Yards 
EYds 
TD 
FK 
FL 
INT 
C% 
DPI 

回答

0

你需要获得TR标签,然后提取TDS:

from bs4 import BeautifulSoup, Tag 
import urllib.request 

url = "http://www.footballoutsiders.com/stats/qb" 
page = urllib.request.urlopen(url) 
soup = BeautifulSoup(page.read(),"html.parser") 
table = soup.select_one("table.stats") 

cols = [td for td in table.find("tr").find_all("td")] 

rows = [[td.text for td in row if isinstance(td, Tag)] for row in table.select("tr + tr")] 
print(cols) 
print(rows) 

,这将给你:

['Player', 'Team', 'DYAR', 'Rk', 'YAR', 'Rk', 'DVOA', 'Rk', 'VOA', 'QBR', 'Rk', 'Passes', 'Yards', 'EYds', 'TD', 'FK', 'FL', 'INT', 'C%', 'DPI'] 
[['C.Palmer', 'ARI', '1,698', '1', '1,755', '1', '34.4%', '1', '36.0%', '82.2', '1', '557', '4,495', '5,310', '35', '2', '2', '11', '64.5%', '16/253'], ['T.Brady', 'NE', '1,312', '2', '1,269', '2', '19.5%', '5', '18.5%', '64.4', '11', '660', '4,536', '5,234', '36', '3', '2', '7', '64.6%', '10/233'], ['R.Wilson', 'SEA', '1,190', '3', '1,159', '4', '24.3%', '3', '23.4%', '74.9', '4', '526', '3,744', '4,311', '34', '1', '2', '8', '68.7%', '2/30'], ['A.Dalton', 'CIN', '1,135', '4', '1,059', '6', '31.7%', '2', '28.9%', '73.1', '5', '409', '3,103', '3,676', '26', '2', '2', '7', '66.1%', '5/107'], ['B.Roethlisberger', 'PIT', '1,114', '5', '1,056', '7', '22.1%', '4', '20.4%', '76.9', '2', '489', '3,809', '4,214', '21', '0', '0', '16', '68.2%', '14/297'], ['D.Brees', 'NO', '1,111', '6', '1,184', '3', '15.8%', '7', '17.5%', '75.5', '3', '657', '4,625', '4,838', '32', '2', '2', '11', '68.6%', '1/37'], ['K.Cousins', 'WAS', '1,023', '7', '1,125', '5', '16.9%', '6', '19.7%', '70.1', '6', '570', '3,979', '4,329', '29', '2', '3', '11', '69.8%', '6/112'], ['P.Rivers', 'SD', '847', '8', '780', '9', '7.8%', '11', '6.3%', '59.4', '20', '704', '4,503', '4,755', '29', '2', '2', '13', '66.5%', '8/102'], ['M.Stafford', 'DET', '804', '9', '637', '12', '8.0%', '10', '4.1%', '62.6', '14', '637', '3,979', '4,467', '32', '1', '2', '13', '67.6%', '6/89'], ['J.Cutler', 'CHI', '659', '10', '556', '13', '8.6%', '9', '5.5%', '60.7', '17', '512', '3,499', '3,587', '21', '1', '5', '11', '64.5%', '9/191'], ['C.Newton', 'CAR', '630', '11', '855', '8', '7.6%', '12', '14.3%', '66.0', '9', '526', '3,558', '3,567', '35', '0', '1', '10', '60.0%', '3/64'], ['D.Carr', 'OAK', '582', '12', '428', '17', '4.1%', '13', '0.0%', '49.2', '26', '603', '3,763', '3,871', '32', '3', '1', '13', '61.3%', '8/165'], ['R.Fitzpatrick', 'NYJ', '542', '13', '670', '10', '3.5%', '14', '7.0%', '63.6', '12', '583', '3,779', '3,714', '31', '2', '2', '15', '59.6%', '4/88'], ['Player', 'Team', 'DYAR', 'Rk', 'YAR', 'Rk', 'DVOA', 'Rk', 'VOA', 'QBR', 'Rk', 'Passes', 'Yards', 'EYds', 'TD', 'FK', 'FL', 'INT', 'C%', 'DPI'], ['T.Taylor', 'BUF', '536', '14', '486', '16', '9.8%', '8', '7.9%', '67.8', '7', '417', '2,810', '2,790', '20', '5', '1', '6', '63.7%', '3/98'], ['A.Smith', 'KC', '468', '15', '359', '19', '3.0%', '15', '-0.3%', '66.5', '8', '516', '3,243', '3,303', '20', '2', '0', '7', '65.5%', '4/90'], ['J.Winston', 'TB', '467', '16', '495', '15', '2.1%', '16', '2.9%', '58.6', '21', '560', '3,828', '3,480', '22', '2', '1', '15', '58.9%', '6/153'], ['A.Rodgers', 'GB', '406', '17', '258', '20', '-1.0%', '17', '-4.7%', '64.9', '10', '617', '3,489', '3,802', '31', '4', '4', '8', '60.9%', '14/364'], ['E.Manning', 'NYG', '404', '18', '535', '14', '-1.9%', '19', '1.1%', '60.5', '18', '646', '4,225', '4,063', '35', '5', '3', '14', '62.9%', '13/192'], ['M.Ryan', 'ATL', '389', '19', '669', '11', '-1.9%', '18', '4.8%', '61.8', '15', '647', '4,372', '3,911', '21', '6', '5', '15', '66.4%', '6/74'], ['B.Hoyer', 'HOU', '201', '20', '372', '18', '-3.0%', '20', '3.9%', '59.6', '19', '394', '2,400', '2,266', '19', '4', '2', '5', '61.0%', '1/10'], ['T.Bridgewater', 'MIN', '187', '21', '93', '25', '-5.1%', '22', '-8.1%', '62.7', '13', '491', '2,923', '2,745', '14', '3', '2', '9', '65.3%', '6/116'], ['B.Osweiler', 'DEN', '153', '22', '140', '22', '-3.2%', '21', '-3.9%', '48.8', '27', '299', '1,813', '1,765', '10', '2', '1', '6', '61.8%', '3/58'], ['J.McCown', 'CLE', '110', '23', '81', '26', '-5.8%', '23', '-7.2%', '53.9', '25', '315', '1,963', '1,804', '12', '2', '4', '4', '63.7%', '1/6'], ['S.Bradford', 'PHI', '107', '24', '221', '21', '-8.2%', '24', '-5.0%', '41.8', '34', '560', '3,512', '3,027', '19', '4', '1', '14', '65.3%', '11/176'], ['B.Bortles', 'JAC', '54', '25', '100', '24', '-9.9%', '25', '-8.8%', '46.4', '30', '657', '4,089', '3,549', '35', '7', '4', '18', '58.9%', '8/145'], ['R.Tannehill', 'MIA', '20', '26', '67', '27', '-10.6%', '27', '-9.4%', '43.2', '32', '632', '3,817', '3,204', '24', '5', '5', '12', '62.2%', '7/109'], ['Player', 'Team', 'DYAR', 'Rk', 'YAR', 'Rk', 'DVOA', 'Rk', 'VOA', 'QBR', 'Rk', 'Passes', 'Yards', 'EYds', 'TD', 'FK', 'FL', 'INT', 'C%', 'DPI'], ['J.Flacco', 'BAL', '17', '27', '55', '29', '-10.5%', '26', '-9.1%', '40.9', '35', '428', '2,637', '2,213', '14', '3', '2', '12', '65.2%', '3/31'], ['R.Mallett', '2TM', '-33', '28', '-127', '31', '-13.2%', '29', '-19.2%', '55.1', '22', '248', '1,282', '1,227', '5', '0', '0', '6', '56.7%', '1/11'], ['M.Hasselbeck', 'IND', '-41', '29', '57', '28', '-13.4%', '30', '-8.0%', '55.1', '23', '273', '1,588', '1,424', '9', '1', '2', '5', '61.3%', '5/52'], ['M.Mariota', 'TEN', '-53', '30', '123', '23', '-13.2%', '28', '-6.4%', '61.0', '16', '407', '2,551', '2,017', '19', '3', '6', '10', '62.5%', '10/120'], ['B.Gabbert', 'SF', '-85', '31', '-118', '30', '-15.6%', '31', '-17.4%', '42.6', '33', '308', '1,866', '1,416', '10', '2', '1', '7', '63.1%', '3/28'], ['J.Manziel', 'CLE', '-105', '32', '-179', '32', '-18.4%', '33', '-23.4%', '54.7', '24', '241', '1,350', '1,048', '7', '3', '3', '5', '58.4%', '1/2'], ['A.Luck', 'IND', '-126', '33', '-189', '33', '-17.5%', '32', '-20.6%', '47.6', '28', '306', '1,783', '1,435', '15', '2', '0', '12', '55.9%', '1/18'], ['M.Cassel', 'DAL', '-172', '34', '-210', '34', '-23.7%', '35', '-26.5%', '33.7', '36', '216', '1,190', '866', '5', '2', '0', '6', '58.9%', '4/32'], ['C.Kaepernick', 'SF', '-182', '35', '-249', '35', '-21.5%', '34', '-25.3%', '47.1', '29', '273', '1,446', '1,168', '6', '1', '1', '5', '59.0%', '1/28'], ['P.Manning', 'DEN', '-326', '36', '-317', '36', '-25.8%', '36', '-25.4%', '45.0', '31', '346', '2,156', '1,339', '9', '1', '0', '17', '60.2%', '4/35'], ['N.Foles', 'STL', '-353', '37', '-428', '37', '-27.9%', '37', '-31.4%', '30.0', '37', '350', '1,944', '1,219', '7', '2', '2', '10', '56.9%', '4/109']] 

中的cols上表中居然重复,所以我们也需要过滤掉那些:

rows = [[td.text for td in row if isinstance(td, Tag)] for row in table.select("tr + tr") if row.find("td").text != "Player"] 

另外你应该知道RK是在cols中重复的,要插入所有你需要的是一个简单的insert语句循环遍历行。

一旦你的表中创建,你只需要像:

import mysql.connector                      


cnx = mysql.connector.connect(user='xxxxxxx', database='NFL',password="xxxxxxxxx")       
cursor = cnx.cursor()                      
sql= ("INSERT INTO stats"                     
     "(Player,Team,DYAR,Rk1,YAR,Rk2,DVOA,Rk3,VOA,QBR,Rk4,Passes,Yards,EYds,TD,FK,FL,I,perc,DPI)"   

     "VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)")        

cursor.executemany(sql, rows)   
cnx.commit() 
cursor.close()          

我以前​​,你可以使用任何你喜欢的只是通过你使用的任何列,并确保他们正确对齐。

+0

This works great ..非常感谢。还有一个问题可以帮助我理解。导入和行定义中'标签'的用途是什么? – user6450004

+0

一些元素基本上是字符串,检查我们是否有Tag避免了这些不需要的元素。如果你删除,如果检查你会看到我的意思 –