2014-09-19 147 views
2

我刮了以下链接:如何在python中正确编码可能的中文编码?

http://www.footballcornersta.com/en/league.php?select=all&league=%E8%8B%B1%E8%B6%85&year=2014&month=1&Submit=Submit 

和下面的字符串包含在菜单中相关的联赛所有可用的选项:

ls_main = [['E','ENG PR','英超'],['E','ENG FAC','英足总杯'],['E','ENG Champ','英冠'],['E','ENG D1','英甲'],['I','ITA D1','意甲'],['I','ITA D2','意乙'],['S','SPA D1','西甲'],['S','SPA D2','西乙'],['G','GER D1','德甲'],['G','GER D2','德乙'],['F','FRA D1','法甲'],['F','FRA D2','法乙'],['S','SCO PR','苏超'],['R','RUS PR','俄超'],['T','TUR PR','土超'],['B','BRA D1','巴西甲'],['U','USA MLS','美职联'],['A','ARG D1','阿根甲'],['J','JP D1','日职业'],['J','JP D2','日职乙'],['A','AUS D1','澳A联'],['K','KOR D1','韩K联'],['C','CHN PR','中超'],['E','EURO Cup','欧洲杯'],['I','Italy Supe','意超杯'],['K','KOR K3','K3联'],['C','CHN D1','中甲'],['D','DEN D2-E','丹乙东'],['D','DEN D2-W','丹乙西'],['D','DEN D1','丹甲'],['D','DEN PR','丹超'],['U','UKR U21','乌克兰U21'],['U','UD2','乌克甲'],['U','UKR D1','乌克超'],['U','Uzber D1','乌兹超'],['U','URU D1','乌拉甲'],['U','UZB D2','乌茲甲'],['I','ISR D2','以色列乙'],['I','ISR D1','以色列甲'],['I','ISR PR','以色列超'],['I','Iraq L','伊拉联'],['I','Ira D1','伊朗甲'],['I','IRA P','伊朗联'],['R','RUS D2C','俄乙中'],['R','RUS D2U','俄乙乌'],['R','RUS D2S','俄乙南'],['R','RUS D2W','俄乙西'],['R','RUS RL','俄后赛'],['R','RUS D1','俄甲'],['R','RUS PR','俄超'],['B','BUL D1','保甲'],['C','CRO D1','克甲'],['I','ICE PR','冰岛超'],['G','GHA PL','加纳超'],['H','Hun U19','匈U19'],['H','HUN D2E','匈乙东'],['H','HUN D2W','匈乙西'],['H','HUN D1','匈甲'],['N','NIR IFAC','北爱冠'],['N','NIRE PR','北爱超'],['S','SAfrica D1','南非甲'],['S','SAfrica NSLP','南非超'],['L','LUX D1','卢森甲'],['I','IDN PR','印尼超'],['I','IND D1','印度甲'],['G','GUAT D1','危地甲'],['E','ECU D1','厄甲'],['F','Friendly','友谊赛'],['K','KAZ D1','哈萨超'],['C','COL D2','哥伦乙'],['C','COL C','哥伦杯'],['C','COL D1','哥伦甲'],['C','COS D1','哥斯甲'],['T','TUR U23','土A2青'],['T','TUR D3L1','土丙1'],['T','TUR D3L2','土丙2'],['T','TUR D3L3','土丙3'],['T','TUR2BK','土乙白'],['T','TUR2BB','土乙红'],['T','TUR D1','土甲'],['E','EGY PR','埃及超'],['S','Serbia D2','塞尔乙'],['S','Serbia 1','塞尔联'],['C','CYP D2','塞浦乙'],['C','CYP D1','塞浦甲'],['M','MEX U20','墨西U20'],['M','Mex D2','墨西乙'],['M','MEX D1','墨西联'],['A','AUT D3E','奥丙东'],['A','AUT D3C','奥丙中'],['A','AUT D3W','奥丙西'],['A','AUT D2','奥乙'],['A','AUT D1','奥甲'],['V','VEN D1','委超'],['W','WAL D2','威甲'],['W','WAL D2CA','威联盟'],['W','WAL D1','威超'],['A','Ang D1','安哥甲'],['N','NIG P','尼日超'],['P','PAR D1','巴拉甲'],['B','BRA D2','巴西乙'],['B','BRA CP','巴锦赛'],['G','GRE D3N','希丙北'],['G','GRE D3S','希丙南'],['G','GRE D2','希乙'],['G','GRE D1','希甲'],['G','GER U17','德U17'],['G','GER U19','德U19'],['G','GER D3','德丙'],['G','GER RN','德北联'],['G','GER RS','德南联'],['G','GER RW','德西联'],['I','ITA D3A','意丙A'],['I','ITA D3B','意丙B'],['I','ITA D3C1','意丙C1'],['I','ITA D3C2','意丙C2'],['I','ITA CP U20','意青U20'],['E','EST D3','愛沙丙'],['N','NOR D2-A','挪乙A'],['N','NOR D2-B','挪乙B'],['N','NOR D2-C','挪乙C'],['N','NOR D2-D','挪乙D'],['N','NORC','挪威杯'],['N','NOR D1','挪甲'],['N','NOR PR','挪超'],['C','CZE D3','捷丙'],['C','CZE MSFL','捷丙M'],['C','CZE D2','捷乙'],['C','CZE U19','捷克U19'],['C','CZE D1','捷克甲'],['M','Mol D2','摩尔乙'],['M','MOL D1','摩尔甲'],['M','MOR D2','摩洛哥乙'],['M','MOR D1','摩洛超'],['S','Slovakia D3E','斯丙東'],['S','Slovakia D3W','斯丙西'],['S','Slovakia D2','斯伐乙'],['S','Slovakia D1','斯伐甲'],['S','Slovenia D1','斯洛甲'],['S','SIN D1','新加联'],['J','JL3','日丙联'],['C','CHI D2','智乙'],['C','CHI D1','智甲'],['G','Geo','格鲁甲'],['G','GEO PR','格鲁超'],['U','UEFA CL','欧冠杯'],['U','UEFA SC','欧霸杯'],['B','BEL D3A','比丙A'],['B','BEL D3B','比丙B'],['B','BEL D2','比乙'],['B','BEL W1','比女甲'],['B','BEL C','比杯'],['B','BEL D1','比甲'],['S','SAU D2','沙地甲'],['S','SAU D1','沙地联'],['F','FRA D4A','法丁A'],['F','FRA D4B','法丁B'],['F','FRA D4C','法丁C'],['F','FRA D4D','法丁D'],['F','FRA D3','法丙'],['F','FRA U19','法国U19'],['F','FRA C','法国杯'],['P','POL D2E','波乙東'],['P','POL D2W','波乙西'],['P','POL D2','波兰乙'],['P','POL D1','波兰甲'],['B','BOS D1','波斯甲'],['P','POL YL','波青联'],['T','THA D1','泰甲'],['T','THA PL','泰超'],['H','HON D1','洪都甲'],['A','Aus BP','澳布超'],['E','EST D1','爱沙甲'],['I','IRE D1','爱甲'],['I','IRE PR','爱超'],['B','BOL D1','玻利甲'],['F','Friendly','球会赛'],['S','SWI D1','瑞士甲'],['S','SWI PR','瑞士超'],['S','SWE D2','瑞甲'],['S','SWE D1','瑞超'],['B','BLR D2','白俄甲'],['B','BLR D1','白俄超'],['P','Peru D1','秘鲁甲'],['T','TUN D2','突尼乙'],['T','Tun D1','突尼甲'],['R','ROM D2G1','罗乙1'],['R','ROM D2G2','罗乙2'],['R','ROM D1','罗甲'],['L','LIBERT C','自由杯'],['F','FIN D2','芬甲'],['F','FIN D1','芬超'],['S','SCO D3','苏丙'],['S','SUD PL','苏丹超'],['S','SCO D2','苏乙'],['S','SCO D1','苏甲'],['S','SCO HL','苏高联'],['E','ENG D2','英乙'],['E','ENG RyPR','英依超'],['E','ENG UP','英北超'],['E','ENG SP','英南超'],['E','ENG Trophy','英挑杯'],['E','ENG Con','英非'],['E','ENG CN','英非北'],['H','HOL D2','荷乙'],['H','HOL Yl','荷青甲'],['S','SV D1','萨尔超'],['P','POR U19','葡U19'],['P','POR D1','葡甲'],['P','POR PR','葡超'],['S','SPA D3B1','西丙1'],['S','SPA D3B2','西丙2'],['S','SPA D3B3','西丙3'],['S','SPA D3B4','西丙4'],['S','SPA Futsal','西內足'],['S','SPA W1','西女超'],['B','BRA CC','里州赛'],['A','Arg D2M1','阿乙M1'],['A','Arg D2M2','阿乙M2'],['A','Arg D2M3','阿乙M3'],['A','ALG D2','阿及乙'],['A','ALG D1','阿及甲'],['A','AZE D1','阿塞甲'],['A','ALB D1','阿巴超'],['A','ARG D2','阿根乙'],['U','UAE D2','阿联乙'],['K','KOR NL','韩联盟'],['F','FYRM D2','马其乙'],['M','MacedoniaFyr','马其甲'],['M','MAS D1','马来超'],['M','MON D2','黑山乙'],['M','MON D1','黑山甲'],['F','FCWC','世冠杯'],['W','World Cup','世界杯'],['F','FIFAWYC','世青杯'],['C','CWPL','中女超'],['C','CFC','中足协杯'],['D','DEN C','丹麦杯'],['A','Asia CL','亚冠杯'],['A','AFC','亚洲杯'],['R','Rus Cup','俄罗斯杯'],['H','HUN C','匈杯'],['N','NIR C','北爱杯'],['T','TUR C','土杯'],['T','Tenno Hai','天皇杯'],['W','WWC','女世杯'],['I','ITA Cup','意杯'],['G','GER C','德国杯'],['J','JPN LC','日联杯'],['S','SCO FAC','苏足总杯'],['E','ENG JPT','英锦赛'],['E','ENG FAC','足总杯'],['C','CAF NC','非洲杯'],['K','K-LC','韩联杯'],['H','HK D1','香港甲']]; 

我刮的页面的链接包含第三字符,但是当我复制它成为上面的链接。

我不确定编码。

import re 

html = 'source of page' 
matches = re.findall('ls_main = \[\[.*?;', html)[0] 
matches = matches.decode('unknown encoding').encode('utf-8') 

如何将原始字符放在链接的字符串中?

我使用Python 2.7。

+0

你正在使用哪个版本的python? – ashwinjv 2014-09-19 01:02:56

+0

该URL已包含UTF-8编码文本。或者我误解了你的问题? – 2014-09-19 01:03:40

回答

2

%XX编码可以通过urllib.qutoe做到:

>>> import urllib 
>>> urllib.quote('英冠') 
'%E8%8B%B1%E5%86%A0' 

>>> urllib.quote(u'英冠'.encode('utf-8')) # with explicit utf-8 encoding. 
'%E8%8B%B1%E5%86%A0' 

要返回原始的字符串,使用urllib.unquote

>>> urllib.unquote('%E8%8B%B1%E5%86%A0') 
'\xe8\x8b\xb1\xe5\x86\xa0' 
>>> print(urllib.unquote('%E8%8B%B1%E5%86%A0')) 
英冠 

在Python 3.x中,使用urllib.parse.quoteurllib.parse.unquote

>>> import urllib.parse 
>>> urllib.parse.quote('英冠', encoding='utf-8') 
'%E8%8B%B1%E5%86%A0' 
>>> urllib.parse.unquote('%E8%8B%B1%E5%86%A0', encoding='utf-8') 
'英冠'