2014-09-03 51 views
0

我有以下输入字符串的Python:可以使用里urlparse从CGI斌URL解析域名

/cgi-bin/ivw/CP/dbb_ug_sp;?r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278 

我想用urlparse()取得网域,但得到的netloc属性回报在这种情况下是一个空字符串。

如何提取域(最佳情况:没有www)?

输出想要的东西: some-super-domain.de

请注意:有时,在上面输入字符串没有WWW

+0

您的预期产出是? – 2014-09-03 09:07:34

+0

some-super-domain.de – nottinhill 2014-09-03 09:22:22

回答

0
www\.(.*?)\/ 

此作品参见演示。

http://regex101.com/r/pP3pN1/18

+0

在您的演示页面上工作,但不在我的脚本中工作,例如:domain = re.match('^ www \。(。*?)\/$',line) – nottinhill 2014-09-03 09:23:05

+1

@SirBenBenji你需要生成正则表达式字符串:'domain = re.match(r'^ www \。(。*?)\ /',line)',并在结尾处删除'$'。 – Jerry 2014-09-03 09:26:03

+1

这是行得通的。 re.findall(r“www \。(。*?)\ /”,x) – vks 2014-09-03 09:27:31

1

我觉得urlparse点给你,你可以使用这个你想要什么:

m=re.search(r'(?<=www\.)[a-zA-Z\-]+\.[a-zA-Z]+',s) 
print m.group(0) 

结果:

some-super-domain.de 

尝试HERE

所以如果你使用urlparse的结果是这样的:

s='/cgi-bin/ivw/CP/dbb_ug_sp;?r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278' 

from urlparse import urlparse 
o = urlparse(s) 
print o 

结果:

ParseResult(scheme='', netloc='', path='/cgi-bin/ivw/CP/dbb_ug_sp', params='', query='r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278', fragment='') 

所以这个结果,你可以访问域与o.query但它是不是你想要的是包含额外的字符!

>>>print o.query 
>>>r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278 
1

试试这个代码工作正常:

from urlparse import urlparse 
import urllib 
url = '/cgi-bin/ivw/CP/dbb_ug_sp;?r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278'; 
url= url[url.find('http'):] 
url= urllib.unquote(url).decode('utf8') 
result= urlparse(url); 
domain = '{uri.netloc}'.format(uri=result) 
if(domain.find('www.')!=None): 
    domain=domain[4:] 
print (domain); 
+0

太棒了!我在找什么。任何想法如何缓解遇到此错误时的情况:return codecs.utf_8_decode(input,errors,True) UnicodeDecodeError:'utf8'编解码器无法解码位置50中的字节0xe4:无效的继续字节 – nottinhill 2014-09-03 10:01:12

0

你可以试试下面的代码,它使用可变长度的回顾后,

>>> import regex 
>>> s = "/cgi-bin/ivw/CP/dbb_ug_sp;?r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278""" 
>>> m = regex.search(r'(?<=https?[^/]*//www\.)[^/]*', s).group() 
>>> m 
'some-super-domain.de' 

OR

>>> m = re.search(r'(?<=www\.)[^/]*', s).group() 
>>> m 
'some-super-domain.de' 
0
import urlparse 
import urllib 

HTTP_PREFIX = 'http://' 
URI = '/cgi-bin/ivw/CP/dbb_ug_sp;?r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278' 

# Unquote the HTTP quoted URI 
unquoted_uri = urllib.unquote(URI) 

# Parse the URI to get just the URL in the query 
queryurl = HTTP_PREFIX + unquoted_uri.split(HTTP_PREFIX)[-1] 

# Now you get the hostname you were looking for 
parsed_hostname = urlparse.urlparse(queryurl).netloc