2016-08-14 82 views
1

提取URL的域名

解析URL的另一个请求,但我发现了许多不完整或理论的示例。我想确定一些在Perl中有效的东西。Perl:提取域名

我有以下网址:

https://vimdoc.sourceforge.net/htmldoc/pattern.html 
http://linksyssmartwifi.com/ui/1.0.1.1001/dynamic/login.html 
http://www.catonmat.net/download/perl1line.txt 
https://github.com/robbyrussell/oh-my-zsh/wiki/Cheatsheet 
https://drive.google.com/drive/u/0/folders/0B5jNDUmF2eUJuSnM 
http://www.gnu.org/software/coreutils/manual/coreutils.html 
http://www.catonmat.net/download/perl1line.txt 
https://feedly.com/i/my 
http://vimhelp.appspot.com/ 
https://git-scm.com/doc 
https://read.amazon.com/ 
https://github.com/netsamir/following 
https://scotch.io/ 
https://servicios.dgi.gub.uy/ 
https://sourcemaking.com/ 
https://stackedit.io/editor 
https://stripe.com/be 
https://toolbelt.heroku.com/ 
https://training.github.com/ 
https://vimeo.com/54505525 
https://vimeo.com/tag:drew+neil 
https://web.whatsapp.com/ 
https://www.ctan.org/ 
https://www.eff.org/ 
https://www.mybeluga.com/ 
https://www.solveforx.com/ 
https://www.symynd.com/ 
https://www.symynd.com/# 
https://www.tizen.org/ 
http://workforall.net/CDS-Credit-default-Swaps.html#Credit_Default_Swaps_CDS 

尽量只提取域名。例如:

linksyssmartwifi.com 
amazon.com 
github.com 

我试过用Perl和Vim,但是无法完成任务。我最好的 逼近如下

perl -pe 's!(^https?\://.*[\.](.+\..+?)/.*$)!$1 -- [$2] !g' all_urls_sorted.txt 

其中有些是正确解析(请参阅[]),其他未:

https://sites.google.com/site/steveyegge2/singleton-considered-stupid -- [google.com] 
https://sourcemaking.com/ 
https://stackedit.io/editor 
https://stripe.com/be 
https://toolbelt.heroku.com/ -- [heroku.com] 
https://training.github.com/ -- [github.com] 
https://vimeo.com/54505525 
https://vimeo.com/tag:drew+neil 
https://web.whatsapp.com/ -- [whatsapp.com] 
https://wiki.haskell.org/GHC -- [haskell.org] 

由于我的测试表明,该URL,从直开始// (在https?://中)被排除在外。

如果你知道如何解决这个问题,我会很高兴。

感谢

回答

5

使用URI模块:

#!/usr/bin/env perl 

use strict; 
use warnings; 
use v5.10; 

use URI; 

while (<DATA>) { 
    chomp; 
    my $uri = URI->new($_); 
    my $host = $uri->host; 
    my ($domain) = $host =~ m/([^.]+\.[^.]+$)/; 
    say $domain; 
} 

__DATA__ 
https://vimdoc.sourceforge.net/htmldoc/pattern.html 
http://linksyssmartwifi.com/ui/1.0.1.1001/dynamic/login.html 
http://www.catonmat.net/download/perl1line.txt 
https://github.com/robbyrussell/oh-my-zsh/wiki/Cheatsheet 
https://drive.google.com/drive/u/0/folders/0B5jNDUmF2eUJuSnM 
http://www.gnu.org/software/coreutils/manual/coreutils.html 
http://www.catonmat.net/download/perl1line.txt 
https://feedly.com/i/my 
http://vimhelp.appspot.com/ 
https://git-scm.com/doc 
https://read.amazon.com/ 
https://github.com/netsamir/following 
https://scotch.io/ 
https://servicios.dgi.gub.uy/ 
https://sourcemaking.com/ 
https://stackedit.io/editor 
https://stripe.com/be 
https://toolbelt.heroku.com/ 
https://training.github.com/ 
https://vimeo.com/54505525 
https://vimeo.com/tag:drew+neil 
https://web.whatsapp.com/ 
https://www.ctan.org/ 
https://www.eff.org/ 
https://www.mybeluga.com/ 
https://www.solveforx.com/ 
https://www.symynd.com/ 
https://www.symynd.com/# 
https://www.tizen.org/ 
http://workforall.net/CDS-Credit-default-Swaps.html#Credit_Default_Swaps_CDS 

输出:

sourceforge.net 
linksyssmartwifi.com 
catonmat.net 
github.com 
google.com 
gnu.org 
catonmat.net 
feedly.com 
appspot.com 
git-scm.com 
amazon.com 
github.com 
scotch.io 
gub.uy 
sourcemaking.com 
stackedit.io 
stripe.com 
heroku.com 
github.com 
vimeo.com 
vimeo.com 
whatsapp.com 
ctan.org 
eff.org 
mybeluga.com 
solveforx.com 
symynd.com 
symynd.com 
tizen.org 
workforall.net 
+2

“www.bbc.co.uk”怎么样? – Borodin

+1

这就是'Domain :: PublicSuffix'试图做的事情。 – ernix

3

我最好的近似值是URI::URL

foreach my $uri (@filecontents) { 
    my $uriobj = URL::URL->new($uri); 
    my $host = $uriobj -> host; 
    my @parts = split /\./, $host; 
    print "$uri -- $parts[-2]$parts[-1]\n"; 
} 

希望有所帮助。

+2

URI :: URL用于向后兼容。对于新代码,请改用[URI](https://metacpan.org/pod/URI)。 – Schwern

1

一个正则表达式的解决方案是:

//(?:[^./]+[.])*([^/.]+[.][^/.]+)/ 

如果最后的斜线是可选的,只需添加一个?

//(?:[^./]+[.])*([^/.]+[.][^/.]+)/? 

这应该与全球改性剂和比/以外的分隔符来使用。

本质上,它看起来在//和下一个/之间。

如果有任何额外的子域,他们将被(?:[^./]+[.])*捕获。主域名将落入捕获组([^/.]+[.][^/.]+)

+0

如果主机名称没有/之后会怎么样? – ysth

+0

我已经测试过这个解决方案,它的工作原理是线性的,因为它是纯正的perl与正则表达式。没有/在主机名后失败。 (?:[^。/] + [。])*与我自己的解决方案有所不同。谢谢。 –

+0

@SamirSadek调整可选的尾部斜线非常简单,只需添加'?'(参见编辑)。 – Laurel