提取URL的域名

解析URL的另一个请求，但我发现了许多不完整或理论的示例。我想确定一些在Perl中有效的东西。Perl：提取域名

我有以下网址：

https://vimdoc.sourceforge.net/htmldoc/pattern.html 
http://linksyssmartwifi.com/ui/1.0.1.1001/dynamic/login.html 
http://www.catonmat.net/download/perl1line.txt 
https://github.com/robbyrussell/oh-my-zsh/wiki/Cheatsheet 
https://drive.google.com/drive/u/0/folders/0B5jNDUmF2eUJuSnM 
http://www.gnu.org/software/coreutils/manual/coreutils.html 
http://www.catonmat.net/download/perl1line.txt 
https://feedly.com/i/my 
http://vimhelp.appspot.com/ 
https://git-scm.com/doc 
https://read.amazon.com/ 
https://github.com/netsamir/following 
https://scotch.io/ 
https://servicios.dgi.gub.uy/ 
https://sourcemaking.com/ 
https://stackedit.io/editor 
https://stripe.com/be 
https://toolbelt.heroku.com/ 
https://training.github.com/ 
https://vimeo.com/54505525 
https://vimeo.com/tag:drew+neil 
https://web.whatsapp.com/ 
https://www.ctan.org/ 
https://www.eff.org/ 
https://www.mybeluga.com/ 
https://www.solveforx.com/ 
https://www.symynd.com/ 
https://www.symynd.com/# 
https://www.tizen.org/ 
http://workforall.net/CDS-Credit-default-Swaps.html#Credit_Default_Swaps_CDS

尽量只提取域名。例如：

linksyssmartwifi.com 
amazon.com 
github.com

我试过用Perl和Vim，但是无法完成任务。我最好的逼近如下

perl -pe 's!(^https?\://.*[\.](.+\..+?)/.*$)!$1 -- [$2] !g' all_urls_sorted.txt

其中有些是正确解析（请参阅[]），其他未：

https://sites.google.com/site/steveyegge2/singleton-considered-stupid -- [google.com] 
https://sourcemaking.com/ 
https://stackedit.io/editor 
https://stripe.com/be 
https://toolbelt.heroku.com/ -- [heroku.com] 
https://training.github.com/ -- [github.com] 
https://vimeo.com/54505525 
https://vimeo.com/tag:drew+neil 
https://web.whatsapp.com/ -- [whatsapp.com] 
https://wiki.haskell.org/GHC -- [haskell.org]

由于我的测试表明，该URL，从直开始// （在https？：//中）被排除在外。

如果你知道如何解决这个问题，我会很高兴。

感谢

来源

2016-08-14 Samir Sadek

使用URI模块：

#!/usr/bin/env perl 

use strict; 
use warnings; 
use v5.10; 

use URI; 

while (<DATA>) { 
    chomp; 
    my $uri = URI->new($_); 
    my $host = $uri->host; 
    my ($domain) = $host =~ m/([^.]+\.[^.]+$)/; 
    say $domain; 
} 

__DATA__ 
https://vimdoc.sourceforge.net/htmldoc/pattern.html 
http://linksyssmartwifi.com/ui/1.0.1.1001/dynamic/login.html 
http://www.catonmat.net/download/perl1line.txt 
https://github.com/robbyrussell/oh-my-zsh/wiki/Cheatsheet 
https://drive.google.com/drive/u/0/folders/0B5jNDUmF2eUJuSnM 
http://www.gnu.org/software/coreutils/manual/coreutils.html 
http://www.catonmat.net/download/perl1line.txt 
https://feedly.com/i/my 
http://vimhelp.appspot.com/ 
https://git-scm.com/doc 
https://read.amazon.com/ 
https://github.com/netsamir/following 
https://scotch.io/ 
https://servicios.dgi.gub.uy/ 
https://sourcemaking.com/ 
https://stackedit.io/editor 
https://stripe.com/be 
https://toolbelt.heroku.com/ 
https://training.github.com/ 
https://vimeo.com/54505525 
https://vimeo.com/tag:drew+neil 
https://web.whatsapp.com/ 
https://www.ctan.org/ 
https://www.eff.org/ 
https://www.mybeluga.com/ 
https://www.solveforx.com/ 
https://www.symynd.com/ 
https://www.symynd.com/# 
https://www.tizen.org/ 
http://workforall.net/CDS-Credit-default-Swaps.html#Credit_Default_Swaps_CDS

输出：

sourceforge.net 
linksyssmartwifi.com 
catonmat.net 
github.com 
google.com 
gnu.org 
catonmat.net 
feedly.com 
appspot.com 
git-scm.com 
amazon.com 
github.com 
scotch.io 
gub.uy 
sourcemaking.com 
stackedit.io 
stripe.com 
heroku.com 
github.com 
vimeo.com 
vimeo.com 
whatsapp.com 
ctan.org 
eff.org 
mybeluga.com 
solveforx.com 
symynd.com 
symynd.com 
tizen.org 
workforall.net

来源

2016-08-14 19:39:47 Miller

“www.bbc.co.uk”怎么样？ – Borodin

这就是'Domain :: PublicSuffix'试图做的事情。 – ernix

我最好的近似值是URI::URL：

foreach my $uri (@filecontents) { 
    my $uriobj = URL::URL->new($uri); 
    my $host = $uriobj -> host; 
    my @parts = split /\./, $host; 
    print "$uri -- $parts[-2]$parts[-1]\n"; 
}

希望有所帮助。

来源

2016-08-14 19:43:22 hd1

URI :: URL用于向后兼容。对于新代码，请改用[URI]（https://metacpan.org/pod/URI）。 – Schwern

一个正则表达式的解决方案是：

//(?:[^./]+[.])*([^/.]+[.][^/.]+)/

如果最后的斜线是可选的，只需添加一个?：

//(?:[^./]+[.])*([^/.]+[.][^/.]+)/?

这应该与全球改性剂和比/以外的分隔符来使用。

本质上，它看起来在//和下一个/之间。

如果有任何额外的子域，他们将被(?:[^./]+[.])*捕获。主域名将落入捕获组([^/.]+[.][^/.]+)。

来源

2016-08-14 19:44:15 Laurel

如果主机名称没有/之后会怎么样？ – ysth

我已经测试过这个解决方案，它的工作原理是线性的，因为它是纯正的perl与正则表达式。没有/在主机名后失败。（？：[^。/] + [。]）*与我自己的解决方案有所不同。谢谢。 –

@SamirSadek调整可选的尾部斜线非常简单，只需添加'？'（参见编辑）。 – Laurel

Perl：提取域名

提取URL的域名

回答

相关问题