2014-12-03 52 views
-2

因此,我正在以URL的形式获取用户输入并解析它,然后打印该网站链接到的其他页面。我使用的包是:模式不会删除网站上的特殊字符

LWP::Simple 

我从命令行使用用户输入获取链接并将其存储在一个变量中。我使用$ ARGV [0]得到它。 然后我着手制作另一个变量,并在存储网站的变量上使用$ get。 我接着,使数组变量,并应用在可变

/\shref="?([^\s>"]+)/gi; 
,其存储在包含网站串的变量被使用get函数的结果

的正则表达式。然后我在数组上做了一个foreach循环来打印出结果。

然而,尽管它的打印链接之类的东西,同时也结束了印刷只是独立位置特殊字符,例如/#如果没有什么在他们之后。

因此,如果有像/blabalbla这样的东西,它会打印。但是如果只有独立的特殊字符(例如/,\#),它也会打印它们。任何方式我可以修改正则表达式,以便如果特殊字符不跟随一个字符串,他们不应该打印。新学习perl,而不是在正则表达式

+0

我帮不了,除非你显示你的代码,*真实的例子*一个URL和相应的输出。你的正则表达式肯定不符合这样的孤立字符,我认为你更可能滥用正则表达式。 – Borodin 2014-12-03 22:07:12

+0

“跟随一个字符串”是什么意思? – ikegami 2014-12-03 22:11:49

+0

@Borodin - 这是http://www.google.com/imghp?hl=zh-CN&tab=wi http://maps.google.com/maps?hl=zh-CN&tab=wl https://play.google .com /?hl = en&tab = w8 \有更多的链接作为输出,但我删除它们以适应评论。这是使用google.com。见末尾 – user2128074 2014-12-03 22:14:58

回答

1

我不能帮助你没有进一步的信息你的具体问题,但同时我建议你看看这是为此目的而编写的HTML::LinkExtor

下面是一个示例代码输出。它仅列出具有href属性的<a>元素。

use strict; 
use warnings; 
use 5.010; 

use LWP; 
use HTML::LinkExtor; 

my $ua = LWP::UserAgent->new; 
my $resp = $ua->get('http://www.bbc.co.uk/'); 

my $extor = HTML::LinkExtor->new(undef, $resp->base); 
$extor->parse($resp->decoded_content); 

for my $link ($extor->links) { 
    my ($tag, %attr) = @$link; 
    next unless $tag eq 'a' and $attr{href}; 
    say $attr{href}; 
} 

输出

http://m.bbc.co.uk 
http://www.bbc.co.uk/ 
http://www.bbc.co.uk/#h4discoveryzone 
http://www.bbc.co.uk/accessibility/ 
https://ssl.bbc.co.uk/id/status 
http://www.bbc.co.uk/news/ 
http://www.bbc.com/news/ 
http://www.bbc.co.uk/sport/ 
http://www.bbc.co.uk/weather/ 
http://shop.bbc.com/ 
http://www.bbc.com/earth/ 
http://www.bbc.com/travel/ 
http://www.bbc.com/capital/ 
http://www.bbc.co.uk/iplayer/ 
http://www.bbc.com/culture/ 
http://www.bbc.com/autos/ 
http://www.bbc.com/future/ 
http://www.bbc.co.uk/tv/ 
http://www.bbc.co.uk/radio/ 
http://www.bbc.co.uk/cbbc/ 
http://www.bbc.co.uk/cbeebies/ 
http://www.bbc.co.uk/arts/ 
http://www.bbc.co.uk/ww1/ 
http://www.bbc.co.uk/food/ 
http://www.bbc.co.uk/history/ 
http://www.bbc.co.uk/learning/ 
http://www.bbc.co.uk/music/ 
http://www.bbc.co.uk/science/ 
http://www.bbc.co.uk/nature/ 
http://www.bbc.com/earth/ 
http://www.bbc.co.uk/local/ 
http://www.bbc.co.uk/travel/ 
http://www.bbc.co.uk/a-z/ 
http://www.bbc.co.uk/#orb-footer 
http://search.bbc.co.uk/search 
http://www.bbc.co.uk/privacy/cookies/managing/cookie-settings.html 
http://www.bbc.co.uk/locator/default/desktop/en-GB?ptrt=%2F 
http://www.bbc.co.uk/# 
http://www.bbc.co.uk/# 
http://www.bbc.co.uk/weather/2643743?day=0 
http://www.bbc.co.uk/weather/2643743?day=0 
http://www.bbc.co.uk/weather/2643743?day=1 
http://www.bbc.co.uk/weather/2643743?day=1 
http://www.bbc.co.uk/weather/2643743?day=2 
http://www.bbc.co.uk/weather/2643743?day=2 
http://www.bbc.co.uk/locator/default/desktop/en-GB?ptrt=%2F 
http://www.bbc.co.uk/weather/2643743 
http://www.bbc.co.uk/news/science-environment-30311816 
http://www.bbc.co.uk/news/science-environment-30311822 
http://www.bbc.co.uk/news/science-environment-30311818 
http://www.bbc.co.uk/news/magazine-30282261 
http://www.bbc.co.uk/news/science-environment-30311816 
http://www.bbc.co.uk/news/uk-politics-30291460 
http://www.bbc.co.uk/news/ 
http://www.bbc.co.uk/news/uk-england-kent-30319549 
http://www.bbc.co.uk/news/world-europe-30306106 
http://www.bbc.co.uk/news/world-europe-30306992 
http://www.bbc.co.uk/news/uk-30306145 
http://www.bbc.co.uk/news/local/ 
http://www.bbc.co.uk/news/england/london/ 
http://www.bbc.co.uk/news/uk-england-london-30308694 
http://www.bbc.co.uk/news/uk-england-london-30315650 
http://www.bbc.co.uk/news/uk-england-london-30321504 
http://www.bbc.co.uk/sport/live/football/29959148 
http://www.bbc.co.uk/sport/0/ 
http://www.bbc.co.uk/sport/live/snooker/29618359 
http://www.bbc.co.uk/sport/football/30204433 
http://www.bbc.co.uk/sport/cricket/30308980 
http://www.bbc.co.uk/sport/football/30204434 
http://www.bbc.co.uk/sport/0/football/ 
http://www.bbc.co.uk/sport/football/30204459 
http://www.bbc.co.uk/sport/football/30204511 
http://www.bbc.co.uk/sport/football/28647040 
http://www.bbc.co.uk/?dzf=sport 
http://www.bbc.co.uk/?dzf=entertainment 
http://www.bbc.co.uk/?dzf=bbcnow 
http://www.bbc.co.uk/?dzf=entertainment 
http://www.bbc.co.uk/?dzf=news 
http://www.bbc.co.uk/?dzf=lifestyle 
http://www.bbc.co.uk/?dzf=knowledge 
http://www.bbc.co.uk/?dzf=sport 
http://www.bbc.co.uk/news/ 
http://www.bbc.com/news/ 
http://www.bbc.co.uk/sport/ 
http://www.bbc.co.uk/weather/ 
http://shop.bbc.com/ 
http://www.bbc.com/earth/ 
http://www.bbc.com/travel/ 
http://www.bbc.com/capital/ 
http://www.bbc.co.uk/iplayer/ 
http://www.bbc.com/culture/ 
http://www.bbc.com/autos/ 
http://www.bbc.com/future/ 
http://www.bbc.co.uk/tv/ 
http://www.bbc.co.uk/radio/ 
http://www.bbc.co.uk/cbbc/ 
http://www.bbc.co.uk/cbeebies/ 
http://www.bbc.co.uk/arts/ 
http://www.bbc.co.uk/ww1/ 
http://www.bbc.co.uk/food/ 
http://www.bbc.co.uk/history/ 
http://www.bbc.co.uk/learning/ 
http://www.bbc.co.uk/music/ 
http://www.bbc.co.uk/science/ 
http://www.bbc.co.uk/nature/ 
http://www.bbc.com/earth/ 
http://www.bbc.co.uk/local/ 
http://www.bbc.co.uk/travel/ 
http://www.bbc.co.uk/a-z/ 
http://www.bbc.co.uk/ 
http://www.bbc.co.uk/terms/ 
http://www.bbc.co.uk/aboutthebbc/ 
http://www.bbc.co.uk/privacy/ 
http://www.bbc.co.uk/privacy/cookies/about 
http://www.bbc.co.uk/accessibility/ 
http://www.bbc.co.uk/guidance/ 
http://www.bbc.co.uk/contact/ 
http://www.bbc.co.uk/bbctrust/ 
http://www.bbc.co.uk/complaints/ 
http://www.bbc.co.uk/help/web/links/ 
+0

谢谢,我知道我总是可以指望你解决我的问题。我也在上面提供了进一步的细节,对不起,我把它们放在开头:) – user2128074 2014-12-03 22:21:01

+0

你将如何被用来获取用户输入?就像用户必须自己放置一个网站 – user2128074 2014-12-03 22:22:48

+0

@ user2128074:按照通常的方式:用'chomp(my $ url = <>)'从终端获取URL,然后在'my $ resp = $ ua中使用它 - >获取($网址)'。你不想让你的原始程序工作吗?我相信这对你理解Perl正则表达式会有帮助,而且我相当确定问题出在你的Perl代码中,而不是正则表达式中,只要你能够显示它。 – Borodin 2014-12-03 22:24:59