如何使用Ruby和Nokogiri解析Google图片网址？

我正在尝试制作Google图片网页上所有图片文件的数组。如何使用Ruby和Nokogiri解析Google图片网址？

我想"imagurl="后正则表达式来拉一切，"&amp"之前结束的看到在这个HTML：

<a href="http://www.google.com/imgres?imgurl=http://www.trendytree.com/old-world- christmas/images/20031chapel20031-silent-night-chapel.jpg&amp;imgrefurl=http://www.trendytree.com/old-world-christmas/silent-night-chapel-20031-christmas-ornament-old-world-christmas.html&amp;usg=__YJdf3xc4ydSfLQa9tYnAzavKHYQ=&amp;h=400&amp;w=400&amp;sz=58&amp;hl=en&amp;start=19&amp;zoom=1&amp;tbnid=ajDcsGGs0tgE9M:&amp;tbnh=124&amp;tbnw=124&amp;ei=qagfUbXmHKfv0QHI3oG4CQ&amp;itbs=1&amp;sa=X&amp;ved=0CE4QrQMwEg"><img height="124" width="124" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLy5inpSdHxWuE7z3QSZw35JwN3upbBaLr11LR25noTKbSMn9-qrySSg"></a><br><cite title="trendytree.com">trendytree.com</cite><br>Silent Night Chapel <b>20031</b><br>400 × 400 - 58k - jpg</td>

我觉得我能做到这一点与正则表达式，但我可以”找到一种方法来使用正则表达式搜索我的分析文档，但我没有找到任何解决方案。

来源

2013-02-16 Jake Schievink

str = '<a href="http://www.google.com/imgres?imgurl=http://www.trendytree.com/old-world-  christmas/images/20031chapel20031-silent-night-chapel.jpg&amp;imgrefurl=http://www.trendytree.com/old-world-christmas/silent-night-chapel-20031-christmas-ornament-old-world-christmas.html&amp;usg=__YJdf3xc4ydSfLQa9tYnAzavKHYQ=&amp;h=400&amp;w=400&amp;sz=58&amp;hl=en&amp;start=19&amp;zoom=1&amp;tbnid=ajDcsGGs0tgE9M:&amp;tbnh=124&amp;tbnw=124&amp;ei=qagfUbXmHKfv0QHI3oG4CQ&amp;itbs=1&amp;sa=X&amp;ved=0CE4QrQMwEg"><img height="124" width="124" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLy5inpSdHxWuE7z3QSZw35JwN3upbBaLr11LR25noTKbSMn9-qrySSg"></a><br><cite title="trendytree.com">trendytree.com</cite><br>Silent Night Chapel <b>20031</b><br>400 × 400 - 58k - jpg</td>' 
str.split('imgurl=')[1].split('&amp')[0] 
#=> "http://www.trendytree.com/old-world-  christmas/images/20031chapel20031-silent-night-chapel.jpg"

这是你在找什么？

来源

2013-02-16 16:24:50

要得到所有你想要做

# get all links 
url = 'some-google-images-url' 
links = Nokogiri::HTML(open(url)).css('a') 

# get regex match or nil on desired img 
img_urls = links.map {|a| a['href'][/imgurl=(.*?)&/, 1] } 

# get rid of nils 
img_urls.compact

你想要的正则表达式是/imgurl=(.*?)&/因为你要imgurl=和&之间的非贪婪匹配的IMG网址，否则贪.*将采取一切到最后&在字符串中。

来源

2013-02-16 16:43:11 AJcodez

使用正则表达式的问题是您对URL中参数的顺序过于了解。如果订单更改，或者&消失，则正则表达式将不起作用。

相反，解析URL，然后拆分值了：

# encoding: UTF-8 

require 'nokogiri' 
require 'cgi' 
require 'uri' 

doc = Nokogiri::HTML.parse('<a href="http://www.google.com/imgres?imgurl=http://www.trendytree.com/old-world-christmas/images/20031chapel20031-silent-night-chapel.jpg&amp;imgrefurl=http://www.trendytree.com/old-world-christmas/silent-night-chapel-20031-christmas-ornament-old-world-christmas.html&amp;usg=__YJdf3xc4ydSfLQa9tYnAzavKHYQ=&amp;h=400&amp;w=400&amp;sz=58&amp;hl=en&amp;start=19&amp;zoom=1&amp;tbnid=ajDcsGGs0tgE9M:&amp;tbnh=124&amp;tbnw=124&amp;ei=qagfUbXmHKfv0QHI3oG4CQ&amp;itbs=1&amp;sa=X&amp;ved=0CE4QrQMwEg"><img height="124" width="124" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLy5inpSdHxWuE7z3QSZw35JwN3upbBaLr11LR25noTKbSMn9-qrySSg"></a><br><cite title="trendytree.com">trendytree.com</cite><br>Silent Night Chapel <b>20031</b><br>400 × 400 - 58k - jpg</td>') 

img_url = doc.search('a').each do |a| 
    query_params = CGI::parse(URI(a['href']).query) 
    puts query_params['imgurl'] 
end

，输出：

http://www.trendytree.com/old-world-christmas/images/20031chapel20031-silent-night-chapel.jpg

两个URI和CGI的使用，因为URI的decode_www_form提高试图解码时异常查询。

我也已经知道使用类似的查询字符串转换为哈希解码：

Hash[URI(a['href']).query.split('&').map{ |p| p.split('=') }]

，将返回：

 
{"imgurl"=> 
    "http://www.trendytree.com/old-world-christmas/images/20031chapel20031-silent-night-chapel.jpg", 
"imgrefurl"=> 
    "http://www.trendytree.com/old-world-christmas/silent-night-chapel-20031-christmas-ornament-old-world-christmas.html", 
"usg"=>"__YJdf3xc4ydSfLQa9tYnAzavKHYQ", 
"h"=>"400", 
"w"=>"400", 
"sz"=>"58", 
"hl"=>"en", 
"start"=>"19", 
"zoom"=>"1", 
"tbnid"=>"ajDcsGGs0tgE9M:", 
"tbnh"=>"124", 
"tbnw"=>"124", 
"ei"=>"qagfUbXmHKfv0QHI3oG4CQ", 
"itbs"=>"1", 
"sa"=>"X", 
"ved"=>"0CE4QrQMwEg"}

来源

2013-02-17 03:50:21

如何使用Ruby和Nokogiri解析Google图片网址？

回答

相关问题