2010-01-13 72 views
2

我的朋友有一个论坛,其中充满了包含信息的帖子。有时她想要在她的论坛中查看帖子,并得出结论。目前,她通过点击她的论坛对帖子进行评论,并生成不一定准确的数据图片(在她的大脑中),从而得出结论。我今天的想法是,我可能会敲出一个快速的Ruby脚本来解析必要的HTML,以便让她真正了解数据的内容。如何阅读别人的论坛

今天我第一次使用Ruby的net/http库,并且遇到了问题。虽然我的浏览器在查看我的朋友的论坛时没有问题,但似乎Net :: HTTP.new(“forumname.net”)方法产生以下错误:

由于目标机器主动拒绝它,因此无法建立连接。 - 连接(2)

谷歌搜索该错误,我已经了解到,它与MySQL(或类似的东西)不想让像我这样的八卦家伙远程在那里徘徊:出于安全原因。这对我来说很有意义,但它让我想知道:我的浏览器如何在我的朋友的论坛上进行探索,但我的小Ruby脚本没有获取戳的权利。脚本有没有办法告诉服务器它不是威胁?我只想要阅读权利而不是写作权利?

谢谢你们,

z。

+1

你们是不是要直接访问她的数据库或者只是HTML页面?您对端口的工作方式似乎有点困惑:“我已经了解到它与MySQL有关” – 2010-01-13 21:06:41

+0

我是!也许我的问题与“代理”有关?我尝试调用address = Net :: HTTP.new(“sitename.net”),对于一些网站我得到的HTML和一些网站,我得到该错误消息。 – Ziggy 2010-01-13 21:12:37

+0

改变了你的问题的标签,因为它似乎确实是让httpwebrequest工作 – Veger 2010-01-13 22:36:51

回答

6

刮痧一个网站?使用mechanize

#!/usr/bin/ruby1.8 

require 'rubygems' 
require 'mechanize' 

agent = WWW::Mechanize.new 
page = agent.get("http://xkcd.com") 
page = page.link_with(:text=>'Forums').click 
page = page.link_with(:text=>'Mathematics').click 
page = page.link_with(:text=>'Math Books').click 
#puts page.parser.to_html # If you want to see the html you just got 
posts = page.parser.xpath("//div[@class='postbody']") 
for post in posts 
    title = post.at_xpath('h3//text()').to_s 
    author = post.at_xpath("p[@class='author']//a//text()").to_s 
    body = post.xpath("div[@class='content']//text()").collect do |div| 
    div.to_s 
    end.join("\n") 
    puts '-' * 40 
    puts "title: #{title}" 
    puts "author: #{author}" 
    puts "body:", body 
end 

输出的第一部分:

---------------------------------------- 
title: Math Books 
author: Cleverbeans 
body: 
This is now the official thread for questions about math books at any level, fr\ 
om high school through advanced college courses. 
I'm looking for a good vector calculus text to brush up on what I've forgotten.\ 
We used Stewart's Multivariable Calculus as a baseline but I was unable to pur\ 
chase the text for financial reasons at the time. I figured some things may hav\ 
e changed in the last 12 years, so if anyone can suggest some good texts on thi\ 
s subject I'd appreciate it. 
---------------------------------------- 
title: Re: Multivariable Calculus Text? 
author: ThomasS 
body: 
The textbooks go up in price and new pretty pictures appear. However, Calculus \ 
really hasn't changed all that much. 
If you don't mind a certain lack of pretty pictures, you might try something li\ 
ke Widder's Advanced Calculus from Dover. it is much easier to carry around tha\ 
n Stewart. It is also written in a style that a mathematician might consider no\ 
rmal. If you think that you might want to move on to real math at some point, i\ 
t might serve as an introduction to the associated style of writing. 
+0

oooooh,也许这就是我感兴趣的内容。机械化,你说,Ruby宝石你说:我充满好奇心!关我去..去谷歌! – Ziggy 2010-01-15 01:57:00

+0

不仅仅是我,还有三个人这样说。这是道义上的要求!你会发现机械化的rdoc页面有点多余,而xpath一开始可能会让人望而生畏,但是如果你学习如何使用它们,你会很满意。以这种方式更快地刮网。任何问题都只针对SO工厂。 – 2010-01-15 02:10:56

1

有些网站只能使用“www”子域进行访问,这样可能会导致问题。

创建一个GET请求,你想使用GET方法:

require 'net/http' 

url = URI.parse('http://www.forum.site/') 
req = Net::HTTP::Get.new(url.path) 
res = Net::HTTP.start(url.host, url.port) {|http| 
    http.request(req) 
} 
puts res.body 

ü可能还需要设置用户代理在某些时候作为一个选项:

{'User-Agent' => 'Mozilla/5.0 (Windows; U; 
    Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1'}) 
+0

是的,用户代理也是我的猜测。 – GalacticCowboy 2010-01-13 21:19:12

+0

所以这是问题所在。当我运行这样的代码,例如 url = URI.parse('forums.xkcd.com') 我收到有关“目标机器主动拒绝它”的错误消息。这是我想要克服的问题。 – Ziggy 2010-01-13 21:24:31

+0

@Ziggy,网址为http://forums.xkcd.com/(尾部斜杠和协议很重要)它应该可以工作 – jspcal 2010-01-13 21:36:18