我想解析一个HTML文件使用Hpricot和Ruby,但我有问题提取“自由浮动”的文字不包含在标签如<p></p>
。如何使用Hpricot使用<br />标签从网页中提取文本?
require 'hpricot'
text = <<SOME_TEXT
<a href="http://www.somelink.com/foo/bar.html">Testing:</a><br />
line 1<br />
line 2<br />
line 3<br />
line 4<br />
line 5<br />
<b>Here's some more text</b>
SOME_TEXT
parsed = Hpricot(text)
parsed = parsed.search('//a[@href="http://www.somelink.com/foo/bar.html"]').first.following_siblings
puts parsed
我希望的结果是
<br />
line 1<br />
line 2<br />
line 3<br />
line 4<br />
line 5<br />
<b>Here's some more text</b>
但我正在逐渐
<br />
<br />
<br />
<br />
<br />
<br />
<b>Here's some more text</b>
我怎样才能使角度来说,Hpricot返回1号线,2号线,等?