这是我怎么会去这样做:
require 'nokogiri'
doc = Nokogiri::XML(open('/Users/gferguson/smithsonian-events.xml'))
namespaces = doc.collect_namespaces
entries = doc.search('entry').map { |entry|
entry_title = entry.at('title').text
entry_time_start, entry_time_end = ['startTime', 'endTime'].map{ |p|
entry.at('gd|when', namespaces)[p]
}
entry_notes = entry.at('gc|notes', namespaces).text
{
title: entry_title,
start_time: entry_time_start,
end_time: entry_time_end,
notes: entry_notes
}
}
,当运行结果为entries
是哈希阵列:
require 'awesome_print'
ap entries [0, 3]
# >> [
# >> [0] {
# >> :title => "Conservation Clinics",
# >> :start_time => "2016-11-09T14:00:00Z",
# >> :end_time => "2016-11-09T17:00:00Z",
# >> :notes => "Have questions about the condition of a painting, frame, drawing,\n print, or object that you own? Our conservators are available by\n appointment to consult with you about the preservation of your art.\n \n To request an appointment or to learn more,\n e-mail [email protected] and specify CLINIC in the subject line."
# >> },
# >> [1] {
# >> :title => "Castle Highlights Tour",
# >> :start_time => "2016-11-09T14:00:00Z",
# >> :end_time => "2016-11-09T14:45:00Z",
# >> :notes => "Did you know that the Castle is the Smithsonian’s first and oldest building? Join us as one of our dynamic volunteer docents takes you on a tour to explore the highlights of the Smithsonian Castle. Come learn about the founding and early history of the Smithsonian; its original benefactor, James Smithson; and the incredible history and architecture of the Castle. Here is your opportunity to discover the treasured stories revealed within James Smithson's crypt, the Gre...
# >> },
# >> [2] {
# >> :title => "Exhibition Interpreters/Navigators (throughout the day)",
# >> :start_time => "2016-11-09T15:00:00Z",
# >> :end_time => "2016-11-09T15:00:00Z",
# >> :notes => "Museum volunteer interpreters welcome visitors, answer questions, and help visitors navigate exhibitions. Interpreters may be stationed in several of the following exhibitions at various times throughout the day, subject to volunteer interpreter availability. <ul> \t<li><em>The David H. Koch Hall of Human Origins: What Does it Mean to be Human?</em></li> \t<li><em>The Sant Ocean Hall</em></li> </ul>"
# >> }
# >> ]
我没有试图收集您所要求的具体信息,因为event_name
不存在,您所做的事情非常通用,一旦理解了一些规则就可以轻松完成。
XML通常非常重复,因为它代表了数据表。表格的“单元格”可能会有所不同,但您可以使用重复来帮助您。在此代码中
doc.search('entry')
通过<entry>
节点循环。然后,很容易查看它们以找到所需的信息。
XML使用名称空间来帮助避免标记名称冲突。起初,这些看起来确实很难,但Nokogiri为文档提供了collect_namespaces
方法,该方法返回文档中所有命名空间的散列。如果您正在查找名称空间标签,请将该散列作为第二个参数传递。
Nokogiri允许我们使用XPath和CSS作为选择器。为了便于阅读,我几乎总是使用CSS。ns|tag
是告诉Nokogiri使用基于CSS的命名空间标签的格式。再次,传递文档中名称空间的散列,Nokogiri将完成其余部分。
如果您熟悉使用Nokogiri,您会看到上面的代码与用于将<td>
单元的内容拉到HTML <table>
中的<tr>
行的正常代码非常相似。
您应该可以修改该代码来收集所需的数据,而不会冒名称空间冲突的风险。
您提供纯xml的网址。但要尝试找到它在其中找到HTML。文档中没有任何html。 – Aleksey
然后如何使用nokogiri提取内容。 @Aleksey – Ajay
不要使用像'“/ html/body/div [2]/div [2]/div [1]/h3/a/span”'这样的完整选择器。他们非常容易出错。相反,找到所需节点的最短路径并使用它。这样,如果文档布局更改,选择器仍然可以正常工作。现在,如果页面发生了一些变化,你的代码就会崩溃。 –