R阅读并解析HTML到列表

我一直在尝试阅读&解析一些HTML以获取动物收容所的动物条件列表。我相信我对HTML解析的经验不足没有帮助，但我似乎没有得到快速的地方。R阅读并解析HTML到列表

这里是HTML的一个片段：

<select multiple="true" name="asilomarCondition" id="asilomarCondition"> 

    <option value="101"> 
     Behavior- Aggression, Confrontational-Toward People (mild) 
     - 
     TM</option> 
.... 
</select>

这里只有一个标签与<select...>，其余都是<option value=x>。

我一直在使用XML库。我可以删除换行符和标签，但没有成功移除标签：

conditions.html <- paste(readLines("Data/evalconditions.txt"), collapse="\n") 
conditions.text <- gsub('[\t\n]',"",conditions.html)

作为最后的结果，我想所有的条件清单，我可以进一步处理以供日后使用作为因子名称：

Behavior- Aggression, Confrontational-Toward People (mild)-TM 
Behavior- Aggression, Confrontational-Toward People (moderate/severe)-UU 
...

我不知道我是否需要使用XML库（或另一个库），或者如果gsub模式就足够了（无论哪种方式，我需要找出如何使用它）。

来源

2016-08-11 kimbekaw

你可以指向带有该选择框的完整URL或扩展该片段吗？ – hrbrmstr

我发现rvest软件包更易于使用。如果你可以提供一个链接到网站，有人可以编写你的解决方案。 – Dave2e

它是HTML。这是一个表单中的选择列表@alistaire – hrbrmstr

下面是一个使用rvest包开始：

library(rvest) 
#read the html page 
page<-read_html("test.html") 
#get the text from the "option" nodes and then trim the whitespace 
nodes<-trimws(html_text(html_nodes(page, "option"))) 

#nodes will need additional clean up to remove the excessive spaces 
#and newline characters 
nodes<-gsub("\n", "", nodes) 
nodes<-gsub(" ", "", nodes)

矢量节点应该是你所要求的结果。此示例基于上面提供的有限示例，这个实际页面可能会有意想不到的结果。

来源

2016-08-12 23:02:04 Dave2e

谢谢，@ Dave2e！这工作完美！我还有一些额外的角色需要清理，但这很容易处理你的例子。开始数据清理的其余部分！：○ – kimbekaw

R阅读并解析HTML到列表

回答

相关问题