2017-04-12 34 views
0

的情况我有过R如何提取满足R中

download.file("http://stats.espncricinfo.com/ci/engine/records/team/match_results_year.html?class=2;id=6;type=team", 
       "dataDictionary.html") 
docHtml = htmlTreeParse("dataDictionary.html", useInternal = TRUE) # to Download the page source 
links <- xpathSApply(docHtml,path = "//a", xmlGetAttr, "href") 

现在我需要提取具有类似"/ci/engine/records/team/match_results.html?class=2;id= *"数据中提取的页面源信息的超链接。这里的*在某种意义上满足这个条件,那些数据必须存储在另一个变量中。任何帮助?

回答

1

所有您有兴趣可以grep

GoodLinks = grep("/ci/engine/records/team/match_results.html\\?class=2;id", links) 

如果只想id字段,你可以处理这些链接与被检测的联系sub

sub(".*id=(\\d+).*", "\\1", links[GoodLinks]) 
[1] "1974" "1975" "1976" "1978" "1979" "1980" "1981" "1982" "1983" "1984" "1985" "1986" "1987" "1988" "1989" "1990" 
[17] "1991" "1992" "1993" "1994" "1995" "1996" "1997" "1998" "1999" "2000" "2001" "2002" "2003" "2004" "2005" "2006" 
[33] "2007" "2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016" "2017"