从一个URL刮到另一个URL在R

我的问题是关于R能够读取URL链接。我使用的示例仅用于说明目的。假设我有以下想阅读的网页（随机选择）;从一个URL刮到另一个URL在R

https://www.mcdb.ucla.edu/faculty

它有一个URL链接教授的名单，我想建立一个脚本，可以读取与此类似，例如网页和访问每个URL链接，并为某些关键字搜索关于他们的刊物。

我目前有我的脚本扫描某个网站的某些关键字，我在下面发表。

library(rvest) 
    library(dplyr) 
    library(tidyverse) 
    library(stringr) 

    prof <- readLines("https://www.mcdb.ucla.edu/faculty/jsadams") 

    library(dplyr) 
    text_df <- data_frame(text = prof) 

    text_df <- as.data.frame.table(text_df) 


    keywords <- c("nonskeletal", "antimicrobial response") 
    text_df %>% 
     filter(str_detect(text, keywords[1]) | str_detect(text, keywords[2]))

这应该返回教授网页上“Selected Publications”部分下的出版物1,2和4。

现在我试图让R从教员链接（https://www.mcdb.ucla.edu/faculty）中阅读每个教授页面，并查看每位教授是否有含上述关键字的出版物。

阅读：https://www.mcdb.ucla.edu/faculty
访问的每个环节，并宣读每位教师页：
返回，如果值“关键字” = TRUE：
名单教授出版物或已在“关键字”文本：

我已经能够为每个单独的页面做到这一点，但我可能更喜欢循环或功能，所以我不必每次都复制并粘贴每个教授页面的URL。

只是一个轻微的免责声明 - 我与加州大学洛杉矶分校或该网站的教授没有任何关系，我选择的教授网址恰好是第一位在教授网页上列出的教授。

来源

2017-10-05 user113156

我会这样做，如下所示。这是“快速和肮脏”的代码，但希望为更好的东西提供基础。

首先，您需要正确的选择器才能获得教师姓名和指向其页面的链接。利用这些信息创建一个数据帧：

library(dplyr) 
library(rvest) 
library(tidytext) 

page <- read_html("https://www.mcdb.ucla.edu/faculty") 
table1 <- page %>% 
    html_nodes(xpath = "///table[1]/tr/td/a") 
names <- table1 %>% 
    html_text() %>% 
    unlist(use.names = FALSE) 
links <- table1 %>% 
    html_attrs() %>% 
    unlist(use.names = FALSE) 

data1 <- data.frame(name = names, href = links) 
head(data1) 

       name    href 
1  John Adams /faculty/jsadams 
2 Utpal Banerjee /faculty/banerjee 
3 Siobhan Braybrook /faculty/siobhanb 
4  Jau-Nian Chen /faculty/chenjn 
5  Amander Clark /faculty/clarka 
6  Daniel Cohn /faculty/dcohn

接下来，您需要一个函数，它在href列中的值，获取员工页面，查找关键字。我采取了一种不同的方法，使用tidytext将所有出版物分解为单个单词，然后对任何关键字出现的行进行计数。这意味着“抗菌反应”必须是两个单独的词，所以你可能想要做不同的事情。

如果存在任何关键字，该函数将返回大于0的计数。

get_pubs <- function(href) { 
    page <- read_html(paste0("https://www.mcdb.ucla.edu", href)) 
    pubs <- data.frame(text = page %>% 
           html_nodes("div.mcdb-faculty-pubs p") %>% 
           html_text(), 
        stringsAsFactors = FALSE) 
pubs <- pubs %>% 
    unnest_tokens(word, text) 
pubs %>% 
    filter(word %in% c("nonskeletal", "antimicrobial", "response")) %>% 
    nrow() 
}

现在你可以应用函数到每个href：

data1 <- data1 %>% 
    mutate(count = sapply(href, function(x) get_pubs(x)))

哪个教员曾在其出版物中至少一个关键字？

data1 %>% 
    filter(count > 0) 

       name    href count 
1  John Adams /faculty/jsadams  9 
2   Arjun Deb  /faculty/adeb  1 
3  Tracy Johnson /faculty/tljohnson  1 
4  Chentao Lin  /faculty/clin  1 
5  Jeffrey Long /faculty/jeffalong  1 
6 Matteo Pellegrini /faculty/matteop  1

来源

2017-10-05 23:15:46 neilfws

从一个URL刮到另一个URL在R

回答

相关问题