2017-04-27 58 views
1

我有一个数据帧,它有几个列和行 - 一些包含信息,一些填充NA,应该用某些数据替换。使用R收集数据 - 多个URL

这些行表示特定的乐器,而列包含给定行中乐器的各种细节。数据帧的最后一列有每个仪器,然后将被用来获取数据为空列的网址:

Issuer NIN or ISIN   Type Nominal Value # of Bonds Issue Volume Start Date End Date 
1 NBRK KZW1KD079112 discount notes   NA   NA   NA   NA  NA 
2 NBRK KZW1KD079146 discount notes   NA   NA   NA   NA  NA 
3 NBRK KZW1KD079153 discount notes   NA   NA   NA   NA  NA 
4 NBRK KZW1KD089137 discount notes   NA   NA   NA   NA  NA 

URL 
1 http://www.kase.kz/en/gsecs/show/NTK007_1911 
2 http://www.kase.kz/en/gsecs/show/NTK007_1914 
3 http://www.kase.kz/en/gsecs/show/NTK007_1915 
4 http://www.kase.kz/en/gsecs/show/NTK008_1913 

例如,下面的代码我得到的第一件乐器的细节行NBRK KZW1KD079112

sp = readHTMLTable(newd$URL[[1]]) 
sp[[4]] 

其中给出以下几点:

          V1                

    V2 
1          Trading code               NTK007_1911 
2        List of securities               official 
3        System of quotation                price 
4        Unit of quotation         nominal value percentage fraction 
5        Quotation currency                 KZT 
6        Quotation accuracy              4 characters 
7      Trade lists admission date               04/21/17 
8        Trade opening date               04/24/17 
9      Trade lists exclusion date               04/28/17 
10          Security                <NA> 
11          Bond's name short-term notes of the National Bank of the Republic of Kazakhstan 
12           NSIN              KZW1KD079112 
13     Currency of issue and service                 KZT 
14    Nominal value in issue's currency                100.00 
15      Number of registered bonds              1,929,319,196 
16      Number of bonds outstanding              1,929,319,196 
17        Issue volume, KZT              192,931,919,600 
18 Settlement basis (days in month/days in year)              actual/365 
19      Date of circulation start               04/21/17 
20       Circulation term, days                 7 
21    Register fixation date at maturity               04/27/17 
22      Principal repayment date               04/28/17 
23         Paying agent       Central securities depository JSC (Almaty) 
24          Registrar       Central securities depository JSC (Almaty) 

从此,我将只保留:

14    Nominal value in issue's currency                100.00 
16      Number of bonds outstanding              1,929,319,196 
17        Issue volume, KZT              192,931,919,600 
19      Date of circulation start               04/21/17 
22      Principal repayment date               04/28/17 

然后我会将所需的数据复制到初始数据框并继续下一行...数据框由100多行组成,并且会不断变化。

我将不胜感激任何帮助。

UPDATE:

看起来,我需要并不总是sp[[4]]数据。有时它的sp[[7]],也许在未来它将是完全不同的表格。是否有寻找刮下表中的信息,并识别可进一步用于收集数据的特定表?:

sp = readHTMLTable(newd$URL[[1]]) 
sp[[4]] 

回答

1
library(XML) 
library(reshape2) 
library(dplyr) 

name = c(
"NBRK KZW1KD079112 discount notes",           
"NBRK KZW1KD079146 discount notes",           
"NBRK KZW1KD079153 discount notes",           
"NBRK KZW1KD089137 discount notes")           

URL = c(
"http://www.kase.kz/en/gsecs/show/NTK007_1911", 
"http://www.kase.kz/en/gsecs/show/NTK007_1914", 
"http://www.kase.kz/en/gsecs/show/NTK007_1915", 
"http://www.kase.kz/en/gsecs/show/NTK008_1913") 

# data 
instruments <- data.frame(name, URL, stringsAsFactors = FALSE) 

# define the columns wanted and the mapping to desired name 
# extend to all wanted columns 
wanted <- c("Nominal value in issue's currency" = "Nominal Value", 
      "Number of bonds outstanding" = "# of Bonds Issue") 

# function returns a data frame of wanted columns for given URL 
getValues <- function (name, url) { 
    # get the table and rename columns 
    sp = readHTMLTable(url, stringsAsFactors = FALSE) 
    df <- sp[[4]] 
    names(df) <- c("full_name", "value") 

    # filter and remap wanted columns 
    result <- df[df$full_name %in% names(wanted),] 
    result$column_name <- sapply(result$full_name, function(x) {wanted[[x]]}) 

    # add the identifier to every row 
    result$name <- name 
    return (result[,c("name", "column_name", "value")]) 
} 

# invoke function for each name/URL pair - returns list of data frames 
columns <- apply(instruments[,c("name", "URL")], 1, function(x) {getValues(x[["name"]], x[["URL"]])}) 

# bind using dplyr:bind_rows to make a tall data frame 
tall <- bind_rows(columns) 

# make wide using dcast from reshape2 
wide <- dcast(tall, name ~ column_name, id.vars = "value") 

wide 

#        name # of Bonds Issue Nominal Value 
# 1 NBRK KZW1KD079112 discount notes 1,929,319,196  100.00 
# 2 NBRK KZW1KD079146 discount notes 1,575,000,000  100.00 
# 3 NBRK KZW1KD079153 discount notes  701,390,693  100.00 
# 4 NBRK KZW1KD089137 discount notes 1,380,368,000  100.00 

    enter code here 
+0

非常感谢任何方式。你的代码正在工作,但是当我尝试为所有仪器运行它时,出现以下错误:'$ < - 。data.frame'错误('* tmp *',“name”,value =“ KZW1KD919127“): 替换有1行,数据有0' 任何想法为什么会发生这种情况? – AK88

+0

嗯......我刚刚完成检查这个特定的工具,并认为我需要的数据不包含在'sp [[4]]'而是'sp [[7]]'中。在某种程度上是否可以将这种情况纳入其中? – AK88

+1

不那么好的黑客就像'df < - if_else(name ==“foo”,sp [[4]],sp [[7]])''。我通常更喜欢'dplyr if_else'而不是'ifelse',因为它保留了class。更好的方法是学习使用'library(rvest)',因为它支持'html_nodes'函数中的CSS选择器,它可以定位html中的'id'属性而不是位置 – epi99