2014-12-02 75 views
0

我有一个政治捐赠数据集,其中包含字母数字代码中的行业类别。单独的文本文件中列出了这些字母数字代码如何转换为行业名称,部门名称和行业类别名称。将类别的代码值合并到R中的数据集

例如,“A1200”是甘蔗产业类农业企业部门的作物生产行业。我想知道如何将字母数字代码与其各自行业,行业和类别值分开配对。

眼下,代码值数据集在

Catcode Catname  Catorder Industry    Sector  
    A1200 Sugar cane A01   Crop Production Agribusiness 

和这个行业的捐款数据集:

Business name Amount donated Year Category 
Sarah Farms  1000    2010 A1200 

类别的数据集大约是444行和捐赠一套大约是1M行。我如何感受捐赠数据集,看起来像这样。类别将是共同的名字

Catcode Catname  Catorder Industry    Sector   Business name Amount donated Year Category 
    A1200 Sugar cane A01   Crop Production Agribusiness  Sarah Farms  1000    2010 A1200 

我有点新的这些论坛,所以如果有一个更好的方式来问这个问题,请让我知道。感谢您的帮助!

+0

尝试带'by.x'和'by.y'参数的'merge()'函数。另请参阅http://stackoverflow.com/q/5963269/946850以改善问题。 – krlmlr 2014-12-02 02:51:06

回答

2

如果速度有问题,您可能需要使用data.tabledplyr。在这里,我修改了一些示例数据以提供一些想法。

df1 <- data.frame(Catcode = c("A1200", "B1500", "C1800"), 
        Catname = c("Sugar", "Salty", "Butter"), 
        Catorder = c("cane A01", "cane A01", "cane A01"), 
        Industry = c("Crop Production", "Crop Production", "Crop Production"), 
        Sector = c("Agribusiness", "Agribusiness", "Agribusiness"), 
        stringsAsFactors = FALSE) 

# Catcode Catname Catorder  Industry  Sector 
#1 A1200 Sugar cane A01 Crop Production Agribusiness 
#2 B1500 Salty cane A01 Crop Production Agribusiness 
#3 C1800 Butter cane A01 Crop Production Agribusiness 

df2 <- data.frame(BusinessName = c("Sarah Farms", "Ben Farms"), 
        AmountDonated = c(100, 200), 
        Year = c(2010, 2010), 
        Category = c("A1200", "B1500"), 
        stringsAsFactors = FALSE) 

# BusinessName AmountDonated Year Category 
#1 Sarah Farms   100 2010 A1200 
#2 Ben Farms   200 2010 B1500 

library(dplyr) 
library(data.table) 

# 1) dplyr option 
# Catcode C1800 will be dropped since it does not exist in both data frames. 
inner_join(df1, df2, by = c("Catcode" = "Category")) 

#  Catcode Catname Catorder  Industry  Sector BusinessName AmountDonated Year 
#1 A1200 Sugar cane A01 Crop Production Agribusiness Sarah Farms   100 2010 
#2 B1500 Salty cane A01 Crop Production Agribusiness Ben Farms   200 2010 

# Catcide C1800 remains 
left_join(df1, df2, by = c("Catcode" = "Category")) 

#  Catcode Catname Catorder  Industry  Sector BusinessName AmountDonated Year 
#1 A1200 Sugar cane A01 Crop Production Agribusiness Sarah Farms   100 2010 
#2 B1500 Salty cane A01 Crop Production Agribusiness Ben Farms   200 2010 
#3 C1800 Butter cane A01 Crop Production Agribusiness   <NA>   NA NA 

# 2) data.table option 
# Convert data.frame to data.table 
setDT(df1) 
setDT(df2) 

#Set columns for merge 
setkey(df1, "Catcode") 
setkey(df2, "Category") 

df1[df2] 

# Catcode Catname Catorder  Industry  Sector BusinessName AmountDonated Year 
#1: A1200 Sugar cane A01 Crop Production Agribusiness Sarah Farms   100 2010 
#2: B1500 Salty cane A01 Crop Production Agribusiness Ben Farms   200 2010 

df2[df1] 
# BusinessName AmountDonated Year Category Catname Catorder  Industry  Sector 
#1: Sarah Farms   100 2010 A1200 Sugar cane A01 Crop Production Agribusiness 
#2: Ben Farms   200 2010 B1500 Salty cane A01 Crop Production Agribusiness 
#3:   NA   NA NA C1800 Butter cane A01 Crop Production Agribusiness 
0

我想你问如何查询..不是吗?

SELECT * 
FROM 
code values dataset(your table for this) a 
LEFT JOIN industry donation dataset(your table for this) b 
ON a.CatCode = b.Category 
0

由于krlmlr建议:

> merge(df1, df2, by.x = "Catcode", by.y = "Category", all = T) 
    Catcode Catname Catorder  Industry  Sector Business_name Amount_donated Year 
1 A1200 Sugar_cane  A01 Crop_Production Agribusiness Sarah_Farms   1000 2010 

但是要避免在列名和值空格。我将它们替换为_

相关问题