2017-10-13 103 views
1

我有两个CSV文件:比较两个数据帧基于公共列

文件1:

SN CY Year Month Day Hour Lat Lon 
196101 1 1961 1 14 12 8.3 134.7 
196101 1 1961 1 14 18 8.8 133.4 
196101 1 1961 1 15 0 9.1 132.5 
196101 1 1961 1 15 6 9.3 132.2 
196101 1 1961 1 15 12 9.5 132 
196101 1 1961 1 15 18 9.9 131.8 

文件2:

Year Month Day RR Hour Lat Lon 
1961 1 14 0 0 14.0917 121.055 
1961 1 14 0 6 14.0917 121.055 
1961 1 14 0 12 14.0917 121.055 
1961 1 14 0 18 14.0917 121.055 
1961 1 15 0 0 14.0917 121.055 
1961 1 15 0 6 14.0917 121.055 

我想file2中添加另一列,并把“ TRUE“,如果file2中的行存在于file1中,则它们具有相同的Year,Month,Day和Hour,否则为”FALSE“。然后保存为csv文件。

所需的输出:

Year Month Day RR Hour Lat Lon  com 
1961 1 14 0 0 14.0917 121.055 FALSE 
1961 1 14 0 6 14.0917 121.055 FALSE 
1961 1 14 0 12 14.0917 121.055 TRUE 
1961 1 14 0 18 14.0917 121.055 TRUE 
1961 1 15 0 0 14.0917 121.055 TRUE 
1961 1 15 0 6 14.0917 121.055 TRUE 

这里是我的脚本:

jtwc <- read.csv("file1.csv",header=T,sep=",") 
stn <- read.csv("file2.csv",header=T,sep=",") 

if ((jtwc$Year == "stn$YY") & (jtwc$Month == "stn$MM") & (jtwc$Day == "stn$DD") &(jtwc$Hour == "stn$HH")){ 
stn$com <- "TRUE" 
} else { 
stn$com <- "FALSE" 
} 
write.csv(stn,file="test.csv",row.names=T) 

这给出了一个错误:

In if ((jtwc$Year == "stn$YY") & (jtwc$Month == "stn$MM") & (jtwc$Day == :the condition has length > 1 and only the first element will be used 
+0

是一个可重现的例子。像帖子结果头(dput(YOURDATA)) –

回答

1

使用data.table快速和肮脏的解决方案:

  1. 使用fread来读取文件。
  2. 提取想要的列从file1(因为你只在file2兴趣)
  3. 使用merge
  4. 如果有合并的文件没有从file1比赛添加FALSE

代码:

library(data.table) 
result <- merge(fread("file2.csv"), 
       fread("file1.csv")[, .(Year, Month, Day, Hour, com = TRUE)], 
       all.x = TRUE)[is.na(com), com := FALSE] 

result 
    Year Month Day Hour RR  Lat  Lon com 
1: 1961  1 14 0 0 14.0917 121.055 FALSE 
2: 1961  1 14 6 0 14.0917 121.055 FALSE 
3: 1961  1 14 12 0 14.0917 121.055 TRUE 
4: 1961  1 14 18 0 14.0917 121.055 TRUE 
5: 1961  1 15 0 0 14.0917 121.055 TRUE 
6: 1961  1 15 6 0 14.0917 121.055 TRUE 
+0

非常感谢! – Lyndz

3

您也可以使用dplyr/tidyverse:

library(tidyverse) 
d2 %>% 
    left_join(select(d1, Year, Month, Day, Hour, Com=Lon)) %>% 
    mutate(Com=ifelse(is.na(Com), FALSE, TRUE)) 

Joining, by = c("Year", "Month", "Day", "Hour") 
    Year Month Day RR Hour  Lat  Lon Com 
1 1961  1 14 0 0 14.0917 121.055 FALSE 
2 1961  1 14 0 6 14.0917 121.055 FALSE 
3 1961  1 14 0 12 14.0917 121.055 TRUE 
4 1961  1 14 0 18 14.0917 121.055 TRUE 
5 1961  1 15 0 0 14.0917 121.055 TRUE 
6 1961  1 15 0 6 14.0917 121.055 TRUE 
+0

非常感谢您的帮助!这也有效。 – Lyndz