2017-03-06 42 views
0

我正在尝试解决以下问题:为Pop_Size_Group的每个值查找每个数值列的平均值。我需要找出排除任何非数字变量的有效方法。如何排除dplyr语句中的非数字列

这是我到目前为止有:

library(dplyr) 
    df <- tbl_df(Demographics) 

df %>% 
    group_by(Pop_Size_Group) %>% 
    summarise_each(funs(mean(., na.rm = TRUE))) 

的代码产生这样的:

> df <- tbl_df(Demographics) 
> df %>% 
+ group_by(Pop_Size_Group) %>% 
+ summarise_each(funs(mean(., na.rm = TRUE))) 

# A tibble: 3 × 18 
    Pop_Size_Group County_name State Region_num Location Square_miles Population Pct_Age18_to_34 Pct_65_or_over 
      <chr>  <lgl> <lgl>  <dbl> <lgl>  <dbl>  <dbl>   <dbl>   <dbl> 
1   Large   NA NA 2.492958  NA 1239.3099 847193.0  28.96338  12.06197 
2   Medium   NA NA 2.465409  NA  861.3711 224348.6  28.30252  12.31572 
3   Small   NA NA 2.424460  NA 1045.1871 121956.6  28.46906  12.11295 
# ... with 9 more variables: Num_physicians <dbl>, Num_hospital_beds <dbl>, Num_serious_crimes <dbl>, 
# Pct_High_Sch_grads <dbl>, Pct_Bachelors <dbl>, Pct_below_poverty <dbl>, Pct_unemployed <dbl>, 
# Per_cap_income <dbl>, Total_personal_income <dbl> 


Warning messages: 
1: In mean.default(c("Los_Angeles", "Cook", "Harris", "San_Diego", : 
    argument is not numeric or logical: returning NA 
2: In mean.default(c("Pulaski", "Guilford", "Solano", "York", "Berks", : 
    argument is not numeric or logical: returning NA 
3: In mean.default(c("Bibb", "Onslow", "Jackson", "Schenectady", "Rock_Island", : 
    argument is not numeric or logical: returning NA 
4: In mean.default(c("CA", "IL", "TX", "CA", "CA", "NY", "AZ", "MI", : 
    argument is not numeric or logical: returning NA 
5: In mean.default(c("AR", "NC", "CA", "PA", "PA", "NH", "TN", "FL", : 
    argument is not numeric or logical: returning NA 
6: In mean.default(c("GA", "NC", "MI", "NY", "IL", "OH", "CA", "ME", : 
    argument is not numeric or logical: returning NA 
7: In mean.default(c("West", "East", "West", "West", "West", "East", : 
    argument is not numeric or logical: returning NA 
8: In mean.default(c("West", "East", "West", "East", "East", "East", : 
    argument is not numeric or logical: returning NA 
9: In mean.default(c("East", "East", "East", "East", "East", "East", : 
    argument is not numeric or logical: returning NA 

下面是一瞥输出(DF)供参考:

> glimpse(df) 
Observations: 440 
Variables: 18 
$ County_name   <chr> "Los_Angeles", "Cook", "Harris", "San_Diego", "Orange", "Kings", "Maricopa", "W... 
$ State     <chr> "CA", "IL", "TX", "CA", "CA", "NY", "AZ", "MI", "FL", "TX", "PA", "WA", "CA", "... 
$ Region_num   <int> 4, 2, 3, 4, 4, 1, 4, 2, 3, 3, 1, 4, 4, 4, 2, 1, 1, 1, 1, 4, 3, 3, 4, 3, 2, 4, 2... 
$ Location    <chr> "West", "East", "West", "West", "West", "East", "West", "East", "East", "West",... 
$ Square_miles   <int> 4060, 946, 1729, 4205, 790, 71, 9204, 614, 1945, 880, 135, 2126, 1291, 20062, 4... 
$ Population   <int> 8863164, 5105067, 2818199, 2498016, 2410556, 2300664, 2122101, 2111687, 1937094... 
$ Pop_Size_Group  <chr> "Large", "Large", "Large", "Large", "Large", "Large", "Large", "Large", "Large"... 
$ Pct_Age18_to_34  <dbl> 32.1, 29.2, 31.3, 33.5, 32.6, 28.3, 29.2, 27.4, 27.1, 32.6, 29.1, 30.1, 32.6, 3... 
$ Pct_65_or_over  <dbl> 9.7, 12.4, 7.1, 10.9, 9.2, 12.4, 12.5, 12.5, 13.9, 8.2, 15.2, 11.1, 8.7, 8.8, 1... 
$ Num_physicians  <int> 23677, 15153, 7553, 5905, 6062, 4861, 4320, 3823, 6274, 4718, 6641, 5280, 4101,... 
$ Num_hospital_beds  <int> 27700, 21550, 12449, 6179, 6369, 8942, 6104, 9490, 8840, 6934, 10494, 4009, 334... 
$ Num_serious_crimes <int> 688936, 436936, 253526, 173821, 144524, 680966, 177593, 193978, 244725, 214258,... 
$ Pct_High_Sch_grads <dbl> 70.0, 73.4, 74.9, 81.9, 81.2, 63.7, 81.5, 70.0, 65.0, 77.1, 64.3, 88.2, 82.0, 7... 
$ Pct_Bachelors   <dbl> 22.3, 22.8, 25.4, 25.3, 27.8, 16.6, 22.1, 13.7, 18.8, 26.3, 15.2, 32.8, 32.6, 1... 
$ Pct_below_poverty  <dbl> 11.6, 11.1, 12.5, 8.1, 5.2, 19.5, 8.8, 16.9, 14.2, 10.4, 16.1, 5.0, 5.0, 10.3, ... 
$ Pct_unemployed  <dbl> 8.0, 7.2, 5.7, 6.1, 4.8, 9.5, 4.9, 10.0, 8.7, 6.1, 8.0, 4.6, 5.5, 8.0, 5.5, 7.3... 
$ Per_cap_income  <int> 20786, 21729, 19517, 19588, 24400, 16803, 18042, 17461, 17823, 21001, 16721, 23... 
$ Total_personal_income <int> 184230, 110928, 55003, 48931, 58818, 38658, 38287, 36872, 34525, 38911, 26512, ... 

这里是一个数据链接: example data

+0

加上'na.omit()'你dplyr管 –

+1

我无法重现,因为没有 '人口' 数据集,但一般情况下,你可以使用'summarise_if(is.numeric,平均(。 ,na.rm = TRUE))' – Mislav

+0

已更新评论,并链接到示例数据。为最初不这样做而道歉。 – Tommy

回答

5

您可以使用dplyr的select_if功能:

df %>% select_if(is.numeric) 

或Mislav的意见建议,直奔摘要使用summarise_if

df %>% 
    group_by(Pop_Size_Group) %>% 
    summarise_if(is.numeric, mean, na.rm = TRUE) 
+0

我最初尝试过,但遇到了路障。如果我首先选择数字列,那么我怎样才能通过Pop_Size_Group进行分组? – Tommy

+1

先分组,然后使用'summarise_if'而不是'select_if'。 – neilfws

+0

对不起,你@neilfws,但我是R新手。不知道我在这里做什么。我试过了:df%>% group_by(Pop_Size_Group)%>% summarise_if(is.numeric()) – Tommy