2017-11-10 223 views
3

我有一长串字符串,它们共享子字符串。该列表来自事件流数据,因此有成千上万行,但我会简化这个例子;宠物:使用子字符串R查找字符串R

+--------------------------------+ 
|    Pets    | 
+--------------------------------+ 
| "one calico cat that's smart" | 
| "German Shepard dog"   | 
| "A Chameleon that is a Lizard" | 
| "a cute tabby cat"    | 
| "the fish guppy"    | 
| "Lizard Gecko"     | 
| "German Shepard dog"   | 
| "Budgie Bird"     | 
| "Canary Bird in a coal mine" | 
| "a chihuahua dog"    | 
+--------------------------------+ 
dput output: structure(list(Pets = structure(c(8L, 6L, 1L, 3L, 9L, 7L, 6L, 4L, 5L, 2L),.Label = c("A Chameleon that is a Lizard", "a chihuahua dog", "a cute tabby cat", "Budgie Bird", "Canary Bird in a coal mine", "German Shepard dog", "Lizard Gecko", "one calico cat that's smart", "the fish guppy"), class = "factor")), .Names = "Pets", row.names = c(NA, -10L), class = "data.frame") 

我想基础上,通用型宠物(狗,猫等)添加信息,我有保留此信息一键表:

+----------+----------------+ 
| key | classification | 
+----------+----------------+ 
| "dog" | "canine"  | 
| "cat" | "feline"  | 
| "lizard" | "reptile"  | 
| "bird" | "avian"  | 
| "fish" | "fish"   | 
+----------+----------------+ 
dput output: structure(list(key = structure(c(3L, 2L, 5L, 1L, 4L), .Label = c("bird", "cat", "dog", "fish", "lizard"), class = "factor"), classification = structure(c(2L, 3L, 5L, 1L, 4L), .Label = c("avian", "canine", "feline", "fish", "reptile"), class = "factor")), .Names = c("key", "classification"), row.names = c(NA, -5L), class = "data.frame") 

怎么办我使用Pets表中的“长字符串”在密钥表中查找相关的classification?问题是,我的查找字符串包含在密钥表中找到的子字符串。

我用grepl这样开始:

key[grepl(pets[1,1], key[ , 2]), ] 

但是,这是行不通的,因为“三色猫”是不是在键列表,虽然“猫”是。我正在寻找的结果将是“feline”。 (注意:我不能简单地切换事物,因为在我自己的代码中,它位于一个apply函数中,并且循环遍历数据中的每一行。所以,而不是pets[1,1]它是pets[n,1]最后我打算cbind对事件流数据的结果做进一步分析。)

我在绕包装如何做到这一点时遇到了麻烦。有什么建议?

+0

看来,键总是每个“长字符串”的第二个字。这是一个合理的假设吗? – useR

+0

不幸的是,没有。字符串有几个到几个不同的单词。我只知道“关键”字在那里。 – JoeM05

+1

然后你应该提供一个不符合这个假设的长字符串。此外,请提供您的数据集,并将'dput(my_data)'的输出复制并粘贴到您的问题中,而不是您目前如何格式化它的数据集 – useR

回答

2

你可以使用包fuzzyjoin很容易做这些事情。

在这里,您可以使用regex_left_join,它的工作原理就像一个正常的左连接(如dplyr::left_join),不同之处在于,对于rwos是匹配的标准是由正则表达式来确定匹配似stringr::str_detect

library(tibble) 
library(fuzzyjoin) 

pets <- tribble(
          ~pets, 
    "one calico cat that\'s smart", 
      "German Shepard dog", 
    "A Chameleon that is a Lizard", 
       "a cute tabby cat", 
       "the fish guppy", 
        "Lizard Gecko", 
      "German Shepard dog", 
        "Budgie Bird", 
    "Canary Bird in a coal mine", 
       "a chihuahua dog" 
) 

key <- tribble(
     ~key, ~classification, 
     "dog",  "canine", 
     "cat",  "feline", 
    "lizard",  "reptile", 
    "bird",   "avian", 
    "fish",   "fish" 
) 

regex_left_join(pets, key, by = c("pets" = "key"), ignore_case = TRUE) 

#> # A tibble: 10 x 3 
#>       pets key classification 
#>       <chr> <chr>   <chr> 
#> 1 one calico cat that's smart cat   feline 
#> 2   German Shepard dog dog   canine 
#> 3 A Chameleon that is a Lizard lizard  reptile 
#> 4    a cute tabby cat cat   feline 
#> 5    the fish guppy fish   fish 
#> 6     Lizard Gecko lizard  reptile 
#> 7   German Shepard dog dog   canine 
#> 8     Budgie Bird bird   avian 
#> 9 Canary Bird in a coal mine bird   avian 
#> 10    a chihuahua dog dog   canine 
+0

这工作。方便的图书馆,谢谢奥地利人 – JoeM05

1

您可以构建每个宠物密钥列表,然后看看他们在表

Pattern = paste(KeyTable$key, collapse="|") 
Pattern = paste0(".*(", Pattern, ").*") 
Type = tolower(sub(Pattern, "\\1", ignore.case=TRUE, Pets)) 
KeyTable$classification[match(Type, KeyTable$key)] 
[1] "feline" "canine" "reptile" "feline" "feline" "canine" "fish" 
[8] "reptile" "canine" "avian" "avian" "canine" 

数据

KeyTable = read.table(text="key classification 
dog canine 
cat feline 
lizard reptile 
bird avian  
fish fish", 
header=TRUE, stringsAsFactors=FALSE) 

Pets = c("calico cat", 
"Shepard dog" , 
"Chameleon Lizard", 
"calico cat", 
"tabby cat", 
"chihuahua dog", 
"guppy fish", 
"Gecko Lizard", 
"Shepard dog", 
"Budgie Bird", 
"Canary Bird" , 
"chihuahua dog") 
1

下面是使用另一种方法hashmap

library(hashmap) 

hash_table = hashmap(Lookup$key, Lookup$classification) 

Pets %>% 
    separate_rows(Pets, sep = " ") %>% 
    mutate(class = hash_table[[tolower(Pets)]]) %>% 
    na.omit() %>% 
    select(Key = Pets, class) %>% 
    bind_cols(Pets, .) 

结果:

> hash_table 
## (character) => (character) 
##  [fish] => [fish]  
##  [bird] => [avian]  
## [lizard] => [reptile] 
##  [cat] => [feline] 
##  [dog] => [canine] 

          Pets Key class 
1 one calico cat that's smart cat feline 
2   German Shepard dog dog canine 
3 A Chameleon that is a Lizard Lizard reptile 
4    a cute tabby cat cat feline 
5    the fish guppy fish fish 
6     Lizard Gecko Lizard reptile 
7   German Shepard dog dog canine 
8     Budgie Bird Bird avian 
9 Canary Bird in a coal mine Bird avian 
10    a chihuahua dog dog canine 

数据:

Pets = structure(list(Pets = c("one calico cat that's smart", "German Shepard dog", 
           "A Chameleon that is a Lizard", "a cute tabby cat", "the fish guppy", 
           "Lizard Gecko", "German Shepard dog", "Budgie Bird", "Canary Bird in a coal mine", 
           "a chihuahua dog")), .Names = "Pets", row.names = c(NA, -10L), class = "data.frame") 


Lookup = structure(list(key = c("dog", "cat", "lizard", "bird", "fish"), 
         classification = c("canine", "feline", "reptile", "avian", 
         "fish")), class = "data.frame", .Names = c("key", "classification" 
        ), row.names = c(NA, -5L))