2017-07-24 123 views
2

我这里列有一些拼写错误的字符串表,让说,作为一个例子正确的参数替换表中的拼写错误的单词:如何使用,使用R

table$Status回报这些值:

"alive" "sic" "alive" "sick" "alive" "si" "alive" "ali" "alv" 
"dead" "alive" "alive" "alive" "al" "dead" "dead" "de" "dead" 
"dead" "dea" "dead" "al" "dead" "de" "al" "de" "sick" 
"dead" "alive" 

我想有活着生病像下面的例子:

"alive" "sick" "alive" "sick" "alive" "sick" "alive" "alive" "alive" 
"dead" "alive" "alive" "alive" "alive" "dead" "dead" "dead" "dead" 
"dead" "dead" "dead" "alive" "dead" "dead" "alive" "dead" "sick" 
"dead" "alive" 

我知道有从包RecordLinkage这个函数来得到这样的字符串之间的距离:

levenshteinSim("al", "alive") 

所以我会比较其他每一个值,并获得最佳的相似性,我也知道用table(Table$Status)我会得到最重复的值的数量,那些将是正确的。

但是,这是我的问题我怎么能比较他们所有相互并替换我的表?如果有人知道一个简单的方法来做这将是非常有帮助的。

回答

1
library(data.table) 
library(dplyr) 
table <- data.table(Status=c("alive", "sic", "alive", "sick", "alive", "si", "de", "al" )) 
table[,Status2:=ifelse(Status%like%"^al","alive", 
         ifelse(Status%like%"^si","sick","dead"))] 

UPDATE

一个更通用的解决方案:

library(data.table) 

table <- data.table(Status=c("alive", "sic", "alive", "sick", "alive", "si", "de", "al" )) 

correct_values <- c("alive","sick","dead") 
for (i in 1:nrow(table)){ # i <- 2 
    string <- table[i,Status] 
    max <- 0 
    similarity <- 0 
    for(j in correct_values){ # j <- "alive" 
    similarity <- length(Reduce(intersect, strsplit(c(string, j), split = ""))) 
    if(similarity > max){ 
     max <- similarity 
     to_replace <- j 
    } 
    } 
    table[i,"Status"] <- to_replace 
} 

在这里,我假设你知道哪个值是校正那些(在此你手工输入correct_values这它将替代。列Status中的值与correct_values中的值具有最高的通用字符数

我希望它有帮助!

+0

这有效,但它对我的例子非常具体当我有一个10000个值的表时会发生什么?我怎么知道这些是拼写错误的单词? –

+0

@quant我会建议在嵌套的'ifelse'上使用'dplyr :: case_when'。 @ProgrammerMan如果它不那么具体,就没有办法确定'al'是什么意思。 “活着”还是“全部”?也许'啤酒'? Ofc您应该对第一个符号使用模糊匹配,但您仍然必须提供全文的模式以供比较。 –

+0

@quant非常感谢你这个作品! –