模糊匹配下一列中同一行的一列中的行

我想根据另一列在一列中查找信息。所以我在一列中有一些词，在另一列中有完整的句子。我想知道它是否找到这些句子中的单词。但有时这些词不一样，所以我不能使用SQL like函数。因此，我认为模糊匹配+某种形式的“喜欢”的数据是这样的功能将是有益的：模糊匹配下一列中同一行的一列中的行

Names     Sentences 
Airplanes Sarl   Airplanes-Sàrl is part of Airplanes-Group Sarl. 
Kidco Ltd.    100% ownership of Kidco.Ltd. is the mother company. 
Popsi Co.    Cola Inc. is 50% share of PopsiCo which is part of LaLo.

数据拥有约2000行这需要一个逻辑找到飞机Sarl公司是否确实在句子或者不是，它也适用于Kidco有限公司，它在句子中是'Kidco.Ltd'。

为简单起见，我不需要在列中搜索所有语句，只需要查找Kidco Ltd.并在数据框的同一行中搜索它。

我已经尝试过在Python与： df.apply（拉姆达S：fuzz.ratio（S [ '名称']，S [ '句']），轴= 1）

但我有很多unicode/ascii错误，所以我放弃了，并且想在R中尝试。有关如何在R中执行此操作的任何建议？我已经看到Stackoverflow上的答案，它可以模糊匹配列中的所有句子，这与我想要的不同。有什么建议么？

来源

2017-05-29 Probs

你能向我们提供了答案那模糊匹配的一切？ –

因为你的桌子很小，你可以尝试levenshtein距离。说d是距离，n1是col1中的字符数，n2是col2中的字符数。如果名称完全不在句子中，则距离应该更接近n2，如果距离应该是n2-n1。然后你会定义一个截断点，我认为它可能会运行良好。 –

也许尝试切分+拼音匹配：

library(RecordLinkage) 
library(quanteda) 
df <- read.table(header=T, sep=";", text=" 
Names     ;Sentences 
Airplanes Sarl   ;Airplanes-Sàrl is part of Airplanes-Group Sarl. 
Kidco Ltd.    ;Airplanes-Sàrl is part of Airplanes-Group Sarl. 
Kidco Ltd.    ;100% ownership of Kidco.Ltd. is the mother company. 
Popsi Co.    ;Cola Inc. is 50% share of PopsiCo which is part of LaLo. 
Popsi Co.    ;Cola Inc. is 50% share of Popsi Co which is part of LaLo.") 
f <- soundex 
tokens <- tokenize(as.character(df$Sentences), ngrams = 1:2) # 2-grams to catch "Popsi Co" 
tokens <- lapply(tokens, f) 
mapply(is.element, soundex(df$Names), tokens) 
# A614 K324 K324 P122 P122 
# TRUE FALSE TRUE TRUE TRUE

来源

2017-05-29 15:03:20 lukeA

下面是一个使用我在评论中提出的方法解决，在这个例子中它工作得很好：

library("stringdist") 

df <- as.data.frame(matrix(c("Airplanes Sarl","Airplanes-Sàrl is part of Airplanes-Group Sarl.", 
          "Kidco Ltd.","100% ownership of Kidco.Ltd. is the mother company.", 
          "Popsi Co.","Cola Inc. is 50% share of PopsiCo which is part of LaLo.", 
          "some company","It is a truth universally acknowledged...", 
          "Hello world",list(NULL)), 
        ncol=2,byrow=TRUE,dimnames=list(NULL,c("Names","Sentences"))),stringsAsFactors=FALSE) 

null_elements <- which(sapply(df$Sentences,is.null)) 
df$Sentences[null_elements] <- "" # replacing NULLs to avoid errors 
df$dist <- mapply(stringdist,df$Names,df$Sentences) 
df$n2 <- nchar(df$Sentences) 
df$n1 <- nchar(df$Names) 
df$match_quality <- df$dist-(df$n2-df$n1) 
cutoff <- 2 
df$match <- df$match_quality <= cutoff 
df$Sentences[null_elements] <- list(NULL) # setting null elements back to initial value 
df$match[null_elements] <- NA # optional, set to FALSE otherwise, as it will prevent some false positives if Names is shorter than cutoff 

# Names            Sentences dist n2 n1 match_quality match 
# 1 Airplanes Sarl   Airplanes-Sàrl is part of Airplanes-Group Sarl. 33 47 14    0 TRUE 
# 2  Kidco Ltd.  100% ownership of Kidco.Ltd. is the mother company. 42 51 10    1 TRUE 
# 3  Popsi Co. Cola Inc. is 50% share of PopsiCo which is part of LaLo. 48 56 9    1 TRUE 
# 4 some company    It is a truth universally acknowledged... 36 41 12    7 FALSE 
# 5 Hello world              NULL 11 0 11   22 NA

来源

2017-05-29 15:52:28

Moody_Mudskipper，答案真的很好！但是，如果'Sentences'中的数据是NULL，那么它表示存在TRUE匹配。您可以使用您提供的示例进行尝试，然后在“名称”中插入任何内容并将“句子”留空。 – Probs

我认为它现在应该可以正常工作，尽管我没有在我的情况下使用TRUE匹配，但如果句子为NULL，则出现错误，请告诉我它是否有效。 –

模糊匹配下一列中同一行的一列中的行

回答

相关问题