2017-05-29 102 views
2

我想根据另一列在一列中查找信息。所以我在一列中有一些词,在另一列中有完整的句子。我想知道它是否找到这些句子中的单词。但有时这些词不一样,所以我不能使用SQL like函数。因此,我认为模糊匹配+某种形式的“喜欢”的数据是这样的功能将是有益的:模糊匹配下一列中同一行的一列中的行

Names     Sentences 
Airplanes Sarl   Airplanes-Sàrl is part of Airplanes-Group Sarl. 
Kidco Ltd.    100% ownership of Kidco.Ltd. is the mother company. 
Popsi Co.    Cola Inc. is 50% share of PopsiCo which is part of LaLo. 

数据拥有约2000行这需要一个逻辑找到飞机Sarl公司是否确实在句子或者不是,它也适用于Kidco有限公司,它在句子中是'Kidco.Ltd'。

为简单起见,我不需要在列中搜索所有语句,只需要查找Kidco Ltd.并在数据框的同一行中搜索它。

我已经尝试过在Python与: df.apply(拉姆达S:fuzz.ratio(S [ '名称'],S [ '句']),轴= 1)

但我有很多unicode/ascii错误,所以我放弃了,并且想在R中尝试。 有关如何在R中执行此操作的任何建议?我已经看到Stackoverflow上的答案,它可以模糊匹配列中的所有句子,这与我想要的不同。有什么建议么?

+0

你能向我们提供了答案那模糊匹配的一切? –

+0

因为你的桌子很小,你可以尝试levenshtein距离。说d是距离,n1是col1中的字符数,n2是col2中的字符数。如果名称完全不在句子中,则距离应该更接近n2,如果距离应该是n2-n1。然后你会定义一个截断点,我认为它可能会运行良好。 –

回答

2

也许尝试切分+拼音匹配:

library(RecordLinkage) 
library(quanteda) 
df <- read.table(header=T, sep=";", text=" 
Names     ;Sentences 
Airplanes Sarl   ;Airplanes-Sàrl is part of Airplanes-Group Sarl. 
Kidco Ltd.    ;Airplanes-Sàrl is part of Airplanes-Group Sarl. 
Kidco Ltd.    ;100% ownership of Kidco.Ltd. is the mother company. 
Popsi Co.    ;Cola Inc. is 50% share of PopsiCo which is part of LaLo. 
Popsi Co.    ;Cola Inc. is 50% share of Popsi Co which is part of LaLo.") 
f <- soundex 
tokens <- tokenize(as.character(df$Sentences), ngrams = 1:2) # 2-grams to catch "Popsi Co" 
tokens <- lapply(tokens, f) 
mapply(is.element, soundex(df$Names), tokens) 
# A614 K324 K324 P122 P122 
# TRUE FALSE TRUE TRUE TRUE 
1

下面是一个使用我在评论中提出的方法解决,在这个例子中它工作得很好:

library("stringdist") 

df <- as.data.frame(matrix(c("Airplanes Sarl","Airplanes-Sàrl is part of Airplanes-Group Sarl.", 
          "Kidco Ltd.","100% ownership of Kidco.Ltd. is the mother company.", 
          "Popsi Co.","Cola Inc. is 50% share of PopsiCo which is part of LaLo.", 
          "some company","It is a truth universally acknowledged...", 
          "Hello world",list(NULL)), 
        ncol=2,byrow=TRUE,dimnames=list(NULL,c("Names","Sentences"))),stringsAsFactors=FALSE) 

null_elements <- which(sapply(df$Sentences,is.null)) 
df$Sentences[null_elements] <- "" # replacing NULLs to avoid errors 
df$dist <- mapply(stringdist,df$Names,df$Sentences) 
df$n2 <- nchar(df$Sentences) 
df$n1 <- nchar(df$Names) 
df$match_quality <- df$dist-(df$n2-df$n1) 
cutoff <- 2 
df$match <- df$match_quality <= cutoff 
df$Sentences[null_elements] <- list(NULL) # setting null elements back to initial value 
df$match[null_elements] <- NA # optional, set to FALSE otherwise, as it will prevent some false positives if Names is shorter than cutoff 

# Names            Sentences dist n2 n1 match_quality match 
# 1 Airplanes Sarl   Airplanes-Sàrl is part of Airplanes-Group Sarl. 33 47 14    0 TRUE 
# 2  Kidco Ltd.  100% ownership of Kidco.Ltd. is the mother company. 42 51 10    1 TRUE 
# 3  Popsi Co. Cola Inc. is 50% share of PopsiCo which is part of LaLo. 48 56 9    1 TRUE 
# 4 some company    It is a truth universally acknowledged... 36 41 12    7 FALSE 
# 5 Hello world              NULL 11 0 11   22 NA 
+0

Moody_Mudskipper,答案真的很好!但是,如果'Sentences'中的数据是NULL,那么它表示存在TRUE匹配。您可以使用您提供的示例进行尝试,然后在“名称”中插入任何内容并将“句子”留空。 – Probs

+0

我认为它现在应该可以正常工作,尽管我没有在我的情况下使用TRUE匹配,但如果句子为NULL,则出现错误,请告诉我它是否有效。 –

相关问题