2017-05-08 62 views
1

我有一个地址列表没有完全格式化。大多数人拥有相同的基本结构,但大约五分之一没有被正确输入。通过在R中添加缺少的字来编辑地址字符串

df1包含24个地址,每个地址都是一个字符串。我的目标是找到似乎缺少单词或数字的地址,并将它们添加到它们最可能属于的每个字符串中。

我的方法是计算每个唯一字/数字出现在数据帧中的次数。出现在80%或更多行中的单词被标识为需要添加到每个地址的单词。根据包含所有寻址元素的地址的格式,任何缺少的单词都需要添加到“正确”位置。

我可以识别需要添加的单词,但是如果不存在,我还没有找到将单词添加到每个字符串的方法;也没有找到确保将它们添加到字符串中正确位置的方法。这是更加复杂的,因为在我的真实数据集中,地址的格式不是跨地区恒定的,即在这个例子中,建筑物号码和道路名称是第三和第四地址元素。有时他们会成为第一,第二,第三等。所以我一直在努力开发的解决方案也需要动态。

这是我的样本数据集:

df1 <- data.frame(V1=c("apt 23 5 roadname cityville b11abc", "apt 47 5 roadname cityville b11abc", "apt 24 roadname cityville b11abc", "apt 3 roadname cityville b11abc", "apt 44 5 roadname cityville b11abc", "apt 88 5 roadname cityville b11abc", "apt 7 5 roadname cityville b11abc", "apt 41 5 roadname cityville b11abc", "apt 55 5 roadname cityville b11abc", "apt 19 5 roadname cityville b11abc", "85 5 roadname cityville b11abc", "apt 12 roadname cityville b11abc", "apt 452 5 roadname cityville b11abc", "apt 1 5 roadname cityville b11abc", "99 5 roadname cityville b11abc", "apt 73 5 roadname cityville b11abc", "74 roadname cityville b11abc", "apt 75 5 roadname cityville b11abc", "apt 63 5 roadname cityville b11abc", "apt 48 5 roadname cityville b11abc", "apt 123 5 roadname cityville b11abc", "apt 56 5 roadname cityville b11abc", "6 5 roadname cityville b11abc", "apt 2 6 roadname cityville b11abc"), stringsAsFactors = F) 

这是我的方法用于鉴定需要添加的话:

df1_words <- as.data.frame(table(t(as.data.frame(as.list(unlist(strsplit(df1$V1, " "))))))) 
df1_words_80 <- subset(df1_words, Freq >= round(nrow(df1)/100*80)) 

这是我后的输出:

df2 <- data.frame(V1=c("apt 23 5 roadname cityville b11abc", "apt 47 5 roadname cityville b11abc", "apt 24 5 roadname cityville b11abc", "apt 3 5 roadname cityville b11abc", "apt 44 5 roadname cityville b11abc", "apt 88 5 roadname cityville b11abc", "apt 7 5 roadname cityville b11abc", "apt 41 5 roadname cityville b11abc", "apt 55 5 roadname cityville b11abc", "apt 19 5 roadname cityville b11abc", "apt 85 5 roadname cityville b11abc", "apt 12 5 roadname cityville b11abc", "apt 452 5 roadname cityville b11abc", "apt 1 5 roadname cityville b11abc", "apt 99 5 roadname cityville b11abc", "apt 73 5 roadname cityville b11abc", "apt 74 5 roadname cityville b11abc", "apt 75 5 roadname cityville b11abc", "apt 63 5 roadname cityville b11abc", "apt 48 5 roadname cityville b11abc", "apt 123 5 roadname cityville b11abc", "apt 56 5 roadname cityville b11abc", "apt 6 5 roadname cityville b11abc", "apt 2 6 roadname cityville b11abc"), stringsAsFactors = F) 

编辑 应用后ng ikop的解决方案到一个真实的数据集我遇到了一个问题,当列表包含长度不同的地址时。我认为这个问题是一些短地址(例如包含5个字)试图在通常在位置6,7,8,9等处找到的频繁词汇插入到它们中,这是不可能的,因此产生错误。我可以想到两个解决方案,无论是向后计数还是向前计数,或者可能是更简单的选项(以及我认为最适合我的特定需求的选项),只是忽略包含非常短的字符串的行。

我遇到的问题可以用df3与ikop的解决方案

df3 <- data.frame(V1=c("apt really long name 23 5 roadname cityville b11abc", "apt really long name 47 5 roadname cityville b11abc", "apt really long name 24 roadname cityville b11abc", "apt 3 roadname cityville b11abc", "apt really long name 44 5 roadname cityville b11abc", "apt really long name 88 5 roadname cityville b11abc", "apt really long name 7 5 roadname cityville b11abc", "apt really long name 41 5 roadname cityville b11abc", "apt really long name 55 5 roadname cityville b11abc", "apt really long name 19 5 roadname cityville b11abc", "85 5 roadname cityville b11abc", "apt really long name 12 roadname cityville b11abc", "apt really long name 452 5 roadname cityville b11abc", "apt really long name 1 5 roadname cityville b11abc", "99 5 roadname cityville b11abc", "apt really long name 73 5 roadname cityville b11abc", "74 roadname cityville b11abc", "apt 75 5 roadname cityville b11abc", "apt really long name 63 5 roadname cityville b11abc", "apt really long name 48 5 roadname cityville b11abc", "apt really long name 123 5 roadname cityville b11abc", "apt really long name 56 5 roadname cityville b11abc", "6 5 roadname cityville b11abc", "apt really long name 2 6 roadname cityville b11abc"), stringsAsFactors = F) 

回答

1

这是一个哈克的解决方案,将让你最的方式,当被复制。

## For each word that appears in at least 80% of the rows compute 
## the most frequent position it appears in: 
library(dplyr) 
splitList <- strsplit(df1$V1, " ") 
wordVec <- unique(unlist(splitList)) 
wordFrequencyDf <- lapply(wordVec, function(theWord){ 
        freqWord <- sum(unlist(splitList) == theWord) 
        posVec <- unlist(lapply(splitList, function(x) which(x == theWord))) 
        mostFreqPos <- sort(table(posVec), decreasing = TRUE)[1] %>% names %>% as.numeric 
        data.frame(theWord, freqWord,mostFreqPos) 
       }) %>% 
     do.call('rbind',.) %>% 
     dplyr::mutate(theWord = as.character(theWord)) %>% 
     dplyr::filter(freqWord >= round(nrow(df1)*0.8)) %>% 
     dplyr::arrange(mostFreqPos) 

## Now loop over those words and insert the word in the relevant 
## position if necessary: 
for (ii in seq(along = wordFrequencyDf$theWord)){ 
    splitList <- lapply(splitList, function(x){ 
       relPos <- wordFrequencyDf$mostFreqPos[ii] 
       if (x[relPos] != wordFrequencyDf$theWord[ii]){ 
        if (relPos == 1){ 
         strBefore <- NULL      
        } else { 
         strBefore <- x[1:(relPos-1)] 
        }      
        if (relPos > length(x)){ 
         strAfter <- NULL       
        } else { 
         strAfter <- x[relPos:length(x)] 
        }     
        x <- c(strBefore, wordFrequencyDf$theWord[ii], strAfter) 
       } 
       x 
      }) 
} 

## Paste list together into a single string again: 
df2 <- data.frame(V1 = sapply(splitList, function(x) paste(x, collapse = " "))) 

结果:

df2 
#                V1 
# 1        apt 23 5 roadname cityville b11abc 
# 2        apt 47 5 roadname cityville b11abc 
# 3        apt 24 5 roadname cityville b11abc 
# 4        apt 3 5 roadname cityville b11abc 
# 5        apt 44 5 roadname cityville b11abc 
# 6        apt 88 5 roadname cityville b11abc 
# 7        apt 7 5 roadname cityville b11abc 
# 8        apt 41 5 roadname cityville b11abc 
# 9        apt 55 5 roadname cityville b11abc 
# 10       apt 19 5 roadname cityville b11abc 
# 11       apt 85 5 roadname cityville b11abc 
# 12       apt 12 5 roadname cityville b11abc 
# 13       apt 452 5 roadname cityville b11abc 
# 14        apt 1 5 roadname cityville b11abc 
# 15       apt 99 5 roadname cityville b11abc 
# 16       apt 73 5 roadname cityville b11abc 
# 17       apt 74 5 roadname cityville b11abc 
# 18       apt 75 5 roadname cityville b11abc 
# 19       apt 63 5 roadname cityville b11abc 
# 20       apt 48 5 roadname cityville b11abc 
# 21       apt 123 5 roadname cityville b11abc 
# 22       apt 56 5 roadname cityville b11abc 
# 23        apt 6 5 roadname cityville b11abc 
# 24 apt 2 5 roadname cityville b11abc 6 roadname cityville b11abc 

你可以看到,该方法在最后一行失败。这里原始线没有位置3的"5"(如预期的代码)。但问题是建筑物号码并未完全丢失,该字符串只包含一个不同的建筑物号码。该代码,但是解释为缺少的建筑物编号,并在位置3插入"5"

+0

非常感谢您的这一点。我已经将它应用到了我的真实数据集中,并且几乎一直都在运行。但是,地址列表包含不同长度的字符串时会出现问题。我试图编辑你的解决方案,基本上忽略包含非常短的字符串的行,但没有太多的运气。我为这个问题添加了一个例子,证明我遇到的错误。 – Chris