我有一个数据名称包含字符串和POS标记。我想通过过滤特定的POS标签来提取特定的字符串。使用正则表达式提取特定字符串
举一个简单的例子,我想提取“NN-NN-NN”和“VB-JJ-NN”基础字符串。
df <- data.frame(word = c("abrasion process management",
"slurries comprise abrasive",
"slurry compositions comprise ",
"keep high polishing",
"improved superabrasive grit",
"using ceriacoated silica",
"and grinding",
"for cmp",
"and grinding for"),
pos_tag = c("NN-NN-NN", "NNS-NN-NN", "NN-NNS-NN", "VB-JJ-NN",
"VBN-JJ-NN", "VBG-JJ-NN", "CC-VBG", "IN-NN", "CC-VBG-IN"))
> df
word pos_tag
1 abrasion process management NN-NN-NN
2 slurries comprise abrasive NNS-NN-NN
3 slurry compositions comprise NN-NNS-NN
4 keep high polishing VB-JJ-NN
5 improved superabrasive grit VBN-JJ-NN
6 using ceriacoated silica VBG-JJ-NN
7 and grinding CC-VBG
8 for cmp IN-NN
9 and grinding for CC-VBG-IN
我试过用正则表达式来定义我的模式。 但我认为这不是定义模式的有效方法。 还有其他更有效的方法吗?
pos <- c("NN-NN-NN", "NNS-NN-NN", "NN-NNS-NN", "VB.-JJ-NN", "VB-JJ-NN")
pos2 <- paste0('^', pos , "\\w*$", collapse = '|')
sort_string <- df[grep(pos2, df$pos_tag),] %>%
unique()
这里是我想要得到
word pos_tag
1 abrasion process management NN-NN-NN
2 slurries comprise abrasive NNS-NN-NN
3 slurry compositions comprise NN-NNS-NN
4 keep high polishing VB-JJ-NN
5 improved superabrasive grit VBN-JJ-NN
6 using ceriacoated silica VBG-JJ-NN
在预期你有'NNS-NN-NN'模式不明确 – akrun
这个问题不是很清楚。让我看看我是否理解:你想从单词中取出“i”元素并将其与pos_tag中的“i”元素相匹配,将文件/控制台写入从1到“i”的行,其中“我“代表循环索引控制。你也想打印行号。这是你想要的吗? – Heto