鉴于输入“快速棕色狐狸跳跃”我想创建每个可能的词汇组合。因此,例如字符串将被标记化到Elastic tokenize into all words cominations
[
"quick", "quick brown", "quick fox", "quick jumped",
"brown", "brown quick", "brown fox", "brown jumped",
...,
"jumped quick", "jumped brown", "jumped fox", "jumped"
]
我可以用shingle tokeniser它,但它只能通过连接相邻方面创造了新的标记和我结束了:
[
"quick", "quick brown", "quick brown fox", "quick brown fox jumped",
"brown", "brown fox", "brown fox jumped",
"fox", "fox jumped",
"jumped"
]
这是向前迈出的正确的一步但不是我寻找的东西。
你能解释一下你使用的用例吗? – Val
@Val长话短说 - 不仅仅是单一词汇([“quick”,“brown”,“fox”,“jumped”)),而且还包括这些单词/术语的组合 –