匹配的部分网址

我有两个指标 - 一个包含_id=<url of the document>，例如“文档”对象http://site/folder/document_name.doc;另一个包含与_id=<url of the folder>，例如“文件夹中的”对象http://site/folder匹配的部分网址

在我的node.js脚本我需要匹配的文档文件夹，即我搜索所有文件夹中的网址，然后为他们每个人我寻找其网址的开头的文件夹网址

我的所有文件似乎无法构建正确的查询，将返回所有文档_id开始http://site/folder

任何想法？

来源

2016-06-07 Andrey

我认为更好的解决办法是不使用_id这个问题。

取而代之的是，索引字段名为path（或任何您想要的名称），并使用Path Hierarchy Tokenizer以及一些创意令牌过滤器来查看。

这样，您就可以使用Elasticsearch/Lucene的来标记网址。

例如：https://site/folder被符号化视为两个标记：

site
site/folder

然后，您可以通过搜索权包含在site文件夹中的任何文件或文件夹令牌：site。

PUT /test 
{ 
    "settings": { 
    "analysis": { 
     "filter": { 
     "http_dropper": { 
      "type": "pattern_replace", 
      "pattern": "^https?:/{0,}(.*)", 
      "replacement": "$1" 
     }, 
     "empty_dropper": { 
      "type": "length", 
      "min": 1 
     }, 
     "qs_dropper": { 
      "type": "pattern_replace", 
      "pattern": "(.*)[?].*", 
      "replacement": "$1" 
     }, 
     "trailing_slash_dropper": { 
      "type": "pattern_replace", 
      "pattern": "(.*)/+$", 
      "replacement": "$1" 
     } 
     }, 
     "analyzer": { 
     "url": { 
      "tokenizer": "path_hierarchy", 
      "filter": [ 
      "http_dropper", 
      "qs_dropper", 
      "trailing_slash_dropper", 
      "empty_dropper", 
      "unique" 
      ] 
     } 
     } 
    } 
    }, 
    "mappings": { 
    "type" : { 
     "properties": { 
     "url" : { 
      "type": "string", 
      "analyzer": "url" 
     }, 
     "type" : { 
      "type": "string", 
      "index": "not_analyzed" 
     } 
     } 
    } 
    } 
}

您可以或可能不希望我加入trailing_slash_dropper。将lowercase令牌过滤器放在那里也是值得的，但实际上可能会使某些URL令牌根本上不正确（例如，mysite.com/bucket/AaDsaAe31AcxX可能真的关心这些字符的情况）。你可以把分析仪与_analyze端点试驾：

GET /test/_analyze?analyzer=url&text=http://test.com/text/a/?value=xyz&abc=value

注：我使用感，所以它的URL编码对我来说。这将产生三个令牌：

{ 
    "tokens": [ 
    { 
     "token": "test.com", 
     "start_offset": 0, 
     "end_offset": 15, 
     "type": "word", 
     "position": 0 
    }, 
    { 
     "token": "test.com/text", 
     "start_offset": 0, 
     "end_offset": 20, 
     "type": "word", 
     "position": 0 
    }, 
    { 
     "token": "test.com/text/a", 
     "start_offset": 0, 
     "end_offset": 22, 
     "type": "word", 
     "position": 0 
    } 
    ] 
}

绑一起：

POST /test/type 
{ 
    "type" : "dir", 
    "url" : "https://site" 
} 

POST /test/type 
{ 
    "type" : "dir", 
    "url" : "https://site/folder" 
} 

POST /test/type 
{ 
    "type" : "file", 
    "url" : "http://site/folder/document_name.doc" 
} 

POST /test/type 
{ 
    "type" : "file", 
    "url" : "http://other/site/folder/document_name.doc" 
} 

POST /test/type 
{ 
    "type" : "file", 
    "url" : "http://other_site/folder/document_name.doc" 
} 

POST /test/type 
{ 
    "type" : "file", 
    "url" : "http://site/mirror/document_name.doc" 
} 

GET /test/_search 
{ 
    "query": { 
    "bool": { 
     "must": [ 
     { 
      "match": { 
      "url": "http://site/folder" 
      } 
     } 
     ], 
     "filter": [ 
     { 
      "term": { 
      "type": "file" 
      } 
     } 
     ] 
    } 
    } 
}

它来测试这一点，以便你可以看到什么比赛是非常重要的，和那些比赛的顺序。当然，这会找到您期望找到的文档（并将其放在最上面！），但它也会找到其他一些您可能不期待的文档，如http://site/mirror/document_name.doc，因为它共享基本标记：site。有一堆，你可以用它来排除这些文件如果它排除他们是非常重要的策略。

你可以利用你的标记化执行类似谷歌的结果过滤，喜欢怎样就可以通过谷歌搜索特定的域：

匹配查询网站：elastic.co

你可以然后解析（手动地）的site:elastic.co并采取elastic.co为边界URL：

{ 
    "term" : { 
    "url" : "elastic.co" 
    } 
}

否这与搜索URL不同。你明确地说“只包含这个确切的令牌在他们的url中的文档”。您可以继续使用site:elastic.co/blog等，因为该确切标记存在。但是，需要注意的是，如果您要尝试site:elastic.co/blog/，那么将不会找到任何文档，因为该令牌在给定令牌过滤器时不能存在。

来源

2016-06-07 22:47:04 pickypg

很好的答案，谢谢 - 我用它成功了！ – Andrey

匹配的部分网址

回答

相关问题