2013-09-24 46 views
1

我想使用solr的langid UpdateRequestProcessor。下面是配置:langid UpdateRequestProcessor只映射第一个字段

<updateRequestProcessorChain name="languages"> 
    <processor class="solr.LangDetectLanguageIdentifierUpdateProcessorFactory"> 
     <lst name="invariants"> 
      <str name="langid.fl">focus, expertise, platforms, partners, participation, additional</str> 
      <str name="langid.whitelist">en,fr</str> 
      <str name="langid.fallback">en</str> 
      <str name="langid.langField">detectedlang</str> 
      <bool name="langid.map">true</bool> 
      <bool name="langid.map.keepOrig">false</bool> 
     </lst> 
    </processor> 
    <processor class="solr.RunUpdateProcessorFactory" /> 
</updateRequestProcessorChain> 

我的领域是这样的:

<fields> 
    <field name="_root_" type="string" indexed="true" stored="false"/> 
    <field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/> 

    <field name="id" type="string" indexed="true" stored="true" required="true" /> 

    <!-- raw fields from sql db --> 
    <field name="expertise_id" type="int" indexed="true" stored="true" /> 
    <field name="person_id" type="int" indexed="true" stored="true" /> 
    <field name="mod_date" type="date" indexed="true" stored="true" /> 
    <field name="lang" type="string" indexed="true" stored="true" /> 
    <field name="focus" type="text_general" indexed="true" stored="true" /> 
    <field name="expertise" type="text_general" indexed="true" stored="true" /> 
    <field name="platforms" type="text_general" indexed="true" stored="true" /> 
    <field name="partners" type="text_general" indexed="true" stored="true" /> 
    <field name="participation" type="text_general" indexed="true" stored="true" /> 
    <field name="additional" type="text_general" indexed="true" stored="true" /> 
    <field name="tag" type="text_general" termVectors="true" multiValued="true" />  
    <field name="facet_tag" type="string" stored="false" indexed="false" docValues="true" multiValued="true" default=""/> 

    <!-- language detected by solr --> 
    <field name="detectedlang" type="string" indexed="true" stored="true" /> 

    <!-- defined locale fields --> 
    <dynamicField name="*_en" type="text_en" indexed="true" stored="true" /> 
    <dynamicField name="*_fr" type="text_fr" indexed="true" stored="true" /> 

    <copyField source="tag" target="facet_tag"/> 

</fields> 

当我运行的更新或dataimport我知道,“语言”更新链的使用,因为focus被映射到focus_en并检测到lang被设置。但是,langid.fl中的其他字段都没有映射。为什么?

一个例子更新查询:

{ 
    "additional": "here is some other information about me.", 
    "expertise_id": "10000", 
    "id": "foo_10000", 
    "focus": "this is my new focus. It is very exciting. When I am done I expect to be super experienced." 
} 

这里是expertise_id=10000查询的结果。需要注意的是additional没有被移动到additional_en

"response":{"numFound":1,"start":0,"docs":[ 
     { 
     "additional":"here is some other information about me.", 
     "expertise_id":10000, 
     "id":"foo_10000", 
     "detectedlang":"en", 
     "focus_en":"this is my new focus. It is very exciting. When I am done I expect to be super experienced.", 
     "_version_":1447088846110982144}] 
    } 
+0

请参阅https://wiki.apache.org/solr/LanguageDetection#Caveats。 '由于这些实现使用基于n-gram的方法进行检测,因此它们很容易在特别短的输入上检测不到。“您是否尝试使用更长的文本? – arun

+0

@arun:为了测试长度可能成为问题的想法,我只是添加了一个文档,其中所有映射字段具有相同的200字英文文本。 'focus'被映射到'focus_en'。没有其他人被映射。 – dnagirl

+0

@dnagirl,是否提供了解决方案? – forguta

回答

1

原来,这个问题是一个语法错误。这条线:

<str name="langid.fl">focus, expertise, platforms, partners, participation, additional</str> 

必须

<str name="langid.fl">focus,expertise,platforms,partners,participation,additional</str> 

docs状态字段列表应该是逗号或空格分隔值。很明显,逗号和空格会将事情搞砸(尽管在其他Solr上下文中可以正常工作,例如,在requestHandler中langid.fl被假设为建模)。我尝试了空格分隔的语法,但它没有解决我的问题。

我希望这可以帮助别人。

+1

嗯,我把它作为你昨天尝试下一件事的评论,但认为它太愚蠢,所以没有发布:)。 – arun