2017-08-25 117 views
0

我正在Logstash中使用TMX文件(用于翻译数据的xml文件)作为我的源以在Elasticsearch中索引数据。如何在logstash中解析tmx文件(用于翻译数据的xml文件)

样本TMX文件看起来像这样,

<?xml version="1.0" encoding="UTF-8"?> 
<tmx version="1.4"> 
    <header creationtool="ModernMT - modernmt.eu" creationtoolversion="1.0" datatype="plaintext" o-tmf="ModernMT" segtype="sentence" adminlang="en-us" srclang="en-GB"/> 
    <body> 
    <tu srclang="en-GB" datatype="plaintext" creationdate="20121019T114713Z"> 
    <tuv xml:lang="en-GB"> 
    <seg>The purpose of the standard is to establish and define the requirements for the provision of quality services by translation service providers.</seg> 
    </tuv> 
    <tuv xml:lang="it"> 
    <seg>L'obiettivo dello standard è stabilire e definire i requisiti affinché i fornitori di servizi di traduzione garantiscano servizi di qualità.</seg> 
    </tuv> 
</tu> 
<tu srclang="en-GB" datatype="plaintext" creationdate="20111223T112746Z"> 
    <tuv xml:lang="en-GB"> 
    <seg>With 1,800 experienced and qualified resources translating regularly into over 200 language combinations, you can count on us for high quality professional translation services.</seg> 
    </tuv> 
    <tuv xml:lang="it"> 
    <seg>Abbiamo 1.800 professionisti esperti e qualificati che traducono regolarmente in oltre 200 combinazioni linguistiche; perciò, se cercate la qualità, potete contare su di noi.</seg> 
    </tuv> 
</tu> 
<tu srclang="en-GB" datatype="plaintext" creationdate="20111223T112746Z"> 
    <tuv xml:lang="en-GB"> 
    <seg>Access our section of useful links</seg> 
    </tuv> 
    <tuv xml:lang="it"> 
    <seg>Da qui potrete accedere a una sezione che propone link a siti che possono essere di vostro interesse</seg> 
    </tuv> 
</tu> 

我需要做的,是来访问每个<tu>块作为一个事件,其中两个<tuv>块内将被用作数据字段。存储在第一个tuv块中的数据将作为源语言数据字段在ES中编入索引,并且存储在第二块tuv块中的数据是目标语言数据字段。

一个TMX文档可以包含多于10000个tuv块。

我在使用XML过滤器的麻烦,现在看起来是这样的,

input { 
    file { 
     path => "/en-gb_pt-pt/81384/81384.xml" 
      start_position => "beginning" 
     codec => multiline { 
       pattern => "<tu>" 
        negate => "true" 
        what => "previous" 
     } 
    } 
} 

filter { 
    xml { 
     source => "message" 
      target => "xml_content" 
      xpath => [ "//seg", "seg" ] 
    } 
} 

output { 
    stdout { 
      #codec => json 
      codec => rubydebug 
    } 
} 

这里是我的索引模板的一部分,

"segment": { 
      "_parent": { 
       "type": "tm" 
      }, 
      "_routing": { 
       "required": "true" 
      }, 
      "properties": { 
       "@timestamp": { 
        "type": "date", 
        "format": "strict_date_optional_time||epoch_millis" 
       }, 
       "@version": { 
        "type": "string" 
       }, 
       "source": { 
        "type": "string", 
        "store": "true", 
        "fields": { 
         "length": { 
          "type":  "token_count", 
          "analyzer": "standard" 
         } 
        } 
       }, 
       "target": { 
        "type": "string", 
        "store": "true", 
        "fields": { 
         "length": { 
          "type":  "token_count", 
          "analyzer": "standard" 
         } 
        } 
       } 
      } 
     } 

回答

1

我倒是提出一个简单的方法,使用grok或解剖过滤器。

filter { 
    dissect { 
     mapping => { "message" => "%{}<seg>%{src}</seg>%{}<seg>%{trg}</seg>%{}" } 
    } 
    mutate { 
     remove_field => ["message"] 
    } 
} 

,你会得到:

{ 
      "path" => "/en-gb_pt-pt/81384/81384.xml", 
    "@timestamp" => 2017-08-25T15:07:34.567Z, 
      "src" => "The purpose of the standard is to establish and define the requirements for the provision of quality services by translation service providers.", 
     "@version" => "1", 
      "host" => "my_host", 
      "trg" => "L'obiettivo dello standard è stabilire e definire i requisiti affinché i fornitori di servizi di traduzione garantiscano servizi di qualità.", 
      "tags" => [ 
     [0] "multiline" 
    ] 
}