2014-08-29 28 views
0

我们的产品利用ascii折叠令牌过滤器,我们的客户正在询问关于它的具体信息。具体而言,他们希望将unicode字符映射到ASCII等效。虽然我相信大多数转换是显而易见的(例如ü= u),但也有一些“棘手”的转换,比如ß,我认为它转化为“ss”。unicode字符与ElasticSearch的ascii折叠令牌过滤器的前127个ASCII字符的映射关系是什么?

我已经使用Google搜索,但一直未能找到明确的映射。有什么地方可以获得这些信息吗?

感谢您的帮助, 埃里克

+0

[相关测试代码(https://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_3_2/lucene/src/test /org/apache/lucene/analysis/TestASCIIFoldingFilter.java) – 2014-08-30 04:35:27

回答

1

You can just read the source code for ASCIIFoldingFilter.

从源样本:

 case '\u00C0': // À [LATIN CAPITAL LETTER A WITH GRAVE] 
     case '\u00C1': // Á [LATIN CAPITAL LETTER A WITH ACUTE] 
     case '\u00C2': // Â [LATIN CAPITAL LETTER A WITH CIRCUMFLEX] 
     case '\u00C3': // Ã [LATIN CAPITAL LETTER A WITH TILDE] 
     case '\u00C4': // Ä [LATIN CAPITAL LETTER A WITH DIAERESIS] 
     case '\u00C5': // Å [LATIN CAPITAL LETTER A WITH RING ABOVE] 
     case '\u0100': // Ā [LATIN CAPITAL LETTER A WITH MACRON] 
     case '\u0102': // Ă [LATIN CAPITAL LETTER A WITH BREVE] 
     case '\u0104': // Ą [LATIN CAPITAL LETTER A WITH OGONEK] 
     case '\u018F': // Ə http://en.wikipedia.org/wiki/Schwa [LATIN CAPITAL LETTER SCHWA] 
     case '\u01CD': // Ǎ [LATIN CAPITAL LETTER A WITH CARON] 
     case '\u01DE': // Ǟ [LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON] 
     case '\u01E0': // Ǡ [LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON] 
     case '\u01FA': // Ǻ [LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE] 
     case '\u0200': // Ȁ [LATIN CAPITAL LETTER A WITH DOUBLE GRAVE] 
     case '\u0202': // Ȃ [LATIN CAPITAL LETTER A WITH INVERTED BREVE] 
     case '\u0226': // Ȧ [LATIN CAPITAL LETTER A WITH DOT ABOVE] 
     case '\u023A': // Ⱥ [LATIN CAPITAL LETTER A WITH STROKE] 
     case '\u1D00': // ᴀ [LATIN LETTER SMALL CAPITAL A] 
     case '\u1E00': // Ḁ [LATIN CAPITAL LETTER A WITH RING BELOW] 
     case '\u1EA0': // Ạ [LATIN CAPITAL LETTER A WITH DOT BELOW] 
     case '\u1EA2': // Ả [LATIN CAPITAL LETTER A WITH HOOK ABOVE] 
     case '\u1EA4': // Ấ [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE] 
     case '\u1EA6': // Ầ [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND GRAVE] 
     case '\u1EA8': // Ẩ [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE] 
     case '\u1EAA': // Ẫ [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND TILDE] 
     case '\u1EAC': // Ậ [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW] 
     case '\u1EAE': // Ắ [LATIN CAPITAL LETTER A WITH BREVE AND ACUTE] 
     case '\u1EB0': // Ằ [LATIN CAPITAL LETTER A WITH BREVE AND GRAVE] 
     case '\u1EB2': // Ẳ [LATIN CAPITAL LETTER A WITH BREVE AND HOOK ABOVE] 
     case '\u1EB4': // Ẵ [LATIN CAPITAL LETTER A WITH BREVE AND TILDE] 
     case '\u1EB6': // Ặ [LATIN CAPITAL LETTER A WITH BREVE AND DOT BELOW] 
     case '\u24B6': // Ⓐ [CIRCLED LATIN CAPITAL LETTER A] 
     case '\uFF21': // A [FULLWIDTH LATIN CAPITAL LETTER A] 
     output[outputPos++] = 'A'; 
     break; 

正如你所看到的,它没有做任何事情来希腊和西里尔字母,更不用说其他的了。

另外。如你猜中,ß被转换成ss

 case '\u00DF': // ß [LATIN SMALL LETTER SHARP S] 
     output[outputPos++] = 's'; 
     output[outputPos++] = 's'; 
     break;