从RFC5646/BCP47:
Language-Tag = langtag ; normal language tags
/privateuse ; private use tag
/grandfathered ; grandfathered tags
langtag = language
["-" script]
["-" region]
*("-" variant)
*("-" extension)
["-" privateuse]
language = 2*3ALPHA ; shortest ISO 639 code
["-" extlang] ; sometimes followed by
; extended language subtags
/4ALPHA ; or reserved for future use
/5*8ALPHA ; or registered language subtag
privateuse = "x" 1*("-" (1*8alphanum))
grandfathered = irregular ; non-redundant tags registered
/regular ; during the RFC 3066 era
它看起来像大多数BCP-47编码的第一部分应该是有效的ISO-639代码尽管他们可能不是三个字母变体。一个BCP-47语言代码有没有ISO-639代码的几个变种 - 即那些拥有x-
或i-
以及一批符合语法的grandfathered
部分是传统的代码开始:
irregular = "en-GB-oed" ; irregular tags do not match
/"sgn-BE-FR" ; also includes i- prefixed codes
/"sgn-BE-NL"
/"sgn-CH-DE"
regular = "art-lojban" ; these tags match the 'langtag'
/"cel-gaulish" ; production, but their subtags
/"no-bok" ; are not extended language
/"no-nyn" ; or variant subtags: their meaning
/"zh-guoyu" ; is defined by their registration
/"zh-hakka" ; and all of these are deprecated
/"zh-min" ; in favor of a more modern
/"zh-min-nan" ; subtag or sequence of subtags
/"zh-xiang"
一良好的开端是类似以下内容:从2个字符的变型的3个字符的变种
def extract_iso_code(bcp_identifier):
language, _ = bcp_identifier.split('-', 1)
if 2 <= len(language) <=3:
# this is a valid ISO-639 code or is grandfathered
else:
# handle non-ISO codes
raise ValueError(bcp_identifier)
转换应该是很容易处理,因为映射是众所周知的。
谢谢!这将使解析变得更容易。你有没有参考这个事实? – 2014-09-28 14:20:50
@AdamMatan:为您添加RFC参考。 – 2014-09-28 14:28:05
639不只是639-1和639-2。一个3字符的iso 639语言代码可以是639-3。 – 2015-02-01 20:20:35