2016-11-24 153 views
2

我有UTF8实体的字符串(我不知道我把它命名为右):解码UTF8实体为UTF8 C++

std::string std = "\u0418\u043d\u0434\u0435\u043a\u0441"; 

我怎么能转换成更具可读性?我用G ++几个小时的std ::的codecvt人工掏挖与C++ 11的支持,但之后我没有得到任何结果:

std::string std = "\u0418\u043d\u0434\u0435\u043a\u0441"; 

wstring_convert<codecvt_utf8_utf16<char16_t>,char16_t> convert; 
string dest = convert.to_bytes(std); 

回报噩梦堆栈跟踪开始:

error: no matching function for call to ‘std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t>::to_bytes(std::string&) 

我希望有是另一种方式。

回答

0

你看到的不是实体,而是代码点。您正在通过Unicode转义序列定义字符,编译器会自动将它们转换为UTF-8。将其转换成UTF-16和反之亦然的典型方法是这样的:

static std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter; 
std::string ws2s(const std::wstring &wstr) { 
    std::string narrow = converter.to_bytes(wstr); 
    return narrow; 
} 

std::wstring s2ws(const std::string &str) { 
    std::wstring wide = converter.from_bytes(str); 
    return wide; 
} 

当然你也可以不是原来的字符串转换为同一类型(的std :: string)的另一个字符串,因为它无法容纳这样的字符。这就是为什么编译器首先将UTF-16代码转换为UTF-8的原因。

+0

我很确定这些函数对'\ u'表示法没有任何线索。 – tadman

+0

他们不需要。编译器会这样做,因为它可以处理字符串中的Unicode序列。如果OP想要在字符串中保留原始Unicode转义序列,他会使用'\\ u0418'等(我的答案会不同)。 –

2

首先,您使用std::wstring_convert是倒退。你有一个UTF-8编码std::string,你想要转换成一个宽的Unicode字符串。由于to_bytes()不包含std::string作为输入,因此您将收到编译器错误。它需要一个std::wstring_convert::wide_string作为输入(这是你的情况std::u16string,由于你在专业化运用char16_t),所以你需要使用from_bytes()而不是to_bytes()

std::string std = "\u0418\u043d\u0434\u0435\u043a\u0441"; 

std::wstring_convert<codecvt_utf8_utf16<char16_t>, char16_t> convert; 
std::u16string dest = convert.from_bytes(std); 

现在,他这样说,第9所述JSON specification状态:

9 String

A string is a sequence of Unicode code points wrapped with quotation marks (U+0022). All characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark (U+0022), reverse solidus (U+005C), and the control characters U+0000 to U+001F. There are two-character escape sequence representations of some characters.

\" represents the quotation mark character (U+0022).

\\ represents the reverse solidus character (U+005C).

\/ represents the solidus character (U+002F).

\b represents the backspace character (U+0008).

\f represents the form feed character (U+000C).

\n represents the line feed character (U+000A).

\r represents the carriage return character (U+000D).

\t represents the character tabulation character (U+0009).

So, for example, a string containing only a single reverse solidus character may be represented as " \\ ".

Any code point may be represented as a hexadecimal number. The meaning of such a number is determined by ISO/IEC 10646. If the code point is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u , followed by four hexadecimal digits that encode the code point. Hexadecimal digits can be digits (U+0030 through U+0039) or the hexadecimal letters A through F in uppercase (U+0041 through U+0046) or lowercase (U+0061 through U+0066). So, for example, a string containing only a single reverse solidus character may be represented as " \u005C ".

The following four cases all produce the same result:

" \u002F "

" \u002f "

" \/ "

" / "

To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So for example, a string containing only the G clef character (U+1D11E) may be represented as " \uD834\uDD1E ".

原始JSON数据本身可以是UTF-8(最常见的编码)进行编码,UTF-16等,但无论使用的编码的,字符序列"\u0418\u043d\u0434\u0435\u043a\u0441"表示UTF-16码单元序列U+0418 U+043d U+0434 U+0435 U+043a U+0441,这是Unicode字符串"Индекс"

如果您使用实际的JSON解析器(如JSON for Modern C++jsoncpp, RapidJSON等),它将为您解析UTF-16 codeunit值并返回可读的Unicode字符串。

但是,如果您手动处理JSON数据,则必须手动解码任何\x\uXXXX转义序列。 std::wstring_convert不能为你做。它只能将std::string的JSON转换为std::wstring/std:::u16string,如果这样可以更轻松地解析数据。但是,您仍然需要分别解析JSON的内容

之后,如果需要,可以使用std::wstring_convert将提取的任何std::wstring/std::u16string字符串转换回UTF-8以节省内存。

+0

我很乐意为现代C++使用JSON,但是当我尝试用它解析json时,我得到一个错误:what():parse error - unexpected' '。代码只是:auto j3 = json :: parse(json_string); –

+0

' '是Unicode码点'U + FFFD REPLACEMENT CHARACTER'(UTF-8字节序列'0xEF 0xBF 0xBD')。您的JSON数据在传递给JSON解析器之前,可能会使用错误的字符集进行字符集转换。 “现代C++的JSON”解析器只接受有效的UTF-8输入。 –