2013-03-09 81 views
1

我试图将UTF-16字符串(从spidermonkey 19中的JSString获得)转换为UTF-8字符串。我认为转换的字符串是可以的,但由于某种原因,转换例程为每个unicode(非ascii)字符添加了两个额外的字节。我很确定我做错了什么,我尝试了不同的编码,但没有好的结果。这就是我现在越来越:从UTF-16转换为UTF-8的问题

// UTF-16 string "áéíóúñ aeiou", this is the string being converted 
// (you can find "aeiou" after \x20\x00, where \x61\x00 is "a") 
\xC3\x00\xA1\x00\xC3\x00\xA9\x00\xC3\x00\xAD\x00\xC3\x00\xB3\x00\xC3\x00\xBA\x00\xC3\x00\xB1\x00\x20\x00\x61\x00\x65\x00\x69\x00\x6F\x00\x75\x00\x6E\x00 

// UTF-8 string, test string, taken from: 
// const char* cmp = "áéíóúñ aeiou" 
// This is the result I'm looking for. 
\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xba\xc3\xb1 aeiou 

// UTF-8 string I'm getting after iconv(utf16, utf8) 
\xc3\x83\xc2\xa1\xc3\x83\xc2\xa9\xc3\x83\xc2\xad\xc3\x83\xc2\xb3\xc3\x83\xc2\xba\xc3\x83\xc2\xb1 aeioun 

正如你所看到的,也有每个非ASCII字符之间的两个额外的字节(\ X83 \ XC2)。任何人都知道这是为什么?

这是我的转换例程:

shared_ptr<char> convertToUTF8(char* utf16string, size_t len) { 
    iconv_t cd = iconv_open("UTF-8", "UTF-16LE"); 
    char* utf8; 
    size_t utf8len; 

    utf8len = len; 
    utf8 = (char *)calloc(utf8len, 1); 
    shared_ptr<char> outptr(utf8); 

    size_t converted = iconv(cd, &utf16string, &len, &utf8, &utf8len); 
    if (converted == (size_t)-1) { 
     fprintf(stderr, "iconv failed\n"); 
     switch (errno) { 
      case EILSEQ: 
       fprintf(stderr, "Invalid multibyte sequence.\n"); 
       break; 
      case EINVAL: 
       fprintf(stderr, "Incomplete multibyte sequence.\n"); 
       break; 
      case E2BIG: 
       fprintf(stderr, "No more room (iconv).\n"); 
       break; 
      default: 
       fprintf(stderr, "Error: %s.\n", strerror(errno)); 
       break; 
     } 
     outptr = NULL; 
    } 
    iconv_close(cd); 
    assert(outptr); 
    return outptr; 
} 

我也尝试了解决方案this other question,但我得到了完全相同的结果。任何想法为什么iconv增加额外的两个字节?我如何将结果与手动创建的utf-8字符串进行匹配?

编辑:测试字符串

回答

0

你为什么不只是使用“UTF-16”或“UTF-16”,而不是“UTF-16LE”的固定内容,从“人iconv_open子”,看来我们有6个不同编码为UTF16,

UTF16 // utf16be应按// utf16le应按// UTF16 // utf16be应按// utf16le应按//

然而,我不没有iconv经验,但我已经使用以下函数将JSString转换为gchar *

gchar* gtweet_jsengine_jsval2gchar(GtweetTwitterClient *self, jsval value) 
{ 
    JSContext *jscontext = NULL; 
    JSString *string = NULL; 
    GError *error = NULL; 
    gunichar2 *utf16_string = NULL; 
    gsize utf16_length = 0; 
    glong rlen = 0; 
    glong wlen = 0; 
    gchar *ret = NULL; 

    jscontext = self->priv->jscontext; 
    JS_BeginRequest(jscontext); 
    string = JS_ValueToString(jscontext, value); 
    utf16_string = (gunichar2 *) JS_GetStringCharsAndLength(jscontext, string, &utf16_length); 
    ret = g_utf16_to_utf8(utf16_string, utf16_length, &rlen, &wlen, &error); 
    if(error) 
    { 
     g_printerr("%s: %d: %s [rlen: %ld wlen: %ld]\n", g_quark_to_string(error->domain), error->code, error->message, rlen, wlen); 
     return NULL; 
    } 
    JS_EndRequest(jscontext); 
    return ret; 
}