从UTF-8格式字符串中提取双字节字符/子字符串

我试图从字符串中提取emojis和其他特殊字符以进行进一步处理（例如，字符串包含''作为其字符之一）。从UTF-8格式字符串中提取双字节字符/子字符串

但是string.charAt(i)和string.substring(i, i+1)都不适用于我。原始字符串采用UTF-8格式，这意味着上述表情符号的转义形式被编码为'\ uD83D \ uDE05'。这就是为什么我收到'？' （\ uD83D）和'？' （\ uDE05）而不是这个位置，导致它在迭代字符串时位于两个位置。

有没有人有解决这个问题的办法？

来源

2015-06-14 conidium

对于UTF-16编码使用'str.getBytes（ “UTF-16”） ;' – Cyrbil

您需要使用**代码点**而不是'char's。表情符号不适合16位“char”。请参阅[Java 16位字符如何支持Unicode？]（http://stackoverflow.com/questions/1941613/how-does-java-16-bit-chars-support-unicode）以及[我如何遍历Unicode一个Java字符串的代码点？]（http://stackoverflow.com/questions/1527856/how-can-i-iterate-through-the-unicode-codepoints-of-a-java-string）。 –

@cyrbil这有什么用？ –

感谢John Kugelman的帮助。该解决方案看起来现在这个样子：

for(int codePoint : codePoints(string)) { 

     char[] chars = Character.toChars(codePoint); 
     System.out.println(codePoint + " : " + String.copyValueOf(chars)); 

    }

随着代码点（字符串字符串） - 方法看起来像这样：

private static Iterable<Integer> codePoints(final String string) { 
    return new Iterable<Integer>() { 
     public Iterator<Integer> iterator() { 
      return new Iterator<Integer>() { 
       int nextIndex = 0; 

       public boolean hasNext() { 
        return nextIndex < string.length(); 
       } 

       public Integer next() { 
        int result = string.codePointAt(nextIndex); 
        nextIndex += Character.charCount(result); 
        return result; 
       } 

       public void remove() { 
        throw new UnsupportedOperationException(); 
       } 
      }; 
     } 
    }; 
}

来源

2015-06-15 06:24:24 conidium

从UTF-8格式字符串中提取双字节字符/子字符串

回答

相关问题