如何转换俄语西里尔字母的字符串？

String artist - 我不知道什么是对编码

Ïåñíÿ ïðî íàäåæäó - 在俄罗斯"Песня про надежду"

例如字符串我用http://code.google.com/p/juniversalchardet/

代码：

String GetEncoding(String text) throws IOException { 
     byte[] buf = new byte[4096]; 


     InputStream fis = new ByteArrayInputStream(text.getBytes()); 


     UniversalDetector detector = new UniversalDetector(null); 

     int nread; 
     while ((nread = fis.read(buf)) > 0 && !detector.isDone()) { 
      detector.handleData(buf, 0, nread); 
     } 
     detector.dataEnd(); 
     String encoding = detector.getDetectedCharset(); 
     detector.reset(); 
     return encoding; 
    }

和隐蔽

new String(text.getBytes(encoding), "cp1251"); - 但这不行。

如果我使用UTF-16

new String(text.getBytes("UTF-16"), "cp1251")回报 “юяПесняпронадежду” 空间 - 不为CHAR空间

编辑：

这个第一读字节

byte[] abyFrameData = new byte[iTagSize]; 
oID3DIS.readFully(abyFrameData); 
ByteArrayInputStream oFrameBAIS = new ByteArrayInputStream(abyFrameData);

的String =新字符串（abyFrameData， “????”）;

来源

2011-05-16 Mediator

你是如何得到的字符串文本参数？或许这个问题与你如何创建探测器的输入有关。 java字符串总是UTF-16，所以这里已经有一些字符转换了。 – stevevls 2011-05-16 12:06:37

'new String（text.getBytes（“UTF-16”），“cp1251”）'不会做你认为它做的事。它实际上做的是取一个现有的字符串，检索它的字节为UTF-16，然后尝试通过假设这些字节字节是CP1251来创建一个新字符串。这是保证是错误的。 – Anon 2011-05-16 12:12:39

@ stevevls，嗯java字符串总是UTF-16，而不是Unicode http://download.oracle.com/javase/tutorial/i18n/text/index.html – mKorbel 2011-05-16 12:15:16

Java字符串是UTF-16。所有其他编码可以使用字节序列表示。要解码字符数据，您必须在首次创建字符串时提供编码。如果你有一个损坏的字符串，它已经太晚了。

假设ID3，规范定义了编码规则。例如，ID3v2.4.0可能限制通过的扩展报头中使用的编码：

q - 文本编码限制

0 No restrictions 
    1 Strings are only encoded with ISO-8859-1 [ISO-8859-1] or 
     UTF-8 [UTF-8].

编码处理被进一步限定向下文档：

如果没有别的说法，字符串包括数字字符串和URL，表示为ISO-8859-1 范围为$ 20 - $ FF的字符。这样的字符串在框中表示为<text string>或 <full text string>如果换行符是允许的。如果没有其他说换行符被禁止。在 ISO-8859-1中，表示换行符，允许时只有$ 0A。

允许不同类型的文本编码的帧包含文本编码描述字节。可能的编码：
$00 ISO-8859-1 [ISO-8859-1]. Terminated with $00. 
$01 UTF-16 [UTF-16] encoded Unicode [UNICODE] with BOM. All 
     strings in the same frame SHALL have the same byteorder. 
     Terminated with $00 00. 
$02 UTF-16BE [UTF-16] encoded Unicode [UNICODE] without BOM. 
     Terminated with $00 00. 
$03 UTF-8 [UTF-8] encoded Unicode [UNICODE]. Terminated with 
     $00. 

使用转码类，如InputStreamReader或（在这种情况下更可能）的String(byte[],Charset)构造的数据进行解码。另见Java: a rough guide to character encoding。

解析ID3v2.4.0数据结构的字符串组成部分将是这样的：

//untested code 
public String parseID3String(DataInputStream in) throws IOException { 
    String[] encodings = { "ISO-8859-1", "UTF-16", "UTF-16BE", "UTF-8" }; 
    String encoding = encodings[in.read()]; 
    byte[] terminator = 
     encoding.startsWith("UTF-16") ? new byte[2] : new byte[1]; 
    byte[] buf = terminator.clone(); 
    ByteArrayOutputStream buffer = new ByteArrayOutputStream(); 
    do { 
    in.readFully(buf); 
    buffer.write(buf); 
    } while (!Arrays.equals(terminator, buf)); 
    return new String(buffer.toByteArray(), encoding); 
}

来源

2011-05-16 13:03:50 McDowell

我读过这个......但不明白。我编辑我的帖子。 – Mediator 2011-05-16 15:14:03

这是为我工作：

byte[] bytes = s.getBytes("ISO-8859-1"); 
UniversalDetector encDetector = new UniversalDetector(null); 
encDetector.handleData(bytes, 0, bytes.length); 
encDetector.dataEnd(); 
String encoding = encDetector.getDetectedCharset(); 
if (encoding != null) s = new String(bytes, encoding);

来源

2014-05-07 06:11:55 Nik

如何转换俄语西里尔字母的字符串？

回答

相关问题