2011-03-23 91 views
2

字符串如何正确使用Java解码在Java中

http%3A//www.google.ru/search%3Fhl%3Dru%26q%3Dla+mer+powder%26btnG%3D%u0420%A0%u0421%u045F%u0420%A0%u0421%u2022%u0420%A0%u0421%u2018%u0420%u040E%u0420%u0453%u0420%A0%u0421%u201D+%u0420%A0%u0420%u2020+Google%26lr%3D%26rlz%3D1I7SKPT_ru 

解码以下字符串当我使用URLDecoder.decode()我下面的错误

java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in escape (%) pattern - For input string: "u0" 

谢谢, 戴夫

+1

该网址未正确编码以开始。 – 2011-03-23 16:32:32

+0

@Johan如果它是较大的URL的一部分(如http://foo.com/?url=<上面的字符串),它可能是,但否则,同意 – 2011-03-23 16:35:17

+0

@Johan,为什么不呢? @Daniel,完全是我的想法:http://www.google.com/search?q=http%3A//www.google.ru/search%3Fhl%3Dru%26q%3Dla+mer+powder%26btnG%3D% u0420%A0%u0421%u045F%u0420%A0%u0421%U2022%u0420%A0%u0421%u2018%u0420%u040E%u0420%u0453%u0420%A0%u0421%U201D +%u0420%A0%u0420%u2020 +谷歌% 26lr%3D%26rlz%3D1I7SKPT_ru – OscarRyz 2011-03-23 16:35:35

回答

2

根据Wikipedia,“存在Unicode字符的非标准编码:%uxxxx,其中xxxx是Unicode va略”。 继续:“此行为未由任何RFC指定,并且已被W3C拒绝”。

您的URL包含这些标记,并且Java URLDecoder实现不支持这些标记。

2

%uXXXX编码是非标准的,实际上被W3C拒绝,所以很自然,URLDecoder并不理解它。

您可以制作一个小函数,它将通过在您编码的字符串中将%uXXYY替换为%XX%YY来修复它。然后你可以正常地处理和解码固定字符串。

1

我们从Vartec的解决方案开始,但发现了其他问题。此解决方案适用于UTF-16,但可以更改为返回UTF-8。所有被留下为清楚起见替换,你可以阅读更多的http://www.cogniteam.com/wiki/index.php?title=DecodeEncodeJavaScript

static public String unescape(String escaped) throws UnsupportedEncodingException 
{ 
    // This code is needed so that the UTF-16 won't be malformed 
    String str = escaped.replaceAll("%0", "%u000"); 
    str = str.replaceAll("%1", "%u001"); 
    str = str.replaceAll("%2", "%u002"); 
    str = str.replaceAll("%3", "%u003"); 
    str = str.replaceAll("%4", "%u004"); 
    str = str.replaceAll("%5", "%u005"); 
    str = str.replaceAll("%6", "%u006"); 
    str = str.replaceAll("%7", "%u007"); 
    str = str.replaceAll("%8", "%u008"); 
    str = str.replaceAll("%9", "%u009"); 
    str = str.replaceAll("%A", "%u00A"); 
    str = str.replaceAll("%B", "%u00B"); 
    str = str.replaceAll("%C", "%u00C"); 
    str = str.replaceAll("%D", "%u00D"); 
    str = str.replaceAll("%E", "%u00E"); 
    str = str.replaceAll("%F", "%u00F"); 

    // Here we split the 4 byte to 2 byte, so that decode won't fail 
    String [] arr = str.split("%u"); 
    Vector<String> vec = new Vector<String>(); 
    if(!arr[0].isEmpty()) 
    { 
     vec.add(arr[0]); 
    } 
    for (int i = 1 ; i < arr.length ; i++) { 
     if(!arr[i].isEmpty()) 
     { 
      vec.add("%"+arr[i].substring(0, 2)); 
      vec.add("%"+arr[i].substring(2)); 
     } 
    } 
    str = ""; 
    for (String string : vec) { 
     str += string; 
    } 
    // Here we return the decoded string 
    return URLDecoder.decode(str,"UTF-16"); 
} 
1

后有过在由@ariy提出的解决方案我创建了一个基于Java的解决方案,也是针对具有编码的字符弹性很好看被分成两部分(即编码字符的一半缺失)。这发生在我的用例中,我需要解码有时在2000字符长度切碎的长URL。请参阅What is the maximum length of a URL in different browsers?

public class Utils { 

    private static Pattern validStandard  = Pattern.compile("%([0-9A-Fa-f]{2})"); 
    private static Pattern choppedStandard = Pattern.compile("%[0-9A-Fa-f]{0,1}$"); 
    private static Pattern validNonStandard = Pattern.compile("%u([0-9A-Fa-f][0-9A-Fa-f])([0-9A-Fa-f][0-9A-Fa-f])"); 
    private static Pattern choppedNonStandard = Pattern.compile("%u[0-9A-Fa-f]{0,3}$"); 

    public static String resilientUrlDecode(String input) { 
     String cookedInput = input; 

     if (cookedInput.indexOf('%') > -1) { 
      // Transform all existing UTF-8 standard into UTF-16 standard. 
      cookedInput = validStandard.matcher(cookedInput).replaceAll("%00%$1"); 

      // Discard chopped encoded char at the end of the line (there is no way to know what it was) 
      cookedInput = choppedStandard.matcher(cookedInput).replaceAll(""); 

      // Handle non standard (rejected by W3C) encoding that is used anyway by some 
      // See: https://stackoverflow.com/a/5408655/114196 
      if (cookedInput.contains("%u")) { 
       // Transform all existing non standard into UTF-16 standard. 
       cookedInput = validNonStandard.matcher(cookedInput).replaceAll("%$1%$2"); 

       // Discard chopped encoded char at the end of the line 
       cookedInput = choppedNonStandard.matcher(cookedInput).replaceAll(""); 
      } 
     } 

     try { 
      return URLDecoder.decode(cookedInput,"UTF-16"); 
     } catch (UnsupportedEncodingException e) { 
      // Will never happen because the encoding is hardcoded 
      return null; 
     } 
    } 
}