2010-05-24 183 views
8

XML规范定义了XML文档中允许的Unicode字符的子集: http://www.w3.org/TR/REC-xml/#charsets在Java中过滤非法XML字符

如何从Java中的字符串中筛选出这些字符?

简单的测试案例:

Assert.equals("", filterIllegalXML(""+Character.valueOf((char) 2))) 
+0

为什么你得到这些“非法”XML字符? 一旦你发现它们,你想怎么做?删除?更换? – 2010-05-24 13:11:59

+0

@RH:忽略它们就足够了。最好的解决办法是删除它们并获得某种报告。这样我可以记录警告。 – 2010-05-24 13:15:47

+0

如果有人想知道我使用Xerces的'XMLChar',正如ZZ Coder所建议的那样。你可以在这里找到整个方法:http://pastebin.com/6Vbm1zuC – 2010-05-25 06:15:58

回答

5

找到XML的所有无效字符并不是微不足道的。你需要调用或者重新从Xerces的的XMLChar.isInvalid(),

http://kickjava.com/src/org/apache/xerces/util/XMLChar.java.htm

+0

+1,很好找.. – Bozho 2010-05-24 13:53:04

+0

该类很相关[阅读:很难理解 - 无论如何感谢它的机器生成部分],以及要求实例化和预传播64K CHARS数组... – rogerdpack 2014-12-09 21:16:49

0

使用StringEscapeUtils.escapeXml(xml)commons-lang会逃跑,不过滤的字符。

+2

我已经使用这种方法来转义实体(例如'<'到'<'),但那是不同的。该方法似乎没有过滤任何非法字符。我的'测试用例'失败了。 – 2010-05-24 13:06:37

+2

显示测试用例。 – Bozho 2010-05-24 13:09:25

+0

如上所述: 'assertEquals(“”,StringEscapeUtils.escapeXml(“”+ Character.valueOf((char)2)));' – 2010-05-24 13:14:00

1

This page包括通过测试每个字符是否是规范中剥离出来invalid XML characters Java方法的例子,虽然它不检查highly discouraged字符

顺便说一句,转义字符并不是解决方案,因为XML 1.0和1.1规范不允许转义形式的无效字符。

+1

链接已死......它看起来也许这是新的URL? http://benjchristensen.com/2008/02/07/how-to-strip-invalid-xml-characters/ – Michael 2012-01-27 15:05:32

+0

更新后的链接 - 谢谢 – 2012-01-28 01:03:32

0

这里有一个解决方案,它负责将原料炭以及逃脱字符流中使用StAX或SAX的原理。它需要对其他无效字符延长,但你的想法

import java.io.BufferedReader; 
import java.io.File; 
import java.io.FileInputStream; 
import java.io.FileOutputStream; 
import java.io.IOException; 
import java.io.InputStream; 
import java.io.InputStreamReader; 
import java.io.OutputStreamWriter; 
import java.io.Reader; 
import java.io.UnsupportedEncodingException; 
import java.io.Writer; 

import org.apache.commons.io.IOUtils; 
import org.apache.xerces.util.XMLChar; 

public class IgnoreIllegalCharactersXmlReader extends Reader { 

    private final BufferedReader underlyingReader; 
    private StringBuilder buffer = new StringBuilder(4096); 
    private boolean eos = false; 

    public IgnoreIllegalCharactersXmlReader(final InputStream is) throws UnsupportedEncodingException { 
     underlyingReader = new BufferedReader(new InputStreamReader(is, "UTF-8")); 
    } 

    private void fillBuffer() throws IOException { 
     final String line = underlyingReader.readLine(); 
     if (line == null) { 
      eos = true; 
      return; 
     } 
     buffer.append(line); 
     buffer.append('\n'); 
    } 

    @Override 
    public int read(char[] cbuf, int off, int len) throws IOException { 
     if(buffer.length() == 0 && eos) { 
      return -1; 
     } 
     int satisfied = 0; 
     int currentOffset = off; 
     while (false == eos && buffer.length() < len) { 
      fillBuffer(); 
     } 
     while (satisfied < len && buffer.length() > 0) { 
      char ch = buffer.charAt(0); 
      final char nextCh = buffer.length() > 1 ? buffer.charAt(1) : '\0'; 
      if (ch == '&' && nextCh == '#') { 
    final StringBuilder entity = new StringBuilder(); 
    // Since we're reading lines it's safe to assume entity is all 
    // on one line so next char will/could be the hex char 
    int index = 0; 
    char entityCh = '\0'; 
    // Read whole entity 
    while (entityCh != ';') { 
     entityCh = buffer.charAt(index++); 
     entity.append(entityCh); 
    } 
    // if it's bad get rid of it and clean it from the buffer and point to next valid char 
    if (entity.toString().equals("&#2;")) { 
     buffer.delete(0, entity.length()); 
     continue; 
    } 
      } 
      if (XMLChar.isValid(ch)) { 
    satisfied++; 
    cbuf[currentOffset++] = ch; 
      } 
      buffer.deleteCharAt(0); 
     } 
     return satisfied; 
    } 

    @Override 
    public void close() throws IOException { 
     underlyingReader.close(); 
    } 

    public static void main(final String[] args) { 
     final File file = new File(
    <XML>); 
     final File outFile = new File(file.getParentFile(), file.getName() 
    .replace(".xml", ".cleaned.xml")); 
     Reader r = null; 
     Writer w = null; 
     try { 
      r = new IgnoreIllegalCharactersXmlReader(new FileInputStream(file)); 
      w = new OutputStreamWriter(new FileOutputStream(outFile),"UTF-8"); 
      IOUtils.copyLarge(r, w); 
      w.flush(); 
     } catch (Exception e) { 
      e.printStackTrace(); 
     } finally { 
      IOUtils.closeQuietly(r); 
      IOUtils.closeQuietly(w); 
     } 
    } 
} 
0

松散的基础上,从斯蒂芬C'S答案的链接comment,和维基百科的XML 1.1 spec这里将告诉您如何删除Java方法使用正则表达式替换的非法字符:

boolean isAllValidXmlChars(String s) { 
    // xml 1.1 spec http://en.wikipedia.org/wiki/Valid_characters_in_XML 
    if (!s.matches("[\\u0001-\\uD7FF\\uE000-\uFFFD\\x{10000}-\\x{10FFFF}]")) { 
    // not in valid ranges 
    return false; 
    } 
    if (s.matches("[\\u0001-\\u0008\\u000b-\\u000c\\u000E-\\u001F\\u007F-\\u0084\\u0086-\\u009F]")) { 
    // a control character 
    return false; 
    } 

    // "Characters allowed but discouraged" 
    if (s.matches(
    "[\\uFDD0-\\uFDEF\\x{1FFFE}-\\x{1FFFF}\\x{2FFFE}–\\x{2FFFF}\\x{3FFFE}–\\x{3FFFF}\\x{4FFFE}–\\x{4FFFF}\\x{5FFFE}-\\x{5FFFF}\\x{6FFFE}-\\x{6FFFF}\\x{7FFFE}-\\x{7FFFF}\\x{8FFFE}-\\x{8FFFF}\\x{9FFFE}-\\x{9FFFF}\\x{AFFFE}-\\x{AFFFF}\\x{BFFFE}-\\x{BFFFF}\\x{CFFFE}-\\x{CFFFF}\\x{DFFFE}-\\x{DFFFF}\\x{EFFFE}-\\x{EFFFF}\\x{FFFFE}-\\x{FFFFF}\\x{10FFFE}-\\x{10FFFF}]" 
)) { 
    return false; 
    } 

    return true; 
}