正如评论中的建议,重新考虑在HTML/XML文档中直接使用正则表达式,因为这些不是常规语言。相反,在解析的文本/值内容上使用正则表达式,但不能转换文档。
一个伟大的XML操纵工具是XSLT,转换语言和兄弟到XPath。 Java带有内置的XSLT 1.0处理器,并且还可以调用或获取外部处理器(Xalan, Saxon, etc.)。考虑以下设置:
XSLT脚本(另存为。下面使用的xsl文件;脚本删除空节点)
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<!-- Identity Transform to Copy Document as is -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<!-- Empty Template to Remove Such Nodes -->
<xsl:template match="*[.='']"/>
</xsl:transform>
的Java代码
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.*;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.TransformerException;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.OutputKeys;
import java.io.File;
import java.io.IOException;
import java.net.URISyntaxException;
import org.w3c.dom.Document;
import org.xml.sax.SAXException;
public class XMLTransform {
public static void main(String[] args) throws IOException, URISyntaxException,
SAXException, ParserConfigurationException,
TransformerException {
// Load XML and XSL Document
String inputXML = "path/to/Input.xml";
String xslFile = "path/to/XSLT/Script.xsl";
String outputXML = "path/to/Output.xml";
Source xslt = new StreamSource(new File(xslFile));
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.parse (new File(inputXML));
// XSLT Transformation with pretty print
TransformerFactory prettyPrint = TransformerFactory.newInstance();
Transformer transformer = prettyPrint.newTransformer(xslt);
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.setOutputProperty(OutputKeys.STANDALONE, "yes");
transformer.setOutputProperty(OutputKeys.METHOD, "xml");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");
DOMSource source = new DOMSource(doc);
StreamResult result = new StreamResult(new File(outputXML));
transformer.transform(source, result);
}
}
输出
<ct>
<c>http://192.168.105.213</c>
<l>http://192.168.105.213</l>
<l>http://192.168.105.213</l>
<o>http://192.168.105.213</o>
</ct>
NAMESPACES
当使用命名空间的,如下面的XML:
<prefix:ct xmlns:prefix="http://www.example.com">
<c>http://192.168.105.213</c>
<l>http://192.168.105.213</l>
<o></o>
<l>http://192.168.105.213</l>
<o>http://192.168.105.213</o>
</prefix:ct>
使用下面的XSLT与声明中的头,并添加模板:
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:prefix="http://www.example.com">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<!-- Identity Transform -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<!-- Retain Namespace Prefix -->
<xsl:template match="ct">
<xsl:element name='prefix:{local-name()}' namespace='http://www.example.com'>
<xsl:copy-of select="namespace::*"/>
<xsl:apply-templates select="node()|@*"/>
</xsl:element>
</xsl:template>
<!-- Remove Empty Nodes -->
<xsl:template match="*[.='']"/>
</xsl:transform>
输出
<prefix:ct xmlns:prefix="http://www.example.com">
<c>http://192.168.105.213</c>
<l>http://192.168.105.213</l>
<l>http://192.168.105.213</l>
<o>http://192.168.105.213</o>
</prefix:ct>
请,做不使用正则表达式来解析XML。决不。见http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la – vanje
@vanje我喜欢这个更好地回答:http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags –
@托马斯:是的,你说得对。 – vanje