JAVA通过XPath解析XML性能比较详解_Java教程

最近在做一个小项目，使用到XML文件解析技术，通过对该技术的了解和使用，总结了以下内容。

1 XML文件解析的4种方法

通常解析XML文件有四种经典的方法。基本的解析方式有两种，一种叫SAX，另一种叫DOM。SAX是基于事件流的解析，DOM是基于XML文档树结构的解析。在此基础上，为了减少DOM、SAX的编码量，出现了JDOM，其优点是，20-80原则（帕累托法则），极大减少了代码量。通常情况下JDOM使用时满足要实现的功能简单，如解析、创建等要求。但在底层，JDOM还是使用SAX（最常用）、DOM、Xanan文档。另外一种是DOM4J，是一个非常非常优秀的Java XML API，具有性能优异、功能强大和极端易用的特点，同时它也是一个开放源代码的软件。如今你可以看到越来越多的 Java 软件都在使用 DOM4J 来读写 XML，特别值得一提的是连 Sun 的 JAXM 也在用 DOM4J。

2 XPath简单介绍

XPath 是一门在 XML 文档中查找信息的语言。XPath 用于在 XML 文档中通过元素和属性进行导航，并对元素和属性进行遍历。XPath 是 W3C XSLT 标准的主要元素，并且 XQuery 和 XPointer 同时被构建于 XPath 表达之上。因此，对 XPath 的理解是很多高级 XML 应用的基础。XPath非常类似对数据库操作的SQL语言，或者说JQuery，它可以方便开发者抓起文档中需要的东西。其中DOM4J也支持XPath的使用。

3 DOM4J使用XPath

DOM4J使用XPath解析XML文档是，首先需要在项目中引用两个JAR包：

dom4j-1.6.1.jar：DOM4J软件包，下载地址http://sourceforge.net/projects/dom4j/；

jaxen-xx.xx.jar：通常不添加此包，会引发异常（java.lang.NoClassDefFoundError: org/jaxen/JaxenException），下载地址http://www.jaxen.org/releases.html。

3.1 命名空间（namespace）的干扰

在处理由excel文件或其他格式文件转换的xml文件时，通常会遇到通过XPath解析得不到结果的情况。这种情况通常是由于命名空间的存在导致的。以下述内容的XML文件为例，通过XPath=" // Workbook/ Worksheet / Table / Row[1]/ Cell[1]/Data[1] "进行简单的检索，通常是没有结果出现的。这就是由于命名空间namespace（xmlns="urn:schemas-microsoft-com:office:spreadsheet"）导致的。

									<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:html="http://www.w3.org/TR/REC-html40">

									 <Worksheet ss:Name="Sheet1">

									  <Table ss:ExpandedColumnCount="81" ss:ExpandedRowCount="687" x:FullColumns="1" x:FullRows="1" ss:DefaultColumnWidth="52.5" ss:DefaultRowHeight="15.5625">

									   <Row ss:AutoFitHeight="0">

									     <Cell>

									     <Data ss:Type="String">敲代码的耗子</Data>

									     </Cell> 

									   </Row>

									   <Row ss:AutoFitHeight="0">

									     <Cell>

									     <Data ss:Type="String">Sunny</Data>

									     </Cell> 

									   </Row>

									  </Table>

									 </Worksheet>

									</Workbook>

3.2 XPath对带有命名空间的xml文件解析

第一种方法（read1()函数）：使用XPath语法中自带的local-name() 和 namespace-uri() 指定你要使用的节点名和命名空间。 XPath表达式书写较为麻烦。

第二种方法（read2()函数）：设置XPath的命名空间，利用setNamespaceURIs()函数。

第三种方法（read3()函数）：设置DocumentFactory()的命名空间，使用的函数是setXPathNamespaceURIs()。二和三两种方法的XPath表达式书写相对简单。

第四种方法（read4()函数）：方法和第三种一样，但是XPath表达式不同（程序具体体现），主要是为了检验XPath表达式的不同，主要指完整程度，是否会对检索效率产生影响。（以上四种方法均通过DOM4J结合XPath对XML文件进行解析）

第五种方法（read5()函数）：使用DOM结合XPath对XML文件进行解析，主要是为了检验性能差异。

没有什么能够比代码更能说明问题的了！果断上代码！

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

									package XPath;

									import java.io.IOException;

									import java.io.InputStream;

									import java.util.HashMap;

									import java.util.List;

									import java.util.Map;

									import javax.xml.parsers.DocumentBuilder;

									import javax.xml.parsers.DocumentBuilderFactory;

									import javax.xml.parsers.ParserConfigurationException;

									import javax.xml.xpath.XPathConstants;

									import javax.xml.xpath.XPathExpression;

									import javax.xml.xpath.XPathExpressionException;

									import javax.xml.xpath.XPathFactory;

									import org.dom4j.Document;

									import org.dom4j.DocumentException;

									import org.dom4j.Element;

									import org.dom4j.XPath;

									import org.dom4j.io.SAXReader;

									import org.w3c.dom.NodeList;

									import org.xml.sax.SAXException;

									/**

									 * DOM4J DOM XML XPath

									 * @author hao

									 */

									public class TestDom4jXpath {

									  public static void main(String[] args) {

									    read1();

									    read2();

									    read3();

									    read4();//read3（）方法一样，但是XPath表达式不同

									    read5();

									  }

									  public static void read1() {

									    /*

									     * use local-name() and namespace-uri() in XPath

									     */

									    try {

									      long startTime=System.currentTimeMillis();

									      SAXReader reader = new SAXReader();

									      InputStream in = TestDom4jXpath.class.getClassLoader().getResourceAsStream("XPath\\XXX.xml");

									      Document doc = reader.read(in);

									      /*String xpath ="//*[local-name()='Workbook' and namespace-uri()='urn:schemas-microsoft-com:office:spreadsheet']"

									          + "/*[local-name()='Worksheet']"

									          + "/*[local-name()='Table']"

									          + "/*[local-name()='Row'][4]"

									          + "/*[local-name()='Cell'][3]"

									          + "/*[local-name()='Data'][1]";*/

									      String xpath ="//*[local-name()='Row'][4]/*[local-name()='Cell'][3]/*[local-name()='Data'][1]";

									      System.err.println("=====use local-name() and namespace-uri() in XPath====");

									      System.err.println("XPath：" + xpath);

									      @SuppressWarnings("unchecked")

									      List<Element> list = doc.selectNodes(xpath);

									      for(Object o:list){ 

									        Element e = (Element) o; 

									        String show=e.getStringValue();

									        System.out.println("show = " + show); 

									      long endTime=System.currentTimeMillis();

									      System.out.println("程序运行时间： "+(endTime-startTime)+"ms");

									      } 

									    } catch (DocumentException e) {

									      e.printStackTrace();

									    }

									  }

									  public static void read2() {

									    /*

									     * set xpath namespace(setNamespaceURIs)

									     */

									    try {

									      long startTime=System.currentTimeMillis();

									      Map map = new HashMap();

									      map.put("Workbook","urn:schemas-microsoft-com:office:spreadsheet");

									      SAXReader reader = new SAXReader();

									      InputStream in = TestDom4jXpath.class.getClassLoader().getResourceAsStream("XPath\\XXX.xml");

									      Document doc = reader.read(in);

									      String xpath ="//Workbook:Row[4]/Workbook:Cell[3]/Workbook:Data[1]";

									      System.err.println("=====use setNamespaceURIs() to set xpath namespace====");

									      System.err.println("XPath：" + xpath);

									      XPath x = doc.createXPath(xpath);

									      x.setNamespaceURIs(map);

									      @SuppressWarnings("unchecked")

									      List<Element> list = x.selectNodes(doc);

									      for(Object o:list){ 

									        Element e = (Element) o; 

									        String show=e.getStringValue();

									        System.out.println("show = " + show);  

									      long endTime=System.currentTimeMillis();

									      System.out.println("程序运行时间： "+(endTime-startTime)+"ms");

									      } 

									    } catch (DocumentException e) {

									      e.printStackTrace();

									    }

									  }

									  public static void read3() {

									    /*

									     * set DocumentFactory() namespace(setXPathNamespaceURIs)

									     */

									    try {

									      long startTime=System.currentTimeMillis();

									      Map map = new HashMap();

									      map.put("Workbook","urn:schemas-microsoft-com:office:spreadsheet");

									      SAXReader reader = new SAXReader();

									      InputStream in = TestDom4jXpath.class.getClassLoader().getResourceAsStream("XPath\\XXX.xml");

									      reader.getDocumentFactory().setXPathNamespaceURIs(map);

									      Document doc = reader.read(in);

									      String xpath ="//Workbook:Row[4]/Workbook:Cell[3]/Workbook:Data[1]";

									      System.err.println("=====use setXPathNamespaceURIs() to set DocumentFactory() namespace====");

									      System.err.println("XPath：" + xpath);

									      @SuppressWarnings("unchecked")

									      List<Element> list = doc.selectNodes(xpath);

									      for(Object o:list){ 

									        Element e = (Element) o; 

									        String show=e.getStringValue();

									        System.out.println("show = " + show);

									      long endTime=System.currentTimeMillis();

									      System.out.println("程序运行时间： "+(endTime-startTime)+"ms");  

									      } 

									    } catch (DocumentException e) {

									      e.printStackTrace();

									    }

									  }

									  public static void read4() {

									    /*

									     * 同read3（）方法一样，但是XPath表达式不同

									     */

									    try {

									      long startTime=System.currentTimeMillis();

									      Map map = new HashMap();

									      map.put("Workbook","urn:schemas-microsoft-com:office:spreadsheet");

									      SAXReader reader = new SAXReader();

									      InputStream in = TestDom4jXpath.class.getClassLoader().getResourceAsStream("XPath\\XXX.xml");

									      reader.getDocumentFactory().setXPathNamespaceURIs(map);

									      Document doc = reader.read(in);

									      String xpath ="//Workbook:Worksheet/Workbook:Table/Workbook:Row[4]/Workbook:Cell[3]/Workbook:Data[1]";

									      System.err.println("=====use setXPathNamespaceURIs() to set DocumentFactory() namespace====");

									      System.err.println("XPath：" + xpath);

									      @SuppressWarnings("unchecked")

									      List<Element> list = doc.selectNodes(xpath);

									      for(Object o:list){ 

									        Element e = (Element) o; 

									        String show=e.getStringValue();

									        System.out.println("show = " + show);

									      long endTime=System.currentTimeMillis();

									      System.out.println("程序运行时间： "+(endTime-startTime)+"ms");  

									      } 

									    } catch (DocumentException e) {

									      e.printStackTrace();

									    }

									  }

									  public static void read5() {

									    /*

									     * DOM and XPath

									     */

									    try {

									      long startTime=System.currentTimeMillis();

									      DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();

									      dbf.setNamespaceAware(false);

									      DocumentBuilder builder = dbf.newDocumentBuilder();

									      InputStream in = TestDom4jXpath.class.getClassLoader().getResourceAsStream("XPath\\XXX.xml");

									      org.w3c.dom.Document doc = builder.parse(in);

									      XPathFactory factory = XPathFactory.newInstance();

									      javax.xml.xpath.XPath x = factory.newXPath();

									      //选取所有class元素的name属性

									      String xpath = "//Workbook/Worksheet/Table/Row[4]/Cell[3]/Data[1]";

									      System.err.println("=====Dom XPath====");

									      System.err.println("XPath：" + xpath);

									      XPathExpression expr = x.compile(xpath);

									      NodeList nodes = (NodeList)expr.evaluate(doc, XPathConstants.NODE);

									      for(int i = 0; i<nodes.getLength();i++) {

									        System.out.println("show = " + nodes.item(i).getNodeValue());

									      long endTime=System.currentTimeMillis();

									      System.out.println("程序运行时间： "+(endTime-startTime)+"ms");

									      }

									    } catch(XPathExpressionException e) {

									      e.printStackTrace();

									    } catch(ParserConfigurationException e) {

									      e.printStackTrace();

									    } catch(SAXException e) {

									      e.printStackTrace();

									    } catch(IOException e) {

									      e.printStackTrace();

									    }

									  }

									}

3.3 不同方法的性能比较

为了比较几种方法的解析性能，实验过程中使用了6M以上大小，7万行以上的XML文件（XXX.xml）进行10轮测试，如下所述：

JAVA通过XPath解析XML性能比较详解

图1 XPath使用性能对比

方法名称	平均运行时间	XPath表达式
read1()	1663ms	//[local-name()='Row'][4]/[local-name()='Cell'][3]/*[local-name()='Data'][1]
read2()	2184ms	//Workbook:Row[4]/Workbook:Cell[3]/Workbook:Data[1]
read3()	601ms	//Workbook:Row[4]/Workbook:Cell[3]/Workbook:Data[1]
read4()	472ms	//Workbook:Worksheet/Workbook:Table/Workbook:Row[4]/Workbook:Cell[3]/Workbook:Data[1]
read5()	1094ms	//Workbook/Worksheet/Table/Row[4]/Cell[3]/Data[1]