Invariant Properties

  • rss
  • Home

Notebook: Common XML Tasks

My notebooks contain code snippets that I use fairly regularly but not enough to remember off the top of my head.  Google is misleading since the oldest stuff often has the most links.

A quick review: the Document Object Model (DOM) is a parse tree of the entire document. Browsers see their HTML (and XML) content as DOM objects. Many operations are easiest in a DOM, but this must be weighed against the comparably heavy resources required by a DOM vs. a SAX parser.

A SAX (Simple API for XML) parser is a much lighter object than a DOM model, and is a lot more powerful than you might expect. But it requires thinking in streaming mode.

DOM methods:

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.FactoryConfigurationError;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import org.w3c.dom.Document;

private static final DocumentBuilderFactory dbFactory;
private static final TransformerFactory tFactory;

// static initialization
// these objects are thread-safe.  The objects they produce are not.
static {
   try {
      dbFactory = DocumentBuilderFactory.newInstance();
      tFactory = TransformerFactory.newInstance();
   } catch (FactoryConfigurationError e) {
      // unable to get a document builder factory
      throw new ExceptionInInitializerError(e);
   }
}

/**
 * Create an empty DOM.
 */
public Document newDocument() throws ParserConfigurationException{
  DocumentBuilder builder = dbFactory.newDocumentBuilder();
  return builder.newDocument();
}

/**
 * Serialize a DOM.
 * <p>
 * You can do this "more efficiently" by walking the DOM yourself and
 * writing to a StringBuilder but that is usually not as robust
 * as using a standard DOM serializer.  For instance, how do you handle
 * namespaces?  UTF-16?  Knowing that you need to break a CDATA section
 * into multiple pieces because there's a "]]>" in the content?
 */
public String serialize(Document doc) throws TransformerConfigurationException, TransformerException {
   StringWriter out = new StringWriter();
   Transformer tf = tFactory.newTransformer();
   tf.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
   tf.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
   tf.setOutputProperty(OutputKeys.INDENT, "no");
   //tf.setOutputProperty(OutputKeys.CDATA_SECTION_ELEMENTS, "content");
   tf.transform(new DOMSource(d), new StreamResult(out));

   return out.toString();
}

SAX methods:

import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
import org.xml.sax.XMLReader;
import org.xml.sax.ext.DefaultHandler2;
import org.xml.sax.helpers.XMLReaderFactory;

/**
 * Sample class that handles SAX callback methods.
 */
private static class MyHandler extends DefaultHandler2 {
   // parse a document
   // a SAXParseException gives line and column number.
   void parse(File file) throws SAXException, IOException {
   Reader r = null;
   try {
      r = new FileReader(file);
      Handler h = new MyHandler();
      XMLReader reader = XMLReaderFactory.createXMLReader();
      reader.setContentHandler(h);
      reader.parse(new InputSource(new XmlEntityReader(r)));
   } finally {
      if (r != null) {
         r.close();
      }
   }
}
Comments rss
Comments rss

Leave a Reply

Click here to cancel reply.

You must be logged in to post a comment.

Archives

  • May 2020 (1)
  • March 2019 (1)
  • August 2018 (1)
  • May 2018 (1)
  • February 2018 (1)
  • November 2017 (4)
  • January 2017 (3)
  • June 2016 (1)
  • May 2016 (1)
  • April 2016 (2)
  • March 2016 (1)
  • February 2016 (3)
  • January 2016 (6)
  • December 2015 (2)
  • November 2015 (3)
  • October 2015 (2)
  • August 2015 (4)
  • July 2015 (2)
  • June 2015 (2)
  • January 2015 (1)
  • December 2014 (6)
  • October 2014 (1)
  • September 2014 (2)
  • August 2014 (1)
  • July 2014 (1)
  • June 2014 (2)
  • May 2014 (2)
  • April 2014 (1)
  • March 2014 (1)
  • February 2014 (3)
  • January 2014 (6)
  • December 2013 (13)
  • November 2013 (6)
  • October 2013 (3)
  • September 2013 (2)
  • August 2013 (5)
  • June 2013 (1)
  • May 2013 (2)
  • March 2013 (1)
  • November 2012 (1)
  • October 2012 (3)
  • September 2012 (2)
  • May 2012 (6)
  • January 2012 (2)
  • December 2011 (12)
  • July 2011 (1)
  • June 2011 (2)
  • May 2011 (5)
  • April 2011 (6)
  • March 2011 (4)
  • February 2011 (3)
  • October 2010 (6)
  • September 2010 (8)

Recent Posts

  • 8-bit Breadboard Computer: Good Encapsulation!
  • Where are all the posts?
  • Better Ad Blocking Through Pi-Hole and Local Caching
  • The difference between APIs and SPIs
  • Hadoop: User Impersonation with Kerberos Authentication

Meta

  • Log in
  • Entries RSS
  • Comments RSS
  • WordPress.org

Pages

  • About Me
  • Notebook: Common XML Tasks
  • Notebook: Database/Webapp Security
  • Notebook: Development Tips

Syndication

Java Code Geeks

Know Your Rights

Support Bloggers' Rights
Demand Your dotRIGHTS

Security

  • Dark Reading
  • Krebs On Security Krebs On Security
  • Naked Security Naked Security
  • Schneier on Security Schneier on Security
  • TaoSecurity TaoSecurity

Politics

  • ACLU ACLU
  • EFF EFF

News

  • Ars technica Ars technica
  • Kevin Drum at Mother Jones Kevin Drum at Mother Jones
  • Raw Story Raw Story
  • Tech Dirt Tech Dirt
  • Vice Vice

Spam Blocked

53,313 spam blocked by Akismet
rss Comments rss valid xhtml 1.1 design by jide powered by Wordpress get firefox