My notebooks contain code snippets that I use fairly regularly but not enough to remember off the top of my head. Google is misleading since the oldest stuff often has the most links.
A quick review: the Document Object Model (DOM) is a parse tree of the entire document. Browsers see their HTML (and XML) content as DOM objects. Many operations are easiest in a DOM, but this must be weighed against the comparably heavy resources required by a DOM vs. a SAX parser.
A SAX (Simple API for XML) parser is a much lighter object than a DOM model, and is a lot more powerful than you might expect. But it requires thinking in streaming mode.
DOM methods:
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.FactoryConfigurationError;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import org.w3c.dom.Document;
private static final DocumentBuilderFactory dbFactory;
private static final TransformerFactory tFactory;
// static initialization
// these objects are thread-safe. The objects they produce are not.
static {
try {
dbFactory = DocumentBuilderFactory.newInstance();
tFactory = TransformerFactory.newInstance();
} catch (FactoryConfigurationError e) {
// unable to get a document builder factory
throw new ExceptionInInitializerError(e);
}
}
/**
* Create an empty DOM.
*/
public Document newDocument() throws ParserConfigurationException{
DocumentBuilder builder = dbFactory.newDocumentBuilder();
return builder.newDocument();
}
/**
* Serialize a DOM.
* <p>
* You can do this "more efficiently" by walking the DOM yourself and
* writing to a StringBuilder but that is usually not as robust
* as using a standard DOM serializer. For instance, how do you handle
* namespaces? UTF-16? Knowing that you need to break a CDATA section
* into multiple pieces because there's a "]]>" in the content?
*/
public String serialize(Document doc) throws TransformerConfigurationException, TransformerException {
StringWriter out = new StringWriter();
Transformer tf = tFactory.newTransformer();
tf.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
tf.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
tf.setOutputProperty(OutputKeys.INDENT, "no");
//tf.setOutputProperty(OutputKeys.CDATA_SECTION_ELEMENTS, "content");
tf.transform(new DOMSource(d), new StreamResult(out));
return out.toString();
}
SAX methods:
import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
import org.xml.sax.XMLReader;
import org.xml.sax.ext.DefaultHandler2;
import org.xml.sax.helpers.XMLReaderFactory;
/**
* Sample class that handles SAX callback methods.
*/
private static class MyHandler extends DefaultHandler2 {
// parse a document
// a SAXParseException gives line and column number.
void parse(File file) throws SAXException, IOException {
Reader r = null;
try {
r = new FileReader(file);
Handler h = new MyHandler();
XMLReader reader = XMLReaderFactory.createXMLReader();
reader.setContentHandler(h);
reader.parse(new InputSource(new XmlEntityReader(r)));
} finally {
if (r != null) {
r.close();
}
}
}