Lesson 20 continued: Processing XML With the DOM

To start our discussion, download Lesson-20-Example-01 and import it to your Eclipse workspace.

To get a clean compile, you may need to add commons-lang…jar from the Libs project to the build classpath. (Click here for a refresher on how to add JARs to the build classpath.)

Now try running the class BroadwayReaderDOM. Your output should look something like this:

Title: Mame Playwright: Jerome Lawrence,Robert E. Lee Theatre: Winter Garden Composer: Jerry Herman Lyricist: Jerry Herman Opening: 05/24/1966 Closing: 01/03/1970 Previews: 5 Performances: 1508
Title: A Chorus Line Playwright: James Kirkwood.,Nicholas Dante Theatre: Shubert Composer: Marvin Hamlisch Lyricist: Edward Kleban Opening: 07/25/1975 Closing: (null) Previews: (null) Performances: 6137
...

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<broadway>
    <show title="Kismet">
        <theatre>Ziegfeld</theatre>
        <playwrights>
            <playwright>Charles Lederer</playwright>
            <playwright>Luther Davis</playwright>
        </playwrights>
        <composers>
            <composer>Robert Wright</composer>
            <composer>George Forrest</composer>
            <composer>Aleksandr Borodin</composer>
        </composers>
        <lyricists>
            <lyricist>Robert Wright</lyricist>
            <lyricist>George Forrest</lyricist>
        </lyricists>
        <opening>12/03/1953</opening>
        <closing>04/23/1955</closing>
        <performances>583</performances>
    </show>
    <show title="The Odd Couple">
        <theatre>Plymouth</theatre>
        <playwrights>
            <playwright>Neil Simon</playwright>
        </playwrights>
        <composers/>
        <lyricists/>
        <opening>03/10/1965</opening>
        <closing>07/02/1967</closing>
        <previews>2</previews>
        <performances>964</performances>
    </show>
...
</broadway>

That is, it’s the contents of our input file, followed by an XML recreation thereof but in reverse order by <show> tag.

Let’s dive into BroadwayReaderDOM to see what we’ve done.

The input to our program is this file, data/broadway.xml, and it looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<broadway>
  <show title="Mame">
    <theatre>Winter Garden</theatre>
    <playwrights>
      <playwright>Jerome Lawrence</playwright>
      <playwright>Robert E. Lee</playwright>
    </playwrights>
    <composers>
      <composer>Jerry Herman</composer>
    </composers>
    <lyricists>
      <lyricist>Jerry Herman</lyricist>
    </lyricists>
    <opening>05/24/1966</opening>
    <closing>01/03/1970</closing>
    <previews>5</previews>
    <performances>1508</performances>
  </show>
  <show title="A Chorus Line">
    <theatre>Shubert</theatre>
    <playwrights>
      <playwright>James Kirkwood.</playwright>
      <playwright>Nicholas Dante</playwright>
    </playwrights>
    <composers>
      <composer>Marvin Hamlisch</composer>
    </composers>
    <lyricists>
      <lyricist>Edward Kleban</lyricist>
    </lyricists>
    <opening>07/25/1975</opening>
    <closing />
    <performances>6137</performances>
  </show>
  ...
</broadway>

Reading XML With the DOM

The objectives of our project are to read, then write, our XML file using the Document Object Model (DOM).

Our project reads each <show> tag from the input and stores it in a list of POJOs, class name ShowPOJO. (Using the DOM to read and write the file, the fact that our POJO has a different name from the tag is unimportant–but on the next page, it becomes very important. Stay tuned!)

So open up BroadwayReaderDOM.java and follow along.

First, notice we have two variables defined at the instance level–although they could be defined static just as well as they don’t change for the life of the run:

    private DocumentBuilderFactory factory = null;
    private DocumentBuilder builder = null;

These classes are in package javax.xml. Our constructor initializes them:

    factory = DocumentBuilderFactory.newInstance();
    builder = factory.newDocumentBuilder();

Not surprisingly, DocumentBuilderFactory.newInstance() returns an instance of DocumentBuilderFactory, and newDocumentBuilder() returns an instance of DocumentBuilder, and it’s the latter that can parse XML text into a Document.

Method read()

The read() method starts like this:

private void read(String fileName) throws Exception {
    File showFile = new File(fileName);
    Document document = builder.parse(showFile);

    // nodeList contains all the <show> nodes.
    NodeList nodeList = document.getElementsByTagName("show");

NodeList is, as the name implies, an array (not a List!) of Nodes. The last statement gets all the nodes named “show” in the document. Node is an interface to nearly anything within the document. There are many implementations of Node:

Attr
Comment
Element
Entity
Notation
Text
and more

The Node interface has a constant value you can compare to the node type to determine the kind of Node you’re dealing with.

    // Process each <show> in turn.
    for (int index = 0; index < nodeList.getLength(); index++) {
        Node node = nodeList.item(index);
        if (node.getNodeType() == Node.ELEMENT_NODE) {
            Element showElement = (Element) node;

            // The title is an Attribute of the <show> node.
            String title = showElement.getAttribute(ShowPOJO.FIELD_TITLE);

Here we’re processing each of the Nodes in nodeList. First, we make sure it’s an Element, and if so we cast it to an Element variable for convenience.

The <show> tag has an attribute: title. the getAttribute() method retrieves its value; FIELD_TITLE is one of the field name constants we’ve defined in ShowPOJO.

// The remaining fields are elements nested within the <show> node.
String theatre = getNodeStringValue(ShowPOJO.FIELD_THEATRE, showElement);
List<String> playwrightList = getListValue(ShowPOJO.FIELD_PLAYWRIGHTS, ShowPOJO.FIELD_PLAYWRIGHT, showElement);

Next, we’ll get the value of the <theatre> tag, and the list of <playwright> tag values–there may be more than one playwright. To help with these tasks, we’ve defined two methods.

// This method returns the text contained within an Element.
// Note that there's no way to get a single element by name.
// The DOM assumes there might be more than one. But we're
// interested in only the first even if there are more.
private String getNodeStringValue(String elementName, Element element) {
    String returnValue = null;
    NodeList list = element.getElementsByTagName(elementName);
    if (list.getLength() > 0) {
        Node node = list.item(0);
        returnValue = node.getTextContent();
    }

    return returnValue;
}

Our getNodeStringValue() method receives the name of the tag from which we want a text value (“theatre” on the first call), and the Element within which the desired tag can be found–a <show> tag here. Notice how we call getElementsByTagName() as we did on the Document as a whole, but this time we’re getting tags within the <show> tag only.

You can’t ask the DOM for a single Node–it can only return an array.

private List<String> getListValue(String parentName, String childName, Element element) {
    List<String> returnValue = new ArrayList<String>();

    NodeList parentList = element.getElementsByTagName(parentName);

    for (int parentIndex = 0; parentIndex < parentList.getLength(); parentIndex++) {
        Node parentNode = parentList.item(parentIndex);
        NodeList childList = parentNode.getChildNodes();
        for (int childIndex = 0; childIndex < childList.getLength(); childIndex++) {
            Node childNode = childList.item(childIndex);
            if (childNode.getChildNodes().getLength() > 0) {
                returnValue.add(childNode.getTextContent());
            }
        }
    }

Our getListValue() method isn’t so different from getNodeStringValue(), except that it first gets a list of parent nodes of a given name (“playwrights”) and gets the text from all the children of each parent. Since the only tag we have within <playwrights> is <playwright>, we don’t worry about the names of the nodes we’re retrieving, so we can use the DOM’s getChildNodes() method to retrieve an array of Nodes from which we’ll add text contents to a List.

We’ve also got methods to parse the text contents from a tag into Dates and Integers.

You can see how the rest of the code in read() is used to acquire text contents from the tag within <show> and build ShowPOJO instances which are stored in a List. Done!

Writing XML With the DOM

So now that we’ve got a list of ShowPOJO instances, let’s write their contents out again in XML format.

// Write a new XML file containing the list of Shows,
// in reverse order.
private void write() throws Exception {
    Document doc = builder.newDocument();
    Element rootElement = doc.createElement("broadway");
    doc.appendChild(rootElement);

The first step is to create a new Document, just as when reading–except that of course there’s no input source to pass to newDocument().

We start by creating rootElement and providing the tag name “broadway,” then appending it to the Document. rootElement is the outermost tag in the file. From here on, it’s just a matter of creating other elements for the <show> tags and appending them to rootElement, then appending Elements contained within each <show> Element to that.

    for (int index = showList.size() - 1; index >= 0; index--) {
        ShowPOJO show = showList.get(index);
        Element showElement = doc.createElement("show");
        showElement.setAttribute(ShowPOJO.FIELD_TITLE, show.getTitle());
        rootElement.appendChild(showElement);

Here we’re creating an Element for the <show> tag, setting its title attribute, and appending it to the root. (Remember that the values staring with “ShowPOJO.FIELD_” are constants set to the corresponding tag names.) If we were to create the XML from what we’ve got so far, it would look something like this:

<broadway>
   <show title="Mame" />
</broadway>

Next, we add the tags for other Elements of ShowPOJO.

        createChild(ShowPOJO.FIELD_THEATRE, show.getTheatre(), doc, showElement);

Our createChild() method is two overloads that append a child Element. The first receives a tag name, the value to assign to the tag, the Document, and the Element to which the new child will be appended. (The Document is needed only because it supplies the createElement() method.)

// This method creates a child Element of the specific node name
// and value, and appends it to the parent Element.
private void createChild(String nodeName, Object value, Document doc, Element parent) {
    // If value is null, don't create an Element at all.
    if (value == null) {
        return;
    }
    String valueStr = "";

    // Determine the String representation of value.
    if (value instanceof String) {
        valueStr = (String) value;
    }
    if (value instanceof Date) {
        SimpleDateFormat fmt = new SimpleDateFormat("MM/dd/yyyy");
        valueStr = fmt.format((Date) value);
    }
    if (value instanceof Integer) {
        valueStr = value.toString();
    }

    // Create the Element, set its text value, and append it
    // to the parent.
    Element child = doc.createElement(nodeName);
    child.setTextContent(valueStr);
    parent.appendChild(child);
}

At this point, our XML would look like this:

<broadway>
   <show title="Mame">
       <theatre>Winter Garden</theatre>
</broadway>

For appending lists of values, we have a second overload.

        createChild(ShowPOJO.FIELD_PLAYWRIGHTS, ShowPOJO.FIELD_PLAYWRIGHT, show.getPlaywrightList(), doc,
                    showElement);

This passes the grouping tag name (“playwrights”), the inner tag name (“<playwright>”), the list of values to appear in the tags, the Document, and the Element to which the new tag will be appended.

// This is an overload that creates nodes from a List of values.
private void createChild(String parentName, String childName, List<String> valueList, Document doc, Element root) {
    Element parentElement = doc.createElement(parentName);
    root.appendChild(parentElement);
    for (String value : valueList) {
        createChild(childName, value, doc, parentElement);
    }
}

In this overload, we’ve created an Element to attach to <show>, then appended each value in the list to that element.

And now our XML looks like this:

<broadway>
   <show title="Mame">
       <theatre>Winter Garden</theatre>
       <playwrights>
          <playwright>Robert E. Lee</playwright>
          <playwright>Jerome Lawrence</playwright>
       </playwrights>
   </show>
</broadway>

Once all the other tags have been appended, we can finally write the Document to an output stream. In our example, once the stream is written, we return its contents by calling toString().

// Return XML from the supplied Document.
private String writeXml(Document doc) throws TransformerException {
    TransformerFactory transformerFactory = TransformerFactory.newInstance();
    Transformer transformer = transformerFactory.newTransformer();

    // Set a property that "pretty prints" the output.
    transformer.setOutputProperty(OutputKeys.INDENT, "yes");

    DOMSource source = new DOMSource(doc);
    OutputStream output = new ByteArrayOutputStream();
    StreamResult result = new StreamResult(output);

    // Write the XML.
    transformer.transform(source, result);

    return output.toString();
}

Here are the steps to create XML from a Document.

Create a TransformerFactory, and from that create a Transformer. The former is a class that creates a Transformer, and a Transformer transforms a source into a result.
In our example, we’ve set a property on the Transformer to add line breaks and spacing to the XML it will ultimately create. Of course, this is optional.
Create a DOMSource from our Document. A DOMSource “acts as a holder for a transformation Source tree in the form of a Document Object Model (DOM) tree” according to the documentation.
Create an OutputStream–in this case, a ByteArrayOutputStream which writes to internal memory–and a StreamResult from it. (StreamResult instances may also be directed at Files or Writers.)
Use the Transformer to convert the DOMSource created from the Doucment to a XML in the OutputStream via the StreamResult.
Finally, return the OutputStream contents using toString().

What You Need to Know

The Document Object Model provides methods for parsing and formatting XML.

Next: Processing XML with JAXB