Help! My XML File Is Too Big To Fit Into Memory!

You may remember that one of the requirements for making an XML well formed is that it be enclosed in a single paired tag. And up to now, both using the DOM and JAXB, we’ve been taking in an entire XML file in one go.

But data files are frequently much larger than can be held in memory at one time. Fortunately, there’s no rule that says that the Document you build with the DOM, or the tag either unmarshalled or marshalled with JAXB, must be the outermost one.

Breaking Off Chunks

Key to processing a huge XML input file is picking off and parsing inner tags one by one. This is surprisingly easy.

(You may recognize this technique from Lesson 9.)

Take a look at class BroadwayReaderDOMSingles in Lesson-20-Example-01. In it, you’ll see a modified read() method.

// Read the XML and store it.
private void read(String fileName) throws Exception {
    File showFile = new File(fileName);

    // Establish a Reader to read the file,
    // and a StringBuilder to hold its contents.
    Reader rdr = new FileReader(showFile);
    StringBuilder buf = new StringBuilder();

    // Read into the buffer until we hit a <show> tag.
    // Then delete all characters that preceded it and
    // read until we have a </show> tag. Present the
    // enclosed characters to the parser.
    String tag;
    while ((tag = extractShowTag(rdr, buf)) != null) {
        addToPojoList(tag);
    }
}

The engine behind this code is the extractShowTag() method, which finds and returns the next <show> tag in the input.

private static final String TAG_SHOW_START = "<show";
private static final String TAG_SHOW_END = "</show>";

// Get the next <show> tag from the input.
private String extractShowTag(Reader rdr, StringBuilder buf) throws IOException {
    String returnString = null;
    int start = -1;

    // Read more into the buffer until a start tag appears
    // or until the input is exhausted.
    char[] input = new char[60];
    while ((start = buf.indexOf(TAG_SHOW_START)) < 0) {
        int charsRead = rdr.read(input);
        if (charsRead < 0) {
            break;
        }
        buf.append(input);
    }

    // If no start tag is found, return a null.
    if (start < 0) {
        return null;
    }

    // Remove the characters up to the start tag.
    buf.delete(0, start);

    // Now search for the end tag.
    int end = -1;
    while ((end = buf.indexOf(TAG_SHOW_END)) < 0) {
        char[] input = new char[60];
        int charsRead = rdr.read(input);
        if (charsRead < 0) {
            break;
        }
        buf.append(input);
    }

    if (end < 0) {
        return null;
    }

    // Break off the complete <show> tag and return it.
    end += TAG_SHOW_END.length();
    returnString = buf.substring(0, end);
    buf.delete(0, end);
    return returnString;
}

Notice how the StringBuilder, buf, persists between calls to extractShow().

Now the beauty part. When using DOM to parse the input, we don’t have to change anything. We still call document.getElementsByTagName("show"); the only difference is that the returned NodeList will contain only one element.

Change for JAXB is only a little more complicated. Since our outermost tag will now be <show> instead of <broadway>, we change the JAXBContext.newInstance() call in the constructor from

jaxbContext = JAXBContext.newInstance(Broadway.class);

to

jaxbContext = JAXBContext.newInstance(Show.class);

And instead of

Broadway broadway = (Broadway) jaxbUnmarshaller.unmarshal(xmlSource);

and processing the list of Show instances within broadway, we code

Show show = (Show) jaxbUnmarshaller.unmarshal(xmlSource);

and process the single instance.

Building the Output

Just as we process one <show> tag at a time for input, we create one <show> tag at a time for output.

But though we just ignored the outer <broadway> tag when reading, we have to provide it when writing.

But this is easy!

Creating Inner XML with DOM

private static final String LINE_SEPARATOR = System.getProperty("line.separator");

// Format the list of ShowPOJOs in XML, surrounded by
// the outer <broadway> tags, and write to the console
// one Production at a time.
private void write(OutputStream out) throws Exception {
    // Since we're not having the DOM write the outer tags,
    // we need to write the XML declaration here. This code
    // will accomplish that.
    {
        Document doc = builder.newDocument();
        
        // Write the document to the output stream with no tags.
        // Only the XML declaration will be written.
        out.write(formatXml(doc, false).getBytes());
    }
        
    out.write(LINE_SEPARATOR.getBytes());
    out.write("<broadway>".getBytes());
    out.write(LINE_SEPARATOR.getBytes());

    for (int index = showList.size() - 1; index >= 0; index--) {
        Document doc = builder.newDocument();

        ShowPOJO show = showList.get(index);
        Element rootElement = doc.createElement("show");
        rootElement.setAttribute(ShowPOJO.FIELD_TITLE, show.getTitle());
        doc.appendChild(rootElement);
   ...
        out.write(formatXml(doc, true).getBytes());
    }

    // Write DOM document to a file
    out.write("</broadway>".getBytes());

Notice how we can ask the DOM to format an XML declaration for us. Next, we write the <broadway> tag to the output ourselves. Then we add <show> tags by creating a new Document for each, and using method formatXml() to format XML from it. Finally, we write a closing </broadway> tag, and we’re done.

// Return XML from the supplied Document.
private String formatXml(Document doc, boolean omitXmlDeclaration)
        throws TransformerException {

    TransformerFactory transformerFactory = TransformerFactory.newInstance();
    Transformer transformer = transformerFactory.newTransformer();
   // This method creates an internal tag, so we don't want the XML
    // declaration tag to appear here. We'd also like "pretty print."
    if (omitXmlDeclaration) {
            transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");

        // Set a property that "pretty prints" the output.
        transformer.setOutputProperty(OutputKeys.INDENT, "yes");
    }

    DOMSource source = new DOMSource(doc);
    OutputStream output = new ByteArrayOutputStream();
    StreamResult result = new StreamResult(output);

    // Write the XML.
    transformer.transform(source, result);
      
    return output.toString();
}

The process of transforming a Document to XML is unchanged. But we’ve added a provision for suppressing the XML declaration lest it preface each <show> tag.

Creating Inner XML with JAXB

For JAXB, we’ve already created our JAXBContext based on the <show> tag instead of the <broadway> tag.

As with the DOM, we need to write the XML declaration and the outer <broadway> tags in Java code.

There are only two things to change from before. One is to realize that having populated a Show instance, rather than adding it to a Broadway instance, we marshal it individually and add it to the output. The second is that before marshalling, we make this call to stop an XML declaration from appearing before each <show> tag:

    // This statement suppresses the XML declaration, which
    // should appear only at the beginning of the output.
    jaxbMarshaller.setProperty(Marshaller.JAXB_FRAGMENT, true);

What You Need To Know

  • Handling big XML files is easy.

Next topic: Regular Expressions