Monday, 29 May 2017

Java: Read Large XML by Using StAX

The following code shows how to read XML file in Java. It uses StAX API which reads xml files sequentially. If you want to read a large xml file, and get outofmemory error, you should be able to solve the problem by using the code below. The solution below read the xml file sequentially, and can process very large xml files, such as 10G or 20G. Therefore, it is a scalable solution!

Problem


From the following xml file, get "id" and first "thetext" from each item. This is related with how to get first elements of XML file.

In the following section, I will complete the code to parse the xml file by using StAX, and explain the code a little bit.


<?xml version="1.0" encoding="UTF-8"?>
<config>
<item id="1">
<mode>1</mode>
<long_desc isprivate="0">
<who name="Andy">andy@ch.ibm.com</who>
<bug_when>2001-10-10 21:34:46 -0400</bug_when>
<thetext> Setup a project</thetext>
</long_desc>

<long_desc isprivate="0">
<who name="Mike">mike@ch.ibm.com</who>
<bug_when>2001-10-10 21:34:46 -0400</bug_when>
<thetext>- Setup</thetext>
</long_desc>
<long_desc isprivate="0">
<who name="Gary">gary@ch.ibm.com</who>
<bug_when>2001-10-10 21:34:46 -0400</bug_when>
<thetext>project</thetext>
</long_desc>

</item>

<item id="2">
<mode>2</mode>
<long_desc isprivate="0">
<who name="John">john@ch.ibm.com</who>
<bug_when>2001-10-10 21:34:46 -0400</bug_when>
<thetext> Setup a project</thetext>
</long_desc>

<long_desc isprivate="0">
<who name="Bill">bill@ch.ibm.com</who>
<bug_when>2001-10-10 21:34:46 -0400</bug_when>
<thetext>- Setup</thetext>
</long_desc>

<long_desc isprivate="0">
<who name="Rick">rick@ch.ibm.com</who>
<bug_when>2001-10-10 21:34:46 -0400</bug_when>
<thetext>project</thetext>
</long_desc>

</item>
</config>

Solution


Complete Code:

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.InputStream;
import java.util.Iterator;
import javax.xml.stream.FactoryConfigurationError;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.Attribute;
import javax.xml.stream.events.EndElement;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;

class Item{
private String firstText = null;

public void setFirstText(String str){
firstText =  str;
}

public String getFirstText(){
if(firstText == null){
return null;
}else{
return firstText;
}
}
}

public class Main {
public static void main(String[] args) throws FileNotFoundException,
XMLStreamException, FactoryConfigurationError {
// First create a new XMLInputFactory
XMLInputFactory inputFactory = XMLInputFactory.newInstance();

//inputFactory.setProperty("javax.xml.stream.isCoalescing", True)

// Setup a new eventReader
InputStream in = new FileInputStream("/usa/xiwang/Desktop/config");
XMLEventReader eventReader = inputFactory.createXMLEventReader(in);

Item item = null;

while (eventReader.hasNext()) {
XMLEvent event = eventReader.nextEvent();

//reach the start of an item
if (event.isStartElement()) {

StartElement startElement = event.asStartElement();

if (startElement.getName().getLocalPart().equals("item")) {
item = new Item();
System.out.println("--start of an item");
// attribute
Iterator<Attribute> attributes = startElement.getAttributes();
while (attributes.hasNext()) {
Attribute attribute = attributes.next();
if (attribute.getName().toString().equals("id")) {
System.out.println("id = " + attribute.getValue());
}
}
}

// data
if (event.isStartElement()) {
if (event.asStartElement().getName().getLocalPart().equals("thetext")) {
event = eventReader.nextEvent();

if(item.getFirstText() == null){
System.out.println("thetext: "
+ event.asCharacters().getData());
item.setFirstText("notnull");
continue;
}else{
continue;
}

}
}
}

//reach the end of an item
if (event.isEndElement()) {
EndElement endElement = event.asEndElement();
if (endElement.getName().getLocalPart() == "item") {
System.out.println("--end of an item\n");
item = null;
}
}

}
}
}

The solution to get the first "thetext" content, is to created an object when read the start of an "item" element, assign "thetext" content to one of its member only when it's member is empty. This makes sure that only the first "thetext" data is stored.

In brief, StAX API is convenient to use, but still take some time to understand how the Event-driven works and why it costs less memory.