Parsing and Processing Large XML Documents with Digester RulesXML is commonly used for integration with third-party applications or web services, especially those that are running on non-Java platforms. On the other hand, if the code is running in a managed environment (e.g., a J2EE container) under a large number of concurrent requests from clients, it is very important to reduce the usage of runtime resources and to minimize performance impact from components that are doing XML processing. Of course, this must be very carefully profiled, but in order to minimize memory requirements, in most cases is not a good idea to handle XML using in-memory representations such as DOM or JDom.
Applications based on SAX or the new StAX APIs can process documents iteratively during parsing. The SAX API is very mature, and is part of the standard JAXP API and supported by many tools and frameworks. It also allows you to chain handlers together in order to implement sophisticated transformations and processing rules.
SAX is based on a event-driven model, where a
parser or previous filter in a chain calls a provided ContentHandler
instance for each parsing event (such as the start or end of elements).
That is why the ContentHandler implementation has to keep the current
state of processing, and that makes implementation quite complex and
difficult to maintain. However, the Jakarta Digester
component provides an extendable ContentHandler implementation
that can help to separate processing logic from the parser.
Let's take a simple example. Imagine a raw-database-reporting or export/import tool that must be able to load a countless number of rows into a database from a large XML document.
The core class of the Digester framework is Digester, which implements
SAX's ContentHandler,
and provides an internal stack. The stack can be used to store
intermediate data during processing.
Here is a simple DBLoader class that illustrates typical
usage of Digester for loading an XML document
from a given InputSource.
public class DBLoader {
private Digester digester;
public DBLoader( RuleSet ruleSet) {
digester = new Digester();
digester.addRuleSet(ruleSet);
}
public void load( Connection connection,
Reader reader) throws DBLoaderException {
Map ctx = new HashMap();
ctx.put("CONNECTION", connection);
digester.push(ctx);
try {
digester.parse( reader);
} catch( SAXException ex) {
Exception ex2 = ex.getException()==null ?
ex : ex.getException();
throw new DBLoaderException(ex2);
} catch( Exception ex) {
throw new DBLoaderException(ex);
} finally {
digester.clear();
}
}
}
|
Related Reading
XML Hacks |
The Digester instance is initialized with a RuleSet,
which defines a set of rules and their mapping to XML elements, as required by the
processing logic. For a complex RuleSet, this initialization could be an expensive operation, and
because rules do not keep state information, the Digester instance
is configured once and then reused for multiple calls.
Note that a single Digester instance can't be used from multiple threads.
A map with processing context attributes is pushed into Digester's stack before parsing.
It can be pre-populated with some properties required for processing, properties
from the runtime info, configuration files, etc. In the example above, the JDBC connection is stored within this context. The context map is used within processing rules to capture information
from an input XML file. It could be also used to store processing errors or
collect processing results.
Digester automatically wraps any exceptions from the rules into
SAXException (even RuntimeExceptions), so it is necessary to catch
SAXException and unwrap exceptions thrown from the processing code.
Note that the clear() method is called after processing in order
to clean up the Digester instance, and release resources if the
XML processing has been terminated.
RuleSetActual processing logic is defined in Digester's rules. A collection of rules
for a particular XML format can be grouped into the RuleSet.
The RuleSet must implement addRuleInstances(), which
should add all rules into the Digester instance.
Usually, Digester is configured with predefined common rules or loads these rules
from an XML-based configuration file. However, to get better control of XML processing,
it is better to implement custom rules.
For an illustration, we can use database layouts used by the DBUnit testing framework. One of the layouts is a traditional, normalized XML structure, defined by the following DTD:
<!ELEMENT dataset (table+)>
<!ELEMENT table (column+, row)>
<!ATTLIST table name CDATA #REQUIRED>
<!ELEMENT column (#PCDATA)>
<!ELEMENT row (value+)>
<!ELEMENT value (#PCDATA)>
For example:
<dataset>
<table name="TABLE1">
<column>col1</column>
<column>col2</column>
<row>
<value>1</value>
<value>11</value>
</row>
<row>
<value>2</value>
<value>22</value>
</row>
</table>
</dataset>
The following DBUnitRuleSet class illustrates a
RuleSet that can be used to process the XML document above.
As you can see, each element has assigned a custom Rule
that processes data from the corresponding part of an XML document.
public final class DBUnitRuleSet
extends RuleSetBase {
public void addRuleInstances( Digester d) {
d.addRule("dataset/table",
new TableRule());
d.addRule("dataset/table/column",
new TableColumnRule());
d.addRule("dataset/table/row",
new TableRowRule());
d.addRule("dataset/table/row/value",
new TableRowValueRule());
}
...
DBUnit also has a more efficient flat XML layout. Unfortunately, it can't be represented by a DTD. Here is how the same data will appear in this layout.
<dataset>
<TABLE1 col1="1" col2="11"/>
<TABLE1 col1="2" col2="22"/>
</dataset>
Because each table is represented as a single element in the XML file, a single custom
Rule is sufficient to handle this format. However, to match all
table elements, it is necessary to use RegexRules that support
wildcards. The DBLoader constructor shall look like the following.
...
public DBLoader( RuleSet ruleSet) {
digester = new Digester();
RegexMatcher m = new SimpleRegexMatcher();
digester.setRules(new RegexRules(m));
digester.addRuleSet( ruleSet);
}
...
Then DBUnitFlatRuleSet can use patterns for assigning rules.
public class DBUnitFlatRuleSet
extends RuleSetBase {
public void addRuleInstances( Digester d) {
d.addRule("dataset/*", new FlatTableRule());
}
...
|
Download the complete source code for DBUnitRuleSet and DBUnitFlatRuleSet, with an accompanying Maven project. Below is the implementation
of the following rules from DBUnitRuleSet: TableRule,
TableColumnRule, TableRowRule, and TableRowValueRule.
For convenience, the concrete rules could be coded as static inner
classes within RuleSet.
Each rule may handle any combination of:
begin().end().body().In this example, TableRule creates a child copy of the parent context
for each new table, initializes the TABLE_NAME attribute, and creates
a new TABLE_COLUMNS List for column names when handling the open
<table> element. It also drops the current child context from
the Digester stack at the closing </table> element.
private static class TableRule extends Rule {
public void begin( String ns, String name,
Attributes att) {
Map parentCtx = (Map) getDigester().peek();
Map ctx = new HashMap(parentCtx);
ctx.put("TABLE_NAME", att.getValue("name"));
ctx.put("TABLE_COLUMNS", new ArrayList());
ctx.put("TABLE_ROWS", new ArrayList());
getDigester().push( ctx);
}
public void end( String ns, String name) {
getDigester().pop();
}
}
TableColumnRule adds a single column
name into the TABLE_COLUMNS List
in the current context.
private static class TableColumnRule
extends Rule {
public void body( String ns, String name,
String text) {
Map ctx = ( Map) getDigester().peek();
((List) ctx.get("TABLE_COLUMNS")).add(text);
}
}
TableRowRule initializes a TABLE_ROW List
that will be used to store values for the current table row at the opening
<row> element.
This rule also executes SQL to insert data from the current row when the
closing </row> element is handled. This way, the entire XML document is
never loaded into memory. The actual SQL is constructed in the
getStatement() method.
private static class TableRowRule extends Rule {
public void begin( String ns, String name,
Attributes att) {
Map ctx = (Map) getDigester().peek();
ctx.put("TABLE_ROW", new ArrayList());
}
public void end( String ns, String name)
throws SQLException {
Map ctx = (Map) getDigester().peek();
execute(ctx, getStatement(ctx));
ctx.remove("TABLE_ROW");
}
private int execute( Map ctx,
PreparedStatement st) throws SQLException {
List values = (List) ctx.get("TABLE_ROW");
if( values.size()==0) return 0;
for( int i = 0; i<values.size(); i++) {
st.setObject(i+1, values.get(i));
}
return st.executeUpdate();
}
private PreparedStatement getStatement( Map ctx)
throws SQLException {
List cols = (List) ctx.get("TABLE_COLUMNS");
if(cols.size()==0) return null;
String tableName = getTableName(ctx);
StringBuffer sql = new StringBuffer()
.append("INSERT INTO ")
.append(tableName).append("(");
StringBuffer values = new StringBuffer("?");
sql.append(columns.get(0));
for( int i = 1; i<columns.size(); i++) {
sql.append(",").append(columns.get(i));
values.append(",?");
}
sql.append(") VALUES (")
.append(values).append(")");
Connection conn = getConnection(ctx);
return conn.prepareStatement(sql.toString());
}
private Connection getConnection( Map ctx) {
return (Connection) ctx.get("CONNECTION");
}
private String getTableName(Map ctx) {
return (String) ctx.get("TABLE_NAME");
}
}
TableRowValueRule collects column values
for the current row from the <value> element
within the TABLE_ROW List of the current context.
private static class TableRowValueRule
extends Rule {
public void body( String ns, String name,
String text) {
Map ctx = (Map) getDigester().peek();
((List) ctx.get("TABLE_ROW")).add(text);
}
}
The code above does not cache the created
PreparedStatement instances, and instead recreates them every time. This
may cause some performance concerns; however, if this code is used inside of a J2EE container,
a connection is obtained from the container-managed DataSource,
so most likely, caching of prepared statements is being done automatically. If not,
then the getStatement() method can be extended in order to save
created instances of the PreparedStatement within the processing
context. Also, please note that these statements must be explicitly closed at the
end of processing, such as in the end() method of
TableRule.
For event-driven code, testing is twice as important than it is for any other application.
It is not always possible to clearly observe which events will be fired by the
event generator. In our case, events are generated by the SAX XML parser, so
we build test data for this. It does not make a much sense to test each rule
independently, because they are related. On the other hand, for a first shot
at an execution sequence test for DBLoader, we don't
really need a database connection and can use a mocked environment. It is easy to implement such test using the
jMock dynamic mock
testing framework. A mocked Connection and PreparedStatement can verify that
rules are executed in an appropriate order and that they convert all data from XML.
Here is a simple test suite.
public class DBLoaderTest extends TestCase {
...
private static final String DBUNIT_FDATA =
"<dataset>\n"+
" <TABLE1 col1=\"1\" col2=\"11\"/>\n"+
" <TABLE1 col1=\"2\" col2=\"22\"/>\n"+
"</dataset>";
public static Test suite() {
String name = DBLoaderTest.class.getName();
TestSuite suite = new TestSuite(name);
suite.addTest( new DBLoaderTest(
new DBUnitRuleSet(), DBUNIT_DATA));
suite.addTest( new DBLoaderTest(
new DBUnitFlatRuleSet(), DBUNIT_FDATA));
return suite;
}
private final RuleSet ruleSet;
private final String xml;
private DBLoaderTest( RuleSet ruleSet,
String xml) {
super("testDBLoader");
this.ruleSet = ruleSet;
this.xml = xml;
}
public void testDBLoader() throws Exception {
Mock ps = new Mock(PreparedStatement.class);
Object[][] params = new Object[][] {
{ new Integer(1), "1"},
{ new Integer(2), "11"},
{ new Integer(1), "2"},
{ new Integer(2), "22"}};
for( int i = 0; i<params.length; i++) {
ps.expects(new InvokeOnceMatcher())
.method(new IsSetter())
.with(new IsEqual(params[i][0]),
new IsEqual(params[i][1]))
.isVoid();
}
ps.expects(new InvokeCountMatcher(2))
.method("executeUpdate")
.will(new ReturnStub(new Integer(1)));
Mock conn = new Mock(Connection.class);
conn.expects(new InvokedRecorder())
.method("prepareStatement")
.will(new ReturnStub(ps.proxy()));
Reader r = new StringReader( xml);
DBLoader loader = new DBLoader(ruleSet);
loader.load((Connection) conn.proxy(), r);
ps.verify();
conn.verify();
}
public String getName() {
String name = ruleSet.getClass().getName();
return super.getName()+" "+name;
}
public class IsSetter implements Constraint {
public boolean eval( Object o) {
return ((String) o).startsWith("set");
}
}
}
The same test case can be used to test both layouts, because the sequence of
JDBC calls will be the same in both cases for the same data. The method
testDBLoader() creates a Mock for PreparedStatement
and sets its expectations based on the source XML structure. Expected methods are
setObject()/setString() and executeUpdate().
The test method also calls verify() for all mocks after DBLoader
execution to ensure that expectations are met.
As shown above, Digester can help to isolate XML processing logic in maintainable
rules and maintain the advantages of the stream-based XML processing. The code is
easy to understand and test.
Eugene Kuleshov is an independent consultant with over 15 years of experience in software design and development.
Return to ONJava.com.
Copyright © 2009 O'Reilly Media, Inc.