ONJava.com    
 Published on ONJava.com (http://www.onjava.com/)
 See this if you're having trouble printing code examples


Parsing and Processing Large XML Documents with Digester Rules Parsing and Processing Large XML Documents with Digester Rules

by Eugene Kuleshov
09/01/2004

XML is commonly used for integration with third-party applications or web services, especially those that are running on non-Java platforms. On the other hand, if the code is running in a managed environment (e.g., a J2EE container) under a large number of concurrent requests from clients, it is very important to reduce the usage of runtime resources and to minimize performance impact from components that are doing XML processing. Of course, this must be very carefully profiled, but in order to minimize memory requirements, in most cases is not a good idea to handle XML using in-memory representations such as DOM or JDom.

Applications based on SAX or the new StAX APIs can process documents iteratively during parsing. The SAX API is very mature, and is part of the standard JAXP API and supported by many tools and frameworks. It also allows you to chain handlers together in order to implement sophisticated transformations and processing rules.

SAX is based on a event-driven model, where a parser or previous filter in a chain calls a provided ContentHandler instance for each parsing event (such as the start or end of elements). That is why the ContentHandler implementation has to keep the current state of processing, and that makes implementation quite complex and difficult to maintain. However, the Jakarta Digester component provides an extendable ContentHandler implementation that can help to separate processing logic from the parser.

Using Digester

Let's take a simple example. Imagine a raw-database-reporting or export/import tool that must be able to load a countless number of rows into a database from a large XML document.

The core class of the Digester framework is Digester, which implements SAX's ContentHandler, and provides an internal stack. The stack can be used to store intermediate data during processing. Here is a simple DBLoader class that illustrates typical usage of Digester for loading an XML document from a given InputSource.

public class DBLoader {
  private Digester digester;
  
  public DBLoader( RuleSet ruleSet) {
    digester = new Digester();
    digester.addRuleSet(ruleSet);
  }
  
  public void load( Connection connection, 
      Reader reader) throws DBLoaderException {
    Map ctx = new HashMap();
    ctx.put("CONNECTION", connection);
    digester.push(ctx);
    try {
      digester.parse( reader);
      
    } catch( SAXException ex) {
      Exception ex2 = ex.getException()==null ?
      		ex : ex.getException();
      throw new DBLoaderException(ex2);

    } catch( Exception ex) {
      throw new DBLoaderException(ex);

    } finally {
      digester.clear();
      
    }
  }
}

Related Reading

XML Hacks
100 Industrial-Strength Tips and Tools
By Michael Fitzgerald

The Digester instance is initialized with a RuleSet, which defines a set of rules and their mapping to XML elements, as required by the processing logic. For a complex RuleSet, this initialization could be an expensive operation, and because rules do not keep state information, the Digester instance is configured once and then reused for multiple calls. Note that a single Digester instance can't be used from multiple threads.

A map with processing context attributes is pushed into Digester's stack before parsing. It can be pre-populated with some properties required for processing, properties from the runtime info, configuration files, etc. In the example above, the JDBC connection is stored within this context. The context map is used within processing rules to capture information from an input XML file. It could be also used to store processing errors or collect processing results.

Digester automatically wraps any exceptions from the rules into SAXException (even RuntimeExceptions), so it is necessary to catch SAXException and unwrap exceptions thrown from the processing code.

Note that the clear() method is called after processing in order to clean up the Digester instance, and release resources if the XML processing has been terminated.

Implementing RuleSet

Actual processing logic is defined in Digester's rules. A collection of rules for a particular XML format can be grouped into the RuleSet. The RuleSet must implement addRuleInstances(), which should add all rules into the Digester instance.

Usually, Digester is configured with predefined common rules or loads these rules from an XML-based configuration file. However, to get better control of XML processing, it is better to implement custom rules.

For an illustration, we can use database layouts used by the DBUnit testing framework. One of the layouts is a traditional, normalized XML structure, defined by the following DTD:

<!ELEMENT dataset (table+)>
<!ELEMENT table (column+, row)>
<!ATTLIST table name CDATA #REQUIRED>
<!ELEMENT column (#PCDATA)>
<!ELEMENT row (value+)>
<!ELEMENT value (#PCDATA)>

For example:

<dataset>
  <table name="TABLE1">
    <column>col1</column>
    <column>col2</column>
    <row>
      <value>1</value>
      <value>11</value>
    </row>
    <row>
      <value>2</value>
      <value>22</value>
    </row>
  </table>
</dataset>

The following DBUnitRuleSet class illustrates a RuleSet that can be used to process the XML document above. As you can see, each element has assigned a custom Rule that processes data from the corresponding part of an XML document.

public final class DBUnitRuleSet 
      extends RuleSetBase {
  public void addRuleInstances( Digester d) {
    d.addRule("dataset/table", 
       new TableRule());
    d.addRule("dataset/table/column", 
       new TableColumnRule());
    d.addRule("dataset/table/row", 
       new TableRowRule());
    d.addRule("dataset/table/row/value", 
       new TableRowValueRule());
  }
  ...

DBUnit also has a more efficient flat XML layout. Unfortunately, it can't be represented by a DTD. Here is how the same data will appear in this layout.

<dataset>
  <TABLE1 col1="1" col2="11"/>
  <TABLE1 col1="2" col2="22"/>
</dataset>

Because each table is represented as a single element in the XML file, a single custom Rule is sufficient to handle this format. However, to match all table elements, it is necessary to use RegexRules that support wildcards. The DBLoader constructor shall look like the following.

  ...
  public DBLoader( RuleSet ruleSet) {
    digester = new Digester();
    RegexMatcher m = new SimpleRegexMatcher();
    digester.setRules(new RegexRules(m));
    digester.addRuleSet( ruleSet);
  }
  ...

Then DBUnitFlatRuleSet can use patterns for assigning rules.

public class DBUnitFlatRuleSet 
      extends RuleSetBase {
  public void addRuleInstances( Digester d) {
    d.addRule("dataset/*", new FlatTableRule());
  }
  ...

Implementing Custom Rules

Download the complete source code for DBUnitRuleSet and DBUnitFlatRuleSet, with an accompanying Maven project. Below is the implementation of the following rules from DBUnitRuleSet: TableRule, TableColumnRule, TableRowRule, and TableRowValueRule. For convenience, the concrete rules could be coded as static inner classes within RuleSet.

Each rule may handle any combination of:

In this example, TableRule creates a child copy of the parent context for each new table, initializes the TABLE_NAME attribute, and creates a new TABLE_COLUMNS List for column names when handling the open <table> element. It also drops the current child context from the Digester stack at the closing </table> element.

private static class TableRule extends Rule {

  public void begin( String ns, String name, 
        Attributes att) {
    Map parentCtx = (Map) getDigester().peek();
    Map ctx = new HashMap(parentCtx);
    ctx.put("TABLE_NAME", att.getValue("name"));
    ctx.put("TABLE_COLUMNS", new ArrayList());
    ctx.put("TABLE_ROWS", new ArrayList());
    getDigester().push( ctx);
  }

  public void end( String ns, String name) {
    getDigester().pop();
  }
}

TableColumnRule adds a single column name into the TABLE_COLUMNS List in the current context.

private static class TableColumnRule 
      extends Rule {
  public void body( String ns, String name, 
        String text) {
    Map ctx = ( Map) getDigester().peek();
    ((List) ctx.get("TABLE_COLUMNS")).add(text);
  }
}

TableRowRule initializes a TABLE_ROW List that will be used to store values for the current table row at the opening <row> element.

This rule also executes SQL to insert data from the current row when the closing </row> element is handled. This way, the entire XML document is never loaded into memory. The actual SQL is constructed in the getStatement() method.

private static class TableRowRule extends Rule {
  public void begin( String ns, String name, 
        Attributes att) {
    Map ctx = (Map) getDigester().peek();
    ctx.put("TABLE_ROW", new ArrayList());
  }

  public void end( String ns, String name) 
        throws SQLException {
    Map ctx = (Map) getDigester().peek();
    execute(ctx, getStatement(ctx));
    ctx.remove("TABLE_ROW");
  }
  
  private int execute( Map ctx, 
      PreparedStatement st) throws SQLException {
    List values = (List) ctx.get("TABLE_ROW");
    if( values.size()==0) return 0;

    for( int i = 0; i<values.size(); i++) {
      st.setObject(i+1, values.get(i));
    }
    return st.executeUpdate();
  }

  private PreparedStatement getStatement( Map ctx) 
        throws SQLException {
    List cols = (List) ctx.get("TABLE_COLUMNS");
    if(cols.size()==0) return null;

    String tableName = getTableName(ctx);
    StringBuffer sql = new StringBuffer()
        .append("INSERT INTO ")
        .append(tableName).append("(");    
    StringBuffer values = new StringBuffer("?");
    sql.append(columns.get(0));
    for( int i = 1; i<columns.size(); i++) {
      sql.append(",").append(columns.get(i));
      values.append(",?");
    }
    sql.append(") VALUES (")
       .append(values).append(")");

    Connection conn = getConnection(ctx);
    return conn.prepareStatement(sql.toString());
  }

  private Connection getConnection( Map ctx) {
    return (Connection) ctx.get("CONNECTION");
  }

  private String getTableName(Map ctx) {
    return (String) ctx.get("TABLE_NAME");
  }
}

TableRowValueRule collects column values for the current row from the <value> element within the TABLE_ROW List of the current context.

private static class TableRowValueRule 
      extends Rule {
  public void body( String ns, String name, 
      String text) {
    Map ctx = (Map) getDigester().peek();
    ((List) ctx.get("TABLE_ROW")).add(text);
  }
}

The code above does not cache the created PreparedStatement instances, and instead recreates them every time. This may cause some performance concerns; however, if this code is used inside of a J2EE container, a connection is obtained from the container-managed DataSource, so most likely, caching of prepared statements is being done automatically. If not, then the getStatement() method can be extended in order to save created instances of the PreparedStatement within the processing context. Also, please note that these statements must be explicitly closed at the end of processing, such as in the end() method of TableRule.

Testing

For event-driven code, testing is twice as important than it is for any other application. It is not always possible to clearly observe which events will be fired by the event generator. In our case, events are generated by the SAX XML parser, so we build test data for this. It does not make a much sense to test each rule independently, because they are related. On the other hand, for a first shot at an execution sequence test for DBLoader, we don't really need a database connection and can use a mocked environment. It is easy to implement such test using the jMock dynamic mock testing framework. A mocked Connection and PreparedStatement can verify that rules are executed in an appropriate order and that they convert all data from XML. Here is a simple test suite.

public class DBLoaderTest extends TestCase {
  ...
  private static final String DBUNIT_FDATA = 
    "<dataset>\n"+
    "  <TABLE1 col1=\"1\" col2=\"11\"/>\n"+
    "  <TABLE1 col1=\"2\" col2=\"22\"/>\n"+
    "</dataset>";
  
  public static Test suite() {
    String name = DBLoaderTest.class.getName();
    TestSuite suite = new TestSuite(name);
    suite.addTest( new DBLoaderTest( 
        new DBUnitRuleSet(), DBUNIT_DATA));
    suite.addTest( new DBLoaderTest( 
        new DBUnitFlatRuleSet(), DBUNIT_FDATA));
    return suite;
  }

  
  private final RuleSet ruleSet;
  private final String xml;

  private DBLoaderTest( RuleSet ruleSet, 
      String xml) {
    super("testDBLoader");
    this.ruleSet = ruleSet;
    this.xml = xml;
  }
  
  public void testDBLoader() throws Exception {
    Mock ps = new Mock(PreparedStatement.class);

    Object[][] params = new Object[][] {
        { new Integer(1), "1"},
        { new Integer(2), "11"},
        { new Integer(1), "2"},
        { new Integer(2), "22"}};
    for( int i = 0; i<params.length; i++) {
      ps.expects(new InvokeOnceMatcher())
        .method(new IsSetter())
        .with(new IsEqual(params[i][0]), 
              new IsEqual(params[i][1]))
        .isVoid();
    }

    ps.expects(new InvokeCountMatcher(2))
      .method("executeUpdate")
      .will(new ReturnStub(new Integer(1)));
    
    Mock conn = new Mock(Connection.class);
    conn.expects(new InvokedRecorder())
        .method("prepareStatement")
        .will(new ReturnStub(ps.proxy()));
    
    Reader r = new StringReader( xml);

    DBLoader loader = new DBLoader(ruleSet);
    loader.load((Connection) conn.proxy(), r);
    
    ps.verify();
    conn.verify();
  }
  
  public String getName() {
    String name = ruleSet.getClass().getName();
    return super.getName()+" "+name;
  }
  
  
  public class IsSetter implements Constraint {

    public boolean eval( Object o) {
      return ((String) o).startsWith("set");
    }

  }
  
}

The same test case can be used to test both layouts, because the sequence of JDBC calls will be the same in both cases for the same data. The method testDBLoader() creates a Mock for PreparedStatement and sets its expectations based on the source XML structure. Expected methods are setObject()/setString() and executeUpdate(). The test method also calls verify() for all mocks after DBLoader execution to ensure that expectations are met.

Conclusion

As shown above, Digester can help to isolate XML processing logic in maintainable rules and maintain the advantages of the stream-based XML processing. The code is easy to understand and test.

Resources

Eugene Kuleshov is an independent consultant with over 15 years of experience in software design and development.


Return to ONJava.com.

Copyright © 2009 O'Reilly Media, Inc.