Sunday, June 14, 2009

Parsing HTML using XMLUnit

XMLUnit HTML document builder is useful outside the testing environment if you want to parse HTML files using DOM model. It also supports powerful XPath engine to makes html traversing easier. Here is sample code to fetch all anchor link tag which has href attribute defined.


TolerantSaxDocumentBuilder tolerantSaxDocumentBuilder = new TolerantSaxDocumentBuilder(XMLUnit.getTestParser());
HTMLDocumentBuilder htmlDocumentBuilder = new HTMLDocumentBuilder(tolerantSaxDocumentBuilder);
Document doc = htmlDocumentBuilder.parse(content);

XpathEngine engine = XMLUnit.newXpathEngine();
String res = engine.evaluate( "/html/body//a[@href]", doc);