Pulling my hair out over XML problem

Hi, I am trying to recurse through an XML file starting from the document root.

What I want to do at each element is find all tags below it with element name "COUNTRY" (All the country elements are leaf nodes). I am using getElementsByTagName("COUNTRY")

to help me achieve this.

So basically the document root should find the most instances of this element, and the lower I go in the XML tree, the less instances will be found.

Here is the code:

DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();

DocumentBuilder docBuilder =null;

Document doc =null;

try{

docBuilder = docBuilderFactory.newDocumentBuilder();

doc = docBuilder.parse(new File("data/countryRollup.xml"));

}catch (ParserConfigurationException e){

// TODO Auto-generated catch block

e.printStackTrace();

}catch (SAXException e){

// TODO Auto-generated catch block

e.printStackTrace();

}catch (IOException e){

// TODO Auto-generated catch block

e.printStackTrace();

}

//normalize() -> remove whitespace characters/linefeeds between element tags

recurseThroughCountryRollupTree(doc.getDocumentElement().normalize());

privatestaticvoid recurseThroughCountryRollupTree(Node root){

NodeList countryNodes = ((Element)root).getElementsByTagName("COUNTRY");

int length = countryNodes.getLength();

for (int i=0; i<length; i++){

System.out.println(countryNodes.item(i).getAttributes().getNamedItem("name").getNodeValue());

}

countryNodes = ((Element)root.getFirstChild()).getElementsByTagName("COUNTRY");//ERROR OCCURS HERE

length = countryNodes.getLength();

for (int i = 0; i >< length; i++){

System.out.println(countryNodes.item(i).getAttributes().getNamedItem("name").getNodeValue());

}

}

Note that I commented where I get the error.

The error message I get is:

"Exception in thread "main" java.lang.ClassCastException: com.sun.org.apache.xerces.internal.dom.DeferredTextImpl cannot be cast to org.w3c.dom.Element

at ReportingTool.recurseThroughCountryRollupTree(ReportingTool.java:190)

at ReportingTool.main(ReportingTool.java:116)"

[3452 byte] By [jellystonesa] at [2007-11-27 11:08:03]
# 1

It sounds like root.getFirstChild() is actually a Text object, not an Element. Text and Element both inherit from Node, so see if you can accomplish your goal by treating everything as Nodes.

hunter9000a at 2007-7-29 13:26:10 > top of Java-index,Java Essentials,New To Java...
# 2

Also, if you still have questions, please post the xml document you're testing with.

hunter9000a at 2007-7-29 13:26:10 > top of Java-index,Java Essentials,New To Java...
# 3

Previous poster has nailed the problem, i think.

As a fix, try:

countryNodes = ((Element)root).getElementsByTagName("COUNTRY");

instead of your line:

countryNodes = ((Element)root.getFirstChild()).getElementsByTagName("COUNTRY"); //ERROR OCCURS HERE

- PKF

PatrickFinnigana at 2007-7-29 13:26:10 > top of Java-index,Java Essentials,New To Java...
# 4

> It sounds like root.getFirstChild() is actually a Text object, not an Element.

It's quite common for there to be whitespace between the start tags of two elements, like this:<continent name="Asia">

<country name="Turkmenistan">...In an example like that, the first child of that continent element is the whitespace text node (linefeed plus two blanks) and the second child is the country element.

So avoid making assumptions like that.

DrClapa at 2007-7-29 13:26:10 > top of Java-index,Java Essentials,New To Java...
# 5

@PatrickFinnigan:

I used countryNodes =

((Element)root).getElementsByTagName("COUNTRY");

near the beginning of my code, now I want to go one level down and find all elementsByTagName("COUNTRY")

again, but this time with a smaller scope.

jellystonesa at 2007-7-29 13:26:10 > top of Java-index,Java Essentials,New To Java...
# 6

do it recursively.

Here, I'll start you off:

ArrayList<Element> getCountryMusicErNodesRather(Element e) {

ArrayList<Element> elements = new ArrayList<Element>();

for(Element e : ...) {

elements.add(e);

elements.addAll(getC...);

}

return elements;

}

jGardnera at 2007-7-29 13:26:10 > top of Java-index,Java Essentials,New To Java...
# 7

Ok so I rewrote it recursively:

private static void recurseThroughCountryRollupTree(Node root, List<transactionObj> transactionObjsToReport) {

System.out.println("");

NodeList countryNodes = ((Element)root).getElementsByTagName("COUNTRY");

int length = countryNodes.getLength();

for (int i=0; i<length; i++){

System.out.println(countryNodes.item(i).getAttributes().getNamedItem("name").getNodeValue());

}

NodeList childNodes = ((Element)root).getChildNodes();

int length2 = childNodes.getLength();

for (int i = 0; i><length2; i++){

Node temp = childNodes.item(i);

if (temp.hasChildNodes()){

recurseThroughCountryRollupTree(temp, transactionObjsToReport);

}

}

}

and it works, my only question now is why does int length2 = childNodes.getLength();

return 7? Shouldnt the return value be 3 (because there is 3 REGION tags)

Here is the XML document:

><GLOBAL>

<REGION>

<IOT>

<COUNTRY name = "Canada"/>

<COUNTRY name = "USA"/>

<COUNTRY name = "Australia"/>

<COUNTRY name = "Mexico"/>

</IOT>

<IOT>

<COUNTRY name = "Belarus"/>

<COUNTRY name = "China"/>

<COUNTRY name = "Czech"/>

<COUNTRY name = "Bulgaria"/>

</IOT>

</REGION>

<REGION>

<IOT>

<COUNTRY name = "Argentina"/>

<COUNTRY name = "Brazil"/>

<COUNTRY name = "Switzerland"/>

<COUNTRY name = "Germany"/>

</IOT>

</REGION>

<REGION>

<IOT>

<COUNTRY name = "Norway"/>

<COUNTRY name = "Finland"/>

<COUNTRY name = "USA"/>

<COUNTRY name = "Iceland"/>

</IOT>

</REGION>

</GLOBAL>

jellystonesa at 2007-7-29 13:26:10 > top of Java-index,Java Essentials,New To Java...
# 8

Text nodes again.

<GLOBAL>

(Text node)<REGION><--(Element Node)

<IOT>

<COUNTRY name = "Canada"/>

<COUNTRY name = "USA"/>

<COUNTRY name = "Australia"/>

<COUNTRY name = "Mexico"/>

</IOT>

<IOT>

<COUNTRY name = "Belarus"/>

<COUNTRY name = "China"/>

<COUNTRY name = "Czech"/>

<COUNTRY name = "Bulgaria"/>

</IOT>

</REGION>

<REGION><--(Element Node)

(Text Node)<IOT>

<COUNTRY name = "Argentina"/>

<COUNTRY name = "Brazil"/>

<COUNTRY name = "Switzerland"/>

<COUNTRY name = "Germany"/>

</IOT>

</REGION>

<REGION><--(Element Node)

(TextNode)<IOT>

<COUNTRY name = "Norway"/>

<COUNTRY name = "Finland"/>

<COUNTRY name = "USA"/>

<COUNTRY name = "Iceland"/>

</IOT>

</REGION>

(Text Node)

</GLOBAL>

It counts the space after the <Region> tag, the new line character, and the spaces that indent to the IOT tag as a "Text" node. You need to remove those. (not from the document, but from the NodeList when you parse it)

Message was edited by:

jGardner

Message was edited by:

jGardner

jGardnera at 2007-7-29 13:26:10 > top of Java-index,Java Essentials,New To Java...
# 9

Interesting, I thought all white spaces were taken care of by normalize();, Ill take a look into this.

jellystonesa at 2007-7-29 13:26:10 > top of Java-index,Java Essentials,New To Java...
# 10

Also, thanks for your help :)

jellystonesa at 2007-7-29 13:26:10 > top of Java-index,Java Essentials,New To Java...
# 11

I don't think normalize() behaves QUITE the way you are expecting it to.

Edit: No worries.

Message was edited by:

jGardner

jGardnera at 2007-7-29 13:26:10 > top of Java-index,Java Essentials,New To Java...
# 12

From the API:

"void normalize()

Puts all Text nodes in the full depth of the sub-tree underneath this Node, including attribute nodes, into a "normal" form where only structure (e.g., elements, comments, processing instructions, CDATA sections, and entity references) separates Text nodes, i.e., there are neither adjacent Text nodes nor empty Text nodes."

(http://java.sun.com/j2se/1.4.2/docs/api/org/w3c/dom/Node.html)

So, if "there are neither adjacent Text nodes nor empty Text nodes", it looks like we can assume that there is a single Text node separating each element node (whether the Text node contains purely whitespace or not).

PatrickFinnigana at 2007-7-29 13:26:10 > top of Java-index,Java Essentials,New To Java...
# 13

> From the API:

>

> "void normalize()

> Puts all Text nodes in the full depth of

> the sub-tree underneath this Node, including

> attribute nodes, into a "normal" form where only

> structure (e.g., elements, comments, processing

> instructions, CDATA sections, and entity references)

> separates Text nodes, i.e., there are neither

> adjacent Text nodes nor empty Text nodes."

> http://java.sun.com/j2se/1.4.2/docs/api/org/w3c/dom/No

> de.html)

>

> So, if "there are neither adjacent Text nodes nor

> empty Text nodes", it looks like we can assume that

> there is a single Text node separating each element

> node (whether the Text node contains purely

> whitespace or not).

I believe that is correct, unless the elements are as such:

<Element1><Element2></Element2></Element1>

Which is bad practice to do anyways, so I guess its good to keep in mind that one must allow for text elements.

jGardnera at 2007-7-29 13:26:10 > top of Java-index,Java Essentials,New To Java...
# 14

Yeah, so it seems like a bad idea to assume the existence of Text nodes or the lack thereof. Thus, good practice would probably be to write code that will execute the same regardless of whether there are text nodes in-between element nodes or not.

PatrickFinnigana at 2007-7-29 13:26:10 > top of Java-index,Java Essentials,New To Java...