Pulling my hair out over XML problem
Hi, I am trying to recurse through an XML file starting from the document root.
What I want to do at each element is find all tags below it with element name "COUNTRY" (All the country elements are leaf nodes). I am using getElementsByTagName("COUNTRY")
to help me achieve this.
So basically the document root should find the most instances of this element, and the lower I go in the XML tree, the less instances will be found.
Here is the code:
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder =null;
Document doc =null;
try{
docBuilder = docBuilderFactory.newDocumentBuilder();
doc = docBuilder.parse(new File("data/countryRollup.xml"));
}catch (ParserConfigurationException e){
// TODO Auto-generated catch block
e.printStackTrace();
}catch (SAXException e){
// TODO Auto-generated catch block
e.printStackTrace();
}catch (IOException e){
// TODO Auto-generated catch block
e.printStackTrace();
}
//normalize() -> remove whitespace characters/linefeeds between element tags
recurseThroughCountryRollupTree(doc.getDocumentElement().normalize());
privatestaticvoid recurseThroughCountryRollupTree(Node root){
NodeList countryNodes = ((Element)root).getElementsByTagName("COUNTRY");
int length = countryNodes.getLength();
for (int i=0; i<length; i++){
System.out.println(countryNodes.item(i).getAttributes().getNamedItem("name").getNodeValue());
}
countryNodes = ((Element)root.getFirstChild()).getElementsByTagName("COUNTRY");//ERROR OCCURS HERE
length = countryNodes.getLength();
for (int i = 0; i >< length; i++){
System.out.println(countryNodes.item(i).getAttributes().getNamedItem("name").getNodeValue());
}
}
Note that I commented where I get the error.
The error message I get is:
"Exception in thread "main" java.lang.ClassCastException: com.sun.org.apache.xerces.internal.dom.DeferredTextImpl cannot be cast to org.w3c.dom.Element
at ReportingTool.recurseThroughCountryRollupTree(ReportingTool.java:190)
at ReportingTool.main(ReportingTool.java:116)"
It sounds like root.getFirstChild() is actually a Text object, not an Element. Text and Element both inherit from Node, so see if you can accomplish your goal by treating everything as Nodes.
Also, if you still have questions, please post the xml document you're testing with.
Previous poster has nailed the problem, i think.
As a fix, try:
countryNodes = ((Element)root).getElementsByTagName("COUNTRY");
instead of your line:
countryNodes = ((Element)root.getFirstChild()).getElementsByTagName("COUNTRY"); //ERROR OCCURS HERE
- PKF
> It sounds like root.getFirstChild() is actually a Text object, not an Element.
It's quite common for there to be whitespace between the start tags of two elements, like this:<continent name="Asia">
<country name="Turkmenistan">...In an example like that, the first child of that continent element is the whitespace text node (linefeed plus two blanks) and the second child is the country element.
So avoid making assumptions like that.
@PatrickFinnigan:
I used countryNodes =
((Element)root).getElementsByTagName("COUNTRY");
near the beginning of my code, now I want to go one level down and find all elementsByTagName("COUNTRY")
again, but this time with a smaller scope.
do it recursively.
Here, I'll start you off:
ArrayList<Element> getCountryMusicErNodesRather(Element e) {
ArrayList<Element> elements = new ArrayList<Element>();
for(Element e : ...) {
elements.add(e);
elements.addAll(getC...);
}
return elements;
}
Ok so I rewrote it recursively:
private static void recurseThroughCountryRollupTree(Node root, List<transactionObj> transactionObjsToReport) {
System.out.println("");
NodeList countryNodes = ((Element)root).getElementsByTagName("COUNTRY");
int length = countryNodes.getLength();
for (int i=0; i<length; i++){
System.out.println(countryNodes.item(i).getAttributes().getNamedItem("name").getNodeValue());
}
NodeList childNodes = ((Element)root).getChildNodes();
int length2 = childNodes.getLength();
for (int i = 0; i><length2; i++){
Node temp = childNodes.item(i);
if (temp.hasChildNodes()){
recurseThroughCountryRollupTree(temp, transactionObjsToReport);
}
}
}
and it works, my only question now is why does int length2 = childNodes.getLength();
return 7? Shouldnt the return value be 3 (because there is 3 REGION tags)
Here is the XML document:
><GLOBAL>
<REGION>
<IOT>
<COUNTRY name = "Canada"/>
<COUNTRY name = "USA"/>
<COUNTRY name = "Australia"/>
<COUNTRY name = "Mexico"/>
</IOT>
<IOT>
<COUNTRY name = "Belarus"/>
<COUNTRY name = "China"/>
<COUNTRY name = "Czech"/>
<COUNTRY name = "Bulgaria"/>
</IOT>
</REGION>
<REGION>
<IOT>
<COUNTRY name = "Argentina"/>
<COUNTRY name = "Brazil"/>
<COUNTRY name = "Switzerland"/>
<COUNTRY name = "Germany"/>
</IOT>
</REGION>
<REGION>
<IOT>
<COUNTRY name = "Norway"/>
<COUNTRY name = "Finland"/>
<COUNTRY name = "USA"/>
<COUNTRY name = "Iceland"/>
</IOT>
</REGION>
</GLOBAL>
Text nodes again.
<GLOBAL>
(Text node)<REGION><--(Element Node)
<IOT>
<COUNTRY name = "Canada"/>
<COUNTRY name = "USA"/>
<COUNTRY name = "Australia"/>
<COUNTRY name = "Mexico"/>
</IOT>
<IOT>
<COUNTRY name = "Belarus"/>
<COUNTRY name = "China"/>
<COUNTRY name = "Czech"/>
<COUNTRY name = "Bulgaria"/>
</IOT>
</REGION>
<REGION><--(Element Node)
(Text Node)<IOT>
<COUNTRY name = "Argentina"/>
<COUNTRY name = "Brazil"/>
<COUNTRY name = "Switzerland"/>
<COUNTRY name = "Germany"/>
</IOT>
</REGION>
<REGION><--(Element Node)
(TextNode)<IOT>
<COUNTRY name = "Norway"/>
<COUNTRY name = "Finland"/>
<COUNTRY name = "USA"/>
<COUNTRY name = "Iceland"/>
</IOT>
</REGION>
(Text Node)
</GLOBAL>
It counts the space after the <Region> tag, the new line character, and the spaces that indent to the IOT tag as a "Text" node. You need to remove those. (not from the document, but from the NodeList when you parse it)
Message was edited by:
jGardner
Message was edited by:
jGardner
Interesting, I thought all white spaces were taken care of by normalize();, Ill take a look into this.
Also, thanks for your help :)
I don't think normalize() behaves QUITE the way you are expecting it to.
Edit: No worries.
Message was edited by:
jGardner
From the API:
"void normalize()
Puts all Text nodes in the full depth of the sub-tree underneath this Node, including attribute nodes, into a "normal" form where only structure (e.g., elements, comments, processing instructions, CDATA sections, and entity references) separates Text nodes, i.e., there are neither adjacent Text nodes nor empty Text nodes."
(http://java.sun.com/j2se/1.4.2/docs/api/org/w3c/dom/Node.html)
So, if "there are neither adjacent Text nodes nor empty Text nodes", it looks like we can assume that there is a single Text node separating each element node (whether the Text node contains purely whitespace or not).
> From the API:
>
> "void normalize()
> Puts all Text nodes in the full depth of
> the sub-tree underneath this Node, including
> attribute nodes, into a "normal" form where only
> structure (e.g., elements, comments, processing
> instructions, CDATA sections, and entity references)
> separates Text nodes, i.e., there are neither
> adjacent Text nodes nor empty Text nodes."
> http://java.sun.com/j2se/1.4.2/docs/api/org/w3c/dom/No
> de.html)
>
> So, if "there are neither adjacent Text nodes nor
> empty Text nodes", it looks like we can assume that
> there is a single Text node separating each element
> node (whether the Text node contains purely
> whitespace or not).
I believe that is correct, unless the elements are as such:
<Element1><Element2></Element2></Element1>
Which is bad practice to do anyways, so I guess its good to keep in mind that one must allow for text elements.
Yeah, so it seems like a bad idea to assume the existence of Text nodes or the lack thereof. Thus, good practice would probably be to write code that will execute the same regardless of whether there are text nodes in-between element nodes or not.