Converting XML to Spreadsheet - Problem with Accented Characters

I have a program that uses an external program to gathers user account information from Active Directory. The external program is .Net, and I execute it from my Java app, which then collects the XML output and saves it in a String variable. So far so good.

Next it parses the XML and retrieves the values for each field for each user 'record' and stores these values in a String member inside of a class I created called "User".

At the end of the process, the User class data is written to a file in XML format. Still, so far so good. However, using a text editor, any user names that have accented characters in them have an extra space character after the accented character. E.g. accented "e" becomes "e ". At least that is how it appears in the Textpad editor app that I use.

Next I have a class called "XMLToExcel", that creates a spreadsheet using Apache's POI library, reads in the XML records , and writes the values for each field into a cell. Here the name with an accented character gives the message "[Fatal Error] <file name>:2617:23: Invalid byte 1 of 1-byte UTF-8 sequence."

I'm not sure what I can to do fix this, or even where to fix it. I wouldn't mind losing the accented characted and replacing it with an un-accented one.

Any ideas? I can provide code snippets if that would help.

-John Gooch

[1370 byte] By [jagoocha] at [2007-11-27 4:19:45]
# 1

Look at your source XML, what kind of encoding is it? e with accent is one of the Latin characters which gets represented differently in ISO-8859-1 and UTF8. I am guessing the characters are encode din ISO-8859-1, but the XML document's header says UTF8.

Open up your source XML in a textpad (window notepad?) that supports Latin characters to see if you see those accented characters.

yue42a at 2007-7-12 9:26:42 > top of Java-index,Enterprise & Remote Computing,Enterprise Technologies...
# 2

The text looks the same in Notepad and Textpad. However, I manually added the "encoding" tag with the value of "ISO-8859-1", and the converter no longer complains about invalid characters.

I added the encoding tag to my application's output file, so the converter should not complain about it on the next run. Unfortunately, the names have been fixed in Active Directory ( the special character was there by mistake ), so I cannot do a live test to be sure that it is fix.

Thanks for the help, Duke points incoming!

jagoocha at 2007-7-12 9:26:42 > top of Java-index,Enterprise & Remote Computing,Enterprise Technologies...