What you need to make sure of with XML files is that the charset = parameter on the first line is actually the encoding with which the file is stored on disc. If you're getting "invalid UTF-8" from SAX that probably means you have charset="UTF-8" but your file is stored in your machine's default locale encoding.
Try changing the first line to charset="ISO8859-1". If you are seeing odd-looking characters then you need to find out what the machine's default encoding actually is. On Windows run a command prompt and type chcp. Then try cpNNN where NNN is what comes back.
> Did anyone know how can i identify and remove Invalid
> UTF-8 character like ?from a file/string.
>
> Actual problem is that because of these chracters my
> SAXparser is throwing exception. i want to remove
> them from file before giving file to SAX Parser
I am still not sure what the question is. Since the character menitoned in the question is a valid utf-8 character, but let me put it this way, all the Java Strings are always UCS-2 encoded, Its only the byte arrays which are encoded using an encoding like 8859-1 or utf-8 or something else.
So please restate your question and what you want to accomplish
> What you need to make sure of with XML files is that
> the charset = parameter on the first line is actually
> the encoding with which the file is stored on disc.
> If you're getting "invalid UTF-8" from SAX that
> probably means you have charset="UTF-8" but your file
> is stored in your machine's default locale encoding.
That should not result in a saxexception but rather lost data characters.
>
> Try changing the first line to charset="ISO8859-1".
> If you are seeing odd-looking characters then you
> need to find out what the machine's default encoding
> actually is. On Windows run a command prompt and type
> chcp. Then try cpNNN where NNN is what comes back.
Again if the encoding for the XML is changed to 8859_1 any data in language other than wester european languages is going to be replaced by question marks and lost
> > If you're getting "invalid UTF-8" from SAX that
> > probably means you have charset="UTF-8" but your
> file
> > is stored in your machine's default locale
> encoding.
>
>
> That should not result in a saxexception but rather
> lost data characters.
>
No, because not all sequences of bytes are valid UTF-8 sequences, but all sequences of bytes are valid in ISO-8859-1. So if the file is actually encoded in 8859 and contains characters about 127 then it may be invalid UTF-8, not merely come out as incorect characters if treated as UTF-8.
> > > If you're getting "invalid UTF-8" from SAX that
> > > probably means you have charset="UTF-8" but your
> > file
> > > is stored in your machine's default locale
> > encoding.
> >
> >
> > That should not result in a saxexception but
> rather
> > lost data characters.
> >
>
>
> No, because not all sequences of bytes are valid
> UTF-8 sequences, but all sequences of bytes are valid
> in ISO-8859-1. So if the file is actually encoded in
> 8859 and contains characters about 127 then it may be
> invalid UTF-8, not merely come out as incorect
> characters if treated as UTF-8.
Yep, I have recently done some character encoding unit tests with SAXParser . If the xml-file is saved with UTF-8 encoding but has encoding="ISO-8859-1" in the PI, the SAXParser will read the file as ISO-8859-1 producing garbage characters for all non-ascii characters. If the xml-file is saved with ISO-8859-1 encoding but has encoding="UTF-8" in the PI, then a SAXException is thrown.
> > Try changing the first line to charset="ISO8859-1".
> > If you are seeing odd-looking characters then you
> > need to find out what the machine's default encoding
> > actually is. On Windows run a command prompt and type
> > chcp. Then try cpNNN where NNN is what comes back.
>
> Again if the encoding for the XML is changed to
> 8859_1 any data in language other than wester
> european languages is going to be replaced by
> question marks and lost
But the point is that the file probably is in a single-byte encoding, and doesn't contain any characters outside that charset. If you don't have anything else to guide you, your best bet is to try the system default encoding for the machine you're working on. In the US, the most likely candidates are ISO8859-1 and cp1252.