How to change file format from UTF8 to ASCII?

I have the source file locate atD:\Converter\srcFile.dat

the format ofsrcFile.dat is UTF8 and which contains some Chinese characters.

I need to change the file format of srcFile from UTF8 to ASCII and the Chinese characters should not be changed to any kind of other things (like\u0089), also, a new filedesFile.dat (D:\Converter\desFile.dat) should be generated.

Is there any one who can tell me how to do? It is in very hurry...

Thx a lot.

[512 byte] By [Kevin.MCUa] at [2007-11-26 19:50:07]
# 1
nativetoascii ?
suparenoa at 2007-7-9 22:39:14 > top of Java-index,Java Essentials,Java Programming...
# 2
NEVER, because native2ascii changes those Chinese characters into the format like \uXXXX (X stands for digit 0 to 9)Plz help~~~!!!
Kevin.MCUa at 2007-7-9 22:39:14 > top of Java-index,Java Essentials,Java Programming...
# 3
Chinese characters cannot be represented in ASCII, so you need UTF-8 or some other character encoding that can. What is your real problem?
glevnera at 2007-7-9 22:39:14 > top of Java-index,Java Essentials,Java Programming...
# 4
I can use UltraEdit to convert... from UTF8 to ASCIIand the Chinese Characters have not been changed... shows normally.Just like open the srcFile.dat (UTF8 file type) with NotePad, then save as desFile.dat with the ANSI file type...
Kevin.MCUa at 2007-7-9 22:39:14 > top of Java-index,Java Essentials,Java Programming...
# 5

ASCII doesn't include any encodings for Chinese characters. What you are asking for is, therefore, impossible.

It may well be that you want to do whatever UltraEdit is doing, but whatever that is it's not translating into an ASCII encoding.

So you're going to have to define more precisely what it is that you do actually want to do.

dcmintera at 2007-7-9 22:39:14 > top of Java-index,Java Essentials,Java Programming...
# 6
I think there are 8-bit encodings that represent some Chinese character big8 (?). Actually ASCII it a 7-bit encoding, anything the defines characters for 0x80 - 0xff is an extended ascii.
malcolmmca at 2007-7-9 22:39:14 > top of Java-index,Java Essentials,Java Programming...
# 7

> I think there are 8-bit encodings that represent some

> Chinese character big8 (?). Actually ASCII it a 7-bit

> encoding, anything the defines characters for 0x80 -

> 0xff is an extended ascii.

As far as I know there is no 8 bit single byte encodings for japanese, chinese nor korean.

There are multi-byte formats that do that however.

jschella at 2007-7-9 22:39:14 > top of Java-index,Java Essentials,Java Programming...
# 8

> I can use UltraEdit to convert... from UTF8 to ASCII

> and the Chinese Characters have not been changed...

> shows normally.

No that isn't what it is doing.

You need to understand that a character set is a represention consist of bytes and rules for those bytes.

ASCII is a 7 bit single byte representation and there is absolutely no way that you are going to fit chinese in there if for no other reason than that all the characters are already used.

What you are seeing is a different characters set - not ASCII.

As a guess it is probably a multi-byte characters set. It is quite possible that the first 127 characters of that multi-byte character set has a one to one mapping with ASCII. That is very common in multi-byte character sets (probably always true once you exclude ebcdic.) That however doesn't make it ASCII.

So once you determine what character set you are actually using (again it is not ASCII) then you can use the string encoding in java to convert from bytes to a string and from a string to bytes. That is how you must do it in java (bytes not strings.)

jschella at 2007-7-9 22:39:15 > top of Java-index,Java Essentials,Java Programming...
# 9

First of all, thx all of your information.

But, actually, i do understandy that the fact of 7-bit ASCII and 8-bit UTF8...

Please see, another fact is that:

1) Open the srcFile.dat file with NotePad

2) When I press "Save as", it shows that the "Encoding" is "UTF8"

3) What I need to do is just select the "Encoding" from "UTF8" to "ASCII"

4) All Chinese or whatever special Characters wont be changed. (I am using english system but I can also type the Chinese in the NotePad and save it as ASCII file)

what I need to do is to make a program which can do the funcstion above. because in our company, SysA generates UTF8 encoding files, but SysB just understand ASCII encoding files (never the type as \uXXX but the actual characters you want sysB to receive), and sysA maybe will generate more than 100 files at once, so I hav to make a program to transfer all those UTF8 files to ASCII files... and now... I need to make at list one file transfers from UTF8 to ASCII encoding possibly so that i can make all files possible...

If you hav the correct answer concering how to do it... plz let me know.

and plz stop saying that it is impossible...since UltraEdit or NotePah can change it manually... Thx a lot. ^_^

Kevin.MCUa at 2007-7-9 22:39:15 > top of Java-index,Java Essentials,Java Programming...
# 10

The only way I can see to save Chinese characters as ASCII is if you encode it as Base64 first, and I seriously doubt that either UltraEdit or Notepad is doing that. But seriously, the set of Chinese characters is way too large to fit into a seven- or eight-bit encoding. Forcing the editor to use one would turn the text into garbage. What you describe simply cannot be happening.

uncle_alicea at 2007-7-9 22:39:15 > top of Java-index,Java Essentials,Java Programming...
# 11

I notice that it one of your posts you typed ANSI instead of ASCII. This is not at all the same thing, so I figured it was a mistake. But maybe all along what you have been meaning to say is ANSI, and not ASCII.

I don't use Windows, so I can't run Notepad myself, but I suspect you are really saving your files in what Windows calls an "ANSI" encoding. These encodings have little or nothing to do with ANSI, but refer in fact to Windows code pages. Notepad is probably saving your file using a Windows code page that covers Chinese characters, such as code page 936 (also known as GBK, used on mainland China), or code page 950 (also known as Big5, used in Taiwan and Hong Kong).

So to answer your original question (how to change the file format), one way would be to read the UTF-8 file using an InputStreamReader with the charset "UTF-8", then write the contents to a new file using an OutputStreamWriter with whatever charset you want, such as "GBK" or "Cp950". For more information on character encodings supported by Java, look at this web page:

http://java.sun.com/j2se/1.4.2/docs/guide/intl/encoding.doc.html

If you are using Sun's JRE for Windows, then note that support for non-European charsets may not have been installed. The site says:

"Sun's Java 2 Runtime Environment, Standard Edition, v. 1.4.2 for Windows may be installed as a complete international version or as a European languages version. The J2RE installer by default installs a European languages version if it recognizes that the host operating system only supports European languages. If the installer recognizes that any other language is needed, or if the user requests support for non-European languages in a customized installation, a complete international version is installed."

Geoff

Message was edited by:

glevner

glevnera at 2007-7-9 22:39:15 > top of Java-index,Java Essentials,Java Programming...
# 12

I don't use Windows either, there are ways to find out which character encoding a given text file uses. Write "<html><body>" at the beginning of the file and "</body></html>" at the end and save it as a .html file. Then open this new file in a web browser that detects character encodings automatically (e.g. Firefox). You will find the character encoding from the "page information" or what ever it might be called.

jsalonena at 2007-7-9 22:39:15 > top of Java-index,Java Essentials,Java Programming...
# 13

> I don't use Windows either, there are ways to find

> out which character encoding a given text file uses.

> Write "<html><body>" at the beginning of the file and

> "</body></html>" at the end and save it as a .html

> file. Then open this new file in a web browser that

> detects character encodings automatically (e.g.

> Firefox). You will find the character encoding from

> the "page information" or what ever it might be

> called.

Any such detection is, by the very nature of the thing, heuristic and I seriously doubt it will get a Chinese file correctly.

You can find out what encoding Windows is configured for easilly enough. Open a cmd window and type "chcp" (change code-page) and it will type the code page in use. In general, if java knows that encoding it calls it cpNNNN

where NNNN is the code page number that the chcp gives you.

(As usual Windows philosophy is not to bother the user's tiny mind with anything as technical as code pages).

malcolmmca at 2007-7-9 22:39:15 > top of Java-index,Java Essentials,Java Programming...
# 14

> Any such detection is, by the very nature of the

> thing, heuristic and I seriously doubt it will get a

> Chinese file correctly.

>

On the contrary, actually in many cases it does. And if it doesn't you're left with trying each of the few encodings for Chinese that can be found in the "character encodings" menu. The encoding is probably well known.

> You can find out what encoding Windows is configured

> for easilly enough. Open a cmd window and type "chcp"

Unfortunately that will only tell you the code pages of the system and not the encoding that was used to save a specific file.

jsalonena at 2007-7-9 22:39:15 > top of Java-index,Java Essentials,Java Programming...
# 15
>Unfortunately that will only tell you the code pages of the system and not the >encoding that was used to save a specific file.Odds are if Notepad shows the data correctly, then the current Windows codepage coresponds to the encoding of the file.
malcolmmca at 2007-7-21 17:44:32 > top of Java-index,Java Essentials,Java Programming...
# 16

> First of all, thx all of your information.

>

> But, actually, i do understandy that the fact of

> 7-bit ASCII and 8-bit UTF8...

>

> Please see, another fact is that:

> 1) Open the srcFile.dat file with NotePad

> 2) When I press "Save as", it shows that the

> "Encoding" is "UTF8"

> 3) What I need to do is just select the "Encoding"

> from "UTF8" to "ASCII"

> 4) All Chinese or whatever special Characters wont be

> changed. (I am using english system but I can also

> type the Chinese in the NotePad and save it as ASCII

> file)

Quite simply the above two statements are complete contradiction.

ASCII does not support chinese characters. Period. There is absolutely no way, regardless of methodology to keep chinese characters using the ASCII character set.

And as I have already explained to you, the most likely explanation is that what is being saved is in fact a multibyte character set. And by definition a multibyte character set is not ASCII. Period.

>

> what I need to do is to make a program which can do

> the funcstion above.

And I believe I already mentioned your first step is to actually determine what character set (not ASCII) is being used. Until you determine what it is you can't convert it.

> because in our company, SysA

> generates UTF8 encoding files, but SysB just

> understand ASCII encoding files (never the type as

> \uXXX but the actual characters you want sysB to

> receive), and sysA maybe will generate more than 100

> files at once, so I hav to make a program to transfer

> all those UTF8 files to ASCII files... and now... I

> need to make at list one file transfers from UTF8 to

> ASCII encoding possibly so that i can make all files

> possible...

>

> If you hav the correct answer concering how to do

> it... plz let me know.

> and plz stop saying that it is impossible...since

> UltraEdit or NotePah can change it manually... Thx a

> lot. ^_^

What I and other said is that you are confusing terminology and what some program implements via a GUI.

Until you understand what a character set is and what it isn't you aren't going to make much head way on this problem.

If you refuse to accept the correct explanation then I would suggest you look into AutoIt. It is free software that you can use to automate tasks on windows. You can write a script in it that will open UltraEdit or Notepad and save the files exactly as you would do manually. Using that process you will not need to understand what character sets are.

jschella at 2007-7-21 17:44:32 > top of Java-index,Java Essentials,Java Programming...
# 17

Thx for all of your replys....

Actually, I have done what I queried before...and the core coding as following:

FileInputStream fInStream= new FileInputStream(srcDir+"/"+strFileList[i]);

InputStreamReader inStreamReader=new InputStreamReader(fInStream,"UTF8");

BufferedReader bufReader=new BufferedReader(inStreamReader);

FileWriter fWriter=new FileWriter(desDir+"/"+strFileList[i],false);

System.out.println("Converting File:" + strFileList[i]);

while((intCh=bufReader.read())!= -1){

strBuf.append((char)intCh);

}

buf=strBuf.toString();

fWriter.write(buf);

Not only all of my files transfered normally but also the format of all files have been changed from UTF8 to ASCII.

Plz do not say thatz impossible to show Chinese, if u hav never or dont know how to use Chinese. Anyway, thank you very much for all of your replys. thx.

Kevin.MC.Ua at 2007-7-21 17:44:32 > top of Java-index,Java Essentials,Java Programming...
# 18

Never mind, guys. By now, the OP has undoubtedly switched to C# so he can use the [url=http://notgartner.wordpress.com/2006/11/26/c-30-adds-the-because-justification-operator/]"because I said so" operator[/url].

edit: Pfft! Leaves the thread inactive for a full day, then comes back just in time to mess up my joke. There should be a law!

uncle_alicea at 2007-7-21 17:44:33 > top of Java-index,Java Essentials,Java Programming...
# 19

> Plz do not say thatz impossible to show Chinese, if u

> hav never or dont know how to use Chinese. Anyway,

> thank you very much for all of your replys. thx.

I don't believe you understood a single thing that has been said to you in this conversation. Your loss, not ours.

dcmintera at 2007-7-21 17:44:33 > top of Java-index,Java Essentials,Java Programming...
# 20

> Not only all of my files transfered normally but also

> the format of all files have been changed from UTF8

> to ASCII.

>

Not to ASCII -- to the platform's default character encoding. Look up the documentation of FileWriter.

The default character encoding on Windows in western locale is "windows-1252." This encoding is not ASCII; it contains all of ASCII and several non-ASCII characters. However, what it does not contain is any Chinese characters. Therefore any Chinese characters are destroyed in the process. http://en.wikipedia.org/wiki/Windows-1252

The default character encoding on many desktop Linux distribtions happens to be UTF-8, so on a Linux box the snippet you post only copies the given file but extremely slowly.

On other systems very likely it does something very different.

jsalonena at 2007-7-21 17:44:33 > top of Java-index,Java Essentials,Java Programming...
# 21
Sounds like it must be a version of Windows set up with a Chinese locale - which presumably has a multi-byte code page. Maybe even UTF-8, though it would be unlike MS to follow a standard.
malcolmmca at 2007-7-21 17:44:33 > top of Java-index,Java Essentials,Java Programming...
# 22

> Sounds like it must be a version of Windows set up

> with a Chinese locale - which presumably has a

> multi-byte code page. Maybe even UTF-8, though it

> would be unlike MS to follow a standard.

Yes, that has to be the case. Wikipedia lists a few variable-width encodings as "windows code pages:" http://en.wikipedia.org/wiki/Category:Windows_code_pages Unfortunately it's impossible to know which one corresponds to the "ANSI encoding" in question--at least two of them have been designed for Chinese.

jsalonena at 2007-7-21 17:44:33 > top of Java-index,Java Essentials,Java Programming...
# 23
But the point is that... desFile.dat which is generated by the code above is really in ASCII encoding since sysB idientify it in that way...and all of the Chinese inside does not DESTORYED...So, would you plz explanation how come?I appreciate all of your replys. Thx a lot.
Kevin.MC.Ua at 2007-7-21 17:44:33 > top of Java-index,Java Essentials,Java Programming...
# 24

> Actually, I have done what I queried before...and the

> core coding as following:

I still don't understand how it works. Could you do me a favor and add another print statement to your code?

After this line:

> System.out.println("Converting File:" + e + strFileList[i]);

Could you add this line, and post what it says?

System.out.println("writer uses encoding: " + fWriter.getEncoding());

kdgregorya at 2007-7-21 17:44:33 > top of Java-index,Java Essentials,Java Programming...
# 25

> But the point is that... desFile.dat which is

> generated by the code above is really in ASCII

> encoding

No, it isn't. ASCII doesn't allow you to write chinese characters, therefore this is not ASCII by definition.

Now, it may be a file that is incorrectly being identified by some tool as "ASCII", but calling it ASCII doesn't make it ASCII.

Doubtless you've written this out in some format that can contain Chinese characters. Perhaps that format meets your needs (which you've never explained), but whatever else is true about that encoding, it's not ASCII. Never was, never can be.

> since sysB idientify it in that way...and

> all of the Chinese inside does not DESTORYED...

And that's how we know it's not ASCII. It's, at the very least, some eight bit encoding (again, if it's an eight bit character set it's not ASCII - see an earlier comment about ANSI encodings). Even more likely is that you've actually written out some double byte character set encoding.

ASCII represents 127 characters, a large number of which are non-printable.

Eight bit encodings can represent up to 256 characters.

The Chinese alphabet contains circa 50,000 pictograms. Do the math.

> So, would you plz explanation how come?

> I appreciate all of your replys. Thx a lot.

Let me ask you this: You've got the data in UTF-8, which is interchangeable with ASCII for characters that can be represented in ASCII, so what are you actually trying to use this file for, why can't you use UTF-8, and why "must" you use ASCII when that's actually impossible?

dcmintera at 2007-7-21 17:44:33 > top of Java-index,Java Essentials,Java Programming...
# 26

>

> > since sysB idientify it in that way...and

> > all of the Chinese inside does not DESTORYED...

>

> And that's how we know it's not ASCII. It's, at the

> very least, some eight bit encoding (again, if it's

> an eight bit character set it's not ASCII - see an

> earlier comment about ANSI encodings). Even more

> likely is that you've actually written out some

> double byte character set encoding.

>

More likely multi-byte (or variable byte if you prefer.)

jschella at 2007-7-21 17:44:33 > top of Java-index,Java Essentials,Java Programming...
# 27
As they say in China: no ASCII, no tell-ee
DrLaszloJamfa at 2007-7-21 17:44:33 > top of Java-index,Java Essentials,Java Programming...
# 28

> But the point is that... desFile.dat which is

> generated by the code above is really in ASCII

> encoding since sysB idientify it in that way...and

> all of the Chinese inside does not DESTORYED...

>

> So, would you plz explanation how come?

I and others already have.

It is written in some character set that supports chinese. That character set is not ascii.

I have no idea what tool you are using but tools that identify file types do so algorithmically based on rules which result in a guess and not a guarantee. Generally all such algorithms can be fooled with the right combination of data.

jschella at 2007-7-21 17:44:33 > top of Java-index,Java Essentials,Java Programming...