Unicode output in SJSE 8

I'm writing a multi-lingual java web application in SJSE 8 which uses log4j 1.2.13 to log messages and the Bundled Tomcat Server 5.5.7. Those log messages sometimes have unicode characters in them outside the English alphabet which are currently being output to the SJSE 8 "Output" window in the "Bundled Tomcat (5.5.7)" tab as question marks (e.g. "term is: ?").

Can I get the output window to display the real characters instead of the question marks?

[467 byte] By [steve@@] at [2007-11-26 9:22:14]
# 1

Since the output window has to deal with the output from various programs, i do not think it is possible for it to recognize the encoding from external programs.

Pl. refer to http://www.netbeans.org/issues/show_bug.cgi?id=48902 which discusses this issue.

Have you tried running the ide under the locale you are interested in? You can use the '-locale' switch (either on command-line or in ide.conf file...)

http://www.netbeans.org/kb/41/using-netbeans/configuring.html#switches

Also from http://www.netbeans.org/issues/show_bug.cgi?id=48902:

... it doesn seem like windows can be told to use utf-8 as encoding for all things, but needs these separate regional settings; I think part of it might be that there are still some parts of windows that donot/cannot use utf-8 (or unicode in general by default so that the additonal step is needed to somehow tell them what locale/encoding is being used when sent data from other programs....

http://www.netbeans.org/nonav/issues/showattachment.cgi/18152/windows.usingasia nlocales

KarthikR at 2007-7-6 23:55:20 > top of Java-index,Development Tools,Java Tools...
# 2

Thanks for your advice, but its doesn't seem to have done what I needed. I added:

--locale ar

to jstudio.conf after the other options for netbeans_default_options. That line now looks like this:

netbeans_default_options="-J-Xms256m -J-Xmx512m -J-XX:PermSize=32m -J-XX:MaxPermSize=96m -J-Dnetbeans.logger.console=true -J-ea -J-Dsun.java2d.pmoffscreen=false -J-Dcom.sun.im.service.xmpp.session.keepaliveinterval=30 -J-Dcom.sun.aas.installRoot=\"C:\Program Files\Sun\jstudio_ent8\AppServ8.1UR2\" --locale ar"

ar is the ISO 639:1988 standard 2-letter code for Arabic. I then ran my code again and input arabic characters. The Output window still displayed them as ?.

Maybe the IDE still isn't running in Arabic because I didn't add --locale correctly. Although I would ideally like to be able to see the characters in any language (not just the one I set the locale to), being able to see them in even that one language would be helpful. If you have any more advice on how to set the locale properly, I'd appreciate it. Thanks.

steve@@ at 2007-7-6 23:55:20 > top of Java-index,Development Tools,Java Tools...
# 3

-locale doesn't seem to be working for Arabic locale...

There is one other that you can try and see if it works and see if it is not too cumbersome...

- Go to tools | options | editing | editor settings | Plain editor

- Select fonts&colors and set the font to an available Arabic font

- Right-click on output window , select save-as to save the output in a file without any extension

- Use file | open file to select the file in the editor.

If the above works, then all u have to do is keep the editor window open, periodically do save-as on output window and refresh the editor...

KarthikR at 2007-7-6 23:55:20 > top of Java-index,Development Tools,Java Tools...
# 4

Thanks again, but still no luck. I can write in Arabic in the IDE editor but when I right-click in the Output window and Save As, and then open that file in the IDE editor, the Arabic still shows up as question marks. I think they've been transformed into question marks by the time they get to the Output window.

Since you mentioned a problem with --locale ar, I tried --locale zh in jstudio.conf and used a Chinese script but got the same "?" results in the Output window.Again, I can write in Chinese in the IDE editor window, it's just the Output window that converts non-Latin characters to question marks.

By the way, did either of these methods work on your system? If so, which --locale and / or language did you use? Maybe there is a system configuration I need to change to match the way yours is set up.

steve@@ at 2007-7-6 23:55:20 > top of Java-index,Development Tools,Java Tools...
# 5

I am sorry my suggestions didn't work. Of course, your conclusion is right: The characters are transformed <b>before</b> they are displayed in the output window so the original information is lost. That is why setting the font doesn't work; sorry i didn't realise it before.

It is not clear me why setting the locale wouldn't work, though...

In any case, it seems that unless the input to output window is somehow intercepted, there is no way to stop the loss of information.

At this point, i have only one suggestion left: You can perhaps try and see if the logging using log4j provides any API to set the encoding. If the information is present, perhaps ide output window would respect that...

KarthikR at 2007-7-6 23:55:20 > top of Java-index,Development Tools,Java Tools...
# 6

That's a good idea: looking at log4j to see if it's losing the UTF data before the Ouput window even gets it. I'll continue to investigate that.

However, I can do this:

String u = "\u0661"; //non-Latin code-point

String a = "\u0061"; //this is the unicode code point for the character 'a'

System.out.println(u);

System.out.println(a);

And the output window shows this:

?

a

So I think there is definitely a problem with the SJSE 8 configuration itself - even if there's also a problem with log4j.

steve@@ at 2007-7-6 23:55:20 > top of Java-index,Development Tools,Java Tools...
# 7

If i try something like

// TODO code application logic here

try

{

PrintStream ps = new PrintStream(System.out,true,"UTF-8");

String u = "\u0661"; //non-Latin code-point

String a = "\u0061"; //this is the unicode code point for the character 'a'

ps.println(u);

ps.println(a);

} catch (Throwable e)

{}

i see different output than "?".

So, it looks like the output window in the ide merely works on what it gets. The standard output stream (System.out) is a PrintStream. And according to

http://java.sun.com/j2se/1.5.0/docs/api/java/io/PrintStream.html

.. All characters printed by a PrintStream are converted into bytes using the platform's default character encoding..

So it looks like one solution for your issue may be set the platform default encoding to Unicode. (And I think that needs to be set at the OS level...)

Also:

You may want to ask this question on NetBeans forums:

http://www.netbeans.org/community/lists/

JSE is built on top of NetBeans and inherits the output window from NetBeans; NetBeans authors and users would certainly have more insight into this.

Or, you may want to file a bug at http://www.netbeans.org/community/issues.html. In the bug entry, you may just want to give the simple java program you wrote and indicate that the output is not what is expected. (Bugs with simple reproducibe cases are easier to deal with by developers than bug reports which have extraneous info...)

KarthikR at 2007-7-6 23:55:20 > top of Java-index,Development Tools,Java Tools...
# 8

> KarthikR wrote: set the platform default encoding to Unicode...

The above is not possible; sorry about that. What can be done is set the locale for the OS that has an associated encoding. For instance, unix has utf8 locales for ja, zh etc as well as ones for the normal default encoding - ie ja locale/enc is euc-jp....Also, the required fonts need to be available on the machine...

btw, did you try '--locale' or '-locale' while running the ide? It should be two dashes...

Also, --locale startup flag may not be very helpful, if the development involves other servers and databases, since the flag only affects the ide. I think it would be better to actually be in the desired locale. (This would of course mean you will need to switch from running the system from one locale to another...)

KarthikR at 2007-7-6 23:55:20 > top of Java-index,Development Tools,Java Tools...
# 9

I did use two dashes on the locale flag previously, thanks for the clarification.

When I used your code I did get a different character (or group of characters) for each unicode value I tried (instead of always a question mark). However, those characters were not the correct ones. For example, an Arabic character (unicode value \u0682) came out as an accented U with a comma.

I tried this to see if I could get valid unicode anywhere: I added a log4j FileAppender and wrote the log4j output to a .log file. I then opened that .log file using Notepad and I saw the valid unicode characters.

For now, I'm going to just use two log4j appenders - one to the Console, and one to the file. When I need to see the unicode, I'll open the file. I'll use a file for the production system anyway so this isn't that much of a sacrifice.

Thanks for all your help! Here is some code that works for displaying unicode:

In a log4j.properties file:

# Configuration of log4j Logger

log4j.rootLogger=DEBUG, consoleAppender, fileAppender

log4j.appender.consoleAppender=org.apache.log4j.ConsoleAppender

log4j.appender.consoleAppender.layout=org.apache.log4j.PatternLayout

log4j.appender.consoleAppender.encoding=UTF-8

log4j.appender.fileAppender=org.apache.log4j.RollingFileAppender

log4j.appender.fileAppender.layout=org.apache.log4j.PatternLayout

log4j.appender.fileAppender.encoding=UTF-8

log4j.appender.fileAppender.File=log4jFileAppender.log

# Once this limit is reached, the file will be moved to filename.ext.1

# a new file will be started with file name filename.ext

log4j.appender.fileAppender.MaxFileSize=100KB

# Keep one backup file

log4j.appender.fileAppender.MaxBackupIndex=1

# Print the date in ISO 8601 format

log4j.appender.consoleAppender.layout.ConversionPattern=%d %-5p %c{1}:: %m%n

log4j.appender.fileAppender.layout.ConversionPattern=%d %-5p %c{1}:: %m%n

In java:

String u = "\u0682"; //non-Latin (Arabic) code-point

String a = "\u0061"; //this is the unicode code point for the character 'a'

try {

PrintStream ps = new PrintStream(System.out,true,"UTF-8");

ps.println(u);

log.debug(u);

ps.println(a);

log.debug(a);

} catch (Throwable e) {

}

This will print an accented U with a comma twice in the console (once with ps.println(u) and once with log.debug(u)) and you'll also see the letter "a" twice from ps.println(a) and log.debug(a). But, if you open log4jFileAppender.log, you'll see the real Arabic character followed by the letter "a".

I'm content with this solution for now. Thanks for all your help. I learned a lot along the way.

steve@@ at 2007-7-6 23:55:20 > top of Java-index,Development Tools,Java Tools...
# 10
I am glad you found a workaround. Also, thanks for posting your solution, as i am sure it will be of use to other users on this forum.
KarthikR at 2007-7-6 23:55:20 > top of Java-index,Development Tools,Java Tools...