HTMLDocument.getText() issue: HTML tag is translated to ' ' (space)?
HTMLDocument.getText() issue:
(HTML.Tag.BR) is translated to ' ' (space), but I've expected the '\n'!
I have the HTMLDocument that contains
<html><body>
SOMETEXT
</body></html>
The JTextPane is perfectly showing this formatting, but after I translate the HTML to plain text using
getText(o, getDocument().getLength())
I receiving the "SOMETEXT "? all line brakes are lost! I have only spaces instead line brakes! (using
instead of
in general is not solution)
Help me please! It's not normal to make the manual "html to plain text" translation..
Strange that I can't find any existing solutions here...
JDK1.4.1 and 1.4.2_03
[745 byte] By [
ovrubla] at [2007-9-30 2:17:08]

[nobr]Well, whether its right or wrong is a moot point. Its what browsers do, ever tried simple copy/paste of formatted code posted to this forum and looked at the underlying html source?
{
System.out.println("Hello, world");
}
The association of a line break in a formating gui component and the \n character must be some kind of psyche thing...
Anyway, the problem is that there is a "place-holder" of one character in the content to represent the position of the
tag. It could have been a \n or it could have been a space. Its irrelevant as the javax.swing.text.View used to represent it would not paint anything. If it was a \n you would get what you want.
You need to find out where in the html parser the space is put in. I'm sure someone else can finish my answer off...
[/nobr]
Looks to me like you have some leg work to do.... Java does not provide any HTML to plaintext conversion. As a matter of fact, java doesn't even provide a way to obtain the inner HTML. You can scan the string returned from getText and convert all leading and trailing spaces to \n. For more info, see link shown below:
http://forum.java.sun.com/thread.jsp?forum=57&thread=227015
;o)
V.V.
Here is the sample of how I get the plain text from HTMLDocument:
HTMLDocument htmlDocument = (HTMLDocument)getDocument();
String plainText = htmlDocument.getText(0, this.getDocument().getLength());
"...SOMETEXT..."
plainText string contains plain text
the same document - alternative way to get the HTML file from this document:)
HTMLDocument htmlDocument = (HTMLDocument)getDocument();
Writer w = new FileWriter(new File("/home/ovrubl/temp/AAAAAAAAAAAAAAAAAAAA"));
htmlEditorKit.write(w, htmlDocument, 0, htmlDocument.getLength());
w.flush();
w.close();
"
...
SOMETEXT
...
"
I'm looking for solution how to get
"...\n\n\nSOMETEXT..." from htmlDocument.getText(0, this.getDocument().getLength())
Here's one way:
1) put htmlDocument.getText(0, this.getDocument().getLength()); into a StringBuffer
2) walk the document element hierarchy recursively from document.getDefaultRootElement()
3) When you find an element of kind HTML.Tag.BR replace the character at the position it indicates with a \n