Problem removing html tags from the text retrived

Hi there,

I am using jdbc to connect the database and retriving the data. In one of the columns along with the description there are some html tags in few of the recors of that column. is there a way to retrive the text only ignoring the html tags in between. Or can i retrive and then strip off the html code in the text to display only normal text.

example of the data retrived which are pipe seperated and one of the columns has html tags in it:

209|The euphoria |187945-2|http://www.abc/lst.jsp?mktgChannel=I86023&sku=18791-2&siteID=qpF0HYnRugA|http://www.abc.com/assets/images/product/medium/18793-2_198.jpg|Rooftop Singers: Walk Right In | abc Music proudly presents THE FOLK YEARS, an unforgettable era in music history!

<B>Featuring:</B>

<LI>The most complete collection of folk and folk-rock songs ever put together -- 132 classics!

<LI>Original hits by the original artists!

Now i need to remove the tags before displaying this on the output. Is there a simple way to do this.

Thanks...

[1091 byte] By [Naik1983a] at [2007-11-26 17:07:28]
# 1
Use a regular expression that matches HTML tags.
bckrispia at 2007-7-8 23:35:17 > top of Java-index,Java Essentials,Java Programming...
# 2
thanks...if you dont mind could you please give me a sample code as how to go abt writing a regex for this. Never used them in java before....
Naik1983a at 2007-7-8 23:35:17 > top of Java-index,Java Essentials,Java Programming...
# 3
All you need should be right here: http://www.regular-expressions.info/java.html
kevjavaa at 2007-7-8 23:35:17 > top of Java-index,Java Essentials,Java Programming...
# 4
Thanks for the link...but if doesn't say any thing abt html tags...i would appreciate if some one could send me a sample code to remove the html code from the text i submitted above...
Naik1983a at 2007-7-8 23:35:17 > top of Java-index,Java Essentials,Java Programming...
# 5
http://txt2re.comhas sample code, and useful in its own right
mkoryaka at 2007-7-8 23:35:17 > top of Java-index,Java Essentials,Java Programming...
# 6
Have we tried googling for regex html tag?
bckrispia at 2007-7-8 23:35:17 > top of Java-index,Java Essentials,Java Programming...
# 7
Hi the link u sent is not opening...Can u please send me the correct link...thanks
Naik1983a at 2007-7-8 23:35:17 > top of Java-index,Java Essentials,Java Programming...
# 8
i forgot the wwwadd that in, and it works
mkoryaka at 2007-7-8 23:35:17 > top of Java-index,Java Essentials,Java Programming...
# 9

> Thanks for the link...but if doesn't say any thing

> abt html tags...i would appreciate if some one could

> send me a sample code to remove the html code from

> the text i submitted above...

*inserts spoon*

http://www.google.com/search?q=java+strip+html+tags&ie=utf-8&oe=utf-8&rls=org.mozilla:en-US:official&client=firefox-a

Link #5

http://examples.oreilly.com/jenut/RemoveHTMLReader.java

kevjavaa at 2007-7-8 23:35:17 > top of Java-index,Java Essentials,Java Programming...
# 10

Thank u all for your gr8 help....I was successfully able to remove the html tags from the text using regex....

But after doing that i am getting blank spaces in the region where the html code was removed...

here is an example:

twoFREE,packagedinaspecialcollector'sbox

<blank space here........>

Previewthisentirecollectionforthelowexaminationfeeofjust$9.95|9.95|Y

I tried applying trim function to the string....but still the blank space doesnt go....any comments on this?

thanks again...

Naik1983a at 2007-7-8 23:35:17 > top of Java-index,Java Essentials,Java Programming...
# 11
Did you read the documentation of the trim() method, where it describes which whitespace it removes?
DrClapa at 2007-7-8 23:35:17 > top of Java-index,Java Essentials,Java Programming...
# 12
yes, i figured out i was doing wrong. trim trim's the leading and trailing whitespaces....is there a way to trim out the spaces that exists in the middle of the string.
Naik1983a at 2007-7-8 23:35:17 > top of Java-index,Java Essentials,Java Programming...
# 13

> Did you read the documentation of the trim() method,

> where it describes which whitespace it removes?

I believe his problem is that

"Some text here

<blah>

More text"

becomes

"Some text here

More text"

... and he wants ...

"Some text here

More text"

So, your problem is that your regex isn't matching whitespace as well.

See the "Trimming Whitespace" section:

http://www.regular-expressions.info/examples.html

kevjavaa at 2007-7-8 23:35:17 > top of Java-index,Java Essentials,Java Programming...
# 14
yep u got my question perfectly right....but for some reason it still shows the blank spaces in between the string. the prophet184 minute|4.99|YThe function i used wasout.write(str.replaceAll("^[ \t]+|[ \t]+$","");Any comments,thanks....
Naik1983a at 2007-7-8 23:35:18 > top of Java-index,Java Essentials,Java Programming...
# 15

The ^ matches the beginning of a line.

The $ matches the end of a line.

Thus, your regex will match something like

"<tab>|<tab>"

and replace it with nothing, but won't traverse over a line break. You need to take out those ^ and $ symbols if you want to match beyond the line, then try the whole [ \t\r\n]+ thing.

kevjavaa at 2007-7-21 16:58:05 > top of Java-index,Java Essentials,Java Programming...
# 16
Thank u so much for the gr8 help :).....I got it right finally with your suggestion
Naik1983a at 2007-7-21 16:58:05 > top of Java-index,Java Essentials,Java Programming...
# 17
> Thank u so much for the gr8 help :).....I got it> right finally with your suggestionCool. Glad it worked out. Have a nice weekend :).
kevjavaa at 2007-7-21 16:58:05 > top of Java-index,Java Essentials,Java Programming...
# 18

If you are refering to link "http://jregex.sourceforge.net/examples-html.html" then

Look at this piece of code:

private static String strip(String s){

StringBuffer sb=new StringBuffer();

RETokenizer tok=delimiters.tokenizer(s);

while(tok.hasMore()){

sb.append(tok.nextToken());

sb.append(" ");***

}

return sb.toString();

}

sb.append(" ") - this is adding the space into your text. If you don't want the space, just remove this, it is works. Thus u can remove the extra code of regular expression for removing blank spaces.

I_LOVE_JAVAa at 2007-7-21 16:58:05 > top of Java-index,Java Essentials,Java Programming...