Regarding conversion of html to strings

Hi All,

I need a small help. I have a requirement to ignore html tags in strings. suppose if i have a string in that there are some html tags i want to ignore that html tag and print the result string.

String a=;&nbsp &nbsp Need<b> dog training</b>

ignore all the html tags in the string and the resultant string must be

String a=Need dog training

please guide me how to achieve this. Is there any class or method which provides this facility.

I apologize for my poor English

Regards,

Rama

[559 byte] By [RamaDevia] at [2007-11-27 10:21:59]
# 1

There is no function that I know of that will remove tags from your string.

When you consider that a tag can have any number of attributes that also have to be ignored in your output

(example: <a herf'=commands" />) you can't identify all possible tag variations ahead of time. Instead, you have to look for the beginning < and an ending >, ignore those characters, and all characters between them. The code below will accomplish this.

Note however, if the text body itself contains a < or >, it will not work correctly.

Note: in the last line below, I removed and '. You will have to add on additional replaceAll() functions to it to remove all the other special strings such as

String x1=" Need<b> dog training</b>";

String answer;

boolean withinTag=false;

StringBuffer str=new StringBuffer();

for(int ii=0;ii<x1.length();++ii){

char x2= x1.charAt(ii);

if(x2=='<'){

withinTag=true;

}

if(x2=='>'){

withinTag=false;

}

if(withinTag==false && x2!='>'){

str.append(x2);

}

}

anwser= str.toString().replaceAll(" ", " ").replaceAll("'"," " ));

System.out.println(answer);

George123a at 2007-7-28 17:13:16 > top of Java-index,Java Essentials,Java Programming...
# 2

Additional note: When I posted to this form, it removed & n b s p ; from

the first replaceAll(), first argument:

anwser= str.toString().replaceAll(" ", " ").replaceAll("'"," " ));

George123a at 2007-7-28 17:13:16 > top of Java-index,Java Essentials,Java Programming...
# 3

Just getting rid of HTML tags isn't that difficult: str = str.replaceAll("<[^<>]++>", "");

Those NBSP entities are a separate problem; you probably want to replace them with simple spaces so you can trim() them away if appropriate: str = str.replaceAll("&nbsp;?", " ").trim();

You may also need to normalize any remaining whitespace: str = str.replaceAll("\\s+", " ");

uncle_alicea at 2007-7-28 17:13:16 > top of Java-index,Java Essentials,Java Programming...