Problem removing html tags from the text retrived
Hi there,
I am using jdbc to connect the database and retriving the data. In one of the columns along with the description there are some html tags in few of the recors of that column. is there a way to retrive the text only ignoring the html tags in between. Or can i retrive and then strip off the html code in the text to display only normal text.
example of the data retrived which are pipe seperated and one of the columns has html tags in it:
209|The euphoria |187945-2|http://www.abc/lst.jsp?mktgChannel=I86023&sku=18791-2&siteID=qpF0HYnRugA|http://www.abc.com/assets/images/product/medium/18793-2_198.jpg|Rooftop Singers: Walk Right In | abc Music proudly presents THE FOLK YEARS, an unforgettable era in music history!
<B>Featuring:</B>
<LI>The most complete collection of folk and folk-rock songs ever put together -- 132 classics!
<LI>Original hits by the original artists!
Now i need to remove the tags before displaying this on the output. Is there a simple way to do this.
Thanks...
[1091 byte] By [
Naik1983a] at [2007-11-26 17:07:28]

Use a regular expression that matches HTML tags.
thanks...if you dont mind could you please give me a sample code as how to go abt writing a regex for this. Never used them in java before....
All you need should be right here: http://www.regular-expressions.info/java.html
Thanks for the link...but if doesn't say any thing abt html tags...i would appreciate if some one could send me a sample code to remove the html code from the text i submitted above...
http://txt2re.comhas sample code, and useful in its own right
Have we tried googling for regex html tag?
Hi the link u sent is not opening...Can u please send me the correct link...thanks
i forgot the wwwadd that in, and it works
> Thanks for the link...but if doesn't say any thing
> abt html tags...i would appreciate if some one could
> send me a sample code to remove the html code from
> the text i submitted above...
*inserts spoon*
http://www.google.com/search?q=java+strip+html+tags&ie=utf-8&oe=utf-8&rls=org.mozilla:en-US:official&client=firefox-a
Link #5
http://examples.oreilly.com/jenut/RemoveHTMLReader.java
Thank u all for your gr8 help....I was successfully able to remove the html tags from the text using regex....
But after doing that i am getting blank spaces in the region where the html code was removed...
here is an example:
twoFREE,packagedinaspecialcollector'sbox
<blank space here........>
Previewthisentirecollectionforthelowexaminationfeeofjust$9.95|9.95|Y
I tried applying trim function to the string....but still the blank space doesnt go....any comments on this?
thanks again...
Did you read the documentation of the trim() method, where it describes which whitespace it removes?
yes, i figured out i was doing wrong. trim trim's the leading and trailing whitespaces....is there a way to trim out the spaces that exists in the middle of the string.
> Did you read the documentation of the trim() method,
> where it describes which whitespace it removes?
I believe his problem is that
"Some text here
<blah>
More text"
becomes
"Some text here
More text"
... and he wants ...
"Some text here
More text"
So, your problem is that your regex isn't matching whitespace as well.
See the "Trimming Whitespace" section:
http://www.regular-expressions.info/examples.html
yep u got my question perfectly right....but for some reason it still shows the blank spaces in between the string. the prophet184 minute|4.99|YThe function i used wasout.write(str.replaceAll("^[ \t]+|[ \t]+$","");Any comments,thanks....
The ^ matches the beginning of a line.
The $ matches the end of a line.
Thus, your regex will match something like
"<tab>|<tab>"
and replace it with nothing, but won't traverse over a line break. You need to take out those ^ and $ symbols if you want to match beyond the line, then try the whole [ \t\r\n]+ thing.
Thank u so much for the gr8 help :).....I got it right finally with your suggestion
> Thank u so much for the gr8 help :).....I got it> right finally with your suggestionCool. Glad it worked out. Have a nice weekend :).
If you are refering to link "http://jregex.sourceforge.net/examples-html.html" then
Look at this piece of code:
private static String strip(String s){
StringBuffer sb=new StringBuffer();
RETokenizer tok=delimiters.tokenizer(s);
while(tok.hasMore()){
sb.append(tok.nextToken());
sb.append(" ");***
}
return sb.toString();
}
sb.append(" ") - this is adding the space into your text. If you don't want the space, just remove this, it is works. Thus u can remove the extra code of regular expression for removing blank spaces.