StringTokenizer or regex?
Hi everybody,
I have an application that I've written to parse a text file and count words; basically I have it kill all the punctuation except apostrophes (the straight up and down kind with ascii code 39, not the ones that look the same as closing quotes) and hyphens. I used a StringTokenizer to do it, but I just learned recently that most of the time now that's frowned upon because StringTokenizer is being phased out. Would it be better to use regex, just as good practice? I really don't know too much about regex; I thought if I told it to split around punctuation that it would get rid of hyphens and apostrophes too.
Any advice on this or regex explanation would be appreciated--I'm not going to ever use this particular program outside of my computer, but I'd just like to know, for programming purposes, which method is better in practice and whether it might be a good idea to change my approach.
Thanks,
Jezzica85
[964 byte] By [
jezzica85a] at [2007-11-26 22:28:54]

If all you need is to count the number of words, then use the String.split() method. Split on one or more spaces and the length of the resulting array will be how many words. To use the split method you will need the correct regex. Take a look at the Pattern class in the java.util.regex package.
Actually, I'm using the words for other things too, I guess I should have been more specific. That's why I need the specific splitting of the punctuation. I'm really bad at regex though and the regex javadoc was confusing, how do you have it split a file on all punctuation except apostrophes and hyphens (but not em dashes or closed quotes)?
Thanks!
Jezzica85
I repeat Take a look at the Pattern class in the java.util.regex package. At the start of this class in the API it explains how to create a regular expression.
I appreciate your repetition, but I don't understand it. I've looked at it and I don't understand how I should do this with the patterns, that's why I'm posting here.Message was edited by: jezzica85
> ...but I just learned
> recently that most of the time now that's frowned
> upon because StringTokenizer is being phased out.
I would ignore that.
Regexes are inefficient for simple matching and the way you described your problem it suggests that you are doing exactly that.
Even if that class was removed from the API external replacements would be created.
StringTokenizer is never going to be removed, so if that's all you're worried about, you can relax. However, regexes are very powerful and versatile, and learning about them is definitely worthwhile. Here's a much better introduction to regexes than the Pattern javadoc, which is really just an enumeration of features.
http://www.regular-expressions.info/
Thank you uncle_alice, I like your handle by the way.Jezzica85
Thank you. Just for that: str = str.replaceAll("[\\p{P}&&[^'-]]+", "");
That's fairly advanced stuff, and partly Java-specific, which means you're not likely to grok it just from reading a tutorial. So I'll explain, but don't worry, it won't be on the quiz. ^_^
\p{P} is a predefined [url=http://www.regular-expressions.info/charclass.html]character class[/url] that matches any [url=http://www.regular-expressions.info/unicode.html]Unicode[/url] punctuation character. That's not the same as \p{Punct}, which only matches ASCII punctuation characters (which means it won't match the curly quotes and em-dashes Microsoft tools are so fond of pushing on us).
&& is the set intersection operator, which only works inside a character class (i.e., the outer pair of square brackets). As far as I know, Java's regex flavor is the only one that supports that feature, although some of the others have their own ways of doing set operations on character classes.
Finally, [^'-] matches any character except an apostrophe or a hyphen. All together, it means "one or more (that's the plus sign at the end) of any Unicode punctuation character that isn't an apostrophe or a hyphen".
Whoa!!!!!!!!!!!!
Thanks a million! That's one of the coolest things I've ever seen!
:) :) :) :) :) :)
(dances a jig)
I like "grok" too, cool word! I'm an amateur writer, so I think I'll definitely have to use it. :) :)
Thank you so much!
Jezzica85
Message was edited by:
jezzica85
I think I'm in love.> I like "grok" too, cool word! I'm an amateur writer,> so I think I'll definitely have to use it. :) :)You do know [url= http://en.wiktionary.org/wiki/grok]where it's from[/url], don't you?:D
haha, no, I didn't know.The geek culture is just fine with me, I'm a bioinformatics major in college and I spent almost eight hours in lab today.I'm glad I finally found somebody who doesn't think I'm nuts!
To someone who's read [url=http://www.amazon.com/exec/obidos/ASIN/0441790348/wegrokitcomshein]the book[/url] as many times as I have, all that stuff about "geek culture" sounds like a load of pretentious **** (and the [url=http://en.wikipedia.org/wiki/Grok]Wikipedia article[/url] is even worse). I feel I should warn you, though: if you decide to read it, wait until Spring break or something. People who like it tend to find themselves unable to put it down. (People who don't like it tend to hate it. Heinlein fans usually take this as a sign of poor character, but I think that's a bit too harsh. I believe those who hate Stranger as young persons merely have no brains, while those who hate it as adults have no sense of humor.)
Also, if you do decide to read it (having set aside sufficient time), try to get the uncut edition (circa 2005, I think). It's not the usual "director's cut" kind of thing, where you get to find out exactly why the deleted scenes were deleted. Bigger is definitely better in this case. Why am I going on about Stranger, anyway? Hmm, now that I think about it, advanced sleep deprivation sounds plausible.
