Regular Expression help (using string split)
I'm attempting to split up a txt document into sentences and then words and am struggling with getting the right regular expression to make it work.
Basically, I want to split each sentence based on a period, followed by one or more spaces, followed by a capital letter. The closest I've got is: "[.]{1}[ ]+[A-Z]{1}" - but then this removes the first capital letter of the next sentence. Is there a better regex to use or a way so that I may specify for it not to remove certain characters?
[505 byte] By [
thoseiona] at [2007-11-27 1:32:21]

I really, really, really, sucks at regex's, but I came up with this!
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class Test
{
public static void main(String[] args)
{
String line = "Bla Bla Bla? A E I O U. AAAAAH!";
String regex = "[A-Z][^\\.!\\?]*[\\.!\\?]\\s*";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(line);
while(m.find())
{
System.out.println(m.group());
}
}
}
Laugh at it all you want, at least it works. ;-)
Next ever so slight problem... :-)
BTW, I went with uncle_alice's approach simply because I was more comfortable using split() over patterns.
I'm now checking each sentence for words and it works almost perfectly using text.split("\\W"). The only problem I am having is for words that contain an apostrophe e.g. it's, don't, would've etc, it counts these as two words and I only want it be one.
So basically I want it to split (and remove) when it encounters any non-word (a-z, 0-9) character except the apostrophe. Any easy way of doing this?
String[] words = sentence.split("[\\W&&[^'-]]+");
&& is the intersection operator, so that regex matches one or more of any non-word character except apostrophe or hyphen (as used, for instance, in the word "non-word").
Actually, Cap's positive-matching approach is better for this kind of thing. That way, you don't really need to treat different punctuation characters differently. String wordRegex = "\\w+(?:\\p{Punct}\\w+)*";