Regular Expression help (using string split)

I'm attempting to split up a txt document into sentences and then words and am struggling with getting the right regular expression to make it work.

Basically, I want to split each sentence based on a period, followed by one or more spaces, followed by a capital letter. The closest I've got is: "[.]{1}[ ]+[A-Z]{1}" - but then this removes the first capital letter of the next sentence. Is there a better regex to use or a way so that I may specify for it not to remove certain characters?

[505 byte] By [thoseiona] at [2007-11-27 1:32:21]
# 1
Use a lookahead to identify the capital letter without consuming it: String[] sentences = text.split("\\.\\s+(?=[A-Z])");
uncle_alicea at 2007-7-12 0:36:45 > top of Java-index,Java Essentials,Java Programming...
# 2

I really, really, really, sucks at regex's, but I came up with this!

import java.util.regex.Pattern;

import java.util.regex.Matcher;

public class Test

{

public static void main(String[] args)

{

String line = "Bla Bla Bla? A E I O U. AAAAAH!";

String regex = "[A-Z][^\\.!\\?]*[\\.!\\?]\\s*";

Pattern p = Pattern.compile(regex);

Matcher m = p.matcher(line);

while(m.find())

{

System.out.println(m.group());

}

}

}

Laugh at it all you want, at least it works. ;-)

CaptainMorgan08a at 2007-7-12 0:36:45 > top of Java-index,Java Essentials,Java Programming...
# 3
Superb, thankyou - works great!
thoseiona at 2007-7-12 0:36:45 > top of Java-index,Java Essentials,Java Programming...
# 4

Next ever so slight problem... :-)

BTW, I went with uncle_alice's approach simply because I was more comfortable using split() over patterns.

I'm now checking each sentence for words and it works almost perfectly using text.split("\\W"). The only problem I am having is for words that contain an apostrophe e.g. it's, don't, would've etc, it counts these as two words and I only want it be one.

So basically I want it to split (and remove) when it encounters any non-word (a-z, 0-9) character except the apostrophe. Any easy way of doing this?

thoseiona at 2007-7-12 0:36:45 > top of Java-index,Java Essentials,Java Programming...
# 5

String[] words = sentence.split("[\\W&&[^'-]]+");

&& is the intersection operator, so that regex matches one or more of any non-word character except apostrophe or hyphen (as used, for instance, in the word "non-word").

Actually, Cap's positive-matching approach is better for this kind of thing. That way, you don't really need to treat different punctuation characters differently. String wordRegex = "\\w+(?:\\p{Punct}\\w+)*";

uncle_alicea at 2007-7-12 0:36:45 > top of Java-index,Java Essentials,Java Programming...