separating a sentence into words

hi there..what is the best or most efficient algorithm for separating a string into words? the string contains a sentences and phrases, and i would like to have them separated into words..how can i do this manually? without using a native java method?
[258 byte] By [rock_chica] at [2007-10-2 10:44:05]
# 1

Hi,

It depends on what a word is. Is "it's" one word or two? The most common way is to treat everything between blanks as a word. So this is what you basically need to do.

1) Trim of all whitespaces from the beginning and the end of the string.

2) Start looking for a white space.

3) If one is found, increase the word counter, and skip all consecutive spaces untill you find a character which isn't a space charater.

4) Goto 2

You can do all of this using methods in the String class (e.g. charAt(i) etc)

Kaj

kajbja at 2007-7-13 2:54:50 > top of Java-index,Other Topics,Algorithms...
# 2

I always use java.util.StringTokenizer... probly easier than making your own algorithm

StringTokenizer st = new StringTokenizer (sentence);

while (st.hasMoreTokens ())

{

String word = st.nextToken ();

//do what you want to the word

}

java318a at 2007-7-13 2:54:50 > top of Java-index,Other Topics,Algorithms...
# 3
> without using a native java method?haha my bad... didnt read your whole post
java318a at 2007-7-13 2:54:50 > top of Java-index,Other Topics,Algorithms...
# 4

One has to split on non-word char sequences.

In java:

String sentence = "...";

String[] words = sentence.split("\\W+");

If you want to do it yourself, the difficulty lies in deciding what is a word part, and what not. With unicode...

joop_eggena at 2007-7-13 2:54:50 > top of Java-index,Other Topics,Algorithms...
# 5

You question can be best answered by using the Java native method

import java.util.StringTokenizer;

....

....

StringTokenizer word =

new StringTokenizer("You have the answer, Spider.", " \n.,");

while (word.hasMoreTokens())

{

System.out.println(word.nextToken());

}

Output:

You

have

the

answer

Spider

The string tokenizer is provided a sentence and a set of separators. In this case the blank, new-line, comma and period. So, the basic idea is:

Start from the beginning of the sentence provided.

Scan each character and check if it is the one specified as separators.

If it is, print all the character/s before the separator.

Start again from the position right after the sepator and follow the same technique to iterate through the whole sentence.

I hope that will help you to implement your own algorithm.

Good luck!

buteForcea at 2007-7-13 2:54:50 > top of Java-index,Other Topics,Algorithms...