Capitalizing words in a string
I have written a method using the Pattern class to capitilize the first letter in every word but isnt very sophisticated. For example it breaks words at apostrophe boundaries creating two words from one.
I was wondering if anyone had written a more sophisticated method I could steal .
I was also wondering what is the recommended way to process natural language, now that we have BreakIterators,CharcterIterators,Pattern & StringTokenizer classes ?
BreakIterator is intended for natural languages, according to its API docs. It also has a static method BreakIterator.getWordInstance()which looks like that which you seek.
Thanks yes I know but how does it decide what a word is when I tried to use it it sometimes returned one more than the length of the string it was parseing.
Also its not particulary useful the regular expression stuff seems much better but I wasnt sure it if it wwas valid for processing any natural language or just english
Can anyone give me a more definitive answer ?
As an FYI, it is recommended that String.spli() is used instead of StringTokenizer.
> As an FYI, it is recommended that String.spli() is> used instead of StringTokenizer.is this your personal recommendation? I like StringTokenizer.. :)
asjfa at 2007-7-14 23:59:03 >

public class ProperCase {
private ProperCase() {
}
public static String convert(String s) {
char[] chars= s.trim().toLowerCase().toCharArray();
boolean found= false;
for (int i=0; i<chars.length; i++) {
if (!found && Character.isLetter(chars)) {
chars= Character.toUpperCase(chars);
found= true;
} else if (Character.isWhitespace(chars)) {
found= false;
}
}
return String.valueOf(chars);
}
}>
Nice but falls over with more words with apostrophes such as Karen's or it's. This is easy enough to fix but I was wondering about a definitive syntax
I'm not sure you've really grasped what a can of worms you are trying to open. Sure, you don't want the "s" in "Karen's" capitalized, but you do want the "R" in "O'Reilly" capitalized.
We have a web application that uses the function that you describe (our database fields are in all-caps but that looks horrible on the web, so we change it to have all words with initial caps only). The code was written by one of our programmers as his very first production Java code, so it is not very sophisticated. But it doesn't need to be, because its subject matter is not arbitrary English-language phrases but just the names of products.
When we put it into production, I sort of cringed at some of its results, such as "Chocolate/vanilla Wafers" where I would have preferred to see "Vanilla". But after I looked at it for a while, I decided not to tinker with it, because for every change I would have made, there was a corresponding situation where that change would make it worse.
I don't know what your subject matter is, but most natural languages have a large collection of little rules that appear arbitrary. So after the "O'Reilly" rule, which is an exception to the "Karen's" rule, you have the "McKinley" rule... these are not the sort of things you can use generic mechanisms to deal with. Unless you're willing to live with the 90% solution.
Actually it will work fine the 's...compile and run the above source and use this test:STRING'S THAT (HAS PARENS) INSIDE OF IT.
Try this as a "test" string:Father O'Reilly gave Mr. McKinley a pamplet titled "Good Neighbors" to read after hearing of his troubles with Mr. MacGregor.
Of course it will fail on stuff like "MacGregor". Very (very) simple class that implements a "ProperCase" algorithm.
Incase anyone is interested, below is the results of the test and the code again. Maybe someone can add to my 2mins of work and make it better.
// Results
Before: Father O'Reilly gave Mr. McKinley a pamplet titled "Good Neighbors" to read after hearing of his troubles with Mr. MacGregor.
After: Father O'reilly Gave Mr. Mckinley A Pamplet Titled "Good Neighbors" To Read After Hearing Of His Troubles With Mr. Macgregor.
// Code
public class ProperCaseTest {
/**
* @param args the command line arguments
*/
public static void main(String[] args) {
String testString = "Father O'Reilly gave Mr. McKinley a pamplet titled \"Good Neighbors\" to read after hearing of his troubles with Mr. MacGregor.";
System.out.println("Before: " + testString);
System.out.println("After: " + ProperCase.convert(testString));
}
}
class ProperCase {
private ProperCase() {
}
public static String convert(String s) {
char[] chars = s.trim().toLowerCase().toCharArray();
boolean found = false;
for (int i=0; i<chars.length; i++) {
if (!found && Character.isLetter(chars[i])) {
chars[i] = Character.toUpperCase(chars[i]);
found = true;
} else if (Character.isWhitespace(chars[i])) {
found = false;
}
}
return String.valueOf(chars);
}
}>
Heres a sort of extensible version. It works on most inputs but can
still be caught out quite easily. (although a dictionary based
StringProcessor could be written.
StringProcessor.java
public interface StringProcessor
{
/**
* Add the next char to be processed. The return value is
* that of the current position - postChars
*/
public boolean processCharAt(String s, int pos);
}
LineStartSP.java
/**
* string processor that returns true when pos is the first char in the
* String
*/
public class LineStartSP implements StringProcessor
{
public LineStartSP(){}
public boolean processCharAt(String s, int pos)
{
if(pos==0) return true;
return false;
}
}
SurroundingCharSP.java
/**
* StringProcessor that returns true when the string at the specified
* position matches the matchValue stored in this class at its specified
* offset relative to the position.
*/
public class SurroundingCharSP implements StringProcessor
{
private String matchValue;
private int offset;
private boolean ignoreCase = false;
public SurroundingCharSP(String matchValue, int offset, boolean ignoreCase)
{
super();
this.matchValue = matchValue;
this.offset = offset;
this.ignoreCase = ignoreCase;
}
public boolean processCharAt(String s, int pos)
{
int start = pos+offset;
int end = pos+offset+this.matchValue.length();
if((start>=0)&&(end<s.length()))
{
return s.regionMatches(this.ignoreCase,start,this.matchValue,0,this.matchValue.length());
}
return false;
}
}
CapitaliseWords.java
import java.util.*;
public class CapitaliseWords {
private LinkedList list = new LinkedList();
public CapitaliseWords()
{
super();
addProcessor(new LineStartSP()); //first char in line
addProcessor(new SurroundingCharSP(" ",-1,true)); // blank space
addProcessor(new SurroundingCharSP(".",-1,false)); // full stop
addProcessor(new SurroundingCharSP("\"",-1,false)); // quote start
addProcessor(new SurroundingCharSP("(",-1,false)); // (
addProcessor(new SurroundingCharSP(" o'",-3,true)); // O'Reilly rule (not a good match)
addProcessor(new SurroundingCharSP(" mc",-3,true)); //mcDougal rule (not good match)
addProcessor(new SurroundingCharSP(" mac",-4,true)); //MacDougal rule (not good match
}
public boolean addProcessor(StringProcessor sp)
{
if(sp==null) throw new IllegalArgumentException("StringProcessor must not be null");
return this.list.add(sp);
}
public String convert(String s)
{
StringBuffer output = new StringBuffer(s.length());
// run through all characters in the String
for(int i=0;i<s.length();i++)
{
boolean capitalise = false;
Iterator it = list.iterator();
// run throuh all of the registered StringProcessors to see if any matches are found
while(it.hasNext())
{
if(((StringProcessor)it.next()).processCharAt(s,i))
capitalise = true;
}
if(capitalise) {
output.append(Character.toUpperCase(s.charAt(i)));
} else {
output.append(s.charAt(i));
}
}
return output.toString();
}
// test the program
public static void main(String[] args) {
String testString = "father o'reilly gave mr. mckinley a pamphlet titled \"good neighbors\" to read after hearing of his troubles with mr. Macgregor. (sod this for a laugh)";
System.out.println("Before: " + testString);
CapitaliseWords cw = new CapitaliseWords();
System.out.println("After: " + cw.convert(testString));
}
}
Before: father o'reilly gave mr. mckinley a pamphlet titled "good neighbors" read after hearing of his troubles with mr. Macgregor. (sod this for a laugh)
After: Father O'Reilly Gave Mr. McKinley A Pamphlet Titled "Good Neighbors" Read After Hearing Of His Troubles With Mr. MacGregor. (Sod This For A Laugh)
As I said earlier its easy to fool. try "macintosh" which is a surname
and an object. But handling capitalisation in this kind of circumstance
requires knowledge of the context in which it is used to determine
its status as a surname or object.
Its based on a sort of very simple blackboard system which could
be beefed up (but to be honest it really isn't worth it)
matfud>
Thanks thats just what I was looking for
You might want to add a few more rules.
The O'R, Mac. and Mc rules only work if the word they are a part of is
preceeded by a space (" ") if the word (or name) starts directly after
a full stop (.), a tab, a return, a ", a ( or anything else for that
matter the capitalisation will not work properly.
Lots of rules as there are lots of exceptions to each rule.
You should really change the StringProcessor interface such that the
method it defines returns a float value. That value should be in the
range of -1 to 1 depending on the confidence the StringProcessor
implementation places on its decision. Having the range able to be
negative means that you can have anti-rules (or rules that negate other
rules) which could be useful. You Capitalisation class would then run
though all StringProcessors and sum thier results. If the sum is greater
then a specified threshold the class would take action.
It would also be interesting to create a StringProcessor that takes
other StringProcessors as arguments. This would allow you to create
far more complex rules but still keep assocaited rules together.
for example a number of the rules in the code I provided are there to
provide for capitalisation directly after a space, ", (, tab, etc. These
could conceivably be added to a single CompundStringProcessor which would perform those associated rule checks.
matfud
perform taht one complex
You are right about "lots of little rules". The phrase "well-formed XML" displays two of them:
(1) Should you capitalize after a hyphen?
(2) Should you decapitalize acronyms?
This might be a good topic for those people who post here asking what they should do for their final-year thesis.
Dr Clap,
You are of course correct. This is an insoluble problem. Unless you can
get the program to understand the text it will never be 100% right. You
could alternately do a dictionary search of all possible matches. This
still can't work 100% due to multifunction words or wrods that are also
acronyms or mistakes in the original document.
To add a bit more flexibility, for those cases where the match could
be one of a number of things, I have just added the following
regular expression based class. Its more difficult to use then the
previous exact match class but should reduce the ruleset required to
implement a functional system (could improve the performance manyfold)
/**
* StringProcessor that examines the substring starting at pos+offset and of length matchLength
* for a match with the specified regular expression (matchExp). Care should be taken that
* the matchExp only matches expressions of length matchLength. If not incorrect matches can be
* made
*/
public class SurroundingRegExpSP implements StringProcessor
{
private String matchExp;
private int offset;
private int length;
private boolean ignoreCase = false;
public SurroundingRegExpSP(String matchExp, int offset, int matchlength, boolean ignoreCase)
{
super();
this.matchExp = matchExp;
this.offset = offset;
this.length = matchLength;
this.ignoreCase = ignoreCase;
}
public boolean processCharAt(String s, int pos)
{
int start = pos+offset;
int end = pos+offset+this.length;
if((start>=0)&&(end<s.length()))
{
return s.substring(start,end).matches(this.matchExp);
}
return false;
}
}
matfud>
******should have taken the ignoreCase boolean out of that post. It isn't usedmatfud