Regex help
Hi all,
I'm trying to break a string at every , and .
So I have something like Pattern.compile("(\\. )|(, )");
but this is not working well for me, I wanted to change it so it will match the "." only if it found only one in a given word
Example:
"Break here. But don't break he.re."
So basically I want to know how to define a pattern that requires no more than 1 "." in a word.
Thanks alot.
[442 byte] By [
tomerg3a] at [2007-11-26 13:19:37]

So, you're saying that you want to find periods that are surrounded by word characters?
Trial and error, lol. http://forum.java.sun.com/thread.jspa?threadID=771359&messageID=4395122
If you define a word as being surrrounded by whitespace, then look for a space, followed by at least 0 characters,followed by one period,followed by at least 0 characters,followed by a spaceYou'll have to decide what constitutes a "character".
It is not clear to me what you want but is this what you want
String line = "Break here. But, don't break he.re.";
String[] splitLine = line.split(",|(?=\\s)\\.(?!\\w)|(?!\\w)\\.(?=\\s)");
for (String segment : splitLine)
{
System.out.println("["+segment+"]");
}
its not really what I want, if the string is
"Break here. But, don't break he.re. ok"
I get
[Break here]
[ But]
[ don't break he.re]
[ ok]
What I want to get is
[Break here]
[ But]
[ don't break he.re ok]
Thanks for the quick replies
> its not really what I want, if the string is
> "Break here. But, don't break he.re. ok"
> I get
> [Break here]
> [ But]
> [ don't break he.re]
> [ ok]
> What I want to get is
> [Break here]
> [ But]
> [ don't break he.re ok]
Is this last line correct? Should it be
[ don't break he.re.ok] ? If so then
final String[] splitLine = line.split(",|\\.(?=\\s)");
is the best I can do without a more formal definition.
Message was edited by:
sabre150
Sorry, correcrtion [Break here] [ But] [ don't break he.re. ok]leaving the "he.re." as is
> Sorry, correcrtion> [Break here]> [ But]> [ don't break he.re. ok]> eaving the "he.re." as isThis is what my modified regex does with your test string.
String[] splitLine = line.split(",|\\.(?=\\s)");Gets me[Break here][ But][ don't break he.re][ ok]and I want[Break here][ But][ don't break he.re. ok]Again, thanks for the help
In your test case is there a space between the 'he.re.' and the 'ok'? If so then what is the criteria for splitting or not splitting on a full stop?
I want to split on a full stop ONLY if its the only full stop in that word (the word may be in a foreign language aswell)so I want to break on these cases:bye.p醙ina.and I don't want to break for there casesU.S.S.R.殚.uu.
Sorry but I can't do this in one regex because the required 'look behind' needs to have a maximum length.You may have more luck if uncle_alice (the regex GURU) picks this up.
I was trying to do something like\s + anything but a . + \. + \sbut either I wrote it wrong, or it just doesn't work
No, this is definitely not a job for split(). In a case like this, I usually recommend a positive matching approach, using Matcher.find(), but even that's going to be pretty ugly this time. The code below yields the desired result given your sample data, but I pity the guy who would have to maintain it. You're probably better off writing your own parser using String.charAt() and Character.isLetter() and such.
import java.util.*;
import java.util.regex.*;
public class Test
{
public static List<String> breakIt(String str)
{
Pattern p = Pattern.compile("(['\\pL]++([.,]['\\pL]++)*+)[.,]\\s++");
Matcher m = p.matcher(str);
List<String> result = new ArrayList<String>();
int pos = 0;
while (m.find())
{
if (m.start(2) == -1)
{
result.add(str.substring(pos, m.end(1)));
pos = m.end(0);
}
}
if (pos < str.length())
{
result.add(str.substring(pos));
}
return result;
}
public static void main(String[] args)
{
String test = "Break here. But, don't break he.re. ok";
for (String s : breakIt(test))
{
System.out.printf("[%s]%n", s);
}
}
}
I'm not sure if I've ever seen such such ug....ly code!Good job, uncle_alice ;)(and please don't do it again!)