String replacing and pattern matching
Hi everybody,
I'm pretty sure my earlier post got deleted in the system maintenance, so I'm going to repost, just to be sure.
I'm writing a version of a grep program, which can take either a string of words separated by spaces or a string containing punctuation marks. What I need to to is compile a pattern that replaces the spaces with a predefined list of punctuation marks, and then can match any string of punctuation marks in that place.
So, as an example:
Query string: This is a test
Should match: This is a test; This, is--a. test; This: is, a test; or any variant, basically, all the words of the query, in order, with any number of punctuation marks in it.
I was looking at java.util.regex, and this is what I have so far:
String query = args[1];
Pattern pattern = Pattern.compile( query.replaceAll(" ","[ ,:?)棑攽抃"]+" ) );
Matcher matcher = pattern.matcher( element );
if( matcher.matches() ){
chapterCounter++;
}
Something about this isn't working, though--I never get matches. Can anyone see where I'm going wrong?
EDIT: I'm realizing that this probably isn't clear, since I haven't gotten responses yet, so I'll try to pare it down to the essentials: Instead of matching a string that contains spaces, I need to match a string of words that can have any number of punctuation marks in place of the simple spaces. The key is that the words are the same, and in the same order.
Thank you!
Jezzica85
Message was edited by:
jezzica85
[1707 byte] By [
jezzica85a] at [2007-11-27 10:09:47]

For one thing, you probably want " +" or even "\\s+" instead of just " " so that 1 or more spaces (or one or more of any whitespace) can be replace by one or more punct. Also you might want "\\p{Punct}". Also, "-" in he middle of a [x-y] character class means "x through y". If you want a literal "-", it has to go at the beginning or the end. Either one may be okay. I forget.
Start with this (it's in beanshell, which is a Java scripting shell.)
bsh % str = "this is a test";
<this is a test>
bsh % tester1 = "this-is-a-test";
<this-is-a-test>
bsh % rep = str.replaceAll(" +", "\\\\p{Punct}\\+");
<this\p{Punct}+is\p{Punct}+a\p{Punct}+test>
bsh % p = Pattern.compile(rep)
<this\p{Punct}+is\p{Punct}+a\p{Punct}+test>
bsh % m = p.matcher(t1);
<java.util.regex.Matcher[pattern=this\p{Punct}+is\p{Punct}+a\p{Punct}+test region=0,14 lastmatch=]>
bsh % m.matches();
<true>
Thank you jverd,
I've never used that shell before, so I really don't know what your code means, sorry. In the pattern I put down, that's actually an em dash, not a hyphen, but do you think it could still be causing a problem? I'll try that thing you were suggesting with the spaces and see if it works. Was I instantiating the pattern right?
EDIT: I think I'm still missing something, neither the \\s or the + signs work.
Jezzica85
Message was edited by:
jezzica85
> Thank you jverd,
> I've never used that shell before, so I really don't
> know what your code means,
It's Java code. You can ignore the bsh % prompt and the <...> responses.
> put down, that's actually an em dash, not a hyphen,
> but do you think it could still be causing a problem?
I wouldn't think so, no.
Start simple--maybe just comma and colon or something--and make that work (using my code as a starting point). Once that works, add additional punct marks (assuming \p{Punct} doesn't cover the chars you want).
> > Thank you jverd,
> > I've never used that shell before, so I really
> don't
> > know what your code means,
>
> It's Java code. You can ignore the bsh %
> prompt and the <...> responses.
By the way, you might want to play with it. www.beanshell.org. I find it very handy for quick, interactive tests of Java code.
\p{Punct} only matches punctuation characters in the 7-bit ASCII range. In other words, it will not match the em-dash, en-dash, curly quotes, or other fancy punctuation characters that Microsoft apps keep trying to sneak into our documents. Even listing those characters explicitly, as you did in your original post, isn't safe, because it requires you to use a certain encoding when you save and compile the source file. You could list them by their Unicode escapes (\u2018 for Left Single Quote, \u2014 for Em Dash), but you'd probably be better off using the Unicode "punctuation" property, \p{P} (which can also be written as \pP). Combined with jverd's advice, that leaves you with: Pattern pattern = Pattern.compile( query.replaceAll( "\\s+", "[\\\\s\\\\pP]+" ) );