REGEX -- Finding matches in a String up to a certain point
I assume that most people will find this to be a naive question, but my general lack of experience with regular expressions is killing me here. I am about to pull my hair out, so I'm asking for help.
I have a one-line string and I want to replace all occurances of ***.jar with ***-obf.jar, but ONLY up to a certain point in the string. Here's the specifics -- I have a list of jars:
abc.jar def.jar my_stuff.jar ThirdParty/blah.jar ThirdParty/lib/external-stuff-3.2.jar
I want to replace *.jar with *-obj.jar, but ONLY in "my" jars (i.e. not the ThirdParty jars). Therefore, the REGEX replace needs to work on all jars in the String up to the first 'ThirdParty/...'.
I can't for the life of me figure this out. All of my patterns end up matching some of the ThirdParty jars, or they don't match enough of my jars (i.e. using ^ results in only matching the first jar in the list). Second, during the replace-all, too many groups are being found and I end up with duplicate replacements.
The ONLY thing I have come up with so far that works is the following:
(?<![./-])\b(\w+)\.jar\b
This is based on the following assumptions:
- my jars will be defined by word boundaries
- my jars will not contain the ., /, or - characters
- all ThirdParty jars will at least contain one '/' path-separator character
These are not ideal assumptions. What I really want is a pattern that matches ALL of my jars up until it hits the first ThirdParty string. Can anyone recommend one?
Thank You,
Eric>
# 1
To be honest with you, finding a regex that matches all jars up to the first ThirdParty would be nice, but a more ideal regex would not assume that ThirdParty jars are at the end of the list. The best regex would assume a random order list of space-separated jars such that the match would only occur on jar names not preceeded by ThirdParty/...
The problem is that the negative lookbehind regex construct, (?<!X)
, does not work with arbitrary length patterns, i.e. you can't do:
(?><!(ThirdParty/.*))
Anyway, thanks for any help!
- Eric>
# 2
If Regex is so much pain for this task I'd go for the straight forward approach:
StringTokenizer st = new StringTokenizer(stringToParse, " ");
StringBuilder sb = new StringBuilder();
while(st.hasMoreTokens()) {
sb.append(" ");
String s = st.nextToken();
sb.append(s);
if (!s.startsWith("ThirdParty")) sb.insert("-obf", sb.length() - 4);
}
sb.delete(0, 1); //remove leading blank
# 3
At this point I have spent so much time on the REGEX that I really want to know the answer. :)
# 4
> At this point I have spent so much time on the REGEX> that I really want to know the answer. :)String updateClasspath = classpath.replaceAll("(?! ThirdParty/.*?.jar)((^| ).*?).jar","$1-obj.jar");
# 5
WOW. That's it. Ugh. I am going to study this closely! I knew it was possible, I just can't believe it was so hard to figure out.
I notice in your response that may be an extra capture group that is unnecessary. I tried this pattern and it works too, with only 2 capture groups instead of 3:
(?! ThirdParty/.*?.jar)((?:^| ).*?).jar
Thanks VERY much.
- Eric
# 6
Hmm. Can you explain what this part is doing:((^| ).*?)
# 7
Also, I have never seen the 'negative lookahead' clause come first:
(?! ThirdParty/.*?.jar)
Examples usually show cases where you don't want a character to match if it is followed by the negative lookahead check (not the other way around, like you posed above). My lack of regex experience is showing here, but I'm hoping someone can break down exactly what this pattern is doing.
# 8
I was playing around and I found that this pattern works too:(?! ThirdParty)((^| ).*?).jarDoes the first part of the pattern (?! ThirdParty) just tell the regex engine not to match if the jar is preceded by the string ThirdParty at any point in the past?
# 9
> Hmm. Can you explain what this part is doing:
>
> ((^| ).*?)
Capturing a space or start of line and the following characters (reluctantly). I suspect that the reluctance will not be really effective because the terminal condition is outside the capture group. If so then there could be a fair of backtracking. I can't see a way to make this .*? have a more positive terminal condition. If uncle_alice (the regex Guru) picks this up he may be able to improve on this.
# 10
My curiosity is getting the best of me, so I will pose another question. What if I want to match all jars up to, but not including, my_stuff.jar?
abc.jar def.jar my_stuff.jar ThirdParty/blah.jar ThirdParty/lib/external-stuff-3.2.jar
I would have thought this pattern would do it, but it does not:
(?! my_stuff.*?)((^| ).*?).jar
# 11
> I was playing around and I found that this pattern
> works too:
>
> (?! ThirdParty)((^| ).*?).jar
>
> Does the first part of the pattern (?! ThirdParty)
> just tell the regex engine not to match if the jar is
> preceded by the string ThirdParty at any point in the
> past?
Nearly, but I find difficult to explain!
If one of your jars was named ThirdParty.jar would your solution work?
# 12
> My curiosity is getting the best of me, so I will> pose another question. My decorating is waiting! When I get time I will study your question.
# 13
Good point. No, if one of my jars is ThirdParty.jar, the following pattern does not work:(?! ThirdParty)((^| ).*?).jarbut this DOES:(?! ThirdParty/)((^| ).*?).jarin other words, you don't need the rest like this:(?! ThirdParty/.*?.jar)((^| ).*?).jar
# 14
> Good point. No, if one of my jars is ThirdParty.jar,
> the following pattern does not work:
>
> (?! ThirdParty)((^| ).*?).jar
>
> but this DOES:
>
> (?! ThirdParty/)((^| ).*?).jar
>
> in other words, you don't need the rest like this:
>
> (?! ThirdParty/.*?.jar)((^| ).*?).jar
OK, so my solution was not optimal! One does not even have to look for the .*?.jar, one can useString updateClasspath = classpath.replaceAll("(?! ThirdParty/)((?:^| ).*?).jar","$1-obj.jar");
# 15
So can anyone answer this one... My curiosity is getting the best of me, so I will pose another question. What if I want to match all jars up to, but not including, my_stuff.jar?
abc.jar def.jar my_stuff.jar ThirdParty/blah.jar ThirdParty/lib/external-stuff-3.2.jar
I would have thought this pattern would do it based on the previous discussion, but it does not:
(?! my_stuff.*?)((^| ).*?).jar
# 16
> abc.jar def.jar my_stuff.jar
> ThirdParty/blah.jar
> ThirdParty/lib/external-stuff-3.2.jar
>
> I would have thought this pattern would do it based
> on the previous discussion, but it does not:
>
> (?! my_stuff.*?)((^| ).*?).jar
I have not tested it but I would expect it to match ALL jars except my_stuff.jar. Reasoning - it says match any jar prefixed by a space or start of line UNLESS it start with a space followed by my_stuff followed by anything.
# 17
I'm very confused at this point. You say that the negative lookahead construct:
(?! my_stuff.*?)
means that don't match anything that STARTS with that particular pattern. So, the original pattern you suggested:
(?! ThirdParty/)((?:^| ).*?)\.jar
says don't match any *.jar that STARTS with ThirdParty/.*. Ok, I understand that. But then I would expect this pattern:
(?! my_stuff.jar.*)((?:^| ).*?)\.jar
to say don't match any *.jar that starts with my_stuff.jar*, or in other words, don't match any jars AFTER the first occurance of my_stuff.jar. However, this pattern does NOT do that. It merely serves to NOT match my_stuff.jar.
So I ask again, can you suggest a pattern that says match all jars BEFORE my_stuff.jar, but don't match anything after the first occurance of my_stuff.jar? I think if I see that pattern I'll finally be able to put this all together.
# 18
> So I ask again, can you suggest a pattern that says
> match all jars BEFORE my_stuff.jar, but don't match
> anything after the first occurance of my_stuff.jar?
> I think if I see that pattern I'll finally be able
> to put this all together.
You seem very confused as to what you want. I thought my original regex solved your problem. Has the requirement changed?
If you want to modify my_stuff.jar as well as the earlier ones then maybe
line.replaceAll("(^| )(.*?)\\.jar(?=.*?my_stuff.jar)|(my_stuff).jar","$1$2$3-obj.jar");
If you want to just modify the ones before my_stuff.jar then maybeline.replaceAll("(^| )(.*?)\\.jar(?=.*?my_stuff.jar)","$1$2-obj.jar");
# 19
Ah yes. Thank you! The light bulb finally went on for me. I had a misunderstanding of how negative lookahead worked. I just finally now got it.
If you have:
(?!foo)bar
This will NOT match any occurance of 'bar' not preceeded by foo. This is because negative lookahead will FIRST do the positive match, and THEN do the lookahead. So the above code will find 'bar', then BACKTRACK to the beginning of the match, and then try to find 'foo' (but of course it will find 'bar', and the match will succeed). In order to find a 'bar' not preceeded by foo, you would need to do something like:
(?!foo)...bar
This will find the match '...bar', then BACKTRACK to the BEGINNING of the match (i.e. before the ...), and then it will try to find foo.
I had missed this very important point that the negative lookahead backtracks to the beginning of the match, because in all previous examples that I had seen, the negative lookahead was always AFTER the match, like this:
foo(?!bar)
(this would find all foo not followed by 'bar'). So your example was the first time I ever saw the negative lookahead construct (?!X) PRECEEDING a match. I would have normally thought you would use a negative lookbehind for that (?>!X), but the problem there is that you can only look behind a fixed # of chars (.* not allowed).
So... now I get it! Thanks very much.
Message was edited by:
oppositereaction
# 20
> Ah yes. Thank you! The light bulb finally went on
> for me. I had a misunderstanding of how negative
> lookahead worked. I just finally now got it.
>
> If you have:
>
> (?!foo)bar
>
> This will NOT match any occurance of 'bar' not
> preceeded by foo.
Yes it will! Try it! I think you still have a problem understanding negative look ahead.
String result = "ooobar".replaceAll("(?!foo)bar","xxx");
System.out.println("[" + result + "]");
result = "foobar".replaceAll("(?!foo)bar","xxx");
System.out.println("[" + result + "]");
Both 'foobar' and 'ooobar' match!
I suspect the light has just gone out!
# 21
UGH! No, I really do understand it. I just MISSPOKE! What I meant to say was, if you do NOT want to match bar if it is preceeded by 'foo', then you cannot simply do:
(?!foo)bar
That snipped WILL match 'bar'. You need to do a little more work to prevent a match on 'bar' preceeded by 'foo', like this:
(?!foo)...bar
And the reason for that is explained in the second half of my previous post. I just started off that post with the wrong choice of words.
I'm really sorry for the confusion. I think at least the last half of my previous post was without typos. I do understand the topic now, and I do thank you very much for your help and examples.
I'm going to hang up now, before I confuse anyone else. :)