regex: parsing out nested divs

I have a dillemma.

I need to parse out nested divs, but i cant just have an expression like this:

String re ="(?i)(<div class=\"leftNavMenu\" id=\"leftNavMenu\">(.*?))<\\/div>)"

because this div may contain other divs, and a </div> tag from one of those will be matched.

I need to somehow keep track of balanced </div> tags. any ideas?

(other then using an html parser, because i am already waaaaaaaaaaaaaay over the deadline)

[515 byte] By [mkoryaka] at [2007-11-26 16:21:06]
# 1
according my my mastering regular expressions book, it seems like this cannot be done in java using *just* regular expressions. ****.
mkoryaka at 2007-7-8 22:44:44 > top of Java-index,Java Essentials,Java Programming...
# 2
> according my my mastering regular expressions book,> it seems like this cannot be done in java using> *just* regular expressions. ****.Jos will tell you it is because of the Pumping Lemma - http://en.wikipedia.org/wiki/Pumping_lemma .
sabre150a at 2007-7-8 22:44:44 > top of Java-index,Java Essentials,Java Programming...
# 3
oh god, not the pumping lemma. Why do you have to bring that up? i thought i was done and over with that. What i am asking for is possible to do in perl by the way.
mkoryaka at 2007-7-8 22:44:44 > top of Java-index,Java Essentials,Java Programming...
# 4

> oh god, not the pumping lemma. Why do you have to

> bring that up? i thought i was done and over with

> that.

>

> What i am asking for is possible to do in perl by the

> way.

The 'pimping lemon' applies whatever the regex flavour. Java regex is based (uncle_alice will be able to say more) on Perl 5 so if it can't be done in Java it probably can't be done in Perl.

P.S. This is a Java forum.

sabre150a at 2007-7-8 22:44:44 > top of Java-index,Java Essentials,Java Programming...
# 5
pimping lemon, heh thats a new one. i used to call it "humping lemmas"
mkoryaka at 2007-7-8 22:44:44 > top of Java-index,Java Essentials,Java Programming...
# 6

i dunno about pumping lemmas...that article hurt my brain.

but, it looks very much like the problem you have to solve when you are dealing with parathesis in an inline algebraic equation.

you use stacks (at least two) to find out if they are balanced by pushing when you encounter the left and popping when you encounter the right. if you have nothing left in the data stack but you have a single parenthesis in your operator stack...outta balance. there are a few other states that you need to check for but thats the gist.

sorry, i cant remember all the details....

txjumpa at 2007-7-8 22:44:44 > top of Java-index,Java Essentials,Java Programming...
# 7

yeah stacks.. counters, same thing.

here is the finished code, if you see a bug tell me:

package test;

import java.util.regex.Matcher;

import java.util.regex.Pattern;

public class Text {

public Text(){

String text = " shit <div id=blee>"+

"<div>" +

"<div>" +

"text1"+

"</div> test" +

"<div>" +

"text2"+

"</div>" +

"<div>" +

"<div>" +

"text3"+

"</div>" +

"</div>" +

"</div>" +

" </div> <div> bad </div>";

String contents = getDivContents(text);

System.out.println("contents: "+contents);

}

private String getDivContents(String text){

String div = "<div id=blee>";

String textl = text.toLowerCase();

int index = textl.indexOf(div);

int loc = index + div.length();

int end = findBalancedClosingTag(loc, textl, "<div","></div>", 1);

return text.substring(index,end);

}

private int findBalancedClosingTag(int start, String textl, String startTag, String endTag, int unClosedStarts){

int index = textl.indexOf(endTag,start);

if(index == -1){

return start;

}

Pattern p = Pattern.compile(startTag);

Matcher m = p.matcher(textl.substring(start, index));

unClosedStarts --; // we are in this function so we closed ONE start tag

while(m.find()){

unClosedStarts ++;

}

if(unClosedStarts == 0){

return index + endTag.length();

} else {

return findBalancedClosingTag(index + endTag.length(), textl, startTag, endTag, unClosedStarts);

}

}

public static void main(String[] args) {

new Text();

}

}

mkoryaka at 2007-7-8 22:44:44 > top of Java-index,Java Essentials,Java Programming...
# 8

why are you looking for "></div>" as your end tag?

that aside, if i use this string:

String text =

"<div id=blee>" +

"</div>" +

"</div>" +

"</div>";

i think you are returning a -1 for unclosed starts but it doesnt tell me about unopened ends does it?

Message was edited by:

txjump

txjumpa at 2007-7-8 22:44:44 > top of Java-index,Java Essentials,Java Programming...
# 9

> why are you looking for "></div>" as your end tag?

thats a known forum bug. it is actually fine.

> that aside, if i use this string:

>

> > String text =

> "<div id=blee>" +

> "</div>" +

> "</div>" +

> "</div>";

>

>

> i think you are returning a -1 for unclosed starts

> but it doesnt tell me about unopened ends does it?

that returns

<div id=blee>" +

"</div>

and leaves the other ones alone, since it returns as soon as it finds the first balanced div.

i guess i assume well formed html since everyone who writes it sits at most 8 feet away from me and i can yell at them

mkoryaka at 2007-7-8 22:44:44 > top of Java-index,Java Essentials,Java Programming...
# 10
> What i am asking for is possible to do in perl by the> way.Perl 5? How?
jschella at 2007-7-8 22:44:44 > top of Java-index,Java Essentials,Java Programming...
# 11

> > What i am asking for is possible to do in perl by the

> > way.

>

> Perl 5? How?

Note that by definition if you can actually do it in a perl 5 regex then you should be able to do it in java.

One can certainly write a perl routine to do it. One can do the same in java.

jschella at 2007-7-8 22:44:44 > top of Java-index,Java Essentials,Java Programming...
# 12
> I need to parse out nested divs, but i cant just have> an expression like this:By the way it is probably much more efficient to just use a parser.
jschella at 2007-7-8 22:44:44 > top of Java-index,Java Essentials,Java Programming...
# 13

> One can certainly write a perl routine to do it. One

> can do the same in java.

this is how my regex book says its done in perl:

my $LevelN; # this is predeclared because its used in its own definition

$LevelN = qr/ \(([^()] | (?{ LevelN }) )* \) /x;

if you can show me a way to do that in java i will give you all the dukes i have + 1.

mkoryaka at 2007-7-8 22:44:44 > top of Java-index,Java Essentials,Java Programming...
# 14

> > I need to parse out nested divs, but i cant just

> have

> > an expression like this:

>

> By the way it is probably much more efficient to just

> use a parser.

probably. but not if you have a deadline of 2 hours.

If i didnt have deadlines i would write the most magnificent code in the world.. but then i probably wouldnt get paid, and stave to death.

mkoryaka at 2007-7-8 22:44:44 > top of Java-index,Java Essentials,Java Programming...
# 15

> > One can certainly write a perl routine to do it. One

> can do the same in java.

>

> this is how my regex book says its done in perl:

>

>

> my $LevelN;

> $LevelN = qr/ \(([^()] | (?{ LevelN }) )* \) /x;

>

But that isn't just a regular expression is it?

There is a reason that the variable itself is inside the expression.

You can create a java method that recurses on a matched expression.

jschella at 2007-7-21 16:44:06 > top of Java-index,Java Essentials,Java Programming...
# 16
> If i didnt have deadlines i would write the most> magnificent code in the world.. but then i probably> wouldnt get paid, and stave to death.That is why I work to schedules not deadlines.
jschella at 2007-7-21 16:44:06 > top of Java-index,Java Essentials,Java Programming...
# 17

> If i didnt have deadlines i would write the most

> magnificent code in the world.. but then i probably

> wouldnt get paid, and stave to death.

I love it... he has time to post on this forum about how strapped for time he is, but no time to fix his code to Do The Right Thing.

kevjavaa at 2007-7-21 16:44:06 > top of Java-index,Java Essentials,Java Programming...
# 18

Neither Perl nor Java really supports Regular Expressions, strictly defined. In the Perl community, they use the word "regex" to mean "the thing that we do that started out as regular expressions, but does a helluva lot more than any self-respecting regular expression would ever attempt". Perl's regexes are much more powerful than Java's, and always will be.

If you know the div's will only be nested to a certain depth, you can write a nauseatingly long regex to match one level, two levels, ... n levels, but it's not worth the effort. If you only want to match the innermost level, you can do that with a long-but-fairly-elegant regex like "<div\\b[^>]*+>(?:[^<]++|<(?!/?div>))*+</div>"

Note that this regex relies on negative lookaheads and possessive quantifiers, neither of which exist in Regular Expressions.

Edit: left out a closing parenthesis. These purple lemmings running all over the place are so distracting!

uncle_alicea at 2007-7-21 16:44:06 > top of Java-index,Java Essentials,Java Programming...
# 19
yes, i would have used jdom for the parsing....but he wanted an algorithm.hehehe...who all expects well formed html/xml from their coworkers? and, html allows for "non-well formed" tags....but xml doesnt.
txjumpa at 2007-7-21 16:44:06 > top of Java-index,Java Essentials,Java Programming...
# 20
dude, it takes about 5 seconds to post here and complain. it takes MORE then 2 hours to write good code when using an API you have never used. christ.
mkoryaka at 2007-7-21 16:44:06 > top of Java-index,Java Essentials,Java Programming...
# 21
It also takes less time to write a simple parser using regular expressions to tokenize, than to try to solve a problem that demands parsing using only regular expressions.
paulcwa at 2007-7-21 16:44:06 > top of Java-index,Java Essentials,Java Programming...
# 22
well, thats where you guys come in. if someone here could have done this using only regular expressions that would have been rather godly, dont you think?
mkoryaka at 2007-7-21 16:44:06 > top of Java-index,Java Essentials,Java Programming...
# 23

> The 'pimping lemon' applies whatever the regex flavour.

Yes, but it only applies to regular expressions. You can use the pumping lemma to prove that Java's "regexes" aren't. In fact, they're not even context-free. See http://forum.java.sun.com/thread.jspa?forumID=54&threadID=375436

YAT_Archivista at 2007-7-21 16:44:06 > top of Java-index,Java Essentials,Java Programming...
# 24

> > The 'pimping lemon' applies whatever the regex

> flavour.

>

> Yes, but it only applies to regular expressions. You

> can use the pumping lemma to prove that Java's

> "regexes" aren't. In fact, they're not even

> context-free. See

> http://forum.java.sun.com/thread.jspa?forumID=54&threa

> dID=375436

YAT,

So can I tell Jos that he is wrong? I do hope so! :-)

I suspect that at this point in my career I'm not going to find the energy to do do a formal course that covers what 'regexes' are and are not. I shall let you and Jos push the 'Pumping Lemma' and I shall just keep using Java regex in my project that is going to make me a billion.

Sabre

sabre150a at 2007-7-21 16:44:06 > top of Java-index,Java Essentials,Java Programming...