I assume you're referring to an infix expression. If so, I've worked on this problem before. The actual Java code I created isn't completely debugged, so I can't post it, but here's the description of the algorithm I used. Its highly recursive.
//Define a list of operator characters.
public double parse(string){
/*Scanning left to right, search for the FIRST open parenthesis.
*If none are found, move on. If we do find one, now find the
*LAST close parenthesis. Call parse on everything in between
*the two and substitute this new value in the expression.
*/
//Once we get here, there are no more parenthesis in string.
for(each level of operator precedence, in reverse order){
/* Scan the string from RIGHT to LEFT, searching for the operators
* at this level. When we find one, compute and RETURN the following:
*parse(leftSide) (operator) parse(rightSide)
*/
}
//Once we get here, there aren't any more operators. The string is either
//a constant or a variable. you'll have to decide which. You can do this
//using a look up table - probably a hash table.
return valueOf(string);
}
I haven't done any research on this, so I don't know if there are better solutions are available. But this approach will work. The reason we have to scan strings backward is because the solution is recursive -- the last call to parse actually gets executed first. (We're using a stack.) This is also why we look for operators in reverse order or precedance.
The only problems you'll run into have to deal with implied multiplication. For example, this algorithm will not correctly parse:
2x + 5 OR 2(x) + 5
But it will work for:
2*x + 5
Dealing with those cases is a bit more tricky, and are the current "bugs" I mentioned at the beginning of my post.
Hope this helps,
Bill
Observe:
Parsing an expression is essentially the act of converting the expression from infix notation to postfix notation. Infix requires parens and precedence to specify order of operations but post fix unambiguously specifies them by the order of the tokens
So something like
"a+b*c" becomes "a b c * +"
"a*b+c" becomes "a b * c +"
"(a+b)*c" becomes "a b + c *"
notice that arguments (variables and numbers, the leaves of the implied expression tree) occur in the same sequence in both the infix and the postfix notation. Thus you can view the parsing as a filtering process that passes arguments directly from input to output and all the filter is doing is re-arranging the order of the operators.
All this rearranging of operators can be done with a single stack and it is all based on this observation:
When you saw a string like "a op1 b op2 c" you run across the token for the first operator before you have even collected its right argument. You must put that operator somewhere and a OpStack is the place to put it. You don't know whether to emit op1 until you have had a look at the next operator, op2. If op2 was higher priority than op1, then op2 gets to consume the argument b first. In that case op2 just gets added to the stack on top of op1.
However if op2 was less than or equal in precedence to op1, well then op1 can now be flowed to the output stream and allowed to act on the b argument that is already out on the output stream.
This means that the fundamental operation of the expression parser is the business of pushing an opertor token onto the stack, but first spewing to the output all the other operators on the stack that had a higher or equal priority. If we give this function a name like opStack.pushSpew(token) it could look something like this:
class OpStack{
Stack s = new Stack();
...
void pushSpew(List out, Token t){
Token tos;
while(!s.empty() && (tos = peek()).priority >= t.priority){out.add(pop());}
push(t);
}
}
Next thing to observe is that a grouper, like open and close parens, behaves in the following way. An open paren acts like a very high priority operator when pushed to the stack in that it spews nothing. The previous op is NOT allowed to eat the argument that is about to come right after the open paren. However once the open paren marks the stack, the open should be a lower priority than any real op that gets pushed onto the stack above it because it must delay going out until every real op in the group had its shot. The close paren should also act like a very low priority op. It should spew every op upto but not including the open paren that started this group.
As a computation trick, if you set the priority of an open lowest of all, and the priority of a close just above that, and you just push the open onto the stack instead of pushSpewing it onto the stack, you will get the desired behavior. You with then pushSpew the close paren and it will dump everything from the stack to the output up to but not including the open. At this point it is easy to check the balance. If you do not have an open/close pair sitting as the top two elements on the stack then the parens (or the brackets or the braces) were not properly nested.
We are almost done. There are post fix operators like factorial and prefix operators like minus that we need to deal with. It only makes sense to evaluate these suckers from the inside out i.e. if you had a string of prefixs on an arg
op1 op2 op3 arg
the only sensible order is defined by parens like this
(op1 (op2 (op3 arg) ) )
and in post fix that would be
arg op3 op2 op1
that is exactly what you would get if you just pushed each prefix op onto the stack in the order you saw them and then popped them off after you dumped out the argument.
Post ops should also work from the inside out
arg op1 op2
being
( (arg op1) op2 )
and you get this behavior by just dumping a post op to the output when you see it.
There is only one special concern. What should you do if you saw, both pre ops and post ops on the same arg?
is "op1 arg op2" done like this: "(op1 arg) op2" or "op1 (arg op2)"? Well, why should these be any different from regular diadic ops, let precdence decide. An espression like minus three factorial "-3!" should be interpreted as take the factorial of 3 and then negate it (and not the other way around!) so let the priority of factorial be greater than that of minus and just do the standard pushSpew with post ops to let them spew out any high priority pre ops before they go onto the stack, but then unlike binary ops, the post op immediately comes off the stack and goes to the output because post ops bind up right away.
Now if you are not careful with your precedences you could get "wierd" behavior. If for example you did not assign a high priority to a factorial, you could write expressions like "3+4!" and the priority would treat that like "(3+4)!" the plus would bind tighter than the factorial. It is hard for me to imagine a language where you would want the effects of either a postfix or a prefix operator to span across multiple arguments combined by single diadic operator, but that is really up to you. You certainly could have a language where you gave a very low precedence to ~ which could mean the boolean NOT, a lower precedence than the diadic operator >= which would allow you to write "~ a >= b" which would mean "~(a>=b)"
After all it is your language, do what you want. None the less I would advise against alowing ANY pre or post op to have a priority LESS than that of ANY diadic operator.
Lots of talk here but to recap, the code is quite simple:
while((t = getNextToken()) != null){
switch (t.type()){
case ARG: out.add(t); break; // arguments pass straight to output
case PRE: stack.push(t); break; // pre ops are just pushed (binding them to the next arg)
case OPEN: stack.push(t); break; // Opens just push, they don't spew to output
case DI: stack.pushSpew(out,t); break; // Diadic ops do the standard pushSpew
case POST: stack.pushSpew(out,t); out.add(stack.pop()); break; //spew and then go out
case CLOSE: stack.pushSpew(out,t); stack.removeOCpair(); // spew all ops in group
// and then check if you had a balancing open
}
}
That just about does the entire parse. If you can find the tokens in the input you can rearrange them into the output. You do need to flush the opStack when you are done with the input stream and there is a convenient trick to doing that. The trick is this. Invent an invisible Open and Close pair, invisible meaning that they never show up in any input stream, you just made them up and pretended that that expression started with an invisible OpenExp and at the end it closed with a CloseExp.
The CloseExp at the end will act like any grouper, it will flush the stack looking for the corresponding opener. IF there was any unbalanced grouper, like say an extra unbalanced open paren somewhere in the expression, the CloseExp will spew its way down to that unbalanced paren and then fail when it discovers that it does not match the open in the call to remove the OC pair. The is a way to do error detection. Use a fictional group on the entire expression to both flush the remainder of the opStack and to do a final check for unbalanced groupers.
In fact, while we are on the topic of error checking there is another one that is very good to include. Basically in a real expression if you ignore the parens, the progression should be an alternation between args and diadic ops, with an occasional pre or post op thrown in.
A simple count tells you where you are. Go up to 1 when you pass an argument, go back down to zero when you see a diadic operator. You should start at zero, because there have been no arguments or ops yet. In state zero the only legal things to do are to put in a pre op or an arg. Once you have an arg out there the only legal things to do are to have a post op or a diadic op. This little state machine will tell you if and when you have got either a pair of arguments together with no op between them, or a pair of ops with no argument between them. You should start at state = 0 and you should end at state = 1 (unless you allow empty expressions) So we can beef up the error checking with a very simple addition. The guts of the parse will now look like this:
state = 0
stack.push(Token.OpenExp); // start off with an open of the expression
while((t = getNextToken()) != null){
switch (t.type()){
case ARG: assertStateIs(0); out.add(t); state = 1; break;
case PRE: assertStateIs(0); stack.push(t); break;
case OPEN: stack.push(t); break;
case DI: assertStateIs(1); stack.pushSpew(out,t); state =0; break;
case POST: assertStateIs(1);stack.pushSpew(out,t); out.add(stack.pop()); break;
case CLOSE: stack.pushSpew(out,t); stack.removeOCpair();
}
}
stack.pushSpew(out,Token.CloseExp); stack.removeOCpair();
assertState(1);
return out;
So - there it is, in outline form the way that you parse an expression.You build 3 classes, Token, ExpressionParser, and OpStack
Token is just a bucket that holds a name, a type (ARG, PRE, OPEN, DI, ...) and a precedence.
OpStack is just a wrapper for a reqular stack that will hold Tokens and allow for a pushSpew function.
ExpressionParser is little other than a Map full of all your Tokens. This map allows the parser to map substrings, tokens, that it finds in the input like >= to actual Tokens with a capital T. It will have two main methods:
1) the parse method which I have almost completly written for you. It takes an input string and will fill up an output list placing the Tokens in post fix order, which is your parse tree.
2) the routine getNextToken which will walk the input string, skipping white space when appropriate, looking for names, numbers, and operators and building each chunk that it finds into a Token.
One last complexity note. If you want to allow functional notation like sqrt(3) you are using parens in a fundamentally different way than was described above (ditto when you use brackets to represent array indexing like x[5]) the fundamentally different thing that you are doing is that you are allowing an argument (a name) like "sqrt" to sit right next to another argument like "3" with no binary op in between. You are using an "implied" binary op at that point. You have saved the user from typing "sqrt FUNCTIONCALL (3)" or "x ARRAYINDEX 5"
I will leave it as an exercise to the reader to see how you make a small modification to the OPEN case to detect and deal with an implied operation. Don't forget that you must tell the state error checker what you are doing.
And now that I have shown you how fundamentally simple it is to build an expression parser let me comment on why you do not find a standard library routine for doing this.
The primary reason is that there is no such thing as a "standard" math expression. Parsing is based on a language and there is no one language. Are you going to allow hex numbers in your expressions? How about scientific notation? Is = equality or assignment? Do you allow functional notation? Array indexing? Multi-dimensional array (like x[2,3])? How about string constants like "foobar"? How do you embed quotes in your quoted strings? Are you allowing comments, block comments, statements, code, Lisp expressions, complex numbers, implicit type conversion, type casting? At what point does it stop being a simple expression and start being a full computer language?
Furthermore, most of the work you will quickly see if you use what I have suggested to build an expression parser, it how to give your user feedback that the expression that he typed had problems and where. There is no stardard UI for error feedback in things that don't parse correctly. Are you just going to dump cryptic strings to the console or are you going to highlight something in a text box somewhere?
All the work turns out to be that of defining your language and determining your set of tokens and what precedence you will employ. Once you get that all worked out, parsing the expressions only takes the ten or fifteen lines of code that I outlined.
And look, it only took several thousands of words to explain why it only takes about 15 lines of code to parse expressions once you have decided upon your language and written your tokenizer.
Enjoy
tokenizing like parsing is more or less painful depending on how you define your language.
for example what is a+++b
is that a++ + b or a + ++b
much easier to decide if spaces are required. Even in cases where there is no ambiguity like a+-b it is easier to tokenize if you require spaces between adjacent clusters of symbols that are operators.
So first you define your language to make things easy. Then you observe that quick and dirty tokenization is typically just a matter of dusting off your knowledge about regular expressions and writing (and testing) a few of them.
Consider a language as complicated as java. I believe that you could do a reasonable job if you were only able to recognize the following tokens.
integer
octal integer
hex integer
floating point number
string constant
char constant
end of line comment
block comment
grouper (parens, braces, or brackets)
name
operator (+,*,-,/, >=, += ,etc)
All you do is look to see which of these patterns most closly matches the initial portion of the string and that is your token. Yes, you can get overlap
123.45e-6
looks like a floating point number AND the 123 portion of it looks like an integer as well, but the longer match always wins.
none of those regexps are difficult to write. For example I think that the one for hex input is "0x[0-9a-fA-F]+" I am not really sure because I almost never use regular expressions and thus never remember the details.
Perhaps someone out there that does regexps all the time will show you how to knock out the nine or ten that I listed, but even without help it should only take you an evening to hack out the handful that you might need.
Is this the most efficient way to tokenize? Well it is not efficient for the computer and is a bit like hauling out the sledge hammer to crush an ant, but then again who cares. Far better to have your tokenizer finished in an evening so you can start fixing all the other probelms in your application than it is to spend any time building an "efficient" parser.
Are there other classes that you could you use or other ways to do this? No doubt there are. I am hardly a Java expert. I'm just one of those old dinosaurs that reflexively builds everything from scratch cause that used to be our only option back when we were living in caves and did our programming on plug boards and front panel switches.
Enjoy
You can try JFormula :
http://www.japisoft.com/formula/
Main features (from the site ) :
* Decimal, string, boolean operators (or, and, not, xor...)
* Unicode
* High precision mode
* Boolean expression support : (A<B)&&(B>C), (A or B) and not ( C equals D )
* IF THEN ELSE expression
* Short expression format : 2x+3y
* Variable : A=(cos(PI + x )*2) + [y-x]^2
* Multiple lines expression : A = 1B = A + 1 ...
* Functions with decimal, boolean, string or list arguments
* Override or add your operators
* Extend or add a new library dynamically
* Override any functions from the current library by your one
* Lazy evaluation for boolean expressions
* Evaluation tree produced by a pluggable parsing system
* Evaluation optimization for symbol value changes
* Standard library with 24 mathematical functions
* Use delegates for resolving unknown functions or symbols
* Support for multithreaded computing
* Many samples (library extension, graphes) for API interesting parts
* JDK 1.1 compliant (tested on JDK1.1.8, JDK 1.4 and JDK1.5)
Best regards