Grammar: difference between a top down and bottom up? - grammar

What is the difference between a top down and bottom up grammar? An example would be awesome.

First of all, the grammar itself isn't top-down or bottom-up, the parser is (though there are grammars that can be parsed by one but not the other).
From a practical viewpoint, the main difference is that most hand-written parsers are top-down, while a much larger percentage of machine-generated parsers are bottom-up (though, of course, the reverse is certainly possible).
A top-down parser typically uses recursive descent, which typically means a structure something like this (using typical mathematical expressions as an example):
expression() { term() [-+] expression }
term() { factor() [*/] term() }
factor() { operand() | '(' expression() ')' }
A bottom-up parser work in the reverse direction -- where a recursive descent parser starts from the full expression, and breaks it down into smaller and smaller pieces until it reaches the level of individual tokens, a bottom-up parser starts from the individual tokens, and uses tables of rules about how those tokens fit together into higher and higher levels of the expression hierarchy until it reaches the top level (what's represented as "expression" above).
Edit: To clarify, perhaps it would make sense to add a really trivial parser. In this case, I'll just do the old classic of converting a simplified version of a typical mathematical expression from infix to postfix:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
void expression(void);
void show(int ch) {
putchar(ch);
putchar(' ');
}
int token() {
int ch;
while (isspace(ch=getchar()))
;
return ch;
}
void factor() {
int ch = token();
if (ch == '(') {
expression();
ch = token();
if (ch != ')') {
fprintf(stderr, "Syntax error. Expected close paren, found: %c\n", ch);
exit(EXIT_FAILURE);
}
}
else
show(ch);
}
void term() {
int ch;
factor();
ch = token();
if (ch == '*' || ch == '/') {
term();
show(ch);
}
else
ungetc(ch, stdin);
}
void expression() {
int ch;
term();
ch = token();
if (ch == '-' || ch=='+') {
expression();
show(ch);
}
else
ungetc(ch, stdin);
}
int main(int argc, char **argv) {
expression();
return 0;
}
Note that the lexing here is pretty stupid (it basically just accepts a single character as a token) and the expressions allowed are quite limited (only +-*/). OTOH, it's good enough to handle an input like:
1+2*(3+4*(5/6))
from which it does produce what I believe is correct output:
1 2 3 4 5 6 / * + * +

Afaik it doesn't make any difference for the grammar itself, but it does for the parser.
Wikipedia has a quite lengthy explanation of both bottom-up and top-down parsing.
Generally the (imho) more intuitive way is top-down. You start with the start symbol and apply the transformation rules that fit, while with bottom-up you need to apply transformation rules backwards (which usually created quite a headache for me).

Related

Evaluating a compiled expression tree is sometimes slower

I generate an expression tree containing simple math expressions. The expression types are limited to constants, variables, addition, subtraction, multiplication, division, negation, sqrt and a few trigonometric functions. Variables are like constants, but their values can change.
To evaluate the tree, I iterate from bottom to top and perform the operation indicated by the expression type. This involves switching on the expression type, for example:
for (int i = 0; i < count; ++i) {
ref Expr expr = ref expressions[i];
switch (expr.Type) {
case Op.Addition:
expr.Value = expressions[expr.Operand1].Value + expressions[expr.Operand2].Value;
break;
case ...
}
}
I also evaluate the derivatives using reverse automatic differentiation. This involves iterating the tree from top to bottom and computing the adjoint for each expression. There are simple rules for each expression type:
ref Expr root = ref expressions[expressions.Length - 1];
root.Adjoint = 1.0;
for (int i = expressions.Length - 1; i >= 0; --i) {
ref Expr expr = ref expressions[i];
switch (expr.Type) {
case Op.Addition:
ref Expr left = ref expressions[expr.Operand1];
ref Expr right = ref expressions[expr.Operand2];
left.Adjoint += expr.Adjoint;
right.Adjoint += expr.Adjoint;
break;
case ...
}
}
To avoid the branching, I thought that I'd compile this expression tree by generating IL code. To do this, I again iterate from bottom to top through the tree and emit instructions for calculating the expression values. Similarly, I iterate from top to bottom and emit instructions for calculating the adjoints. I then end up with two big functions that compute the values and adjoints for all expressions without any branching.
var method = new DynamicMethod("Evaluate", typeof(void), new Type[] { typeof(double[]) }, true);
var il = method.GetILGenerator();
for (int i = 0; i < expressions.Length; ++i) {
ref Expr expr = ref expressions[i];
switch (expr.Type) {
case Op.Addition:
il.Emit(OpCodes.Ldarg_0);
il.Emit(OpCodes.Ldc_I4, expr.Index);
il.Emit(OpCodes.Ldarg_0);
il.Emit(OpCodes.Ldc_I4, expr.Operand1);
il.Emit(OpCodes.Ldelem_R8);
il.Emit(OpCodes.Ldarg_0);
il.Emit(OpCodes.Ldc_I4, expr.Operand2);
il.Emit(OpCodes.Ldelem_R8);
il.Emit(OpCodes.Add);
il.Emit(OpCodes.Stelem_R8);
case ...
}
}
il.Emit(OpCodes.Ret);
evaluate = method.CreateDelegate<EvaluateFunction>();
For small expression trees, this proved to be quite effective. The time to evaluate the tree was quite significantly reduced. However, for larger expression trees the compiled evaluation function actually becomes slower.
Why could this be? Clearly, for large expression trees, the compiled function can contain a large number of instructions. Is this then simply due to the size of the instruction cache? There is no branching in the function, so I'd have thought that the instructions could be loaded very efficiently.

Determine types from a variadic function's arguments in C

I'd like a step by step explanation on how to parse the arguments of a variadic function
so that when calling va_arg(ap, TYPE); I pass the correct data TYPE of the argument being passed.
Currently I'm trying to code printf.
I am only looking for an explanation preferably with simple examples but not the solution to printf since I want to solve it myself.
Here are three examples which look like what I am looking for:
https://stackoverflow.com/a/1689228/3206885
https://stackoverflow.com/a/5551632/3206885
https://stackoverflow.com/a/1722238/3206885
I know the basics of what typedef, struct, enum and union do but can't figure out some practical application cases like the examples in the links.
What do they really mean? I can't wrap my brain around how they work.
How can I pass the data type from a union to va_arg like in the links examples? How does it match?
with a modifier like %d, %i ... or the data type of a parameter?
Here's what I've got so far:
#include <stdarg.h>
#include <stdio.h>
#include <stdlib.h>
#include "my.h"
typedef struct s_flist
{
char c;
(*f)();
} t_flist;
int my_printf(char *format, ...)
{
va_list ap;
int i;
int j;
int result;
int arg_count;
char *cur_arg = format;
char *types;
t_flist flist[] =
{
{ 's', &my_putstr },
{ 'i', &my_put_nbr },
{ 'd', &my_put_nbr }
};
i = 0;
result = 0;
types = (char*)malloc( sizeof(*format) * (my_strlen(format) / 2 + 1) );
fparser(types, format);
arg_count = my_strlen(types);
while (format[i])
{
if (format[i] == '%' && format[i + 1])
{
i++;
if (format[i] == '%')
result += my_putchar(format[i]);
else
{
j = 0;
va_start(ap, format);
while (flist[j].c)
{
if (format[i] == flist[j].c)
result += flist[i].f(va_arg(ap, flist[i].DATA_TYPE??));
j++;
}
}
}
result += my_putchar(format[i]);
i++;
}
va_end(ap);
return (result);
}
char *fparser(char *types, char *str)
{
int i;
int j;
i = 0;
j = 0;
while (str[i])
{
if (str[i] == '%' && str[i + 1] &&
str[i + 1] != '%' && str[i + 1] != ' ')
{
i++;
types[j] = str[i];
j++;
}
i++;
}
types[j] = '\0';
return (types);
}
You can't get actual type information from va_list. You can get what you're looking for from format. What it seems you're not expecting is: none of the arguments know what the actual types are, but format represents the caller's idea of what the types should be. (Perhaps a further hint: what would the actual printf do if a caller gave it format specifiers that didn't match the varargs passed in? Would it notice?)
Your code would have to parse the format string for "%" format specifiers, and use those specifiers to branch into reading the va_list with specific hardcoded types. For example, (pseudocode) if (fspec was "%s") { char* str = va_arg(ap, char*); print out str; }. Not giving more detail because you explicitly said you didn't want a complete solution.
You will never have a type as a piece of runtime data that you can pass to va_arg as a value. The second argument to va_arg must be a literal, hardcoded specification referring to a known type at compile time. (Note that va_arg is a macro that gets expanded at compile time, not a function that gets executed at runtime - you couldn't have a function taking a type as an argument.)
A couple of your links suggest keeping track of types via an enum, but this is only for the benefit of your own code being able to branch based on that information; it is still not something that can be passed to va_arg. You have to have separate pieces of code saying literally va_arg(ap, int) and va_arg(ap, char*) so there's no way to avoid a switch or a chain of ifs.
The solution you want to make, using the unions and structs, would start from something like this:
typedef union {
int i;
char *s;
} PRINTABLE_THING;
int print_integer(PRINTABLE_THING pt) {
// format and print pt.i
}
int print_string(PRINTABLE_THING pt) {
// format and print pt.s
}
The two specialized functions would work fine on their own by taking explicit int or char* params; the reason we make the union is to enable the functions to formally take the same type of parameter, so that they have the same signature, so that we can define a single type that means pointer to that kind of function:
typedef int (*print_printable_thing)(PRINTABLE_THING);
Now your code can have an array of function pointers of type print_printable_thing, or an array of structs that have print_printable_thing as one of the structs' fields:
typedef struct {
char format_char;
print_printable_thing printing_function;
} FORMAT_CHAR_AND_PRINTING_FUNCTION_PAIRING;
FORMAT_CHAR_AND_PRINTING_FUNCTION_PAIRING formatters[] = {
{ 'd', print_integer },
{ 's', print_string }
};
int formatter_count = sizeof(formatters) / sizeof(FORMAT_CHAR_AND_PRINTING_FUNCTION_PAIRING);
(Yes, the names are all intentionally super verbose. You'd probably want shorter ones in the real program, or even anonymous types where appropriate.)
Now you can use that array to select the correct formatter at runtime:
for (int i = 0; i < formatter_count; i++)
if (current_format_char == formatters[i].format_char)
result += formatters[i].printing_function(current_printable_thing);
But the process of getting the correct thing into current_printable_thing is still going to involve branching to get to a va_arg(ap, ...) with the correct hardcoded type. Once you've written it, you may find yourself deciding that you didn't actually need the union nor the array of structs.

Lex Yacc / Flex Bison variables

I was just wondering how any of you guys would implement multi character variables in c using Flex and Bison / Lex and Yacc ?
Any if so can you provide maybe a simple example?
I am attempting to write an interpreter for a language and I can't seem to find a good way to implement variables, so far the methods I've tried have either failed or causing the execution of any program with a lot of variables become really so (I mean it could take minutes to execute a program that just assigns 1000 variables and does nothing else)
Thanks for your time,
Francis
In a lexer provided by ADAIC for Ada the following method is used, i find it ver useful for lexing multu-character literals such as reserved words and variables. It (along with corresponding Bison grammar and some other stuff) is available at ADAIC docs
%%
[a-zA-Z](_?[a-zA-Z0-9])* return(lk_keyword(yytext));
%%
# define NUM_KEYWORDS 69
KEY_TABLE key_tab[NUM_KEYWORDS] =
{
{"ABORT", ABORT},
{"ABS", ABS},
....
....
....
};
lk_keyword(str)
char *str;
{
int min;
int max;
int guess, compare;
min = 0;
max = NUM_KEYWORDS-1;
guess = (min + max) / 2;
to_upper(str);
for (guess=(min+max)/2; min<=max; guess=(min+max)/2) {
if ((compare = strcmp(key_tab[guess].kw, str)) < 0) {
min = guess + 1;
} else if (compare > 0) {
max = guess - 1;
} else {
return key_tab[guess].kwv;
}
}
return identifier;
}

What is the "trick" to writing a Quine?

I read Ken Thompson's classic paper Reflections on Trusting Trust in which he prompts users to write a Quine as an introduction to his argument (highly recommended read).
A quine is a computer program which takes no input and produces a copy of its own source code as its only output.
The naive approach is simply to want to say:
print "[insert this program's source here]"
But one quickly sees that this is impossible. I ended up writing one myself using Python but still have trouble explaining "the trick". I'm looking for an excellent explanation of why Quines are possible.
The normal trick is to use printf such that the format-string represents the structure of the program, with a place-holder for the string itself to get the recursion you need:
The standard C example from http://www.nyx.net/~gthompso/quine.htm illustrates this quite well:
char*f="char*f=%c%s%c;main(){printf(f,34,f,34,10);}%c";main(){printf(f,34,f,34,10);}
edit: After writing this, I did a bit of searching: http://www.madore.org/~david/computers/quine.html gives a very good, more theoretical, description of what exactly quines are and why they work.
Here's one I wrote that uses putchar instead of printf; thus it has to process all its own escape codes. But it is %100 portable across all C execution character sets.
You should be able to see that there is a seam in the text representation that mirrors a seam in the program text itself, where it changes from working on the beginning to working on the end. The trick to writing a Quine is getting over this "hump", where you switch to digging your way out of the hole! Your options are constrained by the textual representation and the language's output facilities.
#include <stdio.h>
void with(char *s) {
for (; *s; s++) switch (*s) {
case '\n': putchar('\\'); putchar('n'); break;
case '\\': putchar('\\'); putchar('\\'); break;
case '\"': putchar('\\'); putchar('\"'); break;
default: putchar(*s);
}
}
void out(char *s) { for (; *s; s++) putchar(*s); }
int main() {
char *a[] = {
"#include <stdio.h>\n\n",
"void with(char *s) {\n",
" for (; *s; s++) switch (*s) {\n",
" case '\\",
"n': putchar('\\\\'); putchar('n'); break;\n",
" case '\\\\': putchar('\\\\'); putchar('\\\\'); break;\n",
" case '\\\"': putchar('\\\\'); putchar('\\\"'); break;\n",
" default: putchar(*s);\n",
" }\n}\n",
"void out(char *s) { for (; *s; s++) putchar(*s); }\n",
"int main() {\n",
" char *a[] = {\n",
NULL }, *b[] = {
"NULL }, **p;\n",
" for (p = a; *p; p++) out(*p);\n",
" for (p = a; *p; p++) {\n",
" putchar('\\\"');\n",
" with(*p);\n",
" putchar('\\\"'); putchar(','); putchar('\\",
"n');\n",
" }\n",
" out(\"NULL }, *b[] = {\\",
"n\");\n",
" for (p = b; *p; p++) {\n",
" putchar('\\\"');\n",
" with(*p);\n",
" putchar('\\\"'); putchar(','); putchar('\\",
"n');\n",
" }\n",
" for (p = b; *p; p++) out(*p);\n",
" return 0;\n",
"}\n",
NULL }, **p;
for (p = a; *p; p++) out(*p);
for (p = a; *p; p++) {
putchar('\"');
with(*p);
putchar('\"'); putchar(','); putchar('\n');
}
out("NULL }, *b[] = {\n");
for (p = b; *p; p++) {
putchar('\"');
with(*p);
putchar('\"'); putchar(','); putchar('\n');
}
for (p = b; *p; p++) out(*p);
return 0;
}
A common trick is to jump start the quine by writing a program to read a textfile and output an array of numbers. Then you modify it to use a static array, and run the first program against the new (static array) program, producing an array of number that represents the program. Insert that into the static array, run it again until it settles down, and that gets you a quine. But, it's tied to a specific character set (== not 100% portable). A program like the above (and not the classic printf hack) will work the same on ASCII or EBCDIC (the classic printf hack fails in EBCDIC because it contains hard-coded ASCII).
edit:
Reading the question again, carefully (finally), it appears you're actually looking for more philosophy less technique. The trick that lets you out of the infinite regress is the two-fer. You've got to get both the encoded program and the expanded program out of the same data: using the same data 2 ways. This data thus only describes the part of the program surrounding its future manifestation, the frame. The image within this frame is a straight copy of the original.
This is how you would naturally go about producing a recursive drawing by hand: the tv of a tv of tv. At some point you get tired and just sketch some glare over the screen, because the recursion has been sufficiently established.
edit:
I'm looking for an excellent explanation of why Quines are possible.
The "possibility" of a Quine goes into the depths of the mathematical revolutions of the 19th and 20th centuries. The "classic" quine by W. V. O. Quine, is the sequence of words (IIRC)
yields false when appended to itself
which is a paradox, akin to David's request for something that "makes me happy when sad, and makes me sad when happy" answered by the medallion inscribed on both sides: "this too shall pass".
The same sort of knot was investigated by the Pioneers of modern mathematical logic such as Frege, Russell and Whitehead, Łukasiewicz, and of course, our boys Turing, Church and Thue. The trick that makes it possible to transpose the Quine from the realm of wordplay to a programmatic demonstration (untwisting the paradox part along the way), was Gödel's method of encoding the arithmetic operations themselves as numbers, so an entire mathematical expression can be encoded into a single (enormous) integer. In particular, a mathematical function that performs a decoding of this representation can be expressed in the same (numerical) form. This number (a Gödel-encoded function) is both code and data.
This power-trio (Code, Representation, Data), can be transposed to different repesentations. By choosing a different Representation (or a chain like: bytes-> ASCII-> hexadecimal-> integer), alters the behavior of the Code, which alters the appearance of the Data.

Expression Evaluation and Tree Walking using polymorphism? (ala Steve Yegge)

This morning, I was reading Steve Yegge's: When Polymorphism Fails, when I came across a question that a co-worker of his used to ask potential employees when they came for their interview at Amazon.
As an example of polymorphism in
action, let's look at the classic
"eval" interview question, which (as
far as I know) was brought to Amazon
by Ron Braunstein. The question is
quite a rich one, as it manages to
probe a wide variety of important
skills: OOP design, recursion, binary
trees, polymorphism and runtime
typing, general coding skills, and (if
you want to make it extra hard)
parsing theory.
At some point, the candidate hopefully
realizes that you can represent an
arithmetic expression as a binary
tree, assuming you're only using
binary operators such as "+", "-",
"*", "/". The leaf nodes are all
numbers, and the internal nodes are
all operators. Evaluating the
expression means walking the tree. If
the candidate doesn't realize this,
you can gently lead them to it, or if
necessary, just tell them.
Even if you tell them, it's still an
interesting problem.
The first half of the question, which
some people (whose names I will
protect to my dying breath, but their
initials are Willie Lewis) feel is a
Job Requirement If You Want To Call
Yourself A Developer And Work At
Amazon, is actually kinda hard. The
question is: how do you go from an
arithmetic expression (e.g. in a
string) such as "2 + (2)" to an
expression tree. We may have an ADJ
challenge on this question at some
point.
The second half is: let's say this is
a 2-person project, and your partner,
who we'll call "Willie", is
responsible for transforming the
string expression into a tree. You get
the easy part: you need to decide what
classes Willie is to construct the
tree with. You can do it in any
language, but make sure you pick one,
or Willie will hand you assembly
language. If he's feeling ornery, it
will be for a processor that is no
longer manufactured in production.
You'd be amazed at how many candidates
boff this one.
I won't give away the answer, but a
Standard Bad Solution involves the use
of a switch or case statment (or just
good old-fashioned cascaded-ifs). A
Slightly Better Solution involves
using a table of function pointers,
and the Probably Best Solution
involves using polymorphism. I
encourage you to work through it
sometime. Fun stuff!
So, let's try to tackle the problem all three ways. How do you go from an arithmetic expression (e.g. in a string) such as "2 + (2)" to an expression tree using cascaded-if's, a table of function pointers, and/or polymorphism?
Feel free to tackle one, two, or all three.
[update: title modified to better match what most of the answers have been.]
Polymorphic Tree Walking, Python version
#!/usr/bin/python
class Node:
"""base class, you should not process one of these"""
def process(self):
raise('you should not be processing a node')
class BinaryNode(Node):
"""base class for binary nodes"""
def __init__(self, _left, _right):
self.left = _left
self.right = _right
def process(self):
raise('you should not be processing a binarynode')
class Plus(BinaryNode):
def process(self):
return self.left.process() + self.right.process()
class Minus(BinaryNode):
def process(self):
return self.left.process() - self.right.process()
class Mul(BinaryNode):
def process(self):
return self.left.process() * self.right.process()
class Div(BinaryNode):
def process(self):
return self.left.process() / self.right.process()
class Num(Node):
def __init__(self, _value):
self.value = _value
def process(self):
return self.value
def demo(n):
print n.process()
demo(Num(2)) # 2
demo(Plus(Num(2),Num(5))) # 2 + 3
demo(Plus(Mul(Num(2),Num(3)),Div(Num(10),Num(5)))) # (2 * 3) + (10 / 2)
The tests are just building up the binary trees by using constructors.
program structure:
abstract base class: Node
all Nodes inherit from this class
abstract base class: BinaryNode
all binary operators inherit from this class
process method does the work of evaluting the expression and returning the result
binary operator classes: Plus,Minus,Mul,Div
two child nodes, one each for left side and right side subexpressions
number class: Num
holds a leaf-node numeric value, e.g. 17 or 42
The problem, I think, is that we need to parse perentheses, and yet they are not a binary operator? Should we take (2) as a single token, that evaluates to 2?
The parens don't need to show up in the expression tree, but they do affect its shape. E.g., the tree for (1+2)+3 is different from 1+(2+3):
+
/ \
+ 3
/ \
1 2
versus
+
/ \
1 +
/ \
2 3
The parentheses are a "hint" to the parser (e.g., per superjoe30, to "recursively descend")
This gets into parsing/compiler theory, which is kind of a rabbit hole... The Dragon Book is the standard text for compiler construction, and takes this to extremes. In this particular case, you want to construct a context-free grammar for basic arithmetic, then use that grammar to parse out an abstract syntax tree. You can then iterate over the tree, reducing it from the bottom up (it's at this point you'd apply the polymorphism/function pointers/switch statement to reduce the tree).
I've found these notes to be incredibly helpful in compiler and parsing theory.
Representing the Nodes
If we want to include parentheses, we need 5 kinds of nodes:
the binary nodes: Add Minus Mul Divthese have two children, a left and right side
+
/ \
node node
a node to hold a value: Valno children nodes, just a numeric value
a node to keep track of the parens: Parena single child node for the subexpression
( )
|
node
For a polymorphic solution, we need to have this kind of class relationship:
Node
BinaryNode : inherit from Node
Plus : inherit from Binary Node
Minus : inherit from Binary Node
Mul : inherit from Binary Node
Div : inherit from Binary Node
Value : inherit from Node
Paren : inherit from node
There is a virtual function for all nodes called eval(). If you call that function, it will return the value of that subexpression.
String Tokenizer + LL(1) Parser will give you an expression tree... the polymorphism way might involve an abstract Arithmetic class with an "evaluate(a,b)" function, which is overridden for each of the operators involved (Addition, Subtraction etc) to return the appropriate value, and the tree contains Integers and Arithmetic operators, which can be evaluated by a post(?)-order traversal of the tree.
I won't give away the answer, but a
Standard Bad Solution involves the use
of a switch or case statment (or just
good old-fashioned cascaded-ifs). A
Slightly Better Solution involves
using a table of function pointers,
and the Probably Best Solution
involves using polymorphism.
The last twenty years of evolution in interpreters can be seen as going the other way - polymorphism (eg naive Smalltalk metacircular interpreters) to function pointers (naive lisp implementations, threaded code, C++) to switch (naive byte code interpreters), and then onwards to JITs and so on - which either require very big classes, or (in singly polymorphic languages) double-dispatch, which reduces the polymorphism to a type-case, and you're back at stage one. What definition of 'best' is in use here?
For simple stuff a polymorphic solution is OK - here's one I made earlier, but either stack and bytecode/switch or exploiting the runtime's compiler is usually better if you're, say, plotting a function with a few thousand data points.
Hm... I don't think you can write a top-down parser for this without backtracking, so it has to be some sort of a shift-reduce parser. LR(1) or even LALR will of course work just fine with the following (ad-hoc) language definition:
Start -> E1
E1 -> E1+E1 | E1-E1
E1 -> E2*E2 | E2/E2 | E2
E2 -> number | (E1)
Separating it out into E1 and E2 is necessary to maintain the precedence of * and / over + and -.
But this is how I would do it if I had to write the parser by hand:
Two stacks, one storing nodes of the tree as operands and one storing operators
Read the input left to right, make leaf nodes of the numbers and push them into the operand stack.
If you have >= 2 operands on the stack, pop 2, combine them with the topmost operator in the operator stack and push this structure back to the operand tree, unless
The next operator has higher precedence that the one currently on top of the stack.
This leaves us the problem of handling brackets. One elegant (I thought) solution is to store the precedence of each operator as a number in a variable. So initially,
int plus, minus = 1;
int mul, div = 2;
Now every time you see a a left bracket increment all these variables by 2, and every time you see a right bracket, decrement all the variables by 2.
This will ensure that the + in 3*(4+5) has higher precedence than the *, and 3*4 will not be pushed onto the stack. Instead it will wait for 5, push 4+5, then push 3*(4+5).
Re: Justin
I think the tree would look something like this:
+
/ \
2 ( )
|
2
Basically, you'd have an "eval" node, that just evaluates the tree below it. That would then be optimized out to just being:
+
/ \
2 2
In this case the parens aren't required and don't add anything. They don't add anything logically, so they'd just go away.
I think the question is about how to write a parser, not the evaluator. Or rather, how to create the expression tree from a string.
Case statements that return a base class don't exactly count.
The basic structure of a "polymorphic" solution (which is another way of saying, I don't care what you build this with, I just want to extend it with rewriting the least amount of code possible) is deserializing an object hierarchy from a stream with a (dynamic) set of known types.
The crux of the implementation of the polymorphic solution is to have a way to create an expression object from a pattern matcher, likely recursive. I.e., map a BNF or similar syntax to an object factory.
Or maybe this is the real question:
how can you represent (2) as a BST?
That is the part that is tripping me
up.
Recursion.
#Justin:
Look at my note on representing the nodes. If you use that scheme, then
2 + (2)
can be represented as
.
/ \
2 ( )
|
2
should use a functional language imo. Trees are harder to represent and manipulate in OO languages.
As people have been mentioning previously, when you use expression trees parens are not necessary. The order of operations becomes trivial and obvious when you're looking at an expression tree. The parens are hints to the parser.
While the accepted answer is the solution to one half of the problem, the other half - actually parsing the expression - is still unsolved. Typically, these sorts of problems can be solved using a recursive descent parser. Writing such a parser is often a fun exercise, but most modern tools for language parsing will abstract that away for you.
The parser is also significantly harder if you allow floating point numbers in your string. I had to create a DFA to accept floating point numbers in C -- it was a very painstaking and detailed task. Remember, valid floating points include: 10, 10., 10.123, 9.876e-5, 1.0f, .025, etc. I assume some dispensation from this (in favor of simplicty and brevity) was made in the interview.
I've written such a parser with some basic techniques like
Infix -> RPN and
Shunting Yard and
Tree Traversals.
Here is the implementation I've came up with.
It's written in C++ and compiles on both Linux and Windows.
Any suggestions/questions are welcomed.
So, let's try to tackle the problem all three ways. How do you go from an arithmetic expression (e.g. in a string) such as "2 + (2)" to an expression tree using cascaded-if's, a table of function pointers, and/or polymorphism?
This is interesting,but I don't think this belongs to the realm of object-oriented programming...I think it has more to do with parsing techniques.
I've kind of chucked this c# console app together as a bit of a proof of concept. Have a feeling it could be a lot better (that switch statement in GetNode is kind of clunky (it's there coz I hit a blank trying to map a class name to an operator)). Any suggestions on how it could be improved very welcome.
using System;
class Program
{
static void Main(string[] args)
{
string expression = "(((3.5 * 4.5) / (1 + 2)) + 5)";
Console.WriteLine(string.Format("{0} = {1}", expression, new Expression.ExpressionTree(expression).Value));
Console.WriteLine("\nShow's over folks, press a key to exit");
Console.ReadKey(false);
}
}
namespace Expression
{
// -------------------------------------------------------
abstract class NodeBase
{
public abstract double Value { get; }
}
// -------------------------------------------------------
class ValueNode : NodeBase
{
public ValueNode(double value)
{
_double = value;
}
private double _double;
public override double Value
{
get
{
return _double;
}
}
}
// -------------------------------------------------------
abstract class ExpressionNodeBase : NodeBase
{
protected NodeBase GetNode(string expression)
{
// Remove parenthesis
expression = RemoveParenthesis(expression);
// Is expression just a number?
double value = 0;
if (double.TryParse(expression, out value))
{
return new ValueNode(value);
}
else
{
int pos = ParseExpression(expression);
if (pos > 0)
{
string leftExpression = expression.Substring(0, pos - 1).Trim();
string rightExpression = expression.Substring(pos).Trim();
switch (expression.Substring(pos - 1, 1))
{
case "+":
return new Add(leftExpression, rightExpression);
case "-":
return new Subtract(leftExpression, rightExpression);
case "*":
return new Multiply(leftExpression, rightExpression);
case "/":
return new Divide(leftExpression, rightExpression);
default:
throw new Exception("Unknown operator");
}
}
else
{
throw new Exception("Unable to parse expression");
}
}
}
private string RemoveParenthesis(string expression)
{
if (expression.Contains("("))
{
expression = expression.Trim();
int level = 0;
int pos = 0;
foreach (char token in expression.ToCharArray())
{
pos++;
switch (token)
{
case '(':
level++;
break;
case ')':
level--;
break;
}
if (level == 0)
{
break;
}
}
if (level == 0 && pos == expression.Length)
{
expression = expression.Substring(1, expression.Length - 2);
expression = RemoveParenthesis(expression);
}
}
return expression;
}
private int ParseExpression(string expression)
{
int winningLevel = 0;
byte winningTokenWeight = 0;
int winningPos = 0;
int level = 0;
int pos = 0;
foreach (char token in expression.ToCharArray())
{
pos++;
switch (token)
{
case '(':
level++;
break;
case ')':
level--;
break;
}
if (level <= winningLevel)
{
if (OperatorWeight(token) > winningTokenWeight)
{
winningLevel = level;
winningTokenWeight = OperatorWeight(token);
winningPos = pos;
}
}
}
return winningPos;
}
private byte OperatorWeight(char value)
{
switch (value)
{
case '+':
case '-':
return 3;
case '*':
return 2;
case '/':
return 1;
default:
return 0;
}
}
}
// -------------------------------------------------------
class ExpressionTree : ExpressionNodeBase
{
protected NodeBase _rootNode;
public ExpressionTree(string expression)
{
_rootNode = GetNode(expression);
}
public override double Value
{
get
{
return _rootNode.Value;
}
}
}
// -------------------------------------------------------
abstract class OperatorNodeBase : ExpressionNodeBase
{
protected NodeBase _leftNode;
protected NodeBase _rightNode;
public OperatorNodeBase(string leftExpression, string rightExpression)
{
_leftNode = GetNode(leftExpression);
_rightNode = GetNode(rightExpression);
}
}
// -------------------------------------------------------
class Add : OperatorNodeBase
{
public Add(string leftExpression, string rightExpression)
: base(leftExpression, rightExpression)
{
}
public override double Value
{
get
{
return _leftNode.Value + _rightNode.Value;
}
}
}
// -------------------------------------------------------
class Subtract : OperatorNodeBase
{
public Subtract(string leftExpression, string rightExpression)
: base(leftExpression, rightExpression)
{
}
public override double Value
{
get
{
return _leftNode.Value - _rightNode.Value;
}
}
}
// -------------------------------------------------------
class Divide : OperatorNodeBase
{
public Divide(string leftExpression, string rightExpression)
: base(leftExpression, rightExpression)
{
}
public override double Value
{
get
{
return _leftNode.Value / _rightNode.Value;
}
}
}
// -------------------------------------------------------
class Multiply : OperatorNodeBase
{
public Multiply(string leftExpression, string rightExpression)
: base(leftExpression, rightExpression)
{
}
public override double Value
{
get
{
return _leftNode.Value * _rightNode.Value;
}
}
}
}
Ok, here is my naive implementation. Sorry, I did not feel to use objects for that one but it is easy to convert. I feel a bit like evil Willy (from Steve's story).
#!/usr/bin/env python
#tree structure [left argument, operator, right argument, priority level]
tree_root = [None, None, None, None]
#count of parethesis nesting
parenthesis_level = 0
#current node with empty right argument
current_node = tree_root
#indices in tree_root nodes Left, Operator, Right, PRiority
L, O, R, PR = 0, 1, 2, 3
#functions that realise operators
def sum(a, b):
return a + b
def diff(a, b):
return a - b
def mul(a, b):
return a * b
def div(a, b):
return a / b
#tree evaluator
def process_node(n):
try:
len(n)
except TypeError:
return n
left = process_node(n[L])
right = process_node(n[R])
return n[O](left, right)
#mapping operators to relevant functions
o2f = {'+': sum, '-': diff, '*': mul, '/': div, '(': None, ')': None}
#converts token to a node in tree
def convert_token(t):
global current_node, tree_root, parenthesis_level
if t == '(':
parenthesis_level += 2
return
if t == ')':
parenthesis_level -= 2
return
try: #assumption that we have just an integer
l = int(t)
except (ValueError, TypeError):
pass #if not, no problem
else:
if tree_root[L] is None: #if it is first number, put it on the left of root node
tree_root[L] = l
else: #put on the right of current_node
current_node[R] = l
return
priority = (1 if t in '+-' else 2) + parenthesis_level
#if tree_root does not have operator put it there
if tree_root[O] is None and t in o2f:
tree_root[O] = o2f[t]
tree_root[PR] = priority
return
#if new node has less or equals priority, put it on the top of tree
if tree_root[PR] >= priority:
temp = [tree_root, o2f[t], None, priority]
tree_root = current_node = temp
return
#starting from root search for a place with higher priority in hierarchy
current_node = tree_root
while type(current_node[R]) != type(1) and priority > current_node[R][PR]:
current_node = current_node[R]
#insert new node
temp = [current_node[R], o2f[t], None, priority]
current_node[R] = temp
current_node = temp
def parse(e):
token = ''
for c in e:
if c <= '9' and c >='0':
token += c
continue
if c == ' ':
if token != '':
convert_token(token)
token = ''
continue
if c in o2f:
if token != '':
convert_token(token)
convert_token(c)
token = ''
continue
print "Unrecognized character:", c
if token != '':
convert_token(token)
def main():
parse('(((3 * 4) / (1 + 2)) + 5)')
print tree_root
print process_node(tree_root)
if __name__ == '__main__':
main()