Lucene scoring tweaking - lucene

How to achieve that with given query "20", document with content "something 20" had something like MAX_SCORE while other document e.g. "something 20/12" had regular one?
Im playing around with overriding Similarity algorithm to simplify the search but this behavior is pain right now.. I need to have lengthNorm factor set to "1" as I dont want to have "shorter documents will have bigger score" behavior (without this "20" obviously wins, but not because it fits entirely, but because its shorter...).
My custom Similarity class looks like that at the moment
public class SimpleSimilarity extends DefaultSimilarity {
public SimpleSimilarity(){}
#Override
public float idf(long docFreq, long numDocs) { return 1f; }
#Override
public float tf(float freq) { return 1f; }
#Override
public float lengthNorm(FieldInvertState state) {
return 1f;
}
}

You can still do this with custom similarity.
You don't need smaller documents to score high but you need ratio of (matched token / total terms in document) in your score.
Try this lengthNorm in your custom similarity (keep tf/idf etc to return 1f as you mentioned above)
#Override
public float lengthNorm (FieldInvertState state)
{
return (float) 1.0 / state.getLength();
}
state.getLength() returns number of tokens in document.
As per similarity score equation (http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html)
lengthNorm() will be added for each matched term, net net you will get ratio of (matched tokens / total terms in document).
Now in you example if your query is "20", here is order of returned document
1) 20 (document has only one term which matched with query) - score ~1.0
2) something 20 (document has two terms and one matched) - score ~0.5
3) something 20/12 (document has three terms and one matched) - score ~0.33

Related

What is the most efficient way to generate random numbers from a union of disjoint ranges in Kotlin?

I would like to generate random numbers from a union of ranges in Kotlin. I know I can do something like
((1..10) + (50..100)).random()
but unfortunately this creates an intermediate list, which can be rather expensive when the ranges are large.
I know I could write a custom function to randomly select a range with a weight based on its width, followed by randomly choosing an element from that range, but I am wondering if there is a cleaner way to achieve this with Kotlin built-ins.
Suppose your ranges are nonoverlapped and sorted, if not, you could have some preprocessing to merge and sort.
This comes to an algorithm choosing:
O(1) time complexity and O(N) space complexity, where N is the total number, by expanding the range object to a set of numbers, and randomly pick one. To be compact, an array or list could be utilized as the container.
O(M) time complexity and O(1) space complexity, where M is the number of ranges, by calculating the position in a linear reduction.
O(M+log(M)) time complexity and O(M) space complexity, where M is the number of ranges, by calculating the position using a binary search. You could separate the preparation(O(M)) and generation(O(log(M))), if there are multiple generations on the same set of ranges.
For the last algorithm, imaging there's a sorted list of all available numbers, then this list can be partitioned into your ranges. So there's no need to really create this list, you just calculate the positions of your range s relative to this list. When you have a position within this list, and want to know which range it is in, do a binary search.
fun random(ranges: Array<IntRange>): Int {
// preparation
val positions = ranges.map {
it.last - it.first + 1
}.runningFold(0) { sum, item -> sum + item }
// generation
val randomPos = Random.nextInt(positions[ranges.size])
val found = positions.binarySearch(randomPos)
// binarySearch may return an "insertion point" in negative
val range = if (found < 0) -(found + 1) - 1 else found
return ranges[range].first + randomPos - positions[range]
}
Short solution
We can do it like this:
fun main() {
println(random(1..10, 50..100))
}
fun random(vararg ranges: IntRange): Int {
var index = Random.nextInt(ranges.sumOf { it.last - it.first } + ranges.size)
ranges.forEach {
val size = it.last - it.first + 1
if (index < size) {
return it.first + index
}
index -= size
}
throw IllegalStateException()
}
It uses the same approach you described, but it calls for random integer only once, not twice.
Long solution
As I said in the comment, I often miss utils in Java/Kotlin stdlib for creating collection views. If IntRange would have something like asList() and we would have a way to concatenate lists by creating a view, this would be really trivial, utilizing existing logic blocks. Views would do the trick for us, they would automatically calculate the size and translate the random number to the proper value.
I implemented a POC, maybe you will find it useful:
fun main() {
val list = listOf(1..10, 50..100).mergeAsView()
println(list.size) // 61
println(list[20]) // 60
println(list.random())
}
#JvmName("mergeIntRangesAsView")
fun Iterable<IntRange>.mergeAsView(): List<Int> = map { it.asList() }.mergeAsView()
#JvmName("mergeListsAsView")
fun <T> Iterable<List<T>>.mergeAsView(): List<T> = object : AbstractList<T>() {
override val size = this#mergeAsView.sumOf { it.size }
override fun get(index: Int): T {
if (index < 0 || index >= size) {
throw IndexOutOfBoundsException(index)
}
var remaining = index
this#mergeAsView.forEach { curr ->
if (remaining < curr.size) {
return curr[remaining]
}
remaining -= curr.size
}
throw IllegalStateException()
}
}
fun IntRange.asList(): List<Int> = object : AbstractList<Int>() {
override val size = endInclusive - start + 1
override fun get(index: Int): Int {
if (index < 0 || index >= size) {
throw IndexOutOfBoundsException(index)
}
return start + index
}
}
This code does almost exactly the same thing as short solution above. It only does this indirectly.
Once again: this is just a POC. This implementation of asList() and mergeAsView() is not at all production-ready. We should implement more methods, like for example iterator(), contains() and indexOf(), because right now they are much slower than they could be. But it should work efficiently already for your specific case. You should probably test it at least a little. Also, mergeAsView() assumes provided lists are immutable (they have fixed size) which may not be true.
It would be probably good to implement asList() for IntProgression and for other primitive types as well. Also you may prefer varargs version of mergeAsView() than extension function.
As a final note: I guess there are libraries that does this already - probably some related to immutable collections. But if you look for a relatively lightweight solution, it should work for you.

Find 10 nearest points from an array of coordinates for each coordinate in Kotlin

I have an array
var poses = arrayOf<Array<Double>>()
That I populate using a loop.
The output looks something like this:
poses.forEach {
println(Arrays.toString(it))
}
[-71.42510166478651, 106.43593221597114]
[104.46430594348055, 78.62761919208839]
[100.27031925094859, 79.65568893000942]
[311.2433803626159, 233.67219485640456]
[330.3015877764689, -114.9000129699181]
[34.76986782382592, -383.71914014833436]
[355.477931403836, -173.29388985868835]
[322.72821807215564, -45.99138725647516]
...
Is there an efficient way to find 10 nearest points from this list for each coordinate?
For example:
Find 10 nearest points for [-71.42510166478651, 106.43593221597114], then [104.46430594348055, 78.62761919208839] and so on.
I tried looking into numpy-like libraries for Kotlin but seeing as though I'm new to the language I couldn't figure out how to do it.
You can write a distance function with the Pythagorean theorem. (This GeeksforGeeks page might be helpful too.)
You could also use a data class for the points, instead of using an array with two double values. The code below uses the approach that Mateen Ulhaq suggested in his comment, with two modifications:
The addition of "point to" lets us create a map from a point to the ten nearest points (so we know which point the ten points are related to).
The call to ".drop(1)" before ".take(10)" keeps the point itself out of its list (since the distance to itself is 0).
This code uses a list of points, determines the nearest points and prints them for each point:
fun main() {
val poses = listOf(
Point(-71.42510166478651, 106.43593221597114),
Point(104.46430594348055, 78.62761919208839),
Point(100.27031925094859, 79.65568893000942),
Point(311.2433803626159, 233.67219485640456),
Point(330.3015877764689, -114.9000129699181),
Point(34.76986782382592, -383.71914014833436),
Point(355.477931403836, -173.29388985868835),
Point(322.72821807215564, -45.99138725647516)
)
val nearestPoints = poses.map {
point -> point to poses.sortedBy { point.distance(it) }.drop(1).take(10)
}
println("Nearest points:")
nearestPoints.forEach {
println("${it.first} is closest to ${it.second}")
}
}
data class Point(val x: Double, val y: Double) {
fun distance(that: Point): Double {
val distanceX = this.x - that.x
val distanceY = this.y - that.y
return sqrt(distanceX * distanceX + distanceY * distanceY)
}
}
If the points are evenly (or almost evenly) distributed in some area, I suggest dividing them into rectangular chunks with size area.size.x / poses.size * 10 by area.size.y / poses.size * 10.
Then to find the nearest points for any point, you only need to check neighboring chunks. Since points are evenly distributed, you can find the nearest points for all points in O(kn) where n is a number of points and k = 10.
If the points are not guaranteed to be evenly (or almost evenly) distributed, you have to divide the area into several chunks and then recursively repeat the same process for each chunk until all the sub-chunks contain at most x points. (It's hard to tell what is optimal x and optimal count of sub-chunks per chunk, you need to do some research to find it out).
Then you can find the nearest points for any point, just as you did with evenly distributed points.
A few tricks to improve performance:
Use distanceSquared instead of distance. Here is how you can implement distanceSquared:
fun Point.distanceSquared(other: Point) = (x - other.x).squared() + (y - other.y).squared()
typealias Point = Array<Double>
val Point.x get() = this[0]
val Point.y get() = this[1]
fun Double.squared() = this * this
Use PriorityQueue<Point>(10, compareBy { -it.distanceSquared(destination) }) to store nearest points, and offer(point, 10) to add points to it:
fun <E : Any> PriorityQueue<E>.offer(element: E, maxSize: Int) {
if (size < maxSize) offer(element)
else if (compare(element, peek()) > 0) {
poll()
offer(element)
}
}
// if `comparator()` returns `null` queue uses naturalOrder and `E` is `Comparable<E>`
#Suppress("UNCHECKED_CAST")
fun <E : Any> PriorityQueue<E>.compare(o1: E, o2: E) =
comparator()?.compare(o1, o2) ?: (o1 as Comparable<E>).compareTo(o2)
Divide your points into several groups and run the calculation for each group in a separate thread. It will let your program to use all available cores.

How to get document ID in CustomScoreProvider?

In short, I am trying to determine a document's true document ID in method CustomScoreProvider.CustomScore which only provides a document "ID" relative to a sub-IndexReader.
More info: I am trying to boost my documents' scores by precomputed boost factors (imagine an in-memory structure that maps Lucene's document ids to boost factors). Unfortunately I cannot store the boosts in the index for a couple of reasons: boosting will not be used for all queries, plus the boost factors can change regularly and that would trigger a lot of reindexing.
Instead I'd like to boost the score at query time and thus I've been working with CustomScoreQuery/CustomScoreProvider. The boosting takes place in method CustomScoreProvider.CustomScore:
public override float CustomScore(int doc, float subQueryScore, float valSrcScore) {
float baseScore = subQueryScore * valSrcScore; // the default computation
// boost -- THIS IS WHERE THE PROBLEM IS
float boostedScore = baseScore * MyBoostCache.GetBoostForDocId(doc);
return boostedScore;
}
My problem is with the doc parameter passed to CustomScore. It is not the true document id -- it is relative to the subreader used for that index segment. (The MyBoostCache class is my in-memory structure mapping Lucene's doc ids to boost factors.) If I knew the reader's docBase I could figure out the true id (id = doc + docBase).
Any thoughts on how I can determine the true id, or perhaps there's a better way to accomplish what I'm doing?
(I am aware that the id I'm trying to get is subject to change and I've already taken steps to make sure the MyBoostCache is always up to date with the latest ids.)
I was able to achieve this by passing the IndexSearcher to my CustomScoreProvider, using it to determine which of its subreaders is being used by the CustomScoreProvider, and then getting the MaxDoc for the prior subreaders from the IndexSearcher to determine the docBase.
private int DocBase { get; set; }
public MyScoreProvider(IndexReader reader, IndexSearcher searcher) {
DocBase = GetDocBaseForIndexReader(reader, searcher);
}
private static int GetDocBaseForIndexReader(IndexReader reader, IndexSearcher searcher) {
// get all segment readers for the searcher
IndexReader rootReader = searcher.GetIndexReader();
var subReaders = new List<IndexReader>();
ReaderUtil.GatherSubReaders(subReaders, rootReader);
// sequentially loop through the subreaders until we find the specified reader, adjusting our offset along the way
int docBase = 0;
for (int i = 0; i < subReaders.Count; i++)
{
if (subReaders[i] == reader)
break;
docBase += subReaders[i].MaxDoc();
}
return docBase;
}
public override float CustomScore(int doc, float subQueryScore, float valSrcScore) {
float baseScore = subQueryScore * valSrcScore;
float boostedScore = baseScore * MyBoostCache.GetBoostForDocId(doc + DocBase);
return boostedScore;
}

Lucene: Iterate all entries

I have a Lucene Index which I would like to iterate (for one time evaluation at the current stage in development)
I have 4 documents with each a few hundred thousand up to million entries, which I want to iterate to count the number of words for each entry (~2-10) and calculate the frequency distribution.
What I am doing at the moment is this:
for (int i = 0; i < reader.maxDoc(); i++) {
if (reader.isDeleted(i))
continue;
Document doc = reader.document(i);
Field text = doc.getField("myDocName#1");
String content = text.stringValue();
int wordLen = countNumberOfWords(content);
//store
}
So far, it is iterating something. The debug confirms that its at least operating on the terms stored in the document, but for some reason it only process a small part of the stored terms. I wonder what I am doing wrong? I simply want to iterate over all documents and everything that is stored in them?
Firstly you need to ensure you index with TermVectors enabled
doc.add(new Field(TITLE, page.getTitle(), Field.Store.YES, Field.Index.ANALYZED, TermVector.WITH_POSITIONS_OFFSETS));
Then you can use IndexReader.getTermFreqVector to count terms
TopDocs res = indexSearcher.search(YOUR_QUERY, null, 1000);
// iterate over documents in res, ommited for brevity
reader.getTermFreqVector(res.scoreDocs[i].doc, YOUR_FIELD, new TermVectorMapper() {
public void map(String termval, int freq, TermVectorOffsetInfo[] offsets, int[] positions) {
// increment frequency count of termval by freq
freqs.increment(termval, freq);
}
public void setExpectations(String arg0, int arg1,boolean arg2, boolean arg3) {}
});

Expression Evaluation and Tree Walking using polymorphism? (ala Steve Yegge)

This morning, I was reading Steve Yegge's: When Polymorphism Fails, when I came across a question that a co-worker of his used to ask potential employees when they came for their interview at Amazon.
As an example of polymorphism in
action, let's look at the classic
"eval" interview question, which (as
far as I know) was brought to Amazon
by Ron Braunstein. The question is
quite a rich one, as it manages to
probe a wide variety of important
skills: OOP design, recursion, binary
trees, polymorphism and runtime
typing, general coding skills, and (if
you want to make it extra hard)
parsing theory.
At some point, the candidate hopefully
realizes that you can represent an
arithmetic expression as a binary
tree, assuming you're only using
binary operators such as "+", "-",
"*", "/". The leaf nodes are all
numbers, and the internal nodes are
all operators. Evaluating the
expression means walking the tree. If
the candidate doesn't realize this,
you can gently lead them to it, or if
necessary, just tell them.
Even if you tell them, it's still an
interesting problem.
The first half of the question, which
some people (whose names I will
protect to my dying breath, but their
initials are Willie Lewis) feel is a
Job Requirement If You Want To Call
Yourself A Developer And Work At
Amazon, is actually kinda hard. The
question is: how do you go from an
arithmetic expression (e.g. in a
string) such as "2 + (2)" to an
expression tree. We may have an ADJ
challenge on this question at some
point.
The second half is: let's say this is
a 2-person project, and your partner,
who we'll call "Willie", is
responsible for transforming the
string expression into a tree. You get
the easy part: you need to decide what
classes Willie is to construct the
tree with. You can do it in any
language, but make sure you pick one,
or Willie will hand you assembly
language. If he's feeling ornery, it
will be for a processor that is no
longer manufactured in production.
You'd be amazed at how many candidates
boff this one.
I won't give away the answer, but a
Standard Bad Solution involves the use
of a switch or case statment (or just
good old-fashioned cascaded-ifs). A
Slightly Better Solution involves
using a table of function pointers,
and the Probably Best Solution
involves using polymorphism. I
encourage you to work through it
sometime. Fun stuff!
So, let's try to tackle the problem all three ways. How do you go from an arithmetic expression (e.g. in a string) such as "2 + (2)" to an expression tree using cascaded-if's, a table of function pointers, and/or polymorphism?
Feel free to tackle one, two, or all three.
[update: title modified to better match what most of the answers have been.]
Polymorphic Tree Walking, Python version
#!/usr/bin/python
class Node:
"""base class, you should not process one of these"""
def process(self):
raise('you should not be processing a node')
class BinaryNode(Node):
"""base class for binary nodes"""
def __init__(self, _left, _right):
self.left = _left
self.right = _right
def process(self):
raise('you should not be processing a binarynode')
class Plus(BinaryNode):
def process(self):
return self.left.process() + self.right.process()
class Minus(BinaryNode):
def process(self):
return self.left.process() - self.right.process()
class Mul(BinaryNode):
def process(self):
return self.left.process() * self.right.process()
class Div(BinaryNode):
def process(self):
return self.left.process() / self.right.process()
class Num(Node):
def __init__(self, _value):
self.value = _value
def process(self):
return self.value
def demo(n):
print n.process()
demo(Num(2)) # 2
demo(Plus(Num(2),Num(5))) # 2 + 3
demo(Plus(Mul(Num(2),Num(3)),Div(Num(10),Num(5)))) # (2 * 3) + (10 / 2)
The tests are just building up the binary trees by using constructors.
program structure:
abstract base class: Node
all Nodes inherit from this class
abstract base class: BinaryNode
all binary operators inherit from this class
process method does the work of evaluting the expression and returning the result
binary operator classes: Plus,Minus,Mul,Div
two child nodes, one each for left side and right side subexpressions
number class: Num
holds a leaf-node numeric value, e.g. 17 or 42
The problem, I think, is that we need to parse perentheses, and yet they are not a binary operator? Should we take (2) as a single token, that evaluates to 2?
The parens don't need to show up in the expression tree, but they do affect its shape. E.g., the tree for (1+2)+3 is different from 1+(2+3):
+
/ \
+ 3
/ \
1 2
versus
+
/ \
1 +
/ \
2 3
The parentheses are a "hint" to the parser (e.g., per superjoe30, to "recursively descend")
This gets into parsing/compiler theory, which is kind of a rabbit hole... The Dragon Book is the standard text for compiler construction, and takes this to extremes. In this particular case, you want to construct a context-free grammar for basic arithmetic, then use that grammar to parse out an abstract syntax tree. You can then iterate over the tree, reducing it from the bottom up (it's at this point you'd apply the polymorphism/function pointers/switch statement to reduce the tree).
I've found these notes to be incredibly helpful in compiler and parsing theory.
Representing the Nodes
If we want to include parentheses, we need 5 kinds of nodes:
the binary nodes: Add Minus Mul Divthese have two children, a left and right side
+
/ \
node node
a node to hold a value: Valno children nodes, just a numeric value
a node to keep track of the parens: Parena single child node for the subexpression
( )
|
node
For a polymorphic solution, we need to have this kind of class relationship:
Node
BinaryNode : inherit from Node
Plus : inherit from Binary Node
Minus : inherit from Binary Node
Mul : inherit from Binary Node
Div : inherit from Binary Node
Value : inherit from Node
Paren : inherit from node
There is a virtual function for all nodes called eval(). If you call that function, it will return the value of that subexpression.
String Tokenizer + LL(1) Parser will give you an expression tree... the polymorphism way might involve an abstract Arithmetic class with an "evaluate(a,b)" function, which is overridden for each of the operators involved (Addition, Subtraction etc) to return the appropriate value, and the tree contains Integers and Arithmetic operators, which can be evaluated by a post(?)-order traversal of the tree.
I won't give away the answer, but a
Standard Bad Solution involves the use
of a switch or case statment (or just
good old-fashioned cascaded-ifs). A
Slightly Better Solution involves
using a table of function pointers,
and the Probably Best Solution
involves using polymorphism.
The last twenty years of evolution in interpreters can be seen as going the other way - polymorphism (eg naive Smalltalk metacircular interpreters) to function pointers (naive lisp implementations, threaded code, C++) to switch (naive byte code interpreters), and then onwards to JITs and so on - which either require very big classes, or (in singly polymorphic languages) double-dispatch, which reduces the polymorphism to a type-case, and you're back at stage one. What definition of 'best' is in use here?
For simple stuff a polymorphic solution is OK - here's one I made earlier, but either stack and bytecode/switch or exploiting the runtime's compiler is usually better if you're, say, plotting a function with a few thousand data points.
Hm... I don't think you can write a top-down parser for this without backtracking, so it has to be some sort of a shift-reduce parser. LR(1) or even LALR will of course work just fine with the following (ad-hoc) language definition:
Start -> E1
E1 -> E1+E1 | E1-E1
E1 -> E2*E2 | E2/E2 | E2
E2 -> number | (E1)
Separating it out into E1 and E2 is necessary to maintain the precedence of * and / over + and -.
But this is how I would do it if I had to write the parser by hand:
Two stacks, one storing nodes of the tree as operands and one storing operators
Read the input left to right, make leaf nodes of the numbers and push them into the operand stack.
If you have >= 2 operands on the stack, pop 2, combine them with the topmost operator in the operator stack and push this structure back to the operand tree, unless
The next operator has higher precedence that the one currently on top of the stack.
This leaves us the problem of handling brackets. One elegant (I thought) solution is to store the precedence of each operator as a number in a variable. So initially,
int plus, minus = 1;
int mul, div = 2;
Now every time you see a a left bracket increment all these variables by 2, and every time you see a right bracket, decrement all the variables by 2.
This will ensure that the + in 3*(4+5) has higher precedence than the *, and 3*4 will not be pushed onto the stack. Instead it will wait for 5, push 4+5, then push 3*(4+5).
Re: Justin
I think the tree would look something like this:
+
/ \
2 ( )
|
2
Basically, you'd have an "eval" node, that just evaluates the tree below it. That would then be optimized out to just being:
+
/ \
2 2
In this case the parens aren't required and don't add anything. They don't add anything logically, so they'd just go away.
I think the question is about how to write a parser, not the evaluator. Or rather, how to create the expression tree from a string.
Case statements that return a base class don't exactly count.
The basic structure of a "polymorphic" solution (which is another way of saying, I don't care what you build this with, I just want to extend it with rewriting the least amount of code possible) is deserializing an object hierarchy from a stream with a (dynamic) set of known types.
The crux of the implementation of the polymorphic solution is to have a way to create an expression object from a pattern matcher, likely recursive. I.e., map a BNF or similar syntax to an object factory.
Or maybe this is the real question:
how can you represent (2) as a BST?
That is the part that is tripping me
up.
Recursion.
#Justin:
Look at my note on representing the nodes. If you use that scheme, then
2 + (2)
can be represented as
.
/ \
2 ( )
|
2
should use a functional language imo. Trees are harder to represent and manipulate in OO languages.
As people have been mentioning previously, when you use expression trees parens are not necessary. The order of operations becomes trivial and obvious when you're looking at an expression tree. The parens are hints to the parser.
While the accepted answer is the solution to one half of the problem, the other half - actually parsing the expression - is still unsolved. Typically, these sorts of problems can be solved using a recursive descent parser. Writing such a parser is often a fun exercise, but most modern tools for language parsing will abstract that away for you.
The parser is also significantly harder if you allow floating point numbers in your string. I had to create a DFA to accept floating point numbers in C -- it was a very painstaking and detailed task. Remember, valid floating points include: 10, 10., 10.123, 9.876e-5, 1.0f, .025, etc. I assume some dispensation from this (in favor of simplicty and brevity) was made in the interview.
I've written such a parser with some basic techniques like
Infix -> RPN and
Shunting Yard and
Tree Traversals.
Here is the implementation I've came up with.
It's written in C++ and compiles on both Linux and Windows.
Any suggestions/questions are welcomed.
So, let's try to tackle the problem all three ways. How do you go from an arithmetic expression (e.g. in a string) such as "2 + (2)" to an expression tree using cascaded-if's, a table of function pointers, and/or polymorphism?
This is interesting,but I don't think this belongs to the realm of object-oriented programming...I think it has more to do with parsing techniques.
I've kind of chucked this c# console app together as a bit of a proof of concept. Have a feeling it could be a lot better (that switch statement in GetNode is kind of clunky (it's there coz I hit a blank trying to map a class name to an operator)). Any suggestions on how it could be improved very welcome.
using System;
class Program
{
static void Main(string[] args)
{
string expression = "(((3.5 * 4.5) / (1 + 2)) + 5)";
Console.WriteLine(string.Format("{0} = {1}", expression, new Expression.ExpressionTree(expression).Value));
Console.WriteLine("\nShow's over folks, press a key to exit");
Console.ReadKey(false);
}
}
namespace Expression
{
// -------------------------------------------------------
abstract class NodeBase
{
public abstract double Value { get; }
}
// -------------------------------------------------------
class ValueNode : NodeBase
{
public ValueNode(double value)
{
_double = value;
}
private double _double;
public override double Value
{
get
{
return _double;
}
}
}
// -------------------------------------------------------
abstract class ExpressionNodeBase : NodeBase
{
protected NodeBase GetNode(string expression)
{
// Remove parenthesis
expression = RemoveParenthesis(expression);
// Is expression just a number?
double value = 0;
if (double.TryParse(expression, out value))
{
return new ValueNode(value);
}
else
{
int pos = ParseExpression(expression);
if (pos > 0)
{
string leftExpression = expression.Substring(0, pos - 1).Trim();
string rightExpression = expression.Substring(pos).Trim();
switch (expression.Substring(pos - 1, 1))
{
case "+":
return new Add(leftExpression, rightExpression);
case "-":
return new Subtract(leftExpression, rightExpression);
case "*":
return new Multiply(leftExpression, rightExpression);
case "/":
return new Divide(leftExpression, rightExpression);
default:
throw new Exception("Unknown operator");
}
}
else
{
throw new Exception("Unable to parse expression");
}
}
}
private string RemoveParenthesis(string expression)
{
if (expression.Contains("("))
{
expression = expression.Trim();
int level = 0;
int pos = 0;
foreach (char token in expression.ToCharArray())
{
pos++;
switch (token)
{
case '(':
level++;
break;
case ')':
level--;
break;
}
if (level == 0)
{
break;
}
}
if (level == 0 && pos == expression.Length)
{
expression = expression.Substring(1, expression.Length - 2);
expression = RemoveParenthesis(expression);
}
}
return expression;
}
private int ParseExpression(string expression)
{
int winningLevel = 0;
byte winningTokenWeight = 0;
int winningPos = 0;
int level = 0;
int pos = 0;
foreach (char token in expression.ToCharArray())
{
pos++;
switch (token)
{
case '(':
level++;
break;
case ')':
level--;
break;
}
if (level <= winningLevel)
{
if (OperatorWeight(token) > winningTokenWeight)
{
winningLevel = level;
winningTokenWeight = OperatorWeight(token);
winningPos = pos;
}
}
}
return winningPos;
}
private byte OperatorWeight(char value)
{
switch (value)
{
case '+':
case '-':
return 3;
case '*':
return 2;
case '/':
return 1;
default:
return 0;
}
}
}
// -------------------------------------------------------
class ExpressionTree : ExpressionNodeBase
{
protected NodeBase _rootNode;
public ExpressionTree(string expression)
{
_rootNode = GetNode(expression);
}
public override double Value
{
get
{
return _rootNode.Value;
}
}
}
// -------------------------------------------------------
abstract class OperatorNodeBase : ExpressionNodeBase
{
protected NodeBase _leftNode;
protected NodeBase _rightNode;
public OperatorNodeBase(string leftExpression, string rightExpression)
{
_leftNode = GetNode(leftExpression);
_rightNode = GetNode(rightExpression);
}
}
// -------------------------------------------------------
class Add : OperatorNodeBase
{
public Add(string leftExpression, string rightExpression)
: base(leftExpression, rightExpression)
{
}
public override double Value
{
get
{
return _leftNode.Value + _rightNode.Value;
}
}
}
// -------------------------------------------------------
class Subtract : OperatorNodeBase
{
public Subtract(string leftExpression, string rightExpression)
: base(leftExpression, rightExpression)
{
}
public override double Value
{
get
{
return _leftNode.Value - _rightNode.Value;
}
}
}
// -------------------------------------------------------
class Divide : OperatorNodeBase
{
public Divide(string leftExpression, string rightExpression)
: base(leftExpression, rightExpression)
{
}
public override double Value
{
get
{
return _leftNode.Value / _rightNode.Value;
}
}
}
// -------------------------------------------------------
class Multiply : OperatorNodeBase
{
public Multiply(string leftExpression, string rightExpression)
: base(leftExpression, rightExpression)
{
}
public override double Value
{
get
{
return _leftNode.Value * _rightNode.Value;
}
}
}
}
Ok, here is my naive implementation. Sorry, I did not feel to use objects for that one but it is easy to convert. I feel a bit like evil Willy (from Steve's story).
#!/usr/bin/env python
#tree structure [left argument, operator, right argument, priority level]
tree_root = [None, None, None, None]
#count of parethesis nesting
parenthesis_level = 0
#current node with empty right argument
current_node = tree_root
#indices in tree_root nodes Left, Operator, Right, PRiority
L, O, R, PR = 0, 1, 2, 3
#functions that realise operators
def sum(a, b):
return a + b
def diff(a, b):
return a - b
def mul(a, b):
return a * b
def div(a, b):
return a / b
#tree evaluator
def process_node(n):
try:
len(n)
except TypeError:
return n
left = process_node(n[L])
right = process_node(n[R])
return n[O](left, right)
#mapping operators to relevant functions
o2f = {'+': sum, '-': diff, '*': mul, '/': div, '(': None, ')': None}
#converts token to a node in tree
def convert_token(t):
global current_node, tree_root, parenthesis_level
if t == '(':
parenthesis_level += 2
return
if t == ')':
parenthesis_level -= 2
return
try: #assumption that we have just an integer
l = int(t)
except (ValueError, TypeError):
pass #if not, no problem
else:
if tree_root[L] is None: #if it is first number, put it on the left of root node
tree_root[L] = l
else: #put on the right of current_node
current_node[R] = l
return
priority = (1 if t in '+-' else 2) + parenthesis_level
#if tree_root does not have operator put it there
if tree_root[O] is None and t in o2f:
tree_root[O] = o2f[t]
tree_root[PR] = priority
return
#if new node has less or equals priority, put it on the top of tree
if tree_root[PR] >= priority:
temp = [tree_root, o2f[t], None, priority]
tree_root = current_node = temp
return
#starting from root search for a place with higher priority in hierarchy
current_node = tree_root
while type(current_node[R]) != type(1) and priority > current_node[R][PR]:
current_node = current_node[R]
#insert new node
temp = [current_node[R], o2f[t], None, priority]
current_node[R] = temp
current_node = temp
def parse(e):
token = ''
for c in e:
if c <= '9' and c >='0':
token += c
continue
if c == ' ':
if token != '':
convert_token(token)
token = ''
continue
if c in o2f:
if token != '':
convert_token(token)
convert_token(c)
token = ''
continue
print "Unrecognized character:", c
if token != '':
convert_token(token)
def main():
parse('(((3 * 4) / (1 + 2)) + 5)')
print tree_root
print process_node(tree_root)
if __name__ == '__main__':
main()