Getting number of hyponym and deep in WordNet using JWNL - wordnet

I'm using JWNL library for search similarity. In the formula, hyponym count is needed. How i get number of Hyponym and Deep of synset in WordNet using JWNL ? Thanks
I have tried the code below. But when the program running there is error java heap space. My java heap space is 2 GB
//Method for count number of hyponym of synset
public double getHypo(Synset synset) throws JWNLException {
double hypo = PointerUtils.getInstance().getHyponymTree(synset).toList().size();
return hypo;
}
//Method for count deep of synset from root (deep root =1)
public double getDeep(Synset synset) throws JWNLException {
double deep = PointerUtils.getInstance().getHypernymTree(synset).toList().size() + 1;
return deep;
}

Related

Beam Java SDK with TFRecord and Compression GZIP

We're using Beam Java SDK (and Google Cloud Dataflow to run batch jobs) a lot, and we noticed something weird (possibly a bug?) when we tried to use TFRecordIO with Compression.GZIP. We were able to come up with some sample code that can reproduce the errors we face.
To be clear, we are using Beam Java SDK 2.4.
Suppose we have PCollection<byte[]> which can be a PC of proto messages, for instance, in byte[] format.
We usually write this to GCS (Google Cloud Storage) using Base64 encoding (newline delimited Strings) or using TFRecordIO (without compression). We have had no issue reading the data from GCS in this manner for a very long time (2.5+ years for the former and ~1.5 years for the latter).
Recently, we tried TFRecordIO with Compression.GZIP option, and sometimes we get an exception as the data is seen as invalid (while being read). The data itself (the gzip files) is not corrupted, and we've tested various things, and reached the following conclusion.
When a byte[] that is being compressed under TFRecordIO is above certain threshold (I'd say when at or above 8192), then TFRecordIO.read().withCompression(Compression.GZIP) would not work.
Specifically, it will throw the following exception:
Exception in thread "main" java.lang.IllegalStateException: Invalid data
at org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:444)
at org.apache.beam.sdk.io.TFRecordIO$TFRecordCodec.read(TFRecordIO.java:642)
at org.apache.beam.sdk.io.TFRecordIO$TFRecordSource$TFRecordReader.readNextRecord(TFRecordIO.java:526)
at org.apache.beam.sdk.io.CompressedSource$CompressedReader.readNextRecord(CompressedSource.java:426)
at org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.advanceImpl(FileBasedSource.java:473)
at org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:468)
at org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:261)
at org.apache.beam.runners.direct.BoundedReadEvaluatorFactory$BoundedReadEvaluator.processElement(BoundedReadEvaluatorFactory.java:141)
at org.apache.beam.runners.direct.DirectTransformExecutor.processElements(DirectTransformExecutor.java:161)
at org.apache.beam.runners.direct.DirectTransformExecutor.run(DirectTransformExecutor.java:125)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
This can be reproduced easily, so you can refer to the code at the end. You will also see comments about the byte array length (as I tested with various sizes, I concluded that 8192 is the magic number).
So I'm wondering if this is a bug or known issue -- I couldn't find anything close to this on Apache Beam's Issue Tracker here but if there is another forum/site I need to check, please let me know!
If this is indeed a bug, what would be the right channel to report this?
The following code can reproduce the error we have.
A successful run (with parameters 1, 39, 100) would show the following message at the end:
------------ counter metrics from CountDoFn
[counter] plain_base64_proto_array_len: 8126
[counter] plain_base64_proto_in: 1
[counter] plain_base64_proto_val_cnt: 39
[counter] tfrecord_gz_proto_array_len: 8126
[counter] tfrecord_gz_proto_in: 1
[counter] tfrecord_gz_proto_val_cnt: 39
[counter] tfrecord_uncomp_proto_array_len: 8126
[counter] tfrecord_uncomp_proto_in: 1
[counter] tfrecord_uncomp_proto_val_cnt: 39
With parameters (1, 40, 100) which would push the byte array length over 8192, it will throw the said exception.
You can tweak the parameters (inside CreateRandomProtoData DoFn) to see why the length of byte[] being gzipped matters.
It may help you also to use the following protoc-gen Java class (for TestProto used in the main code above. Here it is: gist link
References:
Main Code:
package exp.moloco.dataflow2.compression; // NOTE: Change appropriately.
import java.util.Arrays;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Random;
import java.util.TreeMap;
import org.apache.beam.runners.direct.DirectRunner;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.Compression;
import org.apache.beam.sdk.io.TFRecordIO;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.metrics.Counter;
import org.apache.beam.sdk.metrics.MetricResult;
import org.apache.beam.sdk.metrics.Metrics;
import org.apache.beam.sdk.metrics.MetricsFilter;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.Create;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.values.PCollection;
import org.apache.commons.codec.binary.Base64;
import org.joda.time.DateTime;
import org.joda.time.DateTimeZone;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.google.protobuf.InvalidProtocolBufferException;
import com.moloco.dataflow.test.StackOverflow.TestProto;
import com.moloco.dataflow2.Main;
// #formatter:off
// This code uses TestProto (java class) that is generated by protoc.
// The message definition is as follows (in proto3, but it shouldn't matter):
// message TestProto {
// int64 count = 1;
// string name = 2;
// repeated string values = 3;
// }
// Note that this code does not depend on whether this proto is used,
// or any other byte[] is used (see CreateRandomData DoFn later which generates the data being used in the code).
// We tested both, but are presenting this as a concrete example of how (our) code in production can be affected.
// #formatter:on
public class CompressionTester {
private static final Logger LOG = LoggerFactory.getLogger(CompressionTester.class);
static final List<String> lines = Arrays.asList("some dummy string that will not used in this job.");
// Some GCS buckets where data will be written to.
// %s will be replaced by some timestamped String for easy debugging.
static final String PATH_TO_GCS_PLAIN_BASE64 = Main.SOME_BUCKET + "/comp-test/%s/output-plain-base64";
static final String PATH_TO_GCS_TFRECORD_UNCOMP = Main.SOME_BUCKET + "/comp-test/%s/output-tfrecord-uncompressed";
static final String PATH_TO_GCS_TFRECORD_GZ = Main.SOME_BUCKET + "/comp-test/%s/output-tfrecord-gzip";
// This DoFn reads byte[] which represents a proto message (TestProto).
// It simply counts the number of proto objects it processes
// as well as the number of Strings each proto object contains.
// When the pipeline terminates, the values of the Counters will be printed out.
static class CountDoFn extends DoFn<byte[], TestProto> {
private final Counter protoIn;
private final Counter protoValuesCnt;
private final Counter protoByteArrayLength;
public CountDoFn(String name) {
protoIn = Metrics.counter(this.getClass(), name + "_proto_in");
protoValuesCnt = Metrics.counter(this.getClass(), name + "_proto_val_cnt");
protoByteArrayLength = Metrics.counter(this.getClass(), name + "_proto_array_len");
}
#ProcessElement
public void processElement(ProcessContext c) throws InvalidProtocolBufferException {
protoIn.inc();
TestProto tp = TestProto.parseFrom(c.element());
protoValuesCnt.inc(tp.getValuesCount());
protoByteArrayLength.inc(c.element().length);
}
}
// This DoFn emits a number of TestProto objects as byte[].
// Input to this DoFn is ignored (not used).
// Each TestProto object contains three fields: count (int64), name (string), and values (repeated string).
// The three parameters in DoFn determines
// (1) the number of proto objects to be generated,
// (2) the number of (repeated) strings to be added to each proto object, and
// (3) the length of (each) string.
// TFRecord with Compression (when reading) fails when the parameters are 1, 40, 100, for instance.
// TFRecord with Compression (when reading) succeeds when the parameters are 1, 39, 100, for instance.
static class CreateRandomProtoData extends DoFn<String, byte[]> {
static final int NUM_PROTOS = 1; // Total number of TestProto objects to be emitted by this DoFn.
static final int NUM_STRINGS = 40; // Total number of strings in each TestProto object ('repeated string').
static final int STRING_LEN = 100; // Length of each string object.
// Returns a random string of length len.
// For debugging purposes, the string only contains upper-case English alphabets.
static String getRandomString(Random rd, int len) {
StringBuffer sb = new StringBuffer();
for (int i = 0; i < len; i++) {
sb.append('A' + (rd.nextInt(26)));
}
return sb.toString();
}
// Returns a randomly generated TestProto object.
// Each string is generated randomly using getRandomString().
static TestProto getRandomProto(Random rd) {
TestProto.Builder tpBuilder = TestProto.newBuilder();
tpBuilder.setCount(rd.nextInt());
tpBuilder.setName(getRandomString(rd, STRING_LEN));
for (int i = 0; i < NUM_STRINGS; i++) {
tpBuilder.addValues(getRandomString(rd, STRING_LEN));
}
return tpBuilder.build();
}
// Emits TestProto objects are byte[].
#ProcessElement
public void processElement(ProcessContext c) {
// For debugging purposes, we set the seed here.
Random rd = new Random();
rd.setSeed(132475);
for (int n = 0; n < NUM_PROTOS; n++) {
byte[] data = getRandomProto(rd).toByteArray();
c.output(data);
// With parameters (1, 39, 100), the array length is 8126. It works fine.
// With parameters (1, 40, 100), the array length is 8329. It breaks TFRecord with GZIP.
System.out.println("\n--------------------------\n");
System.out.println("byte array length = " + data.length);
System.out.println("\n--------------------------\n");
}
}
}
public static void execute() {
PipelineOptions options = PipelineOptionsFactory.create();
options.setJobName("compression-tester");
options.setRunner(DirectRunner.class);
// For debugging purposes, write files under 'gcsSubDir' so we can easily distinguish.
final String gcsSubDir =
String.format("%s-%d", DateTime.now(DateTimeZone.UTC), DateTime.now(DateTimeZone.UTC).getMillis());
// Write PCollection<TestProto> in 3 different ways to GCS.
{
Pipeline pipeline = Pipeline.create(options);
// Create dummy data which is a PCollection of byte arrays (each array representing a proto message).
PCollection<byte[]> data = pipeline.apply(Create.of(lines)).apply(ParDo.of(new CreateRandomProtoData()));
// 1. Write as plain-text with base64 encoding.
data.apply(ParDo.of(new DoFn<byte[], String>() {
#ProcessElement
public void processElement(ProcessContext c) {
c.output(new String(Base64.encodeBase64(c.element())));
}
})).apply(TextIO.write().to(String.format(PATH_TO_GCS_PLAIN_BASE64, gcsSubDir)).withNumShards(1));
// 2. Write as TFRecord.
data.apply(TFRecordIO.write().to(String.format(PATH_TO_GCS_TFRECORD_UNCOMP, gcsSubDir)).withNumShards(1));
// 3. Write as TFRecord-gzip.
data.apply(TFRecordIO.write().withCompression(Compression.GZIP)
.to(String.format(PATH_TO_GCS_TFRECORD_GZ, gcsSubDir)).withNumShards(1));
pipeline.run().waitUntilFinish();
}
LOG.info("-------------------------------------------");
LOG.info(" READ TEST BEGINS ");
LOG.info("-------------------------------------------");
// Read PCollection<TestProto> in 3 different ways from GCS.
{
Pipeline pipeline = Pipeline.create(options);
// 1. Read as plain-text.
pipeline.apply(TextIO.read().from(String.format(PATH_TO_GCS_PLAIN_BASE64, gcsSubDir) + "*"))
.apply(ParDo.of(new DoFn<String, byte[]>() {
#ProcessElement
public void processElement(ProcessContext c) {
c.output(Base64.decodeBase64(c.element()));
}
})).apply("plain-base64", ParDo.of(new CountDoFn("plain_base64")));
// 2. Read as TFRecord -> byte array.
pipeline.apply(TFRecordIO.read().from(String.format(PATH_TO_GCS_TFRECORD_UNCOMP, gcsSubDir) + "*"))
.apply("tfrecord-uncomp", ParDo.of(new CountDoFn("tfrecord_uncomp")));
// 3. Read as TFRecord-gz -> byte array.
// This seems to fail when 'data size' becomes large.
pipeline
.apply(TFRecordIO.read().withCompression(Compression.GZIP)
.from(String.format(PATH_TO_GCS_TFRECORD_GZ, gcsSubDir) + "*"))
.apply("tfrecord_gz", ParDo.of(new CountDoFn("tfrecord_gz")));
// 4. Run pipeline.
PipelineResult res = pipeline.run();
res.waitUntilFinish();
// Check CountDoFn's metrics.
// The numbers should match.
Map<String, Long> counterValues = new TreeMap<String, Long>();
for (MetricResult<Long> counter : res.metrics().queryMetrics(MetricsFilter.builder().build()).counters()) {
counterValues.put(counter.name().name(), counter.committed());
}
StringBuffer sb = new StringBuffer();
sb.append("\n------------ counter metrics from CountDoFn\n");
for (Entry<String, Long> entry : counterValues.entrySet()) {
sb.append(String.format("[counter] %40s: %5d\n", entry.getKey(), entry.getValue()));
}
LOG.info(sb.toString());
}
}
}
This looks clearly like a bug in TFRecordIO. Channel.read() can read fewer bytes than the capacity of the input buffer. 8192 seems to be the buffer size in GzipCompressorInputStream. I filed https://issues.apache.org/jira/browse/BEAM-5412.
It is a bug, please see: https://issues.apache.org/jira/browse/BEAM-7695, I have solved it.

Lucene scoring tweaking

How to achieve that with given query "20", document with content "something 20" had something like MAX_SCORE while other document e.g. "something 20/12" had regular one?
Im playing around with overriding Similarity algorithm to simplify the search but this behavior is pain right now.. I need to have lengthNorm factor set to "1" as I dont want to have "shorter documents will have bigger score" behavior (without this "20" obviously wins, but not because it fits entirely, but because its shorter...).
My custom Similarity class looks like that at the moment
public class SimpleSimilarity extends DefaultSimilarity {
public SimpleSimilarity(){}
#Override
public float idf(long docFreq, long numDocs) { return 1f; }
#Override
public float tf(float freq) { return 1f; }
#Override
public float lengthNorm(FieldInvertState state) {
return 1f;
}
}
You can still do this with custom similarity.
You don't need smaller documents to score high but you need ratio of (matched token / total terms in document) in your score.
Try this lengthNorm in your custom similarity (keep tf/idf etc to return 1f as you mentioned above)
#Override
public float lengthNorm (FieldInvertState state)
{
return (float) 1.0 / state.getLength();
}
state.getLength() returns number of tokens in document.
As per similarity score equation (http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html)
lengthNorm() will be added for each matched term, net net you will get ratio of (matched tokens / total terms in document).
Now in you example if your query is "20", here is order of returned document
1) 20 (document has only one term which matched with query) - score ~1.0
2) something 20 (document has two terms and one matched) - score ~0.5
3) something 20/12 (document has three terms and one matched) - score ~0.33

How would I add a count for guesses and a best score in a simple Guessing game with multiple methods?

so I am having trouble with being able to have a proper TOTAL guess counter. my code needs to be basic so please dont offer me advice for doing anything you think is not trivial. currently, my code plays a game, then asks the user if they want to play again and its a simple Y or N. if yes, another games plays and if no then then game ends and it reports the results of every game played such: the total games, guesses per game, and the best game (reports game that had lowest guess count). my issues are being able to accurately count ALL THE GUESSES from every game. I am able to report results and track the total games played correctly but i cant figure out how to report a result that tracks the guesses per game, adds them all up, finds the smallest one.
import java.util.*;
public class Guess2{
// The range for what numbers the correct answer is allowed to be.
public static final int MAX = 2;
// Main method that calls the other methods for the game to run properly.
public static void main (String [] args) {
Scanner console = new Scanner(System.in);
introduction();
int userGuess = game(console);
int numGames = 1;
System.out.print("Do you want to play again? ");
String putYesNo = console.next();
while (putYesNo.equalsIgnoreCase("y")) {
System.out.println();
game(console);
numGames++;
System.out.print("Do you want to play again? ");
putYesNo = console.next();
}
if (putYesNo.equalsIgnoreCase("n")) {
System.out.println();
finalScores(console,totalGuess,numGames);
}
}
// Method to play the guessing game using the input from the console(parameter) via the user playing the game.
public static int game(Scanner console){
Random answer = new Random();
int correct = answer.nextInt(MAX) + 1;
System.out.println("I'm thinking of a number between 1 and " + MAX + "...");
int userGuess = 0; // This is used to prime the while loop.
int numGuess = 0; // This is used to start the count at 1.
int totalGuess = 0;
while (userGuess != correct){
System.out.print("Your guess? ");
userGuess = console.nextInt();
if (userGuess > correct){
System.out.println("It's lower.");
} else if (userGuess < correct){
System.out.println("It's higher.");
}
numGuess++;
}
if (userGuess == correct && numGuess == 1){
System.out.println("You got it right in 1 guess");
} else if (userGuess == correct && numGuess != 1){
System.out.println("You got it right in " + numGuess + " guesses");
}
totalGuess =+ numGuess;
return totalGuess; // This Returns the number of total guesses for this single game.
}
/* Method used to report the Users' final statistics of the game(s).
Uses information from the user input via the console,
the sum of guesses used for every game, and the total number of games played. */
public static void finalScores(Scanner console, int totalGuesses, int numGames){
System.out.println("Overall results:");
System.out.println(" total games = " + numGames);
System.out.println(" total guesses = " + totalGuesses);
System.out.println(" guesses/game = " + iLoveRounding(1.0*totalGuesses/numGames));
System.out.println(" best game = " );
}
// Method that introduces the User to the Game.
public static void introduction() {
System.out.println("This program allows you to play a guessing game.");
System.out.println("I will think of a number between 1 and");
System.out.println(MAX + " and will allow you to guess until");
System.out.println("you get it. For each guess, I will tell you");
System.out.println("whether the right answer is higher or lower");
System.out.println("than your guess.");
System.out.println();
}
/* Method used to round the "guessing/game" ratio to 1 decimal place.
The return value of this method returns the ratio in decimal form with only 1 didgit, so
The scores method can report it properly. */
public static double iLoveRounding(double round){
return Math.round(round * 10.0) / 10.0;
}
}
You could look into making an integer ArrayList to hold all of the guesses per game. Then, when you are printing the results, you could first print out the whole ArrayList to represent the guesses per game, then call 'Collections.sort(yourArrayList)', which will sort the ArrayList from lowest to highest, then print out the first element of the ArrayList.
I'd recommend the integer ArrayList over a simple integer array, since the array will require you to define the size beforehand. An ArrayList can be dynamic, and you can change the size according to the number of times the user plays the game.

How to get document ID in CustomScoreProvider?

In short, I am trying to determine a document's true document ID in method CustomScoreProvider.CustomScore which only provides a document "ID" relative to a sub-IndexReader.
More info: I am trying to boost my documents' scores by precomputed boost factors (imagine an in-memory structure that maps Lucene's document ids to boost factors). Unfortunately I cannot store the boosts in the index for a couple of reasons: boosting will not be used for all queries, plus the boost factors can change regularly and that would trigger a lot of reindexing.
Instead I'd like to boost the score at query time and thus I've been working with CustomScoreQuery/CustomScoreProvider. The boosting takes place in method CustomScoreProvider.CustomScore:
public override float CustomScore(int doc, float subQueryScore, float valSrcScore) {
float baseScore = subQueryScore * valSrcScore; // the default computation
// boost -- THIS IS WHERE THE PROBLEM IS
float boostedScore = baseScore * MyBoostCache.GetBoostForDocId(doc);
return boostedScore;
}
My problem is with the doc parameter passed to CustomScore. It is not the true document id -- it is relative to the subreader used for that index segment. (The MyBoostCache class is my in-memory structure mapping Lucene's doc ids to boost factors.) If I knew the reader's docBase I could figure out the true id (id = doc + docBase).
Any thoughts on how I can determine the true id, or perhaps there's a better way to accomplish what I'm doing?
(I am aware that the id I'm trying to get is subject to change and I've already taken steps to make sure the MyBoostCache is always up to date with the latest ids.)
I was able to achieve this by passing the IndexSearcher to my CustomScoreProvider, using it to determine which of its subreaders is being used by the CustomScoreProvider, and then getting the MaxDoc for the prior subreaders from the IndexSearcher to determine the docBase.
private int DocBase { get; set; }
public MyScoreProvider(IndexReader reader, IndexSearcher searcher) {
DocBase = GetDocBaseForIndexReader(reader, searcher);
}
private static int GetDocBaseForIndexReader(IndexReader reader, IndexSearcher searcher) {
// get all segment readers for the searcher
IndexReader rootReader = searcher.GetIndexReader();
var subReaders = new List<IndexReader>();
ReaderUtil.GatherSubReaders(subReaders, rootReader);
// sequentially loop through the subreaders until we find the specified reader, adjusting our offset along the way
int docBase = 0;
for (int i = 0; i < subReaders.Count; i++)
{
if (subReaders[i] == reader)
break;
docBase += subReaders[i].MaxDoc();
}
return docBase;
}
public override float CustomScore(int doc, float subQueryScore, float valSrcScore) {
float baseScore = subQueryScore * valSrcScore;
float boostedScore = baseScore * MyBoostCache.GetBoostForDocId(doc + DocBase);
return boostedScore;
}

Expression Evaluation and Tree Walking using polymorphism? (ala Steve Yegge)

This morning, I was reading Steve Yegge's: When Polymorphism Fails, when I came across a question that a co-worker of his used to ask potential employees when they came for their interview at Amazon.
As an example of polymorphism in
action, let's look at the classic
"eval" interview question, which (as
far as I know) was brought to Amazon
by Ron Braunstein. The question is
quite a rich one, as it manages to
probe a wide variety of important
skills: OOP design, recursion, binary
trees, polymorphism and runtime
typing, general coding skills, and (if
you want to make it extra hard)
parsing theory.
At some point, the candidate hopefully
realizes that you can represent an
arithmetic expression as a binary
tree, assuming you're only using
binary operators such as "+", "-",
"*", "/". The leaf nodes are all
numbers, and the internal nodes are
all operators. Evaluating the
expression means walking the tree. If
the candidate doesn't realize this,
you can gently lead them to it, or if
necessary, just tell them.
Even if you tell them, it's still an
interesting problem.
The first half of the question, which
some people (whose names I will
protect to my dying breath, but their
initials are Willie Lewis) feel is a
Job Requirement If You Want To Call
Yourself A Developer And Work At
Amazon, is actually kinda hard. The
question is: how do you go from an
arithmetic expression (e.g. in a
string) such as "2 + (2)" to an
expression tree. We may have an ADJ
challenge on this question at some
point.
The second half is: let's say this is
a 2-person project, and your partner,
who we'll call "Willie", is
responsible for transforming the
string expression into a tree. You get
the easy part: you need to decide what
classes Willie is to construct the
tree with. You can do it in any
language, but make sure you pick one,
or Willie will hand you assembly
language. If he's feeling ornery, it
will be for a processor that is no
longer manufactured in production.
You'd be amazed at how many candidates
boff this one.
I won't give away the answer, but a
Standard Bad Solution involves the use
of a switch or case statment (or just
good old-fashioned cascaded-ifs). A
Slightly Better Solution involves
using a table of function pointers,
and the Probably Best Solution
involves using polymorphism. I
encourage you to work through it
sometime. Fun stuff!
So, let's try to tackle the problem all three ways. How do you go from an arithmetic expression (e.g. in a string) such as "2 + (2)" to an expression tree using cascaded-if's, a table of function pointers, and/or polymorphism?
Feel free to tackle one, two, or all three.
[update: title modified to better match what most of the answers have been.]
Polymorphic Tree Walking, Python version
#!/usr/bin/python
class Node:
"""base class, you should not process one of these"""
def process(self):
raise('you should not be processing a node')
class BinaryNode(Node):
"""base class for binary nodes"""
def __init__(self, _left, _right):
self.left = _left
self.right = _right
def process(self):
raise('you should not be processing a binarynode')
class Plus(BinaryNode):
def process(self):
return self.left.process() + self.right.process()
class Minus(BinaryNode):
def process(self):
return self.left.process() - self.right.process()
class Mul(BinaryNode):
def process(self):
return self.left.process() * self.right.process()
class Div(BinaryNode):
def process(self):
return self.left.process() / self.right.process()
class Num(Node):
def __init__(self, _value):
self.value = _value
def process(self):
return self.value
def demo(n):
print n.process()
demo(Num(2)) # 2
demo(Plus(Num(2),Num(5))) # 2 + 3
demo(Plus(Mul(Num(2),Num(3)),Div(Num(10),Num(5)))) # (2 * 3) + (10 / 2)
The tests are just building up the binary trees by using constructors.
program structure:
abstract base class: Node
all Nodes inherit from this class
abstract base class: BinaryNode
all binary operators inherit from this class
process method does the work of evaluting the expression and returning the result
binary operator classes: Plus,Minus,Mul,Div
two child nodes, one each for left side and right side subexpressions
number class: Num
holds a leaf-node numeric value, e.g. 17 or 42
The problem, I think, is that we need to parse perentheses, and yet they are not a binary operator? Should we take (2) as a single token, that evaluates to 2?
The parens don't need to show up in the expression tree, but they do affect its shape. E.g., the tree for (1+2)+3 is different from 1+(2+3):
+
/ \
+ 3
/ \
1 2
versus
+
/ \
1 +
/ \
2 3
The parentheses are a "hint" to the parser (e.g., per superjoe30, to "recursively descend")
This gets into parsing/compiler theory, which is kind of a rabbit hole... The Dragon Book is the standard text for compiler construction, and takes this to extremes. In this particular case, you want to construct a context-free grammar for basic arithmetic, then use that grammar to parse out an abstract syntax tree. You can then iterate over the tree, reducing it from the bottom up (it's at this point you'd apply the polymorphism/function pointers/switch statement to reduce the tree).
I've found these notes to be incredibly helpful in compiler and parsing theory.
Representing the Nodes
If we want to include parentheses, we need 5 kinds of nodes:
the binary nodes: Add Minus Mul Divthese have two children, a left and right side
+
/ \
node node
a node to hold a value: Valno children nodes, just a numeric value
a node to keep track of the parens: Parena single child node for the subexpression
( )
|
node
For a polymorphic solution, we need to have this kind of class relationship:
Node
BinaryNode : inherit from Node
Plus : inherit from Binary Node
Minus : inherit from Binary Node
Mul : inherit from Binary Node
Div : inherit from Binary Node
Value : inherit from Node
Paren : inherit from node
There is a virtual function for all nodes called eval(). If you call that function, it will return the value of that subexpression.
String Tokenizer + LL(1) Parser will give you an expression tree... the polymorphism way might involve an abstract Arithmetic class with an "evaluate(a,b)" function, which is overridden for each of the operators involved (Addition, Subtraction etc) to return the appropriate value, and the tree contains Integers and Arithmetic operators, which can be evaluated by a post(?)-order traversal of the tree.
I won't give away the answer, but a
Standard Bad Solution involves the use
of a switch or case statment (or just
good old-fashioned cascaded-ifs). A
Slightly Better Solution involves
using a table of function pointers,
and the Probably Best Solution
involves using polymorphism.
The last twenty years of evolution in interpreters can be seen as going the other way - polymorphism (eg naive Smalltalk metacircular interpreters) to function pointers (naive lisp implementations, threaded code, C++) to switch (naive byte code interpreters), and then onwards to JITs and so on - which either require very big classes, or (in singly polymorphic languages) double-dispatch, which reduces the polymorphism to a type-case, and you're back at stage one. What definition of 'best' is in use here?
For simple stuff a polymorphic solution is OK - here's one I made earlier, but either stack and bytecode/switch or exploiting the runtime's compiler is usually better if you're, say, plotting a function with a few thousand data points.
Hm... I don't think you can write a top-down parser for this without backtracking, so it has to be some sort of a shift-reduce parser. LR(1) or even LALR will of course work just fine with the following (ad-hoc) language definition:
Start -> E1
E1 -> E1+E1 | E1-E1
E1 -> E2*E2 | E2/E2 | E2
E2 -> number | (E1)
Separating it out into E1 and E2 is necessary to maintain the precedence of * and / over + and -.
But this is how I would do it if I had to write the parser by hand:
Two stacks, one storing nodes of the tree as operands and one storing operators
Read the input left to right, make leaf nodes of the numbers and push them into the operand stack.
If you have >= 2 operands on the stack, pop 2, combine them with the topmost operator in the operator stack and push this structure back to the operand tree, unless
The next operator has higher precedence that the one currently on top of the stack.
This leaves us the problem of handling brackets. One elegant (I thought) solution is to store the precedence of each operator as a number in a variable. So initially,
int plus, minus = 1;
int mul, div = 2;
Now every time you see a a left bracket increment all these variables by 2, and every time you see a right bracket, decrement all the variables by 2.
This will ensure that the + in 3*(4+5) has higher precedence than the *, and 3*4 will not be pushed onto the stack. Instead it will wait for 5, push 4+5, then push 3*(4+5).
Re: Justin
I think the tree would look something like this:
+
/ \
2 ( )
|
2
Basically, you'd have an "eval" node, that just evaluates the tree below it. That would then be optimized out to just being:
+
/ \
2 2
In this case the parens aren't required and don't add anything. They don't add anything logically, so they'd just go away.
I think the question is about how to write a parser, not the evaluator. Or rather, how to create the expression tree from a string.
Case statements that return a base class don't exactly count.
The basic structure of a "polymorphic" solution (which is another way of saying, I don't care what you build this with, I just want to extend it with rewriting the least amount of code possible) is deserializing an object hierarchy from a stream with a (dynamic) set of known types.
The crux of the implementation of the polymorphic solution is to have a way to create an expression object from a pattern matcher, likely recursive. I.e., map a BNF or similar syntax to an object factory.
Or maybe this is the real question:
how can you represent (2) as a BST?
That is the part that is tripping me
up.
Recursion.
#Justin:
Look at my note on representing the nodes. If you use that scheme, then
2 + (2)
can be represented as
.
/ \
2 ( )
|
2
should use a functional language imo. Trees are harder to represent and manipulate in OO languages.
As people have been mentioning previously, when you use expression trees parens are not necessary. The order of operations becomes trivial and obvious when you're looking at an expression tree. The parens are hints to the parser.
While the accepted answer is the solution to one half of the problem, the other half - actually parsing the expression - is still unsolved. Typically, these sorts of problems can be solved using a recursive descent parser. Writing such a parser is often a fun exercise, but most modern tools for language parsing will abstract that away for you.
The parser is also significantly harder if you allow floating point numbers in your string. I had to create a DFA to accept floating point numbers in C -- it was a very painstaking and detailed task. Remember, valid floating points include: 10, 10., 10.123, 9.876e-5, 1.0f, .025, etc. I assume some dispensation from this (in favor of simplicty and brevity) was made in the interview.
I've written such a parser with some basic techniques like
Infix -> RPN and
Shunting Yard and
Tree Traversals.
Here is the implementation I've came up with.
It's written in C++ and compiles on both Linux and Windows.
Any suggestions/questions are welcomed.
So, let's try to tackle the problem all three ways. How do you go from an arithmetic expression (e.g. in a string) such as "2 + (2)" to an expression tree using cascaded-if's, a table of function pointers, and/or polymorphism?
This is interesting,but I don't think this belongs to the realm of object-oriented programming...I think it has more to do with parsing techniques.
I've kind of chucked this c# console app together as a bit of a proof of concept. Have a feeling it could be a lot better (that switch statement in GetNode is kind of clunky (it's there coz I hit a blank trying to map a class name to an operator)). Any suggestions on how it could be improved very welcome.
using System;
class Program
{
static void Main(string[] args)
{
string expression = "(((3.5 * 4.5) / (1 + 2)) + 5)";
Console.WriteLine(string.Format("{0} = {1}", expression, new Expression.ExpressionTree(expression).Value));
Console.WriteLine("\nShow's over folks, press a key to exit");
Console.ReadKey(false);
}
}
namespace Expression
{
// -------------------------------------------------------
abstract class NodeBase
{
public abstract double Value { get; }
}
// -------------------------------------------------------
class ValueNode : NodeBase
{
public ValueNode(double value)
{
_double = value;
}
private double _double;
public override double Value
{
get
{
return _double;
}
}
}
// -------------------------------------------------------
abstract class ExpressionNodeBase : NodeBase
{
protected NodeBase GetNode(string expression)
{
// Remove parenthesis
expression = RemoveParenthesis(expression);
// Is expression just a number?
double value = 0;
if (double.TryParse(expression, out value))
{
return new ValueNode(value);
}
else
{
int pos = ParseExpression(expression);
if (pos > 0)
{
string leftExpression = expression.Substring(0, pos - 1).Trim();
string rightExpression = expression.Substring(pos).Trim();
switch (expression.Substring(pos - 1, 1))
{
case "+":
return new Add(leftExpression, rightExpression);
case "-":
return new Subtract(leftExpression, rightExpression);
case "*":
return new Multiply(leftExpression, rightExpression);
case "/":
return new Divide(leftExpression, rightExpression);
default:
throw new Exception("Unknown operator");
}
}
else
{
throw new Exception("Unable to parse expression");
}
}
}
private string RemoveParenthesis(string expression)
{
if (expression.Contains("("))
{
expression = expression.Trim();
int level = 0;
int pos = 0;
foreach (char token in expression.ToCharArray())
{
pos++;
switch (token)
{
case '(':
level++;
break;
case ')':
level--;
break;
}
if (level == 0)
{
break;
}
}
if (level == 0 && pos == expression.Length)
{
expression = expression.Substring(1, expression.Length - 2);
expression = RemoveParenthesis(expression);
}
}
return expression;
}
private int ParseExpression(string expression)
{
int winningLevel = 0;
byte winningTokenWeight = 0;
int winningPos = 0;
int level = 0;
int pos = 0;
foreach (char token in expression.ToCharArray())
{
pos++;
switch (token)
{
case '(':
level++;
break;
case ')':
level--;
break;
}
if (level <= winningLevel)
{
if (OperatorWeight(token) > winningTokenWeight)
{
winningLevel = level;
winningTokenWeight = OperatorWeight(token);
winningPos = pos;
}
}
}
return winningPos;
}
private byte OperatorWeight(char value)
{
switch (value)
{
case '+':
case '-':
return 3;
case '*':
return 2;
case '/':
return 1;
default:
return 0;
}
}
}
// -------------------------------------------------------
class ExpressionTree : ExpressionNodeBase
{
protected NodeBase _rootNode;
public ExpressionTree(string expression)
{
_rootNode = GetNode(expression);
}
public override double Value
{
get
{
return _rootNode.Value;
}
}
}
// -------------------------------------------------------
abstract class OperatorNodeBase : ExpressionNodeBase
{
protected NodeBase _leftNode;
protected NodeBase _rightNode;
public OperatorNodeBase(string leftExpression, string rightExpression)
{
_leftNode = GetNode(leftExpression);
_rightNode = GetNode(rightExpression);
}
}
// -------------------------------------------------------
class Add : OperatorNodeBase
{
public Add(string leftExpression, string rightExpression)
: base(leftExpression, rightExpression)
{
}
public override double Value
{
get
{
return _leftNode.Value + _rightNode.Value;
}
}
}
// -------------------------------------------------------
class Subtract : OperatorNodeBase
{
public Subtract(string leftExpression, string rightExpression)
: base(leftExpression, rightExpression)
{
}
public override double Value
{
get
{
return _leftNode.Value - _rightNode.Value;
}
}
}
// -------------------------------------------------------
class Divide : OperatorNodeBase
{
public Divide(string leftExpression, string rightExpression)
: base(leftExpression, rightExpression)
{
}
public override double Value
{
get
{
return _leftNode.Value / _rightNode.Value;
}
}
}
// -------------------------------------------------------
class Multiply : OperatorNodeBase
{
public Multiply(string leftExpression, string rightExpression)
: base(leftExpression, rightExpression)
{
}
public override double Value
{
get
{
return _leftNode.Value * _rightNode.Value;
}
}
}
}
Ok, here is my naive implementation. Sorry, I did not feel to use objects for that one but it is easy to convert. I feel a bit like evil Willy (from Steve's story).
#!/usr/bin/env python
#tree structure [left argument, operator, right argument, priority level]
tree_root = [None, None, None, None]
#count of parethesis nesting
parenthesis_level = 0
#current node with empty right argument
current_node = tree_root
#indices in tree_root nodes Left, Operator, Right, PRiority
L, O, R, PR = 0, 1, 2, 3
#functions that realise operators
def sum(a, b):
return a + b
def diff(a, b):
return a - b
def mul(a, b):
return a * b
def div(a, b):
return a / b
#tree evaluator
def process_node(n):
try:
len(n)
except TypeError:
return n
left = process_node(n[L])
right = process_node(n[R])
return n[O](left, right)
#mapping operators to relevant functions
o2f = {'+': sum, '-': diff, '*': mul, '/': div, '(': None, ')': None}
#converts token to a node in tree
def convert_token(t):
global current_node, tree_root, parenthesis_level
if t == '(':
parenthesis_level += 2
return
if t == ')':
parenthesis_level -= 2
return
try: #assumption that we have just an integer
l = int(t)
except (ValueError, TypeError):
pass #if not, no problem
else:
if tree_root[L] is None: #if it is first number, put it on the left of root node
tree_root[L] = l
else: #put on the right of current_node
current_node[R] = l
return
priority = (1 if t in '+-' else 2) + parenthesis_level
#if tree_root does not have operator put it there
if tree_root[O] is None and t in o2f:
tree_root[O] = o2f[t]
tree_root[PR] = priority
return
#if new node has less or equals priority, put it on the top of tree
if tree_root[PR] >= priority:
temp = [tree_root, o2f[t], None, priority]
tree_root = current_node = temp
return
#starting from root search for a place with higher priority in hierarchy
current_node = tree_root
while type(current_node[R]) != type(1) and priority > current_node[R][PR]:
current_node = current_node[R]
#insert new node
temp = [current_node[R], o2f[t], None, priority]
current_node[R] = temp
current_node = temp
def parse(e):
token = ''
for c in e:
if c <= '9' and c >='0':
token += c
continue
if c == ' ':
if token != '':
convert_token(token)
token = ''
continue
if c in o2f:
if token != '':
convert_token(token)
convert_token(c)
token = ''
continue
print "Unrecognized character:", c
if token != '':
convert_token(token)
def main():
parse('(((3 * 4) / (1 + 2)) + 5)')
print tree_root
print process_node(tree_root)
if __name__ == '__main__':
main()