Fast lookup of tree with placeholders? - indexing

For an application I'm considering, there would be a large (100,000+) 'database' of trees (think expressions in a programming language, or S-expressions), and I would need to query that database for expressions that match a specific given expression.
Before giving the details of what I'd like to have, note that I'd appreciate any information related to indexing a large set of trees for optimizing lookup by a subtree.
In my specific situation (which would be for a backend to be used by Metamath proof assistants), expressions have the following structure (in Haskell-like notation):
data Expression = Placeholder Id | VarName Id | ConstName Id [Expression]
or as a BNF for an S-expression form:
Expression = '?' Id | Id | '(' Id Expression* ')'
where Id is some kind of identifier.
For example, I could have a database with expressions like
(equiv ?ph ?ps)
(not (in (appl (sqrt) (2)) (Q)))
(equiv (eq ?A ?B) (forall ?x (equiv (in ?x ?A) (in ?x ?B))))
In this context, two expressions match if they can be made equal by substitution of expressions for placeholders. So looking up (equiv (eq A (emptyset)) ?ph) in the above mini-database would result in the first and last expressions.
So again: how would I implement fast lookups in a large set of (expression) trees with placeholders? What kind of index data structure could I use?

I would implement the lookup with a trie. Each key would consist of one of the following:
ConstName Identifier
Variable w/ context info
ConstValue
Placeholder
These should be ordered in some fashion- possibly Placeholder, then all ConstNames (alphabetical), then variables (scope ordering, then argument order), then ConstValues (numerical order). As long as there's a concrete ordering for usage in the trie, you're fine.
Traverse the expression's tree, injecting the appropriate keys into the trie as they are encountered. Do this for all the expressions you want to insert into your data structure. When it comes time to query it, you can traverse the trie in a similar fashion, but with a few new rules.
Everything matches a placeholder node. If it matches some other key as well, then you'll need to explore both branches (easily done via a recursive DFS-like approach).
A placeholder matches everything. This is not equivalent to the previous point- we are talking about placeholders in the query here, the previous bullet is regarding placeholders as trie keys.
Now, this does mean that the search space can somewhat "explode" as you encounter placeholders, but there is one thing you can do to try to mitigate this in practice. Traverse the expression's tree in a breadth-first fashion (both in construction of the trie, and querying). This means if one of the arguments is a placeholder, you won't have to full-depth search every single subtree that matches that expression so far- instead you jump ahead to the next argument- which may not be a placeholder, and will thus greatly prune the search space (compared to matching "everything").
For completeness sake, lets take one of your examples
(not (in (appl (sqrt) (2)) (Q)))
and make a trie entry from that-
not -> in -> apply -> "Q" -> sqrt -> 2
adding (not (in ?ph E)) to this would result in-
not -> in -> apply -> "Q" -> sqrt -> 2
\-> ?ph -> "E"
Continue in this fashion injecting expressions into the trie. Also traverse in this fashion for querying until you reach the ends of your searches into the trie, and return those that matched.
Note- the uniqueness of these entries is based on the assumption you do not have to support variadic functions. If you do, attach to each key some context info (read the next paragraphs for info on how to do this) to distinguish which arguments go to which functions
There is one detail I glossed over- variables. If you only want it to match if they are the exact same variable name, then no work is necessary. But this likely isn't what you want; you probably want it to match generic variables as long as they are "consistent" with each other. The way to do this is to assign each variable an identifier that represents the scope of which it was first defined.
The easiest way to do this is just compose an identifier from the concatenation of the argument ordering of its ancestors. That is, if a variable is first defined as the second argument to a function which is the fifth argument to the root function, then we might label it as (5, 2) or (2, 5), whichever makes more sense intuitively. Either way, this will ensure the variable is given a consistent identifier regardless of other variables / functions elsewhere. Then proceed as normal with this new variable name.

Related

Lucene operator precedence for boolean operators

What is the order of operations for boolean operators? Left to right? Right to left? Specific operators have higher priority?
For example, if I search for:
jakarta OR apache AND website
What do I get? Is it
Anything with "jakarta" as well as anything with both "apache" and "website"?
Anything with "website" that also has either "jakarta" or "apache"?
Something else?
Short answer:
In Lucene, the AND operator takes precedence over the OR operator. So, you are effectively doing this:
jakarta OR (apache AND website)
You can verify this for yourself by parsing your query string and seeing how it converts AND and OR to the "required" and "optional" operators.
And the NOT operator takes precendence over the AND operator, since we are discussing precedence.
But you need to be very careful when dealing with Lucene's so-called "boolean" operators, as they do not behave the way you may expect based on their collective name ("boolean").
(Unfortunately I have never seen any official documentation which provides a citation for these precedence rules - but instead I am relying on empirical observations. See below for more about that. If the documentation for this does exist, that would be great to see.)
Longer Answer
One key thing to understand is that Lucene boolean operators are not really "boolean" in the sense that you may think, based on Boolean algebra, where you use parentheses to help avoid ambiguity (or where you need to know what rules a programming language may be applying) - and where everything evaluates to TRUE or FALSE.
Lucene boolean operators serve a subtly different purpose.
They are not purely concerned with TRUE/FALSE inclusion/exclusion, but also concerned with how to score results so that the more relevant results have higher scores than less relevant results.
The Lucene query jakarta OR apache AND website is equivalent to the following:
jakarta +apache +website
This means the document's field must contain apache and website, but may also include jakarta (for a higher relevance score).
You can see this for yourself by taking your original query string and parsing it:
Query query = parser.parse(queryString);
...and then printing the resulting string representation of the query. The + operator is the "required" operator. It:
requires that the term after the "+" symbol exist somewhere in the field
And the lack of a + operator means the default of "may" as in "may contain" - meaning the term is optional: it does not need to be present, if there is some other clause in the query which does match a document.
The use of AND forces the terms on either side of the AND to be required.
You can encounter some potentially surprising situations.
Consider this:
foo AND bar OR baz AND bat
This parses to the following:
+foo +bar +baz +bat
This is because the AND operators are transformed to + operators for every term, rendering the OR redundant.
It's the same result as if you had written this:
foo AND bar AND baz AND bat
But not the same as this:
(foo AND bar) OR (baz AND bat)
which is parsed to this, where the parentheses are retained:
(+foo +bar) (+baz +bat)
Bottom Line:
Use parentheses to explicitly make your intentions clear, when using AND and OR and also NOT.
Regarding NOT, since we mentioned it - that takes prescendence over AND.
The query:
foo AND bar NOT baz AND bat
Is parsed as:
+foo +bar -baz +bat
So, a document field must contain foo, bar and bat - and must not contain baz.
Why does this situation exist?
I don't know, but I think Lucene originally did not include AND, OR and NOT - but instead used + (must include), - (must not include) and "nothing" (may include). The so-called boolean operators AND, OR, NOT were added later on, as a kind of "syntactic sugar" for these original operators - introduced for people who were more familiar with AND, OR and NOT from other contexts. I'm basing this on the following thread:
Getting a Better Understanding of Lucene's Search Operators
A summary of that thread is included in this answer about the NOT operator.

Is it possible to preserve variable names when writing and reading term programatically?

I'm trying to write an SWI-Prolog predicate that applies numbervars/3 to a term's anonymous variables but preserves the user-supplied names of its non-anonymous variables. I eventually plan on adding some kind of hook to term_expansion (or something like that).
Example of desired output:
?- TestList=[X,Y,Z,_,_].
> TestList=[X,Y,Z,A,B].
This answer to the question Converting Terms to Atoms preserving variable names in YAP prolog shows how to use read_term to obtain as atoms the names of the variables used in a term. This list (in the form [X='X',Y='Y',...]) does not contain the anonymous variables, unlike the variable list obtained by term_variables, making isolation of the anonymous variables fairly straightforward.
However, the usefulness of this great feature is somewhat limited if it can only be applied to terms read directly from the terminal. I noticed that all of the examples in the answer involve direct user input of the term. Is it possible to get (as atoms) the variable names for terms that are not obtained through direct user input? That is, is there some way to 'write' a term (preserving variable names) to some invisible stream and then 'read' it as if it were input from the terminal?
Alternatively... Perhaps this is more of a LaTeX-ish line of thinking, but is there some way to "wrap" variables inside single quotes (thereby atom-ifying them) before Prolog expands/tries to unify them as variables, with the end result that they're treated as atoms that start with uppercase letters rather than as variables?
You can use the ISO core standard variable_names/1 read and write option. Here is some example code, that replaces anonymous variables in a variable name mapping:
% replace_anon(+Map, +Map, -Map)
replace_anon([_=V|M], S, ['_'=V|N]) :- member(_=W, S), W==V, !,
replace_anon(M, S, N).
replace_anon([A=V|M], S, [A=V|N]) :-
replace_anon(M, S, N).
replace_anon([], _, []).
variable_names/1 is ISO core standard. It was always a read option. It then became a write option as well. See also: https://www.complang.tuwien.ac.at/ulrich/iso-prolog/WDCor3
Here is an example run:
Welcome to SWI-Prolog (threaded, 64 bits, version 7.7.25)
?- read_term(X,[variable_names(M),singletons(S)]),
replace_anon(M,S,N),
write_term(X,[variable_names(N)]).
|: p(X,Y,X).
p(X,_,X)
To use the old numbervars/3 is not recommended, since its not compatible with attribute variables. You cannot use it for example in the presence of CLP(FD).
Is it possible to get (as atoms) the variable names for terms that are not obtained through direct user input?
if you want to get variable names from source files you should read them from there.
The easiest way to do so using term expansion.
Solution:
read_term_from_atom(+Atom, -Term, +Options)
Use read_term/3 to read the next term from Atom.
Atom is either an atom or a string object.
It is not required for Atom to end with a full-stop.
Use Atom as input to read_term/2 using the option variable_names and return the read term in Term and the variable bindings in variable_names(Bindings).
Bindings is a list of Name = Var couples, thus providing access to the actual variable names. See also read_term/2.
If Atom has no valid syntax, a syntax_error exception is raised.
write_term( Term ) :-
numbervars(Term, 0, End),
write_canonical(Term), nl.

Why does this predicate leave behind a choicepoint?

I've written the following predicate:
list_withoutlast([_Last], []). % forget the last element
list_withoutlast([First, Second|List], [First|WithoutLast]) :-
list_withoutlast([Second|List], WithoutLast).
Queries like list_withoutlast(X, [1, 2]). succeed deterministically, but queries like list_withoutlast([1, 2, 3], X) leave behind a choicepoint, even though there's only one answer.
When I trace it seems to be that SWI attempts to match list_withoutlast([3], Var) against both clauses, even though definitely only the first one will ever match!
Is there something else I can do to tell SWI that I want a list with more than one element? Or, if I want to take advantage of first-argument indexing, are my only options "zero-length list" and "non-zero length list"?
Do other Prologs handle this situation any differently?
You can rewrite your predicate to avoid the spurious choice point:
list_withoutlast([Head| Tail], List) :-
list_withoutlast(Tail, Head, List).
list_withoutlast([], _, []).
list_withoutlast([Head| Tail], Previous, [Previous| List]) :-
list_withoutlast(Tail, Head, List).
This definition takes advantage of first-argument indexing, which will distinguish in the list_withoutlast /3 predicate the first clause, which have an atom (the empty list) in the first argument, from the second clause, which have a (non-empty) list in the first argument.
Passing the head and tail of an input list argument as separate arguments to an auxiliary predicate is a common Prolog programming idiom to take advantage of first-argument indexing and avoid spurious choice-points.
Note that most Prolog systems don't apply deep term indexing. In particular, for compound terms, indexing usually only takes into account the name and arity and doesn't take into account the compound term arguments (a list with one element and a list with two or more elements share the same functor).

How would you structure a spreadsheet app in elm?

I've been looking at elm and I really enjoy learning the language. I've been thinking about doing a spreadsheet application, but i can't wrap my head how it would be structured.
Let's say we have three cells; A, B and C.
If I enter 4 in cell A and =A in cell B how would i get cell B to always equal cell A? If i then enter =A+B in cell C, can that be evaluated to 8, and also be updated when A or B changes?
Not sure how to lever Signals for such dynamic behavior..
Regards Oskar
First you need to decide how to represent your spreadsheet grid. If you come from a C background, you may want to use a 2D array, but I've found that a dictionary actually works better in Elm. So you can define type alias Grid a = Dict (Int, Int) a.
As for the a, what each cell holds... this is an opportunity to define a domain-specific language. So something like
type Expr = Lit Float | Ref (Int, Int) | Op2 (Float -> Float -> Float) Expr Expr
This means an expression is either a literal float, a reference to another cell location, or an operator. An operator can be any function on two floats, and two other expressions which get recursively evaluated. Depending on what you're going for, you can instead define specific tags for each operation, like Plus Expr Expr | Times Expr Expr, or you can add extra opN tags for operations of different arity (like negate).
So then you might define type alias Spreadsheet = Grid Expr, and if you want to alias (Int, Int) to something, that might help too. I'm also assuming you only want floats in your spreadsheet.
Now you need functions to convert strings to expressions and back. The traditional names for these functions are parse and eval.
parse : String -> Maybe Expr -- Result can also work
eval : Spreadsheet -> Grid Float
evalOne : Expr -> Spreadsheet -> Maybe Float
Parse will be a little tricky; the String module is your friend. Eval will involve chasing references through the spreadsheet and filling in the results, recursively. At first you'll want to ignore the possibility of catching infinite loops. Also, this is just a sketch, if you find that different type signatures work better, use them.
As for the view, I'd start with read-only, so you can verify hard-coded spreadsheets are evaluated properly. Then you can worry about editing, with the idea being that you just rerun the parser and evaluator and get a new spreadsheet to render. It should work because a spreadsheet has no state other than the contents of each cell. (Minimizing the recomputed work is one of many different ways you can extend this.) If you're using elm-html, table elements ought to be fine.
Hope this sets you off in the right direction. This is an ambitious project and I'd love to see it when you're done (post it to the mailing list). Good luck!

How to tell if an identifier is being assigned or referenced? (FLEX/BISON)

So, I'm writing a language using flex/bison and I'm having difficulty with implementing identifiers, specifically when it comes to knowing when you're looking at an assignment or a reference,
for example:
1) A = 1+2
2) B + C (where B and C have already been assigned values)
Example one I can work out by returning an ID token from flex to bison, and just following a grammar that recognizes that 1+2 is an integer expression, putting A into the symbol table, and setting its value.
examples two and three are more difficult for me because: after going through my lexer, what's being returned in ex.2 to bison is "ID PLUS ID" -> I have a grammar that recognizes arithmetic expressions for numerical values, like INT PLUS INT (which would produce an INT), or DOUBLE MINUS INT (which would produce a DOUBLE). if I have "ID PLUS ID", how do I know what type the return value is?
Here's the best idea that I've come up with so far: When tokenizing, every time an ID comes up, I search for its value and type in the symbol table and switch out the ID token with its respective information; for example: while tokenizing, I come across B, which has a regex that matches it as being an ID. I look in my symbol table and see that it has a value of 51.2 and is a DOUBLE. So instead of returning ID, with a value of B to bison, I'm returning DOUBLE with a value of 51.2
I have two different solutions that contradict each other. Here's why: if I want to assign a value to an ID, I would say to my compiler A = 5. In this situation, if I'm using my previously described solution, What I'm going to get after everything is tokenized might be, INT ASGN INT, or STRING ASGN INT, etc... So, in this case, I would use the former solution, as opposed to the latter.
My question would be: what kind of logical device do I use to help my compiler know which solution to use?
NOTE: I didn't think it necessary to post source code to describe my conundrum, but I will if anyone could use it effectively as a reference to help me understand their input on this topic.
Thank you.
The usual way is to have a yacc/bison rule like:
expr: ID { $$ = lookupId($1); }
where the the lookupId function looks up a symbol in the symbol table and returns its type and value (or type and storage location if you're writing a compiler rather than a strict interpreter). Then, your other expr rules don't need to care whether their operands come from constants or symbols or other expressions:
expr: expr '+' expr { $$ = DoAddition($1, $3); }
The function DoAddition takes the types and values (or locations) for its two operands and either adds them, producing a result, or produces code to do the addition at run time.
If possible redesign your language so that the situation is unambiguous. This is why even Javascript has var.
Otherwise you're going to need to disambiguate via semantic rules, for example that the first use of an identifier is its declaration. I don't see what the problem is with your case (2): just generate the appropriate code. If B and C haven't been used yet, a value-reading use like this should be illegal, but that involves you in control flow analysis if taken to the Nth degree of accuracy, so you might prefer to assume initial values of zero.
In any case you can see that it's fundamentally a language design problem rather than a coding problem.