I'm wrapping my head around Grakn a little to understands its added value, I wonder if Graql is compiled or translated to gremlin traversal step ?
This makes me wonder about the difference of expressivity between Sparql and Graql, given that the former is until now not fully translated into Gremlin. It seems to be an open problem ? Is Graql fundamentally simpler than sparql and that would explain the fact that is it fully translated if that's the case ? If not is there any limitation in translating it to gremlin steps at this point ?
I'll try to shine some light on your questions.
To begin with, Graql was designed to be a high-level, human-readable query language. The main idea was to abstract the node-vertex graph datastructure to concepts that are specific to a given user-defined domain. In that way the user doesn't need to worry about the underlying graph representation and low-level gremlin constructs and instead he can work with high-level terms he defined himself and/or he is familiar with.
Now, implementation-wise Graql is an abstraction over Gremlin which translates the high-level queries to Gremlin traversals which can then be executed against a specific graph. However, the mapping between Graql and Gremlin is not 1-1. In fact, Graql operates with some subset of Gremlin that allows to capture the intended behaviours of the Graql language. It was never our intention to find such a mapping as the goal was to translate high-level queries to queries understandable by the underlying graph processor.
Now the efficiency of the traversal generation. Graql queries can be decomposed to properties (has, isa, sub, etc) and fragments. Each fragment has a defined Gremlin counterpart and each property can possibly contain multiple fragments. Now the fragment translation is unambiguous, however there is a lot of freedom in picking and arranging the fragments that go into a property. Keeping in mind that queries contain multiple properties this makes the arrangement a strictly non-trivial task. To perform this arrangement, which in Gremlin is handed to the user, we implemented a query processor. The idea of the processor is to pick such an arrangement and ordering of the fragments that the resulting query execution is as fast as possible. This is reminiscent of SQL query processors and the motivation is exactly the same, to abstract the query optimisation from the user.
We are actively working on the query planning component and although it gives no guarantee to be produce the most optimal plan in all cases, we are trying to make the produced plans converge to optimal solutions.
Related
I'm working on a streaming rules engine, and some of my customers have a few hundred rules they'd like to evaluate on every event that arrives at the system. The rules are pure (i.e. non-side-effecting) Boolean expressions, and they can be nested arbitrarily deeply.
Customers are creating, updating and deleting rules at runtime, and I need to detect and adapt to the population of rules dynamically. At the moment, the expression evaluation uses an interpreter over the internal AST, and I haven't started thinking about codegen yet.
As always, some of the predicates in the tree are MUCH cheaper to evaluate than others, and I've been looking for an algorithm or data structure that makes it easier to find the predicates that are cheap, and that are validly interpretable as controlling the entire expression. My mental headline for this pattern is "ANDs all the way to the root", i.e. any predicate for which all ancestors are ANDs can be interpreted as controlling.
Despite several days of literature search, reading about ROBDDs, CNF, DNF, etc., I haven't been able to close the loop from what might be common practice in the industry to my particular use case. One thing I've found that seems related is Analysis and optimization for boolean expression indexing
but it's not clear how I could apply it without implementing the BE-Tree data structure myself, as there doesn't seem to be an open source implementation.
I keep half-jokingly mentioning to my team that we're going to need a SAT solver one of these days. 😅 I guess it would probably suffice to write a recursive algorithm that traverses the tree and keeps track of whether every ancestor is an AND or an OR, but I keep getting the "surely this is a solved problem" feeling. :)
Edit: After talking to a couple of friends, I think I may have a sketch of a solution!
Transform the expressions into Conjunctive Normal Form, in which, by definition, every node is in a valid short-circuit position.
Use the Tseitin algorithm to try to avoid exponential blowups in expression size as a result of the CNF transform
For each AND in the tree, sort it in ascending order of cost (i.e. cheapest to the left)
???
Profit!^Weval as usual :)
You should seriously consider compiling the rules (and the predicates). An interpreter is 10-50x slower than machine code for the same thing. This is a good idea if the rule set doesn't change very often. Its even a good idea if the rules can change dynamically because in practice they still don't change very fast, although now your rule compiler has be online. Eh, just makes for a bigger application program and memory isn't much of an issue anymore.
A Boolean expression evaluation using individual machine instructions is even better. Any complex boolean equation can be compiled in branchless sequences of individual machine instructions over the leaf values. No branches, no cache misses; stuff runs pretty damn fast. Now, if you have expensive predicates, you probably want to compile code with branches to skip subtrees that don't affect the result of the expression, if they contain expensive predicates.
Within reason, you can generate any equivalent form (I'd run screaming into the night over the idea of using CNF because it always blows up on you). What you really want is the shortest boolean equation (deepest expression tree) equivalent to what the clients provided because that will take the fewest machine instructions to execute. This may sound crazy, but you might consider exhaustive search code generation, e.g., literally try every combination that has a chance of working, especially if the number of operators in the equation is relatively small. The VLSI world has been working hard on doing various optimizations when synthesizing boolean equations into gates. You should look into the the Espresso hueristic boolean logic optimizer (https://en.wikipedia.org/wiki/Espresso_heuristic_logic_minimizer)
One thing that might drive you expression evaluation is literally the cost of the predicates. if I have formula A and B, and I know that A is expensive to evaluate and usually returns true, then clearly I want to evaluate B and A instead.
You should consider common sub expression evaluation, so that any common subterm is only computed once. This is especially important when one has expensive predicates; you never want to evaluate the same expensive predicate twice.
I implemented these tricks in a PLC emulator (these are basically machines that evaluate buckets [like hundreds of thousands] of boolean equations telling factory actuators when to move) using x86 machine instructions for AND/OR/NOT for Rockwell Automation some 20 years ago. It outran Rockwell's "premier" PLC which had custom hardware but was essentially an interpreter.
You might also consider incremental evaluation of the equations. The basic idea is not to re-evaluate all the equations over and over, but rather to re-evaluate only those equations whose input changed. Details are too long to include here, but a patent I did back then explains how to do it. See https://patents.google.com/patent/US5623401A/en?inventor=Ira+D+Baxter&oq=Ira+D+Baxter
Are there any thick-client alternatives to Pulse / Gfsh to query the regions of Gemfire? Though pulse is good, it's not usable as a sqldeveloper/toad for testing/querying.
Unfortunately, none that I know of, sorry.
However, an alternative approach would be to use Spring Data GemFire Repositories (additional details here) to write/express your (OQL) queries, and then write automated [JUnit] tests to test your queries defined in your application Repository interface.
For example, I can define an interface extension of either the SDC's [Crud]Repository or SDG's GemfireRepository interface and declare my application queries following certain conventions (a specification of the query criteria defined by the interface method signature). I.e. I do not need to write the actual queries.
Then, it is a relatively simple matter to define tests to exercise your application's queries.
You can even express more complex queries (like Equi-Joins on 2 or more collocated PRs). However, beware of the query limitations involving PRs in particular, as well as in general.
More information on querying PRs can be found here, and specifically involving Equi-Join Queries on PRs.
I have hard time imagining any tool successfully enabling this sort of practical querying since querying 2 collocated PRs (or a PR with any other Region type, e.g. REPLICATE or LOCAL) in an Equi-Join (OQL) Query must be performed inside a GemFire Function.
Anyway, I know this was not exactly what you were looking for since you probably just need something quick to test the validity of your query results in addition to analyzing the perf (like Explain Plan), but, this at least increases your test coverage in an automated, repeatable fashion.
Of course, this is all moot point if you are just looking to perform analysis on the data outside an application.
Cheers,
John
I would like ask for some thoughts about the concepts: Domain Object and a Semantic Model.
So, I really want to understand what's a Domain Object / Semantic Model for and what's not Domain Object / Semantic Model for.
As far I've been able to figure out, given a grammar is absolutly advisable do these separation concepts.
However, I'm not quite figure out how to do it. For example, given this slight grammar, how do you build a Domain Object or a Semantic Model.
It's exactly what I'm trying to figure out...
Most of books suggest this approach in order to go through an AST. Instead of directly translate at the same time you go throguh the AST creating a semantic model and then connect to it an interpreter.
Example (SQL Syntax Tree):
Instead of generate directly a SQL sentence, I create a semantic model, and then I'm able to connent an interpreter that translate this semantic model to a SQL sentence.
Abstract Systex Tree -> Semantic Model -> Interpreter
By this way, I could have a Transact-SQL Interpreter and another onr for SqLite.
The terms "domain object" and "semantic model" aren't really standard terms from the compiler literature, so you'll get lots of random answers.
The usual terms related to parsing are "concrete syntax tree" (matches the shape of the grammar rules), "abstract syntax tree" (an attempt to make a tree which contains less accidental detail, although it might not be worth the trouble.).
Parsing is only a small part of the problem of processing a language. You need a lot of semantic interpretation of syntax, however you represent it (AST, CST, ...). This includes concepts such as :
Name resolution (for every identifier, where is it defined? used?
Type resolution (for every identifier/expression/syntax construct, what is the type of that entity?
Type checking (is that syntax construct used in a valid way?)
Control flow analysis (what order are the program parts executed in, possibly even parallel/dynamic/constraint-determined)
Data flow analysis (where are values defined? consumed?)
Optimization (replacement of one set of syntax constructs by another semantically equivalent set with some nice property [executes faster after compilation is common]), at high or low levels of abstraction
High level code generation, e.g, interpreting sets of syntactic constructs in the language, to equivalent sets in the targeted [often assembly-language like] language
Each of these concepts more or less builds on top of the preceding ones.
The closest I can come to "semantic model" is that high-level code generation. That takes a lot of machinery that you have to build on top of trees.
ANTLR parses. You have to do/supply the rest.
Requirement:
I am trying to develop a language application using antlr4. The language in question is not important. The important thing is that the grammar is very vast (easily >2000 rules!!!). I want to do a number of operations
Extract bunch of informations. These can be call graphs, variable names. constant expressions etc.
Any number of transformations:
if a loop can be expanded, we go ahead and expand it
If we can eliminate dead code we might choose to do that
we might choose to rename all variable names to conform to some norms.
Each of these operations can be applied independent of each other. And after application of these steps I want the rewrite the input as close as possible to the original input.
e.g. So we might want to eliminate loops and rename the variable and then output the result in the original language format.
Questions:
I see a need to build a custom Tree (read AST) for this. So that I can modify the tree with each of the transformations. However when I want to generate the output, I lose the nice abilities of the TokenStreamRewriter. I have to specify how to write each of the nodes of the tree and I lose the original input formatting for the places I didn't do any transformations. Does antlr4 provide a good way to get around this problem?
Is AST the best way to go? Or do I build my own object representation? If so how do I create that object efficiently? Creating object representation is very big pain for such a vast language. But may be better in the long run. Again how do I get back the original formatting?
Is it possible to work just on the parse tree?
Are there similar language applications which do the same thing? If so what strategy do they use?
Any input is welcome.
Thanks in advance.
In general, what you want is called a Program Transformation System (PTS).
PTSs generally have parsers, build ASTs, can prettyprint the ASTs to recover compilable source text. More importantly, they have standard ways to navigate/inspect/modify the ASTs so that you can change them programmatically.
Many offer these capabilities in the form of pattern-matching code fragments written in the surface syntax of the language being transformed; this avoids the need to forever having to know excruciatingly fine details about which nodes are in your AST and how they are related to children. This is incredibly useful when you big complex grammars, as most of our modern (and our legacy languages) all seem to have.
More sophisticated PTSs (very few) provide additional facilities for teasing out the semantics of the source code. It is pretty hard to analyze/transform most code without knowing what scopes individual symbols belong to, or their type, and many other details such as data flow. Full disclosure: I build one of these.
I have a project where I need to build and store large trees of data in Ruby. I am considering different approaches for serialization, deserialization and querying of trees, and I am wondering what would be the best way to go. My major constraints are read time, query efficiency and and cross-version/cross-platform compatibility. The most frequent operation is to retrieve sets of nodes based on a combination of id/value and/or feature(s).Trees can be up to 15-20 levels deep. Moving subtrees is an uncommon procedure, but should be possible without too much black magic. Rails integration is not a primary concern. The options I thought about, along with some issues I'm concerned about, are the following:
Marshal the trees, and when needed load them into memory and query them in Ruby (inefficiency as tree grows, cross-version compatibility?)
Same as above, but use YAML (more cross-version compatible, but less efficient?)
Same as above, but use a custom XML parser (need to recreate objects from scratch each time the tree is loaded?)
Serialize the trees to XML, store them in an XML database (e.g. Sedna) and use XPath to query the trees (no experience with this approach, not sure about efficiency?)
Use adjacency lists to query trees stored in an schema-less database (inefficiency when counting descendants?)
Use materialized paths (potential of overfilling the max string length for deep trees?)
Use nested sets (complex SQL queries?)
Use the array of ancestors approach? Seems interesting in terms of querying efficiency according to the MongoDB page, but I haven't been able to find any serious discussion of this algorithm.
Based on your experience, which approach would better fit with the constraints I have described? If I go for an XML database, are there ones that would be more suited for this project? Are there other approaches I have overlooked that would be more efficient? Thanks for your time.
Trees work really well with graph databases, such as neo4j: http://neo4j.org/learn/
Neo4j is a graph database, storing data in the nodes and relationships of a graph. The most generic of data structures, a graph elegantly represents any kind of data, preserving the natural structure of the domain.
Ruby has a good interface for the trees:
https://github.com/andreasronge/neo4j
Pacer is a JRuby library that enables very expressive graph traversals. Pacer allows you to create, modify and traverse graphs using very fast and memory efficient stream processing. That also means that almost all processing is done in pure Java, so when it comes the usual Ruby expressiveness vs. speed problem, you can have your cake and eat it too, it's very fast!
https://github.com/pangloss/pacer
Neography is like the neo4j.rb gem and suggested by Ron in the comments (thanks Ron!)
https://github.com/maxdemarzi/neography
Since you are considering a SQL approach, here are some things to think about.
First, how big are the trees? For many applications, 10,000 leafs would seem big. Yet this is small for a database. On any decent database system (like a laptop), you should be able to store hunreds of thousands or millions of leafs in memory.
What a database buys you over other approaches is:
-- Not having to worry about memory/disk performance. When the data spills over to disk, you don't take a big hit on performance. By comparison, consider what happens when a hash table overflows memory.
-- Being able to add indexes to optimize performance.
-- Being able to alter your access path for the tree "just" by modifying SQL
One of the problems with standard SQL is that you can represent a tree node as a simple pair: , , . Then, with a simple join, you can move between parents and leafs. However, the joins accumulate as you move up the tree.
Sigh. Different databases have different solutions for this. SQL Server has recursive CTEs, which let you traverse the tree. Oracle has another approach for tree structures.
This starts to get complicated.
Perhaps a better approach is to assign a "leaf" id based on the hierarchy in the tree. So, if this is a binary tree, then "10011" would be the node at right branch, left branch, left branch, right branch, right branch. There you would store information . . . such as whether it has children and whatever else. Getting the parent is easy, because you can just truncate the last digit.
You can see how this would generalize to non-binary trees. Having any number of children could pose a little challenge.
I believe this may be related to the "array of ancestors" approach.
As I think about it, I think this would work pretty well. I would then suggest that you define separate stored procedures for each action that you want:
usp_tree_FetchNode (NodeId)
usp_tree_GetParent (NodeId)
usp_tree_NodeDelete (NodeId)
usp_tree_FetchSubTree (NodeId)
etc. etc. etc.
Although SQL does not really support object-oriented programming, you can still organize your code with clean naming conventions and good function wrappers.
I actually think this might work and provide a pretty good method for developing the code. One nice side effect is that you can analyze the tree outside the application, which might suggest future enhancements.
Have you looked at ancestry gem? I've used it for simple trees, but by the description it looks to fit on your requirements.