I'm writing a grammar parser for my config evaluation library using pest parser. The grammar supports and, or, matches, == operators to create complex queries. For e.g.
Input: "foo: when bar == 'apple' and baz matches 'foobaz'"
Precedence of "==", "matches" is greater han "and"
I want to create an AST representation of the input so that later I can evaluate the expression based on values of 'bar' and 'baz' variables
{
"foo" : {
{
"and": [
{
"ident": "bar",
"op": "==",
"value": "apple"
},
{
"ident": "baz",
"op": "matches",
"value": "foobaz"
}],
}
}
}
I'm having trouble using specifying the precedence relations. How should I specify operator precedence in pest. Can someone provide me an example. Thanks.
Here is my pest file that doesn't work if I use "and", "or" operators
input = _{ monadicExpr }
monadicExpr = { ident ~ startingverb ~ expr }
expr = {
| terms
| dyadicExpr
| ident
}
terms = { dyadicExpr ~ ((logicalAnd | logicalOr) ~ dyadicExpr)* }
dyadicExpr = { ident ~ verb ~ value }
value = _{ string | array }
startingverb = _{
":" ~ WHITESPACE* ~ "when"
}
verb = {
"matches" | "==" | logicalAnd | logicalOr
}
logicalAnd = { "and" }
logicalOr = { "or" }
ident = #{ ASCII_ALPHA ~ (ASCII_ALPHANUMERIC | "_")* }
Related
We are building our own query language similar to Mysql using antlr4. Except we only use where clause, in other words user does not enter select/from statements.
I was able to create grammar for it and generate lexers/parsers/listeners in golang.
Below our grammar file EsDslQuery.g4:
grammar EsDslQuery;
options {
language = Go;
}
query
: leftBracket = '(' query rightBracket = ')' #bracketExp
| leftQuery=query op=OR rightQuery=query #orLogicalExp
| leftQuery=query op=AND rightQuery=query #andLogicalExp
| propertyName=attrPath op=COMPARISON_OPERATOR propertyValue=attrValue #compareExp
;
attrPath
: ATTRNAME ('.' attrPath)?
;
fragment ATTR_NAME_CHAR
: '-' | '_' | ':' | DIGIT | ALPHA
;
fragment DIGIT
: ('0'..'9')
;
fragment ALPHA
: ( 'A'..'Z' | 'a'..'z' )
;
attrValue
: BOOLEAN #boolean
| NULL #null
| STRING #string
| DOUBLE #double
| '-'? INT EXP? #long
;
...
Query example: color="red" and price=20000 or model="hyundai" and (seats=4 or year=2001)
ElasticSearch supports sql queries with plugin here: https://github.com/elastic/elasticsearch/tree/master/x-pack/plugin/sql.
Having hard time to understand java code.
Since we have Logical Operators I am quite not sure how to get parse tree and convert it to ES query. Can somebody help/suggest ideas?
Update 1: Added more examples with corresponding ES query
Query Example 1: color="red" AND price=2000
ES query 1:
{
"query": {
"bool": {
"must": [
{
"terms": {
"color": [
"red"
]
}
},
{
"terms": {
"price": [
2000
]
}
}
]
}
},
"size": 100
}
Query Example 2: color="red" AND price=2000 AND (model="hyundai" OR model="bmw")
ES query 2:
{
"query": {
"bool": {
"must": [
{
"bool": {
"must": {
"terms": {
"color": ["red"]
}
}
}
},
{
"bool": {
"must": {
"terms": {
"price": [2000]
}
}
}
},
{
"bool": {
"should": [
{
"term": {
"model": "hyundai"
}
},
{
"term": {
"region": "bmw"
}
}
]
}
}
]
}
},
"size": 100
}
Query Example 3: color="red" OR color="blue"
ES query 3:
{
"query": {
"bool": {
"should": [
{
"bool": {
"must": {
"terms": {
"color": ["red"]
}
}
}
},
{
"bool": {
"must": {
"terms": {
"color": ["blue"]
}
}
}
}
]
}
},
"size": 100
}
Working demo url: https://github.com/omurbekjk/convert-dsl-to-es-query-with-antlr, estimated time spent: ~3 weeks
After investigating antlr4 and several examples I found simple solution with listener and stack. Similar to how expressions are calculated using stack.
We need to overwrite to default base listener with ours to get triggers for each enter/exit grammar rules. Important rules are:
Comparison expression (price=200, price>190)
Logical operators (OR, AND)
Brackets (in order to correctly build es query we need to write correct grammar file remembering operator precedence, that's why brackets are in the first place in the grammar file)
Below my custom listener code written in golang:
package parser
import (
"github.com/olivere/elastic"
"strings"
)
type MyDslQueryListener struct {
*BaseDslQueryListener
Stack []*elastic.BoolQuery
}
func (ql *MyDslQueryListener) ExitCompareExp(c *CompareExpContext) {
boolQuery := elastic.NewBoolQuery()
attrName := c.GetPropertyName().GetText()
attrValue := strings.Trim(c.GetPropertyValue().GetText(), `\"`)
// Based on operator type we build different queries, default is terms query(=)
termsQuery := elastic.NewTermQuery(attrName, attrValue)
boolQuery.Must(termsQuery)
ql.Stack = append(ql.Stack, boolQuery)
}
func (ql *MyDslQueryListener) ExitAndLogicalExp(c *AndLogicalExpContext) {
size := len(ql.Stack)
right := ql.Stack[size-1]
left := ql.Stack[size-2]
ql.Stack = ql.Stack[:size-2] // Pop last two elements
boolQuery := elastic.NewBoolQuery()
boolQuery.Must(right)
boolQuery.Must(left)
ql.Stack = append(ql.Stack, boolQuery)
}
func (ql *MyDslQueryListener) ExitOrLogicalExp(c *OrLogicalExpContext) {
size := len(ql.Stack)
right := ql.Stack[size-1]
left := ql.Stack[size-2]
ql.Stack = ql.Stack[:size-2] // Pop last two elements
boolQuery := elastic.NewBoolQuery()
boolQuery.Should(right)
boolQuery.Should(left)
ql.Stack = append(ql.Stack, boolQuery)
}
And main file:
package main
import (
"encoding/json"
"fmt"
"github.com/antlr/antlr4/runtime/Go/antlr"
"github.com/omurbekjk/convert-dsl-to-es-query-with-antlr/parser"
)
func main() {
fmt.Println("Starting here")
query := "price=2000 OR model=\"hyundai\" AND (color=\"red\" OR color=\"blue\")"
stream := antlr.NewInputStream(query)
lexer := parser.NewDslQueryLexer(stream)
tokenStream := antlr.NewCommonTokenStream(lexer, antlr.TokenDefaultChannel)
dslParser := parser.NewDslQueryParser(tokenStream)
tree := dslParser.Start()
listener := &parser.MyDslQueryListener{}
antlr.ParseTreeWalkerDefault.Walk(listener, tree)
esQuery := listener.Stack[0]
src, err := esQuery.Source()
if err != nil {
panic(err)
}
data, err := json.MarshalIndent(src, "", " ")
if err != nil {
panic(err)
}
stringEsQuery := string(data)
fmt.Println(stringEsQuery)
}
/** Generated es query
{
"bool": {
"should": [
{
"bool": {
"must": [
{
"bool": {
"should": [
{
"bool": {
"must": {
"term": {
"color": "blue"
}
}
}
},
{
"bool": {
"must": {
"term": {
"color": "red"
}
}
}
}
]
}
},
{
"bool": {
"must": {
"term": {
"model": "hyundai"
}
}
}
}
]
}
},
{
"bool": {
"must": {
"term": {
"price": "2000"
}
}
}
}
]
}
}
*/
Have you thought about converting your sql-like statements to query string queries?
curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"query_string" : {
"query" : "(new york city) OR (big apple)",
"default_field" : "content"
}
}
}
'
If your use-cases stay simple like color="red" and price=20000 or model="hyundai" and (seats=4 or year=2001), I'd go with the above. The syntax is quite powerful but the queries are guaranteed to run more slowly than the native, spelled-out DSL queries since the ES parser will need to convert them to the DSL for you.
There is a software called Dremio https://www.dremio.com/
It can translate SQL query to elastic search query
https://www.dremio.com/tutorials/unlocking-sql-on-elasticsearch/
Considering this query written in sql server how would I efficiently convert it to mongodb:
select * from thetable where column1 = column2 * 2
You can use below aggregation.
You project a new field comp to calculate the expression value followed by $match to keep the docs with eq(0) value and $project with exclusion to drop comp field.
db.collection.aggregate([
{ $addFields: {"comp": {$cmp: ["$column1", {$multiply: [ 2, "$column2" ]} ]}}},
{ $match: {"comp":0}},
{ $project:{"comp":0}}
])
If you want to run your query in mongo Shell,
try below code,
db.thetable .find({}).forEach(function(tt){
var ttcol2 = tt.column2 * 2
var comapreCurrent = db.thetable.findOne({_id : tt._id,column1 : ttcol2});
if(comapreCurrent){
printjson(comapreCurrent);
}
});
I liked the answer posted by #Veeram but it would also be possible to achieve this using $project and $match pipeline operation.
This is just for understanding the flow
Assume we have the below 2 documents stored in a math collection
Mongo Documents
{
"_id" : ObjectId("58a055b52f67a312c3993553"),
"num1" : 2,
"num2" : 4
}
{
"_id" : ObjectId("58a055be2f67a312c3993555"),
"num1" : 2,
"num2" : 6
}
Now we need to find if num1 = 2 times of num2 (In our case the document with _id ObjectId("58a055b52f67a312c3993553") will be matching this condition)
Query:
db.math.aggregate([
{
"$project": {
"num2": {
"$multiply": ["$num2",1]
},
"total": {
"$multiply": ["$num1",2]
},
"doc": "$$ROOT"
}
},
{
"$project": {
"areEqual": {"$eq": ["$num2","$total"]
},
doc: 1
}
},
{
"$match": {
"areEqual": true
}
},
{
"$project": {
"_id": 1,
"num1": "$doc.num1",
"num2": "$doc.num2"
}
}
])
Pipeline operation steps:-
The 1st pipeline operation $project calculates the total
The 2nd pipeline operation $project is used to check if the total
matches the num2. This is needed as we cannot use the comparision
operation of num2 with total in the $match pipeline operation
The 3rd pipeline operation matches if areEqual is true
The 4th pipeline operation $project is just used for projecting the fields
Note:-
In the 1st pipeline operation I have multiplied num2 with 1 as num1 and num2 are stored as integers and $multiply returns double value. So incase I do not use $mutiply for num2, then it tries to match 4 equals 4.0 which will not match the document.
Certainly no need for multiple pipeline stages when a single $redact pipeline will suffice as it beautifully incorporates the functionality of $project and $match pipeline steps. Consider running the following pipeline for an efficient query:
db.collection.aggregate([
{
"$redact": {
"$cond": [
{
"$eq": [
"$column1",
{ "$multiply": ["$column2", 2] }
]
},
"$$KEEP",
"$$PRUNE"
]
}
}
])
In the above, $redact will return all documents that match the condition using $$KEEP and discards those that don't match using the $$PRUNE system variable.
Given an array of objects, I would like to inject a property with its position in the array.
For example:
[ { "w" : "Hello" }, { "w" : "World } ]
I would like to produce:
[ { "w" : "Hello", p: 0 }, { "w" : "World, p:1 } ]
where p is the zero-based position in the array.
Is there a way to get the index of the element?
I tried this but it is not working:
keys[] as $i | [ .[] | .p= $i ]
I get:
[ { "w" : "Hello", p: 0 }, { "w" : "World, p:0 } ]
You could do it like this:
[ keys[] as $i | .[$i] | .p=$i ]
Alternatively, you could make it work using to_entries like this:
[ to_entries[] | (.value.p=.key).value ]
Both of which yields:
[
{
"w": "Hello",
"p": 0
},
{
"w": "World",
"p": 1
}
]
Here is a solution which uses reduce
reduce keys[] as $i (.; .[$i].p = $i)
I want to regex search an integer value in MongoDB. Is this possible?
I'm building a CRUD type interface that allows * for wildcards on the various fields. I'm trying to keep the UI consistent for a few fields that are integers.
Consider:
> db.seDemo.insert({ "example" : 1234 });
> db.seDemo.find({ "example" : 1234 });
{ "_id" : ObjectId("4bfc2bfea2004adae015220a"), "example" : 1234 }
> db.seDemo.find({ "example" : /^123.*/ });
>
As you can see, I insert an object and I'm able to find it by the value. If I try a simple regex, I can't actually find the object.
Thanks!
If you are wanting to do a pattern match on numbers, the way to do it in mongo is use the $where expression and pass in a pattern match.
> db.test.find({ $where: "/^123.*/.test(this.example)" })
{ "_id" : ObjectId("4bfc3187fec861325f34b132"), "example" : 1234 }
I am not a big fan of using the $where query operator because of the way it evaluates the query expression, it doesn't use indexes and the security risk if the query uses user input data.
Starting from MongoDB 4.2 you can use the $regexMatch|$regexFind|$regexFindAll available in MongoDB 4.1.9+ and the $expr to do this.
let regex = /123/;
$regexMatch and $regexFind
db.col.find({
"$expr": {
"$regexMatch": {
"input": {"$toString": "$name"},
"regex": /123/
}
}
})
$regexFinAll
db.col.find({
"$expr": {
"$gt": [
{
"$size": {
"$regexFindAll": {
"input": {"$toString": "$name"},
"regex": "123"
}
}
},
0
]
}
})
From MongoDB 4.0 you can use the $toString operator which is a wrapper around the $convert operator to stringify integers.
db.seDemo.aggregate([
{ "$redact": {
"$cond": [
{ "$gt": [
{ "$indexOfCP": [
{ "$toString": "$example" },
"123"
] },
-1
] },
"$$KEEP",
"$$PRUNE"
]
}}
])
If what you want is retrieve all the document which contain a particular substring, starting from release 3.4, you can use the $redact operator which allows a $conditional logic processing.$indexOfCP.
db.seDemo.aggregate([
{ "$redact": {
"$cond": [
{ "$gt": [
{ "$indexOfCP": [
{ "$toLower": "$example" },
"123"
] },
-1
] },
"$$KEEP",
"$$PRUNE"
]
}}
])
which produces:
{
"_id" : ObjectId("579c668c1c52188b56a235b7"),
"example" : 1234
}
{
"_id" : ObjectId("579c66971c52188b56a235b9"),
"example" : 12334
}
Prior to MongoDB 3.4, you need to $project your document and add another computed field which is the string value of your number.
The $toLower and his sibling $toUpper operators respectively convert a string to lowercase and uppercase but they have a little unknown feature which is that they can be used to convert an integer to string.
The $match operator returns all those documents that match your pattern using the $regex operator.
db.seDemo.aggregate(
[
{ "$project": {
"stringifyExample": { "$toLower": "$example" },
"example": 1
}},
{ "$match": { "stringifyExample": /^123.*/ } }
]
)
which yields:
{
"_id" : ObjectId("579c668c1c52188b56a235b7"),
"example" : 1234,
"stringifyExample" : "1234"
}
{
"_id" : ObjectId("579c66971c52188b56a235b9"),
"example" : 12334,
"stringifyExample" : "12334"
}
Now, if what you want is retrieve all the document which contain a particular substring, the easier and better way to do this is in the upcoming release of MongoDB (as of this writing) using the $redact operator which allows a $conditional logic processing.$indexOfCP.
db.seDemo.aggregate([
{ "$redact": {
"$cond": [
{ "$gt": [
{ "$indexOfCP": [
{ "$toLower": "$example" },
"123"
] },
-1
] },
"$$KEEP",
"$$PRUNE"
]
}}
])
Is there a way to find out via the elasticsearch API how a query string query is actually parsed? You can do that manually by looking at the lucene query syntax, but it would be really nice if you could look at some representation of the actual results the parser has.
As javanna mentioned in comments there's _validate api. Here's what works on my local elastic (version 1.6):
curl -XGET 'http://localhost:9201/pl/_validate/query?explain&pretty' -d'
{
"query": {
"query_string": {
"query": "a OR (b AND c) OR (d AND NOT(e or f))",
"default_field": "t"
}
}
}
'
pl is name of index on my cluster. Different index could have different analyzers, that's why query validation is executed in a scope of an index.
The result of the above curl is following:
{
"valid" : true,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"explanations" : [ {
"index" : "pl",
"valid" : true,
"explanation" : "filtered(t:a (+t:b +t:c) (+t:d -(t:e t:or t:f)))->cache(org.elasticsearch.index.search.nested.NonNestedDocsFilter#ce2d82f1)"
} ]
}
I made one OR lowercase on purpose and as you can see in explanation, it is interpreted as a token and not as a operator.
As for interpretation of the explanation. Format is similar to +- operators of query string query:
( and ) characters start and end bool query
+ prefix means clause that will be in must
- prefix means clause that will be in must_not
no prefix means that it will be in should (with default_operator equal to OR)
So above will be equivalent to following:
{
"bool" : {
"should" : [
{
"term" : { "t" : "a" }
},
{
"bool": {
"must": [
{
"term" : { "t" : "b" }
},
{
"term" : { "t" : "c" }
}
]
}
},
{
"bool": {
"must": {
"term" : { "t" : "d" }
},
"must_not": {
"bool": {
"should": [
{
"term" : { "t" : "e" }
},
{
"term" : { "t" : "or" }
},
{
"term" : { "t" : "f" }
}
]
}
}
}
}
]
}
}
I used _validate api quite heavily to debug complex filtered queries with many conditions. It is especially useful if you want to check how analyzer tokenized input like an url or if some filter is cached.
There's also an awesome parameter rewrite that I was not aware of until now, which causes the explanation to be even more detailed showing the actual Lucene query that will be executed.