Lucene query where two fields will be compared - lucene

I have an elasticsearch cluster. All documents in the cluster have the same index and type. Each document has two number fields -> field1 and field2.
I want to display all documents in Grafana, where value of field1 > value of field2.
Is there a query like:
document_type:test AND field1 > field2 ?

As far as I'm aware there is no way to perform that sort of query using elasticsearch (lucene). It does support range queries, but not comparison between different fields in the document.

You can do this with a (groovy) script query, like this:
{
"query" : {
"term" : {
"document_type" : "test"
}
},
"filter" : {
"script" : {
"script" : "doc['field1'].value > doc['field2'].value"
}
}
}
See also, more documentation on what is available from the Elasticsearch scripting module.

Related

How to implement a backend for GraphQL connections?

In GraphQL the recommended way for pagination is to use connections as described here. I understand the reasons and advantages of this usage but I need an advice how to implement it.
The server side of the application works on top of a SQL database (Postgres in my case). Some of the GraphQL connection fields have optional argument to specify sorting. Now with knowing the sorting columns and a cursor from the GraphQL query, how can I build an SQL query? Of course it should be efficient - if there is a SQL index index for the combination of sorting columns it should be used.
The problem is that SQL doesn't know anything like GraphQL cursors - we can't tell it to select all rows after certain row. There is just WHERE, OFFSET and LIMIT. From my point of view it seems I need to firstly select a single row based on the cursor and then build a second SQL query using the values of the sorting columns in that row to specify a complicated WHERE clause - not sure if the database would use index in that case.
What bothers me is that I could not find any article on this topic. Does it mean that SQL database is not usually used when implementing a GraphQL server? What database should be used then? How are GraphQL queries to connection fields usually transformed to queries for the underlying database?
EDIT: This is more or less what I came up with myself. The problem is how to extend it to support sorting as well and how to implement it efficiently using database indexes.
The trick here is that, as the server implementer, the cursor can be literally any value you want encoded as a string. Most examples I've seen have been base64-encoded for a bit of opacity, but it doesn't have to be. (Try base64-decoding the cursors from the Star Wars examples in your link, for example.)
Let's say your GraphQL schema looks like
enum ThingColumn { FOO BAR }
input ThingFilter {
foo: Int
bar: Int
}
type Query {
things(
filter: ThingFilter,
sort: ThingColumn,
first: Int,
after: String
): ThingConnection
}
Your first query might be
query {
things(filter: { foo: 1 }, sort: BAR, first: 2) {
edges {
node { bar }
}
pageInfo {
endCursor
hasNextPage
}
}
}
This on its own could fairly directly translate into an SQL query like
SELECT bar FROM things WHERE foo=1 ORDER BY bar ASC LIMIT 2;
Now as you iterate through each item you can just use a string version of its offset as its cursor; that's totally allowed by the spec.
{
"data": {
"things": {
"edges": [
{ "node": { "bar": 17 } },
{ "node": { "bar": 42 } }
],
"pageInfo": {
"endCursor": "2",
"hasNextPage": true
}
}
}
}
Then when the next query says after: "2", you can turn that back into an SQL OFFSET and repeat the query.
If you're trying to build a generic GraphQL interface that gets translated to reasonably generic SQL queries, it's impossible to create indexes such that every query is "fast". Like other cases, you need to figure out what your common and/or slow queries are and CREATE INDEX as needed. You might be able to limit the options in your schema to things you know you can index:
type Other {
things(first: Int, after: String): ThingConnection
}
query OtherThings($id: ID!, $cursor: String) {
node(id: $id) {
... on Other {
things(first: 100, after: $cursor) { ... FromAbove }
}
}
}
SELECT * FROM things WHERE other_id=? ORDER BY id LIMIT ?;
CREATE INDEX things_other ON things(other_id);

Elasticsearch query context vs filter context

I am little bit confused with ElasticSearch Query DSL's query context and filter context. I have 2 below queries. Both queries return same result, first one evaluate score and second one does not. Which one is more appropriate ?
1st Query :-
curl -XGET 'localhost:9200/xxx/yyy/_search?pretty' -d'
{
"query": {
"bool": {
"must": {
"terms": { "mcc" : ["5045","5499"]}
},
"must_not":{
"term":{"maximum_flag":false}
},
"filter": {
"geo_distance": {
"distance": "500",
"location": "40.959334, 29.082142"
}
}
}
}
}'
2nd Query :-
curl -XGET 'localhost:9200/xxx/yyy/_search?pretty' -d'
{
"query": {
"bool" : {
"filter": [
{"term":{"maximum_flag":true}},
{"terms": { "mcc" : ["5045","5499"]}}
],
"filter": {
"geo_distance": {
"distance": "500",
"location": "40.959334, 29.082142"
}
}
}
}
}'
Thanks,
In the official guide you have a good explanation:
Query context
A query clause used in query context answers the question “How well does this document match this query clause?” Besides deciding whether or not the document matches, the query clause also calculates a _score representing how well the document matches, relative to other documents.
Query context is in effect whenever a query clause is passed to a query parameter, such as the query parameter in the search API.
Filter context
In filter context, a query clause answers the question “Does this document match this query clause?” The answer is a simple Yes or No — no scores are calculated. Filter context is mostly used for filtering structured data, e.g.
Does this timestamp fall into the range 2015 to 2016?
Is the status field set to "published"?
Frequently used filters will be cached automatically by Elasticsearch, to speed up performance.
Filter context is in effect whenever a query clause is passed to a filter parameter, such as the filter or must_not parameters in the bool query, the filter parameter in the constant_score query, or the filter aggregation.
https://www.elastic.co/guide/en/elasticsearch/reference/2.3/query-filter-context.html
About your case, we would need more information, but taking into account you are looking for exact values, a filter would suit it better.
The first query is evaluating score because your are using "term" here directly inside without wrapping it inside "filter" so by default "term" written directly inside query run in Query context format which result in calculating score.
But in the case of second query you "term" inside "filter" which change it's context from Query Context to filter Context . And in the case of filter no score is calculated (by default _score 1 is allocated to all matching documents).
You can find more details about queries behavior in this article
https://towardsdatascience.com/deep-dive-into-querying-elasticsearch-filter-vs-query-full-text-search-b861b06bd4c0

Count in mongoDB

I have something like this for every user in mongoDB:
{
"id" : 1234,
"name" : "Mr. Someone",
"userdata" : {
"living" : {
"city" : "Somecity",
"address" : "Main Street 10.",
"zip" : "1023"
},
"interest" : "Cars"
}
I'm trying to find a way to count how many subscribers live in Somecity.
My best guess was the following:
db.users.count({userdata:{living:{city:"Somecity"}}}
But the result was 0.
How can I properly count "rows" by a given value in mongoDB?
I'm using mongoDB's documentation (for example: http://docs.mongodb.org/manual/reference/sql-comparison/) but could not resolve my problem yet.
I'm using mongoDB trough shell.
I think I have found the sollution to my problem:
db.users.count({"userdata.living.city":"Somecity"})
This "dotting" method allowed me to search for only one value in the array, while the method I tried first wanted to find an exact match.
Further reading: http://docs.mongodb.org/manual/reference/operator/query/elemMatch/
Quote:
Since the $elemMatch only specifies a single condition, the $elemMatch
expression is not necessary, and instead you can use the following
query:
db.survey.find( { "results.product": "xyz" } )

Simple full text search in ElasticSearch

I'm trying to understand how ElasticSearch Query DSL works.
It would be a lot of help if anyone could give me an example how to perform a search like the following MySQL query:
SELECT * FROM products
WHERE shop_id = 1
AND MATCH(title, description) AGAINST ('test' IN BOOLEAN MODE)
Assuming that you indexed some documents containing at least the shop_id, title and description fields, something like the following example:
{
"shop_id" : "here goes your shop_id",
"title" : "here goes your title",
"description" : "here goes your description"
}
You can execute a multi match query against multiple fields, and give them a different weight (usually title is more important). You can also combine the query with a term filter on shop_id:
{
"query" : {
"multi_match" : {
"query" : "here goes your query",
"fields" : [ "title^2", "description" ]
},
"filter" : {
"term" : { "shop_id" : "here goes your shop id" }
}
}
You need to submit the query using the search API. Filters are used to reduce the set of documentsthe query is eecute against. Filters are faster since don't involve scoring and cached. In my example I applied a top level filter, which might be or not a good fit for you depending on what else you want to do next. If you want to make a facet, for instance, the filter would be ignored in the facet. Another way to add a filter, which would be taken into account while computing the facets as well, is the filtered query.

MongoDB Update / Upsert Question - Schema Related

I have an problem representing data in MongoDB. I was using this schema design, where a combination of date and word is unique.
{'date':2-1-2011,
'word':word1'
users = [user1, user2, user3, user4]}
{'date':1-1-2011,
'word':word2'
users = [user1, user2]}
There are a fixed number of dates, approximately 200; potentially 100k+ words for each date; and 100k+ users.
I inserted records with an algorithm like so:
while records exist:
message, user, date = pop a record off a list
words = set(tokenise(message))
for word in words:
collection1.insert({'date':date, 'word':word}, {'user':user})
collection2.insert('something similar')
collection3.insert('something similar again')
collection4.insert('something similar again')
However, this schema resulted in extremely large collections and terrible performance was terrible. I am inserting different information into each of the four collections, so it is an extremely large number of operations on the database.
I'm considering representing the data in a format like so, where the words and users arrays are sets.
{'date':'26-6-2011',
'words': [
'word1': ['user1', 'user2'],
'word2': ['user1']
'word1': ['user1', 'user2', 'user3']]}
The idea behind this was to cut down on the number of database operations. So that for each loop of the algorithm, I perform just one update for each collection. However, I am unsure how to perform an update / upsert on this because with each loop of the algorithm, I may need to insert a new word, user, or both.
Could anyone recommend either a way to update this document, or could anyone suggest an alternative schema?
Thanks
Upsert is well suited for dynamically extending documents. Unfortunately I only found it working properly if you have an atomic modifier operation in your update object. like the $addToSet here (mongo shell code):
db.words is empty. add first document for a given date with an upsert.
var query = { 'date' : 'date1' }
var update = { $addToSet: { 'words.word1' : 'user1' } }
db.words.update(query,update,true,false)
check object.
db.words.find();
{ "_id" : ObjectId("4e3bd4eccf7604a2180c4905"), "date" : "date1", "words" : { "word1" : [ "user1" ] } }
now add some more users to first word and another word in one update.
var update = { $addToSet: { 'words.word1' : { $each : ['user2', 'user4', 'user5'] }, 'words.word2': 'user3' } }
db.words.update(query,update,true,false)
again, check object.
db.words.find()
{ "_id" : ObjectId("4e3bd7e9cf7604a2180c4907"), "date" : "date1", "words" : { "word1" : [ "user1", "user2", "user4", "user5" ], "word2" : [ "user3" ] } }
I'm using MongoDB to insert 105mil records with ~10 attributes each. Instead of updating this dataset with changes, I just delete and re insert everything. I found this method to be faster than individually touching each row to see if it was one that I needed to update. You will have better insert speeds if you create JSON formatted text files and use MongoDB's mongoimport tool.
format your data into JSON txt files (one file per collection)
mongoimport each file and specify the collection you want it inserted into