How to realize sum(field) group by multi field in elasticsearch? - sql

I'm using logstash to save row data from MySQL to ElasticSearch. How to calculate sum on one field group by two fields?
For example, here is one table named "Students", it has several columns: id, class_id, name, gender, age;
and here is one SQL query:
select class_id, gender, sum(age) from Students group by class_id, gender;
How to translate this SQL to ElasticSearch high level rest client API call?
Below is my try, but it is not correct:
public TermsAggregationBuilder constructAggregation() {
TermsAggregationBuilder aggregation = AggregationBuilders.terms("by_classid")
.field("classId.keyword");
aggregation = aggregation.subAggregation(AggregationBuilders.terms("by_gender").field("gender.keyword"));
aggregation = aggregation.subAggregation(AggregationBuilders.sum("sum_age")
.field("age"));
return aggregation;
}

Following is the raw query for your sql statement
POST aggregation_index_2/_search
{
"size": 0,
"aggs": {
"gender_agg": {
"terms": {
"field": "gender"
},
"aggs": {
"class_id_aggregator": {
"terms": {
"aggs": {
"field": "class_id"
},
"age_sum_aggregator": {
"sum": {
"field": "age"
}
}
}
}
}
}
}
}
Mappings
PUT aggregation_index_2
{
"mappings": {
"properties": {
"gender": {
"type": "keyword"
},
"age": {
"type": "integer"
},
"class_id": {
"type": "integer"
}
}
}
}

Related

SQL having equivalent keyword in ES query DSL

I have the following query to be converted to DSL and executed on ES. I could not find a suitable aggregation and filter over results of aggregation available out-of-the-box in ES. As an alternative, I am fetching the 'group by count' for each id from ES and filtering the result as a part of my application logic, which is not efficient. Can you suggest any more suitable solution?
select distinct id from index where colA = "something" group by id having count(*) > 10;
index mapping
id : (string)
colA: (string)
Terms aggregation: to get distinct Ids.
Bucket selector: to return ids with doc count more than 10
{
"query": {
"bool": {
"filter": [
{
"term": {
"colA.keyword": "something" --> where clause
}
}
]
}
},
"aggs": {
"distinct_id": {
"terms": { --> group by
"field": "id.keyword",
"size": 10
},
"aggs": {
"ids_having_count_morethan_10": {
"bucket_selector": { --> having
"buckets_path": {
"count": "_count"
},
"script": "params.count>10"
}
}
}
}
}
}

aggregations merged in hits in elasticsearch

Just for an example, let's say I have a database, or an elastic index, holding sales persons and also all their customer visits past and into the future.
Lets also say I want to produce a list of these sales persons and show how many customer visits they have scheduled.
In SQL I would do something like this:
(mind: SQL is probably not all that correct, because it is just written here and just for telling what I am intending to do)
select foo, bar, sum(baz) from table_barbaz
where appointment_date > now()
group by bar
is it possible to get the same result in Elastic search? Like a list of documents sort of looking like this:
{
"foo": "Salesmen John",
"bar": "Client visit this week",
"sum_baz": 99
}
Not sure if this is related to nested aggregations or something else.
Below is a mapping that could have been used in this example. As the real mapping is internal IP, I don't really want to share it publicly.
{
"mappings": {
"properties": {
"salesman_id": {
"type": "integer"
},
"salesman_name": {
"type": "keyword"
},
"customer_visit": {
"type": "integer"
},
"customer_visit_start_date": {
"type": "date",
"format": "yyyy-MM-dd||strict_date"
},
"customer_visit_end_date": {
"type": "date",
"format": "yyyy-MM-dd||strict_date"
}
}
}
}
Then, an aggregation query like the following one would give you the number of customer visits for each salesman, for each day:
{
"size": 0,
"aggs": {
"salesmen": {
"terms": {
"field": "salesman_name",
"size": 20
},
"aggs": {
"days": {
"date_histogram": {
"field": "customer_visit_start_date",
"interval": "day"
},
"aggs": {
"visits": {
"sum": {
"field": "customer_visit"
}
}
}
}
}
}
}
}

How do I write an ElasticSearch query to find unique elements in columns?

For example, if I have a SQL query:
SELECT distinct emp_id, salary FROM TABLE_EMPLOYEE
what would be its ElasticSearch equivalent?
This is what I have come up with until now:
{
"aggs": {
"Employee": {
"terms": {
"field":["emp_id", "salary" ]
"size": 1000
}
}
}
}
Instead of sending a list of fields to perform distinct upon, send them as separate aggregations.
{
"aggs": {
"Employee": {
"terms": {
"field": "emp_id",
"size": 10
}
},
"Salary":{
"terms": {
"field": "salary",
"size": 10
}
}
},
"size": 0
}
To answer from our conversation you would issue the following http command using curl.
curl -XGET localhost:9200/<your index>/<type>/_search?pretty

Elastic search aggregation

Is there a way to return only one product if it has different color.
e.g. suppose I have a product with following properties:
brand,color,title
nike, red, air max
nike, blue, air max
now I want to create elastic search query to return only one product while aggregation but count as two belonging to brand nike.
{
"query" : {
"match_all" : {}
},
"aggs" : {
"brand" : {
"terms" : {
"field" : "brand"
},
"aggs" : {
"size" : {
"terms" : {
"field" : "title"
}
}
}
}
}
}
I am not able to get desired results. I want like select name,color,title, count(*) title from product group by name,title
I think you want to get document, aggregated by name,title
This can be done using topHits aggregation.
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"brand": {
"terms": {
"field": "name"
},
"aggs": {
"size": {
"terms": {
"field": "title"
}
},
"aggs":{
"top_hits" :{
"_source" :[ "name","color","band"],
"size":1
}
}
}
}
}
}
For count, there is always doc_count in returned buckets.
Hope this helps!! If I am missing something, do mention.
Thanks

ElasticSearch:filtering documents based on field length?

Is there a way to filter ElasticSearch documents based on the length of a specific field?
For instance, I have a bunch of documents with the field "body", and I only want to return results where the number of characters in body is > 1000. Is there a way to do this in ES without having to add an extra column with the length in the index?
Use the script filter, like this:
"filtered" : {
"query" : {
...
},
"filter" : {
"script" : {
"script" : "doc['body'].length > 1000"
}
}
}
EDIT
Sorry, meant to reference the query DSL guide on script filters
You can also create a custom tokenizer and use it in a multifields property as in the following:
PUT test_index
{
"settings": {
"analysis": {
"analyzer": {
"character_analyzer": {
"type": "custom",
"tokenizer": "character_tokenizer"
}
},
"tokenizer": {
"character_tokenizer": {
"type": "nGram",
"min_gram": 1,
"max_gram": 1
}
}
}
},
"mappings": {
"person": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
},
"words_count": {
"type": "token_count",
"analyzer": "standard"
},
"length": {
"type": "token_count",
"analyzer": "character_analyzer"
}
}
}
}
}
}
}
PUT test_index/person/1
{
"name": "John Smith"
}
PUT test_index/person/2
{
"name": "Rachel Alice Williams"
}
GET test_index/person/_search
{
"query": {
"term": {
"name.length": 10
}
}
}