ElasticSearch: TooManyClauses exception when adding highlight - lucene

My query_string query gives me a TooManyClauses exception. However, in my case I don't think that the exception is thrown due to the usual reason. Instead, it seems to be related to highlighting, because when I remove the highlight from the query, it works. This is my original query:
{
"query" : {
"query_string" : {
"query" : "aluminium potassium +DOS_UUID:*",
"default_field" : "fileTextContent.fileTextContentAnalyzed"
}
},
"fields" : [ "attachmentType", "DOS_UUID", "ATT_UUID", "DOCUMENT_REFERENCE", "filename", "isCSR", "mime" ],
"highlight" : {
"fields" : {
"fileTextContent.fileTextContentAnalyzed" : { }
}
}
}
and it gives me the TooManyClauses error:
{
"error": "SearchPhaseExecutionException[Failed to execute phase [query_fetch], all shards failed; shardFailures {[02Z45jhrTCu7bSYy-XSW_g][markosindex][0]: FetchPhaseExecutionException[[markosindex][0]: query[filtered(fileTextContent.fileTextContentAnalyzed:aluminium fileTextContent.fileTextContentAnalyzed:potassium +DOS_UUID:*)->cache(_type:markostype)],from[0],size[10]: Fetch Failed [Failed to highlight field [fileTextContent.fileTextContentAnalyzed]]]; nested: TooManyClauses[maxClauseCount is set to 1024]; }]",
"status": 500
}
This is the query without the highlight, which works:
{
"query" : {
"query_string" : {
"query" : "aluminium potassium +DOS_UUID:*",
"default_field" : "fileTextContent.fileTextContentAnalyzed"
}
},
"fields" : [ "attachmentType", "DOS_UUID", "ATT_UUID", "DOCUMENT_REFERENCE", "filename", "isCSR", "mime" ]
}
UPDATE 1:
This is the stacktrace from the ElasticSearch log file:
[2014-10-10 16:03:18,236][DEBUG][action.search.type ] [Doop] [markosindex][0], node[02Z45jhrTCu7bSYy-XSW_g], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest#14d7ab1e]
org.elasticsearch.search.fetch.FetchPhaseExecutionException: [markosindex][0]: query[filtered(fileTextContent.fileTextContentAnalyzed:aluminium fileTextContent.fileTextContentAnalyzed:potassium +DOS_UUID:*)->cache(_type:markostype)],from[0],size[10]: Fetch Failed [Failed to highlight field [fileTextContent.fileTextContentAnalyzed]]
at org.elasticsearch.search.highlight.PlainHighlighter.highlight(PlainHighlighter.java:121)
at org.elasticsearch.search.highlight.HighlightPhase.hitExecute(HighlightPhase.java:126)
at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:211)
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:340)
at org.elasticsearch.search.action.SearchServiceTransportAction$11.call(SearchServiceTransportAction.java:308)
at org.elasticsearch.search.action.SearchServiceTransportAction$11.call(SearchServiceTransportAction.java:305)
at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:517)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.lucene.search.BooleanQuery$TooManyClauses: maxClauseCount is set to 1024
at org.apache.lucene.search.ScoringRewrite$1.checkMaxClauseCount(ScoringRewrite.java:72)
at org.apache.lucene.search.ScoringRewrite$ParallelArraysTermCollector.collect(ScoringRewrite.java:149)
at org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:79)
at org.apache.lucene.search.ScoringRewrite.rewrite(ScoringRewrite.java:105)
at org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:288)
at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:217)
at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:99)
at org.elasticsearch.search.highlight.CustomQueryScorer$CustomWeightedSpanTermExtractor.extractUnknownQuery(CustomQueryScorer.java:89)
at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:224)
at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:474)
at org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:217)
at org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:186)
at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:197)
at org.elasticsearch.search.highlight.PlainHighlighter.highlight(PlainHighlighter.java:113)
... 9 more
[2014-10-10 16:03:18,237][DEBUG][action.search.type ] [Doop] All shards failed for phase: [query_fetch]
Note: I am using ElasticSearch 1.2.1.
UPDATE 2:
This is my mapping:
{
"markosindex": {
"mappings": {
"markostype": {
"_id": {
"path": "DOCUMENT_REFERENCE"
},
"properties": {
"ATT_UUID": {
"type": "string",
"index": "not_analyzed"
},
"DOCUMENT_REFERENCE": {
"type": "string",
"index": "not_analyzed"
},
"DOS_UUID": {
"type": "string",
"index": "not_analyzed"
},
"attachmentType": {
"type": "string",
"index": "not_analyzed"
},
"fileTextContent": {
"type": "string",
"index": "no",
"fields": {
"fileTextContentAnalyzed": {
"type": "string"
}
}
},
"filename": {
"type": "string",
"index": "not_analyzed"
},
"isCSR": {
"type": "boolean"
},
"mime": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
Any idea? Thanks!

Related

How to avoid the duplicated data entry after parsing json in kusto?

I have following sample json data.
{
"data": {
"type": "ABC",
"id": "17495500314",
"attributes": {
[!["event": "update",
"gps_vali][1]][1]d": true,
"gps": {
"distance_diff": 6.48,
"total_distance": 848.6
},
"hdop": 79,
"fuel_level": 46.8,
"total_fuel_used": 60443.9,
"location": {
"latitude": 411.372618,
"longitude": -1.254931,
"relative_position": {
"distance": "37",
}
},
"idle_periods": []
},
"relationships": {
"assets": {
"data": [
{
"type": "ABCDFTTG",
"id": "1589799143500003",
"attributes": {
"external_id": "ABCDFTTG",
"hardware_id": "ABCDFTTG"
}
}
]
},
"devices": {
"data": [
{
"type": "ABCDFTTG",
"id": "1585231172900341",
"attributes": {
"serial": "5572016191"
}
},
{
"type": "tablet",
"id": "1587893062600175",
"attributes": {
"serial": "ABCDFTTG"
}
}
]
},
"users": {
"data": [
{
"type": "user",
"id": "ABCDFTTG",
"attributes": {
"external_id": "ABCDFTTG"
}
}
]
}
}
},
"meta": {
"message_id": "11eb-8c75-0b3f87aedbb5",
"consumer_version": "1.2.0",
"origin_version": null,
"timestamp": "2021-06-14T17:42:29Z"
}
}
I want only one row instead of this two. Here is my kusto query which is used for parsing json data into table columns.
Test
|where messageId =="123"
//|mv-expand message=message.data.attributes
|mv-expand message
|mv-expand Value=message.data.relationships.assets.['data']
|mv-expand value_devices=message.data.relationships.devices.['data']
|mv-expand value_user=message.data.relationships.users.['data']
| project type=message.data.type,id=message.data.id,
event=tostring(message.data.attributes.event),
logged_at=tostring(message.data.attributes.logged_at),
distance=toint(message.data.attributes.location.relative_position.distance),
// Value=message.data.relationships.assets.['data'],//.['data']
type_asset=Value.type,asset_id=Value.id,
device_type=value_devices.type,device_id=value_devices.id,
device_attr_serial=value_devices.attributes.serial,
user_type=value_user.type,user_id=value_user.id,
user_external_id=value_user.attributes.external_id
This duplicate row appeared after adding user tag this tag is array so how to handle this array with single id.
I have parse my json data any got the following output.
Expected output should be like
check device_type and device_id columns

Query Druid SQL inner join with a dataSource name that has a dash

How to write an INNER JOIN query between two data sources that one of them has a dash as it's schema name
Executing the following query on the Druid SQL binary results in a query error
SELECT *
FROM first
INNER JOIN "second-schema" on first.device_id = "second-schema".device_id;
org.apache.druid.java.util.common.ISE: Cannot build plan for query
Is this the correct syntax when trying to refrence a data source that has a dash in it's name?
Schema
[
{
"dataSchema": {
"dataSource": "second-schema",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "ts_start"
},
"dimensionsSpec": {
"dimensions": [
"etid",
"device_id",
"device_name",
"x_1",
"x_2",
"x_3",
"vlan",
"s_x",
"d_x",
"d_p",
"msg_type"
],
"dimensionExclusions": [],
"spatialDimensions": []
}
}
},
"metricsSpec": [
{ "type": "hyperUnique", "name": "conn_id_hll", "fieldName": "conn_id"},
{
"type": "count",
"name": "event_count"
}
],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "HOUR",
"queryGranularity": "minute"
}
},
"ioConfig": {
"type": "realtime",
"firehose": {
"type": "kafka-0.8",
"consumerProps": {
"zookeeper.connect": "localhost:2181",
"zookeeper.connectiontimeout.ms": "15000",
"zookeeper.sessiontimeout.ms": "15000",
"zookeeper.synctime.ms": "5000",
"group.id": "flow-info",
"fetch.size": "1048586",
"autooffset.reset": "largest",
"autocommit.enable": "false"
},
"feed": "flow-info"
},
"plumber": {
"type": "realtime"
}
},
"tuningConfig": {
"type": "realtime",
"maxRowsInMemory": 50000,
"basePersistDirectory": "\/opt\/druid-data\/realtime\/basePersist",
"intermediatePersistPeriod": "PT10m",
"windowPeriod": "PT15m",
"rejectionPolicy": {
"type": "serverTime"
}
}
},
{
"dataSchema": {
"dataSource": "first",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "ts_start"
},
"dimensionsSpec": {
"dimensions": [
"etid",
"category",
"device_id",
"device_name",
"severity",
"x_2",
"x_3",
"x_4",
"x_5",
"vlan",
"s_x",
"d_x",
"s_i",
"d_i",
"d_p",
"id"
],
"dimensionExclusions": [],
"spatialDimensions": []
}
}
},
"metricsSpec": [
{ "type": "doubleSum", "name": "val_num", "fieldName": "val_num" },
{ "type": "doubleMin", "name": "val_num_min", "fieldName": "val_num" },
{ "type": "doubleMax", "name": "val_num_max", "fieldName": "val_num" },
{ "type": "doubleSum", "name": "size", "fieldName": "size" },
{ "type": "doubleMin", "name": "size_min", "fieldName": "size" },
{ "type": "doubleMax", "name": "size_max", "fieldName": "size" },
{ "type": "count", "name": "first_count" }
],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "HOUR",
"queryGranularity": "minute"
}
},
"ioConfig": {
"type": "realtime",
"firehose": {
"type": "kafka-0.8",
"consumerProps": {
"zookeeper.connect": "localhost:2181",
"zookeeper.connectiontimeout.ms": "15000",
"zookeeper.sessiontimeout.ms": "15000",
"zookeeper.synctime.ms": "5000",
"group.id": "first",
"fetch.size": "1048586",
"autooffset.reset": "largest",
"autocommit.enable": "false"
},
"feed": "first"
},
"plumber": {
"type": "realtime"
}
},
"tuningConfig": {
"type": "realtime",
"maxRowsInMemory": 50000,
"basePersistDirectory": "\/opt\/druid-data\/realtime\/basePersist",
"intermediatePersistPeriod": "PT10m",
"windowPeriod": "PT15m",
"rejectionPolicy": {
"type": "serverTime"
}
}
}
]
Based on your schema definitions there are a few observations I'll make.
When doing a join you usually have to list out columns explicitly (not use a *) otherwise you get collisions from duplicate columns. In your join, for example, you have a device_id in both "first" and "second-schema", not to mention all the other columns that are the same across both.
When using a literal delimiter I don't mix them up. I either use them or I don't.
So I think your query will work better in the form of something more like this
SELECT
"first"."etid",
"first"."category",
"first"."device_id",
"first"."device_name",
"first"."severity",
"first"."x_2",
"first"."x_3",
"first"."x_4",
"first"."x_5",
"first"."vlan",
"first"."s_x",
"first"."d_x",
"first"."s_i",
"first"."d_i",
"first"."d_p",
"first"."id",
"second-schema"."etid" as "ss_etid",
"second-schema"."device_id" as "ss_device_id",
"second-schema"."device_name" as "ss_device_name",
"second-schema"."x_1" as "ss_x_1",
"second-schema"."x_2" as "ss_x_2",
"second-schema"."x_3" as "ss_x_3",
"second-schema"."vlan" as "ss_vlan",
"second-schema"."s_x" as "ss_s_x",
"second-schema"."d_x" as "ss_d_x",
"second-schema"."d_p" as "ss_d_p",
"second-schema"."msg_type"
FROM "first"
INNER JOIN "second-schema" ON "first"."device_id" = "second-schema"."device_id";
Obviously feel free to name columns as you see fit, or include exclude columns as needed. Select * will only work when all columns across both tables are unique.

elasticsearch Not_analyzed and analyzed

Hello For certain requirement i have made all the index not_analyzed
{
"template": "*",
"mappings": {
"_default_": {
"dynamic_templates": [
{
"my_template": {
"match_mapping_type": "string",
"mapping": {
"index": "not_analyzed"
}
}
}
]
}
}
}
But now as per our requirement i have to make certain field as analyzed . and keep rest of the field as not analyzed
My Data is of type :
{ "field1":"Value1",
"field2":"Value2",
"field3":"Value3",
"field4":"Value3",
"field5":"Value4",
"field6":"Value5",
"field7":"Value6",
"field8":"",
"field9":"ce-3sdfa773-7sdaf2-989e-5dasdsdf",
"field10":"12345678",
"field11":"ertyu12345ffd",
"field12":"A",
"field13":"Value7",
"field14":"Value8",
"field15":"Value9",
"field16":"Value10",
"field17":"Value11",
"field18":"Value12",
"field19":{
"field20":"Value13",
"field21":"Value14"
},
"field22":"Value15",
"field23":"ipaddr",
"field24":"datwithtime",
"field25":"Value6",
"field26":"0",
"field20":"0",
"field28":"0"
}
If i change my template as per recommendation to something like this
{
"template": "*",
"mappings": {
"_default_": {
"properties": {
"filed6": {
"type": "string",
"analyzer": "keyword",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}}},
"dynamic_templates": [
{
"my_template": {
"match_mapping_type": "*",
"mapping": {
"type": "string",
"index": "not_analyzed"
}
}
}
]
}
}
}
Then i get error while i insert data stating
{"error":"MapperParsingException[failed to parse [field19]]; nested: ElasticsearchIllegalArgumentException[unknown property [field20 ]]; ","status":400}
In short you want to change the mapping of your index.
If your index does not contain any data(which I suppose, is not the
case), then you can simply delete the index and create it again with
new mapping.
If your index contains data, you will have to reindex it.
Steps for reindexing:
Put all data from existing index to dummy index.
Delete existing index. Generate new mapping.
Transfer data from dummy index to newly created index.
You can also give a look to elastic search alias here
This link might also be useful.
If you want to use the same field as analysed and not analysed at the same time you have to use multifield using
"title": {
"type": "multi_field",
"fields": {
"title": { "type": "string" },
"raw": { "type": "string", "index": "not_analyzed" }
}
}
This is for your reference.
For defining multifield in dynamic_templates use:
{
"template": "*",
"mappings": {
"_default_": {
"dynamic_templates": [
{
"my_template": {
"match_mapping_type": "string",
"mapping": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
]
}
}
}
Refer this for more info on this.
You can either write multiple templates or have separate properties depending on your requirements. Both will work smoothly.
1) Multiple Templates
{
"mappings": {
"your_doctype": {
"dynamic_templates": [
{
"analyzed_values": {
"match_mapping_type": "*",
"match_pattern": "regex",
"match": "title|summary",
"mapping": {
"type": "string",
"analyzer": "keyword"
}
}
},
{
"date_values": {
"match_mapping_type": "date",
"match": "*_date",
"mapping": {
"type": "date"
}
}
},
{
"exact_values": {
"match_mapping_type": "*",
"mapping": {
"type": "string",
"index": "not_analyzed"
}
}
}
]
}
}
}
Here title and summary are analyzed by keyword analyzer. I have also added date field which is optional, it will map create_date etc as date. Last one will match anything and it will not_analyzed which will fulfill your requirements.
2) Add analyzed field as properties.
{
"mappings": {
"your_doctype": {
"properties": {
"title": {
"type": "string",
"analyzer": "keyword"
},
"summary": {
"type": "string",
"analyzer": "keyword"
}
},
"dynamic_templates": [
{
"any_values": {
"match_mapping_type": "*",
"mapping": {
"type": "string",
"index": "not_analyzed"
}
}
}
]
}
}
}
Here title and summary fields are analyzed while rest will be not_analyzed
You would have to reindex the data no matter which solution you take.
EDIT 1 : After looking at your data and mapping, there is one slight problem in it. Your data contains object structure and you are mapping everything apart from filed6 as string and filed19 is an Object and not string and hence ES is throwing the error. The solution is to let ES decide which datatype the field is with dynamic_type. Change your mapping to this
{
"template": "*",
"mappings": {
"_default_": {
"properties": {
"filed6": {
"type": "string",
"analyzer": "keyword",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
},
"dynamic_templates": [
{
"my_template": {
"match_mapping_type": "*",
"mapping": {
"type": "{dynamic_type}", <--- this will decide if field is either object or string.
"index": "not_analyzed"
}
}
}
]
}
}
}
Hope this helps!!

Elasticsearch - Index Mapping settings for both exact and partial matching

I'm new to elasticsearch and am trying to learn how to index using optimal mapping settings to achieve the following.
If I have a document like this
{"name":"Galapagos Islands"}
I want to get this a result for both the following queries
1) Partial matching
{
"query": {
"match": {
"name": "ga"
}
}
}
2) Exact matching
{
"query": {
"term": {
"name": "Galapagos Islands"
}
}
}
With the setting I have currently. I am able to achieve the partial matching part. But exact matching returns no results. Please find below the settings with which I indexed.
{
"mappings": {
"islands": {
"properties": {
"name":{
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "search_ngram"
}
}
}
},
"settings":{
"analysis":{
"analyzer":{
"autocomplete":{
"type":"custom",
"tokenizer":"standard",
"filter":[ "standard", "lowercase", "stop", "kstem", "ngram" ]
},
"search_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
}
},
"filter":{
"ngram":{
"type":"ngram",
"min_gram":2,
"max_gram":15
}
}
}
}
}
What is the correct way to do exact matching and partial matching on a field ?
UPDATE
After recreating the index with settings given below. My mappings look like this
curl -XGET 'localhost:9200/testing/_mappings?pretty'
{
"testing" : {
"mappings" : {
"islands" : {
"properties" : {
"name" : {
"type" : "string",
"index_analyzer" : "autocomplete",
"search_analyzer" : "search_ngram",
"fields" : {
"raw" : {
"type" : "string",
"analyzer" : "my_keyword_lowercase_analyzer"
}
}
}
}
}
}
}
}
My indexing settings are the below
{
"mappings": {
"islands": {
"properties": {
"name":{
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "search_ngram",
"fields": {
"raw": {
"type": "string",
"analyzer": "my_keyword_lowercase_analyzer"
}
}
}
}
}
},
"settings":{
"analysis":{
"analyzer":{
"autocomplete":{
"type":"custom",
"tokenizer":"standard",
"filter":[ "standard", "lowercase", "stop", "kstem", "ngram" ]
},
"search_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
},
"my_keyword_lowercase_analyzer": {
"type": "custom",
"filter": ["lowercase"],
"tokenizer": "keyword"
}
},
"filter":{
"ngram":{
"type":"ngram",
"min_gram":2,
"max_gram":15
}
}
}
}
}
And with all the above, when I query like this
curl -XGET 'localhost:9200/testing/islands/_search?pretty' -d '{"query": {"term": {"name.raw" : "Galapagos Islands"}}}'
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
And My document is this
curl -XGET 'localhost:9200/testing/islands/1?pretty'
{
"_index" : "testing",
"_type" : "islands",
"_id" : "1",
"_version" : 1,
"found" : true,
"_source":{"name":"Galapagos Islands"}
}
Add a subfield to your name property which should be not_analyzed. Or, if you care about lowercase/uppercase, a keyword tokenizer together with a lowercase filter.
This should index Galapagos as is, not modifications. Then you can do your term search.
For example, a keyword analyzer together with lowercase filter:
"my_keyword_lowercase_analyzer": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
And the mapping:
"properties": {
"name":{
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "search_ngram",
"fields": {
"raw": {
"type": "string",
"analyzer": "my_keyword_lowercase_analyzer"
}
}
}
}
The query to be used is:
{
"query": {
"term": {
"name.raw": "galapagos islands"
}
}
}
So, instead of using the same field - name - you should be using name.raw (the subfield).

Query for missing fields in nested documents

I have a user document which contains many tags
Here is the mapping:
{
"user" : {
"properties" : {
"tags" : {
"type" : "nested",
"properties" : {
"id" : {
"type" : "string",
"index" : "not_analyzed",
"store" : "yes"
},
"current" : {
"type" : "boolean"
},
"type" : {
"type" : "string"
},
"value" : {
"type" : "multi_field",
"fields" : {
"value" : {
"type" : "string",
"analyzer" : "name_analyzer"
},
"value_untouched" : {
"type" : "string",
"index" : "not_analyzed",
"include_in_all" : false
}
}
}
}
}
}
}
}
Here are the sample user documents:
User 1
{
"created_at": 1317484762000,
"updated_at": 1367040856000,
"tags": [
{
"type": "college",
"value": "Dhirubhai Ambani Institute of Information and Communication Technology",
"id": "a6f51ef8b34eb8f24d1c5be5e4ff509e2a361829"
},
{
"type": "company",
"value": "alma connect",
"id": "58ad4afcc8415216ea451339aaecf311ed40e132"
},
{
"type": "company",
"value": "Google",
"id": "93bc8199c5fe7adfd181d59e7182c73fec74eab5",
"current": true
},
{
"type": "discipline",
"value": "B.Tech.",
"id": "a7706af7f1477cbb1ac0ceb0e8531de8da4ef1eb",
"institute_id": "4fb424a5addf32296f00013a"
},
]
}
User 2:
{
"created_at": 1318513355000,
"updated_at": 1364888695000,
"tags": [
{
"type": "college",
"value": "Dhirubhai Ambani Institute of Information and Communication Technology",
"id": "a6f51ef8b34eb8f24d1c5be5e4ff509e2a361829"
},
{
"type": "college",
"value": "Bharatiya Vidya Bhavan's Public School, Jubilee hills, Hyderabad",
"id": "d20730345465a974dc61f2132eb72b04e2f5330c"
},
{
"type": "company",
"value": "Alma Connect",
"id": "93bc8199c5fe7adfd181d59e7182c73fec74eab5"
},
{
"type": "sector",
"value": "Website and Software Development",
"id": "dc387d78fc99ab43e6ae2b83562c85cf3503a8a4"
}
]
}
User 3:
{
"created_at": 1318513355001,
"updated_at": 1364888695010,
"tags": [
{
"type": "college",
"value": "Dhirubhai Ambani Institute of Information and Communication Technology",
"id": "a6f51ef8b34eb8f24d1c5be5e4ff509e2a361821"
},
{
"type": "sector",
"value": "Website and Software Development",
"id": "dc387d78fc99ab43e6ae2b83562c85cf3503a8a1"
}
]
}
Using the above ES documents for search, I want to construct a query where I need to fetch users who have company tags in nested tag documents or the users who do not have any company tags. What will be my search query?
For example in above case, if search for google tag, then the returned documents should be 'user 1' and 'user 3' (as user 1 has company tag google and user 3 has no company tag). User 2 is not returned as it has a company tag other than google too.
Not trivial at all, mainly due to the not have a type:company tag clause. Here's what I came up with:
{
"or" : {
"filters" : [ {
"nested" : {
"filter" : {
"and" : {
"filters" : [ {
"term" : {
"tags.value" : "google"
}
}, {
"term" : {
"tags.type" : "company"
}
} ]
}
},
"path" : "tags"
}
}, {
"not" : {
"filter" : {
"nested" : {
"filter" : {
"term" : {
"tags.type" : "company"
}
},
"path" : "tags"
}
}
}
} ]
}
}
It contains an or filter with two nested clauses: the first one finds the documents that have tags.type:company and tags.value:google, while the second one finds all the documents that don't have any tags.type:company.
This needs to be optimized though since and/or/not filters don't take advantage of caching for filters that work with bitsets, like the term filter does. It would be best to take some more time to find a way to use a bool filter and obtain the same result. Have a lookt this article to know more.