Elasticsearch - Index Mapping settings for both exact and partial matching - indexing

I'm new to elasticsearch and am trying to learn how to index using optimal mapping settings to achieve the following.
If I have a document like this
{"name":"Galapagos Islands"}
I want to get this a result for both the following queries
1) Partial matching
{
"query": {
"match": {
"name": "ga"
}
}
}
2) Exact matching
{
"query": {
"term": {
"name": "Galapagos Islands"
}
}
}
With the setting I have currently. I am able to achieve the partial matching part. But exact matching returns no results. Please find below the settings with which I indexed.
{
"mappings": {
"islands": {
"properties": {
"name":{
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "search_ngram"
}
}
}
},
"settings":{
"analysis":{
"analyzer":{
"autocomplete":{
"type":"custom",
"tokenizer":"standard",
"filter":[ "standard", "lowercase", "stop", "kstem", "ngram" ]
},
"search_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
}
},
"filter":{
"ngram":{
"type":"ngram",
"min_gram":2,
"max_gram":15
}
}
}
}
}
What is the correct way to do exact matching and partial matching on a field ?
UPDATE
After recreating the index with settings given below. My mappings look like this
curl -XGET 'localhost:9200/testing/_mappings?pretty'
{
"testing" : {
"mappings" : {
"islands" : {
"properties" : {
"name" : {
"type" : "string",
"index_analyzer" : "autocomplete",
"search_analyzer" : "search_ngram",
"fields" : {
"raw" : {
"type" : "string",
"analyzer" : "my_keyword_lowercase_analyzer"
}
}
}
}
}
}
}
}
My indexing settings are the below
{
"mappings": {
"islands": {
"properties": {
"name":{
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "search_ngram",
"fields": {
"raw": {
"type": "string",
"analyzer": "my_keyword_lowercase_analyzer"
}
}
}
}
}
},
"settings":{
"analysis":{
"analyzer":{
"autocomplete":{
"type":"custom",
"tokenizer":"standard",
"filter":[ "standard", "lowercase", "stop", "kstem", "ngram" ]
},
"search_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
},
"my_keyword_lowercase_analyzer": {
"type": "custom",
"filter": ["lowercase"],
"tokenizer": "keyword"
}
},
"filter":{
"ngram":{
"type":"ngram",
"min_gram":2,
"max_gram":15
}
}
}
}
}
And with all the above, when I query like this
curl -XGET 'localhost:9200/testing/islands/_search?pretty' -d '{"query": {"term": {"name.raw" : "Galapagos Islands"}}}'
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
And My document is this
curl -XGET 'localhost:9200/testing/islands/1?pretty'
{
"_index" : "testing",
"_type" : "islands",
"_id" : "1",
"_version" : 1,
"found" : true,
"_source":{"name":"Galapagos Islands"}
}

Add a subfield to your name property which should be not_analyzed. Or, if you care about lowercase/uppercase, a keyword tokenizer together with a lowercase filter.
This should index Galapagos as is, not modifications. Then you can do your term search.
For example, a keyword analyzer together with lowercase filter:
"my_keyword_lowercase_analyzer": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
And the mapping:
"properties": {
"name":{
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "search_ngram",
"fields": {
"raw": {
"type": "string",
"analyzer": "my_keyword_lowercase_analyzer"
}
}
}
}
The query to be used is:
{
"query": {
"term": {
"name.raw": "galapagos islands"
}
}
}
So, instead of using the same field - name - you should be using name.raw (the subfield).

Related

Elasticsearch: Update mapping field type ID from long to string

I changed the elasticsearch mapping field type from:
"articles": {
"properties": {
"id": {
"type": "long"
}}}
to
"articles": {
"properties": {
"id": {
"type": "string",
"index": "not_analyzed"
}
After that I did the following steps:
Create the index with new mapping
Reindex the mapping to the new index
After the mapping update my previous query filter doesn't work anymore and I have no results:
GET /art/_search
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"type": {
"value": "articles"
}
},
{
"term": {
"id": "123467679"
}
}
]
}
}
}
},
"size": 1,
"sort": [
{
"_score": "desc"
}
]
}
If I check with this query the result is what I expect:
GET /art/articles/_search
{
"query": {
"match_all": {}
}
}
I would appreciate if somebody have some idea why after the field type change the query is no longer working.
Thanks!
The problem in the query was with ID filter.
The query works correctly changing the filter from:
"term": {
"id": "123467679"
}
in:
"term": {
"_id": "123467679"
}
I'm still a beginner with elasticsearch to figure out why the mapping change broke the query although I did the reindex, but "_id" fixed my query.
You can find more informations in the :
elasticsearch mapping reference documentation.

ElasticSearch: TooManyClauses exception when adding highlight

My query_string query gives me a TooManyClauses exception. However, in my case I don't think that the exception is thrown due to the usual reason. Instead, it seems to be related to highlighting, because when I remove the highlight from the query, it works. This is my original query:
{
"query" : {
"query_string" : {
"query" : "aluminium potassium +DOS_UUID:*",
"default_field" : "fileTextContent.fileTextContentAnalyzed"
}
},
"fields" : [ "attachmentType", "DOS_UUID", "ATT_UUID", "DOCUMENT_REFERENCE", "filename", "isCSR", "mime" ],
"highlight" : {
"fields" : {
"fileTextContent.fileTextContentAnalyzed" : { }
}
}
}
and it gives me the TooManyClauses error:
{
"error": "SearchPhaseExecutionException[Failed to execute phase [query_fetch], all shards failed; shardFailures {[02Z45jhrTCu7bSYy-XSW_g][markosindex][0]: FetchPhaseExecutionException[[markosindex][0]: query[filtered(fileTextContent.fileTextContentAnalyzed:aluminium fileTextContent.fileTextContentAnalyzed:potassium +DOS_UUID:*)->cache(_type:markostype)],from[0],size[10]: Fetch Failed [Failed to highlight field [fileTextContent.fileTextContentAnalyzed]]]; nested: TooManyClauses[maxClauseCount is set to 1024]; }]",
"status": 500
}
This is the query without the highlight, which works:
{
"query" : {
"query_string" : {
"query" : "aluminium potassium +DOS_UUID:*",
"default_field" : "fileTextContent.fileTextContentAnalyzed"
}
},
"fields" : [ "attachmentType", "DOS_UUID", "ATT_UUID", "DOCUMENT_REFERENCE", "filename", "isCSR", "mime" ]
}
UPDATE 1:
This is the stacktrace from the ElasticSearch log file:
[2014-10-10 16:03:18,236][DEBUG][action.search.type ] [Doop] [markosindex][0], node[02Z45jhrTCu7bSYy-XSW_g], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest#14d7ab1e]
org.elasticsearch.search.fetch.FetchPhaseExecutionException: [markosindex][0]: query[filtered(fileTextContent.fileTextContentAnalyzed:aluminium fileTextContent.fileTextContentAnalyzed:potassium +DOS_UUID:*)->cache(_type:markostype)],from[0],size[10]: Fetch Failed [Failed to highlight field [fileTextContent.fileTextContentAnalyzed]]
at org.elasticsearch.search.highlight.PlainHighlighter.highlight(PlainHighlighter.java:121)
at org.elasticsearch.search.highlight.HighlightPhase.hitExecute(HighlightPhase.java:126)
at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:211)
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:340)
at org.elasticsearch.search.action.SearchServiceTransportAction$11.call(SearchServiceTransportAction.java:308)
at org.elasticsearch.search.action.SearchServiceTransportAction$11.call(SearchServiceTransportAction.java:305)
at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:517)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.lucene.search.BooleanQuery$TooManyClauses: maxClauseCount is set to 1024
at org.apache.lucene.search.ScoringRewrite$1.checkMaxClauseCount(ScoringRewrite.java:72)
at org.apache.lucene.search.ScoringRewrite$ParallelArraysTermCollector.collect(ScoringRewrite.java:149)
at org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:79)
at org.apache.lucene.search.ScoringRewrite.rewrite(ScoringRewrite.java:105)
at org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:288)
at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:217)
at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:99)
at org.elasticsearch.search.highlight.CustomQueryScorer$CustomWeightedSpanTermExtractor.extractUnknownQuery(CustomQueryScorer.java:89)
at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:224)
at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:474)
at org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:217)
at org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:186)
at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:197)
at org.elasticsearch.search.highlight.PlainHighlighter.highlight(PlainHighlighter.java:113)
... 9 more
[2014-10-10 16:03:18,237][DEBUG][action.search.type ] [Doop] All shards failed for phase: [query_fetch]
Note: I am using ElasticSearch 1.2.1.
UPDATE 2:
This is my mapping:
{
"markosindex": {
"mappings": {
"markostype": {
"_id": {
"path": "DOCUMENT_REFERENCE"
},
"properties": {
"ATT_UUID": {
"type": "string",
"index": "not_analyzed"
},
"DOCUMENT_REFERENCE": {
"type": "string",
"index": "not_analyzed"
},
"DOS_UUID": {
"type": "string",
"index": "not_analyzed"
},
"attachmentType": {
"type": "string",
"index": "not_analyzed"
},
"fileTextContent": {
"type": "string",
"index": "no",
"fields": {
"fileTextContentAnalyzed": {
"type": "string"
}
}
},
"filename": {
"type": "string",
"index": "not_analyzed"
},
"isCSR": {
"type": "boolean"
},
"mime": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
Any idea? Thanks!

hierarchical faceting with Elasticsearch

I'm using elasticsearch and need to implement facet search for hierarchical object as follow:
category 1 (10)
subcategory 1 (4)
subcategory 2 (6)
category 2 (X)
...
So I need to get facets for two related objects. Documentation says that it's possible to get such kind of facets for numeric value, but I need it for strings http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-facets-terms-stats-facet.html
Here is another interesting topic, unfortunately it's old: http://elasticsearch-users.115913.n3.nabble.com/Pivot-facets-td2981519.html
Does it possible with elastic search?
If so, how can I do that?
The previous solution works really well until you have no more than a multi-level tag on a single-document. In this case a simple aggregation doesn't work, because the flat structure of the lucene fields mix the results on the internal aggregation.
See the example below:
DELETE /test_category
POST /test_category
# Insert a doc with 2 hierarchical tags
POST /test_category/test/1
{
"categories": [
{
"cat_1": "1",
"cat_2": "1.1"
},
{
"cat_1": "2",
"cat_2": "2.2"
}
]
}
# Simple two-levels aggregations query
GET /test_category/test/_search?search_type=count
{
"aggs": {
"main_category": {
"terms": {
"field": "categories.cat_1"
},
"aggs": {
"sub_category": {
"terms": {
"field": "categories.cat_2"
}
}
}
}
}
}
That's the WRONG response that I have got on ES 1.4, where the fields on the internal aggregation are mixed at a document level:
{
...
"aggregations": {
"main_category": {
"buckets": [
{
"key": "1",
"doc_count": 1,
"sub_category": {
"buckets": [
{
"key": "1.1",
"doc_count": 1
},
{
"key": "2.2", <= WRONG
"doc_count": 1
}
]
}
},
{
"key": "2",
"doc_count": 1,
"sub_category": {
"buckets": [
{
"key": "1.1", <= WRONG
"doc_count": 1
},
{
"key": "2.2",
"doc_count": 1
}
]
}
}
]
}
}
}
A Solution can be to use nested objects. These are the steps to do:
1) Define a new type in the schema with nested objects
POST /test_category/test2/_mapping
{
"test2": {
"properties": {
"categories": {
"type": "nested",
"properties": {
"cat_1": {
"type": "string"
},
"cat_2": {
"type": "string"
}
}
}
}
}
}
# Insert a single document
POST /test_category/test2/1
{"categories":[{"cat_1":"1","cat_2":"1.1"},{"cat_1":"2","cat_2":"2.2"}]}
2) Run a nested aggregation query:
GET /test_category/test2/_search?search_type=count
{
"aggs": {
"categories": {
"nested": {
"path": "categories"
},
"aggs": {
"main_category": {
"terms": {
"field": "categories.cat_1"
},
"aggs": {
"sub_category": {
"terms": {
"field": "categories.cat_2"
}
}
}
}
}
}
}
}
That's the response, now correct, that I have got:
{
...
"aggregations": {
"categories": {
"doc_count": 2,
"main_category": {
"buckets": [
{
"key": "1",
"doc_count": 1,
"sub_category": {
"buckets": [
{
"key": "1.1",
"doc_count": 1
}
]
}
},
{
"key": "2",
"doc_count": 1,
"sub_category": {
"buckets": [
{
"key": "2.2",
"doc_count": 1
}
]
}
}
]
}
}
}
}
The same solution can be extended to a more than two-levels hierarchy facet.
Currently, elasticsearch does not support hierarchical facetting out-of-the-box. But the upcoming 1.0 release features a new aggregations module, that can be used to get these kind of facets (which are more like pivot-facets rather than hierarchical facets). Version 1.0 is currently in beta, you can download the second beta and test out aggregatins by yourself. Your example might look like
curl -XPOST 'localhost:9200/_search?pretty' -d '
{
"aggregations": {
"main category": {
"terms": {
"field": "cat_1",
"order": {"_term": "asc"}
},
"aggregations": {
"sub category": {
"terms": {
"field": "cat_2",
"order": {"_term": "asc"}
}
}
}
}
}
}'
The idea is, to have a different field for each level of facetting and bucket your facets based on the terms of the first level (cat_1). These aggregations then would have sub-buckets, based on the terms of the second level (cat_2). The result may look like
{
"aggregations" : {
"main category" : {
"buckets" : [ {
"key" : "category 1",
"doc_count" : 10,
"sub category" : {
"buckets" : [ {
"key" : "subcategory 1",
"doc_count" : 4
}, {
"key" : "subcategory 2",
"doc_count" : 6
} ]
}
}, {
"key" : "category 2",
"doc_count" : 7,
"sub category" : {
"buckets" : [ {
"key" : "subcategory 1",
"doc_count" : 3
}, {
"key" : "subcategory 2",
"doc_count" : 4
} ]
}
} ]
}
}
}

ElasticSearch:filtering documents based on field length?

Is there a way to filter ElasticSearch documents based on the length of a specific field?
For instance, I have a bunch of documents with the field "body", and I only want to return results where the number of characters in body is > 1000. Is there a way to do this in ES without having to add an extra column with the length in the index?
Use the script filter, like this:
"filtered" : {
"query" : {
...
},
"filter" : {
"script" : {
"script" : "doc['body'].length > 1000"
}
}
}
EDIT
Sorry, meant to reference the query DSL guide on script filters
You can also create a custom tokenizer and use it in a multifields property as in the following:
PUT test_index
{
"settings": {
"analysis": {
"analyzer": {
"character_analyzer": {
"type": "custom",
"tokenizer": "character_tokenizer"
}
},
"tokenizer": {
"character_tokenizer": {
"type": "nGram",
"min_gram": 1,
"max_gram": 1
}
}
}
},
"mappings": {
"person": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
},
"words_count": {
"type": "token_count",
"analyzer": "standard"
},
"length": {
"type": "token_count",
"analyzer": "character_analyzer"
}
}
}
}
}
}
}
PUT test_index/person/1
{
"name": "John Smith"
}
PUT test_index/person/2
{
"name": "Rachel Alice Williams"
}
GET test_index/person/_search
{
"query": {
"term": {
"name.length": 10
}
}
}

Query for missing fields in nested documents

I have a user document which contains many tags
Here is the mapping:
{
"user" : {
"properties" : {
"tags" : {
"type" : "nested",
"properties" : {
"id" : {
"type" : "string",
"index" : "not_analyzed",
"store" : "yes"
},
"current" : {
"type" : "boolean"
},
"type" : {
"type" : "string"
},
"value" : {
"type" : "multi_field",
"fields" : {
"value" : {
"type" : "string",
"analyzer" : "name_analyzer"
},
"value_untouched" : {
"type" : "string",
"index" : "not_analyzed",
"include_in_all" : false
}
}
}
}
}
}
}
}
Here are the sample user documents:
User 1
{
"created_at": 1317484762000,
"updated_at": 1367040856000,
"tags": [
{
"type": "college",
"value": "Dhirubhai Ambani Institute of Information and Communication Technology",
"id": "a6f51ef8b34eb8f24d1c5be5e4ff509e2a361829"
},
{
"type": "company",
"value": "alma connect",
"id": "58ad4afcc8415216ea451339aaecf311ed40e132"
},
{
"type": "company",
"value": "Google",
"id": "93bc8199c5fe7adfd181d59e7182c73fec74eab5",
"current": true
},
{
"type": "discipline",
"value": "B.Tech.",
"id": "a7706af7f1477cbb1ac0ceb0e8531de8da4ef1eb",
"institute_id": "4fb424a5addf32296f00013a"
},
]
}
User 2:
{
"created_at": 1318513355000,
"updated_at": 1364888695000,
"tags": [
{
"type": "college",
"value": "Dhirubhai Ambani Institute of Information and Communication Technology",
"id": "a6f51ef8b34eb8f24d1c5be5e4ff509e2a361829"
},
{
"type": "college",
"value": "Bharatiya Vidya Bhavan's Public School, Jubilee hills, Hyderabad",
"id": "d20730345465a974dc61f2132eb72b04e2f5330c"
},
{
"type": "company",
"value": "Alma Connect",
"id": "93bc8199c5fe7adfd181d59e7182c73fec74eab5"
},
{
"type": "sector",
"value": "Website and Software Development",
"id": "dc387d78fc99ab43e6ae2b83562c85cf3503a8a4"
}
]
}
User 3:
{
"created_at": 1318513355001,
"updated_at": 1364888695010,
"tags": [
{
"type": "college",
"value": "Dhirubhai Ambani Institute of Information and Communication Technology",
"id": "a6f51ef8b34eb8f24d1c5be5e4ff509e2a361821"
},
{
"type": "sector",
"value": "Website and Software Development",
"id": "dc387d78fc99ab43e6ae2b83562c85cf3503a8a1"
}
]
}
Using the above ES documents for search, I want to construct a query where I need to fetch users who have company tags in nested tag documents or the users who do not have any company tags. What will be my search query?
For example in above case, if search for google tag, then the returned documents should be 'user 1' and 'user 3' (as user 1 has company tag google and user 3 has no company tag). User 2 is not returned as it has a company tag other than google too.
Not trivial at all, mainly due to the not have a type:company tag clause. Here's what I came up with:
{
"or" : {
"filters" : [ {
"nested" : {
"filter" : {
"and" : {
"filters" : [ {
"term" : {
"tags.value" : "google"
}
}, {
"term" : {
"tags.type" : "company"
}
} ]
}
},
"path" : "tags"
}
}, {
"not" : {
"filter" : {
"nested" : {
"filter" : {
"term" : {
"tags.type" : "company"
}
},
"path" : "tags"
}
}
}
} ]
}
}
It contains an or filter with two nested clauses: the first one finds the documents that have tags.type:company and tags.value:google, while the second one finds all the documents that don't have any tags.type:company.
This needs to be optimized though since and/or/not filters don't take advantage of caching for filters that work with bitsets, like the term filter does. It would be best to take some more time to find a way to use a bool filter and obtain the same result. Have a lookt this article to know more.