prefix fuzzy query (not using query_string) - lucene

I want to do prefix fuzzy search on single term.
Basically I want to get same result as if this search request has been sent:
{
"from": 0,
"size": 100,
"query": {
"query_string": {
"query": "dala~*"
}
},
"filter": {}
}
but without query_string syntax parsing. Search above should match to Dallas term.

In ElasticSearch, if you set fuzzy_prefix_length, you should be able to specify just the fuzzy tilde and get prefix matching:
{
"from": 0,
"size": 100,
"query": {
"query_string": {
"query": "dala~",
"fuzzy_prefix_length": 3
}
},
"filter": {}
}
Similar in spirit to this question

Related

Why ElasticSearch is not able to search when special characters are available?

I have an ElasticSearch index with below configuration:
{
"my_ind": {
"settings": {
"index": {
"mapping": {
"total_fields": {
"limit": "10000000"
}
},
"number_of_shards": "3",
"provided_name": "my_ind",
"creation_date": "1539773409246",
"analysis": {
"analyzer": {
"default": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "whitespace"
}
}
},
"number_of_replicas": "1",
"uuid": "3wC7i-E_Q9mSDjnTN2gxrg",
"version": {
"created": "5061299"
}
}
}
}
}
I want to search below content with plain search:
DL-1234170386456
This contents are available in the below field:
DNumber
This filed has mapping like below:
{
"DNumber": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
I am trying to implement it in JAVA language. I came across the ElasticSearch Analyzers and Tokenizers so I made use of "whitespace" tokenizer.
I am trying to search with below query:
{
"query": {
"multi_match": {
"query": "DL-1234170386456",
"fields": [
"_all"
],
"type": "best_fields",
"operator": "OR",
"analyzer": "default",
"slop": 0,
"prefix_length": 0,
"max_expansions": 50,
"lenient": false,
"zero_terms_query": "NONE",
"boost": 1
}
}
}
What wrong I am doing?
After doing lot of research and Trial & Error, found out the answer!
Some basic but important points:
We need to specify Analyzers and Tokenizers while creating/indexing the index/data.
In specified string i.e. "DL-1234170386456", special character (i.e. "-") is available and ElasticSearch is using by default Standard Analyzer.
Standard Analyzer contains Standard Tokenizer which is based on the Unicode Text Segmentation algorithm.
Actual Problem:
ElasticSearch is separating the String ("DL-1234170386456") into two different parts like "DL" and "1234170386456".
Solution:
We need to specify Whitespace Analyzer which contains Whitespace Tokenizer.
It will split the word whenever space is encountered. So, String ("DL-1234170386456") will kept as it is by ElasticSearch and we are able to find it out.

Elasticsearch: How to prevent the increase of score when search term appears multiple times in document?

When a search term appears not only once but several times in the document I'm searching the score goes up. While this might be wanted most of the times, it is not in my case.
The query:
"query": {
"bool": {
"should": {
"nested": {
"path": "editions",
"query": {
"match": {
"title_author": {
"query": "look me up",
"operator": "and",
"boost": 2
}
}
}
}
},
"must": {
"nested": {
"path": "editions",
"query": {
"match": {
"title_author": {
"query": "look me up",
"operator": "and",
"fuzziness": 0.5,
"boost": 1
}
}
}
}
}
}
}
doc_1
{
"editions": [
{
"editionid": 1,
"title_author": "look me up look me up",
},
{
"editionid": 2,
"title_author": "something else",
}
]
}
and doc_2
{
"editions": [
{
"editionid": 3,
"title_author": "look me up",
},
{
"editionid": 4,
"title_author": "something else",
}
]
}
Now, doc_1 would have a higher score due to the fact that the search terms are included twice. I don't want that. How do I turn this behavior off? I want the same score - no matter if the search term was found once or twice in the matching document.
In addition to what #keety and #Sid1199 talked about there is another way to do that: special property for fields with type "text" called index_options. By default it is set to "positions", but you can explicitly set it to "docs", so term frequencies will not be placed in the index and Elasticsearch will not know about repetitions while searching.
"title_author": {
"type": "text",
"index_options": "docs"
}
There is a property in Elastic search known as "similarity". There are a lot of types of similarities, but the one that is useful here is "boolean". If you set similarity to "boolean" in your mapping, it will prevent multiple boosting of your query.
"title_author":{"type":"text","similarity":"boolean"}
If you run your query on this mapping, it will boost only once regardless of the number of time the word appears. You can read up more on similarities here
This is only available in ES versions 5.4 and above

elastich search missing filter on query

it's the first time I use the 'missing' parameter and I am not sure if I am doing something wrong as i am not getting what i expect.
Can someone please tell me if the missing condition is correctly integrated in this query? it should created 5 facets, counting for each one only the occurrences for which decimallatitude field is 'not set in the index' or its value is null.
curl -XGET http://my_url:9200/idx_occurrence/Occurrene/_search?pretty=true -d '{
"filter": {
"missing": {
"field": "decimallatitude",
"existence": true,
"null_value": true
}
},
"query": {
"query_string": {
"fields": ["dataset"],
"query": "3",
"default_operator": "AND"
}
},
"facets": {
"test": {
"terms": {
"field": ["kingdom_interpreted"],
"size": 5
}
}
}
}
'
As you can see on the Search API - Filter page, the filter is applied to your query results but not to the facets. To make it work for facets, try using the Filtered Query instead
curl -XGET http://my_url:9200/idx_occurrence/Occurrene/_search?pretty=true -d '{
"query": {
"filtered": {
"filter": {
"missing": {
"field": "decimallatitude",
"existence": true,
"null_value": true
}
},
"query": {
"query_string": {
"fields": ["dataset"],
"query": "3",
"default_operator": "AND"
}
}
}
},
"facets": {
"test": {
"terms": {
"field": ["kingdom_interpreted"],
"size": 5
}
}
}
}
'

Elasticsearch/Tire text query DSL for excluding certain fields from being searched

I have a elastic search query like the following,
{
"query": {
"bool": {
"must": [
{
"query_string": {
"fields": ["title"],
"query": "test"
}
}
],
"must_not": [],
"should": []
}
},
"from": 0,
"size": 50,
"sort": [],
"facets": {}
}
I am able to execute an elastic search query on certain fields by giving a fields param to query_string as mentioned above. In my index mapping i have around 50 fields indexed. How do i query for all but one field. Something like an exclude option to query string. Is it possible with Tire/Elastic Search ?
I assumed it cannot be done and proceeded with getting all the mappings and parsing the hash which kinda sucks actually.

elasticsearch / lucene highlight

I'm using ElasticSearch to index documents.
My mapping is:
"mongodocid": {
"boost": 1.0,
"store": "yes",
"type": "string"
},
"fulltext": {
"boost": 1.0,
"index": "analyzed",
"store": "yes",
"type": "string",
"term_vector": "with_positions_offsets"
}
To highlight the complete fulltext I am setting number_of_framgments to 0.
If I do the following Lucene-like string query:
{
"highlight": {
"pre_tags": "<b>",
"fields": {
"fulltext": {
"number_of_fragments": 0
}
},
"post_tags": "</b>"
},
"query": {
"query_string": {
"query": "fulltext:test"
}
},
"size": 100
}
For some documents in the result set the length of the highlighted fulltext is smaller than the fulltext itself.
Since I am setting number_of_fragments to 0 and pre_tags/post_tags are added this should not happen.
Now comes the strange behaviour: If I only search for one of the failing elements by doing this:
{
"highlight": {
"pre_tags": "<b>",
"fields": {
"fulltext": {
"number_of_fragments": 0
}
},
"post_tags": "</b>"
},
"query": {
"query_string": {
"query": "fulltext:test AND mongodocid:4d0a861c2ebef6032c00b1ec"
}
},
"size": 100
}
then all works fine.
Any ideas?
Sounds like issue which has been fixed in 0.14.0 (see #479). As of writing the 0.14.0 hasn't been released yet, can you try master?