protect certain phrases for search - lucene

I am currently trying to improve the corner cases of my elasticsearch-results. One particular is giving me a headache: "google+" which is simply reduced to "google". Omitting special chars is usually fine but for this one I would want an exception. Any ideas how to achieve this?
I tried the following setup:
{
"index": {
"analysis": {
"analyzer": {
"default": {
"tokenizer": "standard",
"filter": [
"synonym",
"word_delimiter"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms_path": "analysis/synonym.txt"
},
"word_delimiter": {
"type": "word_delimiter",
"protected_words_path": "analysis/protected.txt"
}
}
}
}
}
protected.txt contains one line with google+

I guess the Standard tokenizer is stripping out the + from google+. You can check it using the analyze api. I'd use the Whitespace tokenizer instead and properly configure the Word delimiter token filter that you're already using.

I think pattern replace would be a better idea - http://www.elasticsearch.org/guide/reference/index-modules/analysis/pattern_replace-tokenfilter.html

Related

Why ElasticSearch is not able to search when special characters are available?

I have an ElasticSearch index with below configuration:
{
"my_ind": {
"settings": {
"index": {
"mapping": {
"total_fields": {
"limit": "10000000"
}
},
"number_of_shards": "3",
"provided_name": "my_ind",
"creation_date": "1539773409246",
"analysis": {
"analyzer": {
"default": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "whitespace"
}
}
},
"number_of_replicas": "1",
"uuid": "3wC7i-E_Q9mSDjnTN2gxrg",
"version": {
"created": "5061299"
}
}
}
}
}
I want to search below content with plain search:
DL-1234170386456
This contents are available in the below field:
DNumber
This filed has mapping like below:
{
"DNumber": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
I am trying to implement it in JAVA language. I came across the ElasticSearch Analyzers and Tokenizers so I made use of "whitespace" tokenizer.
I am trying to search with below query:
{
"query": {
"multi_match": {
"query": "DL-1234170386456",
"fields": [
"_all"
],
"type": "best_fields",
"operator": "OR",
"analyzer": "default",
"slop": 0,
"prefix_length": 0,
"max_expansions": 50,
"lenient": false,
"zero_terms_query": "NONE",
"boost": 1
}
}
}
What wrong I am doing?
After doing lot of research and Trial & Error, found out the answer!
Some basic but important points:
We need to specify Analyzers and Tokenizers while creating/indexing the index/data.
In specified string i.e. "DL-1234170386456", special character (i.e. "-") is available and ElasticSearch is using by default Standard Analyzer.
Standard Analyzer contains Standard Tokenizer which is based on the Unicode Text Segmentation algorithm.
Actual Problem:
ElasticSearch is separating the String ("DL-1234170386456") into two different parts like "DL" and "1234170386456".
Solution:
We need to specify Whitespace Analyzer which contains Whitespace Tokenizer.
It will split the word whenever space is encountered. So, String ("DL-1234170386456") will kept as it is by ElasticSearch and we are able to find it out.

elastic search query filter out ids by wildcard

I'm hoping to create a query where it will filter out IDs containing a wildcard. For instance, I would like to search for something everywhere except where the ID contains the word current. Is this possible?
Yes it is possible using Regex Filter/Regex Query. I could not figure a way to directly do it using the Complement option hence I've used bool must_not to solve your problem for the time being. I'll refine the answer later if possible.
POST <index name>/_search
{
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must_not": [
{
"regexp": {
"ID": {
"value": ".*current.*"
}
}
}
]
}
}
}

elasticsearch exact match containing hash value

I am facing problem with elastic search, i am using query to search data from document. following is the query to search single data from document.
"query": {
"filtered": {
"query": {
"query_string": {
"query": "'.$lotnumber.'",
"fields": ["LotNumber"]
}
}
}
}
}'
It is working fine for simple value but if $lotnumber contains any value with hash in between then it is showing all the data from document.any one here who can help me to resolve problem of searching exact value from document with hash value ??
The first things that I would think of in this case is make the field lotnumber not-analyzed in your mapping. That should do the trick.
In your mapping
"album": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}

How to specify an analyzer while creating an index in ElasticSearch

I'd like to specify an analyzer, name it, and use that name in a mapping while creating an index. I'm lost, my ES instance always returns me an error message.
This is, roughly, what I'd like to do:
"settings": {
"mappings": {
"alfedoc": {
"properties": {
"id": { "type": "string" },
"alfefield": { "type": "string", "analyzer": "alfeanalyzer" }
}
}
},
"analysis": {
"analyzer": {
"alfeanalyzer": {
"type": "pattern",
"pattern":"\\s+"
}
}
}
}
But this does not seem to work; the ES instance always returns me an error like
MapperParsingException[mapping [alfedoc]]; nested: MapperParsingException[Analyzer [alfeanalyzer] not found for field [alfefield]];
I tried putting the "analysis" branch of the dictionary at several places (inside the mapping etc.) but to no avail. I guess a working complete example (which I couldn't find up to now) would help me along as well. Probably I'm missing something rather basic.
"analysis" goes in the "settings" block, which goes either before or after the "mappings" block when creating an index.
"settings": {
"analysis": {
"analyzer": {
"alfeanalyzer": {
"type": "pattern",
"pattern": "\\s+"
}
}
}
},
"mappings": {
"alfedoc": { ... }
}
Here's a good complete, example: Example 1

Indexing with edge NGrams for typeahead

I'm trying to get Elasticsearch to index some documents for typeahead suggestions. As far as I can tell, the edge NGram handling in Elasticsearch is provided by Lucene underneath. Unfortunately, the documentation for Lucene in this regard is proving to be very tough for me to make sense of. The best I have come up with is based on https://gist.github.com/988923, but it doesn't seem to work (the index with these settings only returns matches on full words, as though the settings didn't exist):
{
"settings":{
"index":{
"analysis":{
"analyzer":{
"typeahead_analyzer":{
"type":"custom",
"tokenizer":"edgeNGram",
"filter":["typeahead_ngram"]
}
},
"filter":{
"typeahead_ngram":{
"type":"edgeNGram",
"min_gram":1,
"max_gram":8,
"side":"front"
}
}
}
}
}
}
I really don't know at all how analyzers, tokenizers, and filters go together - do I even want a filter? Should I just have a tokenizer? Do I have to reference these settings when I index the documents for them to be used? How can I find out what settings Lucene underneath is using for a given index? How do I debug this? Help :-)
I solved this using edgeNGram. Below are the mappings and analysis that I used to accomplish this.
{
"analysis": {
"analyzer": {
"str_search_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase"
]
},
"str_index_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"substring"
]
}
},
"filter": {
"substring": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 10,
"side": "front"
}
}
}
}
{
"index_name": {
"properties": {
"location": {
"type": "geo_point"
},
"name": {
"type": "string",
"index": "analyzed",
"search_analyzer": "str_search_analyzer",
"index_analyzer": "str_index_analyzer"
}
}
}
}
An important footnote is that I needed to use a match query with the AND operator to query against this properly.
Hope this helps.