Why ElasticSearch is not able to search when special characters are available? - tokenize

I have an ElasticSearch index with below configuration:
{
"my_ind": {
"settings": {
"index": {
"mapping": {
"total_fields": {
"limit": "10000000"
}
},
"number_of_shards": "3",
"provided_name": "my_ind",
"creation_date": "1539773409246",
"analysis": {
"analyzer": {
"default": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "whitespace"
}
}
},
"number_of_replicas": "1",
"uuid": "3wC7i-E_Q9mSDjnTN2gxrg",
"version": {
"created": "5061299"
}
}
}
}
}
I want to search below content with plain search:
DL-1234170386456
This contents are available in the below field:
DNumber
This filed has mapping like below:
{
"DNumber": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
I am trying to implement it in JAVA language. I came across the ElasticSearch Analyzers and Tokenizers so I made use of "whitespace" tokenizer.
I am trying to search with below query:
{
"query": {
"multi_match": {
"query": "DL-1234170386456",
"fields": [
"_all"
],
"type": "best_fields",
"operator": "OR",
"analyzer": "default",
"slop": 0,
"prefix_length": 0,
"max_expansions": 50,
"lenient": false,
"zero_terms_query": "NONE",
"boost": 1
}
}
}
What wrong I am doing?

After doing lot of research and Trial & Error, found out the answer!
Some basic but important points:
We need to specify Analyzers and Tokenizers while creating/indexing the index/data.
In specified string i.e. "DL-1234170386456", special character (i.e. "-") is available and ElasticSearch is using by default Standard Analyzer.
Standard Analyzer contains Standard Tokenizer which is based on the Unicode Text Segmentation algorithm.
Actual Problem:
ElasticSearch is separating the String ("DL-1234170386456") into two different parts like "DL" and "1234170386456".
Solution:
We need to specify Whitespace Analyzer which contains Whitespace Tokenizer.
It will split the word whenever space is encountered. So, String ("DL-1234170386456") will kept as it is by ElasticSearch and we are able to find it out.

Related

Change syntax highlighting of embedded code based on a previous line's keyword?

I'm trying to write a TextMate grammar for a VS Code language extension. Take the following example
(lang=css attribute2=something-else)
"""
.css-class {
background: gray;
}
"""
The (...) part is an "attributes" section, and the """ ... """ is a code section. I'm trying to highlight everything in the code section according to the lang attribute.
The problem is, they are two distinct sections where one might be present without the other in other parts of the file. For example, you can have attributes without the code block.
In the grammars section of package.json I have
"embeddedLanguages": {
"meta.embedded.block.css": "css",
"meta.embedded.block.javascript": "javascript"
}
In the tmLanguage.json file I have both patterns in the repository property.
"attributes": {
"begin": "\\(",
"end": "\\)",
"captures": {
"0": {
"name": "punctuation.definition.annotation punctuation.section.group punctuation.section.parens"
}
},
"patterns": [
{
"begin": "[a-zA-Z_][a-zA-Z0-9_\\.-]*",
"beginCaptures": {
"0": {
"name": "entity.other.attribute-name"
}
},
"end": "(?=\\s*+[^=\\s])",
"patterns": [
{
"begin": "=",
"beginCaptures": {
"0": {
"name": "punctuation.separator.key-value"
}
},
"end": "(?<=[^\\s=])(?!\\s*=)|(?=/?>)",
"patterns": [
{
"match": "([^0-9-.\\s='\"][^\\s='\")]*)",
"name": "string.unquoted.html"
},
{
"match": "=",
"name": "invalid.illegal.unexpected-equals-sign"
},
{
"include": "#strings"
},
{
"include": "#number"
}
]
}
]
}
]
},
"fenced-code": {
"begin": "\\(.*lang=(css|javascript).*\\)\\s*(\"\"\")",
"beginCaptures": {
"1": {
"name": "string.quoted.triple"
}
},
"end": "\"\"\"",
"endCaptures": {
"0": {
"name": "string.quoted.triple"
}
},
"contentName": "meta.embedded.block.$1",
"patterns": [
{
"include": "source.css"
}
]
}
I have a third pattern not shown where I'm using these together by including them in its patterns array. They seem to be mutually exclusive though. I can have the attributes, and I can have a code block if I start the pattern at """, but if I start the code pattern with \\(.*lang=(css|javascript).*\\)\\s*(\"\"\") to capture the lang attribute, the attributes stop getting highlighting.
Is this even possible? I've never worked with TextMate grammars outside of a VS Code theme and VS Code doesn't seem to have deep documentation on the more "advanced" (I guess) things like this.
I tried using VS Code's HTML grammar for its uses of embedded code, but I don't think the HTML one needs to swap syntax based on something in a previous line.
Update
The fenced-code pattern I have below allows the attributes pattern highlighting while also having the code block with dynamic contentName property, i.e. "meta.embedded.block.$2" becomes meta.embedded.block.css when "css" is found in attributes.
"fenced-code": {
"begin": "(\\(.*lang=(css|javascript).*\\))\\s*(\"\"\")",
"beginCaptures": {
"1": {
"patterns": [
{
"include": "#attributes"
}
]
},
"3": {
"name": "string.quoted.triple"
}
},
"end": "\"\"\"",
"endCaptures": {
"0": {
"name": "string.quoted.triple"
}
},
"contentName": "meta.embedded.block.$2",
"patterns": [
{
"include": "source.css"
},
{
"include": "source.js"
}
]
}
However, there's two things wrong so far.
It only works when the opening """ is on the same line as the attributes section
fenced-code's patterns array doesn't seem to allow a dynamic include, i.e. "patterns": [{"include": "source.$2}]. I'm not sure if including the different languages as I did above will work

Search including special characters in MongoDB Atlas

I faced with the issue when I try to search for several words including a special character (section sign "§").
Example: AB § 32.
I want all words "AB", "32" and symbol "§" to be included in found documents.
In some cases document can be found, in some not.
If my document contains the following text then search finds it:
Lagrum: 32 § 1 mom. första stycket a) kommunalskattelagen (1928:370) AB
But if document contains this text then search doesn't find:
Lagrum: 32 § 1 mom. första stycket AB
For symbol "§" I use UT8-encoding "\xc2\xa7".
Index uses "lucene.swedish" analyzer.
"Content": [
{
"analyzer": "lucene.swedish",
"minGrams": 4,
"tokenization": "nGram",
"type": "autocomplete"
},
{
"analyzer": "lucene.swedish",
"type": "string"
}
]
Query looks like:
{
"index": "test_index",
"compound": {
"filter": [
{
"text": {
"query": [
"111111111111"
],
"path": "ProductId"
}
},
],
"must": [
{
"autocomplete": {
"query": [
"AB"
],
"path": "Content"
}
},
{
"autocomplete": {
"query": [
"\xc2\xa7",
],
"path": "Content"
}
},
{
"autocomplete": {
"query": [
"32"
],
"path": "Content"
}
}
],
},
"count": {
"type": "lowerBound",
"threshold": 500
}
}
The question is what is wrong with the search and how can I get a correct result (return both above mentioned documents) ?
Focusing only on the content field, here is an index definition that should work for your requirements. The docs are here. Let me know if this works for you.
{
"mappings": {
"dynamic": false,
"fields": {
"content": [
{
"type": "autocomplete",
"tokenization": "nGram",
"minGrams": 4,
"maxGrams": 7,
"foldDiacritics": false,
"analyzer": "lucene.whitespace"
},
{
"analyzer": "lucene.swedish",
"type": "string"
}
]
}
}
}

Specific analyzers for sub-documents in lucene / elasticsearch

After reading the documentation, testing and reading a lot of other questions here on stackoverflow:
We have documents that have titles and description in multiple languages. There are also tags that are translated to the same languages. There might be up to 30-40 different languages in the system, but probably only 3 or 4 translations for a single document.
This is the planned document structure:
{
"luck": {
"id": 10018,
"pub": 0,
"pr": 100002,
"loc": {
"lat": 42.7,
"lon": 84.2
},
"t": [
{
"lang": "en-analyzer",
"title": "Forest",
"desc": "A lot of trees.",
"tags": [
"Wood",
"Nature",
"Green Mouvement"
]
},
{
"lang": "fr-analyzer",
"title": "Forêt",
"desc": "A grand nombre d'arbre.",
"tags": [
"Bois",
"Nature",
"Mouvement Vert"
]
}
],
"dates": [
"2014-01-01T20:00",
"2014-06-06T20:00",
"2014-08-08T20:00"
]
}
}
Possible queries are "arbre" or "wood" or "forest" or "nature" combined with a date and a geo_distance filter, furthermore there will be some facets over the tags array (that obviously include counting).
We can produce any document structure that fits best for elasticsearch (or for lucene). It's crucial that each language is analyzed specifically, so we use "_analyzer" in order to distinguish the languages.
{
"luck": {
"properties": {
"id": {
"type": "long"
},
"pub": {
"type": "long"
},
"pr": {
"type": "long"
},
"loc": {
"type": "geo_point"
},
"t": {
"_analyzer": {
"path": "t.lang"
},
"properties": {
"lang": {
"type": "string"
},
"properties": {
"title": {
"type": "string"
},
"desc": {
"type": "string"
},
"tags": {
"type": "string"
}
}
}
}
}
}
A) Apparently, this idea does not work: after PUTting the mapping, we retrieve the same mapping ("GET") and it seems to ignore the specific analyzers (A test with a top-level "_analyzer" worked fine). Does "_analyzer" work for sub-documents and if yes how to should we refer to it? We also tested declaring the sub-document as "object" or "nested". How is multi-language document indexing supposed to work.
B) One possibility would be to put each language in its own document: In that case how do we manage the id? Finally both documents should refer to the same id. For example if the user searches for "nature" (and we don't know if the user intends to find "nature" in English or French), this document would appear twice in the result set, and the counting and paging would be very wrong (also facet counting).
Any ideas?

Indexing with edge NGrams for typeahead

I'm trying to get Elasticsearch to index some documents for typeahead suggestions. As far as I can tell, the edge NGram handling in Elasticsearch is provided by Lucene underneath. Unfortunately, the documentation for Lucene in this regard is proving to be very tough for me to make sense of. The best I have come up with is based on https://gist.github.com/988923, but it doesn't seem to work (the index with these settings only returns matches on full words, as though the settings didn't exist):
{
"settings":{
"index":{
"analysis":{
"analyzer":{
"typeahead_analyzer":{
"type":"custom",
"tokenizer":"edgeNGram",
"filter":["typeahead_ngram"]
}
},
"filter":{
"typeahead_ngram":{
"type":"edgeNGram",
"min_gram":1,
"max_gram":8,
"side":"front"
}
}
}
}
}
}
I really don't know at all how analyzers, tokenizers, and filters go together - do I even want a filter? Should I just have a tokenizer? Do I have to reference these settings when I index the documents for them to be used? How can I find out what settings Lucene underneath is using for a given index? How do I debug this? Help :-)
I solved this using edgeNGram. Below are the mappings and analysis that I used to accomplish this.
{
"analysis": {
"analyzer": {
"str_search_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase"
]
},
"str_index_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"substring"
]
}
},
"filter": {
"substring": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 10,
"side": "front"
}
}
}
}
{
"index_name": {
"properties": {
"location": {
"type": "geo_point"
},
"name": {
"type": "string",
"index": "analyzed",
"search_analyzer": "str_search_analyzer",
"index_analyzer": "str_index_analyzer"
}
}
}
}
An important footnote is that I needed to use a match query with the AND operator to query against this properly.
Hope this helps.

elasticsearch / lucene highlight

I'm using ElasticSearch to index documents.
My mapping is:
"mongodocid": {
"boost": 1.0,
"store": "yes",
"type": "string"
},
"fulltext": {
"boost": 1.0,
"index": "analyzed",
"store": "yes",
"type": "string",
"term_vector": "with_positions_offsets"
}
To highlight the complete fulltext I am setting number_of_framgments to 0.
If I do the following Lucene-like string query:
{
"highlight": {
"pre_tags": "<b>",
"fields": {
"fulltext": {
"number_of_fragments": 0
}
},
"post_tags": "</b>"
},
"query": {
"query_string": {
"query": "fulltext:test"
}
},
"size": 100
}
For some documents in the result set the length of the highlighted fulltext is smaller than the fulltext itself.
Since I am setting number_of_fragments to 0 and pre_tags/post_tags are added this should not happen.
Now comes the strange behaviour: If I only search for one of the failing elements by doing this:
{
"highlight": {
"pre_tags": "<b>",
"fields": {
"fulltext": {
"number_of_fragments": 0
}
},
"post_tags": "</b>"
},
"query": {
"query_string": {
"query": "fulltext:test AND mongodocid:4d0a861c2ebef6032c00b1ec"
}
},
"size": 100
}
then all works fine.
Any ideas?
Sounds like issue which has been fixed in 0.14.0 (see #479). As of writing the 0.14.0 hasn't been released yet, can you try master?