Indexing with edge NGrams for typeahead - lucene

I'm trying to get Elasticsearch to index some documents for typeahead suggestions. As far as I can tell, the edge NGram handling in Elasticsearch is provided by Lucene underneath. Unfortunately, the documentation for Lucene in this regard is proving to be very tough for me to make sense of. The best I have come up with is based on https://gist.github.com/988923, but it doesn't seem to work (the index with these settings only returns matches on full words, as though the settings didn't exist):
{
"settings":{
"index":{
"analysis":{
"analyzer":{
"typeahead_analyzer":{
"type":"custom",
"tokenizer":"edgeNGram",
"filter":["typeahead_ngram"]
}
},
"filter":{
"typeahead_ngram":{
"type":"edgeNGram",
"min_gram":1,
"max_gram":8,
"side":"front"
}
}
}
}
}
}
I really don't know at all how analyzers, tokenizers, and filters go together - do I even want a filter? Should I just have a tokenizer? Do I have to reference these settings when I index the documents for them to be used? How can I find out what settings Lucene underneath is using for a given index? How do I debug this? Help :-)

I solved this using edgeNGram. Below are the mappings and analysis that I used to accomplish this.
{
"analysis": {
"analyzer": {
"str_search_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase"
]
},
"str_index_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"substring"
]
}
},
"filter": {
"substring": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 10,
"side": "front"
}
}
}
}
{
"index_name": {
"properties": {
"location": {
"type": "geo_point"
},
"name": {
"type": "string",
"index": "analyzed",
"search_analyzer": "str_search_analyzer",
"index_analyzer": "str_index_analyzer"
}
}
}
}
An important footnote is that I needed to use a match query with the AND operator to query against this properly.
Hope this helps.

Related

Change syntax highlighting of embedded code based on a previous line's keyword?

I'm trying to write a TextMate grammar for a VS Code language extension. Take the following example
(lang=css attribute2=something-else)
"""
.css-class {
background: gray;
}
"""
The (...) part is an "attributes" section, and the """ ... """ is a code section. I'm trying to highlight everything in the code section according to the lang attribute.
The problem is, they are two distinct sections where one might be present without the other in other parts of the file. For example, you can have attributes without the code block.
In the grammars section of package.json I have
"embeddedLanguages": {
"meta.embedded.block.css": "css",
"meta.embedded.block.javascript": "javascript"
}
In the tmLanguage.json file I have both patterns in the repository property.
"attributes": {
"begin": "\\(",
"end": "\\)",
"captures": {
"0": {
"name": "punctuation.definition.annotation punctuation.section.group punctuation.section.parens"
}
},
"patterns": [
{
"begin": "[a-zA-Z_][a-zA-Z0-9_\\.-]*",
"beginCaptures": {
"0": {
"name": "entity.other.attribute-name"
}
},
"end": "(?=\\s*+[^=\\s])",
"patterns": [
{
"begin": "=",
"beginCaptures": {
"0": {
"name": "punctuation.separator.key-value"
}
},
"end": "(?<=[^\\s=])(?!\\s*=)|(?=/?>)",
"patterns": [
{
"match": "([^0-9-.\\s='\"][^\\s='\")]*)",
"name": "string.unquoted.html"
},
{
"match": "=",
"name": "invalid.illegal.unexpected-equals-sign"
},
{
"include": "#strings"
},
{
"include": "#number"
}
]
}
]
}
]
},
"fenced-code": {
"begin": "\\(.*lang=(css|javascript).*\\)\\s*(\"\"\")",
"beginCaptures": {
"1": {
"name": "string.quoted.triple"
}
},
"end": "\"\"\"",
"endCaptures": {
"0": {
"name": "string.quoted.triple"
}
},
"contentName": "meta.embedded.block.$1",
"patterns": [
{
"include": "source.css"
}
]
}
I have a third pattern not shown where I'm using these together by including them in its patterns array. They seem to be mutually exclusive though. I can have the attributes, and I can have a code block if I start the pattern at """, but if I start the code pattern with \\(.*lang=(css|javascript).*\\)\\s*(\"\"\") to capture the lang attribute, the attributes stop getting highlighting.
Is this even possible? I've never worked with TextMate grammars outside of a VS Code theme and VS Code doesn't seem to have deep documentation on the more "advanced" (I guess) things like this.
I tried using VS Code's HTML grammar for its uses of embedded code, but I don't think the HTML one needs to swap syntax based on something in a previous line.
Update
The fenced-code pattern I have below allows the attributes pattern highlighting while also having the code block with dynamic contentName property, i.e. "meta.embedded.block.$2" becomes meta.embedded.block.css when "css" is found in attributes.
"fenced-code": {
"begin": "(\\(.*lang=(css|javascript).*\\))\\s*(\"\"\")",
"beginCaptures": {
"1": {
"patterns": [
{
"include": "#attributes"
}
]
},
"3": {
"name": "string.quoted.triple"
}
},
"end": "\"\"\"",
"endCaptures": {
"0": {
"name": "string.quoted.triple"
}
},
"contentName": "meta.embedded.block.$2",
"patterns": [
{
"include": "source.css"
},
{
"include": "source.js"
}
]
}
However, there's two things wrong so far.
It only works when the opening """ is on the same line as the attributes section
fenced-code's patterns array doesn't seem to allow a dynamic include, i.e. "patterns": [{"include": "source.$2}]. I'm not sure if including the different languages as I did above will work

Why ElasticSearch is not able to search when special characters are available?

I have an ElasticSearch index with below configuration:
{
"my_ind": {
"settings": {
"index": {
"mapping": {
"total_fields": {
"limit": "10000000"
}
},
"number_of_shards": "3",
"provided_name": "my_ind",
"creation_date": "1539773409246",
"analysis": {
"analyzer": {
"default": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "whitespace"
}
}
},
"number_of_replicas": "1",
"uuid": "3wC7i-E_Q9mSDjnTN2gxrg",
"version": {
"created": "5061299"
}
}
}
}
}
I want to search below content with plain search:
DL-1234170386456
This contents are available in the below field:
DNumber
This filed has mapping like below:
{
"DNumber": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
I am trying to implement it in JAVA language. I came across the ElasticSearch Analyzers and Tokenizers so I made use of "whitespace" tokenizer.
I am trying to search with below query:
{
"query": {
"multi_match": {
"query": "DL-1234170386456",
"fields": [
"_all"
],
"type": "best_fields",
"operator": "OR",
"analyzer": "default",
"slop": 0,
"prefix_length": 0,
"max_expansions": 50,
"lenient": false,
"zero_terms_query": "NONE",
"boost": 1
}
}
}
What wrong I am doing?
After doing lot of research and Trial & Error, found out the answer!
Some basic but important points:
We need to specify Analyzers and Tokenizers while creating/indexing the index/data.
In specified string i.e. "DL-1234170386456", special character (i.e. "-") is available and ElasticSearch is using by default Standard Analyzer.
Standard Analyzer contains Standard Tokenizer which is based on the Unicode Text Segmentation algorithm.
Actual Problem:
ElasticSearch is separating the String ("DL-1234170386456") into two different parts like "DL" and "1234170386456".
Solution:
We need to specify Whitespace Analyzer which contains Whitespace Tokenizer.
It will split the word whenever space is encountered. So, String ("DL-1234170386456") will kept as it is by ElasticSearch and we are able to find it out.

Elasticsearch: How to prevent the increase of score when search term appears multiple times in document?

When a search term appears not only once but several times in the document I'm searching the score goes up. While this might be wanted most of the times, it is not in my case.
The query:
"query": {
"bool": {
"should": {
"nested": {
"path": "editions",
"query": {
"match": {
"title_author": {
"query": "look me up",
"operator": "and",
"boost": 2
}
}
}
}
},
"must": {
"nested": {
"path": "editions",
"query": {
"match": {
"title_author": {
"query": "look me up",
"operator": "and",
"fuzziness": 0.5,
"boost": 1
}
}
}
}
}
}
}
doc_1
{
"editions": [
{
"editionid": 1,
"title_author": "look me up look me up",
},
{
"editionid": 2,
"title_author": "something else",
}
]
}
and doc_2
{
"editions": [
{
"editionid": 3,
"title_author": "look me up",
},
{
"editionid": 4,
"title_author": "something else",
}
]
}
Now, doc_1 would have a higher score due to the fact that the search terms are included twice. I don't want that. How do I turn this behavior off? I want the same score - no matter if the search term was found once or twice in the matching document.
In addition to what #keety and #Sid1199 talked about there is another way to do that: special property for fields with type "text" called index_options. By default it is set to "positions", but you can explicitly set it to "docs", so term frequencies will not be placed in the index and Elasticsearch will not know about repetitions while searching.
"title_author": {
"type": "text",
"index_options": "docs"
}
There is a property in Elastic search known as "similarity". There are a lot of types of similarities, but the one that is useful here is "boolean". If you set similarity to "boolean" in your mapping, it will prevent multiple boosting of your query.
"title_author":{"type":"text","similarity":"boolean"}
If you run your query on this mapping, it will boost only once regardless of the number of time the word appears. You can read up more on similarities here
This is only available in ES versions 5.4 and above

Specific analyzers for sub-documents in lucene / elasticsearch

After reading the documentation, testing and reading a lot of other questions here on stackoverflow:
We have documents that have titles and description in multiple languages. There are also tags that are translated to the same languages. There might be up to 30-40 different languages in the system, but probably only 3 or 4 translations for a single document.
This is the planned document structure:
{
"luck": {
"id": 10018,
"pub": 0,
"pr": 100002,
"loc": {
"lat": 42.7,
"lon": 84.2
},
"t": [
{
"lang": "en-analyzer",
"title": "Forest",
"desc": "A lot of trees.",
"tags": [
"Wood",
"Nature",
"Green Mouvement"
]
},
{
"lang": "fr-analyzer",
"title": "ForĂȘt",
"desc": "A grand nombre d'arbre.",
"tags": [
"Bois",
"Nature",
"Mouvement Vert"
]
}
],
"dates": [
"2014-01-01T20:00",
"2014-06-06T20:00",
"2014-08-08T20:00"
]
}
}
Possible queries are "arbre" or "wood" or "forest" or "nature" combined with a date and a geo_distance filter, furthermore there will be some facets over the tags array (that obviously include counting).
We can produce any document structure that fits best for elasticsearch (or for lucene). It's crucial that each language is analyzed specifically, so we use "_analyzer" in order to distinguish the languages.
{
"luck": {
"properties": {
"id": {
"type": "long"
},
"pub": {
"type": "long"
},
"pr": {
"type": "long"
},
"loc": {
"type": "geo_point"
},
"t": {
"_analyzer": {
"path": "t.lang"
},
"properties": {
"lang": {
"type": "string"
},
"properties": {
"title": {
"type": "string"
},
"desc": {
"type": "string"
},
"tags": {
"type": "string"
}
}
}
}
}
}
A) Apparently, this idea does not work: after PUTting the mapping, we retrieve the same mapping ("GET") and it seems to ignore the specific analyzers (A test with a top-level "_analyzer" worked fine). Does "_analyzer" work for sub-documents and if yes how to should we refer to it? We also tested declaring the sub-document as "object" or "nested". How is multi-language document indexing supposed to work.
B) One possibility would be to put each language in its own document: In that case how do we manage the id? Finally both documents should refer to the same id. For example if the user searches for "nature" (and we don't know if the user intends to find "nature" in English or French), this document would appear twice in the result set, and the counting and paging would be very wrong (also facet counting).
Any ideas?

How to specify an analyzer while creating an index in ElasticSearch

I'd like to specify an analyzer, name it, and use that name in a mapping while creating an index. I'm lost, my ES instance always returns me an error message.
This is, roughly, what I'd like to do:
"settings": {
"mappings": {
"alfedoc": {
"properties": {
"id": { "type": "string" },
"alfefield": { "type": "string", "analyzer": "alfeanalyzer" }
}
}
},
"analysis": {
"analyzer": {
"alfeanalyzer": {
"type": "pattern",
"pattern":"\\s+"
}
}
}
}
But this does not seem to work; the ES instance always returns me an error like
MapperParsingException[mapping [alfedoc]]; nested: MapperParsingException[Analyzer [alfeanalyzer] not found for field [alfefield]];
I tried putting the "analysis" branch of the dictionary at several places (inside the mapping etc.) but to no avail. I guess a working complete example (which I couldn't find up to now) would help me along as well. Probably I'm missing something rather basic.
"analysis" goes in the "settings" block, which goes either before or after the "mappings" block when creating an index.
"settings": {
"analysis": {
"analyzer": {
"alfeanalyzer": {
"type": "pattern",
"pattern": "\\s+"
}
}
}
},
"mappings": {
"alfedoc": { ... }
}
Here's a good complete, example: Example 1