In Elasticsearch, Why do I lose the whole word token when I run a word through an ngram filter?

In Elasticsearch, Why do I lose the whole word token when I run a word through an ngram filter? - lucene

It seems that if I am running a word or phrase through an ngram filter, the original word does not get indexed. Instead, I only get chunks of the word up to my max_gram value. I would expect the original word to get indexed as well. I'm using Elasticsearch 0.20.5. If I set up an index using a filter with ngrams like so:
CURL -XPUT 'http://localhost:9200/test/' -d '{
"settings": {
"analysis": {
"filter": {
"my_ngram": {
"max_gram": 10,
"min_gram": 1,
"type": "nGram"
},
"my_stemmer": {
"type": "stemmer",
"name": "english"
}
},
"analyzer": {
"default_index": {
"filter": [
"standard",
"lowercase",
"asciifolding",
"my_ngram",
"my_stemmer"
],
"type": "custom",
"tokenizer": "standard"
},
"default_search": {
"filter": [
"standard",
"lowercase"
],
"type": "custom",
"tokenizer": "standard"
}
}
}
}
}'
Then I put a long word into a document:
CURL -XPUT 'http://localhost:9200/test/item/1' -d '{
"foo" : "REALLY_REALLY_LONG_WORD"
}'
And I query for that long word:
CURL -XGET 'http://localhost:9200/test/item/_search' -d '{
"query":
{
"match" : {
"foo" : "REALLY_REALLY_LONG_WORD"
}
}
}'
I get 0 results. I do get a result if I query for a 10 character chunk of that word. When I run this:
curl -XGET 'localhost:9200/test/_analyze?text=REALLY_REALLY_LONG_WORD
I get tons of grams back, but not the original word. Am I missing a configuration to make this work the way I want?

If you would like to keep the complete word of phrase, use a multi-field mapping for the value where you keep one "not analyzed" or with keyword-tokenizer instead.
Also, when searching a field with nGram-tokenized values, you should probably also use the nGram-tokenizer for the search, then the n-character limit will also apply for the search-phrase, and you will get the expected results.

Related

How do I post a bulleted list using the slack api

Background
I am trying to use the slack bolt jdk along with the following dependencies:
// Slack bolt SDK
implementation("com.slack.api:bolt:1.8.1")
implementation("com.slack.api:bolt-servlet:1.8.1")
implementation("com.slack.api:bolt-jetty:1.8.1")
implementation("com.slack.api:slack-api-model-kotlin-extension:1.8.1")
implementation("com.slack.api:slack-api-client-kotlin-extension:1.8.1")
What I want to achieve (in slack)
What I currently am getting (in slack)
What I've tried so far
fun SlashCommandContext.sendSectionAndAck(
message: String,
): Response {
slack.methods(botToken).chatPostMessage { req ->
req
.channel(channelId)
.blocks {
section {
markdownText(message)
}
}
}
return ack()
}
It seems like the markdown is being formatted almost properly. The header and footer are both bold as intended, but for some reason, the bulleted list is not being formatted correctly. I have also tried replacing the * with - without any luck.
In my case, I can call the function with the following input:
val input = """
*Some header text in bold*
- item
- another item
*Some footer text also in bold*
"""
sendSectionAndAck(input)
What am I doing wrong?

The easiest workaround for this would be using '•' character itself in the text.
Slack also uses following as part of the block kit message to reflect bullet points:
"text": "• test",
"blocks": [
{
"type": "rich_text",
"block_id": "erY",
"elements": [
{
"type": "rich_text_list",
"elements": [
{
"type": "rich_text_section",
"elements": [
{
"type": "text",
"text": "test"
}
]
}
],
"style": "bullet",
"indent": 0
}
]
}
Another reference:
https://superuser.com/questions/1282510/how-do-i-make-a-bullet-point-in-a-slack-message

A simple jq script to prefix a stream of lines read from stdin with bullets for the purposes of pasting into a slack message:
jq -rR '"\u2022 \(.)"'

Searching within an array in kibana

I am pushing my logs to elasticsearch which stores a typical doc as-
{
"_index": "logstash-2014.08.11",
"_type": "machine",
"_id": "2tSlN1P1QQuHUkmoJfkmnQ",
"_score": null,
"_source": {
"category": "critical log with list",
"app_name": "attachment",
"stacktrace_array": [
"this is the first line",
"this is the second line",
"this is the third line",
"this is the fourth line",
],
"#timestamp": "2014-08-11T13:30:51+00:00"
},
"sort": [
1407763851000,
1407763851000
]
}
Kibana makes searching substrings very easy. For example searching for "critical" in the dashboard will fetch all logs with the word critical in any string mapped value.
How do i go about searching for something like "second line" which is a string nested in an array within my doc?

It would be a simple field:<search_term> query, like -
"query": {
"query_string": {
"query": "stacktrace_array:*second line*"
}
...
So in layman terms, for Kibana dashboard, put your search query like so -
stacktrace_array:*second line*

Join / split search words in elasticsearch (using tire)

I have the following analyzer (a slight tweak to the way snowball would be setup):
string_analyzer: {
filter: [ "standard", "stop", "snowball" ],
tokenizer: "lowercase"
}
Here is the field it is applied to:
indexes :title, type: 'string', analyzer: 'string_analyzer'
query do
match ['title'], search_terms, fuzziness: 0.5, max_expansions: 10, operator: 'and'
end
I have a record in my index with title foo bar.
If I search for foo bar it appears in the results.
However, if I search for foobar it doesn't.
Can someone explain why and if possible how I could get it to?
Can someone explain how I could get the reverse of this to work as well so that if I had a record with title foobar a user could search for foo bar and see it as a result?
Thanks

You can only search for tokens that are in your index. So let's look at what you are indexing.
You're currently using the lowercase tokenizer (which tokenizes a string on non-letter characters and lowercases them) then applying the standard filter (redundant, because you are not using the standard tokenizer), the stop and snowball filters.
If we create that analyzer:
curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"analyzer" : {
"string_analyzer" : {
"filter" : [
"standard",
"stop",
"snowball"
],
"tokenizer" : "lowercase"
}
}
}
}
}
'
and use the analyze API to test it out:
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=foo+bar&analyzer=string_analyzer'
you'll see that "foo bar" produces the terms ["foo","bar"] and "foobar" produces the term ["foobar"]. So indexing "foo bar" and searching for "foobar" currently cannot work.
If you want to be able to search "inside" words, then you need to break words up into smaller tokens. To do this, we use the ngram analyzer.
So delete the test index:
curl -XDELETE 'http://127.0.0.1:9200/test/?pretty=1'
and specify a new analyzer:
curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"filter" : {
"ngrams" : {
"max_gram" : 5,
"min_gram" : 1,
"type" : "ngram"
}
},
"analyzer" : {
"ngrams" : {
"filter" : [
"standard",
"lowercase",
"ngrams"
],
"tokenizer" : "standard"
}
}
}
}
}
'
Now, if we test the analyzer, we get:
"foo bar" => [f,o,o,fo,oo,foo,b,a,r,ba,ar,bar]
"foobar" => [f,o,o,b,a,r,fo,oo,ob,ba,ar,foo,oob,oba,bar,foob,ooba,obar,fooba,oobar]
So if we index "foo bar" and we search for "foobar" using the match query, then the query becomes a query looking for any of those tokens, some of which exist in the index.
Unfortunately, it'll also overlap with "wear the fox hat" (f,o,a). While foobar will appear higher up the list of results because it has more tokens in common, you will still get apparently unrelated results.
This can be controlled by using the minimum_should_match parameter, eg:
curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1' -d '
{
"query" : {
"match" : {
"my_field" : {
"minimum_should_match" : "60%",
"query" : "foobar"
}
}
}
}
'
The exact value for minimim_should_match depends upon your data - experiment with it.

Unable to filter out n shingle(n - gram) facets using the "exclude" words option provided in the "facets" query

I am trying to make a tagcloud of words and phrases using the facets feature of elasticsearch.
My mapping:
curl -XPOST http://localhost:9200/myIndex/ -d '{
...
"analysis":{
"filter":{
"myCustomShingle":{
"type":"shingle",
"max_shingle_size":3,
"output_unigrams":true
}
},
"analyzer":{ //making a custom analyzer
"myAnalyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":[
"lowercase",
"myCustomShingle",
"stop"
]
}
}
}
...
},
"mappings":{
...
"description":{ //the field to be analyzed for making the tag cloud
"type":"string",
"analyzer":"myAnalyzer",
"null_value" : "null"
},
...
}
Query for generating facets:
curl -X POST "http://localhost:9200/myIndex/myType/_search?&pretty=true" -d '
{
"size":"0",
"query": {
match_all:{}
},
"facets": {
"blah": {
"terms": {
"fields" : ["description"],
"exclude" : [ 'evil' ], //remove facets that contain these words
"size": "50"
}
}
}
}
My problem is, when I insert a word say 'evil' in the "exclude" option of "facets", it successfully removes the facets containing the words(or single shingles) that match 'evil'. But it doesn't remove the 2/3 word shingles, "resident evil" , "evil computer", "my evil cat". How do I remove the facets of phrases containing the "exclude words"?

It isn't completely clear what you want to achieve. You usually wouldn't make facets on analyzed fields. Maybe you could explain why you're making shingles so that we can help achieving what you want in a better way.
With the exclude facet parameter you can exclude some specific entry, but evil is not the same as resident evil. If you want to exclude it you need to specify it. Facets are made based on indexed terms, and resident evil is in fact a single term in the index, which is not the same as the term evil.
Given the choice that you already made for indexing and faceting, there is a way to achieve what you want. Elasticsearch has a really powerful scripting module. You can use a script to decide whether each entry should be included in the facet or not like this:
{
"query": {
"match_all" : {}
},
"facets": {
"tags": {
"terms": {
"field" : "tags",
"script" : "term.contains('evil') ? true : false"
}
}
}
}

elasticsearch splits by space in facets

I am trying to do a simple facet request over a field containing more than one word (Eg: 'Name1 Name2', sometimes with dots and commas inside) but what I get is...
"terms" : [{
"term" : "Name1",
"count" : 15
},
{
"term" : "Name2",
"count" : 15
}]
so my field value is split by spaces and then runs the facet request...
Query example:
curl -XGET http://my_server:9200/idx_occurrence/Occurrence/_search?pretty=true -d '{
"query": {
"query_string": {
"fields": [
"dataset"
],
"query": "2",
"default_operator": "AND"
}
},
"facets": {
"test": {
"terms": {
"field": [
"speciesName"
],
"size": 50000
}
}
}
}'

Your field shouldn't be analyzed, or at least not tokenized. You need to update your mapping and then reindex if you want to index the field without tokenizing it.

First of all, javanna provided a very good answer from a practical perspective. However, for the sake of completeness, I want to mention that in some cases there is a way to do it without reindexing the data.
If the speciesName field is stored and your queries produce relatively small number of results, you can use script_field to retrieve stored field values:
curl -XGET http://my_server:9200/idx_occurrence/Occurrence/_search?pretty=true -d '{
"query": {
"query_string": {
"fields": ["dataset"],
"query": "2",
"default_operator": "AND"
}
},
"facets": {
"test": {
"terms": {
"script_field": "_fields['\''speciesName'\''].value",
"size": 50000
}
}
}
}
'
As a result of this query, elasticsearch will retrieve the speciesName field for every record in your result set and it will construct facets from these values. Needless to say, if your result set contains millions of records, performance of this query might be sluggish.
Similarly, if the field is not stored, but record source is stored, you can use script_field facet to retrieve field values from the source:
......
"script_field": "_source['\''speciesName'\'']",
......
Again, source for each record in the result list will be retrieved and parsed, so you might need some patience to run this query on a large set of records.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

In Elasticsearch, Why do I lose the whole word token when I run a word through an ngram filter? - lucene

Related

How do I post a bulleted list using the slack api

Searching within an array in kibana

Join / split search words in elasticsearch (using tire)

Unable to filter out n shingle(n - gram) facets using the "exclude" words option provided in the "facets" query

elasticsearch splits by space in facets

Categories

Resources