Join / split search words in elasticsearch (using tire) - ruby-on-rails-3

I have the following analyzer (a slight tweak to the way snowball would be setup):
string_analyzer: {
filter: [ "standard", "stop", "snowball" ],
tokenizer: "lowercase"
}
Here is the field it is applied to:
indexes :title, type: 'string', analyzer: 'string_analyzer'
query do
match ['title'], search_terms, fuzziness: 0.5, max_expansions: 10, operator: 'and'
end
I have a record in my index with title foo bar.
If I search for foo bar it appears in the results.
However, if I search for foobar it doesn't.
Can someone explain why and if possible how I could get it to?
Can someone explain how I could get the reverse of this to work as well so that if I had a record with title foobar a user could search for foo bar and see it as a result?
Thanks

You can only search for tokens that are in your index. So let's look at what you are indexing.
You're currently using the lowercase tokenizer (which tokenizes a string on non-letter characters and lowercases them) then applying the standard filter (redundant, because you are not using the standard tokenizer), the stop and snowball filters.
If we create that analyzer:
curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"analyzer" : {
"string_analyzer" : {
"filter" : [
"standard",
"stop",
"snowball"
],
"tokenizer" : "lowercase"
}
}
}
}
}
'
and use the analyze API to test it out:
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=foo+bar&analyzer=string_analyzer'
you'll see that "foo bar" produces the terms ["foo","bar"] and "foobar" produces the term ["foobar"]. So indexing "foo bar" and searching for "foobar" currently cannot work.
If you want to be able to search "inside" words, then you need to break words up into smaller tokens. To do this, we use the ngram analyzer.
So delete the test index:
curl -XDELETE 'http://127.0.0.1:9200/test/?pretty=1'
and specify a new analyzer:
curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"filter" : {
"ngrams" : {
"max_gram" : 5,
"min_gram" : 1,
"type" : "ngram"
}
},
"analyzer" : {
"ngrams" : {
"filter" : [
"standard",
"lowercase",
"ngrams"
],
"tokenizer" : "standard"
}
}
}
}
}
'
Now, if we test the analyzer, we get:
"foo bar" => [f,o,o,fo,oo,foo,b,a,r,ba,ar,bar]
"foobar" => [f,o,o,b,a,r,fo,oo,ob,ba,ar,foo,oob,oba,bar,foob,ooba,obar,fooba,oobar]
So if we index "foo bar" and we search for "foobar" using the match query, then the query becomes a query looking for any of those tokens, some of which exist in the index.
Unfortunately, it'll also overlap with "wear the fox hat" (f,o,a). While foobar will appear higher up the list of results because it has more tokens in common, you will still get apparently unrelated results.
This can be controlled by using the minimum_should_match parameter, eg:
curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1' -d '
{
"query" : {
"match" : {
"my_field" : {
"minimum_should_match" : "60%",
"query" : "foobar"
}
}
}
}
'
The exact value for minimim_should_match depends upon your data - experiment with it.

Related

Mongodb query problem, how to get the matching items of the $or operator

Thank you for first.
MongoDB Version:4.2.11
I have a piece of data like this:
{
"name":...,
...
"administration" : [
{"name":...,"job":...},
{"name":...,"job":...}
],
"shareholder" : [
{"name":...,"proportion":...},
{"name":...,"proportion":...},
]
}
I want to match some specified data through regular expressions:
For a example:
db.collection.aggregate([
{"$match" :
{
"$or" :
[
{"name" : {"$regex": "Keyword"}}
{"administration.name": {"$regex": "Keyword"}},
{"shareholder.name": {"$regex": "Keyword"}},
]
}
},
])
I want to set a flag when the $or operator successfully matches any condition, which is represented by a custom field, for example:{"name" : {"$regex": "Keyword"}}Execute on success:
{"$project" :
{
"_id":false,
"name" : true,
"__regex_type__" : "name"
}
},
{"administration.name" : {"$regex": "Keyword"}}Execute on success:"__regex_type__" : "administration.name"
I try do this:
{"$project" :
{
"_id":false,
"name" : true,
"__regex_type__" :
{
"$switch":
{
"branches":
[
{"case": {"$regexMatch":{"input":"$name","regex": "Keyword"}},"then" : "name"},
{"case": {"$regexMatch":{"input":"$administration.name","regex": "Keyword"}},"then" : "administration.name"},
{"case": {"$regexMatch":{"input":"$shareholder.name","regex": "Keyword"}},"then" : "shareholder.name"},
],
"default" : "Other matches"
}
}
}
},
But $regexMatch cannot match the array,I tried to use $unwind again, but returned the number of many array members, which did not meet my starting point.
I want to implement the same function as mysql this SQL statement in mongodb, like this:
SELECT name,administration.name,shareholder.name,(
CASE
WHEN name REGEXP("Keyword") THEN "name"
WHEN administration.name REGEXP("Keyword") THEN "administration.name"
WHEN shareholder.name REGEXP("Keyword") THEN "shareholder.name"
END
)AS __regex_type__ FROM db.mytable WHERE
name REGEXP("Keyword") OR
shareholder.name REGEXP("Keyword") OR
administration.name REGEXP("Keyword");
Maybe this method is stupid, but I don’t have a better solution.
If you have a better solution, I would appreciate it!!!
Thank you!!!
Since $regexMatch does not handle arrays, use $filter to filter individual array elements with $regexMatch, then use $size to see how many elements matched.
[{"$match"=>{"$or"=>[{"a"=>"test"}, {"arr.a"=>"test"}]}},
{"$project"=>
{"a"=>1,
"arr"=>1,
"src"=>
{"$switch"=>
{"branches"=>
[{"case"=>{"$regexMatch"=>{"input"=>"$a", "regex"=>"test"}},
"then"=>"a"},
{"case"=>
{"$gte"=>
[{"$size"=>
{"$filter"=>
{"input"=>"$arr.a",
"cond"=>
{"$regexMatch"=>{"input"=>"$$this", "regex"=>"test"}}}}},
1]},
"then"=>"arr.a"}],
"default"=>"def"}}}}]
[{"_id"=>BSON::ObjectId('5ffb2df748966813f82f15ad'), "a"=>"test", "src"=>"a"},
{"_id"=>BSON::ObjectId('5ffb2df748966813f82f15ae'),
"arr"=>[{"a"=>"test"}],
"src"=>"arr.a"}]

elasticsearch: how to index terms which are stopwords only?

I had much success building my own little search with elasticsearch in the background. But there is one thing I couldn't find in the documentation.
I'm indexing the names of musicians and bands. There is one band called "The The" and due to the stop words list this band is never indexed.
I know I can ignore the stop words list completely but this is not what I want since the results searching for other bands like "the who" would explode.
So, is it possible to save "The The" in the index but not disabling the stop words at all?
You can use the synonym filter to convert The The into a single token eg thethe which won't be removed by the stopwords filter.
First, configure the analyzer:
curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"filter" : {
"syn" : {
"synonyms" : [
"the the => thethe"
],
"type" : "synonym"
}
},
"analyzer" : {
"syn" : {
"filter" : [
"lowercase",
"syn",
"stop"
],
"type" : "custom",
"tokenizer" : "standard"
}
}
}
}
}
'
Then test it with the string "The The The Who".
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=The+The+The+Who&analyzer=syn'
{
"tokens" : [
{
"end_offset" : 7,
"position" : 1,
"start_offset" : 0,
"type" : "SYNONYM",
"token" : "thethe"
},
{
"end_offset" : 15,
"position" : 3,
"start_offset" : 12,
"type" : "<ALPHANUM>",
"token" : "who"
}
]
}
"The The" has been tokenized as "the the", and "The Who" as "who" because the preceding "the" was removed by the stopwords filter.
To stop or not to stop
Which brings us back to whether we should include stopwords or not? You said:
I know I can ignore the stop words list completely
but this is not what I want since the results searching
for other bands like "the who" would explode.
What do you mean by that? Explode how? Index size? Performance?
Stopwords were originally introduced to improve search engine performance by removing common words which are likely to have little effect on the relevance of a query. However, we've come a long way since then. Our servers are capable of much more than they were back in the 80s.
Indexing stopwords won't have a huge impact on index size. For instance, to index the word the means adding a single term to the index. You already have thousands of terms - indexing the stopwords as well won't make much difference to size or to performance.
Actually, the bigger problem is that the is very common and thus will have a low impact on relevance, so a search for "The The concert Madrid" will prefer Madrid over the other terms.
This can be mitigated by using a shingle filter, which would result in these tokens:
['the the','the concert','concert madrid']
While the may be common, the the isn't and so will rank higher.
You wouldn't query the shingled field by itself, but you could combine a query against a field tokenized by the standard analyzer (without stopwords) with a query against the shingled field.
We can use a multi-field to analyze the text field in two different ways:
curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"mappings" : {
"test" : {
"properties" : {
"text" : {
"fields" : {
"shingle" : {
"type" : "string",
"analyzer" : "shingle"
},
"text" : {
"type" : "string",
"analyzer" : "no_stop"
}
},
"type" : "multi_field"
}
}
}
},
"settings" : {
"analysis" : {
"analyzer" : {
"no_stop" : {
"stopwords" : "",
"type" : "standard"
},
"shingle" : {
"filter" : [
"standard",
"lowercase",
"shingle"
],
"type" : "custom",
"tokenizer" : "standard"
}
}
}
}
}
'
Then use a multi_match query to query both versions of the field, giving the shingled version more "boost"/relevance. In this example the text.shingle^2 means that we want to boost that field by 2:
curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1' -d '
{
"query" : {
"multi_match" : {
"fields" : [
"text",
"text.shingle^2"
],
"query" : "the the concert madrid"
}
}
}
'

Unable to filter out n shingle(n - gram) facets using the "exclude" words option provided in the "facets" query

I am trying to make a tagcloud of words and phrases using the facets feature of elasticsearch.
My mapping:
curl -XPOST http://localhost:9200/myIndex/ -d '{
...
"analysis":{
"filter":{
"myCustomShingle":{
"type":"shingle",
"max_shingle_size":3,
"output_unigrams":true
}
},
"analyzer":{ //making a custom analyzer
"myAnalyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":[
"lowercase",
"myCustomShingle",
"stop"
]
}
}
}
...
},
"mappings":{
...
"description":{ //the field to be analyzed for making the tag cloud
"type":"string",
"analyzer":"myAnalyzer",
"null_value" : "null"
},
...
}
Query for generating facets:
curl -X POST "http://localhost:9200/myIndex/myType/_search?&pretty=true" -d '
{
"size":"0",
"query": {
match_all:{}
},
"facets": {
"blah": {
"terms": {
"fields" : ["description"],
"exclude" : [ 'evil' ], //remove facets that contain these words
"size": "50"
}
}
}
}
My problem is, when I insert a word say 'evil' in the "exclude" option of "facets", it successfully removes the facets containing the words(or single shingles) that match 'evil'. But it doesn't remove the 2/3 word shingles, "resident evil" , "evil computer", "my evil cat". How do I remove the facets of phrases containing the "exclude words"?
It isn't completely clear what you want to achieve. You usually wouldn't make facets on analyzed fields. Maybe you could explain why you're making shingles so that we can help achieving what you want in a better way.
With the exclude facet parameter you can exclude some specific entry, but evil is not the same as resident evil. If you want to exclude it you need to specify it. Facets are made based on indexed terms, and resident evil is in fact a single term in the index, which is not the same as the term evil.
Given the choice that you already made for indexing and faceting, there is a way to achieve what you want. Elasticsearch has a really powerful scripting module. You can use a script to decide whether each entry should be included in the facet or not like this:
{
"query": {
"match_all" : {}
},
"facets": {
"tags": {
"terms": {
"field" : "tags",
"script" : "term.contains('evil') ? true : false"
}
}
}
}

How to prevent Facet Terms from tokenizing

I am using Facet Terms to get all the unique values and their count for a field. And I am getting wrong results.
term: web
Count: 1191979
term: misc
Count: 1191979
term: passwd
Count: 1191979
term: etc
Count: 1191979
While the actual result should be:
term: WEB-MISC /etc/passwd
Count: 1191979
Here is my sample query:
{
"facets": {
"terms1": {
"terms": {
"field": "message"
}
}
}
}
If reindexing is an option, it would be the best to change mapping and mark this fields as not_analyzed
"your_field" : { "type": "string", "index" : "not_analyzed" }
You can use multi field type if keeping an analyzed version of the field is desired:
"your_field" : {
"type" : "multi_field",
"fields" : {
"your_field" : {"type" : "string", "index" : "analyzed"},
"untouched" : {"type" : "string", "index" : "not_analyzed"}
}
}
This way, you can continue using your_field in the queries, while running facet searches using your_field.untouched.
Alternatively, if this field is stored, you can use a script field facet instead:
"facets" : {
"term" : {
"terms" : {
"script_field" : "_fields.your_field.value"
}
}
}
As the last resort, if this field is not stored, but record source is stored in the index, you can try this:
"facets" : {
"term" : {
"terms" : {
"script_field" : "_source.your_field"
}
}
}
The first solution is the most efficient. The last solution is the least efficient and may take a lot of time on a large index.
Wow, I also got this same issue today while term aggregating in the recent elastic-search. After googling and some partial understanding, found how this geeky indexing works(which is very simple).
Queries can find only terms that actually exist in the inverted index
When you index the following string
"WEB-MISC /etc/passwd"
it will be passed to an analyzer. The analyzer might tokenize it into
"WEB", "MISC", "etc" and "passwd"
with its position details. And this tokens might filtered to lowercase such as
"web", "misc", "etc" and "passwd"
So, after indexing,the search query can see the above 4 only. not the complete word "WEB-MISC /etc/passwd". For your requirement the following are my options you can use
1.Change the Default Analyzer used by elasticsearch([link][1])
2.If it is not need, just TurnOff the analyzer by setting 'not_analyzed' for the fields you need
3.To convert the already indexed data searchable, re-indexing is the only option
I have briefly explained this problem and proposed two solutions here.
I have talked about multiple approaches here.
One is use of not_analyzed to preserve the string as it is. But then as it has the drawback of being case insensitive , a better approach would be use keyword tokenizer + lowercase filter

ElasticSearch's Fuzzy Query

I am brand new to ElasticSearch, and am currently exploring its features. One of them I am interested in is the Fuzzy Query, which I am testing and having troubles to use. It is probably a dummy question so I guess someone who already used this feature will quickly find the answer, at least I hope. :)
BTW I have the feeling that it might not be only related to ElasticSearch but maybe directly to Lucene.
Let's start with a new index named "first index" in which I store an object "label" with value "american football". This is the query I use.
bash-3.2$ curl -XPOST 'http://localhost:9200/firstindex/node/?pretty=true' -d '{
"node" : {
"label" : "american football"
}
}
'
This is the result I get.
{
"ok" : true,
"_index" : "firstindex",
"_type" : "node",
"_id" : "6TXNrLSESYepXPpFWjpl1A",
"_version" : 1
}
So far so good, now I want to find this entry using a fuzzy query. This is the one I send:
bash-3.2$ curl -XGET 'http://localhost:9200/firstindex/node/_search?pretty=true' -d '{
"query" : {
"fuzzy" : {
"label" : {
"value" : "american football",
"boost" : 1.0,
"min_similarity" : 0.0,
"prefix_length" : 0
}
}
}
}
'
And this is the result I get
{
"took" : 15,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
As you can see, no hit. But now, when I shrink a bit my query's value from "american football" to "american footb" like this:
bash-3.2$ curl -XGET 'http://localhost:9200/firstindex/node/_search?pretty=true' -d ' {
"query" : {
"fuzzy" : {
"label" : {
"value" : "american footb",
"boost" : 1.0,
"min_similarity" : 0.0,
"prefix_length" : 0
}
}
}
}
'
Then I get a correct hit on my entry, thus the result is:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.19178301,
"hits" : [ {
"_index" : "firstindex",
"_type" : "node",
"_id" : "6TXNrLSESYepXPpFWjpl1A",
"_score" : 0.19178301, "_source" : {
"node" : {
"label" : "american football"
}
}
} ]
}
}
So, I have several questions related to this test:
Why I didn't get any result when performing a query with a value completely equals the my only entry "american football"
Is it related to the fact that I have a multi-words value?
Is there a way to get the "similarity" score in my query result so I can understand better how to find the right threshold for my fuzzy queries
There is a page dedicated to Fuzzy Query on ElasticSearch web site, but I am not sure it lists all the potential parameters I can use for the fuzzy query. Were could I find such an exhaustive list?
Same question for the other queries actually.
is there a difference between a Fuzzy Query and a Query String Query using lucene syntax to get fuzzy matching?
1.
The fuzzy query operates on terms. It cannot handle phrases because it doesn't analyze the text. So, in your example, elasticsearch tries to match the term "american football" to the term american and to the term football. The match between terms is based on Levenshtein distance, which is used to calculate similarity score. Since you have min_similarity=0.0 any term should match any term as long as edit distance is smaller than the size of the smallest term. In your case, the term "american football" has size 17 and the term "american" has size 8. The distance between these two terms is 9 which is bigger than the size of the smallest term 8. So, as a result, this term is getting rejected. The edit distance between "american footb" and "american" is 6. It's basically the term "american" with 6 additions at the end. That's why it produces results. With min_similarity=0.0 pretty much anything with edit distance 7 or less will match. You will even get results while searching for "aqqqqqq", for example.
2.
Yes, as I explained above, it is somewhat related to multi-word values. If you want to search for multiple terms, take a look at Fuzzy Like This Query and fuzziness parameter of Text Query
4 & 5.
Usually, the next best source of information after elasticsearch.org is elasticsearch source code.