How to setup a field mapping for ElasticSearch that allows both exact and full text searching? - indexing

Here is my problem:
I have a field called product_id that is in a format similar to:
A+B-12321412
If I used the standard text analyzer it splits it into tokens like so:
/_analyze/?analyzer=standard&pretty=true" -d '
A+B-1232412
'
{
"tokens" : [ {
"token" : "a",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "b",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "1232412",
"start_offset" : 5,
"end_offset" : 12,
"type" : "<NUM>",
"position" : 3
} ]
}
Ideally, I would like to sometimes search for an exact product id and other times use a sub string and or just do a query for part of the product id.
My understanding of mappings and analyzers is that I can only specify one analyzer per field.
Is there a way to store a field as both analyzed and exact match?

Yes, you can use the fields parameter. In your case:
"product_id": {
"type": "string",
"fields": {
"raw": { "type": "string", "index": "not_analyzed" }
}
}
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_multi_fields.html
This allows you to index the same data twice, using two different definitions. In this case it will be indexed via both the default analyzer and not_analyzed which will only pick up exact matches. This is also useful for sorting return results:
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/multi-fields.html
However, you will need to spend some time thinking about how you want to search. In particular, given part numbers with a mix of alpha, numeric and punctuation or special characters you may need to get creative to tune your queries and matches.

Related

how to query embedded document using mongodb

Need help constructing this mongo query.
So far I can query on the first level, but unable to do so at the next embedded level ("labels" > 2")
For example, the document structure looks like this:
> db.versions_20170420.findOne();
{
"_id" : ObjectId("54bf146b77ac503bbf0f0130"),
"account" : "foo",
"labels" : {
"1" : {
"name" : "one",
"color" : "color1"
},
"2" : {
"name" : "two",
"color" : "color2"
},
"3" : {
"name" : "three",
"color" : "color3"
}
},
"profile" : "bar",
"version" : NumberLong("201412192106")
This query I can filter at the first level (account, profile).
db.profile_versions_20170420.find({"account":"foo", "profile": "bar"}).pretty()
However, given this structure, I'm looking for documents where the "label" > "2". It doesn't look like "2" is a number, but a string. Is there a way to construct the mongo query to do that? Do I need to do some conversion?
If I correctly understand you and your data structure, "label" > "2" means that object labels must contain property labels.3, and it is easy to check with next code:
db.profile_versions_20170420.find(
{"account": "foo", "profile": "bar", "labels.3": {$exists: true}}
).pretty();
But it doesn't mean that your object contains at least 3 properties, because it is not $size function which calculates count of elements in array, and we cannot use $size because labels is object not array. Hence in our case, we only know that labels have property 3 even it is the only one property which labels contains.
You can improve find criteria:
db.profile_versions_20170420.find({
"account": "foo",
"profile": "bar",
"labels.1": {$exists: true},
"labels.2": {$exists: true},
"labels.3": {$exists: true}
}).pretty();
and ensure that labes contains elements 1, 2, 3, but in this case, you have to care about object structure on application level during insert/update/delete data in document.
As another option, you can update your db and add extra field labelsConut and after that you will be able to run query like this:
db.profile_versions_20170420.find(
{"account": "foo", "profile": "bar", "labelsConut": {$gt: 2}}
).pretty();
btw, it will work faster...

Cloudant search document by attributes of nested objects

My documents in cloudant have the following structure
{
"_id" : "1234",
"name" : "test",
"objects" : [
{
"type" : "TYPE1"
"time" : "1215
},
{
"type" : "TYPE2"
"time" : "1115"
}
]
}
Now I need to query my documents by a list of types.
Examples
1) If I would query with TYPE1 then all the documents where there is an object with this type would return. (The example doc would return)
2) If I would query with TYPE1 and TYPE3 it would return all documents which contain either of them (The example doc would return)
3) If I would query with TYPE3, TYPE4 and TYPE5 it would return all documents which contain either of them (The example doc would not return)
How would the code in the _design document look like and how would my API request look like?
One option is to use Cloudant Search.
Sample design document named types, which indexes each type property in your objects array
{
"_id": "_design/types",
"views": {},
"language": "javascript",
"indexes": {
"one-of": {
"analyzer": "standard",
"index": "function (doc) {\n for(var i in doc.objects) {\n index(\"type\", doc.objects[i].type); \n }\n}"
}
}
}
Query examples:
Search for one key (type=val)
GET https://$HOST/$DATABASE/_design/$DDOC/_search/one-of?q=type%3ATYPE1
Search for multiple keys (type=val1 OR type=val2)
GET https://$HOST/$DATABASE/_design/$DDOC/_search/one-of?q=type%3ATYPE1%20OR%20type%3ATYPE2
Search for multiple keys (type=val1 AND type=val2)
GET https://$HOST/$DATABASE/_design/$DDOC/_search/one-of?q=type%3ATYPE1%20AND%20type%3ATYPE2
To include the documents in the response append &include_docs=true.

Fuzzy Like This on Attachment Returns Nothing on Partial Word

I have my mapping like this:
{
"doc": {
"mappings": {
"mydocument": {
"properties": {
"file": {
"type": "attachment",
"path": "full",
"fields": {
"file": {
"type": "string",
"store": true,
"term_vector": "with_positions_offsets"
},
"author": {
...
When I search for a complete word I get the result:
"query": {
"fuzzy_like_this" : {
"fields" : ["file"],
"like_text" : "This_is_something_I_want_to_search_for",
"max_query_terms" : 12
}
},
"highlight" : {
"number_of_fragments" : 3,
"fragment_size" : 650,
"fields" : {
"file" : { }
}
}
But if I change the search term to "This_is_something_I_want" I get nothing. What am I missing?
To implement a partial match, we must first understand what fuzzy like this does and then decide what you want partial matching to return. fuzzy like this will perform 2 key functions.
The like_text will be analyzed using the default analyzer. All the resulting tokens will then be used to find documents based on term frequency, or tf-idf
This typically means that the input term will be be split on space and lowercased. This_is_something_I_want will therefore be tokenized to this_is_something_i_want. Unless you have files with this exact term, no documents will match.
Secondly, all terms will be fuzzified. Fuzzy searches score terms based on how many character changes needs to made to a word to match another word. For instance to get from bat to hat we will need to make 1 character change.
For our case to get from this_is_something_i_want to this_is_something_i_want_to_search_for, we will need to make 14 character changes (adding _to_search_for.) Standard fuzzy search only allows for 3 character changes when working with terms longer that 5 or 6 characters. Increasing the fuzzy limit to 14 will however produce severely skewed results
So neither of these functions will help produce the results you seek.
Here is what I can suggest:
You can implement an analyzer that splits on underscore similar to this. Tokens produced will then be ['this', 'is', 'something', 'i', 'want'] which can correctly be matched to to the sample case
Alternatively, if all you want is a document that starts with the specified text, you can use a phrase prefix query instead of fuzzy like this. Documentations here

Search multiple fields with "and" operator (but use fields' own analyzers)

ElasticSearch Version: 0.90.2
Here's the problem: I want to find documents in the index so that they:
match all query tokens across multiple fields
fields own analyzers are used
So if there are 4 documents:
{ "_id" : 1, "name" : "Joe Doe", "mark" : "1", "message" : "Message First" }
{ "_id" : 2, "name" : "Ann", "mark" : "3", "message" : "Yesterday Joe Doe got 1 for the message First"}
{ "_id" : 3, "name" : "Joe Doe", "mark" : "2", "message" : "Message Second" }
{ "_id" : 4, "name" : "Dan Spencer", "mark" : "2", "message" : "Message Third" }
And the query is "Joe First 1" it should find ids 1 and 2. I.e., it should find documents which contain all the tokens from search query, no matter in which fields they are (maybe all tokens are in one field, or maybe each token is in its own field).
One solution would be to use elasticsearch "_all" field functionality: that way it will merge all the fields I need (name, mark, message) into one and I'll be able to query it with something like
"match": {
"_all": {
"query": "Joe First 1",
"operator": "and"
}
}
But this way I can specify analyzer for the "_all" field only. And I need "name" and "message" fields to have different set of tokenizers/token filters (let's say name will have phonetic analyzer and message will have some stemming token filter).
Is there a way to do this?
Thanks to guys at elasticsearch group, here's the solution... pretty simple need to say :)
All I needed to do is to use query_string query http://www.elasticsearch.org/guide/reference/query-dsl/query-string-query/ with default_operator = AND and it will do the trick:
{
"query": {
"query_string": {
"fields": [
"name",
"mark",
"message"
],
"query": "Joe First 1",
"default_operator": "AND"
}
}
}
I think using a multi match query makes sense here. Something like:
"multi_match": {
"query": "Joe First 1",
"operator": "and"
"fields": [ "name", "message", "mark"]
}
As you say, you can set the analyzer (or search_analyzer/index_analyzer) to be used on the _all field. It seems to me that should indeed be your first step to achieve the query results you're looking for.
From http://jontai.me/blog/2012/10/lucene-scoring-and-elasticsearch-_all-field/, we have this tasty quote:
... the _all field copies the text from the other fields and analyzes
them again; it doesn’t copy the pre-analyzed tokens. You can set a
separate analyzer for the _all field.
Which I interpret to mean that you should set your _all analyzer(s) as well as individual field analyzer(s). The _all field won't re-analyze the individual field data; it will grab the original field contents.

elasticsearch: how to index terms which are stopwords only?

I had much success building my own little search with elasticsearch in the background. But there is one thing I couldn't find in the documentation.
I'm indexing the names of musicians and bands. There is one band called "The The" and due to the stop words list this band is never indexed.
I know I can ignore the stop words list completely but this is not what I want since the results searching for other bands like "the who" would explode.
So, is it possible to save "The The" in the index but not disabling the stop words at all?
You can use the synonym filter to convert The The into a single token eg thethe which won't be removed by the stopwords filter.
First, configure the analyzer:
curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"filter" : {
"syn" : {
"synonyms" : [
"the the => thethe"
],
"type" : "synonym"
}
},
"analyzer" : {
"syn" : {
"filter" : [
"lowercase",
"syn",
"stop"
],
"type" : "custom",
"tokenizer" : "standard"
}
}
}
}
}
'
Then test it with the string "The The The Who".
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=The+The+The+Who&analyzer=syn'
{
"tokens" : [
{
"end_offset" : 7,
"position" : 1,
"start_offset" : 0,
"type" : "SYNONYM",
"token" : "thethe"
},
{
"end_offset" : 15,
"position" : 3,
"start_offset" : 12,
"type" : "<ALPHANUM>",
"token" : "who"
}
]
}
"The The" has been tokenized as "the the", and "The Who" as "who" because the preceding "the" was removed by the stopwords filter.
To stop or not to stop
Which brings us back to whether we should include stopwords or not? You said:
I know I can ignore the stop words list completely
but this is not what I want since the results searching
for other bands like "the who" would explode.
What do you mean by that? Explode how? Index size? Performance?
Stopwords were originally introduced to improve search engine performance by removing common words which are likely to have little effect on the relevance of a query. However, we've come a long way since then. Our servers are capable of much more than they were back in the 80s.
Indexing stopwords won't have a huge impact on index size. For instance, to index the word the means adding a single term to the index. You already have thousands of terms - indexing the stopwords as well won't make much difference to size or to performance.
Actually, the bigger problem is that the is very common and thus will have a low impact on relevance, so a search for "The The concert Madrid" will prefer Madrid over the other terms.
This can be mitigated by using a shingle filter, which would result in these tokens:
['the the','the concert','concert madrid']
While the may be common, the the isn't and so will rank higher.
You wouldn't query the shingled field by itself, but you could combine a query against a field tokenized by the standard analyzer (without stopwords) with a query against the shingled field.
We can use a multi-field to analyze the text field in two different ways:
curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"mappings" : {
"test" : {
"properties" : {
"text" : {
"fields" : {
"shingle" : {
"type" : "string",
"analyzer" : "shingle"
},
"text" : {
"type" : "string",
"analyzer" : "no_stop"
}
},
"type" : "multi_field"
}
}
}
},
"settings" : {
"analysis" : {
"analyzer" : {
"no_stop" : {
"stopwords" : "",
"type" : "standard"
},
"shingle" : {
"filter" : [
"standard",
"lowercase",
"shingle"
],
"type" : "custom",
"tokenizer" : "standard"
}
}
}
}
}
'
Then use a multi_match query to query both versions of the field, giving the shingled version more "boost"/relevance. In this example the text.shingle^2 means that we want to boost that field by 2:
curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1' -d '
{
"query" : {
"multi_match" : {
"fields" : [
"text",
"text.shingle^2"
],
"query" : "the the concert madrid"
}
}
}
'