Cloudant search document by attributes of nested objects - indexing

My documents in cloudant have the following structure
{
"_id" : "1234",
"name" : "test",
"objects" : [
{
"type" : "TYPE1"
"time" : "1215
},
{
"type" : "TYPE2"
"time" : "1115"
}
]
}
Now I need to query my documents by a list of types.
Examples
1) If I would query with TYPE1 then all the documents where there is an object with this type would return. (The example doc would return)
2) If I would query with TYPE1 and TYPE3 it would return all documents which contain either of them (The example doc would return)
3) If I would query with TYPE3, TYPE4 and TYPE5 it would return all documents which contain either of them (The example doc would not return)
How would the code in the _design document look like and how would my API request look like?

One option is to use Cloudant Search.
Sample design document named types, which indexes each type property in your objects array
{
"_id": "_design/types",
"views": {},
"language": "javascript",
"indexes": {
"one-of": {
"analyzer": "standard",
"index": "function (doc) {\n for(var i in doc.objects) {\n index(\"type\", doc.objects[i].type); \n }\n}"
}
}
}
Query examples:
Search for one key (type=val)
GET https://$HOST/$DATABASE/_design/$DDOC/_search/one-of?q=type%3ATYPE1
Search for multiple keys (type=val1 OR type=val2)
GET https://$HOST/$DATABASE/_design/$DDOC/_search/one-of?q=type%3ATYPE1%20OR%20type%3ATYPE2
Search for multiple keys (type=val1 AND type=val2)
GET https://$HOST/$DATABASE/_design/$DDOC/_search/one-of?q=type%3ATYPE1%20AND%20type%3ATYPE2
To include the documents in the response append &include_docs=true.

Related

Query for entire JSON document in nested JSON schema

Background:
I wish to locate the entire JSON document that has a condition where "state" = "new" and where length(Features.id) > 4
{
"id": "123"
"feedback": {
"Features": [
{
"state": "new"
"id": "12345"
}
]
}
}
This is what I have tried to do:
Since this is a nested document. My query looks like this:
A stackoverflow member has helped me to access the nested contents within the query, but is there a way to obtain the full document
I have used:
SELECT VALUE t.id FROM t IN f.feedback.Features where t.state = 'new' and length(t.id)>4
This will give me the ids.
My desire is to have access to the full document with this condition?
{
"id": "123"
"feedback": {
"Features": [
{
"state": "new"
"id": "12345"
}
]
}
}
Any help is appreciated
Try this
SELECT *
FROM f
WHERE
f.feedback.Features[0].state = 'new'
AND length(f.feedback.Features[0].id)>4
Here is the SELECT spec for CosmosDB for more details
https://learn.microsoft.com/en-us/azure/cosmos-db/sql-query-select
Also, check out "working with JSON" in CosmosDB notes
https://learn.microsoft.com/en-us/azure/cosmos-db/sql-query-working-with-json
If the Features array has more than 1 value, you can use EXISTS clause to search within them. See specs of EXISTS here with examples:
https://learn.microsoft.com/en-us/azure/cosmos-db/sql-query-subquery#exists-expression

How to setup a field mapping for ElasticSearch that allows both exact and full text searching?

Here is my problem:
I have a field called product_id that is in a format similar to:
A+B-12321412
If I used the standard text analyzer it splits it into tokens like so:
/_analyze/?analyzer=standard&pretty=true" -d '
A+B-1232412
'
{
"tokens" : [ {
"token" : "a",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "b",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "1232412",
"start_offset" : 5,
"end_offset" : 12,
"type" : "<NUM>",
"position" : 3
} ]
}
Ideally, I would like to sometimes search for an exact product id and other times use a sub string and or just do a query for part of the product id.
My understanding of mappings and analyzers is that I can only specify one analyzer per field.
Is there a way to store a field as both analyzed and exact match?
Yes, you can use the fields parameter. In your case:
"product_id": {
"type": "string",
"fields": {
"raw": { "type": "string", "index": "not_analyzed" }
}
}
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_multi_fields.html
This allows you to index the same data twice, using two different definitions. In this case it will be indexed via both the default analyzer and not_analyzed which will only pick up exact matches. This is also useful for sorting return results:
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/multi-fields.html
However, you will need to spend some time thinking about how you want to search. In particular, given part numbers with a mix of alpha, numeric and punctuation or special characters you may need to get creative to tune your queries and matches.

How should I index this schema in Elasticsearch

I am a bit lost on how to index these documents in Elasticsearch.
Document 1
{
text: ['chicken']
}
Document 2
{
text: ['chicken'], [['broth', 'stock']]
}
I need to be able to query these using either 'chicken flavored stock' or 'chicken flavored broth' and it should return both documents with the same score, since all of their terms have been matched in the input query. It also shouldn't return doc 2 with only 'chicken' as query.
Basically, I want to know that all the terms in 'text' field have been found somewhere in the query, and the internal array (ie: 'broth' and 'stock' acts like an OR clause).
Is this even possible?
Update:
I did find a (cumbersome) way of doing it. I save the document by combining their fields into phrases (ex: ['chicken broth', 'chicken stock'] for doc 2). Then I search using every combination of the input as a phrase (ex: ['chicken', 'chicken flavored', 'chicken flavored broth', 'chicken broth', ...].)
This solution does give me the results I want, but I can't help but feel this is a common case that could be handled much more elegantly. It feels like the ngrams are along the path to my answer, but I can't quite work it out.
When you index documents without adding a custom mapping, Elasticsearch using the Standard analyzer by default.
You could remove the arrays from the text fields and index your documents as:
Document 1
{
"text": "chicken"
}
Document 2
{
"text": "chicken broth stock"
}
The standard analyzer will create the following tokens in the Lucene index:
Document 1
"chicken"
Document 2
"chicken", "broth", "stock"
Your documents are matching the search terms as follows:
chicken : the term 'chicken' matches in both documents, because the text field is shorter in Document 1 it scores higher than Document 2.
chicken flavored: the term 'chicken' matches in both documents, but there is no match for the term 'flavoured'. Again, as the text field is shorter in Document 1 it scores higher than Document 2.
chicken flavored broth: the term 'chicken' matches in both documents, and the term 'broth' also matched in document 2. There is no match on the term 'flavored' in either of the documents. Document 2 is scored higher than Document 1 as it matches two of the terms in the query.
I don't really see a use case for ngrams as the above does what you want.
So here is something that you can try. Percolator can solve your problem but you will have to change the way you are indexing your documents.
So instead of indexing doc1 the way you are doing, index it like so:
PUT /test-index/.percolator/1
{
"query": {
"term": {
"text": {
"value": "chicken"
}
}
}
}
And, index doc2 like so:
PUT /test-index/.percolator/2
{
"query": {
"bool": {
"must": [
{
"term": {
"text": {
"value": "chicken"
}
}
},
{
"bool": {
"should": [
{
"term": {
"text": {
"value": "broth"
}
}
},
{
"term": {
"text": {
"value": "stock"
}
}
}
]
}
}
]
}
}
}
No instead of querying the way you were querying your documents earlier, percolate them:
GET /test-index/all_terms_search/_percolate
{
"doc": {
"text": "chicken flavored stock"
}
}
This will get both your documents. This also gives you the flexibility to control what and how much you want to match. While you are indexing your document's reverse queries in percolator, you provide an ID for that query and corresponding to that ID, you can maintain the text in a much simpler form for you to consume either in a separate index in Elasticsearch or may be some other datastore which can get matching documents really fast.

Search multiple fields with "and" operator (but use fields' own analyzers)

ElasticSearch Version: 0.90.2
Here's the problem: I want to find documents in the index so that they:
match all query tokens across multiple fields
fields own analyzers are used
So if there are 4 documents:
{ "_id" : 1, "name" : "Joe Doe", "mark" : "1", "message" : "Message First" }
{ "_id" : 2, "name" : "Ann", "mark" : "3", "message" : "Yesterday Joe Doe got 1 for the message First"}
{ "_id" : 3, "name" : "Joe Doe", "mark" : "2", "message" : "Message Second" }
{ "_id" : 4, "name" : "Dan Spencer", "mark" : "2", "message" : "Message Third" }
And the query is "Joe First 1" it should find ids 1 and 2. I.e., it should find documents which contain all the tokens from search query, no matter in which fields they are (maybe all tokens are in one field, or maybe each token is in its own field).
One solution would be to use elasticsearch "_all" field functionality: that way it will merge all the fields I need (name, mark, message) into one and I'll be able to query it with something like
"match": {
"_all": {
"query": "Joe First 1",
"operator": "and"
}
}
But this way I can specify analyzer for the "_all" field only. And I need "name" and "message" fields to have different set of tokenizers/token filters (let's say name will have phonetic analyzer and message will have some stemming token filter).
Is there a way to do this?
Thanks to guys at elasticsearch group, here's the solution... pretty simple need to say :)
All I needed to do is to use query_string query http://www.elasticsearch.org/guide/reference/query-dsl/query-string-query/ with default_operator = AND and it will do the trick:
{
"query": {
"query_string": {
"fields": [
"name",
"mark",
"message"
],
"query": "Joe First 1",
"default_operator": "AND"
}
}
}
I think using a multi match query makes sense here. Something like:
"multi_match": {
"query": "Joe First 1",
"operator": "and"
"fields": [ "name", "message", "mark"]
}
As you say, you can set the analyzer (or search_analyzer/index_analyzer) to be used on the _all field. It seems to me that should indeed be your first step to achieve the query results you're looking for.
From http://jontai.me/blog/2012/10/lucene-scoring-and-elasticsearch-_all-field/, we have this tasty quote:
... the _all field copies the text from the other fields and analyzes
them again; it doesn’t copy the pre-analyzed tokens. You can set a
separate analyzer for the _all field.
Which I interpret to mean that you should set your _all analyzer(s) as well as individual field analyzer(s). The _all field won't re-analyze the individual field data; it will grab the original field contents.

elasticsearch: how to index terms which are stopwords only?

I had much success building my own little search with elasticsearch in the background. But there is one thing I couldn't find in the documentation.
I'm indexing the names of musicians and bands. There is one band called "The The" and due to the stop words list this band is never indexed.
I know I can ignore the stop words list completely but this is not what I want since the results searching for other bands like "the who" would explode.
So, is it possible to save "The The" in the index but not disabling the stop words at all?
You can use the synonym filter to convert The The into a single token eg thethe which won't be removed by the stopwords filter.
First, configure the analyzer:
curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"filter" : {
"syn" : {
"synonyms" : [
"the the => thethe"
],
"type" : "synonym"
}
},
"analyzer" : {
"syn" : {
"filter" : [
"lowercase",
"syn",
"stop"
],
"type" : "custom",
"tokenizer" : "standard"
}
}
}
}
}
'
Then test it with the string "The The The Who".
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=The+The+The+Who&analyzer=syn'
{
"tokens" : [
{
"end_offset" : 7,
"position" : 1,
"start_offset" : 0,
"type" : "SYNONYM",
"token" : "thethe"
},
{
"end_offset" : 15,
"position" : 3,
"start_offset" : 12,
"type" : "<ALPHANUM>",
"token" : "who"
}
]
}
"The The" has been tokenized as "the the", and "The Who" as "who" because the preceding "the" was removed by the stopwords filter.
To stop or not to stop
Which brings us back to whether we should include stopwords or not? You said:
I know I can ignore the stop words list completely
but this is not what I want since the results searching
for other bands like "the who" would explode.
What do you mean by that? Explode how? Index size? Performance?
Stopwords were originally introduced to improve search engine performance by removing common words which are likely to have little effect on the relevance of a query. However, we've come a long way since then. Our servers are capable of much more than they were back in the 80s.
Indexing stopwords won't have a huge impact on index size. For instance, to index the word the means adding a single term to the index. You already have thousands of terms - indexing the stopwords as well won't make much difference to size or to performance.
Actually, the bigger problem is that the is very common and thus will have a low impact on relevance, so a search for "The The concert Madrid" will prefer Madrid over the other terms.
This can be mitigated by using a shingle filter, which would result in these tokens:
['the the','the concert','concert madrid']
While the may be common, the the isn't and so will rank higher.
You wouldn't query the shingled field by itself, but you could combine a query against a field tokenized by the standard analyzer (without stopwords) with a query against the shingled field.
We can use a multi-field to analyze the text field in two different ways:
curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"mappings" : {
"test" : {
"properties" : {
"text" : {
"fields" : {
"shingle" : {
"type" : "string",
"analyzer" : "shingle"
},
"text" : {
"type" : "string",
"analyzer" : "no_stop"
}
},
"type" : "multi_field"
}
}
}
},
"settings" : {
"analysis" : {
"analyzer" : {
"no_stop" : {
"stopwords" : "",
"type" : "standard"
},
"shingle" : {
"filter" : [
"standard",
"lowercase",
"shingle"
],
"type" : "custom",
"tokenizer" : "standard"
}
}
}
}
}
'
Then use a multi_match query to query both versions of the field, giving the shingled version more "boost"/relevance. In this example the text.shingle^2 means that we want to boost that field by 2:
curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1' -d '
{
"query" : {
"multi_match" : {
"fields" : [
"text",
"text.shingle^2"
],
"query" : "the the concert madrid"
}
}
}
'