How to do case insensitive search on dictionary key in CosmosDB? - sql

I can't figure out a way to do a case insensitive search on dictionary keys in ComsosDB. My objects look like this:
...
"Codes": {
"CodeSystem1": [
"A1", "A2"
],
"CodeSystem2": [
"x1","x2"
]
},
...
Codes is a Dictionary<string, List<string>>
My query looks like this:
SELECT * FROM c WHERE ARRAY_CONTAINS(c.Codes["CodeSystem2"], 'x1')
However, I'd like to do a LOWER() on both the dictionary key and value, but it doesn't work like that.
SELECT * FROM c WHERE ARRAY_CONTAINS(c.Codes[LOWER("CodeSystem2"]), LOWER('x1'))
Any ideas? I can't change the structure of the objects, and rather not do the filtering in my .NET code.

LOWER/UPPER will not work with Array elements as you would want. If you have something like this:
"CodeSystem4": [
"Z1"
],
"CodeSystem5": "Z3"
We can use the lower with element CodeSystem5 as below:
select * from c where lower(c.Codes["CodeSystem5"]) = Lower('Z3')
But we cannot do the same with 'CodeSystem4' with ARRAY_CONTAINS, it will not return any result.
Also as per the below article, "The LOWER system function does not utilize the index. If you plan to do frequent case insensitive comparisons, the LOWER system function may consume a significant amount of RU's. If this is the case, instead of using the LOWER system function to normalize data each time for comparisons, you can normalize the casing upon insertion."
https://learn.microsoft.com/en-us/azure/cosmos-db/sql/sql-query-lower
One way is to add another searchable array in lower case to make it work through query. Or else we can filter it through the SDK

Related

How to querying and filtering efficiently on Fauna DB?

For example, let’s assume we have a collection with hundreds of thousands of documents of clients with 3 fields, name, monthly_salary, and age.
How can I search for documents that monthly_salary is higher than 2000 and age higher than 30?
In SQL this would be straightforward but with Fauna, I´m struggling to understand the best approach because terms of Index only work with an exact match. I see in docs that I can use the Filter function but I would need to get all documents in advance so it looks a bit counterintuitive and not performant.
Below is an example of how I can achieve it, but not sure if it’s the best approach, especially if it contains a lot of records.
Map(
Filter(
Paginate(Documents(Collection('clients'))),
Lambda(
'client',
And(
GT(Select(['data', 'monthly_salary'], Get(Var('client'))), 2000),
GT(Select(['data', 'age'], Get(Var('client'))), 30),
)
)
),
Lambda(
'filteredClients',
Get(Var('filteredClients'))
)
)
Is this correct or I´m missing some fundamental concepts about Fauna and FQL?
can anyone help?
Thanks in advance
Efficient searching is performed using Indexes. You can check out the docs for search with Indexes, and there is a "cookbook" for some different search examples.
There are two ways to use Indexes to search, and which one you use depends on if you are searching for equality (exact match) or inequality (greater than or less than, for example).
Searching for equality
If you need an exact match, then use Index terms. This is most explicit in the docs, and it is also not what your original question is about, so I am not going to dwell much here. But here is a simple example
given user documents with this shape
{
ref: Ref(Collection("User"), "1234"),
ts: 16934907826026,
data: {
name: "John Doe",
email: "jdoe#example.com,
age: 50,
monthly_salary: 3000
}
}
and an index defined like the following
CreateIndex({
name: "users_by_email",
source: Collection("User"),
terms: [ { field: ["data", "email"] } ],
unique: true // user emails are unique
})
You can search for exact matches with... the Match function!
Get(
Match(Index("user_by_email"), "jdoe#example.com")
)
Searching for inequality
Searching for inequalities is more interesting and also complicated. It requires using Index values and the Range function.
Keeping with the document above, we can create a new index
CreateIndex({
name: "users__sorted_by_monthly_salary",
source: Collection("User"),
values: [
{ field: ["data", "monthly_salary"] },
{ field: ["ref"] }
]
})
Note that I've not defined any terms in the above Index. The important thing for inequalities is again the values. We've also included the ref as a value, since we will need that later.
Now we can use Range to get all users with salary in a given range. This query will get all users with salary starting at 2000 and all above.
Paginate(
Range(
Match(Index("users__sorted_by_monthly_salary")),
[2000],
[]
)
)
Combining Indexes
For "OR" operations, use the Union function.
For "AND" operations, use the Intersection function.
Functions like Match and Range return Sets. A really important part of this is to make sure that when you "combine" Sets with functions like Intersection, that the shape of the data is the same.
Using sets with the same shape is not difficult for Indexes with no values, they default to the same single ref value.
Paginate(
Intersection(
Match(Index("user_by_age"), 50), // type is Set<Ref>
Match(Index("user_by_monthly_salary, 3000) // type is Set<Ref>
)
)
When the Sets have different shapes they need to be modified or else the Intersection will never return results
Paginate(
Intersection(
Range(
Match(Index("users__sorted_by_age")),
[30],
[]
), // type is Set<[age, Ref]>
Range(
Match(Index("users__sorted_by_monthly_salary")),
[2000],
[]
) // type is Set<[salary, Ref]>
)
)
{
data: [] // Intersection is empty
}
So how do we change the shape of the Set so they can be intersected? We can use the Join function, along with the Singleton function.
Join will run an operation over all entries in the Set. We will use that to return only a ref.
Join(
Range(Match(Index("users__sorted_by_age")), [30], []),
Lambda(["age", "ref"], Singleton(Var("ref")))
)
Altogether then:
Paginate(
Intersection(
Join(
Range(Match(Index("users__sorted_by_age")), [30], []),
Lambda(["age", "ref"], Singleton(Var("ref")))
),
Join(
Range(Match(Index("users__sorted_by_monthly_salary")), [2000], []),
Lambda(["age", "ref"], Singleton(Var("ref")))
)
)
)
tips for combining indexes
You can use additional logic to combine different indexes when different terms are provided, or search for missing fields using bindings. Lot's of cool stuff you can do.
Do check out the cook book and the Fauna forums as well for ideas.
BUT WHY!!!
It's a good question!
Consider this: Since Fauna is served as a serverless API, you get charged for each individual read and write on your documents and indexes as well as the compute time to execute your query. SQL can be much easier, but it is a much higher level language. Behind SQL sits a query planner making assumptions about how to get you your data. If it cannot do it efficiently it may default to scanning your entire table of data or otherwise performing an operation much more expensive than you might have expected.
With Fauna, YOU are the query planner. That means it is much more complicated to get started, but it also means you have fine control over the performance of you database and thus your cost.
We are working on improving the experience of defining schemas and the indexes you need, but at the moment you do have to define these queries at a low level.

Extract specific key from array of jsons in Amazon Redshift

Background
I am working in Amazon Redshift database using SQL. I have a table and one of the column called attributes contains data that looks like this:
[{"name": "Color", "value": "Beige"},{"name":"Size", "value":Small"}]
or
[{"name": "Size", "value": "Small"},{"name": "Color", "value": "Blue"},{"name": "Material", "value": "Cotton"}]
From what I understand, the above is a series of path elements in a JSON string.
Issue
I am looking to extract the color value in each JSON string. I am unsure how to proceed. I know that if color was in the same location I could use the index to indicate where to extract from. But that is not the case here.
What I tried
select json_extract_array_element_text(attributes, 1) as color_value, json_extract_path_text(color_value, 'value') as color from my_table
This query works for some columns but not all as the location of the color value is different.
I would appreciate any help here as i am very new to sql and have only done basic querying. I have been using the following page as a reference
First off your data is in an array format (between [ ]), not object format (between { }). The page you mention is a function for extracting data from JSON objects, not arrays. Also array format presents challenges as you need to know the numeric position of the element you wish to extract.
Based on your example data it seems like objects is the way to go. If so you will want to reformat your data to be more like:
{"Color": "Beige", "Size": "Small"}
and
{"Size": "Small", "Color": "Blue", "Material": "Cotton"}
This conversion only works if the "name" values are unique in your data.
With this the function you selected - JSON_EXTRACT_PATH_TEXT() - will pull the values you want from the data.
Now changing data may not be an option and dealing with these arrays will make things harder and less performant. To do this you will need to expand these arrays by cross joining with a set of numbers that contain all numbers up to the maximum length of your arrays. For example for the samples you gave you will need to cross join by the values 0,1,2 so that you 3 element array can be fully extracted. You can then filter on only those rows that have a "name" of "color".
The function you will need for extracting elements from an array is JSON_EXTRACT_ARRAY_ELEMENT_TEXT() and since you have objects stored in the array you will need to run JSON_EXTRACT_PATH_TEXT() on the results.

How to get the equivalent of combinig [contains] and [in] operators in the same query?

So I have a field that's a multi-choice on the Directus back end so when the JSON comes out of the API it's a one-dimensional array, like so:
"field_name": [
"",
"option 6",
"option 11",
""
]
(btw I have no idea why all these fields produce those blank values, but that's a matter for another day)
I am trying to make an interface on the front end where you can select one or more of these values and the result will come back if ANY of them are found for that record. Think of it like a tag list, if the item has just one of the values it should be returned.
I can use the [contains] operator to find if it has one of the values I'm looking for, but I can only pass a single value, whereas I need all that have either optionX OR optionY OR optionZ. I would basically need a combination of [contains] and [in] to achieve what I'm trying to do. Is there a way to achieve this?
I've also tried setting the [logical] operator to OR, but then that screws up the other filters that need to be included as AND (or I'm doing something wrong). Not to mention the query gets completely unruly.
Help?

Cloudant - Lucene range search using numbers stored as text

I have a number of documents in Cloudant, that have ID field of type string. ID can be a simple string, like "aaa", "bbb" or number stored as text, e.g. "111", "222", etc. I need to be able to full text search using the above field, but I encountered some problems.
Assuming that I have two documents, having ID="aaa" and ID="111", then searching with query:
ID:aaa
ID:"aaa"
ID:[aaa TO zzz]
ID:["aaa" TO "zzz"]
returns first document, as expected
ID:111
returns nothing, but
ID:"111"
returns second document, so at least there is a way to retrieve it.
Unfortunately, when searching for range:
ID:[111 TO 999]
ID:["111" TO "999"]
I get no results, and I have no idea what to do to get around this problem. Is there any special syntax for such case?
UPDATE:
Index function:
function(doc){
if(!doc.ID) return;
index("ID", doc.ID, { index:'not_analyzed_no_norms', store:true });
}
Changing index to analyzed doesn't help. Analyzer itself is keyword, but changing to standard doesn't help either.
UPDATE 2
Just to add some more context, because I think I missed one key point. The field I'm indexing will be searched using ranges, and both min and max values can be provided by user. So it is possible that one of them will be number stored as a string, while other will be a standard non-numeric text. For example search all document where ID >= "11" and ID <= "foo".
Assumig that database contains documents with ID "1", "5", "alpha", "beta", "gamma", this query should return "5", "alpha", "beta". Please note that "5" should actually be returned, because string "5" is greater than string "11".
Our team just came to a workaround solution. We managed to get proper results by adding some arbitrary character, e.g. 'a' to an upper range value, and by introducing additional search term, to exclude documents having ID between upper range value and upper range value + 'a'.
When searching for a range
ID:[X TO Y]
actual query would be
(ID:[X TO Ya] AND -ID:{Y TO Ya])
For example, to find a documents having ID between 23 and 758, we execute
(ID:[23 TO 758a] AND -ID:{758 TO 758a]).
First of all, I would suggest to use keyword analyzer, so you can control the right tokenization during both indexing and search.
"analyzer": "keyword",
"index": "function(doc){\n if(!doc.ID) return;\n index(\"ID\", doc.ID, {store:true });\n}
To retrieve you document with _id "111", use the following range query:
curl -X GET "http://.../facetrangetest/_design/ddoc/_search/f?q=ID:\[111%20TO%A\]"
If you use a query q=ID:\[111%20TO%20999\], Cloudant search seeing numbers on both size of the range, will interpret it as NumericRangeQuery; and since your ID of "111" is a String, it will not be part of the results returned. Including a string into query [111%20TO%20A], will make Cloudant interpret it as a range query on strings.
You can get both docs returned like this:
q=ID:["111" TO "CCC"]
Here's a working live example:
https://rajsingh.cloudant.com/facetrangetest/_design/ddoc/_search/f?q=ID:[%22111%22%20TO%20%22CCC%22]
I found something quirky. It seems that range queries on strings only work if at least one of the range values is a string. Querying on ID:["111" TO "555"] doesn't return anything either, so maybe this is resolving to a numeric query somehow? Could be a bug.
This could also be achieved using regular expressions in queries. Something line this:
curl -X POST "https://.../facetrangetest/_design/ddoc/_search/f" -d '{"q":"ID:/<23-758>/"}' | jq .
This regular expressions means to retrieve all documents with ID field from 23 to 758. Slashes: / / are used to enclose a regular expression; the interval is enclosed inside <>.

MongoDB or CouchDB or something else?

I know this is another question on this topic but I am a complete beginner in the NoSQL world so I would love some advice. People at SO told me MySQL might be a bad idea for this dataset so I'm asking this. I have lots of data in the following format:
TYPE 1
ID1: String String String ...
ID2: String String String ...
ID3: String String String ...
ID4: String String String ...
which I am hoping to convert into something like this:
TYPE 2
ID1: String
ID1: String
ID1: String
ID1: String
ID2: String
ID2: String
This is the most inefficient way but I need to be able to search by both the key and the value. For instance, my queries would look like this:
I might need to know what all strings a given ID contains and then intersect the list with another list obtained for a different ID.
I might need to know what all IDs contain a given string
I would love to achieve this without transforming Type 1 into Type 2 because of the sheer space requirements but would like to know if either MongoDB or CouchDB or something else (someone suggested NoSQL so started Googling and found these two are very popular) would help me out in this situation. I can a 14 node cluster I can leverage but would love some advice on which one is the right database for this usecase. Any suggestions?
A few extra things:
The input will mostly be static. I will create new data but will not modify any of the existing data.
The ID is 40 bytes in length whereas the strings are about 20 bytes
MongoDB will let you store this data efficiently in Type 1. Depending on your use it will look like one these (data is in JSON):
Array of Strings
{ "_id" : 1, "strings" : ["a", "b", "c", "d", "e"] }
Set of KV Strings
{ "_id" : 1, "s1" : "a", "s2" : "b", "s3" : "c", "s4" : "d", "s5" : "e" }
Based on your queries, I would probably use the Array of Strings method. Here's why:
I might need to know what all strings
a given ID contains and then intersect
the list with another list obtained
for a different ID.
This is easy, you get one Key Value look-up for the ID. In code, it would look something like this:
db.my_collection.find({ "_id" : 1});
I might need to know what all IDs contain a given string
Similarly easy:
db.my_collection.find({ "strings" : "my_string" })
Yeah it's that easy. I know that "strings" is technically an array, but MongoDB will recognize the item as an array and will loop through to find the value. Docs for this are here.
As a bonus, you can index the "strings" field and you will get an index on the array. So the find above will actually perform relatively fast (with the obvious trade-off that the index will be very large).
In terms of scaling a 14-node cluster may almost be overkill. However, Mongo does support auto-sharding and replication sets. They even work together, here's a blog post from a 10gen member to get you started (10gen makes Mongo).