Cloudant - Lucene range search using numbers stored as text - lucene

I have a number of documents in Cloudant, that have ID field of type string. ID can be a simple string, like "aaa", "bbb" or number stored as text, e.g. "111", "222", etc. I need to be able to full text search using the above field, but I encountered some problems.
Assuming that I have two documents, having ID="aaa" and ID="111", then searching with query:
ID:aaa
ID:"aaa"
ID:[aaa TO zzz]
ID:["aaa" TO "zzz"]
returns first document, as expected
ID:111
returns nothing, but
ID:"111"
returns second document, so at least there is a way to retrieve it.
Unfortunately, when searching for range:
ID:[111 TO 999]
ID:["111" TO "999"]
I get no results, and I have no idea what to do to get around this problem. Is there any special syntax for such case?
UPDATE:
Index function:
function(doc){
if(!doc.ID) return;
index("ID", doc.ID, { index:'not_analyzed_no_norms', store:true });
}
Changing index to analyzed doesn't help. Analyzer itself is keyword, but changing to standard doesn't help either.
UPDATE 2
Just to add some more context, because I think I missed one key point. The field I'm indexing will be searched using ranges, and both min and max values can be provided by user. So it is possible that one of them will be number stored as a string, while other will be a standard non-numeric text. For example search all document where ID >= "11" and ID <= "foo".
Assumig that database contains documents with ID "1", "5", "alpha", "beta", "gamma", this query should return "5", "alpha", "beta". Please note that "5" should actually be returned, because string "5" is greater than string "11".

Our team just came to a workaround solution. We managed to get proper results by adding some arbitrary character, e.g. 'a' to an upper range value, and by introducing additional search term, to exclude documents having ID between upper range value and upper range value + 'a'.
When searching for a range
ID:[X TO Y]
actual query would be
(ID:[X TO Ya] AND -ID:{Y TO Ya])
For example, to find a documents having ID between 23 and 758, we execute
(ID:[23 TO 758a] AND -ID:{758 TO 758a]).

First of all, I would suggest to use keyword analyzer, so you can control the right tokenization during both indexing and search.
"analyzer": "keyword",
"index": "function(doc){\n if(!doc.ID) return;\n index(\"ID\", doc.ID, {store:true });\n}
To retrieve you document with _id "111", use the following range query:
curl -X GET "http://.../facetrangetest/_design/ddoc/_search/f?q=ID:\[111%20TO%A\]"
If you use a query q=ID:\[111%20TO%20999\], Cloudant search seeing numbers on both size of the range, will interpret it as NumericRangeQuery; and since your ID of "111" is a String, it will not be part of the results returned. Including a string into query [111%20TO%20A], will make Cloudant interpret it as a range query on strings.

You can get both docs returned like this:
q=ID:["111" TO "CCC"]
Here's a working live example:
https://rajsingh.cloudant.com/facetrangetest/_design/ddoc/_search/f?q=ID:[%22111%22%20TO%20%22CCC%22]
I found something quirky. It seems that range queries on strings only work if at least one of the range values is a string. Querying on ID:["111" TO "555"] doesn't return anything either, so maybe this is resolving to a numeric query somehow? Could be a bug.

This could also be achieved using regular expressions in queries. Something line this:
curl -X POST "https://.../facetrangetest/_design/ddoc/_search/f" -d '{"q":"ID:/<23-758>/"}' | jq .
This regular expressions means to retrieve all documents with ID field from 23 to 758. Slashes: / / are used to enclose a regular expression; the interval is enclosed inside <>.

Related

How to get the equivalent of combinig [contains] and [in] operators in the same query?

So I have a field that's a multi-choice on the Directus back end so when the JSON comes out of the API it's a one-dimensional array, like so:
"field_name": [
"",
"option 6",
"option 11",
""
]
(btw I have no idea why all these fields produce those blank values, but that's a matter for another day)
I am trying to make an interface on the front end where you can select one or more of these values and the result will come back if ANY of them are found for that record. Think of it like a tag list, if the item has just one of the values it should be returned.
I can use the [contains] operator to find if it has one of the values I'm looking for, but I can only pass a single value, whereas I need all that have either optionX OR optionY OR optionZ. I would basically need a combination of [contains] and [in] to achieve what I'm trying to do. Is there a way to achieve this?
I've also tried setting the [logical] operator to OR, but then that screws up the other filters that need to be included as AND (or I'm doing something wrong). Not to mention the query gets completely unruly.
Help?

ERROR: function regexp_matches(jsonb, unknown) does not exist in Tableau but works elsewhere

I have a column called "Bakery Activity" whose values are all JSONs that look like this:
{"flavors": [
{"d4js95-1cc5-4asn-asb48-1a781aa83": "chocolate"},
{"dc45n-jnsa9i-83ysg-81d4d7fae": "peanutButter"}],
"degreesToCook": 375,
"ingredients": {
"d4js95-1cc5-4asn-asb48-1a781aa83": [
"1nemw49-b9s88e-4750-bty0-bei8smr1eb",
"98h9nd8-3mo3-baef-2fe682n48d29"]
},
"numOfPiesBaked": 1,
"numberOfSlicesCreated": 6
}
I'm trying to extract the number of pies baked with a regex function in Tableau. Specifically, this one:
REGEXP_EXTRACT([Bakery Activity], '"numOfPiesBaked":"?([^\n,}]*)')
However, when I try to throw this calculated field into my text table, I get an error saying:
ERROR: function regexp_matches(jsonb, unknown) does not exist;
Error while executing the query
Worth noting is that my data source is PostgreSQL, which Tableau regex functions support; not all of my entries have numOfPiesBaked in them; when I run this in a simulator I get the correct extraction (actually, I get "numOfPiesBaked": 1" but removing the field name is a problem for another time).
What might be causing this error?
In short: Wrong data type, wrong function, wrong approach.
REGEXP_EXTRACT is obviously an abstraction layer of your client (Tableau), which is translated to regexp_matches() for Postgres. But that function expects text input. Since there is no assignment cast for jsonb -> text (for good reasons) you have to add an explicit cast to make it work, like:
SELECT regexp_matches("Bakery Activity"::text, '"numOfPiesBaked":"?([^\n,}]*)')
(The second argument can be an untyped string literal, Postgres function type resolution can defer the suitable data type text.)
Modern versions of Postgres also have regexp_match() returning a single row (unlike regexp_matches), which would seem like the better translation.
But regular expressions are the wrong approach to begin with.
Use the simple json/jsonb operator ->>:
SELECT "Bakery Activity"->>'numOfPiesBaked';
Returns '1' in your example.
If you know the value to be a valid integer, you can cast it right away:
SELECT ("Bakery Activity"->>'numOfPiesBaked')::int;
I found an easier way to handle JSONB data in Tableau.
Firstly, make a calculated field from the JSONB field and convert the field to a string by using str([FIELD_name]) command.
Then, on the calculated field, make another calculated field and use function:
REGEXP_EXTRACT([String_Field_Name], '"Key_to_be_extracted":"?([^\n,}]*)')
The required key-value pair will form the second caluculated field.

How to search for an element in literal array or integer array in cloud search

I have an cloud search domain where I have index on a column named "color_f_la". Its a faceted index and is a literal array. Sample value for it is :
[
"Blue",
"Green"
]
I've been trying to find out the documentation to construct a query which would search for a particular Color, but to no avail. Is it even possible?
If you're running the search with the "Structured" Query Parser, you should be able to pass your value directly as if it were a literal. There is no difference for querying literal-array fields.
For example:
(and color_f_la:'Blue')
Or if you want to return things that have blue or green:
(or color_f_la:'Blue' color_f_la:'Green')
Or if you want to return things that have both blue and green only:
(and color_f_la:'Blue' color_f_la:'Green')

Lucene NOT_ANALYZED not working with uppercase characters

I have build an index using a StandardAnalyzer, in this index are a few fields. For example purposes, imagine it has Id and Type. Both are NON_ANALYZED, meaning you can only search for them as-is.
There are a few entries in my index:
{Id: "1", Type: "Location"},
{Id: "2", Type: "Group"},
{Id: "3", Type: "Location"}
When I search for +Id:1 or any other number, I get the appropriate result (again using StandardAnalyzer).
However, when I search for +Type:Location or the +Type:Group, I'm not getting any results. The strange thing is that when I enable leading wildcards, that +Type:*ocation does return results! +Type:*Location or other combinations do not.
This got me leading to believe the indexer/query doesn't like uppercase characters! After lowercasing the Type to location and group before indexing them, I could search for them as such.
If I turn the Type-field to ANALYZED, it works with pretty much any search (uppercase/lowercase, etc), but I want to query for the Type-field as-is.
I'm completely baffled why it's doing this. Could anyone explain to me why my indexer doesn't let me search for NON_ANALYZED fields that have a capital in their value?
Are you using StandardAnalyzer when parsing your your query string (+Type:Location)? The StandardAnalyzer will lower-case all terms, so you're really searching with +Type:location.
Always use the same analyzer when searching and indexing. Look into using the PerFieldAnalyzer and set the Type field to use the KeywordAnalyzer.

Searching in the middle of a not_analyzed field

I have an Elasticsearch index where one of the fields is marked with not_analyzed. This field contains a space-separated list of values, like this:
Value1 Value2 Value3
Now I want to perform a search to find documents where this field contains "Value2". I've tested to search using text phrase prefix but a search for "Value2" matches nothing. A search for "Value1" or "Value1 Value2" on the other hand matches. I don't want any fuzzyness in the searching but only exact matches (which is the reason the field was set to not_analyzed).
Is there any way to do a search like this?
From my limited understanding of Elasticsearch, I'm guessing I need to set the field to analyzed using the whitespace analyzer. Is that right?
Correct, using either the Standard or Whitespace Analyzer among others would break the word down into chunks, split by whitespace, commas etc. A simple_query_string query would then match "Value2" no matter of its position in the documents field.
Standard Analyzer will also Lowercase your fields, meaning that only search terms that are lower-case will match.
You could do this using wildcards, it will be an expensive query though.
You might will have to set "lowercase_expanded_terms" to false in order to have the match.
When you're searching for "Value2" and you use wildcard the search would be interpreted as "value2" after the lucene parsing.
query_string:Value2* -> ES interpretation value2*
note that it lowercase your search, this is usefull for analyze fields, but in not_analyzed fields you wont have a match (if the original value is in upper case)
the lowercase_expanded_terms prevents this from happening
now if the field is not_analyzed as you said the following query should match your documents
{
"size": 10,
"query": {
"query_string": {
"query": "title:*Value2*"
}
}
}
sorry for the lousy answer.