Searching in the middle of a not_analyzed field - indexing

I have an Elasticsearch index where one of the fields is marked with not_analyzed. This field contains a space-separated list of values, like this:
Value1 Value2 Value3
Now I want to perform a search to find documents where this field contains "Value2". I've tested to search using text phrase prefix but a search for "Value2" matches nothing. A search for "Value1" or "Value1 Value2" on the other hand matches. I don't want any fuzzyness in the searching but only exact matches (which is the reason the field was set to not_analyzed).
Is there any way to do a search like this?
From my limited understanding of Elasticsearch, I'm guessing I need to set the field to analyzed using the whitespace analyzer. Is that right?

Correct, using either the Standard or Whitespace Analyzer among others would break the word down into chunks, split by whitespace, commas etc. A simple_query_string query would then match "Value2" no matter of its position in the documents field.
Standard Analyzer will also Lowercase your fields, meaning that only search terms that are lower-case will match.

You could do this using wildcards, it will be an expensive query though.
You might will have to set "lowercase_expanded_terms" to false in order to have the match.
When you're searching for "Value2" and you use wildcard the search would be interpreted as "value2" after the lucene parsing.
query_string:Value2* -> ES interpretation value2*
note that it lowercase your search, this is usefull for analyze fields, but in not_analyzed fields you wont have a match (if the original value is in upper case)
the lowercase_expanded_terms prevents this from happening
now if the field is not_analyzed as you said the following query should match your documents
{
"size": 10,
"query": {
"query_string": {
"query": "title:*Value2*"
}
}
}
sorry for the lousy answer.

Related

Lucene NOT_ANALYZED not working with uppercase characters

I have build an index using a StandardAnalyzer, in this index are a few fields. For example purposes, imagine it has Id and Type. Both are NON_ANALYZED, meaning you can only search for them as-is.
There are a few entries in my index:
{Id: "1", Type: "Location"},
{Id: "2", Type: "Group"},
{Id: "3", Type: "Location"}
When I search for +Id:1 or any other number, I get the appropriate result (again using StandardAnalyzer).
However, when I search for +Type:Location or the +Type:Group, I'm not getting any results. The strange thing is that when I enable leading wildcards, that +Type:*ocation does return results! +Type:*Location or other combinations do not.
This got me leading to believe the indexer/query doesn't like uppercase characters! After lowercasing the Type to location and group before indexing them, I could search for them as such.
If I turn the Type-field to ANALYZED, it works with pretty much any search (uppercase/lowercase, etc), but I want to query for the Type-field as-is.
I'm completely baffled why it's doing this. Could anyone explain to me why my indexer doesn't let me search for NON_ANALYZED fields that have a capital in their value?
Are you using StandardAnalyzer when parsing your your query string (+Type:Location)? The StandardAnalyzer will lower-case all terms, so you're really searching with +Type:location.
Always use the same analyzer when searching and indexing. Look into using the PerFieldAnalyzer and set the Type field to use the KeywordAnalyzer.

Cloudant - Lucene range search using numbers stored as text

I have a number of documents in Cloudant, that have ID field of type string. ID can be a simple string, like "aaa", "bbb" or number stored as text, e.g. "111", "222", etc. I need to be able to full text search using the above field, but I encountered some problems.
Assuming that I have two documents, having ID="aaa" and ID="111", then searching with query:
ID:aaa
ID:"aaa"
ID:[aaa TO zzz]
ID:["aaa" TO "zzz"]
returns first document, as expected
ID:111
returns nothing, but
ID:"111"
returns second document, so at least there is a way to retrieve it.
Unfortunately, when searching for range:
ID:[111 TO 999]
ID:["111" TO "999"]
I get no results, and I have no idea what to do to get around this problem. Is there any special syntax for such case?
UPDATE:
Index function:
function(doc){
if(!doc.ID) return;
index("ID", doc.ID, { index:'not_analyzed_no_norms', store:true });
}
Changing index to analyzed doesn't help. Analyzer itself is keyword, but changing to standard doesn't help either.
UPDATE 2
Just to add some more context, because I think I missed one key point. The field I'm indexing will be searched using ranges, and both min and max values can be provided by user. So it is possible that one of them will be number stored as a string, while other will be a standard non-numeric text. For example search all document where ID >= "11" and ID <= "foo".
Assumig that database contains documents with ID "1", "5", "alpha", "beta", "gamma", this query should return "5", "alpha", "beta". Please note that "5" should actually be returned, because string "5" is greater than string "11".
Our team just came to a workaround solution. We managed to get proper results by adding some arbitrary character, e.g. 'a' to an upper range value, and by introducing additional search term, to exclude documents having ID between upper range value and upper range value + 'a'.
When searching for a range
ID:[X TO Y]
actual query would be
(ID:[X TO Ya] AND -ID:{Y TO Ya])
For example, to find a documents having ID between 23 and 758, we execute
(ID:[23 TO 758a] AND -ID:{758 TO 758a]).
First of all, I would suggest to use keyword analyzer, so you can control the right tokenization during both indexing and search.
"analyzer": "keyword",
"index": "function(doc){\n if(!doc.ID) return;\n index(\"ID\", doc.ID, {store:true });\n}
To retrieve you document with _id "111", use the following range query:
curl -X GET "http://.../facetrangetest/_design/ddoc/_search/f?q=ID:\[111%20TO%A\]"
If you use a query q=ID:\[111%20TO%20999\], Cloudant search seeing numbers on both size of the range, will interpret it as NumericRangeQuery; and since your ID of "111" is a String, it will not be part of the results returned. Including a string into query [111%20TO%20A], will make Cloudant interpret it as a range query on strings.
You can get both docs returned like this:
q=ID:["111" TO "CCC"]
Here's a working live example:
https://rajsingh.cloudant.com/facetrangetest/_design/ddoc/_search/f?q=ID:[%22111%22%20TO%20%22CCC%22]
I found something quirky. It seems that range queries on strings only work if at least one of the range values is a string. Querying on ID:["111" TO "555"] doesn't return anything either, so maybe this is resolving to a numeric query somehow? Could be a bug.
This could also be achieved using regular expressions in queries. Something line this:
curl -X POST "https://.../facetrangetest/_design/ddoc/_search/f" -d '{"q":"ID:/<23-758>/"}' | jq .
This regular expressions means to retrieve all documents with ID field from 23 to 758. Slashes: / / are used to enclose a regular expression; the interval is enclosed inside <>.

Lucene phase query case insensitive

I am writing a query to do an exact match on a 'city' field. The field/property is defined as:
#org.hibernate.search.annotations.Field(index = Index.YES, analyze = Analyze.NO, store = Store.NO)
private String city;
If I have the value of "New York", I want to find a match if user enters "new york", or some variation of case. I am using the StandardAnalyzer for the entity, so I know that will lowercase all the tokens. I don't tokenize since I want to match the phrase (Analyze.NO).
I tried to lowercase my search value, but no luck.
Query query = qb.phrase().onField(.....).sentence(location.toLowerCase()).createQuery();
If I don't lowercase the search term and the value is 'New York', results are returned. Searching for 'new york' does not return any result.
If I tokenize (Analyze.YES), then other cities like 'New Jersey' are returned. I know I can use a wildcard query (searchTerm*), but I was hoping to be able to do a case insensitive search on a phrase. Just not sure if that's possible unless you use the wildcard.
thanks
It sounds like you would want to use an analyzer which emits the entire text as a single token while lower-casing the input. In this case, you would want to use analyze=Analyze.YES, while specifying the appropriate analyzer (the answer here has code that looks like what you need) using analyzer=#Analyzer(impl=your.fully.qualified.Analyzer.class).

multiple matches for NSRegularExpression

I have the following regex pattern to match:
NSString *pattern=[NSString stringWithFormat:#"%#(.*)%#", key, key2];
So let say if key=\\\\[\\\\\[ and key2=\\\\]\\\\] then I am getting the string containing the
keys along with the contained text. But the problem is that if there are multiple matches then it only takes ist appearance of key and last appearance of key2 and gives me the text contained between those along with the keys. Eg.: This is [[some]] [[text]]. This gives me:[[some]] [[text]] as one match whereas I want [[some]] and [[text]] as separate matches. How should I modify it to give all the matches separately?
Same thing that bothers novice parser makers who wanna parse a string between quotes, and think that
\\".*\\"
is sufficient, but then are surprised when this matches all the text between
"a string" and also "another string"
The reason behind this is that the * operator is greedy. You have to use character set negation to achieve the expected result:
\\[\\[[^\\[\\]]*\\]\\]
Hope this helps.

Is it possible to ignore characters in a string when matching with a regular expression

I'd like to create a regular expression such that when I compare the a string against an array of strings, matches are returned with the regex ignoring certain characters.
Here's one example. Consider the following array of names:
{
"Andy O'Brien",
"Bob O'Brian",
"Jim OBrien",
"Larry Oberlin"
}
If a user enters "ob", I'd like the app to apply a regex predicate to the array and all of the names in the above array would match (e.g. the ' is ignored).
I know I can run the match twice, first against each name and second against each name with the ignored chars stripped from the string. I'd rather this by done by a single regex so I don't need two passes.
Is this possible? This is for an iOS app and I'm using NSPredicate.
EDIT: clarification on use
From the initial answers I realized I wasn't clear. The example above is a specific one. I need a general solution where the array of names is a large array with diverse names and the string I am matching against is entered by the user. So I can't hard code the regex like [o]'?[b].
Also, I know how to do case-insensitive searches so don't need the answer to focus on that. Just need a solution to ignore the chars I don't want to match against.
Since you have discarded all the answers showing the ways it can be done, you are left with the answer:
NO, this cannot be done. Regex does not have an option to 'ignore' characters. Your only options are to modify the regex to match them, or to do a pass on your source text to get rid of the characters you want to ignore and then match against that. (Of course, then you may have the problem of correlating your 'cleaned' text with the actual source text.)
If I understand correctly, you want a way to match the characters "ob" 1) regardless of capitalization, and 2) regardless of whether there is an apostrophe in between them. That should be easy enough.
1) Use a case-insensitivity modifier, or use a regexp that specifies that the capital and lowercase version of the letter are both acceptable: [Oo][Bb]
2) Use the ? modifier to indicate that a character may be present either one or zero times. o'?b will match both "o'b" and "ob". If you want to include other characters that may or may not be present, you can group them with the apostrophe. For example, o['-~]?b will match "ob", "o'b", "o-b", and "o~b".
So the complete answer would be [Oo]'?[Bb].
Update: The OP asked for a solution that would cause the given character to be ignored in an arbitrary search string. You can do this by inserting '? after every character of the search string. For example, if you were given the search string oleary, you'd transform it into o'?l'?e'?a'?r'?y'?. Foolproof, though probably not optimal for performance. Note that this would match "o'leary" but also "o'lea'r'y'" if that's a concern.
In this particular case, just throw the set of characters into the middle of the regex as optional. This works specifically because you have only two characters in your match string, otherwise the regex might get a bit verbose. For example, match case-insensitive against:
o[']*b
You can add more characters to that character class in the middle to ignore them. Note that the * matches any number of characters (so O'''Brien will match) - for a single instance, change to ?:
o[']?b
You can make particular characters optional with a question mark, which means that it will match whether they're there or not, e.g:
/o\'?b/
Would match all of the above, add .+ to either side to match all other characters, and a space to denote the start of the surname:
/.+? o\'?b.+/
And use the case-insensitivity modifier to make it match regardless of capitalisation.