Lucene phase query case insensitive - lucene

I am writing a query to do an exact match on a 'city' field. The field/property is defined as:
#org.hibernate.search.annotations.Field(index = Index.YES, analyze = Analyze.NO, store = Store.NO)
private String city;
If I have the value of "New York", I want to find a match if user enters "new york", or some variation of case. I am using the StandardAnalyzer for the entity, so I know that will lowercase all the tokens. I don't tokenize since I want to match the phrase (Analyze.NO).
I tried to lowercase my search value, but no luck.
Query query = qb.phrase().onField(.....).sentence(location.toLowerCase()).createQuery();
If I don't lowercase the search term and the value is 'New York', results are returned. Searching for 'new york' does not return any result.
If I tokenize (Analyze.YES), then other cities like 'New Jersey' are returned. I know I can use a wildcard query (searchTerm*), but I was hoping to be able to do a case insensitive search on a phrase. Just not sure if that's possible unless you use the wildcard.
thanks

It sounds like you would want to use an analyzer which emits the entire text as a single token while lower-casing the input. In this case, you would want to use analyze=Analyze.YES, while specifying the appropriate analyzer (the answer here has code that looks like what you need) using analyzer=#Analyzer(impl=your.fully.qualified.Analyzer.class).

Related

Searching in the middle of a not_analyzed field

I have an Elasticsearch index where one of the fields is marked with not_analyzed. This field contains a space-separated list of values, like this:
Value1 Value2 Value3
Now I want to perform a search to find documents where this field contains "Value2". I've tested to search using text phrase prefix but a search for "Value2" matches nothing. A search for "Value1" or "Value1 Value2" on the other hand matches. I don't want any fuzzyness in the searching but only exact matches (which is the reason the field was set to not_analyzed).
Is there any way to do a search like this?
From my limited understanding of Elasticsearch, I'm guessing I need to set the field to analyzed using the whitespace analyzer. Is that right?
Correct, using either the Standard or Whitespace Analyzer among others would break the word down into chunks, split by whitespace, commas etc. A simple_query_string query would then match "Value2" no matter of its position in the documents field.
Standard Analyzer will also Lowercase your fields, meaning that only search terms that are lower-case will match.
You could do this using wildcards, it will be an expensive query though.
You might will have to set "lowercase_expanded_terms" to false in order to have the match.
When you're searching for "Value2" and you use wildcard the search would be interpreted as "value2" after the lucene parsing.
query_string:Value2* -> ES interpretation value2*
note that it lowercase your search, this is usefull for analyze fields, but in not_analyzed fields you wont have a match (if the original value is in upper case)
the lowercase_expanded_terms prevents this from happening
now if the field is not_analyzed as you said the following query should match your documents
{
"size": 10,
"query": {
"query_string": {
"query": "title:*Value2*"
}
}
}
sorry for the lousy answer.

How to search multiple fields using Surround QueryParser?

I have some queries regarding the Surround QueryParser. Could any of you please suggest?
How to search multiple fields at once?
As shown below, the syntax allows to search against one field. But how do I submit a query like "FIELD1:N(abc,corp) FIELD2:N(xyz,corp)". Is something like this possible with Surround QueryParser?
SrndQuery srndQuery = org.apache.lucene.queryparser.surround.parser.QueryParser.parse(strTxtSearchString);
Query query = srndQuery.makeLuceneQueryField(, new BasicQueryFactory());
How to escape special characters the way we do in the regular QueryParser as queryparser.escape();
How to escape words such as "and", "or", "W", "N" etc.? The search string itself might have the words such as "and". In that case, my query would look something like "N(abc,and,sons)" or "W(abc,n,company)".
I get a org.apache.lucene.queryparser.surround.parser.ParseException when I submit such a query.
How to provide wild card in the beginning of the words?
The regular QueryParser lets us do parser.setAllowLeadingWildcard(true); Is there some way to do this with the Surround QueryParser?
Any inputs will be very helpful. Thanks!

SQL to return results for the following regex

I have the following regular expression:
WHERE A.srvc_call_id = '40750564' AND REGEXP_LIKE (A.SRVC_CALL_DN, '[^TEST]')
The row that contains 40750564 has "TEST CALL" in the column SRVC_CALL_DN and REGEXP_LIKE doesn't seem to be filtering it out. Whenever I run the query it returns the row when it shouldn't.
Is my regex pattern wrong? Or does SQL not accept [^whatever]?
The carat anchors the expression to the start of a string. By enclosing the letters T, E, S & T in square brackets you're searching, as barsju suggests for any of these characters, not for the string TEST.
You say that SRVC_CALL_DN contains the string 'TEST CALL', but you don't say where in the string. You also say that you're looking for where this string doesn't match. This implies that you want to use not regexp_like(...
Putting all this together I think you need:
AND NOT REGEXP_LIKE (A.SRVC_CALL_DN, '^TEST[[:space:]]CALL')
This excludes every match from your query where the string starts with 'TEST CALL'. However, if this string may be in any position in the column you need to remove the carat - ^.
This also assumes that the string is always in upper case. If it's in mixed case or lower, then you need to change it again. Something like the following:
AND NOT REGEXP_LIKE (upper(A.SRVC_CALL_DN), '^TEST[[:space:]]CALL')
By upper-casing SRV_CALL_DN you ensure that you're always going to match but ensure that your query may not use an index on this column. I wouldn't worry about this particular point as regular expressions queries can be fairly poor at using indexes anyway and it appears as though SRVC_CALL_ID is indexed.
Also if it may not include 'CALL' you will have to remove this. It is best when using regular expressions to make your match pattern as explicit as possible; so include 'CALL' if you can.
Try with '^TEST' or '^TEST.*'
Your regexp means any string not starting with any of the characters: T,E,S,T.
But your case is so simple, starts with TEST. Why not use a simple like:
LIKE 'TEST%'

Find all Lucene documents having a certain field

I want to find all documents in the index that have a certain field, regardless of the field's value. If at all possible using the query language, not the API.
Is there a way?
If you know the type of data stored in your field, you can try a range query. Per example, if your field contain string data, a query like field:[a* TO z*] would return all documents where there is a string value in that field.
I've done some experimenting, and it seems the simplest way to achieve this is to create a QueryParser and call SetAllowLeadingWildcard( true ) and search for field:* like so:
var qp = new QueryParser( Lucene.Net.Util.Version.LUCENE_29, field, analyzer );
qp.SetAllowLeadingWildcard( true );
var query = qp.Parse( "*" ) );
(Note I am setting the default field of the QueryParser to field in its constructor, hence the search for just "*" in Parse()).
I cannot vouch for how efficient this method is over other methods, but being the simplest method I can find, I would expect it to be at least as efficient as field:[* TO *], and it avoids having to do hackish things like field:[0* TO z*], which may not account for all possible values, such as values starting with non-alphanumeric characters.
Another solution is using a ConstantScoreQuery with a FieldValueFilter
new ConstantScoreQuery(new FieldValueFilter("field"))

How to make Lucene match all words in query?

I am using Lucene to allow a user to search for words in a large number of documents. Lucene seems to default to returning all documents containing any of the words entered.
Is it possible to change this behaviour? I know that '+' can be use to force a term to be included but I would like to make that the default action.
Ideally I would like functionality similar to Google's: '-' to exclude words and "abc xyz" to group words.
Just to clarify
I also thought of inserting '+' into all spaces in the query. I just wanted to avoid detecting grouped terms (brackets, quotes etc) and potentially breaking the query. Is there another approach?
This looks similar to the Lucene Sentence Search question. If you're interested, this is how I answered that question:
String defaultField = ...;
Analyzer analyzer = ...;
QueryParser queryParser = new QueryParser(defaultField, analyzer);
queryParser.setDefaultOperator(QueryParser.Operator.AND);
Query query = queryParser.parse("Searching is fun");
Like Adam said, there's no need to do anything to the query string. QueryParser's setDefaultOperator does exactly what you're asking for.
Why not just preparse the user search input and adjust it to fit your criteria using the Lucene query syntax before passing it on to Lucene. Alternatively, you could just create some help documentation on how to use the standard syntax to create a specific query and let the user decide how the query should be performed.
Lucene has a extensive query language as described here that describes everything you want except for + being the default but that's something you can simple handle by replacing spaces with +. So the only thing you need to do is define the format you want people to enter their search queries in (I would strongly advise to adhere to the default Lucene syntax) and then you can write the transformations from your own syntax to the Lucene syntax.
The behavior is hard-coded in method addClause(List, int, int, Query) of class org.apache.lucene.queryParser.QueryParser, so the only way to change the behavior (other than the workarounds above) is to change that method. The end of the method looks like this:
if (required && !prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.MUST));
else if (!required && !prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.SHOULD));
else if (!required && prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.MUST_NOT));
else
throw new RuntimeException("Clause cannot be both required and prohibited");
Changing "SHOULD" to "MUST" should make clauses (e.g. words) required by default.