Lucene NOT_ANALYZED not working with uppercase characters - lucene

I have build an index using a StandardAnalyzer, in this index are a few fields. For example purposes, imagine it has Id and Type. Both are NON_ANALYZED, meaning you can only search for them as-is.
There are a few entries in my index:
{Id: "1", Type: "Location"},
{Id: "2", Type: "Group"},
{Id: "3", Type: "Location"}
When I search for +Id:1 or any other number, I get the appropriate result (again using StandardAnalyzer).
However, when I search for +Type:Location or the +Type:Group, I'm not getting any results. The strange thing is that when I enable leading wildcards, that +Type:*ocation does return results! +Type:*Location or other combinations do not.
This got me leading to believe the indexer/query doesn't like uppercase characters! After lowercasing the Type to location and group before indexing them, I could search for them as such.
If I turn the Type-field to ANALYZED, it works with pretty much any search (uppercase/lowercase, etc), but I want to query for the Type-field as-is.
I'm completely baffled why it's doing this. Could anyone explain to me why my indexer doesn't let me search for NON_ANALYZED fields that have a capital in their value?

Are you using StandardAnalyzer when parsing your your query string (+Type:Location)? The StandardAnalyzer will lower-case all terms, so you're really searching with +Type:location.
Always use the same analyzer when searching and indexing. Look into using the PerFieldAnalyzer and set the Type field to use the KeywordAnalyzer.

Related

Cloudant - Lucene range search using numbers stored as text

I have a number of documents in Cloudant, that have ID field of type string. ID can be a simple string, like "aaa", "bbb" or number stored as text, e.g. "111", "222", etc. I need to be able to full text search using the above field, but I encountered some problems.
Assuming that I have two documents, having ID="aaa" and ID="111", then searching with query:
ID:aaa
ID:"aaa"
ID:[aaa TO zzz]
ID:["aaa" TO "zzz"]
returns first document, as expected
ID:111
returns nothing, but
ID:"111"
returns second document, so at least there is a way to retrieve it.
Unfortunately, when searching for range:
ID:[111 TO 999]
ID:["111" TO "999"]
I get no results, and I have no idea what to do to get around this problem. Is there any special syntax for such case?
UPDATE:
Index function:
function(doc){
if(!doc.ID) return;
index("ID", doc.ID, { index:'not_analyzed_no_norms', store:true });
}
Changing index to analyzed doesn't help. Analyzer itself is keyword, but changing to standard doesn't help either.
UPDATE 2
Just to add some more context, because I think I missed one key point. The field I'm indexing will be searched using ranges, and both min and max values can be provided by user. So it is possible that one of them will be number stored as a string, while other will be a standard non-numeric text. For example search all document where ID >= "11" and ID <= "foo".
Assumig that database contains documents with ID "1", "5", "alpha", "beta", "gamma", this query should return "5", "alpha", "beta". Please note that "5" should actually be returned, because string "5" is greater than string "11".
Our team just came to a workaround solution. We managed to get proper results by adding some arbitrary character, e.g. 'a' to an upper range value, and by introducing additional search term, to exclude documents having ID between upper range value and upper range value + 'a'.
When searching for a range
ID:[X TO Y]
actual query would be
(ID:[X TO Ya] AND -ID:{Y TO Ya])
For example, to find a documents having ID between 23 and 758, we execute
(ID:[23 TO 758a] AND -ID:{758 TO 758a]).
First of all, I would suggest to use keyword analyzer, so you can control the right tokenization during both indexing and search.
"analyzer": "keyword",
"index": "function(doc){\n if(!doc.ID) return;\n index(\"ID\", doc.ID, {store:true });\n}
To retrieve you document with _id "111", use the following range query:
curl -X GET "http://.../facetrangetest/_design/ddoc/_search/f?q=ID:\[111%20TO%A\]"
If you use a query q=ID:\[111%20TO%20999\], Cloudant search seeing numbers on both size of the range, will interpret it as NumericRangeQuery; and since your ID of "111" is a String, it will not be part of the results returned. Including a string into query [111%20TO%20A], will make Cloudant interpret it as a range query on strings.
You can get both docs returned like this:
q=ID:["111" TO "CCC"]
Here's a working live example:
https://rajsingh.cloudant.com/facetrangetest/_design/ddoc/_search/f?q=ID:[%22111%22%20TO%20%22CCC%22]
I found something quirky. It seems that range queries on strings only work if at least one of the range values is a string. Querying on ID:["111" TO "555"] doesn't return anything either, so maybe this is resolving to a numeric query somehow? Could be a bug.
This could also be achieved using regular expressions in queries. Something line this:
curl -X POST "https://.../facetrangetest/_design/ddoc/_search/f" -d '{"q":"ID:/<23-758>/"}' | jq .
This regular expressions means to retrieve all documents with ID field from 23 to 758. Slashes: / / are used to enclose a regular expression; the interval is enclosed inside <>.

How to index only words with a minimum length using Apache Lucene 5.3.1?

may someone give me a hint on how to index only words with a minimum length using Apache Lucene 5.3.1?
I've searched through the API but didn't find anything which suits my needs except this, but I couldn't figure out how to use that.
Thanks!
Edit:
I guess that's important info, so here's a copy of my explanation of what I want to achieve from my reply below:
"I don't intend to use queries. I want to create a source code summarization tool for which I created a doc-term matrix using Lucene. Now it also shows single- or double-character words. I want to exclude them so they don't show up in the results as they have little value for a summary. I know I could filter them when outputting the results, but that's not a clean solution imo. An even worse would be to add all combinations of single- or double-character words to the stoplist. I am hoping there is a more elegant way then one of those."
You should use a custom Analyzer with LengthTokeFilter. E.g.
Analyzer ana = CustomAnalyzer.builder()
.withTokenizer("standard")
.addTokenFilter("standard")
.addTokenFilter("lowercase")
.addTokenFilter("length", "min", "4", "max", "50")
.addTokenFilter("stop", "ignoreCase", "false", "words", "stopwords.txt", "format", "wordset")
.build();
But it is better to use a stopword (words what occur in almost all documents, like articles for English language) list. This gives a more accurate result.

Searching in the middle of a not_analyzed field

I have an Elasticsearch index where one of the fields is marked with not_analyzed. This field contains a space-separated list of values, like this:
Value1 Value2 Value3
Now I want to perform a search to find documents where this field contains "Value2". I've tested to search using text phrase prefix but a search for "Value2" matches nothing. A search for "Value1" or "Value1 Value2" on the other hand matches. I don't want any fuzzyness in the searching but only exact matches (which is the reason the field was set to not_analyzed).
Is there any way to do a search like this?
From my limited understanding of Elasticsearch, I'm guessing I need to set the field to analyzed using the whitespace analyzer. Is that right?
Correct, using either the Standard or Whitespace Analyzer among others would break the word down into chunks, split by whitespace, commas etc. A simple_query_string query would then match "Value2" no matter of its position in the documents field.
Standard Analyzer will also Lowercase your fields, meaning that only search terms that are lower-case will match.
You could do this using wildcards, it will be an expensive query though.
You might will have to set "lowercase_expanded_terms" to false in order to have the match.
When you're searching for "Value2" and you use wildcard the search would be interpreted as "value2" after the lucene parsing.
query_string:Value2* -> ES interpretation value2*
note that it lowercase your search, this is usefull for analyze fields, but in not_analyzed fields you wont have a match (if the original value is in upper case)
the lowercase_expanded_terms prevents this from happening
now if the field is not_analyzed as you said the following query should match your documents
{
"size": 10,
"query": {
"query_string": {
"query": "title:*Value2*"
}
}
}
sorry for the lousy answer.

Lucene.net GetFieldQuery vs TermQuery

Using Lucene's standard analyzer. Title field in question is non-stored, analyzed. The query is as follows:
title:"Some-Url-Friendly-Title"
In Luke, this query gets correctly re-written as:
title:"some url friendly title" (- replaced by whitespace, everything lowercased).
I thought the Lucene.net version would be:
new TermQuery(new Term("title","Some-Url-Friendly-Title"))
However, no results are returned.
Then I tried:
_parser.GetFieldQuery("title","Some-Url-Friendly-Title")
And it worked as expected!
Both queries were executed via:
_searcher.Search([query object], [sort object])
Can somebody point me in the right direction to see what the differences between TermQuery and _parser.GetFieldQuery() are?
A TermQuery is much simpler than running a query through a queryparser. Not only is it not lowercased, and doesn't understand to break up hyphenated terms, it isn't even tokenized. It just searches for the term you tell it to look for. That means it is looking for the term "Some-Url-Friendly-Title" as a single untokenized keyword, in your index. I assume you are using an analyzer, so chances are no such tokens exist.
To take it a step further, if you had been searching for "Some Url Friendly Title" as the Term text, you still wouldn't come up with anything, since it's looking for "Some Url Friendly Title" as a Single Token, not as the four tokens (or rather, terms) in your index.
If you look at what a the standard query parser generates when you parse your query, you'll see that TermQueries are only one of the building blocks it uses to generate the complete query, along with BooleanQuery, and possibly PhraseQuery, PrefixQueriy, etc.
In Lucene.Net version 3.0.3 the GetFieldQuery is inaccessible due to it's protection modifier. Use
MultiFieldQueryParser.Parse(searchText, field)
instead.

Comparison of Lucene Analyzers

Can someone please explain the difference between the different analyzers within Lucene? I am getting a maxClauseCount exception and I understand that I can avoid this by using a KeywordAnalyzer but I don't want to change from the StandardAnalyzer without understanding the issues surrounding analyzers. Thanks very much.
In general, any analyzer in Lucene is tokenizer + stemmer + stop-words filter.
Tokenizer splits your text into chunks, and since different analyzers may use different tokenizers, you can get different output token streams, i.e. sequences of chunks of text. For example, KeywordAnalyzer you mentioned doesn't split the text at all and takes all the field as a single token. At the same time, StandardAnalyzer (and most other analyzers) use spaces and punctuation as a split points. For example, for phrase "I am very happy" it will produce list ["i", "am", "very", "happy"] (or something like that). For more information on specific analyzers/tokenizers see its Java Docs.
Stemmers are used to get the base of a word in question. It heavily depends on the language used. For example, for previous phrase in English there will be something like ["i", "be", "veri", "happi"] produced, and for French "Je suis très heureux" some kind of French analyzer (like SnowballAnalyzer, initialized with "French") will produce ["je", "être", "tre", "heur"]. Of course, if you will use analyzer of one language to stem text in another, rules from the other language will be used and stemmer may produce incorrect results. It isn't fail of all the system, but search results then may be less accurate.
KeywordAnalyzer doesn't use any stemmers, it passes all the field unmodified. So, if you are going to search some words in English text, it isn't a good idea to use this analyzer.
Stop words are the most frequent and almost useless words. Again, it heavily depends on language. For English these words are "a", "the", "I", "be", "have", etc. Stop-words filters remove them from the token stream to lower noise in search results, so finally our phrase "I'm very happy" with StandardAnalyzer will be transformed to list ["veri", "happi"].
And KeywordAnalyzer again does nothing. So, KeywordAnalyzer is used for things like ID or phone numbers, but not for usual text.
And as for your maxClauseCount exception, I believe you get it on searching. In this case most probably it is because of too complex search query. Try to split it to several queries or use more low level functions.
In my perspective, I have used StandAnalyzer and SmartCNAnalyzer. As I have to search text in Chinese. Obviously, SmartCnAnalyzer is better at handling Chinese. For diiferent purposes, you have to choose properest analyzer.