How to suggest only single words but index phrases - lucene

I am having troubles right now. I am using ShingleAnalyzerWrapper to index phrases. But I need the SpellChecker to suggest me only single words.
How can i index phrases but search for both phrases and single words with SpellChecker?
Please, give some advice.

Use this constructor, for your ShingleAnalyzerWrapper
ShingleAnalyzerWrapper(Analyzer defaultAnalyzer,
int minShingleSize,
int maxShingleSize,
String tokenSeparator,
boolean outputUnigrams,
boolean outputUnigramsIfNoShingles)
passing true as the fifth argument (outputUnigrams). This will index all single tokens regardless of what you minShingleSize is. If your current minShingleSize is 2, you could just lower that to 1 to achieve the same result.

Related

TermRangeQuery in lucene for long values

I'm using Lucene 8 and trying to perform a range query for epoch values (which are timestamps of my documents being indexed). However I've seen that lucene 8 supports only term-range query and this takes parameters bytesref instead of long. Can someone tell me if there is any alternative that takes long as input values and performs the range query or kindly let me know how to convert a long value to bytesref.
Below is my code ::
Term startTerm = new Term(OFFER_END_DATE_KEY, valueOf(Instant.now()));
Term endTerm = new Term(OFFER_END_DATE_KEY, valueOf(Instant.now().plus(2, ChronoUnit.YEARS)));
new TermRangeQuery(OFFER_END_DATE_KEY, startTerm, endTerm, true, true);
One of the workaround is to use below construct:
TermRangeQuery.newStringRange(String field, String lowerTerm, String upperTerm, boolean includeLower, boolean includeUpper);
To use this convert the epoch timestamp to String value.
To answer your BytesRef question directly, the LongPoint field type (https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/document/LongPoint.html) has a static method pack which does this.
It seems easier to use the LongPoint field at index time however, and query with LongPoint.newRangeQuery(String field, long[] lowerValue, long[] upperValue).

Regex match SQL values string with multiple rows and same number of columns

I tried to match the sql values string (0),(5),(12),... or (0,11),(122,33),(4,51),... or (0,121,12),(31,4,5),(26,227,38),... and so on with the regular expression
\(\s*\d+\s*(\s*,\s*\d+\s*)*\)(\s*,\s*\(\s*\d+\s*(\s*,\s*\d+\s*)*\))*
and it works. But...
How can I ensure that the regex does not match a values string like (0,12),(1,2,3),(56,7) with different number of columns?
Thanks in advance...
As i mentioned in comment to the question, the best way to check if input string is valid: contains the same count of numbers between brackets, is to use client side programm, but not clear SQL.
Implementation:
List<string> s = new List<string>(){
"(0),(5),(12)", "(0,11),(122,33),(4,51)",
"(0,121,12),(31,4,5),(26,227,38)","(0,12),(1,2,3),(56,7)"};
var qry = s.Select(a=>new
{
orig = a,
newst = a.Split(new string[]{"),(", "(", ")"},
StringSplitOptions.RemoveEmptyEntries)
})
.Select(a=>new
{
orig = a.orig,
isValid = (a.newst
.Sum(b=>b.Split(new char[]{','},
StringSplitOptions.RemoveEmptyEntries).Count()) %
a.newst.Count()) ==0
});
Result:
orig isValid
(0),(5),(12) True
(0,11),(122,33),(4,51) True
(0,121,12),(31,4,5),(26,227,38) True
(0,12),(1,2,3),(56,7) False
Note: The second Select statement gets the modulo of sum of comma instances and the count of items in string array returned by Split function. If the result isn't equal to zero, it means that input string is invalid.
I strongly believe there's a simplest way to achieve that, but - at this moment - i don't know how ;)
:(
Unless you add some more constraints, I don't think you can solve this problem only with regular expressions.
It isn't able to solve all of your string problems, just as it cannot be used to check that the opening and closing of brackets (like "((())()(()(())))") is invalid. That's a more complicated issue.
That's what I learnt in class :P If someone knows a way then that'd be sweet!
I'm sorry, I spent a bit of time looking into how we could turn this string into an array and do more work to it with SQL but built in functionality is lacking and the solution would end up being very hacky.
I'd recommend trying to handle this situation differently as large scale string computation isn't the best way to go if your database is to gradually fill up.
A combination of client and serverside validation can be used to help prevent bad data (like the ones with more numbers) from getting into the database.
If you need to keep those numbers then you could rework your schema to include some metadata which you can use in your queries, like how many numbers there are and whether it all matches nicely. This information can be computed inexpensively from your server and provided to the database.
Good luck!

Space issue in Lucene.NET C#

I want to search sentence which has space in full text search.
Ex: Tom is a very good boy in class.
I want to Search the key word "very good".
I'm using white space tokenizer to create/search index. But it is not finding the keyword if it is separated by space.
Code:
Query searchItemQuery = new WildcardQuery(new Term(string-field-name, searchkeyword.ToLower()));
I've tried with split but it is not working properly.
Do anyone suggest me a solution for this problem?
Thanks,
Vijay
Since, you are working with tokenized string, every word is a separate term.
In order too find a phrase consisting of multiple terms, you would need to use PhraseQuery instead of WildcardQuery.
Like this:
PhraseQuery phraseQuery = new PhraseQuery();
phraseQuery.Add(new Term(string-field-name, "very"));
phraseQuery.Add(new Term(string-field-name, "good"));
Note also, that you are using wildcard query. Wildcards in phrase query are a bit complex. Check this post for details: Lucene - Wildcards in phrases
And finally, I would suggest to consider using QueryParser instead of constructing query manually.

querying for a string'ed number in lucene finds nothing

I have an existing index with some documents I'm trying to search.
When I search a "real textual" field, everything is OK.
When I try to search a field which is a number, the search gives 0 results.
The code is something like this (it is pylucene but the concept is the same):
dir = SimpleFSDirectory(File(indexDir))
analyzer = StandardAnalyzer(Version.LUCENE_CURRENT)
searcher = IndexSearcher(dir)
query = QueryParser(Version.LUCENE_CURRENT, "id", analyzer).parse("902")
hits = searcher.search(query, MAX)
print hits.totalHits #gives me 0
a luke search (id:902) gives me empty results as well.
When I look at the Overview tab on luke it says this field is UTF-8 (string)
Anything I'm doing wrong?
edit:
It appears this happens on Fields that are indexed and has no Norm (according to the flags of luke).
Can someone explain it?
I don't like answering my own questions but I believe this answer is an important reference.
The solution is put a NumericRange query with both numbers the number you seek (this time in java):
NumericRangeQuery.newIntRange("id", Integer.valueOf(902), Integer.valueOf(902),
true, true)
Are you using SimpleAnalyzer while indexing? It strips off numbers. Make sure you are using same analyzer while indexing and searching.

Find all Lucene documents having a certain field

I want to find all documents in the index that have a certain field, regardless of the field's value. If at all possible using the query language, not the API.
Is there a way?
If you know the type of data stored in your field, you can try a range query. Per example, if your field contain string data, a query like field:[a* TO z*] would return all documents where there is a string value in that field.
I've done some experimenting, and it seems the simplest way to achieve this is to create a QueryParser and call SetAllowLeadingWildcard( true ) and search for field:* like so:
var qp = new QueryParser( Lucene.Net.Util.Version.LUCENE_29, field, analyzer );
qp.SetAllowLeadingWildcard( true );
var query = qp.Parse( "*" ) );
(Note I am setting the default field of the QueryParser to field in its constructor, hence the search for just "*" in Parse()).
I cannot vouch for how efficient this method is over other methods, but being the simplest method I can find, I would expect it to be at least as efficient as field:[* TO *], and it avoids having to do hackish things like field:[0* TO z*], which may not account for all possible values, such as values starting with non-alphanumeric characters.
Another solution is using a ConstantScoreQuery with a FieldValueFilter
new ConstantScoreQuery(new FieldValueFilter("field"))