TermRangeQuery in lucene for long values

TermRangeQuery in lucene for long values - lucene

I'm using Lucene 8 and trying to perform a range query for epoch values (which are timestamps of my documents being indexed). However I've seen that lucene 8 supports only term-range query and this takes parameters bytesref instead of long. Can someone tell me if there is any alternative that takes long as input values and performs the range query or kindly let me know how to convert a long value to bytesref.
Below is my code ::
Term startTerm = new Term(OFFER_END_DATE_KEY, valueOf(Instant.now()));
Term endTerm = new Term(OFFER_END_DATE_KEY, valueOf(Instant.now().plus(2, ChronoUnit.YEARS)));
new TermRangeQuery(OFFER_END_DATE_KEY, startTerm, endTerm, true, true);

One of the workaround is to use below construct:
TermRangeQuery.newStringRange(String field, String lowerTerm, String upperTerm, boolean includeLower, boolean includeUpper);
To use this convert the epoch timestamp to String value.

To answer your BytesRef question directly, the LongPoint field type (https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/document/LongPoint.html) has a static method pack which does this.
It seems easier to use the LongPoint field at index time however, and query with LongPoint.newRangeQuery(String field, long[] lowerValue, long[] upperValue).

Related

How to avoid Lucene from tokenizing query string containg '/' or '-'?

Good day!
In my document, I have a date field, which contains ISO-8601 date, which can also be a period like "25-08-2016/P1D"
I want to search a document having exactly this date or period - so ,just having a same value of "date" field. Unfortunately, I was unable to do this. Tried different query strings, with escaping or without, like
date:"25-08-2016/P1D" - is transformed to PhraseQuery, which fails with
exception, AFAIUI since a field is just a String field
date:"25-08-2016/P1D" - same as above
date:25-08-2016/P1D -
does not fail with exception, but still PhraseQuery is created, and
nothing is being found
date:25-08-2016/P1D - parsing fails with "org.apache.lucene.queryparser.classic.TokenMgrError: Lexical error at line 1, column 150. Encountered: after : "/P1D"
What I'm doing wrong? How can I tell Lucene to search this field using simple string match, without any tokenization?

After some research I've got, that escaping query string is a wrong way - correct way to achieve this is customizing Query analyzer for necessary field ("date" in my case).
Map<String, Analyzer> analyzerPerField = new HashMap<String, Analyzer>();
analyzerPerField.put("date", new WhitespaceAnalyzer());
analyzer = new PerFieldAnalyzerWrapper(
new StandardAnalyzer(), analyzerPerField);
parser = new QueryParser("title", analyzer);
In given code, we just use WhitespaceAnalyzer (which is dividing query by whitespaces) instead of SimpleAnalyzer used by default, which is dividing text at non-letters. WhitespaceAnalyzer does not break ISO-8601 date.
For additional details about custom analyzing/tokenizing in lucene, please refer e.g. http://www.hascode.com/2014/07/lucene-by-example-specifying-analyzers-on-a-per-field-basis-and-writing-a-custom-analyzertokenizer/

Regex match SQL values string with multiple rows and same number of columns

I tried to match the sql values string (0),(5),(12),... or (0,11),(122,33),(4,51),... or (0,121,12),(31,4,5),(26,227,38),... and so on with the regular expression
\(\s*\d+\s*(\s*,\s*\d+\s*)*\)(\s*,\s*\(\s*\d+\s*(\s*,\s*\d+\s*)*\))*
and it works. But...
How can I ensure that the regex does not match a values string like (0,12),(1,2,3),(56,7) with different number of columns?
Thanks in advance...

As i mentioned in comment to the question, the best way to check if input string is valid: contains the same count of numbers between brackets, is to use client side programm, but not clear SQL.
Implementation:
List<string> s = new List<string>(){
"(0),(5),(12)", "(0,11),(122,33),(4,51)",
"(0,121,12),(31,4,5),(26,227,38)","(0,12),(1,2,3),(56,7)"};
var qry = s.Select(a=>new
{
orig = a,
newst = a.Split(new string[]{"),(", "(", ")"},
StringSplitOptions.RemoveEmptyEntries)
})
.Select(a=>new
{
orig = a.orig,
isValid = (a.newst
.Sum(b=>b.Split(new char[]{','},
StringSplitOptions.RemoveEmptyEntries).Count()) %
a.newst.Count()) ==0
});
Result:
orig isValid
(0),(5),(12) True
(0,11),(122,33),(4,51) True
(0,121,12),(31,4,5),(26,227,38) True
(0,12),(1,2,3),(56,7) False
Note: The second Select statement gets the modulo of sum of comma instances and the count of items in string array returned by Split function. If the result isn't equal to zero, it means that input string is invalid.
I strongly believe there's a simplest way to achieve that, but - at this moment - i don't know how ;)

:(
Unless you add some more constraints, I don't think you can solve this problem only with regular expressions.
It isn't able to solve all of your string problems, just as it cannot be used to check that the opening and closing of brackets (like "((())()(()(())))") is invalid. That's a more complicated issue.
That's what I learnt in class :P If someone knows a way then that'd be sweet!
I'm sorry, I spent a bit of time looking into how we could turn this string into an array and do more work to it with SQL but built in functionality is lacking and the solution would end up being very hacky.
I'd recommend trying to handle this situation differently as large scale string computation isn't the best way to go if your database is to gradually fill up.
A combination of client and serverside validation can be used to help prevent bad data (like the ones with more numbers) from getting into the database.
If you need to keep those numbers then you could rework your schema to include some metadata which you can use in your queries, like how many numbers there are and whether it all matches nicely. This information can be computed inexpensively from your server and provided to the database.
Good luck!

Convert an alphanumeric string to integer format

I need to store an alphanumeric string in an integer column on one of my models.
I have tried:
#result.each do |i|
hex_id = []
i["id"].split(//).each{|c| hex_id.push(c.hex)}
hex_id = hex_id.join
...
Model.create(:origin_id => hex_id)
...
end
When I run this in the console using puts hex_id in place of the create line, it returns the correct values, however the above code results in the origin_id being set to "2147483647" for every instance. An example string input is "t6gnk3pp86gg4sboh5oin5vr40" so that doesn't make any sense to me.
Can anyone tell me what is going wrong here or suggest a better way to store a string like the aforementioned example as a unique integer?
Thanks.

Answering by request form OP
It seems that the hex_id.join operation does not concatenate strings in this case but instead sums or performs binary complement of the hex values. The issue could also be that hex_id is an array of hex-es rather than a string, or char array. Nevertheless, what seems to happen is reaching the maximum positive value for the integer type 2147483647. Still, I was unable to find any documented effects on array.join applied on a hex array, it appears it is not concatenation of the elements.
On the other hand, the desired result 060003008600401100500050040 is too large to be recorded as an integer either. A better approach would be to keep it as a string, or use different algorithm for producing a number form the original string. Perhaps aggregating the hex values by an arithmetic operation will do better than join ?

How to suggest only single words but index phrases

I am having troubles right now. I am using ShingleAnalyzerWrapper to index phrases. But I need the SpellChecker to suggest me only single words.
How can i index phrases but search for both phrases and single words with SpellChecker?
Please, give some advice.

Use this constructor, for your ShingleAnalyzerWrapper
ShingleAnalyzerWrapper(Analyzer defaultAnalyzer,
int minShingleSize,
int maxShingleSize,
String tokenSeparator,
boolean outputUnigrams,
boolean outputUnigramsIfNoShingles)
passing true as the fifth argument (outputUnigrams). This will index all single tokens regardless of what you minShingleSize is. If your current minShingleSize is 2, you could just lower that to 1 to achieve the same result.

Lucene: query parser is not working as expected

I'm using Lucene.Net but I'm sure it still aplies for the non.Net flavour.
This is my query:
Collection:drwho AND Format:"Blu-ray"
This is what the query parser does to it:
{+Collection:drwho +Format:"blu ray"}
This is clearly not what I am after. This is the code I'm using:
Dim analyzer = New StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29)
Dim qp = New QueryParser(Lucene.Net.Util.Version.LUCENE_29, Nothing, analyzer)
Dim q As Query = qp.Parse(query)
Any ideas on why the query is being butched? According to http://lucene.apache.org/java/3_4_0/queryparsersyntax.html, I cannot for the life of me see what is wrong with my query...

For NOT_ANALYZED fields either you should create TermQuery in your code or use KeywordAnalyzer since it requires exact matching of the term in the index and in your query(your input is stored as Blu-ray in the index) where other analyzers processes the input and converts Blu-ray to blu ray for example, as you have already noticed.
If you change your field to ANALYZED and use StandardAnalyzer while indexing, your query would also work in current form.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

TermRangeQuery in lucene for long values - lucene

One of the workaround is to use below construct: TermRangeQuery.newStringRange(String field, String lowerTerm, String upperTerm, boolean includeLower, boolean includeUpper); To use this convert the epoch timestamp to String value.

Related

How to avoid Lucene from tokenizing query string containg '/' or '-'?

Regex match SQL values string with multiple rows and same number of columns

Convert an alphanumeric string to integer format

How to suggest only single words but index phrases

Lucene: query parser is not working as expected

Categories

Resources