Get words corresponding to a match from SpanNearQuery in Lucene - lucene

I would need to retrieve the words in my text that correspond to a match of Spans returned by SpanNearQuery.getSpans(). For instance, if my text is [a b c d e f] and I use SpanNearQueries with queries 'b' and 'e' (and sufficient slop), then I get a match 'b c d e' in my text. Now, how can I most efficiently retrieve the words as they appear in the match, that is, the sequence of words 'b c d e' itself?
Here is an example code of what I would need:
SpanNearQuery allNear = new SpanNearQuery(spansTermQueries, numWordsInBetween, true);
Spans allSpans = allNear.getSpans(reader);
Now I would like to iterate over all the matches in allSpans, and for each match retrieve the exact words between the queries 9 the text that correspond to that match.
One indirect way is to get the end and start position of that match, read through the text document using a file reader, and find the string of text between position 'end' and 'start'. But that does not seem a very efficient way. It seems that this information should already be stored in the Lucene Index.
Would anyone know of a more direct way of retrieving the words between the queries in a match?
Thanks.

What you want to do is highlighting. You can either use the plain highlighter or fast vector highlighter if you store term vectors.

Related

how to get the longest string in an array in Openrefine

With GREL is it possible to get the longest string of an array ?
For example, if I have an array with 3 strings ["a","aaa","aa"], I want to obtain "aaa".
You can probably do that at the cost of a very complicated formula. It's typically to face this kind of case that Open Refine added Python (and Clojure) as scripting languages. Even if you don't know Python, you can find in two minutes the answer to the question "how to choose the longest string in list?" and simply copy and paste it (by adding a "return" instead of "print")
In this case :
return max(['a','aaa','aaaa','aa'], key=len)
EDIT
Just for the sake of the challenge, here is a possible solution with GREL.
value = "a,aa,aaaa,aa"
forEach(value.split(','), e, if(length(e)==sort(forEach(value.split(','), e, e.length()))[-1], e, null)).join(',').split(',')

Regex matching sequence of characters

I have a test string such as: The Sun and the Moon together, forever
I want to be able to type a few characters or words and be able to match this string if the characters appear in the correct sequence together, even if there are missing words. For example, the following search word(s) should all match against this string:
The Moon
Sun tog
Tsmoon
The get ever
What regex pattern should I be using for this? I should add that the supplied test strings are going to be dynamic within an app, and so I'd like to be able to use a pattern based on the search string.
From your example Tsmoon you show partial words (T), ignoring case (s, m) and allow anything between each entered character. So as a first attempt you can:
Set the ignore case option
Between each chapter input insert the regular expression to match zero or more of anything. You can choose whether to match the shortest or longest run.
Try that, reading the documentation for NSRegularExpression if you're stuck, and see how it goes. If you get stuck ask a new question showing your code and the RE constructed and explain what happens/doesn't work as expected.
HTH

How to make the first n words more important in Lucene

I want to make the first n (which i set) words from a document more important that the rest of the document in Lucene. How will i do that? I found something about boosting, but boost a field to be more important. My document is supposed to be an only field.
Is to number the words at indexing time and boost them a solution? Something like that:
TextField myField = new TextField("text",termAtt.toString(),Store.YES);
myField.setBoost(2);
document.add(myField);
if the i didn't reach the n-th word in my document?
I want to get the following result: let's say that the first 20 words in a document are more important than the rest. I have 2 identical documents that have more than 20 words and i add the word that i am searching in one document as th first word and in the second document as the last word, an i want that the first document to have a bigger score.
The best approach would be to simply create two different fields, one containing the higher value portion of the text (this wouldn't need to be stored), and the next containing the full text:
int leadinLength = 20
TextField myFieldLeadin = new TextField("text_leadin",termAtt.toString().substring(leadinLength,Store.NO);
TextField myField = new TextField("text, termAtt.toString(),Store.YES);
myFieldLeadin.setBoost(2);
document.add(myFieldLeadin);
document.add(myField);
To could use a MultiFieldQueryParser to streamline searching in both fields at once, if desired, like:
Query query = MultiFieldQueryParser.parse(Version.LUCENE_48, "my search query",{"text_leadin","text"}, analyzer);
TopDocs docs = searcher.search(query, 10);

XQuery Full Text Search: Text Near Element?

I am using the eXist implementation of Lucene. Is there a query that would allow me to find, for instance, all occurrences of <span>A</span> B in a document? I.e., all Bs that occur within 1 word of <span>A</span>, but aren’t wrapped in their own elements?
This XPath should do the trick:
//span[. = 'A'][following-sibling::node()[1] = ' B']
This doesn't make use of eXist's Lucene-based full text index, but you haven't said if you've applied an index to the span element here. If there's another aspect to the challenge, please let me know.

Is there possible in Lucene to find documents with nearest string?

For example i have such words in index:
abcde
adeb
bbcdefsdg
bdef
bfgtj
I want to find documents containing nearest word to given one (in the sorted list), to have something like:
NEAREST_BOTTOM(bdeb) that returns documents containing bbcdefsdg and
NEAREST_TOP(bdeb) that returns documents containing bdef
This is what the spell checker does.