How to index documents that have the most common ngrams in their summaries using lucene - indexing

I have a set of articles and their corresponding summaries that have been generated using BART. These summaries, however, contain repetitions of ngrams. I need to index all the articles whose summaries contain these repetitions using Lucene.
I am new to java and to Lucene but I have been instructed to use it.
I came across How to have ngram tokenizer in lucene 4.0? but I cannot understand t.o code the ngram tokenizer from the summaries, for my use case and then get the indices of the corresponding articles.
Please help!

Related

Lucene 4.*: FuzzyQuery and finding positions of hits

The various examples I see about how to find positions of the matches returned by an IndexSearcher either require retrieving the document's content and search a TokenStream or to index the positions and offsets in the term vectors, turn the query into a term and find it in the term vector.
But what happens when I use a FuzzyQuery? Is there a way to know which term(s) exactly matched in the hit so that I may look for them in the term vector of this document?
In case that's of any value, I'm new to Lucene and my goal here is to annotate a set of documents (the ones indexed in Lucene) with a set of terms, but the documents are from scanned texts and contain OCR errors, therefore I must use a FuzzyQuery. I thought about using lucene-suggest to do some spellchecking beforehand but it occured to me that it boiled down to trying to find fuzzy matches.

Do Lucene's analyzers create a tf-idf representation?

I'd like to know if Lucene's analyzers use the tf-idf representation ofr building the index.
Thanks
No: Analyzers just break a document into a stream of tokens.
IndexWriter is an analysis consumer that builds an inverted index, recording raw statistics like how many occurrences of the term appear in the document and how many documents contain the term.
But this isnt a tf/idf representation: the index format is independent of the scoring model.

Lucene: how do I assign weights to the different search terms at query time?

I have a Lucene indexed corpus of more than 1 million documents.
I am searching for named entities such as "Susan Witting" by using the the Lucene java API for queries.
I would like to expand my queries by also searching for "Sue Witting" for example but would like that term to have a lower weight than the main query term.
How can I go about doing that?
I found infos about the boosting option in the Lucene Manual. But it seems to be set at indexing and it needs fields.
You can boost each query clause independently. See the Query Javadoc.
If you want to give different weight to the words of a term. Then
Query#setBoost(float)
is not useful. A better way is:
Term term = new Term("some_key", "stand^3 firm^2 always");
This allows to give different weight to words in the same term query. Here, the word stand boosted by three but always is has the default boost value.

Lucene to perform document similarity

I have made the code to find the similarity between two documents by finding their tf and then their cosine values . But when i was looking at the standard examples on lucene , every program had made use of an index .
My process involves a comparision between one reference document and other documents from a folder .
Do u think i should use indexing ?
checkout the MoreLikeThis class.

Lucene: output elaborated data by adding IR information to it

I need to process a database in order to add meta-information such as td-idf weights to the documents terms.
Successively I need to create document pairs with similarity measures such as td-idf cosine similarity, etc...
I'm planning to use Apache Lucene for this task. I'm actually not interested in the retrieval, or running a query, but in indexing the data and elaborate them in order to generate an output file with the above mentioned document pairs and similarity scores. The next step would be to pass these results to a Weka classifier.
Can I easily do it with Lucene ?
thanks
Try Integrating Apache Mahout with Apache Lucene and Solr. Replace the places that say "Mahout" with "Weka". Good Luck.