Lucene: output elaborated data by adding IR information to it - lucene

I need to process a database in order to add meta-information such as td-idf weights to the documents terms.
Successively I need to create document pairs with similarity measures such as td-idf cosine similarity, etc...
I'm planning to use Apache Lucene for this task. I'm actually not interested in the retrieval, or running a query, but in indexing the data and elaborate them in order to generate an output file with the above mentioned document pairs and similarity scores. The next step would be to pass these results to a Weka classifier.
Can I easily do it with Lucene ?
thanks

Try Integrating Apache Mahout with Apache Lucene and Solr. Replace the places that say "Mahout" with "Weka". Good Luck.

Related

Interpret the Doc2Vec Vectors Clusters Representation

I am new to Doc2Vec, please bear with the naive questions.
I have generated Doc2vector score i.e. using the 'Paragraph Vector' algorithm.
I have an array output for each document.
I use the model.similar for doc1 and get the output - doc5 and doc10 are similar to doc1.
Q1) How to summarize using the code what are the important words or high-level summary this document holds?
In addition, If I use the array output and run K- means to get 5 clusters. How to define the cluster definition.
Q2) I can read the documents but the number of documents is very high and doing a manual read to find the cluster definition is not possible.
There's no built-in 'summarization' function for Doc2Vec doc-vectors (or clusters of same).
Theoretically, the model could do something that's sort-of the opposition of doc-vector inference. It could take a doc-vector – perhaps one corresponding to a existing document – and then provide it to the model, run the model "forward", and read out the activation levels of all its output nodes. At least in models using the default negative-sampling, those nodes map one-to-one with known vocabulary words, and you could plausibly sort/scale those activation levels to find the top-N "most-associated" words with that doc-vector.
You could look at the predict_output_word() method source of Word2Vec to get a rough idea of how such a calculation could work:
https://github.com/RaRe-Technologies/gensim/blob/3514d3fb9224280edd8ddd14c46b722220df5436/gensim/models/word2vec.py#L1131
As mentioned, this isn't an existing capability, and I don't know of an online source for code to do such a calculation. But, if it were implemented, it would be a welcome contribution.
(I'm not sure what your Q2 question actually is.)

Is it possible to obtain, alter and replace the tfidf document representations in Lucene?

Hej guys,
I'm working on some ranking related research. I would like to index a collection of documents with Lucene, take the tfidf representations (of each document) it generates, alter them, put them back into place and observe how the ranking over a fixed set of queries changes accordingly.
Is there any non-hacky way to do this?
Your question is too vague to have a clear answer, esp. on what you plan to do with :
take the tfidf representations (of each document) it generates, alter them
Lucene stores raw values for scoring :
CollectionStatistics
TermStatistics
Per term/doc pair stats : PostingsEnum
Per field/doc pair : norms
All this data is managed by lucene and will be used to compute a score for a given query term. A custom Similarity class can be used to change the formula that generates this score.
But you have to consider that a search query is made of multiple terms, and the way the scores of individual terms are combined can be changed as well. You could use existing Query classes (e.g. BooleanQuery, DisjunctionMax) but you could also write your own.
So it really depends on what you want to do with of all this but note that if you want to change the raw values stored by lucene this is going to be rather hard. You'll have to write a custom lucene codec and probably most the query stack to take benefit of your new data.
One nice thing you should consider is the possibility to store an arbitrary byte[] payloads. This way you could store a value that would have been computed outside of lucene and use it in a custom similarity or query.
Please see the following tutorials: Getting Started with Payloads and Custom Scoring with Lucene Payloads it may you give some ideas.

What is the use of lucene index files in DBPedia-Spotlight..?

I am trying to find named entities in a given text. For that, I have tried using DBPedia spotlight service.
I am able to get a response out of that. However, the DBPedia dataset is limited, so I tried replacing their spotter.dict file with my own dictionary. My dictionary contains entities per line:
Sachin Tendulkar###PERSON
Barack Obama ###PERSON
.... etc
Then I parse this file and build an ExactDictionaryChunker object.
Now I am able to get the entities and their types (after modification of dbpedia code).
My Question is: DBPedia spotlight is using Lucene Index files. I really don't understand for what purpose they are using these files?
Can we do it without using Index files? Whats the significance of the index files?
Lucene was used in the earlier implementation of DBpedia Spotlight to store a model of each entity in our KB. This model is used to give us a relatedness measure between the context (extracted from your input text) and the entity. More concretely, each entity is represented by a vector {t1: score1, t2: score2, ... }. At runtime we model your input text as a vector in the same dimensions and measure the cosine between input vector and entity vectors. In your case, you would have to add a vector for Sachin Tendulkar to the space (add a document to the Lucene index) in case it is not already there. The latest implementation, though, has moved away from Lucene to an in-house in-memory context store. https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Internationalization-(DB-backed-core)

Lucene: how do I assign weights to the different search terms at query time?

I have a Lucene indexed corpus of more than 1 million documents.
I am searching for named entities such as "Susan Witting" by using the the Lucene java API for queries.
I would like to expand my queries by also searching for "Sue Witting" for example but would like that term to have a lower weight than the main query term.
How can I go about doing that?
I found infos about the boosting option in the Lucene Manual. But it seems to be set at indexing and it needs fields.
You can boost each query clause independently. See the Query Javadoc.
If you want to give different weight to the words of a term. Then
Query#setBoost(float)
is not useful. A better way is:
Term term = new Term("some_key", "stand^3 firm^2 always");
This allows to give different weight to words in the same term query. Here, the word stand boosted by three but always is has the default boost value.

Lucene to perform document similarity

I have made the code to find the similarity between two documents by finding their tf and then their cosine values . But when i was looking at the standard examples on lucene , every program had made use of an index .
My process involves a comparision between one reference document and other documents from a folder .
Do u think i should use indexing ?
checkout the MoreLikeThis class.