Using FieldSelector when searching with Lucene - lucene

I'm searching articles in PubMed via Lucene.
Each of the 20,000,000 articles has an abstract with ~250 words and an ID.
At the moment I store my searches, with each take multiple seconds, in a TopDocs object.
Searchs can find thousands of articles.
I'm just interested in the ID of the article.
Does Lucene load the abstracts internally into the TopDocs?
If so can I prevent that behavior through FieldSelectors or do FieldSelectors only work with IndexReader and don't work with IndexSearcher?

No, Lucene does not load the values of fields into TopDocs. TopDocs only contains the doc number and score for each one of the matching documents.
If you're having performance issues, here's another SO question that can help you:
Optimizing Lucene performance

Lucene, by default, does not load any stored fields. If you want to retrieve only the ID field, and if you can afford to load up all the IDs in memory, then you can load all values as follows and reuse them.
String[] allIDs = FieldCache.DEFAULT.getStrings(indexReader, "IDFieldName")
Please check the answer for FieldCache. Best way to retrieve certain field of all documents returned by a Lucene search

You're on the right lines.
Try using a SetBasedFieldSelector when you retrieve the document from the index.
As another poster noted, iterating through the hits will return a ScoreDoc object. This will give you the document Id that can be used to retrieve the document using the IndexReader associated with the IndexSearcher.
If IO is a problem because of loading fields you aren't interested in, you should be in for a pleasant surprise.
Hope this helps,

Related

Is it possible to obtain, alter and replace the tfidf document representations in Lucene?

Hej guys,
I'm working on some ranking related research. I would like to index a collection of documents with Lucene, take the tfidf representations (of each document) it generates, alter them, put them back into place and observe how the ranking over a fixed set of queries changes accordingly.
Is there any non-hacky way to do this?
Your question is too vague to have a clear answer, esp. on what you plan to do with :
take the tfidf representations (of each document) it generates, alter them
Lucene stores raw values for scoring :
CollectionStatistics
TermStatistics
Per term/doc pair stats : PostingsEnum
Per field/doc pair : norms
All this data is managed by lucene and will be used to compute a score for a given query term. A custom Similarity class can be used to change the formula that generates this score.
But you have to consider that a search query is made of multiple terms, and the way the scores of individual terms are combined can be changed as well. You could use existing Query classes (e.g. BooleanQuery, DisjunctionMax) but you could also write your own.
So it really depends on what you want to do with of all this but note that if you want to change the raw values stored by lucene this is going to be rather hard. You'll have to write a custom lucene codec and probably most the query stack to take benefit of your new data.
One nice thing you should consider is the possibility to store an arbitrary byte[] payloads. This way you could store a value that would have been computed outside of lucene and use it in a custom similarity or query.
Please see the following tutorials: Getting Started with Payloads and Custom Scoring with Lucene Payloads it may you give some ideas.

Lucene query documents that don't have a specific field

I am using Lucene in android to search my content. I have two types of documents and one has a trashed field which is either true or false. The other type of document does not have that field. I want to return all documents that have trashed:false, or don't have the trashed field.
I have tried add -trashed:true to my query, which returns all the correct documents, but it messes up the offsets of the search surround a different word and not the one I am searching for.
EDIT:
I have to add this to every search query I perform. I have an index of approximately 20,000 documents and I would really like to not have to rebuild it because I had my users rebuild their indices my last release. Note: this is on android devices so it takes a long time and a lot of battery to reindex all of their documents.
Thanks for the help.
I can think of following solution.
1) If you can rebuild the index.
Add trashed:na field-value to the docs for which "trashed" is not applicable.
To get all the docs with trashed:false or "trashed" is not applicable, you can use following..
Query: trashed:false OR trashed:na
2) If you cannot rebuild the index, I am not sure...

Getting lucene to return only unique threads (indexing both threads and posts)

I have a StackOverflow-like system where content is organised into threads, each thread having content of its own (the question body / text), and posts / replies.
I'm producing the ability to search this content via Lucene, and if possible I have decided I would like to index individual posts, (it makes the index easier to update, and means I have more control and ability to tweak the results), rather than index entire threads. The problem I have however is that I want the search to display a list of threads, rather than a list of posts.
How can I get Lucene to return only unique threads as results, while also searching the content of the posts?
Each document can have a "threadId" field. After running a search, you can loop through your result set and return all the unique threadId's.
The tricky part is specifying how many results you want to return. If you want to show say, 10 results on your results page, you'll probably need Lucene to return 10 + m results, since a certain percentage of the return set will be de-duped out, because they are posts belonging to the same thread. You'll need to incorporate some extra logic that will run another Lucene search if the deduped set is < 10.
This is what the Nutch project does when collapsing multiple search results that belong to the same domain.
When you index the threads, you should break each thread into postings and make each post a Document with a field containing a unique id identifying the thread to which it belongs.
When you do the search implementation, I would recommend using lucene 2.9 or later, which enables you to use a Collector. Collectors lets you preprocess the retrieved documents and thereby you'll be able to group together posts that originate from the same thread-id.
Just for completenes, latest Lucene versions (from 3.2 onwards) support a grouping API that is very useful for this kind of use-cases:
http://lucene.apache.org/java/3_2_0/api/contrib-grouping/org/apache/lucene/search/grouping/package-summary.html

Collect all hits for a search in Lucene / Optimization

Summary: I collect the doc ids of all hits for a given search by using a custom Collector (it populates a BitSet with the ids). The searching and getting doc ids are quite fast according to my needs but when it comes to actually fetching the documents from disk, things get very slow. Is there a way to optimize Lucene for faster document collection?
Details: I'm working on a processed corpus of Wikipedia and I keep each sentence as a separate document. When I search for "computer", I get all sentences containing the term computer. Currently, searching the corpus and getting all document ids work in sub-second but fetching the first 1000 documents takes around 20 seconds. Fetching all documents takes proportionally more time (i.e. another 20 sec for each 1000-document batch).
Subsequent searches and document fetching takes much less time (though, I don't know who's doing the caching, OS or Lucene?) but I'll be searching for many diverse terms and I don't want to rely on caching, the performance on the very first search is crucial for me.
I'm looking for suggestions/tricks that will improve the document-fetching performance (if it's possible at all). Thanks in advance!
Addendum:
I use Lucene 3.0.0 but I use Jython to drive Lucene classes. Which means, I call the get_doc method of the following Jython class for every doc id I retrieved during the search:
class DocumentFetcher():
def __init__(self, index_name):
self._directory = FSDirectory.open(java.io.File(index_name))
self._index_reader = IndexReader.open(self._directory, True)
def get_doc(self, doc_id):
return self._index_reader.document(doc_id)
I have 50M documents in my index.
You, probably, are storing lot of information in the document. Reduce the stored fields to as much as you can.
Secondly, while retrieving fields, select only those fields which you need. You can use following method of IndexReader to specify only few of the stored fields.
public abstract Document document(int n, FieldSelector fieldSelector)
This way you don't load up fields which are not used.
You can utilize following code sample.
FieldSelector idFieldSelector =
new SetBasedFieldSelector(Collections.singleton("idFieldName"), Collections.emptySet());
for (int i: resultDocIDs) {
String id = reader.document(i, idFieldSelector).get("idFieldName");
}
Scaling Lucene and Solr discusses many ways to improve Lucene performance.
As you are working on Lucene search within Wikipedia, you may be interested in Rainman's Lucene Search of Wikipedia. He mostly discusses algorithms and less performance, but this may still be relevant.

Bubbling up newest content in lucene search results

I am storing various articles in my lucene index.
When user searches for articles which contain a specific term or phrase,I need to show all th articles (could be anywhere between 1000 to 10000 articles) but with newest articles "bubbled up" in the search results.
I believe you can bubble up a search result in Lucene using "Date field Boosting".
Can someone please give me the details of how to go about this?
Thanks in advance!
I would implement the SortComparatorSource interface. You should write a new ScoreDocComparator, whose compare() function compares two dates. Then you will need to sort your searches using the new sorter. This advice is taken from chapter 6 of Lucene in Action.
You can use the setBoost method to set the "boost" for a particular document in the index at index time. Since the default boost value is 1.0, setting a value less than 1.0 will make the document "less relevant" in search results. By tying the boost value of a document to its age (lower boost the older the document gets), you can make newer content seem more relevant in search results.
Note in the documentation for setBoost that the boost value set at indexing time is not available for retrieved documents (boost works, you just can't read the value back at retrieval time to see if you applied the correct value at index time).