Getting lucene to return only unique threads (indexing both threads and posts) - lucene

I have a StackOverflow-like system where content is organised into threads, each thread having content of its own (the question body / text), and posts / replies.
I'm producing the ability to search this content via Lucene, and if possible I have decided I would like to index individual posts, (it makes the index easier to update, and means I have more control and ability to tweak the results), rather than index entire threads. The problem I have however is that I want the search to display a list of threads, rather than a list of posts.
How can I get Lucene to return only unique threads as results, while also searching the content of the posts?

Each document can have a "threadId" field. After running a search, you can loop through your result set and return all the unique threadId's.
The tricky part is specifying how many results you want to return. If you want to show say, 10 results on your results page, you'll probably need Lucene to return 10 + m results, since a certain percentage of the return set will be de-duped out, because they are posts belonging to the same thread. You'll need to incorporate some extra logic that will run another Lucene search if the deduped set is < 10.
This is what the Nutch project does when collapsing multiple search results that belong to the same domain.

When you index the threads, you should break each thread into postings and make each post a Document with a field containing a unique id identifying the thread to which it belongs.
When you do the search implementation, I would recommend using lucene 2.9 or later, which enables you to use a Collector. Collectors lets you preprocess the retrieved documents and thereby you'll be able to group together posts that originate from the same thread-id.

Just for completenes, latest Lucene versions (from 3.2 onwards) support a grouping API that is very useful for this kind of use-cases:
http://lucene.apache.org/java/3_2_0/api/contrib-grouping/org/apache/lucene/search/grouping/package-summary.html

Related

Nested search in solr

I have an Activity model and ActivityOccurrence Model where Activity has_many :activity_occurrences
Activity: This model will have all the meta data required by ActivityOccurrence
AcitvityOccurrence: attrs - occurrence(datetime), completed.
Now we have new requirement where we have to show all occurrences of activity in search results when user searches for activities in a particular range.
Previously we used to show only one record in case of repeating activities.
So as per new requirement we have decided to move search from Activity to ActivityOccurrence.
Now, I don't want to index the Meta information of Activity in each of my ActivityOccurrence as my activity has 10 fields more than ActivityOccurrence,
eg:
if I have Activity with 1000 AcitivityOccurrence then I will be indexing all my activity informations in 1000 AcitivityOccurrence records.
This will take huge space as app grows if we index this way
Hence, my major concern is the amount of indexing I have to do.
So I am thinking to avoid activity indexes in ActivityOccurrence.
So is there a way to search Activity based on its filters first and then search ActivityOccurrence in the range based on the results from activities?
Note: Also we have never ending occurrences.
Any ideas?
Thanks in advance.
Unless you're dealing with millions of Activities/Occurrences, this may be a premature optimization - space is cheap, and SOLR is fast. Looking at this the other way around, have you considered just indexing a list of the activity occurrences that pertain to each activity (using callbacks to ensure that it gets updated)?It's hard to really optimization without more info about your data access patterns, but I'm never a fan of doing more round-trips than necessary.
That said, while I'm not sure how to write a pure SOLR query to do this, you can do it with Sunspot pretty easily:
Make sure that ActivityOccurence is searchable by Activity easily (i.e. by Activity ID).
Search Activity for the metadata that you want, and use this to extract the ID's that are relevant:
search = Activity.solr_search {<some block that does what you want>}
activity_ids = search.hits.map { |hit| hit.primary_key.to_i }
Now you can just add a with parameter to your ActivityOccurence search block:
with(:activity_id, activity_ids)
This will limit the search to the occurrences for those activities. Note that you are trading off search-time performance for index efficiency with this.

Lucene query documents that don't have a specific field

I am using Lucene in android to search my content. I have two types of documents and one has a trashed field which is either true or false. The other type of document does not have that field. I want to return all documents that have trashed:false, or don't have the trashed field.
I have tried add -trashed:true to my query, which returns all the correct documents, but it messes up the offsets of the search surround a different word and not the one I am searching for.
EDIT:
I have to add this to every search query I perform. I have an index of approximately 20,000 documents and I would really like to not have to rebuild it because I had my users rebuild their indices my last release. Note: this is on android devices so it takes a long time and a lot of battery to reindex all of their documents.
Thanks for the help.
I can think of following solution.
1) If you can rebuild the index.
Add trashed:na field-value to the docs for which "trashed" is not applicable.
To get all the docs with trashed:false or "trashed" is not applicable, you can use following..
Query: trashed:false OR trashed:na
2) If you cannot rebuild the index, I am not sure...

Using FieldSelector when searching with Lucene

I'm searching articles in PubMed via Lucene.
Each of the 20,000,000 articles has an abstract with ~250 words and an ID.
At the moment I store my searches, with each take multiple seconds, in a TopDocs object.
Searchs can find thousands of articles.
I'm just interested in the ID of the article.
Does Lucene load the abstracts internally into the TopDocs?
If so can I prevent that behavior through FieldSelectors or do FieldSelectors only work with IndexReader and don't work with IndexSearcher?
No, Lucene does not load the values of fields into TopDocs. TopDocs only contains the doc number and score for each one of the matching documents.
If you're having performance issues, here's another SO question that can help you:
Optimizing Lucene performance
Lucene, by default, does not load any stored fields. If you want to retrieve only the ID field, and if you can afford to load up all the IDs in memory, then you can load all values as follows and reuse them.
String[] allIDs = FieldCache.DEFAULT.getStrings(indexReader, "IDFieldName")
Please check the answer for FieldCache. Best way to retrieve certain field of all documents returned by a Lucene search
You're on the right lines.
Try using a SetBasedFieldSelector when you retrieve the document from the index.
As another poster noted, iterating through the hits will return a ScoreDoc object. This will give you the document Id that can be used to retrieve the document using the IndexReader associated with the IndexSearcher.
If IO is a problem because of loading fields you aren't interested in, you should be in for a pleasant surprise.
Hope this helps,

Collect all hits for a search in Lucene / Optimization

Summary: I collect the doc ids of all hits for a given search by using a custom Collector (it populates a BitSet with the ids). The searching and getting doc ids are quite fast according to my needs but when it comes to actually fetching the documents from disk, things get very slow. Is there a way to optimize Lucene for faster document collection?
Details: I'm working on a processed corpus of Wikipedia and I keep each sentence as a separate document. When I search for "computer", I get all sentences containing the term computer. Currently, searching the corpus and getting all document ids work in sub-second but fetching the first 1000 documents takes around 20 seconds. Fetching all documents takes proportionally more time (i.e. another 20 sec for each 1000-document batch).
Subsequent searches and document fetching takes much less time (though, I don't know who's doing the caching, OS or Lucene?) but I'll be searching for many diverse terms and I don't want to rely on caching, the performance on the very first search is crucial for me.
I'm looking for suggestions/tricks that will improve the document-fetching performance (if it's possible at all). Thanks in advance!
Addendum:
I use Lucene 3.0.0 but I use Jython to drive Lucene classes. Which means, I call the get_doc method of the following Jython class for every doc id I retrieved during the search:
class DocumentFetcher():
def __init__(self, index_name):
self._directory = FSDirectory.open(java.io.File(index_name))
self._index_reader = IndexReader.open(self._directory, True)
def get_doc(self, doc_id):
return self._index_reader.document(doc_id)
I have 50M documents in my index.
You, probably, are storing lot of information in the document. Reduce the stored fields to as much as you can.
Secondly, while retrieving fields, select only those fields which you need. You can use following method of IndexReader to specify only few of the stored fields.
public abstract Document document(int n, FieldSelector fieldSelector)
This way you don't load up fields which are not used.
You can utilize following code sample.
FieldSelector idFieldSelector =
new SetBasedFieldSelector(Collections.singleton("idFieldName"), Collections.emptySet());
for (int i: resultDocIDs) {
String id = reader.document(i, idFieldSelector).get("idFieldName");
}
Scaling Lucene and Solr discusses many ways to improve Lucene performance.
As you are working on Lucene search within Wikipedia, you may be interested in Rainman's Lucene Search of Wikipedia. He mostly discusses algorithms and less performance, but this may still be relevant.

Bubbling up newest content in lucene search results

I am storing various articles in my lucene index.
When user searches for articles which contain a specific term or phrase,I need to show all th articles (could be anywhere between 1000 to 10000 articles) but with newest articles "bubbled up" in the search results.
I believe you can bubble up a search result in Lucene using "Date field Boosting".
Can someone please give me the details of how to go about this?
Thanks in advance!
I would implement the SortComparatorSource interface. You should write a new ScoreDocComparator, whose compare() function compares two dates. Then you will need to sort your searches using the new sorter. This advice is taken from chapter 6 of Lucene in Action.
You can use the setBoost method to set the "boost" for a particular document in the index at index time. Since the default boost value is 1.0, setting a value less than 1.0 will make the document "less relevant" in search results. By tying the boost value of a document to its age (lower boost the older the document gets), you can make newer content seem more relevant in search results.
Note in the documentation for setBoost that the boost value set at indexing time is not available for retrieved documents (boost works, you just can't read the value back at retrieval time to see if you applied the correct value at index time).