How to boost linked documents in Lucene? - lucene

Is it possible to boost found documents based on other found documents?
E.g. if I have document A which has a link to document B and both are found independently, then to boost them both? By link I mean a field with an ID of another document.
Currently I'm doing it "manually" i.e. I post-process the TopDocs looking for documents that have links to other documents in the same result and move those to the top. This is not the best solution as the TopDocs itself is already limited without taking my custom boosting into account.

I would suggest to implement a custom lucene collector or extend an existing one. This way you can store all the doc ids which are retrieved and you can post process them all at the end. Depending on the links between your documents, you may be able to throw away some of the docs during the "collecting" phase which will save you memory.

Related

umbraco pdf searcher result ranking

We have used pdf searcher (nuget package) within one of our Umbraco applications. When I see the pdf search results it does not look 100% correct.
The top 2 pdfs in the search result contain the search term, but the 3rd, 4th and remaining other pdfs in the search result do not have search term. Not sure why pdfs not having the search term are being added in the search result.
Can anyone provide some info on how the umbraco pdf searcher works? and ranks the result items?
Is there any way to remove the pdfs from the search result which do not contain the search term at all.
Go and download LUKE (https://code.google.com/archive/p/luke/). This is a tool that allows you to look inside indexes and see what they have indexed etc.
Using LUKE you should be able to see the indexes and see what has been indexed.
You can get Umbraco Examine to output the raw Lucene string it's using to search by calling .ToString on the criteria object. You can paste that into LUKE to run a search and you'll be able to see all sorts of useful details, like the matched terms, and the ranking etc.
:)

different cloudsearch relevance scores for equivalent matches

I'm new to AWS CloudSearch and have set up my first domain. It only has one basic text index field.
I've tried a number of simple searches and – more often than not – I get different relevance scores across documents where it seems they should be the same. Even searching for one simple word, which matches exactly once in a number of documents, often produces different scores.
Is this supposed to happen? If so, why?
This is normal. Document length is one factor that will affect this. Think about it: finding your query in a 5 word document indicates a better match than finding your query in a 1000 word document.
The current version of CloudSearch uses Solr/Lucene, an Apache project, so you can dig into the internals to your heart's content if you'd like to learn more. Here is the Similarity which discusses the underlying scoring algorithm in Lucene.
As your app matures, you may want to look into custom ranking of your results. CloudSearch provides this capability as well as a tool for comparing the results according to different rankers. You aren't able to customize the base document relevance score but you can boost it according to different fields, etc.

Tagging documents with predefined labels

I am working with large number of documents and have a set of predefined categories/tags(could be phrases) that would be present in the text of the documents either in the exact or inexact form.
I want to assign each document to exactly one tag among the tags that is closest to its text.
Please give me some directions as to what should I do to address this problem.
You can look at the lucene search engine that tags the documents while indexing. Northernlight search engine used to do a similar task mentioned by you in their searching methodology. You can have a look at its implementation in order to get an idea.

Index strategy for tagged documents where tags can change often

In addition to text content my documents have tags which can be searched too. The problem now is that the tags change quite often and every time a tag gets added or removed I have to call UpdateDocument which is quite slow when done for hundreds of documents.
Are there any well performing strategies for storing tags that change often and need to be searched with Lucene? I have been thinking about keeping the tags in separate documents to keep them smaller but I can't figure out how to quickly search for tags AND content.
Store [tag, UID] pairs in a relational database. Every time a tag is added or updated, it is added and updated in this table in the database.
When performing a Lucene search that includes both tag data (stored in a database) and content (indexed in Lucene) you will need to merge the results together. One way you can do this is to:
Make a database query to pull up all the UID's for the tag in question
Translate all the UID's to Lucene doc ID's and set a bit in a BitSet for every matching Lucene doc ID
Create a Filter that wraps the BitSet, and pass that filter in to your search.
We implemented this approach in our system, and it works well. You might need to put a cache in front of the database for performance reasons, though. The particulars of step (3) will vary depending on which version of Lucene you're using.

Is there a set of best practices for building a Lucene index from a relational DB?

I'm looking into using Lucene and/or Solr to provide search in an RDBMS-powered web application. Unfortunately for me, all the documentation I've skimmed deals with how to get the data out of the index; I'm more concerned with how to build a useful index. Are there any "best practices" for doing this?
Will multiple applications be writing to the database? If so, it's a bit tricky; you have to have some mechanism to identify new records to feed to the Lucene indexer.
Another point to consider is do you want one index that covers all of your tables, or one index per table. In general, I recommend one index, with a field in that index to indicate which table the record came from.
Hibernate has support for full text search, if you want to search persistent objects rather than unstructured documents.
There's an OpenSymphony project called Compass of which you should be aware. I have stayed away from it myself, primarily because it seems to be way more complicated than search needs to be. Also, as I can tell from the documentation (I confess I haven't found the time necessary to read it all), it stores Lucene segments as blobs in the database. If you're familiar with the Lucene architecture, Compass implements a Lucene Directory on top of the database. I think this is the wrong approach. I would leverage the database's built-in support for indexing and implement a Lucene IndexReader instead. The same criticism applies to distributed cache implementations, etc.
I haven't explored this at all, but take a look at LuSql.
Using Solr would be straightforward as well but there'll be some DRY-violations with the Solr schema.xml and your actual database schema. (FYI, Solr does support wildcards, though.)
We are rolling out our first application that uses Solr tonight. With Solr 1.3, they've included the DataImportHandler that allows you to specify your database tables (they call them entities) along with their relationships. Once defined, a simple HTTP request will tirgger an import of your data.
Take a look at the Solr wiki page for DataImportHandler for details.
As introduction:
Brian McCallister wrote a nice blog post: Using Lucene with OJB.