Index verification tools for Lucene - lucene

How can we know the index in Lucene is correct?
Detail
I created a simple program that created Lucene indexes and stored it in a folder. Using the diagnostic tool, Luke I could look inside an index and view the content.
I realise Lucene is a standard framework for building a search engine but I wanted to be sure that Lucene indexes every term that existed in a file.
Can I verify that the Lucene index creation is dependable? That not even a single term went missing?

You could always build a small program that will perform the same analysis you use when indexing your content. Then, for all the terms, query your index to make sure that the document is among the results. Repeat for all the content. But personally, I would not waste time on this. If you can open your index in Luke and if you can make a couple of queries, everything is most probably fine.
Often, the real question is whether or not the analysis you did on the content will be appropriate for the queries that will be made against your index. You have to make sure that your index will have a good balance between recall and precision.

Related

Apache Lucene Search program

I am currently reading a few tutorials on Apache Lucene. I had a question on how indexing works, which I was not able to find an answer to.
Given a set of documents that I want to index, and search with a string. It seems a Lucene program should index all these documents and then search the inputted search string each time the program is run. Will this not cause performance issues? or am I missing something?
No, it would be pretty atypical build a new index every time you run the program.
Many tutorials and examples out there use RAMDirectory, perhaps that's where the confusion is coming from. RAMDirectory creates an index entirely in memory. This is great for demos and tutorials, because you don't have to worry about the file system, or any of that nonsense, and it ensures you are working from a predictable blank state from the start.
In practice, though, you usually won't be using it. You would, instead, use a directory on the file system, and open an existing index after creating it for the first time, rather than creating a new index every time you run the program.

Elasticsearch _boost deprecated. Any alternative?

So basically _boost, which was a mapping option to give a field a certain boost is now deprecated
The page suggest to use "function score instead of boost". But function score means:
The function_score allows you to modify the score of documents that are retrieved by a query
So it's not even an alternative. Function score just modifies the score of the documents at query time.
How i do alter the relevante of a field at the mapping time?
That option is not valid anymore? Removed and no replacement?
The option is no longer valid and there is no direct replacement. The problem is that index time boosting was removed from Lucene 4.0 upon which Elasticsearch runs. Elasticsearch then used it's own implementation which had it's own issues. A good write up on the issues can be found here: http://blog.brusic.com/2014/02/document-boosting-in-elasticsearch.html and the issue deprecating boost at index time here: https://github.com/elasticsearch/elasticsearch/issues/4664
To summarize, it basically was not working in a transparent and understandable way - you could boost one document by 100 and another by 50, hit the same keyword and yet get the same score. So the decision was made to remove it and rely on function score queries, which have the benefit of being much more transparent and predictable in their impact on scoring.
If you feel strongly that function score queries do not meet your needs and use case, I'd open an issue in github and explain your case.
Function score query can be used to boost the whole document. If you want to use field boost, you can do so with a multi match query or a term query.
I don't know about your case but I believe you have strong reasons to boost documents at index time. It is always recommended to boost at "Query" as "Index" time boosting will require reindexing the data again if ever your boost criteria changes. Being said that, in my application we have implemented both Index & Query time boosting, we are using
Index Time Boosting (document boosting), to boost some documents which we know will always be TOP HIT for our search. e.g searching with word "google" should always put a document containing "google.com" as top hit. We do this using a custom boost field and a custom boosting script to achieve this. Please see this link: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html
Query time Boosting (Per field boosting), we are using ES java APIs to execute our queries, we apply field level boosting at query time to each field as its highly flexible & allows us to change the field level boosting without reindexing the whole data set again.
You can have a look at this, it might be helpful for you: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html#_field_value_factor
I have described my complete use case here, hopefully you will find it helpful.

Creating Lucene Index in a Database - Apache Lucene

I am using grails searchable plugin. It creates index files on a given location. Is there any way in searchable plugin to create Lucene index in a database?
Generally, no.
You can probably attempt to implement your own format but this would require a lot of effort.
I am no expert in Lucene, but I know that it is optimized to offer fast search over the filesystem. So it would be theoretically possible to build a Lucene index over the database, but the main feature of lucene (being a VERY fast search engine) would be lost.
As a point of interest, Compass supported storage of a Lucene index in a database, using a JdbcDirectory. This was, as far as I can figure, just a bad idea.
Compass, by the way, is now defunct, having been replaced by ElasticSearch.

how lucene use skip list in inverted index?

In some blogs and lucene website,I know lucene use data structure "skip list" in inverted index. But I have some puzzle about it.
1:In general,skip list maybe used in memory ,but inverted index is stored in disk. So how lucene use it when search on the index? just scanning it on disk or load it to memory?
2:skip list's insert operator often use random(0,1) to decide whether insert to next level,but in luncene introdution,it seems a fixed interval in every terms,so how lucene create the "skip list" different or not?
Please correct me if I am wrong.
Lucene uses memory in a couple different ways, even though the index is persisted on a disk when the IndexReader is created for searching and for operations like sorting (field cache):
http://blog.mikemccandless.com/2010/07/lucenes-ram-usage-for-searching.html
Basically those binary files get copied into RAM for much faster scanning and reducing I/O. You get a hint in the above link how searching with some parameters can force Lucene to "skip terms in searching" Hence, where that data structure can be used.
Lucene is open source, so you can see the code for yourself what is being used in Java or Lucene.NET for the C# implementation.
see To accelerate posting list skips, Lucene uses skip lists

Solr on a .NET site

I've got an ASP.NET site backed with a SQL Server database. I'm been using Lucene.NET to index and search the database. I'm adding faceted search navigation to the results page (the facets are a hiarchical category tree). I asked yesterday to make sure I was using the right technique for faceting. All I've gotten so far is a suggestion to use Solr, but Solr does a lot of things I don't need.
I would really like to know from anyone who is familiar with the Solr's source code if Solr's facet processing is terribly different from the one described here by Bert Willems. Bascially you have a Lucene filter for each facet, you get the bits array from it, and you count the set bits in the array.
I'm thinking since mine is hiarchical to begin with I should be able to optimize this pretty well, but I'm afraid I might be grossly under-estimating the impact of this design on search performance. If Solr is no quicker, I'm not going to gain anything by using it.
I'd recommend creating a prototype project modeling your faceting needs with Solr and benchmark it against Lucene.net.
Even though faceting in Solr is very optimized (and gets new optimizations all the time, like the parallel per-segment faceting method), when using Solr there is some overhead, for example network roundtrips and response parsing.
If your code already implements Lucene.NET, performs adequately and you don't need any of Solr's additional features, then there is no need to switch to Solr. But also consider that if you choose Solr you will get faceting performance boosts for free with each new version.