For a project I am working on, I have a index of nearly 10 million documents. For sets of documents, ranging from 100k to 5m, I need to regularly add fields.
Lucene 4 supports updating documents (basically remove and add). What would be a good approach to add the field to a larger set of documents?
What I have tried so far is using the SearcherManager to wrap a IndexWriter, and to make small searches for documents that do not yet contain the field, but do match the Query I am interested in, by wrapping these in a BooleanQuery. I then iterate over the ScoreDocs, retrieve the documents, add my new field and call writer.updateDocument with the uuid I stored with each document. Then I call commit and maybeRefreshBlocking , reacquire the IndexSearcher and search again. This is kinda slow and seems a naive approach.
You only need to require the IndexSearcher before your searches will return different results based on the fields that you add.
In the case where your searches are never affected by the fields that you add, you only need to reacquire the IndexSearcher when documents are added to the index.
So, it will simplify and speed things up at least a little if you only reaquire the IndexSearcher when necessary rather than before each search.
Related
Say I update my index once a day, everyday, at the same time. During the time between updates (for 21 hours or so), will the docids remain constant?
As #andrewjames mentioned, the docId's only change when a merge happens. The docsId is basically the array index position of the doc in a particular segment.
The side effect of that is also that if you have multiple segments, then a given docId might be assigned to multiple docs, one in one segment, one in another segment, etc. If that's a problem, you can do a force merge once you are done building your index so that there is only a single segment. Then no two docs will have the same docId at that point.
The docId for a given document will not change if a merge does not happen. And a merge won't happen unless you call force merge or add or delete documents, or upgrade your index.
So...if you build your index, and don't add docs, delete docs, or call force merge, or upgrade your index then the docIds will be stable. But the next time you build your index, a give doc may receive a totally different doc Id. And as #andrewjames said, the docId assignments and timing of assignments are an internal affair in Lucene, so you sould be cautious about relying on them even when you know when and how they are currently assigned.
I've set up a basic implementation of ElasticSearch, storing a couple of fields in the document and I'm able to execute queries.
var searchResult = client.Search<SearchTest>(s =>
s
.Size(1000)
.Fields(f => f.ID)
.Query(q => q.QueryString(d => d.Query(query)))
)
.Documents.Select(item =>
item.ID
)
.ToList();
var products = this.DbContext.Products
.Where(item =>
searchResult.Contains(item.ProductId)
&& ...
)
.Select(item => ...);
// subsequent queries here
Right now, I simply return the index, which I use in database queries to retrieve a whole lot of information. The information stored in the documents is also retrieved. Now I'm wondering, should I skip retrieving this from the database, and use the data in the document store? Or should I use it for nothing else but searching?
Some context: searching in a product database, some information is always the same, some information (like price calculation) depends on which customer is searching.
There isn't really a hard and fast answer to this question. I like to pull enough information from the index to populate a list of search results, but retrieve the full contents of the document from others, external sources (ex. a database). Entirely subjectively, this seems to be the more common use of Lucene, from what I've seen.
Storage strategy, as far as I know, should not have a direct impact on search performance, but keeping data stored for each document to a minimum will improve performance retrieving documents from the index (ie, for that list of results mentioned before).
I'm also sometimes hesitant to make Lucene the system of record. It seems to be much easier to find yourself with a broken/corrupt index than a database. I like having the option available to trash and rebuild it.
I see you already accepted an answer but i'd like to offer a second approach.
Elasticsearch excels at storing Documents (json) and so retrieving complete object graphs can be a very fast and powerful approach to overcome the impedance mismatch and N+1 sensitive database queries.
To me the best approach would be for searchResults to already be the list of definitive IEnumerable<Product> without having to do N database queries afterwards.
Elasticsearch (unlike raw lucene or even Solr) has a special field that stores the original json graph called _source so the overhead of loading your whole document is very minimal.
This comes at the cost of having to basically write your data twice, once to the database and once to elasticsearch on every mutation. Depending on your architecture this may or may not be achievable.
I agree with #femtoRgon that being able to reindex from an external datasource is a good idea, but the Elasticsearch developers are working very hard to get a proper backup and restore going for 1.0. This will greatly reduce the need for the second datastorage.
BTW not sure if you are aware but specifying .Fields() will already force Elasticsearch to only load up the specified fields instead of the whole graph from the special _source field.
I am using Lucene to index the records from my database. I have a million records in my table called "Documents". The records will be accessed by particular users only. A real case scenario is that a single user can access a maximum of 100 records in the Documents table. Which of the following is a best practice for this scenario.
Indexing all the 1 million records in Documents table as a single index file with the user information as one of the field in that index OR
Creating user specfic indexes
Sounds like you'll end up with a lot of indices in the second scenario, and if you want to search them concurrently, Lucene will have to keep a lot of files open, so you might easily hit your OS limit on the number of open files. If you decide to open/close them on demand, you might not benefit from caching and your search might be slow because of cold indices (or you prewarm them but again you might have a lot of overhead processing). I'd go with the first approach, Lucene can handle 1M documents in a single index.
I would like to use Lucene for indexing a table in an existing database. I have been thinking the process is like:
Create a 'Field' for every column in the table
Store all the Fields
'ANALYZE' all the Fields except for the Field with the primary key
Store each row in the table as a Lucene Document.
While most of the columns in this table are small in size, one is huge. This column is also the one containing the bulk of the data on which searches will be performed.
I know Lucene provides an option to not store a Field. I was thinking of two solutions:
Store the field regardless of the size and if a hit is found for a search, fetch the appropriate Field from Document
Don't store the Field and if a hit is found for a search, query the data base to get the relevant information out
I realize there may not be a one size fits all answer ...
For sure, your system will be more responsive if you store everything on Lucene. Stored field does not affect the query time, it will only make the size of your index bigger. And probably not that bigger if it is only a small portion of the rows that have a lot of data. So if the index size is not an issue for your system, I would go with that.
I strongly disagree with a Pascal's answer. Index size can have major impact on search performance. The main reasons are:
stored fields increase index size. It could be problem with relatively slow I/O system;
stored fields are all loaded when you load Document in memory. This could be good stress for the GC
stored fields are likely to impact reader reopen time.
The final answer, of course, it depends. If the original data is already stored somewhere else, it's good practice to retrieve it from original data store.
When adding a row from the database to Lucene, you can judge if it actually needed to be write to the inverted-index. If not, you can use Index.NOT to avoid writing too much data to the inverted-index.
Meanwhile, you can judge where a column will be queried by key-value. If not, you needn't use Store.YES to store the data.
I am implementing Solr for a free text search for a project where the records available to be searched will need to be added and deleted on a large scale every day.
Because of the scale I need to make sure that the size of the index is appropriate.
On my test installation of Solr, I index a set of 10 documents. Then I make a change in one of the document and want to replace the document with the same ID in the index. This works correctly and behaves as expected when I search.
I am using this code to update the document:
getSolrServer().deleteById(document.getIndexId());
getSolrServer().add(document.getSolrInputDocument());
getSolrServer().commit();
What I noticed though is that when I look at the stats page for the Solr server that the figures are not what I expect.
After the initial index, numDocs and maxDocs both equal 10 as expected. When I update the document however, numDocs is still equal to 10 (expected) but maxDocs equals 11 (unexpected).
When reading the documentation I see that
maxDoc may be larger as the maxDoc count includes logically deleted documents that have not yet been removed from the index.
So the question is, how do I remove logically deleted documents from the index?
If these documents still exist in the index do I run the risk of performance penalties when this is run with a very large volume of documents?
Thanks :)
You have to optimize your index.
Note that an optimize is expansive, you probably should not do it more than daily.
Here is some more info on optimize:
http://www.lucidimagination.com/search/document/CDRG_ch06_6.3.1.3
http://wiki.apache.org/solr/SolrPerformanceFactors#Optimization_Considerations