Are docids constant if the index is not manipulated in Lucene 8.6.1? - lucene

Say I update my index once a day, everyday, at the same time. During the time between updates (for 21 hours or so), will the docids remain constant?

As #andrewjames mentioned, the docId's only change when a merge happens. The docsId is basically the array index position of the doc in a particular segment.
The side effect of that is also that if you have multiple segments, then a given docId might be assigned to multiple docs, one in one segment, one in another segment, etc. If that's a problem, you can do a force merge once you are done building your index so that there is only a single segment. Then no two docs will have the same docId at that point.
The docId for a given document will not change if a merge does not happen. And a merge won't happen unless you call force merge or add or delete documents, or upgrade your index.
So...if you build your index, and don't add docs, delete docs, or call force merge, or upgrade your index then the docIds will be stable. But the next time you build your index, a give doc may receive a totally different doc Id. And as #andrewjames said, the docId assignments and timing of assignments are an internal affair in Lucene, so you sould be cautious about relying on them even when you know when and how they are currently assigned.

Related

RethinkDb do function based secondary indexes update themselves dynamically?

Let's say that I need to maintain an index on a table where multiple documents can relate do the same item_id (not primary key of course).
Can one secondary compound index based on the result of a function which of any item_id returns the most recent document based on a condition, update itself whenever a newer document gets inserted?
This table already holds 1.2 million documents in just 25 days, so it's a big-data case here as it will keep growing and must always keep the old records to build whatever pivots needed over the years.
I'm not 100% sure I understand the question, but if you have a secondary index and insert a new document or change an old document, the document will be in the correct place in the index once the write completes. So if you had a secondary index on a timestamp, you could write r.table('items').orderBy(index: r.desc('timestamp')).limit(n) to get the most recent n documents (and you could also subscribe to changes on that).

AWS DynamoDB v2: Do I need secondary index for alternative queries?

I need to create a table that would contain a slice of data produced by a continuously running process. This process generates messages that contain two mandatory components, among other things: a globally unique message UUID, and a message timestamp.
Those messages would be later retrieved by the UUID.
In addition, on a regular basis I would need to delete all messages from that table that are too old, i.e. whose timestamps are more than X away from the current time.
I've been reading the DynamoDB v2 documentation (e.g. Local Secondary Indexes) trying to figure out how to organize my table and whether or not I need a secondary index to perform searches for messages to delete. There might be a simple answer to my question, but I am somehow confused...
So should I just create a table with the UUID as the hash and messageTimestamp as the range key (together with a "message" attribute that would contain the actual message), and then not create any secondary indices? In the examples that I've seen, the hash was something that was not unique (e.g. ForumName under the above link). In my case, the hash would be unique. I am not sure whether than makes any difference.
And if I create the table with hash and range as described, and without a secondary index, then how would I query for all messages that are in a certain timerange regardless of their UUIDs?
DynamoDB introduced Global Secondary Index which would solve this problem.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html
We've wrestled with this as well. The best solution we've come up with is to create second table for storing the time series data. To do this:
1) Use the date plus "bucket" id for a hash key
You could just use the date, but then I'm guessing today's date would become a "hot" key - one that is written with excessive frequency. This can create a serious bottleneck, as the total throughput for a particular DynamoDB partition is equal to the total provisioned throughput divided by the number of partitions - that means if all your writes are to a single key (today's key) and you have a throughput of 20 writes per second, then with 20 partitions, your total throughput would be 1 write per second. Any requests beyond this would be throttled. Not a good situation.
The bucket can be a random number from 1 to n, where n should be greater than the number of partitions used by the underlying DB. Determining n is a bit tricky of course because Dynamo does not reveal how many partitions it uses. But we are currently working with the upper limit of 200 based on the example found here. The writeup at this link was the basis for our thinking in coming up with this approach.
2) Use the UUID for the range key
3) Query records by issuing queries for each day and bucket.
This may seem tedious, but it is more efficient than a full scan. Another possibility is to use Elastic Map Reduce jobs, but I have not tried that myself yet so cannot say how easy/effective it is to work with.
We are still figuring this out ourselves, so I'm interested to hear others' comments. I also found this presentation very helpful in thinking through how best to use Dynamo:
Falling In and Out Of Love with Dynamo
-John
In short you can not. All DynamoDB queries MUST contain the primary hash index in the query. Optionally, you can also use the range key and/or a local secondary index. With the current DynamoDB functionality you won't be able to use an LSI as an alternative to the primary index. You also are not able to issue a query with only the range key (you can test this out easily in the AWS Console).
A (costly) workaround that I can think of is to issue a scan of the table, adding filters based on the timestamp value in order to find out which fields to delete. Note that filtering will not reduce the used capacity of the query, as it will parse the whole table.

Why a 3rd column family when creating a custom index on Cassandra?

As I always say, sorry for my English. I'm working on creating some manual indexes for some column families in Cassandra. I have read everything I could about this but I have found something I'm not able to understand properly.
In this presentation Indexing in Cassandra, pages 36 to 45, done by Ed Anuff, I have seen his simple example for creating an index for a Users column family. He uses the 2 obvious CFs and another one to deal with concurrency. This third CF is "my problem". If I'm not wrong, Cassandra will always store the most recent value for each column. If this value is indexed, I have to update it in the Index CF (delete old index and create the new one), but why it is necessary the third CF? When I think about that and the concurrency, what my understanding says is: ok, many people updating a value which is indexed. It will mean a lot of work updating the index, but finally the last value will be in the Users CF and also in the Index CF, that's why there is a timestamp per column, so what's the matter with the concurrency? Even more, if the value can be updated only by one user (the owner of the data), there will be no concurrency...
I know I am a big ignorant in Cassandra affairs, but I don't see the reason behind the third CF. Ed Anuff explains that using this third column family you can restore the indexes to a consistent status here, but, why are them going to fall into an inconsistent status? And, if this happens, the Users CF could be enough to restore the index, or am I wrong?
Please, could someone explain me this? What is/are my error/s?
Thank you very much!
Since I think someone else could have the same doubt than me, I'm going to answer my own question with the things I have found:
As I supposed the main problem is the concurrency. If we assume that many users at the same time can be changing the same indexed value, as you have to read the index before updating, between the time you read a value and the time you update a value in the index another user could have changed again that value. As well as, from the moment the value is updated until you update the index the system could crash. Then, after a few concurrent changes the index could have old values that point to rows that have not that value.
By adding the third column family this process is safer but NOT 100% SAFE.
And a last thing: from my understanding, if there is no concurrency when updating the values, then there must be no problem. Let's supose you are indexing some user data. If only the owner of the data is allowed to modify the data, there is no concurrency at all. The unique risk is the system crashes before you finish the process to align the index with the value, but this operation is idempotent, so you can repeat it until success.
Hope this explain what I have understood and help others.
Actually I think it's more about idempotency rather than concurrency.
If you have two column families or three, concurrent users may produce false positive results, i.e. keys in the index column family that point to the rows which don't have that value anymore... but with two column family design if you repeat any part of updating process you may end up losing the key of a row in the correct row of index column family... however with three column family design you are sure that you have the key of each row in the correct place in index column family...
filtering the results will solve the false positive problem but if you don't have the key in the correct place, you can't obtain the row simply and whole indexing mechanism will be in vain...
consider this example on a two column family design:
user 1 updates the location, Cassandra returns error but write was successful.
user 2 updates the location, reads result of user 1 write and writes his location in column family
user 1 re-tries and writes his location in the column family and updates index column family
user 2 updates index column family and deletes user 1 location and inserts his own
at last user has the location of user 1 but row key only exists in the user 2 input index row
I make the example right now and it may have some problem in it or you may solve the problem of losing the key in the right place by changing the update process, but you should understand the concept behind. you can think of a better example.
however I'm not sure about this, but this explanation makes sense to me and hopefully I could explain it to you...

Way to create a frozen table-view in SQLite?

I've got an SQLite table with potentially hundreds of thousands of entries, which is being added to (and occasionally removed from) in the background at irregular intervals. The UI needs to display this table in an arbitrary user-selected sorted order, within a wxWidgets wxListCtrl item.
I'm planning to use a wxLC_VIRTUAL list control, and query the table for small groups of items as needed using LIMIT and OFFSET, but I foresee trouble. When the background process makes changes to items that are "above" the currently-viewed ones, I can't see any way to know how the offsets of the currently-viewed items will change.
Is there some SQLite trick to handle this? Maybe a way to identify what offset a particular record is at in a specific sorted order, without iterating through all of the records returned by a SELECT statement?
Alternatively, is there some way to create an unchanging view of the database at a particular time, without a time-consuming duplication of it?
If all else fails, I can store the changed items and add them later, but I'm hoping I won't have to.
Solved it by creating a query to find the index of an item, by counting the number of items that are "less than" (in the user-defined order) the one I'm looking for. A little complex to write, because of the user-defined ordering, but it works, and runs surprisingly fast even on a huge table.

How do I remove logically deleted documents from a Solr index?

I am implementing Solr for a free text search for a project where the records available to be searched will need to be added and deleted on a large scale every day.
Because of the scale I need to make sure that the size of the index is appropriate.
On my test installation of Solr, I index a set of 10 documents. Then I make a change in one of the document and want to replace the document with the same ID in the index. This works correctly and behaves as expected when I search.
I am using this code to update the document:
getSolrServer().deleteById(document.getIndexId());
getSolrServer().add(document.getSolrInputDocument());
getSolrServer().commit();
What I noticed though is that when I look at the stats page for the Solr server that the figures are not what I expect.
After the initial index, numDocs and maxDocs both equal 10 as expected. When I update the document however, numDocs is still equal to 10 (expected) but maxDocs equals 11 (unexpected).
When reading the documentation I see that
maxDoc may be larger as the maxDoc count includes logically deleted documents that have not yet been removed from the index.
So the question is, how do I remove logically deleted documents from the index?
If these documents still exist in the index do I run the risk of performance penalties when this is run with a very large volume of documents?
Thanks :)
You have to optimize your index.
Note that an optimize is expansive, you probably should not do it more than daily.
Here is some more info on optimize:
http://www.lucidimagination.com/search/document/CDRG_ch06_6.3.1.3
http://wiki.apache.org/solr/SolrPerformanceFactors#Optimization_Considerations