Lucene indexCommit serializable? - serialization

IndexCommit does not inherit java.io.Serialiable - does that mean one cannot persist it to another data store? So what I want is to save a list of writing history, i.e., list of commit points, so that even if I bounce my lucene based process, I can still travel back to a particular commit point before the bounce. Is that possible in Lucene?

Related

Avoid two-phase commits in a event sourced application saving BLOB data

Let's assume we have an Aggregate User which has a UserPortraitImage and a Contract as a PDF file. I want to store files in a dedicated document-based store and just hold process-relevant data in the event (with a link to the BLOB data).
But how do I avoid a two-phase commit when I have to store the files and store the new event?
At first I'd store the documents and then the event; if the first transaction fails it doesn't matter, the command failed. If the second transaction fails it also doesn't matter even if we generated some dead files in the store, the command fails; we could even apply a rollback.
But could there be an additional problem?
The next question is how to design the aggregate and the event. If the aggregate only holds a reference to the BLOB storage, what is the process after a SignUp command got called?
SignUpCommand ==> Store documents (UserPortraitImage and Contract) ==> Create new User aggregate with the given BLOB storage references and store it?
Is there a better design which unburdens the aggregate of knowing that BLOB data is saved in another store? And who is responsible for storing BLOB data and forwarding the reference to the aggregate?
Sounds like you are working with something analogous to an AtomPub media-entry/media-link-entry pair. The blob is going into your data store, the meta data gets copied into the aggregate history
But how do I avoid a two-phase commit when I have to store the files and store the new event?
In practice, you probably don't.
That is to say, if the blob store and the aggregate store happen to be the same database, then you can update both in the same transaction. That couples the two stores, and adds some pretty strong constraints to your choice of storage, but it is doable.
Another possibility is that you accept that the two changes that you are making are isolated from one another, and therefore that for some period of time the two stores are not consistent with each other.
In this second case, the saga pattern is what you are looking for, and it is exactly what you describe; you pair the first action with a compensating action to take if the second action fails. So "manual" rollback.
Or not - in a sense, the git object database uses a two phase commit; an object gets copied into the object store, and then the trees get updated, and then the commit... garbage collection comes along later to discard the objects that you don't need.
who is responsible for storing BLOB data and forwarding the reference to the aggregate?
Well, ultimately it is an infrastructure concern; does your model actually need to interact with the document, or is it just carrying a claim check that can be redeemed later?
At first I'd store the documents and then the event; if the first
transaction fails it doesn't matter, the command failed. If the second
transaction fails it also doesn't matter even if we generated some
dead files in the store, the command fails; we could even apply a
rollback. But could there be an additional problem?
Not that I can think of, aside from wasted disk space. That's what I typically do when I want to avoid distributed transactions or when they're not available across the two types of data stores. Oftentimes, one of the two operations is less important and you can afford to let it complete even if the master operation fails later.
Cleaning up botched attempts can be done during exception handling, as an out-of-band process or as part of a Saga as #VoiceOfUnreason explained.
SignUpCommand ==> Store documents (UserPortraitImage and Contract) ==>
Create new User aggregate with the given BLOB storage references and
store it?
Yes. Usually the Application layer component (Command handler in your case) acts as a coordinator betweeen the different data stores and gets back all it needs to know from one store before talking to the other or to the Domain.

How can I get data from an SQL database, as it was at a specfic point in time?

I'm writing a browser turn-based RPG. Nearly every part of the game, including enemies, items, and levels, are rows in an SQL table corresponding to the prototype of that object. This same data is accessible in Wiki format, allowing users to edit this data freely, subject to some community regulation. However, if at this precise moment the game was live, and I was playing, and some troll decided to make the next boss's health "over 9000!", it would be devastating to my campaign, and the losses I suffered would be irreversible. With this in mind, I want to implement a sort of "release system" to the game data. Users can choose for their client to fetch data as it is updated, or to fetch data that has been reviewed and tested on the first of each month. What would be the best way to do this, (although I'm fairly sure the correct answer is "copy your database once a month")?
For your purpose, you want to place a boolean column in your tables "tested" and depending on what mode you are playing in, request either the value for which "tested" is true, or the value for which "tested" is false.
Changes to the values should be stored in their own tables, which are added to the directly accessed values for which "tested" is false. Once a month, the testing reviews these changes and decides which ones should stay. Then the changes may be cleaned if you envision the update process you described to be permanent.

How to design a news feed system like google reader? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I’m preparing a system design interview, i was expected to be asked such kind of question in the interview, so I want to show my design process about this. In addition, I would like what are the best practices to solve some difficulties in the process. I would like to think in terms of scalability and how would i handle heavy read and write on database. Please correct me if i’m wrong in any thoughts.
First, I want to build a function subscribe/ unsubscribe. For user, I want to design mark feeds read/unread. How can i design a system like this ? At first glance, the first problems i can see is that If I put every data in database, it could involved tons of read /write operation on database once thousands of users subscribe/unsubscribe from certain source or media source like CNN posts feed every 5 - 10 mins.
Obviously, database would be a bottleneck once the user grows into certain point.
How can I solve the issue of this ? What’s the thoughts of solving this problem. Although database is a bottleneck in this point of view, we still need to have database but have a better design right ? I saw a lot of articles talking about denormalized data.
Question:
What’s the best way to store like subscribers for each source ?
In database, i can think of a table have “source_id” “user_id”, which user_id subscribes to source_id. Is this a good design or bad? If tons of user subscribe to new source, then database will be a burden.
Approach I can think of this is using Redis, which provides fast write and fast read.
Advantages:
Fast read and write operations.
Provide multiple data structures rather than simple key-value store.
Disadvantages:
Data need to fit in memory ⇒ solution to this: sharding. Sharding I
can use twemproxy to manage cluster.
If data loss, we loss data ⇒ solution: replication, embrace
“master-slave” setup. Write to master, read from slaves and backup
data to disk(data persistency). Additionally, take snapshot on hourly
basis.
Now I listed pros and cons of moving to redis cluster, how do I store the relation between source and subscriber in redis ? Is it a good design if I have a hashmap, stored each source and each point to a list of subscriber.
For example,
Cnn ⇒ (sub1, sub2, sub3, sub4….)
Espn ⇒ (sub1,sub2,sub3,sub4..)
…
In terms of scalability, we can shard each source and user into each dedicated redis nodes.
<<== this is at least i can think of right now.
In addition, we can store user info (what sources does user subscribes to) in redis as well and shard user to multiple cluster
User1 ⇒ (source1, source2, source4..)
User2 ⇒ (source1, source2, source4..)
...
For feed and posts from single source, i can have both database table and redis data structure (basically, my idea is to store everything in redis and database as backup, is it a good design consideration in this case? Maybe not everything, only active user in redis or recent feeds)
Database: i want to make as compact as possible, storing only a copy of it.
feedID, sourceID, created_timestamp, data
Redis: store feedID, source_id and content, and find subscriber based on source_id.
For read/unread part, I don’t have clear idea how to design around these limitation.
Every user have join timestamp and server will push the feed (at most 10 feeds per source) if user haven’t read it. What’s the good design to tell if user read data or not? My initial thoughts of this is to keep track of every feed user read or unread. But the table could grow linearly to the size of feed . In redis, i can design similar structure.
Userid, feedid, status
User1, 001, read
User1, 002, read
User1, 003, unread
At this point, my initial idea of setting up structure of data are like above. Redis runs “master-slave” setup and backup to disk on hourly basis.
Now i’m going to think about how the process of subscribe/ unsubscribe function works. User click subscribe button on media page, for instance, CNN. web server receive request saying “user X” subscribes to “source Y”. On application layer logic, find the machine that has data of user X, this could be achieved by installing sharding map on every application server. Work like this user_id mod shard = machineid.
Once application lookup the server ip that has his (user X) data, application server talks to redis node and update the user structure with new source_id. Subscribe function is the same thing.
For read/unread of specific feed on user X, the application lookup the redis node and update the structure of it, and redis asynchronously make updates to database. (here i embrace eventual consistency).
Let’s thinking about how to design push/pull model.
For pushing notification, once there’s a recent feed, i can store it most recent feed in redis and update only the active user (the reason is to avoid as much write operation on database as possible).
For pull model, only updates the user once they reload their home feed page, which also avoid lots of disk seek time.
Some points:
Only put active user in redis (logged in last 30 days)
If a user is not active for 6 months, and recently logged back and
would like to check feeds. There’s another service reconstruct data
from database and put into redis and serve the user.
Store recent feed in redis and only push notification to active
subscribers at this time. This is to avoid disk seek time on
database.
In order to make feed sortable, design timestamp in feedID. For
example, first 10 bits of feedID is timestamp, and also we can have
another 10 bits for sourceID embedded in ID as well. This make feed
sortable in nature.
Application server can horizontally scale and hide behind
load-balancer.
Application server connect to redis cluster, and database is for
storing back and reconstruct data if possible (like inactive user
case)
Redis apply “master-slave” setup. Write to master, read from slave
and replicate data asynchronously. Data backup to disk on timely
basis. Also updates the database asynchronously.
Questions:
Is updating database asynchronously using redis when new events
coming a feasible solution ? or just keeping the replication is
enough ?
I know it’s a long post and want to hear back from the community. Please correct me if i’m wrong or point out any so we can discuss more about approach.

Updating Lucene payloads without a full re-index

In Lucene, I'm using payloads to store information for each token in a document (a float value in my case). From time to time, those payloads may need to be updated. If I know the docID, termID, offset, etc., is there any way for me to update the payloads in place without having to re-index the whole document?
I'm no aware of any Lucene API to support this, even an "update" operation under the hood is executed as a "delete" and "add" add operation.
A workaround that will require more storage, but reduces IO and latency could be to store the whole source of a document either in the Lucene index itself or a dedicated data store on the same node as the Lucene index. Then you still could send only the updated payload info to your application, to get your document updated. But still the whole document needs to be re-indexed.
See also How to set a field to keep a row unique in lucene?

Programmatically purge document deletion

I've a database with an agent that periodically delete (via Java agent, "removePermanently" method) all documents in a view and re-create them.
After some month, i've noticed that database size is considerably increased.
Showing database information through this command
sh database <dbpath>
it results that i've a lot of deleted documents (i suppose they are deletion stubs)
Document Type Live Deleted
Documents 1,922 817,378
Compacting database, 80% space was recovered.
Is there a way to programmatically delete stubs definitively to avoid "database explosion"? Or, is there a way to correctly manage this scenario (deletion and creation of documents)?
Don't delete the documents! Re-use them. That's the best answer. Seriously. Take the existing documents, clear the fields and set Form := "Obsolete". Modify the selection formula for all your views by appending & Form != "Obsolete" Create a new hidden view called "Obsolete" with selection formula Form = "Obsolete", and instead of creating new documents, change your code to go to the Obsolete view, grab an available document and set new field values (including changing the Form field). Only create new documents if there are not enough available in the Obsolete view. Any performance that you lose by doing this, which really should be minimal with the number of documents that you seem to have, will be more than offset by what you will gain by avoiding the growth and fragmentation of the NSF file that you are creating by doing all the deletions and creating new documents.
If, however, there's no possible way for you to do that -- maybe some third party tool that is outside of your control is creating the documents -- then it's important to know if the database you are talking about is replicated. If it is replicated, then you must be very careful because purging deletion stubs before all replicas are brought up to date will cause deleted documents to "come back to life" if a replica that has been off-line since before the delete occurs comes back on-line.
If the database is not replicated at all, or is reliably replicated across all replicas quickly, then you can reduce the purge interval. Go to the Replication Settings dialog, find the checkbox labeled "Remove documents not modified in the last __ days". Do not check the box, but enter a small number into the number of days. The purge interval for deletion stubs will be set to 1/3 of this number. So if you set it to 3 the effect will be that stubs are kept for 1 day and then purged, giving you 24 hours to assure that all replicas are up to date. If you need more, set the interval higher, maintaining the 3x multiple as needed. If a server is down for an extended period of time (longer than your purge interval), then adjust your operations procedures so that you will be sure to disable replication of the database to that server before it comes back on line and the replica can be deleted and recreated. Be aware, though, that user replicas pose the same problem, and it's not really possible to control or be aware of user replicas that might go off-line for longer than the purge interval. In any case, remember: do not check the box. To reduce the purge interval for deletion stubs only, just reduce the number.
Apart from this, the only way to programmatically delete deletion stubs requires use of the Notes C API. It's possible to call the required routines from LotusScript, but in my experience once the total number of stubs plus documents gets too high you will likely run into an error and may have to create and deploy a new non-replica copy of the database to get past it. You can find code along with my explanation in the answer to this previous question.
I have to second Richard's recommendation to reuse documents. I recently had a similar project, and started the way you did with deleting everything and importing half a million records every night. Deletion stubs and the growth of the FT index quickly became problems, eating up huge amounts of disk space and slowing performance significantly. I tried to manage the deletion stubs, but I was clearly going against the grain of Domino's architecture.
I read Richard's suggestion here, and adopted that approach. Here's what I did:
1) create 2 views based on form - one for 'active' records, and another for 'inactive' records
2) start the agent by setting autoupdate = false for both views
3) use stampall("form", "inactive") to change all fo the active records to inactive
4) manually refresh the 2 views using notesview.refresh()
5) start importing data. for each record, pull a document out of the pool of inactive records (by walking the 'inactive' view)
6) if if run out of inactive records in the pool, create new ones
7) when import is complete, manually refresh the views again
8) use db.createftindex(0, true) to re-create the FT index
the code is really not that complex, and it runs in about the same amount of time, if not faster, than my original approach.
Thanks Richard!
Also, look at the advanced db properties - several things there that will help optimize the db.
It sounds like you are "refreshing" the contents of the database by periodically deleting all the documents and creating new ones from some other source. Cut that out. If the data are in the Notes database already, leave the document alone. What you're doing is very inefficient.