How synchronous is galera cluster - galera

Actually I have couple of questions here.
1) When I call insert from my application using Mysql connector, its answered by one of the Master node, but does that master node waits before the insert is applied on all the nodes and then reply to the client. If it waits for all the nodes to insert before replying to the client then how is wsrep_sst_method=xtrabackup helps, will it make it reply to client immediately or will it make no difference. Maybe I understood this variable wrong.
2) What about read, I guess it is just answered by one of the master node. In case wsrep_sync_wait is set only in that case it waits for a reply from all the nodes.
Thanks

"How synchronous"? Synchronous enough, but with one exception: "Critical read".
The "fix" is during reading, not writing.
When writing the heavyweight checking is done during COMMIT. At this point, all other nodes are contacted to see if "this transaction will eventually commit successfully". That is, the other nodes say "yes" but don't actually finish the work enough for a subsequent SELECT to see the results of the write. The guarantee here is that, the cluster is in a consistent state and will stay that way, even if any one node dies.
"Critical read" is, for example, when a user posts something, then immediately reads the database and expects to see the posting. But, if the read (SELECT) hits a different node, the "almost" synchronous nature of Galera may not have committed the data to the reading node. The data is there, and will be successfully written to disk, but maybe not yet. The workaround is to use wsrep_sync_wait when reading to assure that replication is caught up before the SELECT. No action is taken when writing.
(I don't see the relevance of wsrep_sst_method=xtrabackup. That relates to recovering from a dead node.)

Related

Apache ignite spring data save method transaction behaviour with map parameter

As per apache ignite spring data documentation, there are two method to save the data in ignite cache:
1. org.apache.ignite.springdata.repository.IgniteRepository.save(key, vlaue)
and
2. org.apache.ignite.springdata.repository.IgniteRepository.save(Map<ID, S> entities)
So, I just want to understand the 2nd method transaction behavior. Suppose we are going to save the 100 records by using the save(Map<Id,S>) method and due to some reason after 70 records there are some nodes go down. In this case, will it roll back all the 70 records?
Note: As per 1st method behavior, If we use #Transaction at method level then it will roll back the particular entity.
First of all, you should read about the transaction mechanism used in Apache Ignite. It is very good described in articles presented here:
https://apacheignite.readme.io/v1.0/docs/transactions#section-two-phase-commit-2pc
The most interesting part for you is "Backup Node Failures" and "Primary Node Failures":
Backup Node Failures
If a backup node fails during either "Prepare" phase or "Commit" phase, then no special handling is needed. The data will still be committed on the nodes that are alive. GridGain will then, in the background, designate a new backup node and the data will be copied there outside of the transaction scope.
Primary Node Failures
If a primary node fails before or during the "Prepare" phase, then the coordinator will designate one of the backup nodes to become primary and retry the "Prepare" phase. If the failure happens before or during the "Commit" phase, then the backup nodes will detect the crash and send a message to the Coordinator node to find out whether to commit or rollback. The transaction still completes and the data within distributed cache remains consistent.
In your case, all updates for all values in the map should be done in one transaction or rollbacked. I guess that these articles answered your question.

JEE 7 JSR 352 passing data from batchlet to a chunk-step

I have read the standard (and the javadoc) but still have some questions.
My use case is simple:
A batchlet fetches data from an external source and acknowledges the data (meaning that the data is deleted from the external source after acknowledgement).
Before acknowledging the data the batchlet produces relevant output (in-menory-object) that is to be passed to the next chunk oriented step.
Questions:
1) What is the best practice for passing data between a batchlet and a chunk step?
It seems that I can do that by calling jobContext#setTransientUserData
in the batchlet and then in my chunk step I can access that data by calling
jobContext#getTransientUserData.
I understand that both jobContext and stepContext are implemented in threadlocal-manner.
What worries me here is the "Transient"-part.
What will happen if the batchlet succeeds but my chunk-step fails?
Will the "TransientUserData"-data still be available or will it be gone if the job/step is restarted?
For my use case it is important that the batchlet is run just once.
So even if the job or the chunk step is restarted it is important that the output data from the successfully-run-batchlet is preserved - otherwise the batchlet have to be once more. (I have already acknowledged the data and it is gone - so running the batchlet once more would not help me.)
2)Follow up question
In stepContext there is a couple of methods: getPersistentUserData and setPersistentUserData.
What is these method's intended usage?
What does the "Persistent"-part refer to?
Are these methods relevant only for partitioning?
Thank you!
/ Daniel
Transient user data is just transient, and will not be available during job restart. A job restart can happen in a different process or machine, so users cannot count on job transient from previous run being available at restart.
Step persistent user data are those application data that the batch job developers deem necessary to save/persist for purpose of restarting, monitoring or auditing. They will be available at restart, but they are typically scoped to the current step (not across steps).
From reading your brief descriptioin, I got the feeling that your 2 steps are too tightly coupled and you can almost consider them one single unit of work. You want them either both succeed or both fail in order to maintain your application state integrity. I think that could be the root of the problem.

Best way to handle timouts on rabbitmq message processing

I am trying to get my head around an issue I have recently encountered and I hope someone will be able to point me in the most reasonable direction of solving it.
I am using Riak KV store and working on CRDT data, where I have some sort of counter inside each CRDT item stored in database.
I have a rabbitmq queue, where each message is a request to increase or decrease a certain amount of aforementioned counters.
Finally, I have a group of service-workers, that listens on the queue, and for each request try to change the amount of counters accordingly.
The issue I have is as follows: While a single worker is processing a request, it may get stuck for a while on a write operation to database – let’s say on a second change of counters out of three. It’s connection with rabbitmq gets lost (timeout) so the message-request gets back on to the queue (I cannot afford to miss one). Then it is picked up by second worker, which begins all processing anew. However, the first worker finishes its work, and as a results I have processed a single message twice.
I can split those increments into single actions, but this still leaves me with dilemma – can still change value of counter twice, if some worker gets stuck on a write operation for a long period.
I do not have possibility of making Riak KV CRDT writes work faster, nor can I accept missing out a message-request. I need to implement some means of checking whether a request was already processed before.
My initial thoughts were to use some alternative, quick KV store for storing rabbitMQ message ID if they are being processed. That way other workers could tell whether they are not starting to process a message that is already parsed elsewhere.
I could use any help and pointers to materials I can read.
You can't have "exactly one delivery" semantic. You can reduce double-sent messages or missed deliveries, so it's up to you to decide which misbehavior is the least inconvenient.
First of all are you sure it's the CRDTs that are too slow ? Are you using simple counters or counters inside maps ? In my experience they are quite fast, although slower than kv. You could try:
- having simple CRDTs (no maps), and more CRDTs objects, to lower their stress( can you split the counters in two ?)
- not using CRDTs but using good old sibling resolution on client side on simple key/values.
- accumulate the count updates orders and apply them in batch, but then you're accepting an increase in latency so it's equivalent to increasing the timeout.
Can you provide some metrics? Like how long the updates take, what numbers you'd expect, if it's as slow when you have few updates or many updates, etc

Safely setting keys with StackExchange.Redis while allowing deletes

I am trying to use Redis as a cache that sits in front of an SQL database. At a high level I want to implement these operations:
Read value from Redis, if it's not there then generate the value via querying SQL, and push it in to Redis so we don't have to compute that again.
Write value to Redis, because we just made some change to our SQL database and we know that we might have already cached it and it's now invalid.
Delete value, because we know the value in Redis is now stale, we suspect nobody will want it, but it's too much work to recompute now. We're OK letting the next client who does operation #1 compute it again.
My challenge is understanding how to implement #1 and #3, if I attempt to do it with StackExchange.Redis. If I naively implement #1 with a simple read of the key and push, it's entirely possible that between me computing the value from SQL and pushing it in that any number of other SQL operations may have happened and also tried to push their values into Redis via #2 or #3. For example, consider this ordering:
Client #1 wants to do operation #1 [Read] from above. It tries to read the key, sees it's not there.
Client #1 calls to SQL database to generate the value.
Client #2 does something to SQL and then does operation #2 [Write] above. It pushes some newly computed value into Redis.
Client #3 comes a long, does some other operation in SQL, and wants to do operation #3 [Delete] to Redis knowing that if there's something cached there, it's no longer valid.
Client #1 pushes its (now stale) value to Redis.
So how do I implement my operation #1? Redis offers a WATCH primitive that makes this fairly easy to do against the bare metal where I would be able to observe other things happened on the key from Client #1, but it's not supported by StackExchange.Redis because of how it multiplexes commands. It's conditional operations aren't quite sufficient here, since if I try saying "push only if key doesn't exist", that doesn't prevent the race as I explained above. Is there a pattern/best practice that is used here? This seems like a fairly common pattern that people would want to implement.
One idea I do have is I can use a separate key that gets incremented each time I do some operation on the main key and then can use StackExchange.Redis' conditional operations that way, but that seems kludgy.
It looks like question about right cache invalidation strategy rather then question about Redis. Why i think so - Redis WATCH/MULTI is kind of optimistic locking strategy and this kind of
locking not suitable for most of cases with cache where db read query can be a problem which solves with cache. In your operation #3 description you write:
It's too much work to recompute now. We're OK letting the next client who does operation #1 compute it again.
So we can continue with read update case as update strategy. Here is some more questions, before we continue:
That happens when 2 clients starts to perform operation #1? Both of them can do not find value in Redis and perform SQL query and next both of then write it to Redis. So we should have garanties that just one client would update cache?
How we can be shure in the right sequence of writes (operation 3)?
Why not optimistic locking
Optimistic concurrency control assumes that multiple transactions can frequently complete without interfering with each other. While running, transactions use data resources without acquiring locks on those resources. Before committing, each transaction verifies that no other transaction has modified the data it has read. If the check reveals conflicting modifications, the committing transaction rolls back and can be restarted.
You can read about OCC transactions phases in wikipedia but in few words:
If there is no conflict - you update your data. If there is a conflict, resolve it, typically by aborting the transaction and restart it if still need to update data.
Redis WATCH/MULTY is kind of optimistic locking so they can't help you - you do not know about your cache key was modified before try to work with them.
What works?
Each time your listen somebody told about locking - after some words you are listen about compromises, performance and consistency vs availability. The last pair is most important.
In most of high loaded system availability is winner. Thats this means for caching? Usualy such case:
Each cache key hold some metadata about value - state, version and life time. The last one is not Redis TTL - usually if your key should be in cache for X time, life time
in metadata has X + Y time, there Y is some time to garantie process update.
You never delete key directly - you need just update state or life time.
Each time your application read data from cache if should make decision - if data has state "valid" - use it. If data has state "invalid" try to update or use absolete data.
How to update on read(the quite important is this "hand made" mix of optimistic and pessisitic locking):
Try set pessimistic locking (in Redis with SETEX - read more here).
If failed - return absolete data (rememeber we still need availability).
If success perform SQL query and write in to cache.
Read version from Redis again and compare with version readed previously.
If version same - mark as state as "valid".
Release lock.
How to invalidate (your operations #2, #3):
Increment cache version and set state "invalid".
Update life time/ttl if need it.
Why so difficult
We always can get and return value from cache and rarely have situatiuon with cache miss. So we do not have cache invalidation cascade hell then many process try to update
one key.
We still have ordered key updates.
Just one process per time can update key.
I have queue!
Sorry, you have not said before - I would not write it all. If have queue all becomes more simple:
Each modification operation should push job to queue.
Only async worker should execute SQL and update key.
You still need use "state" (valid/invalid) for cache key to separete application logic with cache.
Is this is answer?
Actualy yes and no in same time. This one of possible solutions. Cache invalidation is much complex problem with many possible solutions - one of them
may be simple, other - complex. In most of cases depends on real bussines requirements of concrete applicaton.

HA Database configuration that avoids split-brain issues?

I am looking for a (SQL/RDB) database setup that works something like this:
I will have 3+ databases in an active/active/active configuration
prior to doing any insert, the database will communicate with atleast a majority of the others, such that they all either insert at the same time or rollback (transaction)
this way I can write and read from any of the databases, and always get the same results (as long as the field wasn't updated very recently)
note: this is for a use case that will be very read-heavy and have few writes (and delay on the writes is an OK situation)
does anything like this exist? I see all sorts of solutions with database HA configurations, but most of them suggest writing to a primary node or having a passive backup
alternatively I could setup a custom application, and have each application talk to exactly 1 database, and achieve a similar result, but I was hoping something similar would already exist
So my questions is: does something like this exist? if not, are there any technical/architectural reasons why not?
P.S. - I will NOT be using a SAN where all databases can store/access the same data
edit: more clarifications as far as what I am looking for:
1. I have no database picked out yet, but I am more familiar with MySQL / SQL Server / Oracle, so I would have a minor inclination towards on of those
2. If a majority of the nodes are down (or a single node can't communicate with the collective), then I expect all writes from that node to fail, and accept that it may provide old/outdated information
failure / recover scenario expectations:
1. A node goes down: it will query and get updates from the other nodes when it comes back up
2. A node loses connection with the collective: it will provide potentially old data to read request, and refuse any writes
3. A node is in disagreement with the data stores in others: majority rule
4. 4. majority rule does not work: go with whomever has the latest data (although this really shouldn't happen)
5. The entries are insert/update/read only, i.e. there will be no deletes (except manually ofc), so I shouldn't need to worry about an update after a delete, however in that case I would choose to delete the record and ignore the update
6. Any other scenarios I missed?
update: I the closest I see to what I am looking for seems to be using a quorum + 2 DBs, however I would prefer if I could have 3 DBs instead, such that I can query any of them at any time (to further distribute the reads, and also to keep another copy of the data)
You need to define "very recently". In most environments with replication for inserts, all the databases will have the same data within a few seconds of an insert (and a few seconds seems pessimistic).
An alternative approach is a "read-from-one/write-to-all" approach. In this case, reads are spread through the system. Writes are then sent to all nodes by the application (or a common layer that the application uses).
Remember, though, that the issue with database replication is not how it works when it works. The issue is how it recovers when it fails and even how failures are identified. You need to decide what happens when nodes go down, how they recover lost transactions, how you decide that nodes are really synchronized. I would suggest that you peruse the documentation of the database that you are actually using and understand the replication mechanisms provided by that platform.