Cassandra Commit and Recovery on a Single Node - crash

I am a newbie to Cassandra - I have been searching for information related to commits and crash recovery in Cassandra on a single node. And, hoping someone can clarify the details.
I am testing Cassandra - so, set it up on a single node. I am using stresstool on datastax to insert millions of rows. What happens if there is an electrical failure or system shutdown? Will all the data that was in Cassandra's memory get written to disk upon Cassandra restart (I guess commitlog acts as intermediary)? How long is this process?
Thanks!

Cassandra's commit log gives Cassandra durable writes. When you write to Cassandra, the write is appended to the commit log before the write is acknowledged to the client. This means every write that the client receives a successful response for is guaranteed to be written to the commit log. The write is also made to the current memtable, which will eventually be written to disk as an SSTable when large enough. This could be a long time after the write is made.
However, the commit log is not immediately synced to disk for performance reasons. The default is periodic mode (set by the commitlog_sync param in cassandra.yaml) with a period of 10 seconds (set by commitlog_sync_period_in_ms in cassandra.yaml). This means the commit log is synced to disk every 10 seconds. With this behaviour you could lose up to 10 seconds of writes if the server loses power. If you had multiple nodes in your cluster and used a replication factor of greater than one you would need to lose power to multiple nodes within 10 seconds to lose any data.
If this risk window isn't acceptable, you can use batch mode for the commit log. This mode won't acknowledge writes to the client until the commit log has been synced to disk. The time window is set by commitlog_sync_batch_window_in_ms, default is 50 ms. This will significantly increase your write latency and probably decrease the throughput as well so only use this if the cost of losing a few acknowledged writes is high. It is especially important to store your commit log on a separate drive when using this mode.
In the event that your server loses power, on startup Cassandra replays the commit log to rebuild its memtable. This process will take seconds (possibly minutes) on very write heavy servers.
If you want to ensure that the data in the memtables is written to disk you can run 'nodetool flush' (this operates per node). This will create a new SSTable and delete the commit logs referring to data in the memtables flushed.

You are asking something like
What happen if there is a network failure at the time data is being loaded in Oracle using SQL*Loader ?
Or what happens Sqoop stops processing due to some condition while transferring data?
Simply whatever data is being transferred before electrical failure or system shutdown, it will remain the same.
Coming to second question, when ever the memtable runs out of space, i.e when the number of keys exceed certain limit (128 is default) or when it reaches the time duration (cluster clock), it is being stored into sstable, immutable space.

Related

Behaviour of redis client-output-buffer-limit during resynchronization

I'm assuming that during replica resynchronisation (full or partial), the master will attempt to send data as fast as possible to the replica. Wouldn't this mean the replica output buffer on the master would rapidly fill up since the speed the master can write is likely to be faster than the throughput of the network? If I have client-output-buffer-limit set for replicas, wouldn't the master end up closing the connection before the resynchronisation can complete?
Yes, Redis Master will close the connection and the synchronization will be started from beginning again. But, please find some details below:
Do you need to touch this configuration parameter and what is the purpose/benefit/cost of it?
There is a zero (almost) chance it will happen with default configuration and pretty much moderate modern hardware.
"By default normal clients are not limited because they don't receive data
without asking (in a push way), but just after a request, so only asynchronous clients may create a scenario where data is requested faster than it can read." - the chunk from documentation .
Even if that happens, the replication will be started from beginning but it may lead up to infinite loop when slaves will continuously ask for synchronization over and over. Redis Master will need to fork whole memory snapshot (perform BGSAVE) and use up to 3 times of RAM from initial snapshot size each time during synchronization. That will be causing higher CPU utilization, memory spikes network utilization (if any) and IO.
General recommendations to avoid production issues tweaking this configuration parameter:
Don't decrease this buffer and before increasing the size of the buffer make sure you have enough memory on your box.
Please consider total amount of RAM as snapshot memory size (doubled for copy-on-write BGSAVE process) plus the size of any other buffers configured plus some extra capacity.
Please find more details here

Dirty Reads in SQL Server AlwaysOn

I have a pair of SQL Server 2014 databases set up as a synchronous AlwaysOn availability group.
Both servers are set to the Synchronous commit availability mode, with a session timeout of 50 seconds. The secondary is set to be a Read-intent only readable secondary.
If I write to the primary and then immediately read from the secondary (via ApplicationIntent=ReadOnly), I consistently read dirty data (i.e. the state before the write). If I wait for around a second between writing and reading, I get the correct data.
Is this expected behaviour? If so, is there something I can do to ensure that reads from the secondary are up-to-date?
I'd like to use the secondary as a read-only version of the primary (as well as a fail-over), to reduce the load on the primary.
There is no way you can get dirty reads unless you are using no-lock hint..
When you enable Read Only secondaries in AlwaysOn..Internally SQL uses rowversioning to store Previous version of the row..
further you are using Synchronous commit mode,this ensures log records are committed first on secondary,then on primary..
what you are seeing is Data latency..
This whitePaper deals with this scenario..Below is relevant part which helps in understanding more about it..
The reporting workload running on the secondary replica will incur some data latency, typically a few seconds to minutes depending upon the primary workload and the network latency.
The data latency exists even if you have configured the secondary replica to synchronous mode. While it is true that a synchronous replica helps guarantee no data loss in ideal conditions (that is, RPO = 0) by hardening the transaction log records of a committed transaction before sending an ACK to the primary, it does not guarantee that the REDO thread on secondary replica has indeed applied the associated log records to database pages.
So there is some data latency. You may wonder if this data latency is more likely when you have configured the secondary replica in asynchronous mode. This is a more difficult question to answer. If the network between the primary replica and the secondary replica is not able to keep up with the transaction log traffic (that is, if there is not enough bandwidth), the asynchronous replica can fall further behind, leading to higher data latency.
In the case of synchronous replica, the insufficient network bandwidth does not cause higher data latency on the secondary but it can slow down the transaction response time and throughput for the primary workload

Does Bigtable write operations to the log for every single operation or in batches?

I was wondering how Google's Bigtable stays persistent. When a write operation comes in, the tablet server updates the in-memory "hashmap" and it is also written to a log file. This way, if the tablet server dies, a new tablet server can read all recent operations and be "equal" to the dead tablet.
This makes sense, but doesn't it slow down to write every operation to a log server rather than in batch (because it is written to a disk)?
Let's take each of these questions in turn.
Does Bigtable write operations to the log for every single operation or in batches?
Bigtable writes every single operation to the persistent log as they come in, not in batch. In other words, it's synchronous, rather than asynchronous: by the time the server responds to the client, the data was already written to a log (which is durable and replicated), not just to memory.
If a storage system only writes to memory, and writes out to a log in batches, it will lose data that was only in memory if the server were to crash after accepting some writes, but before it flushed them to a log.
This makes sense, but doesn't it slow down to write every operation to a log server rather than in batch (because it is written to a disk)?
The distributed file system behind Bigtable (formerly Google File System, now Colossus) is much faster than typical file systems, even though it's distributed and each write is replicated.
On benchmarks using YCSB, Google Cloud Bigtable has demonstrated single-digit millisecond latency on both reads and writes even at the tail:

Couchbase 3.1.0 - Hard out of memory error when performing full backup

We recently migrated to Couchbase 3.1.0. The odd thing is - when performing full backup of a bucket, web UI alerts "Hard Out Of Memory Error. Bucket X on node Y is full. All memory allocated to this bucket is used for metadata". The numbers from RAM usage in the web UI contradict that - about 75% is used, but not 100%. I looked into the logs, but haven't find any similar errors there.
Is that even normal?
This is a known issue in the Couchbase Server 3.x releases.
To understand the problem, we must also first understand Database Change Protocol (DCP), the protocol used to transfer data throughout the system. At a high level the flow-control for DCP is as follows:
The Consumer creates a connection with the Producer and sends an Open Connection message. The Consumer then sends a Control message to indicate per stream flow control. This messages will contain “stream_buffer_size” in the key section and the buffer size the Consumer would like each stream to have in the value section.
The Consumer will then start opening streams so that is can receive data from the server.
The Producer will then continue to send data for the stream that has buffer space available until it reaches the maximum send size.
Steps 1-3 continue until the connection is closed, as the Consumer continues to consume items from the stream.
The cbbackup utility does not implement any flow control (data buffer limits) however, and it will try to stream all vbuckets from all nodes at once, with no cap on the buffer size.
While this does not mean that it will use the same amount of memory as your overall data size (as the streams are being drained slowly by the cbbackup process), it does mean that a large memory overhead is required to be able to store the data streams.
When you are in a heavy DGM (disk greater than memory) scenario, the amount of memory required to store the streams is likely to grow more rapidly than cbbackup can drain them as it is streaming large quantities of data off of disk, leading to very large streams, which take up a lot of memory as previously mentioned.
The slightly misleading message about metadata taking up all of the memory is displayed as there is no memory left for the data, so all of the remaining memory is allocated to the metadata, which when using value eviction cannot be ejected from memory.
The reason that this only affects Couchbase Server versions prior to 4.0 is that in 4.0 a server-side improvement to DCP stream management was made that allows the pausing of DCP streams to keep the memory footprint down, this is tracked as MB-12179.
As a result, you should not experience the same issue on Couchbase Server versions 4.x+, regardless of how DGM your bucket is.
Workaround
If you find yourself in a situation where this issue is occurring, then terminating the backup job should release all of the memory consumed by the streams immediately.
Unfortunately if you have already had most of your data evicted from memory as a result of the backup, then you will have to retrieve a large quantity of data off of disk instead of RAM for a small period of time, which is likely to increase your get latencies.
Over time 'hot' data will be brought into memory when requested, so this will only be a problem for a small period of time, however this is still a fairly undesirable situation to be in.
The workaround to avoid this issue completely is to only stream a small number of vbuckets at once when performing the backup, as opposed to all vbuckets which cbbackup does by default.
This can be achieved using cbbackupwrapper which comes bundled with all Couchbase Server releases 3.1.0 and later, details of using cbbackupwrapper can be found in the Couchbase Server documentation.
In particular the parameter to pay attention to is the -n flag, which specifies the number of vbuckets to be backed up in a batch at once.
As the name suggests, cbbackupwrapper is simply a wrapper script on top of cbbackup which partitions the vbuckets up and automatically handles all of the directory creation and backup generation, while still using cbbackup under the hood.
As an example, with a batch size of 50, cbbackupwrapper would backup vbuckets 0-49 first, followed by 50-99, then 100-149 etc.
It is suggested that you test with cbbackupwrapper in a testing environment which mirrors your production environment to find a suitable value for -n and -P (which controls how many backup processes run at once, the combination of these two controls the amount of memory pressure caused by backup as well as the overall speed).
You should not find that lowering the value of -n from its default 100 decreases the backup speed, in some cases you may find that the backup speed actually increases due to the fact that there is far less memory pressure on the server.
You may however wish to sensibly adjust the -P parameter if you wish to speed up the backup further.
Below is an example command:
cbbackupwrapper http://[host]:8091 [backup_dir] -u [user_name] -p [password] -n 50
It should be noted that if you use cbbackupwrapper to perform your backup then you must also use cbrestorewrapper to restore the data, as cbrestorewrapper is automatically aware of the directory structures used by cbbackupwrapper.
When you run a full backup, by default the backup tool streams data from all nodes over the network. This is not the best way, because it causes a lot of extra load and increased memory usage, especially of you run cbbackup on one of the Couchbase nodes. I would use the data-copy mode of cbbackup, which copies data directly from the files on disk:
> sudo /opt/couchbase/bin/cbbackup couchstore-files:///opt/couchbase/var/lib/couchbase/data/ /tmp/backup
Of course, change the data path to wherever your Couchbase data is actually stored. (In my example it runs as sudo because only root has read access to /opt/couchbase/blabla..) Do this on every node, then collect all the backup folders and put them somewhere. Note that the backups are very compressible, so you might want to zip them before copying over the network.

Aspstate SQL Server database mirroring high IO

We are currently having issues with aspstate database mirroring as we have around 10,000 active users online 9-5 day to day and the aspstate db is so heavy on writing and passing this to the mirror that the mirror's drive is very high on IO and keeps causing both servers to be inaccessible due to the latency of writing the data on the mirror. We're using SQL Server 2012 standard so not in asynchronous mode.
We're running the SQL Server on Amazon EC2 instances with EBS backed volumes and 1000IOPS, in your views should this be enough? As we seem to have very smooth times where we've had over 15,000 users online and then other times where only 10,000 users online and we have issues with disk queue lengths on the mirror (backup server not the principle server.)
The principle can be writing to the aspstate.mdf files at 10-20mbps constant when the disk queue length goes up.
We're going to increase the IOPS to 2000 in the mean time as currently we've had to disable mirroring, however would you expect this and has anyone handled this sort of volume before?
Regards
Liam
The bottleneck with a high transaction workload like ASPState is not the data file but the transaction log. In the case of synchronous mirroring, additional latency is introduced for both the network and synchronous commit at the mirror. This latency will not be tolerable if you have a large number of APSState requests. Keep in mind that unless specified otherwise with session state enabled, each ASP.NET page request will require 2 updates to a session state row. So if you have 10,000 active users clicking once every 15 seconds, that requires about 1,300 I/Os per second for transaction log writes alone on each database.
If you must have HA for session state, I suggest failover clustering to eliminate network latency. You might also consider tuning session state by specifying the read only or none directive for pages that don't need session state. Consider using an in-memory session state solution instead of the out-of-the box ASPSession state database if you need to support a large number of users. Also remember that session state data is temporary so you can forgo durability.