What is the difference between Cassy's snapshots and consistent backups? Cassy is a backup tool for Cassandra - cassy

i want to know the characteristics of Cassy backup and how to use them in actual operation. Specifically, from an operational design perspective, what is the difference between snapshots and consistent backups, what types of usage patterns can you use each or in combination, and what are the benefits and tradeoffs?

Both are snapshot. Consistent snapshot (a.k.a cluster-wide snapshot/backup) is a special snapshot that is transactionally consistent.
Imagine a transaction T in a payment application.
T: sends $100 from an account A to an account B.
T actually has 2 processes like deducting $100 from A's account and adding $100 to B's account.
If you take a normal snapshot, the transaction is not cared so a produced snapshot might not have the result of one of the processes.
Instead, transactionally consistent snapshot is either a snapshot before or after the transaction occurs.
Transactionally consistent backup is useful only when quorum of C* disks are broken and the data is lost.
Otherwise, normal snapshot is enough since C* cluster still has quorum of data alive and the latest data can just be synchronized.
Snapshot should be taken regularly like once in a week or 2 weeks.
Cluster-wide snapshot can be taken in a longer time span just for the worst case scenario.

Related

SQL Server Snapshot on live DB

I am in need of some help.I want to reinitialize one of my subscribers with a new snapshot.my previous snapshot i generated was when activity was low on the production database.It took under 2 minutes.
My question is,can i generate a new snapshot during the day when the applications are using the database live? would it lock my tables badly to where transaction wont be followed to the database?
Yes, you can create initial snapshot during the day.
With transactional replication Snapshot Agent takes locks only during the initial phase of snapshot generation.
Performance impact of this operation depends on current workload of your system. So, if previous snapshot generation took 2 minutes, I think I will take not much more time.

Does Redis persist data?

I understand that Redis serves all data from memory, but does it persist as well across server reboot so that when the server reboots it reads into memory all the data from disk. Or is it always a blank store which is only to store data while apps are running with no persistence?
I suggest you read about this on http://redis.io/topics/persistence . Basically you lose the guaranteed persistence when you increase performance by using only in-memory storing. Imagine a scenario where you INSERT into memory, but before it gets persisted to disk lose power. There will be data loss.
Redis supports so-called "snapshots". This means that it will do a complete copy of whats in memory at some points in time (e.g. every full hour). When you lose power between two snapshots, you will lose the data from the time between the last snapshot and the crash (doesn't have to be a power outage..). Redis trades data safety versus performance, like most NoSQL-DBs do.
Most NoSQL-databases follow a concept of replication among multiple nodes to minimize this risk. Redis is considered more a speedy cache instead of a database that guarantees data consistency. Therefore its use cases typically differ from those of real databases:
You can, for example, store sessions, performance counters or whatever in it with unmatched performance and no real loss in case of a crash. But processing orders/purchase histories and so on is considered a job for traditional databases.
Redis server saves all its data to HDD from time to time, thus providing some level of persistence.
It saves data in one of the following cases:
automatically from time to time
when you manually call BGSAVE command
when redis is shutting down
But data in redis is not really persistent, because:
crash of redis process means losing all changes since last save
BGSAVE operation can only be performed if you have enough free RAM (the amount of extra RAM is equal to the size of redis DB)
N.B.: BGSAVE RAM requirement is a real problem, because redis continues to work up until there is no more RAM to run in, but it stops saving data to HDD much earlier (at approx. 50% of RAM).
For more information see Redis Persistence.
It is a matter of configuration. You can have none, partial or full persistence of your data on Redis. The best decision will be driven by the project's technical and business needs.
According to the Redis documentation about persistence you can set up your instance to save data into disk from time to time or on each query, in a nutshell. They provide two strategies/methods AOF and RDB (read the documentation to see details about then), you can use each one alone or together.
If you want a "SQL like persistence", they have said:
The general indication is that you should use both persistence methods if you want a degree of data safety comparable to what PostgreSQL can provide you.
The answer is generally yes, however a fuller answer really depends on what type of data you're trying to store. In general, the more complete short answer is:
Redis isn't the best fit for persistent storage as it's mainly performance focused
Redis is really more suitable for reliable in-memory storage/cacheing of current state data, particularly for allowing scalability by providing a central source for data used across multiple clients/servers
Having said this, by default Redis will persist data snapshots at a periodic interval (apparently this is every 1 minute, but I haven't verified this - this is described by the article below, which is a good basic intro):
http://qnimate.com/redis-permanent-storage/
TL;DR
From the official docs:
RDB persistence [the default] performs point-in-time snapshots of your dataset at specified intervals.
AOF persistence [needs to be explicitly configured] logs every write operation received by the server, that will be played again at server startup, reconstructing the
original dataset.
Redis must be explicitly configured for AOF persistence, if this is required, and this will result in a performance penalty as well as growing logs. It may suffice for relatively reliable persistence of a limited amount of data flow.
You can choose no persistence at all.Better performance but all the data lose when Redis shutting down.
Redis has two persistence mechanisms: RDB and AOF.RDB uses a scheduler global snapshooting and AOF writes update to an apappend-only log file similar to MySql.
You can use one of them or both.When Redis reboots,it constructes data from reading the RDB file or AOF file.
All the answers in this thread are talking about the possibility of redis to persist the data: https://redis.io/topics/persistence (Using AOF + after every write (change)).
It's a great link to get you started, but it is defenently not showing you the full picture.
Can/Should You Really Persist Unrecoverable Data/State On Redis?
Redis docs does not talk about:
Which redis providers support this (AOF + after every write) option:
Almost none of them - redis labs on the cloud does NOT provide this option. You may need to buy the on-premise version of redis-labs to support it. As not all companies are willing to go on-premise, then they will have a problem.
Other Redis Providers does not specify if they support this option at all. AWS Cache, Aiven,...
AOF + after every write - This option is slow. you will have to test it your self on your production hardware to see if it fits your requirements.
Redis enterpice provide this option and from this link: https://redislabs.com/blog/your-cloud-cant-do-that-0-5m-ops-acid-1msec-latency/ let's see some banchmarks:
1x x1.16xlarge instance on AWS - They could not achieve less than 2ms latency:
where latency was measured from the time the first byte of the request arrived at the cluster until the first byte of the ‘write’ response was sent back to the client
They had additional banchmarking on a much better harddisk (Dell-EMC VMAX) which results < 1ms operation latency (!!) and from 70K ops/sec (write intensive test) to 660K ops/sec (read intensive test). Pretty impresive!!!
But it defenetly required a (very) skilled devops to help you create this infrastructure and maintain it over time.
One could (falsy) argue that if you have a cluster of redis nodes (with replicas), now you have full persistency. this is false claim:
All DBs (sql,non-sql,redis,...) have the same problem - For example, running set x 1 on node1, how much time it takes for this (or any) change to be made in all the other nodes. So additional reads will receive the same output. well, it depends on alot of fuctors and configurations.
It is a nightmare to deal with inconsistency of a value of a key in multiple nodes (any DB type). You can read more about it from Redis Author (antirez): http://antirez.com/news/66. Here is a short example of the actual ngihtmare of storing a state in redis (+ a solution - WAIT command to know how much other redis nodes received the latest change change):
def save_payment(payment_id)
redis.rpush(payment_id,”in progress”) # Return false on exception
if redis.wait(3,1000) >= 3 then
redis.rpush(payment_id,”confirmed”) # Return false on exception
if redis.wait(3,1000) >= 3 then
return true
else
redis.rpush(payment_id,”cancelled”)
return false
end
else
return false
end
The above example is not suffeint and has a real problem of knowing in advance how much nodes there actually are (and alive) at every moment.
Other DBs will have the same problem as well. Maybe they have better APIs but the problem still exists.
As far as I know, alot of applications are not even aware of this problem.
All in all, picking more dbs nodes is not a one click configuration. It involves alot more.
To conclude this research, what to do depends on:
How much devs your team has (so this task won't slow you down)?
Do you have a skilled devops?
What is the distributed-system skills in your team?
Money to buy hardware?
Time to invest in the solution?
And probably more...
Many Not well-informed and relatively new users think that Redis is a cache only and NOT an ideal choice for Reliable Persistence.
The reality is that the lines between DB, Cache (and many more types) are blurred nowadays.
It's all configurable and as users/engineers we have choices to configure it as a cache, as a DB (and even as a hybrid).
Each choice comes with benefits and costs. And this is NOT an exception for Redis but all well-known Distributed systems provide options to configure different aspects (Persistence, Availability, Consistency, etc). So, if you configure Redis in default mode hoping that it will magically give you highly reliable persistence then it's team/engineer fault (and NOT that of Redis).
I have discussed these aspects in more detail on my blog here.
Also, here is a link from Redis itself.

does a simple-recovery database records transaction logs when selected from a full-recovery database?

does a simple-recovery database records transaction logs when selected from a full-recovery database? I mean, we have a full-recovery database, and it records too much transaction logs, causing its size to grow.
my question is, does the simple recovery still has does its minimal logging even if the data are selected from a full-recovery model database? thank you!
One thing has nothing to do with the other. Where the data comes from does not affect logging of changes to the tables in the db it's going to.
However as Martin Smith pointed out this is solving a symptom, there's naff all point in having full recovery mode on if you (they??) aren't backing up the transaction logs frequently enough to make the overhead useful. Whole point of them, aside from restoring up to particular transaction in the event of some catastrophy in your applications is speed and granularity.
Please read the MSDN page for recovery models.
http://msdn.microsoft.com/en-us/library/ms189275.aspx
Here is a quick summary from MSDN.
1 - Simple model - Automatically reclaims log space to keep space requirements small,
essentially eliminating the need to manage the transaction log space
2 - Bulk Copy model -
An adjunct of the full recovery model that permits high-performance
bulk copy operations.
**The first two do not support point in time recovery!**
3 - Full model - Can recover to an arbitrary point in time
(for example, prior to application or user error).
If no tail log backup possible, recover to last log backup.
So your problem is with either log usage or log backups.
A - Are you deleting from temporary tables instead of truncating?
http://msdn.microsoft.com/en-us/library/ms177570.aspx The delete operating will log each row in the transaction log.
B - Are you inserting large amounts of data via a ETL job? Each insert will get logged in the T-Log.
If you use bulk copy and ETL that support (fast data loads), it will be minimally logged.
However, page density and fill factor come into play when determining the size of the T-LOG.
http://blogs.msdn.com/b/sqlserverfaq/archive/2011/01/07/using-bulk-logged-recovery-model-for-bulk-operations-will-reduce-the-size-of-transaction-log-backups-myths-and-truths.aspx
C - How often are you taking transaction log backups? After each backup, the T-LOG space can be reused. Resulting in overall smaller size.
D - How fragmented is the T-LOG? I suggest reducing and re-growing the log during a maintenance period. A 20% log to data ratio with hourly backups worked fine at my old company. It all depends on how many changes you are making. http://craftydba.com/?p=3374
In summary, these are the places you should be looking at. Not the old data in the system since it is probably not being modified.
Moving the old data to a read only reporting database so that ADHOC queries from novice T-SQL users might not be a bad idea. But that solves other problems, possible BLOCKING and DEADLOCKS in your OLTP database.

DynamoDB - How to do incremental backup?

I am using DynamoDB tables with keys and throughput optimized for application use cases. To support other ad hoc administrative and reporting use cases I want to keep a complete backup in S3 (a day old backup is OK). Again, I cannot afford to scan the entire DynamoDB tables to do the backup. The keys I have are not sufficient to find out what is "new". How do I do incremental backups? Do I have to modify my DynamoDB schema, or add extra tables just to do this? Any best practices?
Update:
DynamoDB Streams solves this problem.
DynamoDB Streams captures a time-ordered sequence of item-level
modifications in any DynamoDB table, and stores this information in a
log for up to 24 hours. Applications can access this log and view the
data items as they appeared before and after they were modified, in
near real time.
I see two options:
Generate the current snapshot. You'll have to read from the table to do this, which you can do at a very slow rate to stay under your capacity limits (Scan operation). Then, keep an in-memory list of updates performed for some period of time. You could put these in another table, but you'll have to read those, too, which would probably cost just as much. This time interval could be a minute, 10 minutes, an hour, whatever you're comfortable losing if your application exits. Then, periodically grab your snapshot from S3, replay these changes on the snapshot, and upload your new snapshot. I don't know how large your data set is, so this may not be practical, but I've seen this done with great success for data sets up to 1-2GB.
Add read throughput and backup your data using a full scan every day. You say you can't afford it, but it isn't clear if you mean paying for capacity, or that the scan would use up all the capacity and the application would begin failing. The only way to pull data out of DynamoDB is to read it, either strongly or eventually consistent. If the backup is part of your business requirements, then I think you have to determine if it's worth it. You can self-throttle your read by examining the ConsumedCapacityUnits property on your results. The Scan operation has a Limit property that you can use to limit the amount of data read in each operation. Scan also uses eventually consistent reads, which are half the price of strongly consistent reads.
You can now use dynamoDB streams to have data persisted into anthother table or maintain another copy of data in another datastore.
https://aws.amazon.com/blogs/aws/dynamodb-streams-preview/
For incremental backups, you can associate your DynamoDB Stream with a Lambda Function to automatically trigger code for every data update (Ie: data to another store like S3)
A lambda function you can use to tie up with DynamoDb for incremental backups:
https://github.com/PageUpPeopleOrg/dynamodb-replicator
I've provided a detailed walk through on how you can use DynamoDB Streams, Lambda and S3 versioned buckets to create incremental backups for your data in DynamoDb on my blog:
https://www.abhayachauhan.com/category/aws/dynamodb/dynamodb-backups
Alternatively, DynamoDB just realised On Demand backups and restores. They aren't incremental, but full backup snapshots.
Check out https://www.abhayachauhan.com/2017/12/dynamodb-scheduling-on-demand-backups/ for more information.
HTH
On November 29th, 2017 On-Demand Backup was introduced. It allows for you to create backups directly in DynamoDB essentially instantly without consuming any capacity. Here are a few snippets from the blog post:
This feature is designed to help you to comply with regulatory requirements for long-term archival and data retention. You can create a backup with a click (or an API call) without consuming your provisioned throughput capacity or impacting the responsiveness of your application. Backups are stored in a highly durable fashion and can be used to create fresh tables.
...
The backup is available right away! It is encrypted with an Amazon-managed key and includes all of the table data, provisioned capacity settings, Local and Global Secondary Index settings, and Streams. It does not include Auto Scaling or TTL settings, tags, IAM policies, CloudWatch metrics, or CloudWatch Alarms.
You may be wondering how this operation can be instant, given that some of our customers have tables approaching half of a petabyte. Behind the scenes, DynamoDB takes full snapshots and saves all change logs. Taking a backup is as simple as saving a timestamp along with the current metadata for the table.
A scan operation in DynamoDB returns rows sorted by primary key (hash key). So if a table's hash key is an auto-incremented integer, then set the hash key of the last record saved during the previous backup as "lastEvaluatedKey" parameter for a scan request when doing the next backup, and the scan will return records which have been created since the last backup only.

Daily Data subset of main database

I have a large Db 500GB one of our customers wants daily snapshot of only his data, He only has a 3mb connection , I suspect that is the Max ! What method is the most effective method I could use?
1. Views that are updated but it wants the underlaying tables.
2. Replication I don’t know much about this.
3. Alternative method.
Merge replication, which allows you to initialize a subscriber without using a snapshot. You will have to initialize replication on the subscriber from a backup. You can possibly do the same with transactional replication, but it just never worked quite the same for me. YMMV. When it breaks (etc.) you will have to be prepared to ship a new backup and start over (hacking replication sometimes works, but don't count on it). Changing database structures is also a pain once in replication. I have seen ~500Gb with less than 3Mbit work in production, though not without proper planning and preparation (and grey hair)
I have used transactional replication with a read only subscription, where I invoked the distribution with a batch file on a task schedule once a day. Transactional replication did not feel as maintained (from MS) or as stable as merge replication, though the data integrity with transactional was more consistent
I have not tried transaction log shipping, but that might also be an option
(p.s. notice I didn't say "If it breaks")