Are there any in-memory (persistent) solutions faster than Aerospike for a single-node? - aerospike

I am working on a cloud application that requires low latency and very high read/writes per second. I will only have around 1 million records stored persistently but this may fluctuate largely as the application runs.
After YCSB benchmarking Aerospike and Redis, I found that Aerospike beats Redis and MongoDB both in terms of performance on a single-node for 60/40 read write.
Some points to note:
Fetching all my data using a single 32-bit integer key (no advanced queries)
Running on a single machine with 8 GB RAM and an SSD (small number of records)
Multiple clients need access to the database at once (via LAN)
I'm also assuming that key-value stores will outperform document stores and are the best fit considering I do not need advanced queries.
Before committing myself to Aerospike, are there any other solutions which may be more fit for my scenario considering that I am only running a single node with a small-ish amount of records?

Not that I'm aware of. I think Aerospike is the fastest.
However, for some use cases you can consider Tarantool.
Here's one of the benchmarks: https://medium.com/#rvncerr/tarantool-vs-competitors-racing-in-microsoft-azure-ebde9c5d619

Related

How to estimate the maximum number of reads and writes per second a RDBMS server can handle?

Before spinning up an actual (MySQL, Postgres, etc) database, are there ways to estimate how many reads & writes per second the database can handle?
I'm assuming this is dependant on the CPU and memory (+ network if we're sharding), but is there a good best practice on how to put these variables together?
This is useful for estimating cost and understanding how much of a traffic spike can the db handle.
You can learn from others to gauge transactions per second you'll get from certain instances. For example, https://aiven.io/blog/postgresql-12-gcp-aws-performance gives you a good idea of how PostgreSQL 12 performs.
Percona has blogged about performance benchmarks also: https://www.percona.com/blog/2017/01/06/millions-queries-per-second-postgresql-and-mysql-peaceful-battle-at-modern-demanding-workloads/
Here's another benchmark with useful information: http://dimitrik.free.fr/blog/posts/mysql-performance-80-and-sysbench-oltp_rw-updatenokey.html about MySQL 8.0 and links to 5.7 performance.
There are several blogs about SQL Server performance such as https://storagehub.vmware.com/t/microsoft-sql-server-2017-database-on-vmware-vsan-tm-6-7-using-vmware-cloud-foundation-tm/performance-test-results/ that can also help you recognize the workloads these databases can handle.
Under 10K tps shouldn't be much of a problem with modern hardware. You can start with a most common configuration on the cloud or a standard sized server in your own environment. Use SSDs. Optimize your server settings to gain more speed and be ready to add more resources gradually. As Gordon mentions, benchmark your database after you have installed it. I'd start with 32G memory, 8 cores and SSDs to pull 10K tps as a thumbrule and adjust from there.
As you assumed, a lot depends on the # and type of CPU/memory/SSD, your workload, how you structure data, latency between your app and database, reporting happening against the database, master/slave configuration, types of transactions, storage engines etc.

Redis vs Aerospike usecases?

After going through couple of resources on Google and stack overflow(mentioned below) , I have got high level understanding when to use what but
got couple of questions too
My Understanding :
When used as pure in-memory memory databases both have comparable performance. But for big data where complete complete dataset
can not be fit in memory or even if it can be fit (but it increases the cost), AS(aerospike) can be the good fit as it provides
the mode where indexes can be kept in memory and data in SSD. I believe performance will be bit degraded(compared to completely in memory
db though the way AS handles the read/write from SSD , it makes it fast then traditional disk I/O) but saves the cost and provide performance
then complete data on disk. So when complete data can be fit in memory both can be
equally good but when memory is constraint AS can be good case. Is that right ?
Also it is said that AS provided rich and easy to set up clustering feature whereas some of clustering features in redis needs to be
handled at application. Is it still hold good or it was true till couple of years back(I believe so as I see redis also provides clustering
feature) ?
How is aerospike different from other key-value nosql databases?
What are the use cases where Redis is preferred to Aerospike?
Your assumption in (1) is off, because it applies to (mostly) synthetic situations where all data fits in memory. What happens when you have a system that grows to many terabytes or even petabytes of data? Would you want to try and fit that data in a very expensive, hard to manage fully in-memory system containing many nodes? A modern machine can store a lot more SSD/NVMe drives than memory. If you look at the new i3en instance family type from Amazon EC2, the i3en.24xl has 768G of RAM and 60TB of NVMe storage (8 x 7.5TB). That kind of machine works very well with Aerospike, as it only stores the indexes in memory. A very large amount of data can be stored on a small cluster of such dense nodes, and perform exceptionally well.
Aerospike is used in the real world in clusters that have grown to hundreds of terabytes or even petabytes of data (tens to hundreds of billions of objects), serving millions of operations per-second, and still hitting sub-millisecond to single digit millisecond latencies. See https://www.aerospike.com/summit/ for several talks on that topic.
Another aspect affecting (1) is the fact that the performance of a single instance of Redis is misleading if in-reality you'll be deployed on multiple servers, each with multiple instances of Redis on them. Redis isn't a distributed database as Aerospike is - it requires application-side sharding (which becomes a bit of a clustering and horizontal scaling nightmare) or a separate proxy, which often ends up being the bottleneck. It's great that a single shard can do a million operations per-second, but if the proxy can't handle the combined throughput, and competes with shards for CPU and memory, there's more to the performance at scale picture than just in-memory versus data on SSD.
Unless you're looking at a tiny amount of objects or a small amount of data that isn't likely to grow, you should probably compare the two for yourself with a proof-of-concept test.

Optimizing write performance of a 3 Node 8 Core/16G Cassandra cluster

We have setup a 3 node performance cluster with 16G RAM and 8 Cores each. Our use case is to write 1 million rows to a single table with 101 columns which is currently taking 57-58 mins for the write operation. What should be our first steps towards optimizing the write performance on our cluster?
The first thing I would do is look at the application that is performing the writes:
What language is the application written in and what driver is it using? Some drivers can offer better inherent performance than others. i.e. Python, Ruby, and Node.js drivers may only make use of one thread, so running multiple instances of your application (1 per core) may be something to consider. Your question is tagged 'spark-cassandra-connector' so that possibly indicates your are using that, which uses the datastax java driver, which should perform well as a single instance.
Are your writes asynchronous or are you writing data one at a time? How many writes does it execute concurrently? Too many concurrent writes could cause pressure in Cassandra, but not very many concurrent writes could reduce throughput. If you are using the spark connector are you using saveToCassandra/saveAsCassandraTable or something else?
Are you using batching? If you are, how many rows are you inserting/updating per batch? Too many rows could put a lot of pressure on cassandra. Additionally, are all of your inserts/updates going to the same partition within a batch? If they aren't in the same partition, you should consider batching them up.
Spark Connector Specific: You can tune the write settings, like batch size, batch level (i.e. by partition or by replica set), write throughput in mb per core, etc. You can see all these settings here.
The second thing I would look at is look at metrics on the cassandra side on each individual node.
What does the garbage collection metrics look like? You can enable GC logs by uncommenting lines in conf/cassandra-env.sh (As shown here). Are Your Garbage Collection Logs Speaking to You?. You may need to tune your GC settings, if you are using an 8GB heap the defaults are usually pretty good.
Do your cpu and disk utilization indicate that your systems are under heavy load? Your hardware or configuration could be constraining your capability Selecting hardware for enterprise implementations
Commands like nodetool cfhistograms and nodetool proxyhistograms will help you understand how long your requests are taking (proxyhistograms) and cfhistograms (latencies in particular) could give you insight into any other possibile disparities between how long it takes to process the request vs. perform mutation operations.

What are the use cases where Redis is preferred to Aerospike?

We are currently using Redis and it's a great in-memory datastore. We're starting to look at some new problems where the in-memory limitation is a factor and looking at other option. One we came across is Aerospike - it seems very fast, even faster than redis on in-memory single-shard operation.
Now that we're adding this to our stack, I'm trying to understand the use cases where Aerospike would not be able to replace redis?
Aerospike supports less data types than Redis, for example pub/sub is not available in Aerospike. However, Aerospike is a distributed key-value store and has superior clustering features.
The two are both great databases. It really depends on how big of a dataset you're handling, and your expectations of growth.
Redis:
Key/value store, dataset fits into RAM in single machine or you can shard yourself across multiple machines (and/or cores since it's single-threaded), persists data to disk, has data structures like lists/sets, basic pub/sub, simple slave replication, Lua scripting.
Aerospike:
Key/value row-store (meaning value contains bins with values and those values can be more maps/lists/values to have multiple levels), multithreaded to use all cores, built for clustering across machines with replication, and can do cross-datacenter replication, Lua scripting for UDFs. Can run directly on SSDs so you can store much more data without it fitting into RAM.
Comparison:
If you just have a smaller dataset or are fine with single-core performance then Redis is great. Quick to install, simple to run, easy to just attach a slave with 1 command if you need more read scalability. Redis also has more unique functionality with list/set/bitmap operations so you can do "more" out of the box.
If you want to store more complicated or nested data or need more performance on a single machine or clustering, then Aerospike gets the job done really well with less operational overhead. Very fast performance and easy cluster setup with all nodes being exactly the same role so you can scale reads and writes.
That's the big difference, scalability beyond a single core or server. With Lua scripting, you can usually fill in any native feature that Redis has into Aerospike. If you have lots of data (like TBs) then Aerospike's SSD feature means you get RAM-like performance without the RAM cost.
Have you looked at the benchmarks? I believe each one performs differently under different conditions and use cases:
http://www.aerospike.com/when-to-use-aerospike-vs-redis/
https://redislabs.com/blog/nosql-performance-aerospike-cassandra-datastax-couchbase-redis
Redis and Aerospike are different and both have their pros and cons, but Redis seems a better fit than Aerospike in the 2 following use cases:
when we don't need replication
We are using a big cache with intensive writes and a very short ttl (20s) for deduplication. There is no point in replicating this data. Redis would probably use half as much cpu and less than half as much RAM than Aerospike. It would be cheaper and as fast, or even faster thanks to pipelining.
when we need cross data-center replication
We have one large database that we need to access from 5 data centres, lots of writes, intensive reads. There is no perfect solution but the best one so far seems to store the central database in Redis and a copy on each data centre using Redis master-slave replication.

Redis as a database

I want to use Redis as a database, not a cache. From my (limited) understanding, Redis is an in-memory datastore. What are the risks of using Redis, and how can I mitigate them?
You can use Redis as an authoritative store in a number of different ways:
Turn on AOF (Append-only File store) see AOF docs. This will keep a log of all Redis commands made against your dataset in real-time.
Run Redis using Master-Slave replication see replication docs. This will allow you to provide high-availability if one of your instances fails.
If you're running on something like EC2 you can EBS back your Redis partition to provide another layer of protection against instance failure.
On the horizon is Redis Cluster - this is specifically designed as a way to run Redis in a way that should help with HA and scalability. However, this won't appear for at least another six months or so.
Redis is an in-memory store which can also write the data back to disc. You can specify how many times to do a fsync to make redis safer(but also slower => trade-off) .
But still I am not certain if redis is in state yet to really store (mission) critical data in it (yet?). If for example it is not a huge problem when 1 more tweets(twitter.com) or something similiar get losts then I would certainly use redis. There is also a lot of information available about persistence at redis's own website.
You should also be aware of some persistence problems which could occur by reading antirez(redis maintainers) blog article. You should read his blog because he has some interesting articles.
I would like to share a few things that we have learned by using Redis as a primary Database in our service. We choose Redis since we had data that could not be partitioned. We wanted to get the best performance we could get out of one box
Pros:
Redis was unbeatable in raw performance. We got 10K transactions per second out of the box (Note that one transaction involved multiple Redis commands). We were able to hit a rate of 25K+ transactions per second after a few optimizations, along with LUA scripts. So when it comes to performance per box, Redis is unmatched.
Redis is very simple to setup and has a very small learning curve as opposed to other SQL and NoSQL datastores.
Cons:
Redis supports only few primitive Data Structures like Hashes, Sets, Lists etc. and operations on these Data Structures. These are more than sufficient when you are using Redis as a cache, but if you want to use Redis as a full fledged primary data store, you will feel constrained. We had a tough time modelling our data requirements using these simple types.
The biggest problem we have seen with Redis was the lack of flexibility. Once you have solutioned the structure of your data, any modifications to storage requirements or access patterns virtually requires re-thinking of the entire solution. Not sure if this is the case with all NoSQL data stores though (I have heard MongoDB is more flexible, but haven't used it myself)
Since Redis is single threaded, CPU utilization is very low. You can't put multiple Redis instances on the same machine to improve CPU utilization as they will compete for the same disk, making disk as the bottleneck.
Lack of horizontal scalability is a problem as mentioned by other answers.
As Redis is an in-memory storage, you cannot store large data that won't fit you machine's memory size. Redis usually work very bad when the data it stores is larger than 1/3 of the RAM size. So, this is the fatal limitation of using Redis as a database.
Certainly, you can distribute you big data into several Redis instances, but you have to do it all on your own manually. The operation usually be done like this(assuming you have only 1 instance from start):
Use its master-slave mechanism to replicate data to the second machine, Now you have 2 copies of the same data.
Cut off the connection between master and slave.
Delete the first half(split by hashing, etc) of data on the first machine, and delete the second half of data on the second machine.
Tell all clients(PHP, C, etc...) to operate on the first machine if the specified keys are on that machine, otherwise operate on the second machine.
This is the way how Redis scales! You also have to stop your service to prevent any writes during the migration.
To the expierence we encounter, we have this conclusion to Redis: Redis is not the right choice to store more than 30G data, Redis is not scalable, Redis is quite suitable for prototype development.
We later find an alternative to Redis, that is SSDB(https://github.com/ideawu/ssdb), a leveldb server that supports nearly all the APIs of Redis, it is suitable for storing more than 1TB of data, that only depends on the size of you harddisk.
Redis is a database, that means we can use it for persisting information for any kind of app, information like user accounts, blog posts, comments and so on. After storing information we can retrieve it later on by writing queries.
Now this behavior is similar to just about every other database, but what is the difference? Or rather why would we use it over any other database?
Redis is fast.
Redis is not fast because it's written in a special programming language or anything like that, it's fast because all data is stored in-memory.
Most databases store all their information between both the memory of a computer and the hard drive. Accessing data in-memory is fast, but getting it stored on a hard disk is relatively slow.
So rather than storing memory in hard disk, Redis decided to store it in memory.
Now, the downside to this is that working with data that is larger than the amount of memory your computer has, that is not going to work.
That may sound like a tremendous problem, but Redis has clear strategies for working around this limitation.
The above is just the first reason why Redis is so fast.
The second reason is that Redis stores all of its data or rather organizes all of its data in simple data structures such as Doubly Linked Lists, Sorted Sets and so on.
These data structures have well-known and well-understood performance characteristics. So as developers we can decide exactly how our information is organized and how to efficiently query data.
It's also very fast because Redis is simple in nature, it's not feature heavy; feature heavy datastores like Postgres have performance penalties.
So to use Redis as a database you have to know how to store in limited space, you have to know how to organize it into these simple data structures mentioned above and you have to understand how to work around the limited feature set.
So as far as mitigating risks, the way you start to do that is to start to think Redis Design Methodology and not SQL Database Design Methodology. What do I mean?
So instead of, step 1. Put the data in tables, step 2. figure out how we will query it.
With Redis it's more:
Step 1. Figure out what queries we need to answer.
Step 2. Structure data to best answer those queries.