I am using Redis as time series database. I am importing mysql data into Redis by reforming data in the format of score and value in order to fit data into sorted set. I have 26 tables and at some point of time data, they can extend to 100 million records for every table.
Is it okay to store that much of data into Redis as Redis stores data in memory?
Is there an chance of Redis crash? If yes how often it will crash?
Is it okay to use Redis for my task?
You should ask yourself how you intend to query your data. Will you access single values or do scans?
Depending on your answers, a more specialized solution might be a better fit for your problem:
Warp 10 (disclaimer: I help build it)
InfluxDB
KairosDB
OpenTSDB
Related
I have 100GB dataset in this format with row format as seen below.
cookie,iplong1,iplong2..,iplongN
I am currently trying to fit this data into redis as a sorted set data structure. I would also need to set a TTL for each of those IPs. I was thinking to have TTL implemented on each element in that set, I will probably add a score to them, where score is the epoch time. And may be I will write a separate script to parse the scores and remove expired IPs based score as applicable. With that said, I am also noticing that it almost takes 100GB memory to this 100GB dataset. I was wondering if there is any other way of efficiently packing this data in redis with minimal memory footprint.
I am also happy to know if there are any other tech stack out there that can handle this better. This dataset would be updated frequently based on hourly logs, also the expectation is we should be able to read from it at faster rate, concurrently.
Thanks in advance.
I'm using BigQuery for Java to do small reads on a table with about ~5GB of data. The queries I do follows the most standard SQL like SELECT foo FROM my-table WHERE bar=$1 where the result will be at most 1 row. I need to do this at a high frequency and therefore performance is a big concern. How do I optimize for this?
I thought about pulling the entire data set periodically since it's only 5GB, but then again 5GB sounds like a lot to be constantly keeping in memory.
Running this query in BigQuery console shows something like Query complete (0.6 sec elapsed, 4.2 GB processed). Fast for 4.2 GB but not fast enough. Again, I need to very frequently read from it but rarely (maybe once a day or week) write to it.
Maybe tell the server to cache the processed data somehow?
You don't have control over the Cache layer in BigQuery. That is something the service does automatically for you. Unfortunately typical cache lifetime is 24 hours, and the cached results are best-effort and may be invalidated sooner (Official docs).
Query completes in 0.6s seems to be goo for BQ. I'm afraide that If you are looking for something faster maybe BigQuery isn't the data warehouse for your use case.
BigQuery is built for analytical processing and not to interact with individual rows. The best practice would be as you mentioned to hold a copy of it in a place that allows quicker and more efficient reading of individual rows (like a MySQL database).
However, you can still vastly optimize the amount of data scanned in your query by clustering the table on the field that you're filtering on.
https://cloud.google.com/bigquery/docs/creating-clustered-tables
We are currently using Redis and it's a great in-memory datastore. We're starting to look at some new problems where the in-memory limitation is a factor and looking at other option. One we came across is Aerospike - it seems very fast, even faster than redis on in-memory single-shard operation.
Now that we're adding this to our stack, I'm trying to understand the use cases where Aerospike would not be able to replace redis?
Aerospike supports less data types than Redis, for example pub/sub is not available in Aerospike. However, Aerospike is a distributed key-value store and has superior clustering features.
The two are both great databases. It really depends on how big of a dataset you're handling, and your expectations of growth.
Redis:
Key/value store, dataset fits into RAM in single machine or you can shard yourself across multiple machines (and/or cores since it's single-threaded), persists data to disk, has data structures like lists/sets, basic pub/sub, simple slave replication, Lua scripting.
Aerospike:
Key/value row-store (meaning value contains bins with values and those values can be more maps/lists/values to have multiple levels), multithreaded to use all cores, built for clustering across machines with replication, and can do cross-datacenter replication, Lua scripting for UDFs. Can run directly on SSDs so you can store much more data without it fitting into RAM.
Comparison:
If you just have a smaller dataset or are fine with single-core performance then Redis is great. Quick to install, simple to run, easy to just attach a slave with 1 command if you need more read scalability. Redis also has more unique functionality with list/set/bitmap operations so you can do "more" out of the box.
If you want to store more complicated or nested data or need more performance on a single machine or clustering, then Aerospike gets the job done really well with less operational overhead. Very fast performance and easy cluster setup with all nodes being exactly the same role so you can scale reads and writes.
That's the big difference, scalability beyond a single core or server. With Lua scripting, you can usually fill in any native feature that Redis has into Aerospike. If you have lots of data (like TBs) then Aerospike's SSD feature means you get RAM-like performance without the RAM cost.
Have you looked at the benchmarks? I believe each one performs differently under different conditions and use cases:
http://www.aerospike.com/when-to-use-aerospike-vs-redis/
https://redislabs.com/blog/nosql-performance-aerospike-cassandra-datastax-couchbase-redis
Redis and Aerospike are different and both have their pros and cons, but Redis seems a better fit than Aerospike in the 2 following use cases:
when we don't need replication
We are using a big cache with intensive writes and a very short ttl (20s) for deduplication. There is no point in replicating this data. Redis would probably use half as much cpu and less than half as much RAM than Aerospike. It would be cheaper and as fast, or even faster thanks to pipelining.
when we need cross data-center replication
We have one large database that we need to access from 5 data centres, lots of writes, intensive reads. There is no perfect solution but the best one so far seems to store the central database in Redis and a copy on each data centre using Redis master-slave replication.
The task is to filter and analyze a huge amount of logfiles (around 8TB) from a finished research project. The idea is to fill a database with the data to be able to run different analysis tasks later.
The values are stored comma separated. In principle the values are tuples of up to 5 values:
id, timestamp, type, v1, v2, v3, v4, v5
In a first try using MySQL I used one table with one log entry per row. So there is no direct relation between the log values. The downside here is slow querying of subsets.
Because there is no relation I looked into alternatives like NoSQL databases, and column based tables like hbase or cassandra seemed to be a perfect fit for this kind of data. But these systems are made for huge distributed systems, which we not have. In our case the analysis will run on a single machine or perhaps some VMs.
Which kind of database would fit this task? Is it worth to setup a single machine instance with hadoop+hbase... or is this all a bit over-sized?
What database would you choose to do high-performance logfile analysis?
EDIT: Maybe out of my question it is not clear that we cannot spend money for cloud services or new hardware. The Question is if there are benefits in using noSQL approaches instead of mySQL (especially for this data). If there are none, or if they are so small that the effort of setting up a noSQL system is not worth the benefit we can use our ESXi infrastructure and MySQL.
EDIT2: I'm still having the Problem here. I did further experiments with MySQL and just inserted a quarter of all available data. The insert is now running for over 2 days and is not yet finished. Currently there are 2,147,483,647 rows in my single table db. With indeces this takes 211,2 GiB of disk space. And this is just a quarter of all logging data...
A query of the form
SELECT * FROM `table` WHERE `timestamp`>=1342105200000 AND `timestamp`<=1342126800000 AND `logid`=123456 AND `unit`="UNIT40";
takes 761 seconds to complete, in this case returning one row.
There is a combined index on timestamp, logid, unit.
So I think this is not the way to go, because later in analysis I will have to get all entries in a time range and compare the datapoints.
I read bout MongoDB and Redis, but the problem with them is, that they are in Memory databases.
In the later analyzing process there will a very small amount of concurrent database access. In fact the analyzing will be run from one single machine.
I do not need redundancy. I would be able to regenerate the database in case of a failure.
When the database is once completely written, there would also be no need to update or add further row.
What do you think about alternatives like Redis, MongoDB and so on. When I get this right, i would need RAM in the dimension of my data...
Is this task even somehow possible with a single node system or with maybe two nodes?
well i personally would prefer the faster solution, as you said you need a high-perfomance analysis. the problem is, if you have to setup a whole new system to do so and the performance-improvement would be minor in relation to the additional effort you'd need, then stay with SQL.
in our company, we have a quite small Database containing not even half a GB of Data on the VM. the problem now is, as soon as you use a VM, you will have major performance issues, when opening the Database on VM you can go for a coffee in the meantime ;)
But if the time until the Database is loaded to cache is not so important it doesn't matter. It all depends on how much faster you think the new System will be, and how much effort you will have to put in it, but as i said i'd prefer the faster solution if you have to go for "high-performance analysis"
I want to use Redis as a database, not a cache. From my (limited) understanding, Redis is an in-memory datastore. What are the risks of using Redis, and how can I mitigate them?
You can use Redis as an authoritative store in a number of different ways:
Turn on AOF (Append-only File store) see AOF docs. This will keep a log of all Redis commands made against your dataset in real-time.
Run Redis using Master-Slave replication see replication docs. This will allow you to provide high-availability if one of your instances fails.
If you're running on something like EC2 you can EBS back your Redis partition to provide another layer of protection against instance failure.
On the horizon is Redis Cluster - this is specifically designed as a way to run Redis in a way that should help with HA and scalability. However, this won't appear for at least another six months or so.
Redis is an in-memory store which can also write the data back to disc. You can specify how many times to do a fsync to make redis safer(but also slower => trade-off) .
But still I am not certain if redis is in state yet to really store (mission) critical data in it (yet?). If for example it is not a huge problem when 1 more tweets(twitter.com) or something similiar get losts then I would certainly use redis. There is also a lot of information available about persistence at redis's own website.
You should also be aware of some persistence problems which could occur by reading antirez(redis maintainers) blog article. You should read his blog because he has some interesting articles.
I would like to share a few things that we have learned by using Redis as a primary Database in our service. We choose Redis since we had data that could not be partitioned. We wanted to get the best performance we could get out of one box
Pros:
Redis was unbeatable in raw performance. We got 10K transactions per second out of the box (Note that one transaction involved multiple Redis commands). We were able to hit a rate of 25K+ transactions per second after a few optimizations, along with LUA scripts. So when it comes to performance per box, Redis is unmatched.
Redis is very simple to setup and has a very small learning curve as opposed to other SQL and NoSQL datastores.
Cons:
Redis supports only few primitive Data Structures like Hashes, Sets, Lists etc. and operations on these Data Structures. These are more than sufficient when you are using Redis as a cache, but if you want to use Redis as a full fledged primary data store, you will feel constrained. We had a tough time modelling our data requirements using these simple types.
The biggest problem we have seen with Redis was the lack of flexibility. Once you have solutioned the structure of your data, any modifications to storage requirements or access patterns virtually requires re-thinking of the entire solution. Not sure if this is the case with all NoSQL data stores though (I have heard MongoDB is more flexible, but haven't used it myself)
Since Redis is single threaded, CPU utilization is very low. You can't put multiple Redis instances on the same machine to improve CPU utilization as they will compete for the same disk, making disk as the bottleneck.
Lack of horizontal scalability is a problem as mentioned by other answers.
As Redis is an in-memory storage, you cannot store large data that won't fit you machine's memory size. Redis usually work very bad when the data it stores is larger than 1/3 of the RAM size. So, this is the fatal limitation of using Redis as a database.
Certainly, you can distribute you big data into several Redis instances, but you have to do it all on your own manually. The operation usually be done like this(assuming you have only 1 instance from start):
Use its master-slave mechanism to replicate data to the second machine, Now you have 2 copies of the same data.
Cut off the connection between master and slave.
Delete the first half(split by hashing, etc) of data on the first machine, and delete the second half of data on the second machine.
Tell all clients(PHP, C, etc...) to operate on the first machine if the specified keys are on that machine, otherwise operate on the second machine.
This is the way how Redis scales! You also have to stop your service to prevent any writes during the migration.
To the expierence we encounter, we have this conclusion to Redis: Redis is not the right choice to store more than 30G data, Redis is not scalable, Redis is quite suitable for prototype development.
We later find an alternative to Redis, that is SSDB(https://github.com/ideawu/ssdb), a leveldb server that supports nearly all the APIs of Redis, it is suitable for storing more than 1TB of data, that only depends on the size of you harddisk.
Redis is a database, that means we can use it for persisting information for any kind of app, information like user accounts, blog posts, comments and so on. After storing information we can retrieve it later on by writing queries.
Now this behavior is similar to just about every other database, but what is the difference? Or rather why would we use it over any other database?
Redis is fast.
Redis is not fast because it's written in a special programming language or anything like that, it's fast because all data is stored in-memory.
Most databases store all their information between both the memory of a computer and the hard drive. Accessing data in-memory is fast, but getting it stored on a hard disk is relatively slow.
So rather than storing memory in hard disk, Redis decided to store it in memory.
Now, the downside to this is that working with data that is larger than the amount of memory your computer has, that is not going to work.
That may sound like a tremendous problem, but Redis has clear strategies for working around this limitation.
The above is just the first reason why Redis is so fast.
The second reason is that Redis stores all of its data or rather organizes all of its data in simple data structures such as Doubly Linked Lists, Sorted Sets and so on.
These data structures have well-known and well-understood performance characteristics. So as developers we can decide exactly how our information is organized and how to efficiently query data.
It's also very fast because Redis is simple in nature, it's not feature heavy; feature heavy datastores like Postgres have performance penalties.
So to use Redis as a database you have to know how to store in limited space, you have to know how to organize it into these simple data structures mentioned above and you have to understand how to work around the limited feature set.
So as far as mitigating risks, the way you start to do that is to start to think Redis Design Methodology and not SQL Database Design Methodology. What do I mean?
So instead of, step 1. Put the data in tables, step 2. figure out how we will query it.
With Redis it's more:
Step 1. Figure out what queries we need to answer.
Step 2. Structure data to best answer those queries.