Storing 30M records in redis - redis

I'm wondering the most efficient way to store this data.
I need to track 30-50 million data points per day. It needs to be extremely fast read/write, so I'm using redis.
The data only needs to last for 24 hours, at which point it will EXPIRE.
The data looks like this as a key/value hash
{
"statistics:a5ded391ce974a1b9a86aa5322ea9e90": {
xbi: 1,
bid: 0.24024,
xpl: 25.0,
acc: 40,
pid: 43,
cos: 0.025,
xmp: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
clu: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
}
}
I've replaced the actual string with a lot of x but that IS the proper length of the string.
So far, according to my calculations.... this will use hundreds of GB of memory. Does that seem correct?
This is mostly ephemeral logging data thats important, but not important enough to try to support writing to disk or failovers. I am comfortable keeping it on 1 machine, if that helps make this easier.
What would be the best way to reduce memory space in this scenario? Is there a better way I can do this? Does redis support 300GB on a single instance?

In redis.conf - set hash-max-ziplist-value to 1 more than the length of the field 'xmp'. Then restart redis, and watch your memory go down significantly.
The default value is 64. Increasing it increases cpu utilization when you modify or add new fields in the hash. But your use case seems to be create-only, and in that case there shouldn't be any drawbacks of increasing the setting.

this will use hundreds of GB of memory. Does that seem correct?
YES
Does redis support 300GB on a single instance?
YES
Is there a better way I can do this?
You can try the following methods:
Avoid Using Hash
Since you always get all fields of the log with HGETALL, there's NO need to save the log as HASH. HASH consumes more memory than STRING.
You can serialize all fields into a string, and save the log as a key-value pair:
SET 'statistics:a5ded391ce974a1b9a86aa5322ea9e90' '{xbi: 1, bid: 0.24024, and other fields}'
#Sripathi Krishnan's answer gives another way to avoid HASH, i.e. config Redis to encode the HASH into ZIPLIST. It's a good idea if you don't share your Redis with other applications. Otherwise, this modification might cause problem to others.
Compress The Data
In order to reduce memory usage, you can try to compress your data. Redis can store binary strings, so you can use gzip, snappy or other compression algorithm to compress the log text into binary string, and save it into Redis.
Normally, you can get better compression when the input is bigger. So you'd better compress the whole log, instead of compress each field one by one.
The side-effect is that the producer and consumer of the log need to cost some CPU to compress and decompress the data. However, normally that's NOT a problem, and also it can reduce some network bandwidth.
Batch Write and Batch Read
As I mentioned above, if you want to get better compression, you should get a bigger input. So if you can write multiple logs in a batch, you can compress the batch of logs to get better compression.
Compress multiple logs into a batch: compress(log1, log2, log3) -> batch1: batch-result
Put the batch result into Redis as a key-value pair: SET batch1 batch-result
Build an index for the batch: MSET log1 batch1 log2 batch1 log3 batch1
When you need to get the log:
Search the index to get the batch key: GET log1 -> batch1
Get the batch result: GET batch1 -> batch-result
Decompress the batch result and look up the log from the result
The last method is the most complicated one, and the extra index will cost some extra memory. However, it can largely reduce the size of your data.
Also what these methods can achieve, largely depends on your log. You should do lots of benchmark :)

Related

Does dataframe.repartition(x) makes execution faster

I have a Spark script that reads data from amazon S3 and then writes in another bucket usion parquet format.
This is what the code looks like:
File = "LocationInFirstBucket.csv.gz"
df_ods = spark.read.csv(File, header=True, sep=";")
df_ods.repartition(25).write.format("parquet").mode("OverWrite").save("AnotherLocationInS3")
My question is: how does the repartition argument (here 25) affects the execution time? Should I increase it so the script runs faster?
Second question: Would it be better if I cache my df before the last line?
Thank you
In typical setups neither repartition nor cache will help you in this specific case. Since you read data from non-splittable format:
File = "LocationInFirstBucket.csv.gz"
df_ods = spark.read.csv(File, header=True, sep=";")
df_ods will have only one partition.
In such case repartitioning would make sense, if you performed any actual processing on this data.
However if you just write to distributed file system repartitioning will simply double the cost - you have to send data to other nodes first (that involves serialization, deserialization, network transfer, write to disk) and then still write to distributed file system.
There are of course edge cases when this makes sense. If network connecting your cluster is much faster than network connection your cluster to S3 nodes, effective latency might be a bit lower.
As of caching ‒ there is no value in caching here at all. Caching Dataset is expensive, and makes sense only if persisted data is reused.
Answer 1 :- Repartition of 25 or more or less it depends on how much data you have and no. of executors you provided. If your Spark code run in the cluster with more than one executor and it is not repartitioned then repartitioning will speedy to writing parallel your data.
Answer 2 :- There is no need to cache df before the last line because you are using only single action in your code. If you will perform multiple actions on your DF and don't want it will recalculate as the number of actions then you will Cache it.
The thing here is that Spark can parallelize writing to a certain point since one file can't be written by multiple executors at the same time.
Repartition helps you in this parallelization because it will write 25 different files (one for each partition). If you increase the number of partitions you will increase the number of written files hence speeding up the execution. This comes with a price because of the reading time will increase with the number of files to be read.
The limit is the number of executors you are running your job with, e.g. if you are running with 25 executors then setting repartition to 26 will not help you because to write the 26th partition one of the previous 25 would have to be finished.
For the other question, I don't think .cache() will help you because Spark is lazy, maybe this article can help you further.

Redis 1+ min query time on lists with 15 larger JSON objects (20MB total)

I use Redis to cache database inserts. For this I created a list CACHE into which I push serialized JSON lists. In pseudocode:
let entries = [{a}, {b}, {c}, ...];
redis.rpush("CACHE", JSON.stringify(entries));
The idea is to run this code for an hour, then later do an
let all = redis.lrange("CACHE", 0, LIMIT);
processAndInsert(all);
redis.ltrim("CACHE", 0, all.length);
Now the thing is that each entries can be relatively large (but far below 512MB / whatever Redis limit I read about). Each of the a, b, c is an object of probably 20 bytes, and entries itself can easily have 100k+ objects / 2MB.
My problem now is that even for very short CACHE lists of only 15 entries a simple lrange can take many minutes(!) even from the redis-cli (my node.js actually dies with an "FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory", but that's a side comment).
The debug output for the list looks like this:
127.0.0.1:6379> debug object "CACHE"
Value at:00007FF202F4E330 refcount:1 encoding:linkedlist serializedlength:18104464 lru:12984004 lru_seconds_idle:1078
What is happening? Why is this so massively slow, and what can I do about it? This does not seem to be a normal slowness, something seems to be fundamentally wrong.
I am using a local Redis 2.8.2101 (x64), ioredis 1.6.1, node.js 0.12 on a relatively hardcore Windows 10 gaming machine (i5, 16GB RAM, 840 EVO SSD, ...) by the way.
Redis is great at doing lots of small operations,
but not so great at doing small numbers of "very big" operations.
I think you should re-evaluate your algorithm, and try to break apart your data in to smaller chunks. Not only you'll save the bandwidth, you'll also will not lock your redis instance long amounts of time.
Redis offers many data structures you should be able to use for more fine grain control over your data.
Well, still, in this case, since you are running the redis locally, and assuming you are not running anything else but this code, I doubt that the bandwidth, nor the redis is the problem. I'm more thinking this line:
JSON.stringify()
is the main culprit why you are seeing the slow execution.
JSON serialization of 20MB of string is not something simple,
The process needs allocate many small strings, and also has to go through all of your array and inspect each item individually. All of this will take a long time for a big object like this one.
Again, if you were breaking apart your data, and doing smaller operations with redis, you'd not need the JSON serializer at all.

camel split big sql result in smaller chunks

Because of memory limitation i need to split a result from sql-component (List<Map<column,value>>) into smaller chunks (some thousand).
I know about
from(sql:...).split(body()).streaming().to(...)
and i also know
.split().tokenize("\n", 1000).streaming()
but the latter is not working with List<Map<>> and is also returning a String.
Is there a out of the Box way to create those chunks? Or do i need to add a custom aggregator just behind the split? Or is there another way?
Edit
Additional info as requested by soilworker:
At the moment the sql endpoint is configured this way:
SqlEndpoint endpoint = context.getEndpoint("sql:select * from " + lookupTableName + "?dataSource=" + LOOK_UP_DS,
SqlEndpoint.class);
// returns complete result in one list instead of one exchange per line.
endpoint.getConsumerProperties().put("useIterator", false);
// poll interval
endpoint.getConsumerProperties().put("delay", LOOKUP_POLL_INTERVAL);
The route using this should poll once a day (we will add CronScheduledRoutePolicy soon) and fetch a complete table (view). All the data is converted to csv with a custom processor and sent via a custom component to proprietary software. The table has 5 columns (small strings) and around 20M entries.
I don't know if there is a memory issue. But i know on my local machine 3GB isn't enough. Is there a way to approximate the memory footprint to know if a certain amount of Ram would be enough?
thanks in advance
maxMessagesPerPoll will help you get the result in batches

What is the maximum value size you can store in redis?

Does anyone know what the maximum value size you can store in redis? I want to use redis as a message queue with celery to store some small documents that need to be processed by a worker on another server, and I want to make sure the documents aren't going to be too big.
I found one page with a reference to 1GB, but when I followed the link on the page for where they got that answer the link wasn't valid anymore. Here is the link:
http://news.ycombinator.com/item?id=1182005
All string values are limited to 512 MiB. This is the size limit you probably care most about.
EDIT: Because keys in Redis are strings, the maximum key size is 512 MiB. The maximum number of keys is 2^32 - 1 = 4,294,967,295.
Values, on the other hand, can vary in size depending on their type. For aggregate data types (i.e. hash, list, set, and sorted set), the maximum value size is 512 MiB for each element, although the data structure itself can have up to 2^32 - 1 elements.
https://redis.io/topics/data-types
https://redis.io/topics/faq#what-is-the-maximum-number-of-keys-a-single-redis-instance-can-hold-and-what-is-the-max-number-of-elements-in-a-hash-list-set-sorted-set
http://groups.google.com/group/redis-db/browse_thread/thread/1c7e33fbc98734b3?fwc=2
Article about Redis Memory Usage can help you to roughly determine how much memory your database would take.
It's in the order of the amount of RAM you have, at least, so unless you plan on puting multi-gigabyte objects in there I wouldn't worry. I've had sets that were hundreds of megabytes big without a problem, but I don't know the exact limits.
A String value can accommodate the size of max 512MB. But according to this link, the size can be increased.

combining open(Filename, {delayed_write, Size, Delay}) with index

How can I combine a open(Filename, {delayed_write, Size, Delay}) with an index on where to write this data to?
I want to wait until I receive a certain amount of data and then write it to a position in the file.
Also is {read_ahead, Size} the opposite of {delayed_write, Size, Delay}? I would like to read a certain amount of data to send it.
Thanks
read_ahead is kind of the opposite of delayed_write in the sense that reading is the opposite of writing.
If you want to read and send bigger chunks of memory you don't need read_ahead, just read big chunks and send them (not many os calls to save here).
From the file:open/2 manpage on read_ahead:
If read/2 calls are for sizes not significantly less than, or even greater than Size
bytes, no performance gain can be expected.
You don't need to specify and index when opening. Just use pwrite/3 or a combination of position/2 and write/2.
But writing in different positions of the file might just reduce the gain of delayed_write since (also manpage of file:open/2):
The buffered
data is also flushed before some other file
operation than write/2 is executed.
If you have chunks of data for several positions collect them in a list of {Location, Bytes} and from time to time write them with file:pwrite/2 all in one go. This can map to very efficient writev(2) system call that writes several chunks in one go.