efficient way to store 100GB dataset - redis

I have 100GB dataset in this format with row format as seen below.
cookie,iplong1,iplong2..,iplongN
I am currently trying to fit this data into redis as a sorted set data structure. I would also need to set a TTL for each of those IPs. I was thinking to have TTL implemented on each element in that set, I will probably add a score to them, where score is the epoch time. And may be I will write a separate script to parse the scores and remove expired IPs based score as applicable. With that said, I am also noticing that it almost takes 100GB memory to this 100GB dataset. I was wondering if there is any other way of efficiently packing this data in redis with minimal memory footprint.
I am also happy to know if there are any other tech stack out there that can handle this better. This dataset would be updated frequently based on hourly logs, also the expectation is we should be able to read from it at faster rate, concurrently.
Thanks in advance.

Related

How to use chronicle-map instead of redis as a data cache

I intend to use chronicle-map instead of redis, the application scenario is the memoryData module starts every day from the database to load hundreds of millions of records to chronicle-map, and dozens of jvm continue to read chronicle-map records. Each jvm has hundreds of threads. But probably because of the lack of understanding of the chronicle-map, the code poor performance, running slower, until the memory overflow. I wonder if the above practice is the correct use of chronicle-map.
Because Chronicle map stores your data off-heap it's able to store more data than you can hold in main memory, but will perform better if all the data can fit into memory, ( so if possible, consider increasing your machine memory, if this is not possible try to use an SSD drive ), another reason for poor performance maybe down to how you have sized the map in the chronicle map builder, for example how you have set the number of max entries, if this is too large it will effect performance.

How much Redis is scalable as a time-series database

I am using Redis as time series database. I am importing mysql data into Redis by reforming data in the format of score and value in order to fit data into sorted set. I have 26 tables and at some point of time data, they can extend to 100 million records for every table.
Is it okay to store that much of data into Redis as Redis stores data in memory?
Is there an chance of Redis crash? If yes how often it will crash?
Is it okay to use Redis for my task?
You should ask yourself how you intend to query your data. Will you access single values or do scans?
Depending on your answers, a more specialized solution might be a better fit for your problem:
Warp 10 (disclaimer: I help build it)
InfluxDB
KairosDB
OpenTSDB

How can I quickly insert test data into BigQuery?

Inserting large amounts of test data into BigQuery can be slow, especially if the exact details of the data aren't important and you just want to test the performance of a particular shape of query/data.
What's the best way to achieve this without waiting around for many GB of data to upload to GCS?
In general, I'd recommend testing over small amounts of data (to save money and time).
If you really need large amounts of test data, there are several options.
If you care about the exact structure of the data:
You can upload data to GCS in parallel (if a slow single transfer is the bottleneck).
You could create a short-lived Compute Engine VM and use it to insert test data into GCS (which is likely to provide higher throughput than over your local link). This is somewhat involved, but gives you a very fast path for inserting data generated on-the-fly by a script.
If you just want to try out the capabilities of the platform, there are a number of public datasets available for experimentation. See:
https://cloud.google.com/bigquery/docs/sample-tables
If you just need a large amount of data and duplicate rows are acceptable:
You can insert a moderate amount of data via upload to GCS. Then duplicate it by querying the table and appending the result to the original. You can also use the bq command line tool with copy and the --append flag to achieve a similar result without being charged for a query.
This method has a bit of a caveat -- to get performance similar to typical production usage, you'll want to load your data in reasonably large chunks. For a 400GB use case, I'd consider starting with 250MB - 1GB of data in a single import. Many tiny insert operations will slow things down (and are better handled via the streaming API, which does the appropriate batching for you).

RRDtool what use are multiple RRAs?

I'm trying to implement rrdtool. I've read the various tutorials and got my first database up and running. However, there is something that I don't understand.
What eludes me is why so many of the examples I come across instruct me to create multiple RRAs?
Allow me to explain: Let's say I have a sensor that I wish to monitor. I will want to ultimately see graphs of the sensor data on an hourly, daily, weekly and monthly basis and one that spans (I'm still on the fence on this one) about 1.5 yrs (for visualising seasonal influences).
Now, why would I want to create an RRA for each of these views? Why not just create a database like this (stepsize=300 seconds):
DS:sensor:GAUGE:600:U:U \
RRA:AVERAGE:0.5:1:160000
If I understand correctly, I can then create any graph I desire, for any given period with whatever resolution I need.
What would be the use of all the other RRAs people tell me I need to define?
BTW: I can imagine that in the past this would have been helpful when computing power was more rare. Nowadays, with fast disks, high-speed interfaces and powerful CPUs I guess you don't need the kind of pre-processing that RRAs seem to be designed for.
EDIT:
I'm aware of this page. Although it explains about consolidation very clearly, it is my understanding that rrdtool graph can do this consolidation aswell at the moment the data is graphed. There still appears (to me) no added value in "harvest-time consolidation".
Each RRA is a pre-consolidated set of data points at a specific resolution. This performs two important functions.
Firstly, it saves on disk space. So, if you are interested in high-detail graphs for the last 24h, but only low-detail graphs for the last year, then you do not need to keep the high-detail data for a whole year -- consolidated data will be sufficient. In this way, you can minimise the amount of storage required to hold the data for graph generation (although of course you lose the detail so cant access it if you should want to). Yes, disk is cheap, but if you have a lot of metrics and are keeping low-resolution data for a long time, this can be a surprisingly large amount of space (in our case, it would be in the hundreds of GB)
Secondly, it means that the consolidation work is moved from graphing time to update time. RRDTool generates graphs very quickly, because most of the calculation work is already done in the RRAs at update time, if there is an RRA of the required configuration. If there is no RRA available at the correct resolution, then RRDtool will perform the consolidation on the fly from a high-granularity RRA, but this takes time and CPU. RRDTool graphs are usually generated on the fly by CGI scripts, so this is important, particularly if you expect to have a large number of queries coming in. In your example, using a single 5min RRA to make a 1.5yr graph (where 1pixel would be about 1 day) you would need to read and process 288 times more data in order to generate the graph than if you had a 1-day granularity RRA available!
In short, yes, you could have a single RRA and let the graphing work harder. If your particular implementation needs faster updates and doesnt care about slower graph generation, and you need to keep the detailed data for the entire time, then maybe this is a solution for you, and RRDTool can be used in this way. However, usually, people will optimise for graph generation and disk space, meaning using tiered sets of RRAs with decreasing granularity.

boto dynamodb: is there a way to optimize batch writing?

I am indexing large amounts of data into DynamoDB and experimenting with batch writing to increase actual throughput (i.e. make indexing faster). Here's a block of code (this is the original source):
def do_batch_write(items,conn,table):
batch_list = conn.new_batch_write_list()
batch_list.add_batch(table, puts=items)
while True:
response = conn.batch_write_item(batch_list)
unprocessed = response.get('UnprocessedItems', None)
if not unprocessed:
break
# identify unprocessed items and retry batch writing
I am using boto version 2.8.0. I get an exception if items has more than 25 elements. Is there a way to increase this limit? Also, I noticed that sometimes, even if items is shorter, it cannot process all of them in a single try. But there does not seem to be correlation between how often this happens, or how many elements are left unprocessed after a try, and the original length of items. Is there a way to avoid this and write everything in one try? Now, the ultimate goal is to make processing faster, not just avoid repeats, so sleeping for a long period of time between successive tries is not an option.
Thx
From the documentation:
"The BatchWriteItem operation puts or deletes multiple items in one or more tables. A single call to BatchWriteItem can write up to 16 MB of data, which can comprise as many as 25 put or delete requests. Individual items to be written can be as large as 400 KB."
The reason for some not succeeded is probably due to exceeding the provisioned throughput of your table. Do you have other write operations being performed on the table at the same time? Have you tried increasing the write throughput on your table to see if more items are processed.
I'm not aware of any way of increasing the limit of 25 items per request but you could try asking on the AWS Forums or through your support channel.
I think the best way to get maximum throughput is to increase the write capacity units as high as you can and to parallelize the batch write operations across several threads or processes.
From my experience, there is little to be gained in trying to optimize your write throughput using either batch write or multithreading. Batch write saves a little network time, and multithreading saves close to nothing as the item size limitation is quite low and the bottleneck is very often DDB throttling your request.
So (like it or not) increasing your Write Capacity in DynamoDB is the way to go.
Ah, like garnaat said, latency inside the region is often really different (like from 15ms to 250ms) from inter-region or outside AWS.
Not only increasing the Write Capacity will make it faster.
if your HASH KEY diversity is poor, then even if you will increase your write capacity, then you can have throughput errors.
throughput errors are depends on your hit map.
example: if your hash key is a number between 1-10, and you have 10 records with hash value 1-10 but 10k records with value 10, then you will have many throughput errors even while increasing your write capacity.