I've recently been getting involved in implementing an Aerospike data store into our product. We've been trying to work out the best configuration for our namespace. The requirement to persist data means we need to have a storage-engine as a device. we have specified data-in-memory as true.
My question is: does data-in-memory attempt to load ALL the backing store data into memory as the vague description implies?
Keep a copy of all data in memory always.
Or will it pay attention to the memory-size setting on the namespace and only load memory-size amount of data from the backing store?
Description of the setting was retrieved from documentation.
I have been talking to the guy who first implemented aerospike to try and find out if he knew and wasn't sure so I'm looking for clarification.
For reference my namespace config looks something like this, with an obviously smaller memory quota than backing store
namespace Test {
replication-factor 2
memory-size 4G
default-ttl 0
storage-engine device {
file /opt/aerospike/data/Test.dat
filesize 16G
data-in-memory true
}
}
It will keep all the data in memory. Aerospike does not yet have a partial cache implementation to keep the most used data in the provided memory.
Your data will only exist in memory, while the disk is used for persistence for recovery in the event of server restart. The reason the filesize is larger than the memory-size is that disk space is needed for maintenance operations such as defragmentation of blocks. Disk devices are block devices, and in the default 1MB write-block-size you can fit multiple records so operations such as defrag occur by moving records from blocks that are less than defrag-lwm-pct full. This takes extra blocks, so you need that spare capacity.
Related
How can I limit maximum size on disk when using Ignite Persistence? For example, my data set in a database is 5TB. I want to cache maximum of 50GB of data in memory with no more than 500GB on disk. Any reasonable eviction policy like LRU for on-disk data will work for me. Parameter maxSize controls in-memory size and I will set it to 50GB. What should I do to limit my disk storage to 500GB then? Looking for something like maxPersistentSize and not finding it there.
Thank you
There is no direct parameter to limit the complete disk usage occupied by the data itself. As you mentioned in the question, you can control in-memory regon allocation, but when a data region is full, data pages are going to be flushed and loaded on demand to/from the disk, this process is called page replacement.
On the other hand, page eviction works only for non-persistent cluster preventing it from running OOM. Personally, I can't see how and why that eviction might be implemented for the data stored on disk. I'm almost sure that other "normal" DBs like Postgres or MySQL do not have this option either.
I suppose you might check the following options:
You can limit WAL and WAL archive max sizes. Though these items are rather utility ones, they still might occupy a lot of space [1]
Check if it's possible to use expiry policies on your data items, in this case, data will be cleared from disk as well [2]
Use monitoring metrics and configure alerting to be aware if you are close to the disk limits.
I want to understand when a cache is created with native persistence enabled, will it store the data in the defined data region/RAM and in the disk at the same time? Is there any way I can restrict the disk utilization for storing the data?
Additionally, in a cluster of 3 due to any reason the disk got full for one of the nodes and there is not enough memory available, what will be the impact on the cluster?
Yes, data will be stored both in RAM and on the disk. I does not have to fit in RAM at the same time.
If you run out of disk space, your persistent store will likely be corrupted.
I was using this page for capacity planning but I feel an ambiguity there:
The very first section "Data Storage Required" tells in the last paragraph - "Data can be stored in RAM or on flash storage (SSD)". Does it mean that above calculations are relevant for both RAM and SSD?
Cause lower - another section exists called "For Data" and it states that "If a namespace is configured to store data in memory, the RAM requirement can be calculated as the sum of:" - and provides different numbers comparing to the first section.
Assuming that I want to keep all my data in RAM - which section is relevant to me? Could anyone suggest?
Thanks in advance
Aerospike is a database that has very flexible storage options for its namespaces. Each namespace defines its own storage.
For data stored in memory you have two options:
In-memory without persistence (essentially a Redis-like cache, but on a distributed data store)
In-memory with persistence to either a file or a raw device.
To do capacity planning for the first case (in-memory no persistence) you would look at the index memory required - 64B per-object if you're not using the optional secondary indexes. To that you'd add the in-memory storage cost. Mind you, if you declared the namespace to be single-bin too, it would save some of the overhead.
If you're using persistence, memory is same as above, and the SSD/filesystem storage cost is calculated using the Data Storage Required section up top (as is the case for data-on-SSD).
I have faced a strange issue.
Suddenly Aerospike data has been erased. Provided i have not executed any command to delete the data from the Aerospike.
namespace test {
replication-factor 2
memory-size 4G
default-ttl 30d # 30 days, use 0 to never expire/evict.
storage-engine memory
}
I haven’t configured the ttl here but few days back I ran one UDF to set the ttl of all the records to -1 so that it never expires. The sets were being updated periodically, so even then it should not expire after 30 days. I lost all at once which should not be the case.
I am stuck in this since 2 days. Any help is appreciated.
You're using a namespace that is basically defined to be a cache. It is in-memory with no persistence. For example, a restart of the node will cause the namespace to start empty.
The Namespace Storage Configuration article in the deployment guide gives recipes for storage engine configuration. You can set the storage of a specific namespace to be one of the following:
Data stored on SSD
Data stored on a filesystem (not recommended for production)
Data stored in-memory with persistence to an SSD
Data stored in-memory with persistence on a filesystem
Data stored in-memory with no persistence
There is a special case of data in-memory for counters, data-in-index. This is done with persistence.
Is there a way to set Redis so it will never evict data and cause a hard failure if it runs out of memory? I need to ensure no data is lost; I am not using this as a permanent data storage, mechanism, but rather for more of a temporary data storage mechanism for high volume/high performance data transformations.
Is there an alternative NoSQL data store that could come close in performance, but utilize disk read/write if memory runs out; this is not ideal, but is better than losing data. I am reading/writing/updating millions of JSON documents (12+ million and growing).
Yes.
First make sure that you set the maxmemory directive (in the conf file or with a CONFIG SET) to a value other than 0. This will instruct Redis to use that value as it's memory upper limit.
Next, set the maxmemory-policy directive to noeviction - this will cause Redis to return an OOM (out of memory) error when maxmemory is reached when trying to write to it.
See the config file in-file documentation for more details on these directives: http://download.redis.io/redis-stable/redis.conf