Aerospike - Data in RAM capacity planning - aerospike

I was using this page for capacity planning but I feel an ambiguity there:
The very first section "Data Storage Required" tells in the last paragraph - "Data can be stored in RAM or on flash storage (SSD)". Does it mean that above calculations are relevant for both RAM and SSD?
Cause lower - another section exists called "For Data" and it states that "If a namespace is configured to store data in memory, the RAM requirement can be calculated as the sum of:" - and provides different numbers comparing to the first section.
Assuming that I want to keep all my data in RAM - which section is relevant to me? Could anyone suggest?
Thanks in advance

Aerospike is a database that has very flexible storage options for its namespaces. Each namespace defines its own storage.
For data stored in memory you have two options:
In-memory without persistence (essentially a Redis-like cache, but on a distributed data store)
In-memory with persistence to either a file or a raw device.
To do capacity planning for the first case (in-memory no persistence) you would look at the index memory required - 64B per-object if you're not using the optional secondary indexes. To that you'd add the in-memory storage cost. Mind you, if you declared the namespace to be single-bin too, it would save some of the overhead.
If you're using persistence, memory is same as above, and the SSD/filesystem storage cost is calculated using the Data Storage Required section up top (as is the case for data-on-SSD).

Related

For Ignite Persistence, how to control maximum data size on disk?

How can I limit maximum size on disk when using Ignite Persistence? For example, my data set in a database is 5TB. I want to cache maximum of 50GB of data in memory with no more than 500GB on disk. Any reasonable eviction policy like LRU for on-disk data will work for me. Parameter maxSize controls in-memory size and I will set it to 50GB. What should I do to limit my disk storage to 500GB then? Looking for something like maxPersistentSize and not finding it there.
Thank you
There is no direct parameter to limit the complete disk usage occupied by the data itself. As you mentioned in the question, you can control in-memory regon allocation, but when a data region is full, data pages are going to be flushed and loaded on demand to/from the disk, this process is called page replacement.
On the other hand, page eviction works only for non-persistent cluster preventing it from running OOM. Personally, I can't see how and why that eviction might be implemented for the data stored on disk. I'm almost sure that other "normal" DBs like Postgres or MySQL do not have this option either.
I suppose you might check the following options:
You can limit WAL and WAL archive max sizes. Though these items are rather utility ones, they still might occupy a lot of space [1]
Check if it's possible to use expiry policies on your data items, in this case, data will be cleared from disk as well [2]
Use monitoring metrics and configure alerting to be aware if you are close to the disk limits.

Cons of using MemoryCache as a temporary copy of DB table

I have a site where you can list your car for sale. There is a list and a map with filtering on car types and other car specifications. My idea was to cache cars table and use that to filter on when user is searching for a car on the website. Currently, especially when zooming in/out on the map, each time user does that, http request is made and it's querying the database, and that can be slow and heavy on the server.
As an experiment with 1 000 items, I have cached map data (trimmed data with only basic info) and it's working fine. I was thinking of doing a basically copy of cars table instead with all needed joins added in Memory Cache and use that instead of querying the DB every request for both list and the map. I would have Cron Job every 5 minutes (as data can change, but it doesn't have to be immediate) to update Memory Cache with latest cars data from DB.
What would be the cons of using this approach in long term and for using it for example storing 100 000 records? Beside server needing more RAM, would there be any concerns about scalability or usability of this approach? Would it be better to use Redis instead?
I do have in place now "search as you type" service, but I don't really need that functionality as filtering is pretty exact, I have added it more as a caching server but I think I would be better off just using Memory Cache until a real need for that kind of service is required.
Thank you
Since memory isn’t infinite, we need to limit the number of items stored in the In-Memory cache.
MemoryCache VS Redis
MemoryCache
MemoryCache is embedded in the process , hence can only be used as a plain key-value store from that process.
Redis
Redis is a remote data structure server. It is certainly slower than just storing the data in local memory.
I conclude that MemoryCache is running in the web server of the current application, and it is limited by the performance of the web server. Of course, it will be very fast under the same configuration. I think the disadvantage is that the stored data cannot be shared with other applications.
If redis is used, reading data directly from memory is not as fast as memorycache, but it has high reliability and high scalability.
Related Post:
1. How to update redis after updating database?
2. how to keep caching up to date
3. How can MySQL update data in real time in redis cache?

Foundry Data Storage Optimization

Hi I have a general question about pipelines optimization in order to lower storage space.
Does deleting trashed datasets help alleviate disk storage? Ex. Remove obsolete datasets: a.) based on business knowledge and utilization and b.) datasets in the trash.
Also, We'd like to manage the copies of datasets that are stored when a schedule runs. We believe that if we ever had to fall back to a previous version, we only need to reference the latest one, as opposed to keeping multiple copies.
Does this affect storage? And is there a way to manage configuration on this?
Deleting trashed datasets (in typical setups) will result in their underlying files being deleted, but typically a larger driver of storage consumption is the set of previous dataset views kept by default.
You can control the length of time these files and views are kept using the Foundry Retention service. I'd recommend you consult with platform documentation and your support team for configuration of this service.
Retention will compute and mark files matching your configuration for deletion and periodically delete them, thus reducing your storage consumption.

Choose to store table exclusively on disk in Apache Ignite

I understand the native persistence mode of Apache Ignite allows to store as much as possible data in memory - and the potential remaining data on disk.
Is it possible to manually choose which table I want to store in memory and which I want to store EXCLUSIVELY on disk? If I want to save costs, should I just give Ignite a lot of disk space and just a small amount of memory? What if I know some tables should return results as fast as possible while other tables have lower priorities in terms of speed (even if they are accessed more often)? Is there any feature to prioritize data storage into memory at table level or any other level?
You can define two different data regions - one with small amount of memory and enabled persistence and second without persistence, but with bigger max memory size: https://apacheignite.readme.io/docs/memory-configuration
You can't have a cache (which contains rows for a table) to be stored exclusively on disk.
When you add a row to table it gets stored in Durable Memory, which is always located in RAM. Later it may be flushed to disk via Checkpointing process, which will use checkpoint page buffer, which is also in RAM. So you can have a separate region with low memory usage (see another answer) but you can't have data exclusively on disk.
When you access data it will always be pulled from disk to Durable Memory, too.

How to limit the RAM used by multiple embedded HSLQDB DB instances as a whole?

Given:
HSQLDB embedded
50 distinct databases (I have 50 different data sources)
All the databases are of the file:/ kind
All the tables are CACHED
The amount of RAM allowed to use by all the embedded DB instances combined is limited and given upon startup of the java process.
The LOG file is disabled (no need to recover upon crash)
My understanding is that the RAM used by a single DB instance is comprised of the following pieces:
The cache of all the tables (all my tables are CACHED)
The DB instance internal state
Also, as far as I can see I have these two properties to control the total size of the cache of a single DB instance:
SET FILES CACHE SIZE
SET FILES CACHE ROWS
However, they control only the cache part of the RAM used by a DB instance. Plus, they are per DB instance, whereas I would like to limit all the instances as a whole.
So, I wonder whether it is possible to instruct HSQLDB to stay within the specified amount of RAM in total including all the DB instances?
You can only limit the CACHE memory use per database instance. Each instance is independent of the other.
You can reduce the CACHE SIZE and CACHE ROWS per database to suit your application.
HSQLDB does not use a lot of other memroy, but when it does, it uses the memory of the JVM, which is shared among the different database instanced.