How to limit the RAM used by multiple embedded HSLQDB DB instances as a whole? - hsqldb

Given:
HSQLDB embedded
50 distinct databases (I have 50 different data sources)
All the databases are of the file:/ kind
All the tables are CACHED
The amount of RAM allowed to use by all the embedded DB instances combined is limited and given upon startup of the java process.
The LOG file is disabled (no need to recover upon crash)
My understanding is that the RAM used by a single DB instance is comprised of the following pieces:
The cache of all the tables (all my tables are CACHED)
The DB instance internal state
Also, as far as I can see I have these two properties to control the total size of the cache of a single DB instance:
SET FILES CACHE SIZE
SET FILES CACHE ROWS
However, they control only the cache part of the RAM used by a DB instance. Plus, they are per DB instance, whereas I would like to limit all the instances as a whole.
So, I wonder whether it is possible to instruct HSQLDB to stay within the specified amount of RAM in total including all the DB instances?

You can only limit the CACHE memory use per database instance. Each instance is independent of the other.
You can reduce the CACHE SIZE and CACHE ROWS per database to suit your application.
HSQLDB does not use a lot of other memroy, but when it does, it uses the memory of the JVM, which is shared among the different database instanced.

Related

How does spark SQL access databases

Suppose you access a SQL database with spark SQL.
With RDD spark partitions the data into many different parts that all together make the data set.
My question is how does Spark SQL manages this access from the N nodes to the database. I can see several possibilities:
Each nodes of the RDD access to the database and builds up their parts. Advantage of it is that the nodes are not forced to allocate a lot of memory, but the database will have to stand N connections with N potentially very large.
A single node access the data and sends the data to the other N-1 nodes as required. The problem is that this single node will need to have all the data and this is unworkable in many cases. Possibly this can be alleviated by getting the data by chunks.
The JDBC package uses the pooled connections in order to avoid connecting again and again. But this does not address this problem.
What would be a reference explaining how spark manages this access to SQL database? How much of it can be parametrized?
This is documented quite in detail on the JDBC To Other Databases data sources documentation page.
In short, each node involved in the job will establish a connection to the database. However, it isn't the node count that determines the number of connections, but the configured number of partitions:
numPartitions: The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, we decrease it to this limit by calling coalesce(numPartitions) before writing.
And regarding
The JDBC package uses the pooled connections in order to avoid connecting again and again. But this does not address this problem.
JDBC drivers don't implicitly pool connections. The application sets up, configures, and uses the connection pool. Unless Spark needs to connect to the database repeatedly in order to fetch the data, it doesn't need to establish multiple connections per partition. So there would be no need to pool connections.
The linked documentation page has a list of options that applications can use to control how the data is fetched from the database.

Choose to store table exclusively on disk in Apache Ignite

I understand the native persistence mode of Apache Ignite allows to store as much as possible data in memory - and the potential remaining data on disk.
Is it possible to manually choose which table I want to store in memory and which I want to store EXCLUSIVELY on disk? If I want to save costs, should I just give Ignite a lot of disk space and just a small amount of memory? What if I know some tables should return results as fast as possible while other tables have lower priorities in terms of speed (even if they are accessed more often)? Is there any feature to prioritize data storage into memory at table level or any other level?
You can define two different data regions - one with small amount of memory and enabled persistence and second without persistence, but with bigger max memory size: https://apacheignite.readme.io/docs/memory-configuration
You can't have a cache (which contains rows for a table) to be stored exclusively on disk.
When you add a row to table it gets stored in Durable Memory, which is always located in RAM. Later it may be flushed to disk via Checkpointing process, which will use checkpoint page buffer, which is also in RAM. So you can have a separate region with low memory usage (see another answer) but you can't have data exclusively on disk.
When you access data it will always be pulled from disk to Durable Memory, too.

How can I dump a single Redis DB index?

Redis SAVE and BGSAVE commands dump the complete Redis data to a persistent file.
But is there a way to dump only one DB index?
I am using the same Redis server with multiple DB indices.
I use DB 0 as config which is edited manually and contains just a small number of keys. I wish to dump this to a file as a config snapshot (versioned) to keep track of manual changes in the prod environment.
The rest of the DBs have a large number of items, that will take too long to dump, and I don't need to back them up.
Redis' persistence scope is the entire instance, meaning all shared/numbered databases and all keys in them. Saving only a subset of these is not supported.
Instead, use two independent Redis instances and configure each to persist (or not) per your needs. The overhead of running an insurance is a few megabytes so it is practically negligible.

Optimize tempdb log files

Have a dedicated DB server with 16 cores and 200+ GB of memory.
Use tempdb a lot but it typically stays under 4 GB
Recently added a dedicated SSD stripe for tempdb
Based on this page should create multiple files
Optimizing tempdb Performance
Understand multiple row files.
Here is my question:
Should I also create multiple tempdb log files?
It does say "create one data file for each CPU".
So my thought is that data means row (not log) files.
No, as with all databases, SQL server can only use one log file at a time so there is no benefit at all in having multiple log files.
The best thing you can do with log files really is keep them on separate drives to the data files as they have different IO requirements, pre-size them so they don't have to auto grow and if they do have to autogrow, make sure they do so at a sensible level to manage the number of virtual log files that are created inside them.

Low MySQL Table Cache Hit Rate

I've been working on optimizing my site and databases, and I have been using mysqltuner.pl to help with this. I've gotten just about everything correct, except for the table cache hit rate, no matter how high I raise it in my.cnf, I am still hitting about 0% (284 open / 79k opened).
My problem is that I don't really understand exactly what affects this so I don't really know what to look for in my queries/database structure to fix this.
The table cache defines the number of simultaneous file descriptors that MySQL has open. So table cache hit rate will be affected by how many tables you have relative to your limit as well as how frequently you re-reference tables (keeping in mind that it is counting not just a single connection, but simultaneous connections)
For instance, if your limit is 100 and you have 101 tables and you query each one in order, you'll never get any table cache hits. On the other hand, if you only have 1 table, you should generally get close to 100% hit rate unless you run FLUSH TABLES a lot ( as long as your table_cache is set higher than the number of typically simultaneous connections).
So for tuning, you need to look at how many distinct tables you might reference by one process/client and then look at how many simultaneous connections you might typically have.
Without more details, I can't guess whether your case is due to too many simultaneous connections or too many frequently referenced tables.
A cache is supposed to maintain copies of hot data. Hot data is data that is used a lot. If you cannot retrieve data out of a certain cache it means the DB has to go to disk to retrieve it.
--edit--
sorry if the definition seemed a bit obnoxious. a specific cache often covers a lot of entities, and these are database specific, you need to find out what is cached by the table cache firstly.
--edit: some investigation --
Ok, it seems (from the reply to this post), that Mysql uses the table cache for the data structures used to represent a table. the data structures also (via encapsulation or by having duplicate table entries for each table) represent a set of file descriptors open for the data files on the file system. The MyIsam engine uses one for a table and one for each index, additionally each active query element requires its own descriptors.
A file descriptor is a kernel entity used for file IO, it represents the low-level context of a particular file read or write.
I think you are either interpreting the value's incorrectly or they need to be interpreted differently in this context. 284 is the amount of active tables at the instance you took the snapshot and the second value represents the amount of times a table was acquired since you started Mysql.
I would hazard a guess that you need to take multiple snapshots of this reading and see if the first value (active fd's at that instance) ever exceed your cache size capacity.
p.s., the kernel generally has a upper limit on the amount of file descriptors it will allow each process to open -- so you might need to tune this if it is too low.