Is there a standard on number of Ignite caches per cluster? - ignite

I know that more number of caches open too many file descriptors and consume more resources.
Is there a recommendation on number of caches per ignite/GridGain cluster?
Is there a recommendation on number of caches vs number of nodes vs OS configuration (CPU, RAM)
We have 45 caches (PARTITIONED) and the system configuration is 4 CPU and 60GB RAM on each node. It is a 3 node cluster.
Current data storage size is 2GB and Data is expected to grow 1.5-2 TB in next one year.
We are frequently getting "Too many open files" error.

First of all there's nothing wrong with increasing the limit for file descriptors on the OS level. You can use the ulimit utility for that.
Another option is to leverage cache groups, it will make caches share some structures including files.

Related

Alluxio data is not evenly distributed

I have an EMR setup with 4 r3.4Xlarge machines (total of 128GB (32G/Node) and 1000GB(250GB) SSD is allocated to alluxio).
I have loaded around 650GB of ORC data. But I can see 3 workers have used 80% + space allocated but one of the worker have only used 1%.
Any way to evenly distribute the data across all workers ?
Thanks in advance
Typically, when Alluxio clients read data from the UFS, the client will cache the data to the local worker. If there is a large imbalance in the distribution of the data, then it may indicate that the task distribution is not even.
There is an Alluxio client configuration parameter which can change the default behavior when caching data into Alluxio. For example, you can set:
alluxio.user.file.write.location.policy.class=alluxio.client.file.policy.RoundRobinPolicy
to change the write location policy to round robin, which will distribute the data across the workers more evenly. This configuration parameter will have to be updated on the Alluxio client, which is dependent on the specific framework you are using.

Speed Up Hot Folder Data Upload

I have CSV file of around 188MB. When I try to upload data using hot folder technique its taking too much time 10-12 hrs. How can I speedup the data upload?
Thanks
Default value of impex.import.workers is 1. Try to change this value. And I recommend making performance test with a bit smaller file first, than 188Mb (just to get swift results)
Adjust the number of impex threads on the backoffice server to speed up ImpEx file processing. It is recommended that you start with it equal to the number of cores available on a backoffice node. You should not adjust it any higher than 2 * number of cores, and this is only if the IMPEX processes will be the only item running on the node. The actual value could be somewhere in between and will only be determined by testing and analyzing the number of other processes, jobs, apps running on your server to ensure you are not maxing out CPU.
NOTE: this value could be higher for lower environments since Hybris will likely be the only process running.
Taken from Tuning Parameters - Hybris Wiki

Hadoop 2.6.4 and Big File

I am a new user of Apache Hadoop. There is one moment which I do not understand. I have a simple cluster (3 nodes). Every node have about 30GB free space. When I look at Overview site of Hadoop I see DFS Remaining: 90.96 GB. I set the Replication factor to 1.
Then I create one file 50GB and try to upload it to HDFS. But space is out. Why? Do I can't upload file which more than free space one node of cluster?
According to the Hadoop : Definitive Guide
Hadoop’s default strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy). The second replica is placed on a different rack from the first (off-rack), chosen at random. The third replica is placed on the same rack as the second, but on a different node chosen at random. Further replicas are placed on random nodes on the cluster, although the system tries to avoid placing too many replicas on the same rack.
This logic makes sense as it decreases the network chatter between the different nodes.
I think it depends on, whether the client is same as a Hadoop node or not. If the client is a Hadoop node then all the splits will be on the same node. This doesn't provide any better read/write throughput in-spite of having multiple nodes in the cluster. If the client is not same as the Hadoop node, then the node is chosen at random for each split, so the splits are spread across the nodes in a cluster. Now, this provides a better read/write throughput.

Couchbase 3.1.0 - Hard out of memory error when performing full backup

We recently migrated to Couchbase 3.1.0. The odd thing is - when performing full backup of a bucket, web UI alerts "Hard Out Of Memory Error. Bucket X on node Y is full. All memory allocated to this bucket is used for metadata". The numbers from RAM usage in the web UI contradict that - about 75% is used, but not 100%. I looked into the logs, but haven't find any similar errors there.
Is that even normal?
This is a known issue in the Couchbase Server 3.x releases.
To understand the problem, we must also first understand Database Change Protocol (DCP), the protocol used to transfer data throughout the system. At a high level the flow-control for DCP is as follows:
The Consumer creates a connection with the Producer and sends an Open Connection message. The Consumer then sends a Control message to indicate per stream flow control. This messages will contain “stream_buffer_size” in the key section and the buffer size the Consumer would like each stream to have in the value section.
The Consumer will then start opening streams so that is can receive data from the server.
The Producer will then continue to send data for the stream that has buffer space available until it reaches the maximum send size.
Steps 1-3 continue until the connection is closed, as the Consumer continues to consume items from the stream.
The cbbackup utility does not implement any flow control (data buffer limits) however, and it will try to stream all vbuckets from all nodes at once, with no cap on the buffer size.
While this does not mean that it will use the same amount of memory as your overall data size (as the streams are being drained slowly by the cbbackup process), it does mean that a large memory overhead is required to be able to store the data streams.
When you are in a heavy DGM (disk greater than memory) scenario, the amount of memory required to store the streams is likely to grow more rapidly than cbbackup can drain them as it is streaming large quantities of data off of disk, leading to very large streams, which take up a lot of memory as previously mentioned.
The slightly misleading message about metadata taking up all of the memory is displayed as there is no memory left for the data, so all of the remaining memory is allocated to the metadata, which when using value eviction cannot be ejected from memory.
The reason that this only affects Couchbase Server versions prior to 4.0 is that in 4.0 a server-side improvement to DCP stream management was made that allows the pausing of DCP streams to keep the memory footprint down, this is tracked as MB-12179.
As a result, you should not experience the same issue on Couchbase Server versions 4.x+, regardless of how DGM your bucket is.
Workaround
If you find yourself in a situation where this issue is occurring, then terminating the backup job should release all of the memory consumed by the streams immediately.
Unfortunately if you have already had most of your data evicted from memory as a result of the backup, then you will have to retrieve a large quantity of data off of disk instead of RAM for a small period of time, which is likely to increase your get latencies.
Over time 'hot' data will be brought into memory when requested, so this will only be a problem for a small period of time, however this is still a fairly undesirable situation to be in.
The workaround to avoid this issue completely is to only stream a small number of vbuckets at once when performing the backup, as opposed to all vbuckets which cbbackup does by default.
This can be achieved using cbbackupwrapper which comes bundled with all Couchbase Server releases 3.1.0 and later, details of using cbbackupwrapper can be found in the Couchbase Server documentation.
In particular the parameter to pay attention to is the -n flag, which specifies the number of vbuckets to be backed up in a batch at once.
As the name suggests, cbbackupwrapper is simply a wrapper script on top of cbbackup which partitions the vbuckets up and automatically handles all of the directory creation and backup generation, while still using cbbackup under the hood.
As an example, with a batch size of 50, cbbackupwrapper would backup vbuckets 0-49 first, followed by 50-99, then 100-149 etc.
It is suggested that you test with cbbackupwrapper in a testing environment which mirrors your production environment to find a suitable value for -n and -P (which controls how many backup processes run at once, the combination of these two controls the amount of memory pressure caused by backup as well as the overall speed).
You should not find that lowering the value of -n from its default 100 decreases the backup speed, in some cases you may find that the backup speed actually increases due to the fact that there is far less memory pressure on the server.
You may however wish to sensibly adjust the -P parameter if you wish to speed up the backup further.
Below is an example command:
cbbackupwrapper http://[host]:8091 [backup_dir] -u [user_name] -p [password] -n 50
It should be noted that if you use cbbackupwrapper to perform your backup then you must also use cbrestorewrapper to restore the data, as cbrestorewrapper is automatically aware of the directory structures used by cbbackupwrapper.
When you run a full backup, by default the backup tool streams data from all nodes over the network. This is not the best way, because it causes a lot of extra load and increased memory usage, especially of you run cbbackup on one of the Couchbase nodes. I would use the data-copy mode of cbbackup, which copies data directly from the files on disk:
> sudo /opt/couchbase/bin/cbbackup couchstore-files:///opt/couchbase/var/lib/couchbase/data/ /tmp/backup
Of course, change the data path to wherever your Couchbase data is actually stored. (In my example it runs as sudo because only root has read access to /opt/couchbase/blabla..) Do this on every node, then collect all the backup folders and put them somewhere. Note that the backups are very compressible, so you might want to zip them before copying over the network.

How to limit the RAM used by multiple embedded HSLQDB DB instances as a whole?

Given:
HSQLDB embedded
50 distinct databases (I have 50 different data sources)
All the databases are of the file:/ kind
All the tables are CACHED
The amount of RAM allowed to use by all the embedded DB instances combined is limited and given upon startup of the java process.
The LOG file is disabled (no need to recover upon crash)
My understanding is that the RAM used by a single DB instance is comprised of the following pieces:
The cache of all the tables (all my tables are CACHED)
The DB instance internal state
Also, as far as I can see I have these two properties to control the total size of the cache of a single DB instance:
SET FILES CACHE SIZE
SET FILES CACHE ROWS
However, they control only the cache part of the RAM used by a DB instance. Plus, they are per DB instance, whereas I would like to limit all the instances as a whole.
So, I wonder whether it is possible to instruct HSQLDB to stay within the specified amount of RAM in total including all the DB instances?
You can only limit the CACHE memory use per database instance. Each instance is independent of the other.
You can reduce the CACHE SIZE and CACHE ROWS per database to suit your application.
HSQLDB does not use a lot of other memroy, but when it does, it uses the memory of the JVM, which is shared among the different database instanced.