Alluxio data is not evenly distributed - apache

I have an EMR setup with 4 r3.4Xlarge machines (total of 128GB (32G/Node) and 1000GB(250GB) SSD is allocated to alluxio).
I have loaded around 650GB of ORC data. But I can see 3 workers have used 80% + space allocated but one of the worker have only used 1%.
Any way to evenly distribute the data across all workers ?
Thanks in advance

Typically, when Alluxio clients read data from the UFS, the client will cache the data to the local worker. If there is a large imbalance in the distribution of the data, then it may indicate that the task distribution is not even.
There is an Alluxio client configuration parameter which can change the default behavior when caching data into Alluxio. For example, you can set:
alluxio.user.file.write.location.policy.class=alluxio.client.file.policy.RoundRobinPolicy
to change the write location policy to round robin, which will distribute the data across the workers more evenly. This configuration parameter will have to be updated on the Alluxio client, which is dependent on the specific framework you are using.

Related

Is there a standard on number of Ignite caches per cluster?

I know that more number of caches open too many file descriptors and consume more resources.
Is there a recommendation on number of caches per ignite/GridGain cluster?
Is there a recommendation on number of caches vs number of nodes vs OS configuration (CPU, RAM)
We have 45 caches (PARTITIONED) and the system configuration is 4 CPU and 60GB RAM on each node. It is a 3 node cluster.
Current data storage size is 2GB and Data is expected to grow 1.5-2 TB in next one year.
We are frequently getting "Too many open files" error.
First of all there's nothing wrong with increasing the limit for file descriptors on the OS level. You can use the ulimit utility for that.
Another option is to leverage cache groups, it will make caches share some structures including files.

does ceph RBD have the ability to load balance?

I don't know much about ceph. As far as I know, RBD is a distributed block storage device of ceph, and the same data should be stored on several computers that make up the ceph cluster. So, does this distributed block device(ceph RBD) have the ability to load balance? In other words, if multiple clients(In my situation,it would be QEMU)use this RBD block storage and they read the same data at the same time, will ceph RBD balance the traffic and send it to the client simultaneously from different computers in the cluster or just one computer will send its data to multiple clients? If I have a ceph cluster composed of 6 computers and a ceph cluster composed of 3 computers. Is there any difference in the performance of these RBD?
It's not a load balance but the distributed nature of ceph allows many clients to be served in parallel. If we focus on replicated pools with a size of 3 there are 3 different disks (on different hosts) storing the exact same object. But there's always a primary OSD which forwards write requests to the other copies. This make write requests a little slower but read requests are only served by the primary OSD so it's much faster than writing. And since clients "talk" directly to the OSDs (they get the address from the MON) many clients can be served in parallel. Especially because the OSDs don't store the RBDs as a single object but split into many objects grouped by "Placement Groups".
However, if you really talk about the exact same object being read by multiple clients you have to know that there are watchers on RBDs which lock them so only one client can change data. If you could describe your scenario with more detail we could provide more information.
If I have a ceph cluster composed of 6 computers and a ceph cluster
composed of 3 computers. Is there any difference in the performance of
these RBD?
It depends on the actual configuration (reasonable amount of PGs, crush rules, network etc.) but in general the answer is yes, the more ceph nodes you have the more clients you can serve in parallel. Ceph may not have the best performance compared to other storage systems (of course, depending on the actual setup) but it scales so well that the performance stays the same with an increasing amount of clients.
https://ceph.readthedocs.io/en/latest/architecture/
does ceph RBD have the ability to load balance?
Yes, it does. For RBD there's rbd_read_from_replica_policy option:
"… Policy for determining which OSD will receive read operations. If set to default, each PG’s primary OSD will always be used for read operations. If set to balance, read operations will be sent to a randomly selected OSD within the replica set. If set to localize, read operations will be sent to the closest OSD as determined by the CRUSH map
…
Note: this feature requires the cluster to be configured with a minimum compatible OSD release of Octopus. …"

Choose to store table exclusively on disk in Apache Ignite

I understand the native persistence mode of Apache Ignite allows to store as much as possible data in memory - and the potential remaining data on disk.
Is it possible to manually choose which table I want to store in memory and which I want to store EXCLUSIVELY on disk? If I want to save costs, should I just give Ignite a lot of disk space and just a small amount of memory? What if I know some tables should return results as fast as possible while other tables have lower priorities in terms of speed (even if they are accessed more often)? Is there any feature to prioritize data storage into memory at table level or any other level?
You can define two different data regions - one with small amount of memory and enabled persistence and second without persistence, but with bigger max memory size: https://apacheignite.readme.io/docs/memory-configuration
You can't have a cache (which contains rows for a table) to be stored exclusively on disk.
When you add a row to table it gets stored in Durable Memory, which is always located in RAM. Later it may be flushed to disk via Checkpointing process, which will use checkpoint page buffer, which is also in RAM. So you can have a separate region with low memory usage (see another answer) but you can't have data exclusively on disk.
When you access data it will always be pulled from disk to Durable Memory, too.

Controling HDFS Replication ,mappers number and Reducers identification

I am trying to run Apache Hadoop 2.65 in a distributed way (with a cluster of 3 computers) and I want to decide the number of mappers and reducers.
I am using HDFS with number of replication 1 and my input is 3 files (tables).
I want to adjust the way data flows in the system and for that, I would like to get some help with the following manners by is it possible? and how and where can I change it?
Replication of HDFS- Can I interfere with the way the replication of HDFS has been done? for example, make sure that each
file stored in a different computer? and if so can I choose on which
computer it will be stored?
Number of mappers- Can I change the number of mappers or input splits? I know that it is decided by the number of input splits and block size. It said on the web that I can do that by changing the following parameters but I don't know where?
-D mapred.map.tasks=5
mapred.min.split.size property
Reducers identification- How can I suggest or force the Resource manager to start the reduce containers (reduce tasks) on specific computers? and if so can I select their amount for each computer? (divide the map out output differently across the cluster). More specifically, add another parameter to the ContainerLaunchContext (we have Mem, CPU, Disk, and Locality).
Replication of HDFS- Can I interfere with the way the replication of HDFS has been done?
Ans- Yes we can change replication factor in hdfs. just go config file change there.
Number of mappers- Can I change the number of mappers or input splits?
Ans - We can change number of mappers also in hdfs.

Setting up a Hadoop Cluster on Amazon Web services with EBS

I was wondering how I could setup a hadoop cluster (say 5 nodes) through AWS. I know how to create the cluster on EC2 but I don't know how to face the following challenges.
What happens if I lose my spot instance. How do I keep the cluster going.
I am working with some datasets of Size 1TB. Would it be possible to setup the EBS accordingly. How can I access the HDFS in this scenario.
Any help will be great!
Depending on your requirements, these suggestions would change. However, assuming a 2 Master and 3 Worker setup, you can probably use r3 instances for Master nodes as they are memory intensive app optimized and go for d2 instances for the worker nodes. d2 instances have multiple local disks and thus can withstand some disk failures while still keeping your data safe.
To answer your specific questions,
treat Hadoop machines as any linux applications. What would happen if your general centOS spot instances are lost? Hwnce, generally it is advised to use reserved instances.
Hadoop typically stores data by maintaining 3 copies and distributing them across the worker nodes in forms of 128 or 256 MB blocks. So, you will have 3TB data to store across the three worker nodes. Obviously, you have to consider some overhead while calculating space requirements.
You can use AWS's EMR service - it is designed especially for Hadoop clusters on top of EC2 instances.
It it fully managed, and it comes pre-packed with all the services you need in Hadoop.
Regarding your questions:
There are three main types of nodes in hadoop:
Master - a single node, don't need to spot it.
Core - a node that handle tasks, and have part of the HDFS
Task - a node that handle tasks, but does not have any part of the HDFS
If Task nodes are lost (if they are spot instances) the cluster will continue to work with no problems.
Regarding storage, the default replication factor in EMR is as follows:
1 for clusters < four nodes
2 for clusters < ten nodes
3 for all other clusters
But you can change it - http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hdfs-config.html