A solution to increase my EMR master node capacity without shutting down the cluster - amazon-emr

My EMR master node has become full and I need to attach some ESB volumne to it, is there any way to do it without terminating the cluster?

You can add additional EBS volumes & also resize
How to explained here :
https://superuser.com/questions/1409373/how-to-add-an-ebs-volume-by-snapshot-id-to-amazon-emr
https://github.com/qyjohn/AWS_Tutorials/wiki/Grow-EBS-volumes-on-EMR-clusters

I don't think so. This is because you set up Amazon Elastic Block Store (Amazon EBS) volumes and configure mount points when the cluster is launched, so it’s difficult to modify the storage capacity after the cluster is running.
The feasible solutions usually involve adding more nodes to your
cluster, backing up your data to a data lake, and then launching a new
cluster with a higher storage capacity. Or, if the data that occupies
the storage is expendable, removing the excess data is usually the way
to go.
For more details,have a look at: https://aws.amazon.com/blogs/big-data/dynamically-scale-up-storage-on-amazon-emr-clusters/

Related

Can I have 2 stage redis backup using free redis?

I am new to redis, still reading doc, hope you could help me here.
I need a 2-stage database solution:
At local devices, there is a database cluster. It has several primaries and several replicas. To my understanding each primary or replica normally has a portion of the whole data set. This is called data sharding.
At cloud, there is another database replica. This cloud replica backs up the whole data set.
I like to use free redis for this solution, not enterprise version.
Is this achievable? From what I read so far, it seems that there is no problem if the cloud replica is just like local replica to back up a portion of data set. So I want to know whether I can use the cloud database to back up the whole cluster.
Thanks!
Nothing prevents you from having a replica hosted in the cloud, but each Redis cluster node is either a master responsible of a set of key slots (shards) or a replica of a master; in a multi-master scenario there is no way to have a single replica covering different master nodes.
With the goal of having your entire cluster data replicated in the cloud, you should configure and host there one additional Redis replica per each master node. To avoid those new replicas to ever become masters themselves, you can set their cluster-replica-no-failover configuration property accordingly in their redis.conf files:
cluster-replica-no-failover yes
In all cases, please note that replication is not a backup solution and you may want to pair your setup with a proper Redis persistence mechanism.
If I understand your questions clearly, your master dataset(in shards) are located on premise and the replicas(slave) are hosted on cloud. There is nothing preventing you from backing up your slaves(open source redis) on the cloud. Redis doesn't care where the slaves are situated provided the master can reach them. Master-slave replication is available on redis enterprise with no such restriction. You might have a little problem implementing master-master replication on redis open source but that is outside the scope of this question

Setting up a Hadoop Cluster on Amazon Web services with EBS

I was wondering how I could setup a hadoop cluster (say 5 nodes) through AWS. I know how to create the cluster on EC2 but I don't know how to face the following challenges.
What happens if I lose my spot instance. How do I keep the cluster going.
I am working with some datasets of Size 1TB. Would it be possible to setup the EBS accordingly. How can I access the HDFS in this scenario.
Any help will be great!
Depending on your requirements, these suggestions would change. However, assuming a 2 Master and 3 Worker setup, you can probably use r3 instances for Master nodes as they are memory intensive app optimized and go for d2 instances for the worker nodes. d2 instances have multiple local disks and thus can withstand some disk failures while still keeping your data safe.
To answer your specific questions,
treat Hadoop machines as any linux applications. What would happen if your general centOS spot instances are lost? Hwnce, generally it is advised to use reserved instances.
Hadoop typically stores data by maintaining 3 copies and distributing them across the worker nodes in forms of 128 or 256 MB blocks. So, you will have 3TB data to store across the three worker nodes. Obviously, you have to consider some overhead while calculating space requirements.
You can use AWS's EMR service - it is designed especially for Hadoop clusters on top of EC2 instances.
It it fully managed, and it comes pre-packed with all the services you need in Hadoop.
Regarding your questions:
There are three main types of nodes in hadoop:
Master - a single node, don't need to spot it.
Core - a node that handle tasks, and have part of the HDFS
Task - a node that handle tasks, but does not have any part of the HDFS
If Task nodes are lost (if they are spot instances) the cluster will continue to work with no problems.
Regarding storage, the default replication factor in EMR is as follows:
1 for clusters < four nodes
2 for clusters < ten nodes
3 for all other clusters
But you can change it - http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hdfs-config.html

Can we copy Apache Ignite Cluster to another Ignite cluster?

I want to back up entire Ignite cluster so that back up clutser will be used if the original(active) cluster is down. Is there any approach for this?
If you need two separate clusters with replication across data center, it would be better to look at GridGain solutions that supports Datacenter Replication.
Unfortunately, Ignite does not support DR.
With Apache Ignite you can logically divide you cluster to two zones to have guarantee that every zone contains full copy of data. However, there is no way to choose primary node for partitions manually. See, AffinityFunction and affinityBackupFilter() method of standard implementations.
As answered above, ready made solution is only available in paid version. Open source Apache ignite provides ability to take cluster wide absolute snapshot. You can add a cron job in your ignite cluster to take this snapshot and add another job to copy snapshot data to object storage like S3.
On the other side, you download this data node wise to work directories of respective nodes as per manual restore procedure and start the cluster. It should automatically activate when all baseline nodes are started successfully and your cluster is ready to use.

Ignite servers go down if put large data into cluster

I deployed an ignite cluster in yarn. The cluster has 5 servers. Each server has 10GB memory and 8GB heap. I was trying to write a lot of data to ignite cache. Each item is an integer array whose length is 100K. The backups is 2. When I write 3980 items to the ignite cache, the cluster's heap is almost full. But instead of reject writing, the servers went down one by one.
My questions are:
Is there a configuration or way to control the cache ratio of servers, so the heap won't be full and servers won't go down?
Apparently, servers go down when write too much into cache seems not good for users. I'm wondering why ignite will let this happen, if user uses default configuration.
Apache Ignite, as well as Java Virtual Machine, is NOT responsible for managing or controlling a size of data sets that are placed into Java heap. This is the reason why OutOfMemoryError is presented in Java API because it's a responsibility of an application to handle its data sets and make sure that they fit into the heap.
You can set up eviction policy and Ignite may either move data to off-heap region or swap or completely remove from the memory.
Refer to my foreword above. This is the responsibility of an application. Ignite can assist here wit its eviction policy, off-heap mode and ability to scale out.

Using Glacier as back end for web crawling

I will be doing a crawl of several million URLs from EC2 over a few months and I am thinking about where I ought to store this data. My eventual goal is to analyze it, but the analysis might not be immediate (even though I would like to crawl it now for other reasons) and I may want to eventually transfer a copy of the data out for storage on a local device I have. I estimate the data will be around 5TB.
My question: I am considering using Glacier for this, with the idea that I will run a multithreaded crawler that stores the crawled pages locally (on EB) and then use a separate thread that combines, compresses, and shuttles that data to Glacier. I know transfer speeds on Glacier are not necessarily good, but since there is no online element of this process, it would seem feasible (esp since I could always increase the size of my local EBS volume in case I'm crawling faster than I can store to Glacier).
Is there a flaw in my approach or can anyone suggest a more cost-effective, reliable way to do this?
Thanks!
Redshift seems more relevant than Glacier. Glacier is all about freeze / thaw and you'll have to move the data prior to doing any analysis.
Redshift is more about adding the data into a large, inexpensive, data warehouse and running queries over it.
Another option is to store the data in EBS and leave it there. When you're done with your crawling take a Snapshot to push the volume into S3 and decomission the volume and EC2 instance. Then when you're ready to do the analysis just create a volume from the snapshot.
The upside of this approach is that it's all file access (no formal data store) which may be easier for you.
Personally, I would probably push the data into Redshift. :-)
--
Chris
If your analysis will not be immediate then you can adopt one of the following 2 approaches
Approach 1) Amazon EC2 crawler -> store in EBS disks - Move them frequently to Amazon S3-> archive regularly to glacier. You can store your last X days data in Amazon S3 and use it for adhoc processing as well.
Approach 2) Amazon EC2 crawler -> store in EBS disks - Move them frequently to Amazon Glacier. Retrieve when needed and do the processing on EMR or other processing tools
If you need frequent analysis:
Approach 3) Amazon EC2 crawler -> store in EBS disks - Move them frequently to Amazon S3-> Analysis through EMR or other tools and store the processed results in S3/DB/MPP and move the raw files to glacier
Approach 4) if your data is structured, then Amazon EC2 crawler -> store in EBS disks and move them to Amazon RedShift and move the raw files to glacier
Additional tips:
If you can retrieve the data again(from source)then you can use ephemeral disks for your crawlers instead of EBS
Amazon has introduced Data pipeline service, check whether it fits your needs on data movement.