Schedule YARN application on active/standby nodes - hadoop-yarn

I would like to have a cluster that is split to 2 sub-clusters: "active" nodes and "standby" nodes.
Normally, when an application is scheduled I would like it to run on the "active" nodes. But if no "active" node is healthy, I would like it to run on the "standby" nodes.
Is there a way to achieve such behavior in YARN?
To give a bit more details, the "active" nodes of the cluster will be located in different zone than the the "standby" nodes (but not so far from them).
Thus we try to achieve multi-zone high availability for our application. Meaning, upon disaster in the "active" zone, the application will be recovered and scheduled on the "standby" zone.

To route jobs to specific nodes, you will need Node Labels. Capacity Scheduler has had them for a while (2.6 or earlier), but for Fair Scheduler I think they were planning on supporting them in Hadoop 3.x.
Another option to consider is YARN federation where you have more than one YARN cluster so your 2nd would be in zone 2 and you can re-route your job to zone 2 if zone 1 has issues.
References
YARN Node Labels
Hadoop: YARN Federation

Related

Ensure a new RabbitMQ quorum queue replicas are spread among cluster's availability zones

I'm going to run a multi-node (3 zones, 2 nodes in each, expected to grow) RabbitMQ cluster with many dynamically created quorum queues. It will be unmanageable to tinker with the queue replicas manually.
I need to ensure that a new quorum queue (1lead, 2repl) always spans all 3 AZs to be able to survive an AZ outage.
Is there a way to configure node affinity to achieve that goal?
There is one poor man's solution (actually, pretty expensive to do right) that comes on my mind:
create queues with x-quorum-initial-group-size=1
have a reconciliation script that runs periodically and adds replica members to nodes in the right zones
Of course, build in feature for configurable node affinity, which I might miss somehow, would be the best one.

How to make Dataproc detect Python-Hive connection as a Yarn Job?

I launch a Dataproc cluster and serve Hive on it. Remotely from any machine I use Pyhive or PyODBC to connect to Hive and do things. It's not just one query. It can be a long session with intermittent queries. (The query itself has issues; will ask separately.)
Even during one single, active query, the operation does not show as a "Job" (I guess it's Yarn) on the dashboard. In contrast, when I "submit" tasks via Pyspark, they show up as "Jobs".
Besides the lack of task visibility, I also suspect that, w/o a Job, the cluster may not reliably detect a Python client is "connected" to it, hence the cluster's auto-delete might kick in prematurely.
Is there a way to "register" a Job to companion my Python session, and cancel/delete the job at times of my choosing? For my case, it is a "dummy", "nominal" job that does nothing.
Or maybe there's a more proper way to let Yarn detect my Python client's connection and create a job for it?
Thanks.
This is not supported right now, you need to submit jobs via Dataproc Jobs API to make them visible on jobs UI page and to be taken into account by cluster TTL feature.
If you can not use Dataproc Jobs API to execute your actual jobs, then you can submit a dummy Pig job that sleeps for desired time (5 hours in the example below) to prevent cluster deletion by max idle time feature:
gcloud dataproc jobs submit pig --cluster="${CLUSTER_NAME}" \
--execute="sh sleep $((5 * 60 * 60))"

Restoring Large State in Apache Flink Streaming Job

We have a cluster running Hadoop and YARN on AWS EMR with one core and one master, each with 4 vCores, 32 GB mem, 32 GB disc. We only have one long-running YARN application, and within that, there are only one or two long-running Flink applications, each with a parallelism of 1. Checkpointing has a 10 minute interval with a minimum of 5 minutes between. We use EventTime with a window of 10 minutes and a watermark duration of 15 seconds. The state is stored in S3 through the FsStateBackend with async snapshots enabled. Exactly-Once checkpointing is enabled as well.
We have UUIDs set up for all operators but don't have HA set up for YARN or explicit max parallelism for the operators.
Currently, when restoring from a checkpoint (3GB) the processing holds at the windowing until a org.apache.flink.util.FlinkException: The assigned slot <container_id> was removed error is thrown during the next checkpoint. I have seen that all but the operator with the largest state (which is a ProcessFunction directly after the windowing), finish checkpointing.
I know it is strongly suggested to use RocksDB for production, but is that mandatory for a state that most likely won't exceed 50GB?
Where would be the best place to start addressing this problem? Parallelism?

Can we copy Apache Ignite Cluster to another Ignite cluster?

I want to back up entire Ignite cluster so that back up clutser will be used if the original(active) cluster is down. Is there any approach for this?
If you need two separate clusters with replication across data center, it would be better to look at GridGain solutions that supports Datacenter Replication.
Unfortunately, Ignite does not support DR.
With Apache Ignite you can logically divide you cluster to two zones to have guarantee that every zone contains full copy of data. However, there is no way to choose primary node for partitions manually. See, AffinityFunction and affinityBackupFilter() method of standard implementations.
As answered above, ready made solution is only available in paid version. Open source Apache ignite provides ability to take cluster wide absolute snapshot. You can add a cron job in your ignite cluster to take this snapshot and add another job to copy snapshot data to object storage like S3.
On the other side, you download this data node wise to work directories of respective nodes as per manual restore procedure and start the cluster. It should automatically activate when all baseline nodes are started successfully and your cluster is ready to use.

Couchbase node failure

My understanding could be amiss here. As I understand it, Couchbase uses a smart client to automatically select which node to write to or read from in a cluster. What I DON'T understand is, when this data is written/read, is it also immediately written to all other nodes? If so, in the event of a node failure, how does Couchbase know to use a different node from the one that was 'marked as the master' for the current operation/key? Do you lose data in the event that one of your nodes fails?
This sentence from the Couchbase Server Manual gives me the impression that you do lose data (which would make Couchbase unsuitable for high availability requirements):
With fewer larger nodes, in case of a node failure the impact to the
application will be greater
Thank you in advance for your time :)
By default when data is written into couchbase client returns success just after that data is written to one node's memory. After that couchbase save it to disk and does replication.
If you want to ensure that data is persisted to disk in most client libs there is functions that allow you to do that. With help of those functions you can also enshure that data is replicated to another node. This function is called observe.
When one node goes down, it should be failovered. Couchbase server could do that automatically when Auto failover timeout is set in server settings. I.e. if you have 3 nodes cluster and stored data has 2 replicas and one node goes down, you'll not lose data. If the second node fails you'll also not lose all data - it will be available on last node.
If one node that was Master goes down and failover - other alive node becames Master. In your client you point to all servers in cluster, so if it unable to retreive data from one node, it tries to get it from another.
Also if you have 2 nodes in your disposal you can install 2 separate couchbase servers and configure XDCR (cross datacenter replication) and manually check servers availability with HA proxies or something else. In that way you'll get only one ip to connect (proxy's ip) which will automatically get data from alive server.
Hopefully Couchbase is a good system for HA systems.
Let me explain in few sentence how it works, suppose you have a 5 nodes cluster. The applications, using the Client API/SDK, is always aware of the topology of the cluster (and any change in the topology).
When you set/get a document in the cluster the Client API uses the same algorithm than the server, to chose on which node it should be written. So the client select using a CRC32 hash the node, write on this node. Then asynchronously the cluster will copy 1 or more replicas to the other nodes (depending of your configuration).
Couchbase has only 1 active copy of a document at the time. So it is easy to be consistent. So the applications get and set from this active document.
In case of failure, the server has some work to do, once the failure is discovered (automatically or by a monitoring system), a "fail over" occurs. This means that the replicas are promoted as active and it is know possible to work like before. Usually you do a rebalance of the node to balance the cluster properly.
The sentence you are commenting is simply to say that the less number of node you have, the bigger will be the impact in case of failure/rebalance, since you will have to route the same number of request to a smaller number of nodes. Hopefully you do not lose data ;)
You can find some very detailed information about this way of working on Couchbase CTO blog:
http://damienkatz.net/2013/05/dynamo_sure_works_hard.html
Note: I am working as developer evangelist at Couchbase