Dataproc use GC_OPTS="-XX:+UseConcMarkSweepGC" for yarn? - hadoop-yarn

Working with dataproc and i was exploring different configuration related to spark and yarn, and i found that dataproc includes GC_OPTS="-XX:+UseConcMarkSweepGC" as part of yarn env. configuration.
GC_OPTS="-XX:+UseConcMarkSweepGC"
# Log GC details to stdout, these will be in diagnostic tarballs.
GC_LOGGING_OPTS="-XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCDetails"
export YARN_TIMELINESERVER_OPTS="${GC_OPTS} ${GC_LOGGING_OPTS} ${YARN_TIMELINESERVER_OPTS}
Is there any specific needs for yarn performance in order to setup the garbage collector to the CMS collector instead of default options?

In certain cases with very high memory usage, stop-the-world garbage collection can potentially trigger timeouts in daemons talking to the ResourceManager or NameNode. This was actually observed in some Dataproc clusters prior to reconfiguring to use CMS GC.
Optimal options may vary depending on the characteristics of the workload, but in general this approach is corroborated in other general Hadoop guidance, such as https://community.hortonworks.com/articles/14170/namenode-garbage-collection-configuration-best-pra.html

Related

How to change the Ignite to maintanance mode?

What is Ignite maintenance mode of Ignite, and how to change an ignite to this mode? I was stuck joining the node to the cluster and complains cleaning up the persistent data, however the data can be cleaned (using control.sh) only in the maintenance mode only.
This is a special mode, similar to running Windows in a safe mode after a crash or a data corruption where most of the cluster functionality is disabled and a user is asked to perform some maintenance task to resolve the issue, most straightforward example I can think of is - to clean (remove) some corrupted files on disk just like in your question. You can refer to IEP-53: Maintenance Mode proposal for the details.
I don't think that there is a way to enter this mode manually unless you trigger some preconfigured conditions like stopping a node in the middle of checkpointing with WAL disabled. Once the state is fixed, maintenance mode should be resolved automatically allowing a node to join the cluster.
Also, from my understanding, this mode is about a particular node rather than a complete cluster. I.e. you can have a 4-nodes cluster with only 1 node in maintenance mode, in that case, you have to run control.sh commands locally for the concrete failed node, not from another healthy node. If that's not the case, please provide more details or file a JIRA ticket because reported behavior looks quite broken to me.

Flink on yarn use yarn-session or not?

There are two methods to deploy flink applications on yarn. The first one is use yarn-session and all flink applications are deployed in the session. The second method is each flink application deploy on yarn as a yarn application.
My question is what's the difference between these two methods? Which one to choose in product environment?
I can't find any material about this.
I think the first method will save resources since only need one jobmanager(yarn application master). While it is also the disadvantage since the only jobmanager can be the bottleneck while flink applications getting more and more.
Both modes have their uses in production environments.
Session mode generally makes sense when you will be running a bunch of short-lived jobs, and want to avoid the overhead of starting up a cluster for each one. On the other hand, there are security implications, as any credentials available to any of the jobs will be accessible to all of the jobs. Cluster-per-job mode may use more resources overall, but is, in some sense, more straightforward.

Ignite data backup in hard disk

So i'm totally new to ignite here. Is there any configuration or strategy to export all data present in the cache memory to the local hard disk in ignite.
Basically what i'm hoping for is some kind of a logger/snapshot that shows the change in data when any kind of sql update operation is performed on the data present in the caches.
If someone could sugest a solution, i'd appreciate it a lot.
You can create and configure persistence store for any cache [1]. If cluster is restarted, all the data will be there and can be reloaded into memory using IgniteCache#loadCache(..) method. Out of the box Ignite provides integration with RDBMS [2] and Cassandra [3].
Additionally, in one of the future versions (most likely next 2.1) Ignite will provide a local disk persistence storage which will allow to run with a cold cache, i.e. without explicit reloading after cluster restart. I would recommend to monitor dev and user Apache Ignite mailing lists for more details.
[1] https://apacheignite.readme.io/docs/persistent-store
[2] https://apacheignite-tools.readme.io/docs/automatic-rdbms-integration
[3] https://apacheignite-mix.readme.io/docs/ignite-with-apache-cassandra

Dose the Apache Drill have any negative influence on the other members of Hadoop ecosystem in the existing Hadoop cluster?

if I deploy Apache Drill in a existing Hadoop cluster, dose the Apache Drill have any negative influence on the other members of Hadoop ecosystem in the existing Hadoop cluster ?
It wont have any negative impact on other members of ecosystem but it will hog a lot of memory of the node. Make sure you have enough memory before installing Drill.
It's Hadoop compatibility component. The influence only happen if you don't provide enough resource for all the members to operate.
You can put Drill with HDFS in the same node or same cluster to get the best performance.

Real world example of Apache Helix, Zookeeper, Mesos and Erlang?

I am new in
Apache ZooKeeper : ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
Apache Mesos : Apache Mesos is a cluster manager that simplifies the complexity of running applications on a shared pool of servers.
Apache Helix : Apache Helix is a generic cluster management framework used for the automatic management of partitioned, replicated and distributed resources hosted on a cluster of nodes.
Erlang Langauge : Erlang is a programming language used to build massively scalable soft real-time systems with requirements on high availability.
It sounds to me that Helix and Mesos both are useful for Clustering management System. How they are related to ZooKeeper? It'd better if someone give me a real world example for their usage.
I am curious to know How [BOINC][1] are distributing tasks to their clients? Are they using any of the above technologies? (Forget about Erlang).
I just need a brief view on it :)
Erlang was built by Ericsson, designed for use in phone systems. By design, it runs hundreds, thousands, or even 10s of thousands of small processes to handle tasks by sending information between them instead of sharing memory or state. This enables all sorts of interesting features that are great for high availability distributed systems such as:
hot code reloading. Each process is paused, it's relevant module code is swapped out, and it is resumed where it left off, so deploys can happen without restarting or causing significant interruption.
Easy distributed messaging and clustering. Sending a message to a local process or a remote one is fairly seamless in most instances.
Process-local GC. Garbage collection happens in each process independently instead of a global stop-the-world even like java, aiding in low-latency results.
Supervision trees and complex process hierarchy and monitoring/managing.
A few concrete real-world examples that makes great use of Erlang would be:
MongooseIM A highly performant and incredibly scalable, distributed XMPP / Chat server
Riak A distributed key/value store.
Mesos, on the other hand, you can sort of think of as a platform effectively for turning a datacenter of servers into a platform for teams and developers. If I, say as a company, own a datacenter with 10,000 physical servers, and I have 1,000 engineers developing hundreds of services, a good way to allow the engineers to deploy and manage services across that hardware without them needing to worry about the servers directly. It's an abstraction layer over-top of the physical servers to that allows you to share and intelligently allocate resources.
As a user of Mesos, I might say that I have Service X. It's an executable bundle that lives in location Y. Each instance of Service X needs 4 GB of RAM and 2 cores. And I need 8 instances which will be attached to a load balancer. You can specify this in configuration and deploy based on that config. Mesos will find hardware that has enough ram and CPU capacity available to handle each instance of that service and start it running in each of those locations.
It can handle a lot of other more complex topics about the orchestration of them as well, but that's probably a bit in-depth for this :)
Zookeepers most common use cases are Service Discover and configuration management. You can think of it, fundamentally, a bit like a nested key value store, where services can look at pre-defined paths to see where other services currently live.
A simple example is that I have a web service using a shared database cluster. I know a simple name for that database cluster and where the configuration for it lives in zookeeper. I can look up (or repeatedly poll) that path in zookeeper to check what the addresses of the active database hosts are. And on the other side, if I take a database node out of rotation and replace it with a new one, the config in zookeeper gets updated with the new address, and anything continually looking at it will detect this change and change where it's connected to.
A more complex use case for zookeeper is how Kafka uses it (or did at the time that I last used Kafka). Kafka has streams, and streams have many shards. Each consumer of each stream use zookeeper to save checkpoints in each shard after they have read and processed up to a certain point in the stream. That way if the consumer crashes or is restarted, it knows where to pick up in the stream.
I dont know about Meos and Earlang language. But this article might help you with Helix and Zookeeper.
This article tells us:
Zookeeper is responsible for gluing all parts together where Helix is cluster management component that registers all cluster details (cluster itself, nodes, resources).
The article is related to clustering in JBPM using helix and zookeeper.But with this you will get a basic idea on what helix and zookeeper is used for.
And from most of the articles i read online it seems like zookeeper and helix are used together.
Apache Zookeeper can be installed on a single machine or on a cluster.
It can be used to keep track of logs. It can provide various services on a distributed platform.
Storm and Kafka rely on Zookeeper.
Storm uses Zookeeper to store all state so that it can recover from an outage in any of its (distributed) component services.
Kafka queue consumers can use Zookeeper to store information on what has been consumed from the queue.