With the latest Ignite release (2.4), embedded deployment of Ignite was deprecated, and I refer to the original discussion forum link.
http://apache-ignite-developers.2346864.n4.nabble.com/Deprecate-IgniteRDD-in-embedded-mode-td24867.html
1) However, it was not clear from the documentation as to what advantage would the YARN deployment have over embedded. If this can please be explained. Wouldn't the YARN deployment have similar shortcomings as embedded?
2) My use case involves using Ignite to create a distributed cache while computing in Spark. Would vanilla deployment of Ignite in a different/same cluster make more sense vs YARN deployment in my spark cluster?
I guess it was deprecated because adding and removing server nodes to topology on a whim would lead to expensive and error-prone process of rebalancing caches between nodes. Data may be lost if there are insufficient backups, or will need to be transferred between nodes when this happens. You can also get cluster failures if during a run insufficient nodes are kept alive.
It is much better to run all the needed nodes before work is started, avoid changing topology while work is underway, and kill all nodes once they're no longer needed. That's what YARN deployment tries to do.
Vanilla deployment may make more sense if the lifecycle of Ignite cluster is longer than lifecycle of work you run on MR.
Related
We use Bitnami Redis-Cluster chart to deploy our cluster into AKS.
When kubernetes nodes are creating, the master nodes seems to load the module just fine.
The replica nodes crash a couple seconds after loading the module, with exit code 137.
We build RediSearch module in coordinator mode like so:
RUN make setup
RUN make build COORD=1
Using the config option, we load RediSearch module as such:
module load /opt/bitnami/redis/modules/module-oss.so OSS_GLOBAL_PASSWORD ourPassword.
Redis Version: 6.2.6 (tried 6.2.7)
RediSearch Version: 2.2.10 (tried 2.4.5)
Bitnami redis-cluster Version: 7.5.2
Have already raised an issue on RediSearch Github page here. It has details of the stacktrace, the way we're building our rediSearch module and dockerfile for redis itself.
Deploying the cluster without modules works completely fine.
Deploying the cluster without modules and then manually loading the modules via cli, also seems to work fine.
Doing a double deployment, one without module loading and another one with module loading command also seems to work, but is not really an option as we dont want to run our release pipeline twice as a workaround.
I tried using a PostStart lifecycle hook to load the module, but it failed due to redis node not being available yet when the hook fires.
Any suggestions or ideas would be highly appreciated. Thanks
Is there a native Spinnaker way to cleanup old AMIs after a successful deployment took place?
It's nice that the previous version of the newest deployment is available in AWS but the previous ones keep adding-up and thus incur not only cost but also confusion.
Thanks.
Nothing native in Spinnaker. Janitor Monkey will do this, however, and cleans up a bunch of other unused AWS artifacts.
It's very possible Spinnaker will support this natively in the future - just not right now.
Documentation on this is quite rare but are there any tips on how to speed up build on CloudBees, especially using the workflow plugin?
Usually -- when using the very same machine for subsequent builds, you can make use of caches or reuse previous computations.
There are some quite expensive computations like downloading dependencies with SBT, Maven or Gradle; the initial npm install; Gemfile Cache that are quite expensive in time and computation but are great to cache.
On CloudBees you will most probably get a random (new) node for your builds, so there's no cache.
We are also using Snap-CI - there we have a persistent CACHE_DIR that allows that. Is there anything similar on CloudBees?
If you are referring to DEV#cloud, CloudBees’ hosted Jenkins, there is a cached workspace system, though it is not used for every build. (Depends on detail of hardware allocation in the cloud.) If you run a number of builds, over time you should see most of them picking up an existing workspace, and thus being able to use Maven local repository caches, etc.
Use of the Workflow plugin as opposed to freestyle or other project types should not matter in this regard.
I have set up a single node cluster of Hadoop 2.6 but i need to integrate zookeeper and hbase with it.
I am a beginner with no prior experience in big data tools.
How do you set up zookeeper to coordinate hadoop cluster and how do we use hbase over hdfs?
How do they combine to make a ecosystem?
For standalone mode, just follow the steps provided in this HBase guide:http://hbase.apache.org/book.html#quickstart
HBase has a standalone mode that makes it easy for starters to get going. In standalone mode hbase,hdfs, and zk runs in single JVM process.
It depends on the kind of system that you want to build. As you said, the Hadoop ecosystem is made my three major components: HBase, HDFS, and zookeeper. Although they can be installed independently from each other, sometimes there is no need to install them all depending on the kind of cluster that you want to setup.
Since you are using a single node cluster, there are two HBase run modes that you can choose: the standalone mode and the pseudo-distributed mode. In the standalone mode there is no need to install HDFS or Zookeeper. HBase will do everything in a transparent way. If you want to use the pseudo-distributed mode, you can run HBase against the local filesystem or against HDFS. If you want to use HDFS, you'll have to install Hadoop. About the zookeeper, again, the HBase will also do the job by itself (you just need to tell him that through the configuration files).
In case you want to use HDFS in the pseudo-distributed mode, downloading Hadoop you will get HDFS and MapReduce. If you don't want to execute MapReduce jobs, just ignore its tools.
If you want to learn more, I think that this guide explains it all very well: https://hbase.apache.org/book.html (check the HBase run modes).
I want to use hadoop 2.6.0 and by default it's on YARN mode. So should I write the YARN application like this :
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html
Or I'm just write some mapreduce application like usual? And what the function of this YARN application?
I nned your suggest, Thanks all.....
Think of YARN as data operating system & MapReduce as your application which runs on top of YARN.
So, you existing MapReduce code should work without any modifications even on YARN mode.
The code example you have posted shows how you could develop your own applications on top of YARN which hides the abstractions of resource allocation, multi-tenancy, distributed programming, failover and so on. For example MapReduce framework itself is re-written as YARN application so that it could run on top of YARN. This allows YARN to run multiple applications (MapReduce, Spark, Tez, Storm, etc) simultaneously on the same cluster.