Is it possible to have nodes from multiple datacenters join the same Spark cluster? - datastax

I am running a Datastax Enterprise cluster (with GossipingPropertyFileSnitch). I have two datacenters, Analytics and Cassandra. Analytics nodes forms a Spark cluster. I am considering merging the two clusters to better utilize resources.
When I enable Spark (in /etc/dse/default) on my Cassandra nodes I get a new master and it seems like those nodes aren't joining the same Spark cluster as the Analytics nodes. Can I somehow make the Cassandra datacenter nodes join the Analytics Spark cluster?

Because you're using GossipingPropertyFileSnitch, you must also change which DC the new Spark nodes are in. Otherwise they will continue to be in the so-named "Cassandra" datacenter.
Edit:
The short answer to your headline questions is "No". Separate DC's are assigned separate spark masters and don't share resources on spark jobs.

Related

Join on two Ignite Cluster is it possible?

In our Project we are using Ignite and in it we have multiple Ignite cluster and we are using zookeeper Discovery, I wanted to know if Ignite supports Join on two different Ignite cluster if yes please share the approach.
Found few inputs on this but not that helpful.
Apache Ignite: caching ClusterGroup
Communication between two Ignite clusters (maybe merging two Ignite clusters in one) ignite-clusters-in
Well, if they are two independent clusters, they are independent. The most common scenario I can think of is - having a Master and Replica cluster with a synchronization in between.
If you want to perform a SQL over multiple nodes, they have to be in a single cluster. There is no hard limit on how many nodes you can have. I know that some companies have hundreds of nodes. But it could be tricky in terms of maintenance, like using Zookeeper discovery and paying a lot of attention to the network.
If you indeed need to join some result from two completely independent clusters, you will need to do it manually. I.e. get a result from one, then from the second and do an aggregation/processing.

Reduce Redis cluster to single GCP memorystore

I have 3 redis instance with redis. One is the master and the other two, are the slaves. I have connected to master node and get info by redis-cli with INFO command. I can see the parameter cluster_enabled:0 and
#Replication
role:master
connected_slaves:2
slave0:ip=xxxxx,port=6379,state=online,offset=15924636776,lag=1
slave1:ip=xxxxx,port=6379,state=online,offset=15924636776,lag=0
And the keyspace, each node has different dbs. I need to migrate all data to a single memorystore in GCP but I don't know how. Anyone can help me?
Since the two nodes are slaves and clustering is not enabled, you only need to replicate the master node. RIOT is a great tool for migrating data in and out of Redis.
However, if you say DB by node do you mean redis DB that you access by select? In that case you'll need to prefix keys as there may be overlap between the keysets of the DBs.
I think setting up another Redis cluster in a single node configuration is the least of your worries.
The real challenge for you would be migrating all your records over to the new setup. This is not a simple question to answer and would depend heavily on multiple factors:
The total size of your data being migrated
Is this is a live Database in production
Do you want to keep the two DB schemas in your new configuration separate?
Ok, I believe currently your Redis Instances are hosted on Google Compute Engine.
And you are looking to migrate to Memorystore for Redis.
As mentioned here, you can leverage Redis snapshots for this. It provides you step-wise instructions on how to achieve this, leveraging GCS buckets as transient storage.
import data into Cloud Memorystore instances using RDB (Redis Database Backup) snapshots, as well as back up data from existing Redis instances.

Setting up a Hadoop Cluster on Amazon Web services with EBS

I was wondering how I could setup a hadoop cluster (say 5 nodes) through AWS. I know how to create the cluster on EC2 but I don't know how to face the following challenges.
What happens if I lose my spot instance. How do I keep the cluster going.
I am working with some datasets of Size 1TB. Would it be possible to setup the EBS accordingly. How can I access the HDFS in this scenario.
Any help will be great!
Depending on your requirements, these suggestions would change. However, assuming a 2 Master and 3 Worker setup, you can probably use r3 instances for Master nodes as they are memory intensive app optimized and go for d2 instances for the worker nodes. d2 instances have multiple local disks and thus can withstand some disk failures while still keeping your data safe.
To answer your specific questions,
treat Hadoop machines as any linux applications. What would happen if your general centOS spot instances are lost? Hwnce, generally it is advised to use reserved instances.
Hadoop typically stores data by maintaining 3 copies and distributing them across the worker nodes in forms of 128 or 256 MB blocks. So, you will have 3TB data to store across the three worker nodes. Obviously, you have to consider some overhead while calculating space requirements.
You can use AWS's EMR service - it is designed especially for Hadoop clusters on top of EC2 instances.
It it fully managed, and it comes pre-packed with all the services you need in Hadoop.
Regarding your questions:
There are three main types of nodes in hadoop:
Master - a single node, don't need to spot it.
Core - a node that handle tasks, and have part of the HDFS
Task - a node that handle tasks, but does not have any part of the HDFS
If Task nodes are lost (if they are spot instances) the cluster will continue to work with no problems.
Regarding storage, the default replication factor in EMR is as follows:
1 for clusters < four nodes
2 for clusters < ten nodes
3 for all other clusters
But you can change it - http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hdfs-config.html

Apache Ignite to Amazon S3 connectivity

I want to know how to load data from Amazon S3 to an Apache Ignite cluster? Would single node or multi node cluster required?
You can load data to any cluster, single node or multi node, as long as your data set fits in memory of this cluster. Please refer to this documentation page for information about data loading: https://apacheignite.readme.io/docs/data-loading
you can use Spark + Ignite as a workaround, spark to read S3 and then write to ignite as explained in the Ignite examples.
Also, you can use spark structured streaming trigger once just to write incremental files to Ignite, combine spark structured streaming and ignite.
https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html
https://github.com/apache/ignite/blob/85af9c789a109f7f067145972a82693c7d28b4a9/examples/src/main/spark/org/apache/ignite/examples/spark/IgniteDataFrameWriteExample.scala#L74
https://github.com/apache/ignite/blob/85af9c789a109f7f067145972a82693c7d28b4a9/examples/src/main/spark/org/apache/ignite/examples/spark/IgniteDataFrameWriteExample.scala#L89
Not sure whether sparK overwrites ignite tables workaround would be to create data frame on existing ignite data and union all latest data and then overwrite ignite table.
https://github.com/apache/ignite/blob/85af9c789a109f7f067145972a82693c7d28b4a9/examples/src/main/spark/org/apache/ignite/examples/spark/IgniteDataFrameWriteExample.scala#L113

Can we copy Apache Ignite Cluster to another Ignite cluster?

I want to back up entire Ignite cluster so that back up clutser will be used if the original(active) cluster is down. Is there any approach for this?
If you need two separate clusters with replication across data center, it would be better to look at GridGain solutions that supports Datacenter Replication.
Unfortunately, Ignite does not support DR.
With Apache Ignite you can logically divide you cluster to two zones to have guarantee that every zone contains full copy of data. However, there is no way to choose primary node for partitions manually. See, AffinityFunction and affinityBackupFilter() method of standard implementations.
As answered above, ready made solution is only available in paid version. Open source Apache ignite provides ability to take cluster wide absolute snapshot. You can add a cron job in your ignite cluster to take this snapshot and add another job to copy snapshot data to object storage like S3.
On the other side, you download this data node wise to work directories of respective nodes as per manual restore procedure and start the cluster. It should automatically activate when all baseline nodes are started successfully and your cluster is ready to use.