I want to know how to load data from Amazon S3 to an Apache Ignite cluster? Would single node or multi node cluster required?
You can load data to any cluster, single node or multi node, as long as your data set fits in memory of this cluster. Please refer to this documentation page for information about data loading: https://apacheignite.readme.io/docs/data-loading
you can use Spark + Ignite as a workaround, spark to read S3 and then write to ignite as explained in the Ignite examples.
Also, you can use spark structured streaming trigger once just to write incremental files to Ignite, combine spark structured streaming and ignite.
https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html
https://github.com/apache/ignite/blob/85af9c789a109f7f067145972a82693c7d28b4a9/examples/src/main/spark/org/apache/ignite/examples/spark/IgniteDataFrameWriteExample.scala#L74
https://github.com/apache/ignite/blob/85af9c789a109f7f067145972a82693c7d28b4a9/examples/src/main/spark/org/apache/ignite/examples/spark/IgniteDataFrameWriteExample.scala#L89
Not sure whether sparK overwrites ignite tables workaround would be to create data frame on existing ignite data and union all latest data and then overwrite ignite table.
https://github.com/apache/ignite/blob/85af9c789a109f7f067145972a82693c7d28b4a9/examples/src/main/spark/org/apache/ignite/examples/spark/IgniteDataFrameWriteExample.scala#L113
Related
I have 3 redis instance with redis. One is the master and the other two, are the slaves. I have connected to master node and get info by redis-cli with INFO command. I can see the parameter cluster_enabled:0 and
#Replication
role:master
connected_slaves:2
slave0:ip=xxxxx,port=6379,state=online,offset=15924636776,lag=1
slave1:ip=xxxxx,port=6379,state=online,offset=15924636776,lag=0
And the keyspace, each node has different dbs. I need to migrate all data to a single memorystore in GCP but I don't know how. Anyone can help me?
Since the two nodes are slaves and clustering is not enabled, you only need to replicate the master node. RIOT is a great tool for migrating data in and out of Redis.
However, if you say DB by node do you mean redis DB that you access by select? In that case you'll need to prefix keys as there may be overlap between the keysets of the DBs.
I think setting up another Redis cluster in a single node configuration is the least of your worries.
The real challenge for you would be migrating all your records over to the new setup. This is not a simple question to answer and would depend heavily on multiple factors:
The total size of your data being migrated
Is this is a live Database in production
Do you want to keep the two DB schemas in your new configuration separate?
Ok, I believe currently your Redis Instances are hosted on Google Compute Engine.
And you are looking to migrate to Memorystore for Redis.
As mentioned here, you can leverage Redis snapshots for this. It provides you step-wise instructions on how to achieve this, leveraging GCS buckets as transient storage.
import data into Cloud Memorystore instances using RDB (Redis Database Backup) snapshots, as well as back up data from existing Redis instances.
I am new to both PySpark and AWS EMR. I have been given a small project where I need to scrub large amounts of data files every hour and build aggregated data sets based on them. These data files are stored on S3 and I can utilize some of the basic functions in Spark (like filter and map) to derive the aggregated data. To save on egress costs and after performing some CBA analysis, I decided to create an EMR cluster and make pypark calls. The concept is working fine using Lambda functions triggered by file created in the S3 bucket. I am writing the output files back to S3.
But I am not able to comprehend the need for the 3 node EMR cluster I created and its use for me. How can I use the Hadoop file system to my advantage here and all the storage that is made available on the nodes?
How do I view (if possible) the utilization of the slave/core nodes in the cluster? How do I know they are used, how often, etc etc? I am executing the pyspark code on the master node.
Are there alternatives to EMR that I can use with pyspark?
Is there any good documentation available to get a better understanding.
Thanks
Spark is a framework for distributed computing. It can process larger than memory datasets and split the workload in chunks onto multiple workers in parallel. By default EMR creates 1 master node and 2 worker nodes. The disk space on the spark nodes is typically not used directly. Spark can use the space to cache temp results.
To use a Hadoop filesystem, you need to start a hdfs service in aws .
However s3 is also distributed storage. It is supported by Hadoop libraries. Spark EMR ships with Hadoop drivers and support S3 out of the box. Using spark with S3 is perfectly valid storage solution and will be good enough for a lot of basic data processing tasks.
The is a spark manager UI in AWS EMR. You can see each running spark application session and current job. By clicking on the job you can see how many executors are used. Whether those executors run on all nodes depends on your spark memory and cpu configuration. Tuning those is a really big topic. There are good hints here on SO.
There is also a hardware monitoring tab, showing cpu and memory usage for each node.
The spark code is always executed on the master node. But it just creates a DAG plan on that node and shifts the actual work to the worker nodes according to the plan. Hence the guides speak of submitting the spark application rather than executing.
Yes. You can start your own spark cluster on normal ec2 instances. There is even a standalone mode , allowing to start spark on only one machine. It is quite some footprint, that is installed then. And you still need to tune the memory, cpu and executor settings. So it is quite a complexity compared to just implement some multiprocessing in python or use dask. However there are valid reasons to do so. It allows to use all cores on one machine. And it allows you to use a well known , good documented api. The same one, which can be used to process petabytes of data. The linked article above, explains the motivation.
Another possibility is to use AWS Glue. It is serverless spark. The
service will submit your jobs to some on demand spark nodes on AWS,
where you have no control over. Similar to how lambda functions run
on random AWS EC2 instances. However glue has some limitations. With
pyspark on glue, you cannot install python libs with c-extensions
e.g numpy, pandas, most of ml libs. Also Glue forces you to create
schema mapping of your data in Athena catalog. But standalone spark
can just process those on the fly.
Databricks also offers a separate serverless spark solution outside of AWS. It is more sophisticated in my opinion. It also allows custom c-extensions.
Big part of official documentation is focusing on the different data processing apis and not on the internals of apache spark. There are some good notes on spark internals on github. I assume every good book will cover some inner workings on spark. AWS EMR is just an automated spark cluster with yarn orchestrator. (Unfortunately, never read some good book on spark, got some info here and there, so cannot recommend one)
I am trying to speed spark sql queries by introduce ignite as cache layer, by using IgniteRDD. From the example by ignite doc, it loads data from ignite cache to construct the RDD. But in our usecase the data size may too big to put into ignite memory, actually we just put the data in hbase, so is it possible to do:
1, construct igniteRDD by loading data from hbase
2, Just use ignite to cache share rdd which is generated by spark sql to speed up spark sql.
There are two possible usage scenarios.
First approach. If you run Ignite SQL queries from Spark using igniteRdd.sql(...) method then all the data must be stored in an Ignite cluster. Ignite SQL engine cannot query an underlying 3rd party persistence layer if not all the data is cached in memory. But if you enable Ignite persistence and store all your data there instead of HBase then you can cache as much data as possible and run SQL safely since Ignite can query its own persistence.
Second approach is to use HBase as a cache store (need to implement your own version since there's nothing out-of-the-box) and use Spark SQL queries instead of Ignite SQL because the latter requires us to cache all the data in RAM if Ignite persistence is not used.
Third approach is to try out Ignite in-memory file system (IGFS) and Hadoop accelerator. IGFS and the accelerator are deployed on top of HDFS. However, here you cannot use IgniteRDDs API because all the operations will go through this pipeline Spark->HBase->IGFS+Accelerator+HDFS.
If I were to choose I would go for the first approach.
Apart from above three approaches, if you have flexibility to add another component, use Apache Phoenix. It supports integration with Spark SQL. You can check it on their official website. In this case you will not need Apache Ignite.
I want to back up entire Ignite cluster so that back up clutser will be used if the original(active) cluster is down. Is there any approach for this?
If you need two separate clusters with replication across data center, it would be better to look at GridGain solutions that supports Datacenter Replication.
Unfortunately, Ignite does not support DR.
With Apache Ignite you can logically divide you cluster to two zones to have guarantee that every zone contains full copy of data. However, there is no way to choose primary node for partitions manually. See, AffinityFunction and affinityBackupFilter() method of standard implementations.
As answered above, ready made solution is only available in paid version. Open source Apache ignite provides ability to take cluster wide absolute snapshot. You can add a cron job in your ignite cluster to take this snapshot and add another job to copy snapshot data to object storage like S3.
On the other side, you download this data node wise to work directories of respective nodes as per manual restore procedure and start the cluster. It should automatically activate when all baseline nodes are started successfully and your cluster is ready to use.
I am running a Datastax Enterprise cluster (with GossipingPropertyFileSnitch). I have two datacenters, Analytics and Cassandra. Analytics nodes forms a Spark cluster. I am considering merging the two clusters to better utilize resources.
When I enable Spark (in /etc/dse/default) on my Cassandra nodes I get a new master and it seems like those nodes aren't joining the same Spark cluster as the Analytics nodes. Can I somehow make the Cassandra datacenter nodes join the Analytics Spark cluster?
Because you're using GossipingPropertyFileSnitch, you must also change which DC the new Spark nodes are in. Otherwise they will continue to be in the so-named "Cassandra" datacenter.
Edit:
The short answer to your headline questions is "No". Separate DC's are assigned separate spark masters and don't share resources on spark jobs.