I really love the Apache Ignite's shared RDD for spark. However, due to the limitation, I can not deploy ignite onto cluster nodes. The only way I can use Ignite is throuhgh embedded mode with Spark.
I would like to knowledge, in embedded mode, can the RDD shared through different Spark applications?
I have two Spark jobs:
Job 1: Produce the data, and stores into the shared RDD
Job 2: retrieve the data from the Shared RDD, and do some calculation.
Can this task be done using ignite's embedded mode?
Thanks
In embedded mode Ignite nodes are started within executors which are under control of Spark. Having said that, this mode is more for testing purposes in my opinion - no need to deploy and start Ignite separately while having an ability to try basic functionality. But in real scenarios it would be very hard to achieve consistency and failover guarantees as Spark can start and stop executors which in case of embedded mode are actually holding the data. I would recommend to work around your limitation and make sure Ignite can be installed separately in standalone mode.
Related
I am new to both PySpark and AWS EMR. I have been given a small project where I need to scrub large amounts of data files every hour and build aggregated data sets based on them. These data files are stored on S3 and I can utilize some of the basic functions in Spark (like filter and map) to derive the aggregated data. To save on egress costs and after performing some CBA analysis, I decided to create an EMR cluster and make pypark calls. The concept is working fine using Lambda functions triggered by file created in the S3 bucket. I am writing the output files back to S3.
But I am not able to comprehend the need for the 3 node EMR cluster I created and its use for me. How can I use the Hadoop file system to my advantage here and all the storage that is made available on the nodes?
How do I view (if possible) the utilization of the slave/core nodes in the cluster? How do I know they are used, how often, etc etc? I am executing the pyspark code on the master node.
Are there alternatives to EMR that I can use with pyspark?
Is there any good documentation available to get a better understanding.
Thanks
Spark is a framework for distributed computing. It can process larger than memory datasets and split the workload in chunks onto multiple workers in parallel. By default EMR creates 1 master node and 2 worker nodes. The disk space on the spark nodes is typically not used directly. Spark can use the space to cache temp results.
To use a Hadoop filesystem, you need to start a hdfs service in aws .
However s3 is also distributed storage. It is supported by Hadoop libraries. Spark EMR ships with Hadoop drivers and support S3 out of the box. Using spark with S3 is perfectly valid storage solution and will be good enough for a lot of basic data processing tasks.
The is a spark manager UI in AWS EMR. You can see each running spark application session and current job. By clicking on the job you can see how many executors are used. Whether those executors run on all nodes depends on your spark memory and cpu configuration. Tuning those is a really big topic. There are good hints here on SO.
There is also a hardware monitoring tab, showing cpu and memory usage for each node.
The spark code is always executed on the master node. But it just creates a DAG plan on that node and shifts the actual work to the worker nodes according to the plan. Hence the guides speak of submitting the spark application rather than executing.
Yes. You can start your own spark cluster on normal ec2 instances. There is even a standalone mode , allowing to start spark on only one machine. It is quite some footprint, that is installed then. And you still need to tune the memory, cpu and executor settings. So it is quite a complexity compared to just implement some multiprocessing in python or use dask. However there are valid reasons to do so. It allows to use all cores on one machine. And it allows you to use a well known , good documented api. The same one, which can be used to process petabytes of data. The linked article above, explains the motivation.
Another possibility is to use AWS Glue. It is serverless spark. The
service will submit your jobs to some on demand spark nodes on AWS,
where you have no control over. Similar to how lambda functions run
on random AWS EC2 instances. However glue has some limitations. With
pyspark on glue, you cannot install python libs with c-extensions
e.g numpy, pandas, most of ml libs. Also Glue forces you to create
schema mapping of your data in Athena catalog. But standalone spark
can just process those on the fly.
Databricks also offers a separate serverless spark solution outside of AWS. It is more sophisticated in my opinion. It also allows custom c-extensions.
Big part of official documentation is focusing on the different data processing apis and not on the internals of apache spark. There are some good notes on spark internals on github. I assume every good book will cover some inner workings on spark. AWS EMR is just an automated spark cluster with yarn orchestrator. (Unfortunately, never read some good book on spark, got some info here and there, so cannot recommend one)
I'm trying to give useful information but I am far from being a data engineer.
I am currently using the python library pandas to execute a long series of transformation to my data which has a lot of inputs (currently CSV and excel files). The outputs are several excel files. I would like to be able to execute scheduled monitored batch jobs with parallel computation (I mean not as sequential as what I'm doing with pandas), once a month.
I don't really know Beam or Airflow, I quickly read through the docs and it seems that both can achieve that. Which one should I use ?
The other answers are quite technical and hard to understand. I was in your position before so I'll explain in simple terms.
Airflow can do anything. It has BashOperator and PythonOperator which means it can run any bash script or any Python script.
It is a way to organize (setup complicated data pipeline DAGs), schedule, monitor, trigger re-runs of data pipelines, in a easy-to-view and use UI.
Also, it is easy to setup and everything is in familiar Python code.
Doing pipelines in an organized manner (i.e using Airflow) means you don't waste time debugging a mess of data processing (cron) scripts all over the place.
Nowadays (roughly year 2020 onwards), we call it an orchestration tool.
Apache Beam is a wrapper for the many data processing frameworks (Spark, Flink etc.) out there.
The intent is so you just learn Beam and can run on multiple backends (Beam runners).
If you are familiar with Keras and TensorFlow/Theano/Torch, the relationship between Keras and its backends is similar to the relationship between Beam and its data processing backends.
Google Cloud Platform's Cloud Dataflow is one backend for running Beam on.
They call it the Dataflow runner.
GCP's offering, Cloud Composer, is a managed Airflow implementation as a service, running in a Kubernetes cluster in Google Kubernetes Engine (GKE).
So you can either:
manual Airflow implementation, doing data processing on the instance itself (if your data is small (or your instance is powerful enough), you can process data on the machine running Airflow. This is why many are confused if Airflow can process data or not)
manual Airflow implementation calling Beam jobs
Cloud Composer (managed Airflow as a service) calling jobs in Cloud Dataflow
Cloud Composer running data processing containers in Composer's Kubernetes cluster environment itself, using Airflow's KubernetesPodOperator (KPO)
Cloud Composer running data processing containers in Composer's Kubernetes cluster environment with Airflow's KPO, but this time in a better isolated fashion by creating a new node-pool and specifying that the KPO pods are to be run in the new node-pool
My personal experience:
Airflow is lightweight and not difficult to learn (easy to implement), you should use it for your data pipelines whenever possible.
Also, since many companies are looking for experience using Airflow, if you're looking to be a data engineer you should probably learn it
Also, managed Airflow (I've only used GCP's Composer so far) is much more convenient than running Airflow yourself, and managing the airflow webserver and scheduler processes.
Apache Airflow and Apache Beam look quite similar on the surface. Both of them allow you to organise a set of steps that process your data and both ensure the steps run in the right order and have their dependencies satisfied. Both allow you to visualise the steps and dependencies as a directed acyclic graph (DAG) in a GUI.
But when you dig a bit deeper there are big differences in what they do and the programming models they support.
Airflow is a task management system. The nodes of the DAG are tasks and Airflow makes sure to run them in the proper order, making sure one task only starts once its dependency tasks have finished. Dependent tasks don't run at the same time but only one after another. Independent tasks can run concurrently.
Beam is a dataflow engine. The nodes of the DAG form a (possibly branching) pipeline. All the nodes in the DAG are active at the same time, and they pass data elements from one to the next, each doing some processing on it.
The two have some overlapping use cases but there are a lot of things only one of the two can do well.
Airflow manages tasks, which depend on one another. While this dependency can consist of one task passing data to the next one, that is not a requirement. In fact Airflow doesn't even care what the tasks do, it just needs to start them and see if they finished or failed. If tasks need to pass data to one another you need to co-ordinate that yourself, telling each task where to read and write its data, e.g. a local file path or a web service somewhere. Tasks can consist of Python code but they can also be any external program or a web service call.
In Beam, your step definitions are tightly integrated with the engine. You define the steps in a supported programming language and they run inside a Beam process. Handling the computation in an external process would be difficult if possible at all*, and is certainly not the way Beam is supposed to be used. Your steps only need to worry about the computation they're performing, not about storing or transferring the data. Transferring the data between different steps is handled entirely by the framework.
In Airflow, if your tasks process data, a single task invocation typically does some transformation on the entire dataset. In Beam, the data processing is part of the core interfaces so it can't really do anything else. An invocation of a Beam step typically handles a single or a few data elements and not the full dataset. Because of this Beam also supports unbounded length datasets, which is not something Airflow can natively cope with.
Another difference is that Airflow is a framework by itself, but Beam is actually an abstraction layer. Beam pipelines can run on Apache Spark, Apache Flink, Google Cloud Dataflow and others. All of these support a more or less similar programming model. Google has also cloudified Airflow into a service as Google Cloud Compose by the way.
*Apache Spark's support for Python is actually implemented by running a full Python interpreter in a subprocess, but this is implemented at the framework level.
Apache Airflow is not a data processing engine.
Airflow is a platform to programmatically author, schedule, and
monitor workflows.
Cloud Dataflow is a fully-managed service on Google Cloud that can be used for data processing. You can write your Dataflow code and then use Airflow to schedule and monitor Dataflow job. Airflow also allows you to retry your job if it fails (number of retries is configurable). You can also configure in Airflow if you want to send alerts on Slack or email, if your Dataflow pipeline fails.
I am doing the same as you with airflow, and I've got very good results. I am not very sure about the following: Beam is machine learning focused and airflow is for anything you want.
Finally you can create a hive with kubernetes +airflow.
I am trying to speed spark sql queries by introduce ignite as cache layer, by using IgniteRDD. From the example by ignite doc, it loads data from ignite cache to construct the RDD. But in our usecase the data size may too big to put into ignite memory, actually we just put the data in hbase, so is it possible to do:
1, construct igniteRDD by loading data from hbase
2, Just use ignite to cache share rdd which is generated by spark sql to speed up spark sql.
There are two possible usage scenarios.
First approach. If you run Ignite SQL queries from Spark using igniteRdd.sql(...) method then all the data must be stored in an Ignite cluster. Ignite SQL engine cannot query an underlying 3rd party persistence layer if not all the data is cached in memory. But if you enable Ignite persistence and store all your data there instead of HBase then you can cache as much data as possible and run SQL safely since Ignite can query its own persistence.
Second approach is to use HBase as a cache store (need to implement your own version since there's nothing out-of-the-box) and use Spark SQL queries instead of Ignite SQL because the latter requires us to cache all the data in RAM if Ignite persistence is not used.
Third approach is to try out Ignite in-memory file system (IGFS) and Hadoop accelerator. IGFS and the accelerator are deployed on top of HDFS. However, here you cannot use IgniteRDDs API because all the operations will go through this pipeline Spark->HBase->IGFS+Accelerator+HDFS.
If I were to choose I would go for the first approach.
Apart from above three approaches, if you have flexibility to add another component, use Apache Phoenix. It supports integration with Spark SQL. You can check it on their official website. In this case you will not need Apache Ignite.
We need to load millions of key/values into Apache Geode and we'd like to know what are some the options available. Our values happen to be in the 256kb range.
There are several options depending on your application requirements/SLAs or whether you need to perform conversion or other transformations, etc.
Out-of-the-box, Apache Geode provides the Cache & Region Snapshot Service. This is useful when you want to migrate data from 1 existing Apache Geode cluster to another, for instance. Not so useful if your data is coming from an external source, like a RDBMS.
Another option is to lazily load the data based on need. This can be accomplished by implementing the CacheLoader interface and registering the CacheLoader with a Region. Obviously, you could create a CacheLoader implementation that intelligently loads a block of data based on some rules/criteria in addition to loading and returning the single value of interests based on the current requests.
A lot of times, users create an external, custom Conversion process or tool to extract, transform and bulk load (ETL) a bunch of data into Apache Geode. This is typical in complex Use Cases or requirements. However, it is highly advisable to use perhaps a framework/tool like...
Spring XD (now Spring Cloud Data Flow on Pivotal's Cloud Foundry (PCF)) is great ETL tool and pipeline for creating stream-based applications. Spring XD / SCDF provides many different options for "sources" and "sinks" (e.g. GemFire Server). In addition to sources & sinks, you can even "tap" the stream to process the data with "Processors". So whether you are doing real-time stream or batch-oriented data operations (e.g. bulk loads), Spring XD is a great option.
I am sure Google might provide other answers on how to perform ETL with a KeyValue store like Apache Geode.
Hope this helps get you going.
Cheers,
John
We have very limited options to load Gemfire regions .
1) Spring batch:
Create Gemfire writer for load data and remove data
Create batch configuration and lod it
2) Apache Spark
https://www.linkedin.com/pulse/fast-data-access-using-gemfire-apache-spark-part-vaquar-khan-/
I have been studying 'in-memory data grids' and saw the term 'gemfire'. I'm confused. It seems that gemfire is a term to refer to technologies that store and manipulate data like a database but in the computer memory, isn't it? What exactly is gemfire?
Which technologies can I use to work with 'in-memory data grids' in Node.js?
I saw some applications, like 'Apache Geode' and 'Pivotal gemfire'. How do I work with them? Is it like work with some cache technologies (like Redis or Memcached)? In geode's case, are the data only accessed through an API or are there other ways to access this one?
There are many products that qualify as a "in-memory data grid", GemFire is one of the leading ones. From this article the main ones are:
VMware Gemfire (Java)
Oracle Coherence (Java)
Alachisoft NCache (.Net)
Gigaspaces XAP Elastic Caching Edition (Java)
Hazelcast (Java)
Scaleout StateServer (.Net)
Most of these products have drivers in many languages. You can access data in GemFire over REST, or over the native node.js client.
Apache Geode is the open source version of GemFire. It is much more powerful than memcached and Redis; You can use Geode not only as a cache, but as a store of record (it has native persistence). It has an Object Query Language (OQL) engine built in, which allows you to query nested objects, has powerful features like Continuous Queries and replication over WAN, among others. Geode also has protocol adapters for memcached and Redis, allowing your memcached and Redis clients to connect to Geode.
I would add to the list of "In memory data grid" solutions:
Apache Ignite
Infinispan
They also provide powerful features.
For feature comparison you can use this website: https://db-engines.com/en/system/Hazelcast%3BIgnite .
Last note: GemFire is now a Pivotal solution.
GemFire is a high performance distributed data management infrastructure that sits between application cluster and back-end data sources.
With GemFire, data can be managed in-memory, which makes the access faster.
Kindly check the Link below for further details
https://www.baeldung.com/spring-data-gemfire