Trying to get a handle on what I would use to schedule and run jobs to move data into S3, run scripts on it and move it around s3 afterward.
My requirement is to be able to ingest from API's and also directly from databases. Some formats to ingest will be XML, and others could be flat files. The raw files need to be joined and transformed and turned into a format that graphs could be produced with.
What is AWS glue is like as an ETL tool? My specific question is can you see the finished pipelines showing the data sources and processing parts in a graphical view once they are created?
I have used Azure Data Factory - and it had a graphical UI to view and monitor the pipelines which I found quite useful. Just wondering if AWS glue has a similar thing.
If not - would Nifi on AWS S3 be a good way to do this?
Thanks
If you are looking for the best GUI, I would recommend NiFi. It is commonly used with S3 and has many connectors out of the box for other data sources. It becomes even more interesting if you want to do things outside of the AWS cloud.
That being said, I would think that Glue will also get the job done.
Running Data Factory when you have a heavy AWS footprint feels like an anti-pattern.
Full Disclosure: Have not worked with Glue/Data Factory and work for Cloudera, the driving force behind NiFi
I'm currently using AWS Glue to extract data from DB into s3, manipulate the data and save it back to Redshift/S3 or send via API to my client. AWS Glue GUI is not that good, you won't see a diagram of your flow and sometimes you will need to use other tools like step functions, airflow to orchestrate your job. Also, most of my jobs I have to use PySpark because AWS Glue methods are too limited.
Related to monitoring, you can see if there is an error, how many CPU and memory is been consumed by your job, s3 bytes read/written. If you want additional information you need to use logger or print to send it to the logs.
Related
I am trying to find the most efficient way to process files in AWS.
Read a json, xml, csv from S3 bucket
Map it to another type of json, xml, csv
Save it to S3 bucket
Right now we are using Java with AWS lambdas but we write lots of code.
AWS Data Glue looks good but my experience with MS BizTalk is even better.
Is there any service that can help me with this?
There are many options available within AWS for reading from one file format and writing it to another file format in s3 bucket. Below are some options -
A) AWS SDK for Pandas (DataWrangler) which is an open source Python library from AWS ProServe. You can run this either from a Lambda, or any other server. It provides several out of the box connectors for reading, writing data from various sources and sinks. This option may be used if the volumes are low. It also provides the flexibility to use this from Amazon Lambda or any other server where the SDK can be installed.
B) AWS Glue either using Spark or Python which is a is a serverless data integration service. This also provides a drag and drop option using the Glue Studio to generate data pipelines using many out of the box transformations. One can control the processing windows by using the desired number of Data Processing Units (DPUs). It also has the Glue Workflow for orchestration.
C) EMR which is a PetaByte scale AWS Service that one can use for high volume distributed data processing, machine learning, interactive analytics using open source frameworks like Apache Spark.
Which option one would choose would depend on the use cases one is trying to solve and the requirements. Other factors like volume of data, processing window, low code\no code options, cost, etc. would help decide which option to leverage.
I want to build a Data Lake in AWS S3 and asking my self how to work with CDC. I wanna avoid loading the whole data from the sources and furthermore I wanna avoid duplicates in the target. Are there some proven methodologies how to tackle that?
You can refer to following blog:
https://aws.amazon.com/blogs/big-data/loading-ongoing-data-lake-changes-with-aws-dms-and-aws-glue/
The deduplication is done by AWS Glue by running jobs on raw data. This dumps data to another bucket which will be mirror replication of your source database.
I am new to both PySpark and AWS EMR. I have been given a small project where I need to scrub large amounts of data files every hour and build aggregated data sets based on them. These data files are stored on S3 and I can utilize some of the basic functions in Spark (like filter and map) to derive the aggregated data. To save on egress costs and after performing some CBA analysis, I decided to create an EMR cluster and make pypark calls. The concept is working fine using Lambda functions triggered by file created in the S3 bucket. I am writing the output files back to S3.
But I am not able to comprehend the need for the 3 node EMR cluster I created and its use for me. How can I use the Hadoop file system to my advantage here and all the storage that is made available on the nodes?
How do I view (if possible) the utilization of the slave/core nodes in the cluster? How do I know they are used, how often, etc etc? I am executing the pyspark code on the master node.
Are there alternatives to EMR that I can use with pyspark?
Is there any good documentation available to get a better understanding.
Thanks
Spark is a framework for distributed computing. It can process larger than memory datasets and split the workload in chunks onto multiple workers in parallel. By default EMR creates 1 master node and 2 worker nodes. The disk space on the spark nodes is typically not used directly. Spark can use the space to cache temp results.
To use a Hadoop filesystem, you need to start a hdfs service in aws .
However s3 is also distributed storage. It is supported by Hadoop libraries. Spark EMR ships with Hadoop drivers and support S3 out of the box. Using spark with S3 is perfectly valid storage solution and will be good enough for a lot of basic data processing tasks.
The is a spark manager UI in AWS EMR. You can see each running spark application session and current job. By clicking on the job you can see how many executors are used. Whether those executors run on all nodes depends on your spark memory and cpu configuration. Tuning those is a really big topic. There are good hints here on SO.
There is also a hardware monitoring tab, showing cpu and memory usage for each node.
The spark code is always executed on the master node. But it just creates a DAG plan on that node and shifts the actual work to the worker nodes according to the plan. Hence the guides speak of submitting the spark application rather than executing.
Yes. You can start your own spark cluster on normal ec2 instances. There is even a standalone mode , allowing to start spark on only one machine. It is quite some footprint, that is installed then. And you still need to tune the memory, cpu and executor settings. So it is quite a complexity compared to just implement some multiprocessing in python or use dask. However there are valid reasons to do so. It allows to use all cores on one machine. And it allows you to use a well known , good documented api. The same one, which can be used to process petabytes of data. The linked article above, explains the motivation.
Another possibility is to use AWS Glue. It is serverless spark. The
service will submit your jobs to some on demand spark nodes on AWS,
where you have no control over. Similar to how lambda functions run
on random AWS EC2 instances. However glue has some limitations. With
pyspark on glue, you cannot install python libs with c-extensions
e.g numpy, pandas, most of ml libs. Also Glue forces you to create
schema mapping of your data in Athena catalog. But standalone spark
can just process those on the fly.
Databricks also offers a separate serverless spark solution outside of AWS. It is more sophisticated in my opinion. It also allows custom c-extensions.
Big part of official documentation is focusing on the different data processing apis and not on the internals of apache spark. There are some good notes on spark internals on github. I assume every good book will cover some inner workings on spark. AWS EMR is just an automated spark cluster with yarn orchestrator. (Unfortunately, never read some good book on spark, got some info here and there, so cannot recommend one)
I've never used AWS Glue however believe it will deliver what I want and am after some advice. I have a monthly CSV data upload that I push to S3 that has a staging Athena table (all strings) associated to it. I want Glue to perform a Create Table As (with all necessary convert/cast) against this dataset in Parquet format, and then move that dataset from one S3 bucket to another S3 bucket, so the primary Athena Table can access the data.
As stated, never used Glue before, and want a starter for 10, so I don't go down rabbit holes.
I currently perform all these steps manually, so want to understand how to use Glue to automate my manual tasks.
Yes, you can use AWS Glue ETL jobs to do exactly what you described. However, it doesn't perform CREATE TABLE AS SELECT queries, instead it does it with ETL jobs based on spark. Here is github repo that describes such process in quite detailed way and here is more of official AWS documentation on ETL programming based on AWS Glue service. After the initial setup, you can define some trigger events/scheduling to run your Glue ETL jobs automatically.
However, one thing to remember is cost of using AWS Glue services. Since it is based on execution time, sometimes it is not that trivial to forecast the final cost. For the workflow you described, performing CTAS queries with Athena would work just fine to transform your data and write it into a different s3 bucket. In this case you would know exactly price since it depends on the size of your data. Then you can use AWS API to do some manipulation with metadata catalog, so that new information would be accessible and in once place.
Since you are new to AWS Glue ETL jobs, I would suggest to stick with CTAS queries for simple tasks (although you can come up with quite complicated queries) and look into an open source project Apache Airflow for automation/scheduling and orchestration. This is the approach the I am using for tasks similar to yours. Airflow is easy to setup on both local and remote machines, has reach CLI and GUI for task monitoring, abstracts away all scheduling and retrying logic. It even has hooks to interact with AWS services. Hell, Airflow even provides you with a dedicated operator for sending queries to Athena. I wrote a little bit more about this approach here.
I am implementing a Hadoop Map reduce job that needs to create output in multiple S3 objects.
Hadoop itself creates only a single output file (an S3 object) but I need to partition the output into multiple files.
How do I achieve this?
I did this by just writing the output directly from my reducer method to S3, using an S3 toolkit. Since I was running on EC2, this was quick and free.
In general, you want Hadoop to handle your input and output as much as possible, for cleaner mappers and reducers; and, of course, you want to write to S3 at the very end of your pipeline, to let Hadoop's code moving do it's job over HDFS.
In any case, I recommend doing all of your data partitioning, and writing entire output sets to S3 in a final reduce task, one set per S3 file. This puts as little writer logic in your code as possible. This paid off for me because I ended up with a minimal Hadoop S3 toolkit which I used for several task flows.
I needed to write to S3 in my reducer code because the S3/S3n filesystems weren't mature; they might work better now.
Do you also know the MultipleOutputFormat?
It's not related to S3, but in general it allows to write output to multiple files, implementing a given logic.