CAN we run ETL jobs on AWS EFS - amazon-s3

I would like to know if we can run ETL jobs on EFS mount files..
if so how? is it using Hive or anyother service?
Our target is to reduce all the files in one mount point to one file...and store that one file in s3 for better processing

EFS in itself does not inherently have a particular data warehouse product included. For data warehousing and ETL you can choose what you want to use that operates in the AWS environment.
On to your problem:
You want to concatenate or in some way combine all of the files currently in your EFS mount into a single file and store that in S3, if I understand it correctly.
You do not mention what type of data you have or what type of files you want to combine. That makes a huge difference in how you would do this. So I will have to give general suggestions. If you have different types of data, SQL tables from different databases, documents, non-sql data; then you need to determine how to combine that data. For that you would be looking at a data integration solution that can accommodate raw data.
Amazon has a few different products that may assist the process such as Redshift, Athena, Snowflake and their ETL solution Glue. Adding products depends on your company's needs and budget.
So, a more flexible data integration approach would be to use ELT (extract, load, transform) instead of ETL. Basically you would create an appropriate file over on your S3 instance. Then you would extract each file on EFS one at a time and load them into your S3 file. Then when you query the data in your S3 file you would perform any transformations needed before seeing the query results. Here's an article that explains the differences in more detail: https://blog.panoply.io/etl-vs-elt-the-difference-is-in-the-how.
There are some vendors supporting the ELT process such as Talend, Hadoop/Hive/Spark, Terradata and Informatica should you want to investigate options.

Related

Snowflake - Loading data loading data from cloud storage

I have some data stored in an S3 bucket and I want to load it into one of my Snowflake DBs. Could you help me to better understand the 2 following points please :
From the documentation (https://docs.snowflake.com/en/user-guide/data-load-s3.html), I see it is better to first create an external stage before loading the data with the COPY INTO operation, but it is not mandatory.
==> What is the advantage/usage of creating this external step and what happen under the hood if you do not create it
==> In the COPY INTO doc, it is said that the data must be staged beforehand. If the data is not staged, Snowflake creates a temporary stage ?
If my S3 bucket is not in the same region as my Snowflake DB, is it still possible to load the data directly, or one must first transfert the data to another S3 bucket in the same region as the Snowflake DB ?
I expect it is still possible but slower because of network transfert time ?
Thanks in advance
The primary advantages of creating an external stage is the ability to tie a file format directly to the stage and not have to worry about defining it on every COPY INTO statement. You can also tie a connection object that contains all of your security information to make that transparent to your users. Lastly, if you have a ton of code that references the stage, but you wind up moving your bucket, you won't need to update any of your code. This is nice for Dev to Prod migrations, as well.
Snowflake can load from any S3 bucket regardless of region. It might be a little bit slower, but not any slower than it'd be for you to copy it to a different bucket and then load to Snowflake. Just be aware that you might incur some egress charges from AWS for moving data across regions.

Using AWS Glue to Create a Table and move the dataset

I've never used AWS Glue however believe it will deliver what I want and am after some advice. I have a monthly CSV data upload that I push to S3 that has a staging Athena table (all strings) associated to it. I want Glue to perform a Create Table As (with all necessary convert/cast) against this dataset in Parquet format, and then move that dataset from one S3 bucket to another S3 bucket, so the primary Athena Table can access the data.
As stated, never used Glue before, and want a starter for 10, so I don't go down rabbit holes.
I currently perform all these steps manually, so want to understand how to use Glue to automate my manual tasks.
Yes, you can use AWS Glue ETL jobs to do exactly what you described. However, it doesn't perform CREATE TABLE AS SELECT queries, instead it does it with ETL jobs based on spark. Here is github repo that describes such process in quite detailed way and here is more of official AWS documentation on ETL programming based on AWS Glue service. After the initial setup, you can define some trigger events/scheduling to run your Glue ETL jobs automatically.
However, one thing to remember is cost of using AWS Glue services. Since it is based on execution time, sometimes it is not that trivial to forecast the final cost. For the workflow you described, performing CTAS queries with Athena would work just fine to transform your data and write it into a different s3 bucket. In this case you would know exactly price since it depends on the size of your data. Then you can use AWS API to do some manipulation with metadata catalog, so that new information would be accessible and in once place.
Since you are new to AWS Glue ETL jobs, I would suggest to stick with CTAS queries for simple tasks (although you can come up with quite complicated queries) and look into an open source project Apache Airflow for automation/scheduling and orchestration. This is the approach the I am using for tasks similar to yours. Airflow is easy to setup on both local and remote machines, has reach CLI and GUI for task monitoring, abstracts away all scheduling and retrying logic. It even has hooks to interact with AWS services. Hell, Airflow even provides you with a dedicated operator for sending queries to Athena. I wrote a little bit more about this approach here.

What to use to serve as an intermediary data source in ETL job?

I am creating an ETL pipeline that uses variety of sources and sends the data to Big Query. Talend cannot handle both relational and non relational database components in one job for my use case so here's how i am doing it currently:
JOB 1 --Get data from a source(SQL Server, API etc), transform it and store transformed data in a delimited file(text or csv)
JOB 1 -- Use the stored transformed data from delimited file in JOB 1 as source and then transform it according to big query and send it.
I am using delimited text file/csv as intermediary data storage to achieve this.Since confidentiality of data is important and solution also needs to be scalable to handle millions of rows, what should i use as this intermediary source. Will a relational database help? or delimited files are good enough? or anything else i can use?
PS- I am deleting these files as soon as the job finishes but worried about security till the time job runs, although will run on safe cloud architecture.
Please share your views on this.
In Data Warehousing architecture, it's usually a good practice to have the staging layer to be persistent. This gives you among other things, the ability to trace the data lineage back to source, enable to reload your final model from the staging point when business rules change as well as give a full picture about the transformation steps the data went through from all the way from landing to reporting.
I'd also consider changing your design and have the staging layer persistent under its own dataset in BigQuery rather than just deleting the files after processing.
Since this is just a operational layer for ETL/ELT and not end-user reports, you will be paying only for storage for the most part.
Now, going back to your question and considering your current design, you could create a bucket in Google Cloud Storage and keep your transformation files there. It offers all the security and encryption you need and you have full control over permissions. Big Query works seemingly with Cloud Storage and you can even load a table from a Storage file straight from the Cloud Console.
All things considered, whatever the direction you chose I recommend to store the files you're using to load the table rather than deleting them. Sooner or later there will be questions/failures in your final report and you'll likely need to trace back to the source for investigation.
In a nutshell. The process would be.
|---Extract and Transform---|----Load----|
Source ---> Cloud Storage --> BigQuery
I would do ELT instead of ETL: load the source data as-is and transform in Bigquery using SQL functions.
This allows potentially to reshape data (convert to arrays), filter out columns/rows and perform transform in one single SQL.

BigQuery best approach for ETL (external tables and views vs Dataflow)

CSV files get uploaded to some FTP server (for which I don't have SSH access) in a daily basis and I need to generate weekly data that merges those files with transformations. That data would go into a history table in BQ and a CSV file in GCS.
My approach goes as follows:
Create a Linux VM and set a cron job that syncs the files from the
FTP server with a GCS bucket (I'm using GCSFS)
Use an external table in BQ for each category of CSV files
Create views with complex queries that transform the data
Use another cron job to create a table with the historic data and also the CSV file on a weekly basis.
My idea is to remove as much middle processes as I can and to make the implementation as easy as possible, including dataflow for ETL, but I have some questions first:
What's the problem with my approach in terms of efficiency and money?
Is there anything DataFlow can provide that my approach can't?
any ideas about other approaches?
BTW, I ran into one problem that might be fixable by parsing the csv files myself rather than using external tables, which is invalid characters, like the null char, so I can get rid of them, while as an external table there is a parsing error.
Probably your ETL will be simplified by Google DataFlow Pipeline batch execution job. Upload your files to the GCS bucket. For transforming use pipeline transformation to strip null values and invalid character (or whatever your need is). On those transformed dataset use your complex queries like grouping it by key, aggregating it (sum or combine) and also if you need side inputs data-flow provides ability to merge other data-sets into the current the data-set too. Finally the transformed output can written to BQ or you can write your own custom implementation for writing those results.
So the data-flow gives you very high flexibility to your solution, you can branch the pipeline and work differently on each branch with same data-set. And regarding the cost, if you run your batch job with three workers, which is the default that should not be very costly, but again if you just want to concentrate on your business logic and not worry about the rest, google data-flow is pretty interesting and its very powerful if used wisely.
Data-flow helps you to keep everything on a single plate and manage them effectively. Go through its pricing and determine if it could be the best fit for you (your problem is completely solvable with google data-flow), Your approach is not bad but needs extra maintenance with those pieces.
Hope this helps.
here are a few thoughts.
If you are working with a very low volume of data then your approach may work just fine. If you are working with more data and need several VMs, dataflow can automatically scale up and down the number of workers your pipeline uses to help it run more efficiently and save costs.
Also, is your linux VM always running? Or does it only spin up when you run your cron job? A batch Dataflow job only runs when it needed, which also helps to save on costs.
In Dataflow you could use TextIO to read each line of the file in, and add your custom parsing logic.
You mention that you have a cron job which puts the files into GCS. Dataflow can read from GCS, so it would probably be simplest to keep that process around and have your dataflow job read from GCS. Otherwise you would need to write a custom source to read from your FTP server.
Here are some useful links:
https://cloud.google.com/dataflow/service/dataflow-service-desc#autoscaling

Export big query data into in house Hadoop Cluster

We have GA data in Big query, and some of my users want to join that to in house data in Hadoop which we can not move to Big Query.
Please let me know what is the best way to do this.
See BigQuery to Hadoop Cluster - How to transfer data?:
The easiest way to go from BigQuery to Hadoop is to use the official Google BigQuery Connector for Hadoop
https://cloud.google.com/hadoop/bigquery-connector
This connector defines a BigQueryInputFormat class.
Write a query to select the appropriate BigQuery objects.
Splits the results of the query evenly among the Hadoop nodes.
Parses the splits into java objects to pass to the mapper. The Hadoop Mapper class receives a JsonObject representation of each selected BigQuery object.
(It uses Google Cloud Storage as an intermediary between BigQuery's data and the splits that Hadoop consumes)
You could follow the route of the Hadoop connecter as Felipe Hoffa suggested.. Or build your own application which will transfer data from BigQuery to your Hadoop cluster. In both ways, you will be able to make the required joins on the hadoop cluster using Pig, Hive etc.
In case you want to try the application method, let me take you through a process flow which your application may need to follow:
Query BQ tables (flatten any nested or repeated fields)
If your query response is too large, you can divert this response into a destination table. Your destination table is simply another table in BigQuery.
You can then export this destination table to a GCS bucket. This uses another query request. You will have options to choose an export format, compression type, split up the data into multiple files etc.
From the GCS bucket, using a tool called gsutil, you can copy the files to your cluster gateway machine.
From your cluster gateway machine, you can use the hadoop command 'copyFromLocal' to copy this data to your HDFS directory.
Once it is in a HDFS directory, you can create a hive external table pointing to this HDFS directory. Your data will now be available in the Hive table. Ready to be joined with the in house data on your cluster.
Let me know if you need anymore details or clarifications. I went down this route because I found the connector alternative a little too complex. But that is a subjective opinion varying from a person to person.