Migration of ETL jobs to Hadoop - pandas

I have a set of ETL (created in informatica) jobs which I want to migrate to Hadoop. I have already created source and target tables into hadoop environment. Now I can write a hive query to implement logic of ETL which pulls data from source and write to target table. But this is lengthy process since my ETL jobs are complex (with complex business logic), development and testing of these queries are taking longer time. I would like to know if there are any better way to migrate my ETL code to Hadoop ? I heard we can use pandas dataframe instead of hive. Any suggestions ?

Related

Replicate data from cloud SQL postgres to bigQuery

I am looking for the recommended way of streaming database change from cloud SQL (postgres) to bigQuery ? I am seeing that CDC streaming does not seems available for postgres, does anyone know the timeline of this feature ?
Thanks a lot for you help.
Jonathan.
With Datastream for BigQuery, you can now replicate data and schema updates from operational databases directly into BigQuery.
Datastream reads and delivers every change—insert, update, and delete—from your MySQL, PostgreSQL, AlloyDB, and Oracle databases into BigQuery with minimal latency. The source database can be hosted on-premises, on Google Cloud services such as Cloud SQL or Bare Metal Solution for Oracle, or anywhere else on any cloud.
https://cloud.google.com/datastream-for-bigquery
You have to create an ETL process. That will allow you to automatically transform data from Postgres into BigQuery. You can do that using many ways, but I will point you to the two main approaches that I've already implemented:
Way 1:
Set Up the ETL Process manually:
Create your ETL using open source tools...
This method involves the use of the COPY command to migrate data from PostgreSQL tables and standard file-system files. It can be used as a normal SQL statement with SQL functions or PL/pgSQL procedures which gives a lot of flexibility to extract data as a full dump or incrementally. You need to know that it is a time-consuming process and would need you to invest in engineering bandwidth!
Also, you could try different tech stacks to implement the above, and I recommended this one Java Spring Data Flow
Way 2:
Using DataFlow
You can automate the ETL process using GCP's DataFlow without coding your own solution. It is faster and it cost, of course.
DataFlow is Unified stream and batch data processing that's
serverless, fast, and cost-effective.
Check more details and learn in a minute here
Also check this

Automate data transforming ( SQL) and then push processed data to Tableau

I have questions about ways to automate data transformation process.
What I normally do is that I transform data using python or postgresql and then export the processed data as csv. After that, I connect the csv file to Tableau.
I have done some research and found that ETL can help. However, I've watched some ETL tools' demo videos, and I'm not sure whether these tools' transform features would meet my need or not. For example, I have written 100+ sql lines for one of my data transforming task; it's better if I can use postgresql to run the query instead of using ETL tools.
The problem is that I don't know what's the proper way to automate the data transforming process and then push the data to Tableau. The csv files will be updated on a daily basis, so I'll need to refresh the data.
Data transformation can be done in various ways. It depends on your nature of data to figure out what can be the right fit.
If you have huge volume of data and you are comfortable in python/java and you can automate your transformation logic using spark and write it to a hive table and then connect tableau to read data from hive.
Most of the next gen ETL tools like pentaho and talend can be used but that erodes the flexibility and portability what a framework like spark or beam can give.
If you want to know , how can you achieve this using cloud provider services like GCP or AWS , please let me know
Prep is the Tableau tool for preparing data. It can be used for joining, appending, cleaning, pivoting, filtering and other data cleansing activities.
Tableau Prep is available:
for free if you have a Tableau Creator license
in desktop and Online/ Tableau server versions
Scheduling Prep flows is available in Tableau Online/ Server. To schedule flows you will need to acquire the Tableau Prep Conductor add-on.

How to transform data from S3 bucket before writing to Redshift DW?

I'm creating a (modern) data warehouse in redshift. All of our infrastructure is hosted at Amazon. So far, I have setup DMS to ingest data (including changed data) from some tables of our business database (SQL Server on EC2, not RDS) and store it directly to S3.
Now I must transform and enrich this data from the S3 before I can write it to Redshift. Our DW have some tables for facts and dimensions (star schema), so, imagine a Customer dimension, it should contain not only the customer basic info, but address info, city, state, etc. This data is spread amongst a few tables in our business database.
So here's my problem, I don't have a clear idea of how to query the S3 staging area in order to join these tables and write it to my redshift DW. I want to do it using AWS services like Glue, Kinesis, etc. i.e. full serverless.
Can Kinesis accomplish this task? Would it make things easier if I moved my staging area from S3 to Redshift since all of our data is highly relational in nature anyway? If so, the question remains, how to transform/enrich data before saving it on our DW schemas? I have searched everywhere for this particular topic but information on it is scarse.
Any help is appreciated.
There are a lot of ways to do this but one idea is to query the data using Redshift Spectrum. Spectrum is a way to query S3 (called an external database) using your Redshift cluster.
Really high-level, one way to do this would be to create a Glue Crawler job to crawl your S3 bucket, which creates the External Database that Redshift Spectrum can query.
This way, you don't need to move your data into Redshift itself. Likely, you will want to keep your "staging" area in S3 and only bring into Redshift the data that is ready to be used for reporting or analytics, which would be your Customer Dim table.
Here is the documentation to do this: https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html
To schedule the ETL SQL: I don't believe there is a scheduling tool built into Redshift but you can do that in a few ways:
1) Get an ETL tool or set up CRON jobs on a server or Glue that schedules SQL scripts to be ran. I do this with a Python script that connects to the database then runs the SQL text. This would be a little bit more of a bulk operation. You can also do this in a Lambda function and have it be scheduled on a Cloudwatch trigger which can be on a cron schedule
2) Use a Lambda function that runs the SQL script that you want that triggers for S3 PUTs into that bucket. That way the script will run right when the file drops. This would be basically a realtime operation. DMS drops files very quickly so you will have files dropping multiple times per minute so that might be more difficult to handle.
One option is to load the 'raw' data into Redshift as 'staging' tables. Then, run SQL commands to manipulate the data (JOINs, etc) into the desired format.
Finally, copy the resulting data into the 'public' tables that users query.
This is a normal Extract-Load-Transform process (slightly different to ETL) that uses the capabilities of Redshift to do the transform.

Dump materialize aggregation from BigQuery to SQL server, Dataflow vs Airflow

I use a BigQuery dataset as data lake to store all records/events level data, and a SQL server to store aggregated reports that are updated regularly. Because the reports will be accessed frequently by clients via web interface, and each report aggregates large amount of data, so storing it BigQuery is a no go.
What is the best practise for doing this? Internally we have 2 ideas running around:
Run a Dataflow batched job every X hr to recalculate the aggregation and update the SQL server. It will need a scheduler to trigger the job, and the same job can be used to backfill all data.
Run an Airflow job that does the same thing. A separate job will be needed for backfill (but can still share most of the code with the regular job)
I know Dataflow does well in terms of processing chunks of data in parallel, but I wonder about Airflow's performance, as well as the risk of exhausting connection limit
Please check this answer from a previous similar question
In conclusion: Using Airflow will result in a more efficient way to manage all the process from the workflow. A solution that Google offers based on Airflow is Cloud Composer.

Is there a way where the output of the MapReduce job is imported into SQL table?

Is there a way where the output of the MapReduce job is imported into SQL table?
I want to know if we could automatically import the output of MapReduce job (MapReduce job should be responsible for exporting ) into SQL table (MySQL,Oracle, etc..).
I know Sqoop could be used as a tool but could it be used in MR job?
Unless you write in the reducer some code that instead of writing the output to the context, it connects via JDBC to the SQL table and insert it (which would be a really BAD idea), the only thing you can do is to use Oozie to automate the execution of the MapReduce job and then perform the insertion using Sqoop. Oozie is a workflow scheduler to automate all these kinds of operations. You can find more information about it here.