Tool to "Data Load" or "ETL" -- from SQL Server into Amazon Redshift - sql

I am trying to figure out decent but simple tool which I can host myself in AWS EC2, which will allow me to pull data out of SQL Server 2005 and push to Amazon Redshift.
I basically have a view in SQL Server on which I am doing SELECT * and I need just put all this data into Redshift. The biggest concern is that there is a lot of data, and this will need to be configurable so I can queue it, run as a nighly/continuous job, etc.
Any suggestions?

alexeypro,
dump tables to files, then you have two fundamental challenges to solve:
Transporting data to Amazon
Loading data to Redshift tables.
Amazon S3 will help you with both:
S3 supports fast upload of files to Amazon from your SQL server location. See this great article. It is from 2011 but I did some testing a few months back and saw very similar results. I was testing with gigabytes of data and 16 uploader threads were ok, as I'm not on backbone. Key thing to remember is that compression and parallel upload are your friends to cut down the time for upload.
Once data are on S3, Redshift supports high-performance parallel load from files on S3 to table(s) via COPY SQL command. To get fastest load performance pre-partition your data based on table distribution key and and pre-sort it to avoid expensive vacuums. All is well documented in Amazon's best practices. I have to say these guys know how to make things neat & simple, so just follow the steps.
If you are coder you can orchestrate the whole process remotely using scripts in whatever shell/language you want. You'll need tools/libraries for parallel HTTP upload to S3 and command line access to Redshift (psql) to launch the COPY command.
Another options is Java, there are libraries for S3 upload and JDBC access to Redshift.
As other posters suggest, you could probably use SSIS (or essentially any other ETL tool) as well. I was testing with CloverETL. Took care of automating the process as well as partitioning/presorting the files for load.

Now Microsoft released SSIS Powerpack, so you can do it natively.
SSIS Amazon Redshift Data Transfer Task
Very fast bulk copy from on-premises data to Amazon Redshift in few clicks
Load data to Amazon Redshift from traditional DB engines like SQL Server, Oracle, MySQL, DB2
Load data to Amazon Redshift from Flat Files
Automatic file archiving support
Automatic file compression support to reduce bandwidth and cost
Rich error handling and logging support to troubleshoot Redshift Datawarehouse loading issues
Support for SQL Server 2005, 2008, 2012, 2014 (32 bit and 64 bit)
Why SSIS PowerPack?
High performance suite of Custom SSIS tasks, transforms and adapters

With existing ETL tools, an alternate option to avoid staging data in Amazon (S3/Dynamo) is to use the commercial DataDirect Amazon Redshift Driver which supports a high performance load over the wire without additional dependencies to stage data.
https://blogs.datadirect.com/2014/10/recap-amazon-redshift-salesforce-data-integration-oow14.html

For getting data into Amazon Redshift, I made DataDuck http://dataducketl.com/
It's like Ruby on Rails but for building ETLs.
To give you an idea of how easy it is to set up, here's how you get your data into Redshift.
Add gem 'dataduck' to your Gemfile.
Run bundle install
Run datatduck quickstart and follow the instructions
This will autogenerate files representing the tables and columns you want to migrate to the data warehouse. You can modify these to customize it, e.g. remove or transform some of the columns.
Commit this code to your own ETL project repository
Git pull the code on your EC2 server
Run dataduck etl all on a cron job, from the EC2 server, to transfer all the tables into Amazon Redshift

Why not Python+boto+psycopg2 script?
It will run on EC2 Windows or Linux instance.
If it's OS Windows you could:
Extract data from SQL Server( using sqlcmd.exe)
Compress it (using gzip.GzipFile).
Multipart upload it to S3 (using boto)
Append it to Amazon Redshit table (using psycopg2).
Similarly, it worked for me when I wrote Oracle-To-Redshift-Data-Loader

Related

Replicate data from cloud SQL postgres to bigQuery

I am looking for the recommended way of streaming database change from cloud SQL (postgres) to bigQuery ? I am seeing that CDC streaming does not seems available for postgres, does anyone know the timeline of this feature ?
Thanks a lot for you help.
Jonathan.
With Datastream for BigQuery, you can now replicate data and schema updates from operational databases directly into BigQuery.
Datastream reads and delivers every change—insert, update, and delete—from your MySQL, PostgreSQL, AlloyDB, and Oracle databases into BigQuery with minimal latency. The source database can be hosted on-premises, on Google Cloud services such as Cloud SQL or Bare Metal Solution for Oracle, or anywhere else on any cloud.
https://cloud.google.com/datastream-for-bigquery
You have to create an ETL process. That will allow you to automatically transform data from Postgres into BigQuery. You can do that using many ways, but I will point you to the two main approaches that I've already implemented:
Way 1:
Set Up the ETL Process manually:
Create your ETL using open source tools...
This method involves the use of the COPY command to migrate data from PostgreSQL tables and standard file-system files. It can be used as a normal SQL statement with SQL functions or PL/pgSQL procedures which gives a lot of flexibility to extract data as a full dump or incrementally. You need to know that it is a time-consuming process and would need you to invest in engineering bandwidth!
Also, you could try different tech stacks to implement the above, and I recommended this one Java Spring Data Flow
Way 2:
Using DataFlow
You can automate the ETL process using GCP's DataFlow without coding your own solution. It is faster and it cost, of course.
DataFlow is Unified stream and batch data processing that's
serverless, fast, and cost-effective.
Check more details and learn in a minute here
Also check this

Export SQL query result as a txt file to Azure blob storage automatically

I am trying to export the result of a query to a container in the Azure blob storage. I did much research and it seems there are services that can do this, but they are paid services; is there any way to automate this without any paid service at all? I can already push the files from my computer to the storage automatically, but if I could find a way to directly do this it will be great. Essentially I want to extract some data on daily basis to the storage and make it possible for a simple download using browsers or from withing Excel
fictitious Example:
SELECT name, salary FROM dbo.Employees
Export to https://mystorage.blob.core.windows.net/mycontainer/myresults.txt
If you are use on-premise SQL Server and want to run the script which save the query result to Blob Storage automatically, SSIS is the good solution. It's free and very effective.
You can ref this tutorial Azure Blob Storage Data Upload with SSIS: it teaches us run a SQL query and upload it to Blob Storage.
Then you can schedule run the SSIS package with a SQL Server agent job. Ref this document: How to Execute SSIS Packages from SQL Server Agent:
SSIS is indeed a good choice for implementing ETL processes. The
typical process is scheduled to run on a periodic basis. SQL Server
Agent is a good tool for executing SSIS packages as well as
scheduling jobs to run at the appropriate times.
You can combine these two documents and achieve your purpose.

Data migration from teradata to bigquery

My requirement is to migrate data from teradata database to Google bigquery database where table structure and schema remains unchanged. Later, using the bigquery database, I want to generate reports.
Can anyone suggest how I can achieve this?
I think you should try TDCH to export data to Google Cloud Storage in Avro format. TDCH runs on top of hadoop and exports data in parallel. You can then import data from avro files into BigQuery.
I was part of a team that addressed this issue in a Whitepaper.
The white paper documents the process of migrating data from Teradata Database to Google BigQuery. It highlights several key areas to consider when planning a migration of this nature, including the rationale for Apache NiFi as the preferred data flow technology, pre-migration considerations, details of the migration phase, and post-migration best practices.
Link: How To Migrate From Teradata To Google BigQuery
I think you can also try to use cloud composer(apache airflow) or install apache airflow in instance.
If you can open the ports from Teradata DB then you can run 'gsutil' command from there and schedule it via airflow/composer to run the jobs on daily basis. Its quick and you can leverage the scheduling capabilities of airflow.
BigQuery introduced Migration Service which is a comprehensive solution for migrating the data warehouse to BigQuery. It includes free to use tools that help with each phase of migration including assessment and planning to execution and verification.
Reference:
https://cloud.google.com/bigquery/docs/migration-intro

Test Amazon S3 and Redshift

Currently I am trying to figure out a strategy to automate the below testing scenarios:
Compare data integrity between a set of files between On premise server and Amazon S3 (csv) i.e. comparing two file's entire content while they are residing in two diff server.
My thought: I have thought of using Java to compare both, but not sure bow to perform run time comparison between two diff servers. Otherwise I have to bring both of the files into same server for comparison.
Compare data integrity between Amazon S3 and Amazon Redshift (After data loaded from S3 to Redshift). Can I use Java to query Amazon S3 object and create a table first and then compare with Redshift? But I think although they are part of the same environment S3 and Redshift are 2 diff servers.
Please suggest if there is any SIT test framework to test On-premise to AWS Cloud migration.

Synchronize Amazon RDS with Google BigQuery

People, the company where I work has some MySQL databases on AWS (Amazon RDS). We are making a POC with BigQuery and what I am researching now is how to replicate the bases to BigQuery (the existing registers and the new ones in the future). My doubts are:
How to replicate the MySQL tables and rows to BigQuery. Is there any tool to do that (I am reading about Amazon Database Migration Service)? Should I replicate to Google Cloud SQL and than export to BigQuery?
How to replicate the future registers? Is possible to create a job inside MySQL to send the new registers after a predefined number? For example, after 1,000 new rows are inserted (or a time is passed), some event is "triggered" and the new registers are copied to Cloud SQL/BigQuery?
My initial idea is to dump the original base, load it to the other and use a script to listen to new registers and send them to the new base.
Have I explained it properly? Is it understandable?
You will need to use one of the ETL tools which have integration with both mySQL and BigQuery to perform initial transfer of the data and copy subsequent changes to BigQuery. Take a look on the list of available tools [1]
You can also implement your own tool by developing a process which will extract the data from mySQL to a CSV file and then load that file into BigQuery using data import [2]
[1] https://cloud.google.com/bigquery/third-party-tools
[2] https://cloud.google.com/bigquery/loading-data-into-bigquery
In addition to what Vadim said, you can try:
mysqldump to CSV files to s3 (I believe RDS allows that)
run "gsutil" Google Cloud Storage utility to copy data from s3 to GCS
run "bq load file.csv" to load the file to BigQuery
I'm interested in hearing your experience, so feel free to ping me in private.