Synchronize Amazon RDS with Google BigQuery - google-bigquery

People, the company where I work has some MySQL databases on AWS (Amazon RDS). We are making a POC with BigQuery and what I am researching now is how to replicate the bases to BigQuery (the existing registers and the new ones in the future). My doubts are:
How to replicate the MySQL tables and rows to BigQuery. Is there any tool to do that (I am reading about Amazon Database Migration Service)? Should I replicate to Google Cloud SQL and than export to BigQuery?
How to replicate the future registers? Is possible to create a job inside MySQL to send the new registers after a predefined number? For example, after 1,000 new rows are inserted (or a time is passed), some event is "triggered" and the new registers are copied to Cloud SQL/BigQuery?
My initial idea is to dump the original base, load it to the other and use a script to listen to new registers and send them to the new base.
Have I explained it properly? Is it understandable?

You will need to use one of the ETL tools which have integration with both mySQL and BigQuery to perform initial transfer of the data and copy subsequent changes to BigQuery. Take a look on the list of available tools [1]
You can also implement your own tool by developing a process which will extract the data from mySQL to a CSV file and then load that file into BigQuery using data import [2]
[1] https://cloud.google.com/bigquery/third-party-tools
[2] https://cloud.google.com/bigquery/loading-data-into-bigquery

In addition to what Vadim said, you can try:
mysqldump to CSV files to s3 (I believe RDS allows that)
run "gsutil" Google Cloud Storage utility to copy data from s3 to GCS
run "bq load file.csv" to load the file to BigQuery
I'm interested in hearing your experience, so feel free to ping me in private.

Related

Replicate data from cloud SQL postgres to bigQuery

I am looking for the recommended way of streaming database change from cloud SQL (postgres) to bigQuery ? I am seeing that CDC streaming does not seems available for postgres, does anyone know the timeline of this feature ?
Thanks a lot for you help.
Jonathan.
With Datastream for BigQuery, you can now replicate data and schema updates from operational databases directly into BigQuery.
Datastream reads and delivers every change—insert, update, and delete—from your MySQL, PostgreSQL, AlloyDB, and Oracle databases into BigQuery with minimal latency. The source database can be hosted on-premises, on Google Cloud services such as Cloud SQL or Bare Metal Solution for Oracle, or anywhere else on any cloud.
https://cloud.google.com/datastream-for-bigquery
You have to create an ETL process. That will allow you to automatically transform data from Postgres into BigQuery. You can do that using many ways, but I will point you to the two main approaches that I've already implemented:
Way 1:
Set Up the ETL Process manually:
Create your ETL using open source tools...
This method involves the use of the COPY command to migrate data from PostgreSQL tables and standard file-system files. It can be used as a normal SQL statement with SQL functions or PL/pgSQL procedures which gives a lot of flexibility to extract data as a full dump or incrementally. You need to know that it is a time-consuming process and would need you to invest in engineering bandwidth!
Also, you could try different tech stacks to implement the above, and I recommended this one Java Spring Data Flow
Way 2:
Using DataFlow
You can automate the ETL process using GCP's DataFlow without coding your own solution. It is faster and it cost, of course.
DataFlow is Unified stream and batch data processing that's
serverless, fast, and cost-effective.
Check more details and learn in a minute here
Also check this

Google Cloud Data Fusion, How can I load many tables to bigquery in one pipeline

I want to load many tables which is in aws rds mysql server by using cloud data fusion. each table storage is more than about 1gb. also I found the plugin which name is "multiple database table" to load multi table. but i got a fail. Also basically when I used database source I can check my tables' schema. However, in multiple database table, i can 't find how to check table's schema. how can i use this plugin? or is there any other way to load many tables in data fusion service?
My pipeline setting was as follows.
I'm posting this Community Wiki as OP didn't provide enough details to reproduce but the below information might help someone.
There are few ways to get your data using Cloud Data Fusion, you can use pipeline, plugin, driver and a few others depending on your needs.
On the internet you can find two very well described guides with examples.
If you would like to find some information about Cloud Data Fusion with GCP products you should read Bahadir Bulut guide - How I used Google Cloud Data Fusion to create a data warehouse - Part 1 and Part 2. Also Data Fusion allows to use 150+ preconfigured connectors and transformations like Amazons S3, SQS, etc. Azure services and many more.
Another well described (which I guess would help OP) is to configure both Amazon and GCP resources and using pipelines. This guide is Building a Simple Batch Data Pipeline from AWS RDS to Google BigQuery — Part 1: Setting UP AWS Data pipeline and second part Building a Simple Batch Data Pipeline from AWS RDS to Google BigQuery — Part 2: Setting up BigQuery Transfer Service and Scheduled Query.. In short this guide describes 2 main steps:
Extract data from MYSQL RDS and bring into S3 using AWS data pipeline service
From S3, bring the file inside Bigquery using BigqQuery transfer service.

Data migration from teradata to bigquery

My requirement is to migrate data from teradata database to Google bigquery database where table structure and schema remains unchanged. Later, using the bigquery database, I want to generate reports.
Can anyone suggest how I can achieve this?
I think you should try TDCH to export data to Google Cloud Storage in Avro format. TDCH runs on top of hadoop and exports data in parallel. You can then import data from avro files into BigQuery.
I was part of a team that addressed this issue in a Whitepaper.
The white paper documents the process of migrating data from Teradata Database to Google BigQuery. It highlights several key areas to consider when planning a migration of this nature, including the rationale for Apache NiFi as the preferred data flow technology, pre-migration considerations, details of the migration phase, and post-migration best practices.
Link: How To Migrate From Teradata To Google BigQuery
I think you can also try to use cloud composer(apache airflow) or install apache airflow in instance.
If you can open the ports from Teradata DB then you can run 'gsutil' command from there and schedule it via airflow/composer to run the jobs on daily basis. Its quick and you can leverage the scheduling capabilities of airflow.
BigQuery introduced Migration Service which is a comprehensive solution for migrating the data warehouse to BigQuery. It includes free to use tools that help with each phase of migration including assessment and planning to execution and verification.
Reference:
https://cloud.google.com/bigquery/docs/migration-intro

google-cloud-dataflow : How to read data from a Database and write to BigQuery

I need to setup a data pipeline from some source databases like Oracle, MySQL and load the data to BigQuery.
How can I use google-cloud-dataflow to read data from a database(jdbc connection) and write to BigQuery tables using Python.
Also, I have some hive tables in an on-premise Hadoop cluster, how do I transfer this data to BigQuery.
I couldn't find the right documentation or examples to achieve this.
Can you please point me in the right direction.
I applied a solution in my project to provide such thing, you need to follow these steps:
Load data from Google Cloud SQL to Google Cloud storage in CSV by following this link.
Load the CSV data from Google cloud storage directly into BigQuery by following this link.

Tool to "Data Load" or "ETL" -- from SQL Server into Amazon Redshift

I am trying to figure out decent but simple tool which I can host myself in AWS EC2, which will allow me to pull data out of SQL Server 2005 and push to Amazon Redshift.
I basically have a view in SQL Server on which I am doing SELECT * and I need just put all this data into Redshift. The biggest concern is that there is a lot of data, and this will need to be configurable so I can queue it, run as a nighly/continuous job, etc.
Any suggestions?
alexeypro,
dump tables to files, then you have two fundamental challenges to solve:
Transporting data to Amazon
Loading data to Redshift tables.
Amazon S3 will help you with both:
S3 supports fast upload of files to Amazon from your SQL server location. See this great article. It is from 2011 but I did some testing a few months back and saw very similar results. I was testing with gigabytes of data and 16 uploader threads were ok, as I'm not on backbone. Key thing to remember is that compression and parallel upload are your friends to cut down the time for upload.
Once data are on S3, Redshift supports high-performance parallel load from files on S3 to table(s) via COPY SQL command. To get fastest load performance pre-partition your data based on table distribution key and and pre-sort it to avoid expensive vacuums. All is well documented in Amazon's best practices. I have to say these guys know how to make things neat & simple, so just follow the steps.
If you are coder you can orchestrate the whole process remotely using scripts in whatever shell/language you want. You'll need tools/libraries for parallel HTTP upload to S3 and command line access to Redshift (psql) to launch the COPY command.
Another options is Java, there are libraries for S3 upload and JDBC access to Redshift.
As other posters suggest, you could probably use SSIS (or essentially any other ETL tool) as well. I was testing with CloverETL. Took care of automating the process as well as partitioning/presorting the files for load.
Now Microsoft released SSIS Powerpack, so you can do it natively.
SSIS Amazon Redshift Data Transfer Task
Very fast bulk copy from on-premises data to Amazon Redshift in few clicks
Load data to Amazon Redshift from traditional DB engines like SQL Server, Oracle, MySQL, DB2
Load data to Amazon Redshift from Flat Files
Automatic file archiving support
Automatic file compression support to reduce bandwidth and cost
Rich error handling and logging support to troubleshoot Redshift Datawarehouse loading issues
Support for SQL Server 2005, 2008, 2012, 2014 (32 bit and 64 bit)
Why SSIS PowerPack?
High performance suite of Custom SSIS tasks, transforms and adapters
With existing ETL tools, an alternate option to avoid staging data in Amazon (S3/Dynamo) is to use the commercial DataDirect Amazon Redshift Driver which supports a high performance load over the wire without additional dependencies to stage data.
https://blogs.datadirect.com/2014/10/recap-amazon-redshift-salesforce-data-integration-oow14.html
For getting data into Amazon Redshift, I made DataDuck http://dataducketl.com/
It's like Ruby on Rails but for building ETLs.
To give you an idea of how easy it is to set up, here's how you get your data into Redshift.
Add gem 'dataduck' to your Gemfile.
Run bundle install
Run datatduck quickstart and follow the instructions
This will autogenerate files representing the tables and columns you want to migrate to the data warehouse. You can modify these to customize it, e.g. remove or transform some of the columns.
Commit this code to your own ETL project repository
Git pull the code on your EC2 server
Run dataduck etl all on a cron job, from the EC2 server, to transfer all the tables into Amazon Redshift
Why not Python+boto+psycopg2 script?
It will run on EC2 Windows or Linux instance.
If it's OS Windows you could:
Extract data from SQL Server( using sqlcmd.exe)
Compress it (using gzip.GzipFile).
Multipart upload it to S3 (using boto)
Append it to Amazon Redshit table (using psycopg2).
Similarly, it worked for me when I wrote Oracle-To-Redshift-Data-Loader