Replicate data from cloud SQL postgres to bigQuery - google-bigquery

I am looking for the recommended way of streaming database change from cloud SQL (postgres) to bigQuery ? I am seeing that CDC streaming does not seems available for postgres, does anyone know the timeline of this feature ?
Thanks a lot for you help.
Jonathan.

With Datastream for BigQuery, you can now replicate data and schema updates from operational databases directly into BigQuery.
Datastream reads and delivers every change—insert, update, and delete—from your MySQL, PostgreSQL, AlloyDB, and Oracle databases into BigQuery with minimal latency. The source database can be hosted on-premises, on Google Cloud services such as Cloud SQL or Bare Metal Solution for Oracle, or anywhere else on any cloud.
https://cloud.google.com/datastream-for-bigquery

You have to create an ETL process. That will allow you to automatically transform data from Postgres into BigQuery. You can do that using many ways, but I will point you to the two main approaches that I've already implemented:
Way 1:
Set Up the ETL Process manually:
Create your ETL using open source tools...
This method involves the use of the COPY command to migrate data from PostgreSQL tables and standard file-system files. It can be used as a normal SQL statement with SQL functions or PL/pgSQL procedures which gives a lot of flexibility to extract data as a full dump or incrementally. You need to know that it is a time-consuming process and would need you to invest in engineering bandwidth!
Also, you could try different tech stacks to implement the above, and I recommended this one Java Spring Data Flow
Way 2:
Using DataFlow
You can automate the ETL process using GCP's DataFlow without coding your own solution. It is faster and it cost, of course.
DataFlow is Unified stream and batch data processing that's
serverless, fast, and cost-effective.
Check more details and learn in a minute here
Also check this

Related

Automate data transforming ( SQL) and then push processed data to Tableau

I have questions about ways to automate data transformation process.
What I normally do is that I transform data using python or postgresql and then export the processed data as csv. After that, I connect the csv file to Tableau.
I have done some research and found that ETL can help. However, I've watched some ETL tools' demo videos, and I'm not sure whether these tools' transform features would meet my need or not. For example, I have written 100+ sql lines for one of my data transforming task; it's better if I can use postgresql to run the query instead of using ETL tools.
The problem is that I don't know what's the proper way to automate the data transforming process and then push the data to Tableau. The csv files will be updated on a daily basis, so I'll need to refresh the data.
Data transformation can be done in various ways. It depends on your nature of data to figure out what can be the right fit.
If you have huge volume of data and you are comfortable in python/java and you can automate your transformation logic using spark and write it to a hive table and then connect tableau to read data from hive.
Most of the next gen ETL tools like pentaho and talend can be used but that erodes the flexibility and portability what a framework like spark or beam can give.
If you want to know , how can you achieve this using cloud provider services like GCP or AWS , please let me know
Prep is the Tableau tool for preparing data. It can be used for joining, appending, cleaning, pivoting, filtering and other data cleansing activities.
Tableau Prep is available:
for free if you have a Tableau Creator license
in desktop and Online/ Tableau server versions
Scheduling Prep flows is available in Tableau Online/ Server. To schedule flows you will need to acquire the Tableau Prep Conductor add-on.

Transformation in Snowflake or Azure data Factory?

I'm very new to Snowflake, so forgive me if the answer is obvious.
I am loading the data from on-prem into Azure using Data Factory, and then ingesting into Snowflake using COPY INTO. However, I need to enable access for some of the transformed data to other platforms, meaning that if I perform transformation in Snowflake, I'll need to create an external table in Azure (essentially pushing this data back to Azure so other platforms can access it).
As we don't particularly want to introduce a new tool, I have two options for our fairly basic transformation:
do the transformation in ADF
do the transformation in Snowflake in SQL scripts and then create an external table so other teams can access the data using other tools (these platforms don't integrate with Snowflake)
Are there any major drawbacks to option 2 apart from increased storage costs?
I'm trying to weigh up the following: maintenance effort (our team's skills lie in SQL not ADF), cost, and performance.
Any advice would be appreciated.
As stated in the question, there are many possible answers for this scenario - with my favorite being the second one ("do the transformation in Snowflake in SQL scripts and then create an external table so other teams can access the data using other tools").
If you need to make the results of these transformations available on Azure storage, Azure Data Factory supports this natively:
Copy data from Snowflake that utilizes Snowflake's COPY into [location] command to achieve the best performance. https://learn.microsoft.com/en-us/azure/data-factory/connector-snowflake#supported-capabilities
Or you could manage this inside Snowflake using the same COPY INTO that ADF uses.
Let me add a couple screenshots from the Snowflake webinar "Data Warehouse or Data Lake? How You Can Have Both in a Single Platform":
https://resources.snowflake.com/webinars-thought-leadership/data-warehouse-or-data-lake-how-you-can-have-both-in-a-single-platform-3

How to create a triggered update of cloud SQL instance export into SQL dump file in cloud storage? [duplicate]

I am designing a solution in which Google Cloud SQL will be used to store all data from the regular functioning of the app(kind of OLTP data). The data is expected to grow over time into pretty large size. The data itself is relational in nature and hence we have chosen Cloud SQL instead of Cloud Datastore.
This data needs to be fed into Big Query for analytics and this needs to be near real-time analytics (as the best case), although realistically some lag can be expected. But I am trying to design a solution which reduces this lag to minimum possible.
My question has 3 parts -
Should I use Cloud SQL for storing data and then move it to BigQuery or change the basic design itself and use BigQuery for storing the data initially as well? Is BigQuery suitable for use for regular, low-latency OLTP workloads?(I don't think so - is my assumption correct?)
What is the recommended/best practice for loading Cloud SQL data into BigQuery and have this integration work near real-time?
Is Cloud Dataflow a good option? If I connect Cloud SQL to Cloud DataFlow and further to BigQuery - will it work? Or is there any other way to achieve this which is better(as asked in question 2)?
Take a look at how WePay does this:
https://wecode.wepay.com/posts/bigquery-wepay
The MySQL to GCS operator executes a SELECT query against a MySQL
table. The SELECT pulls all data greater than (or equal to) the last
high watermark. The high watermark is either the primary key of the
table (if the table is append-only), or a modification timestamp
column (if the table receives updates). Again, the SELECT statement
also goes back a bit in time (or rows) to catch potentially dropped
rows from the last query (due to the issues mentioned above).
With Airflow they manage to keep BigQuery synchronized to their MySQL database every 15 minutes.
BigQuery supports Cloud SQL federated queries which lets you directly query Cloud SQL database from BigQuery. To keep Cloud SQL table in sync with BigQuery, you can write a simple script with following query to sync two tables every hour.
INSERT
demo.customers (column1)
SELECT
*
FROM
EXTERNAL_QUERY(
"project.us.connection",
"SELECT column1 FROM mysql_table WHERE timestamp > ${timestamp};");
Just remember replace the ${timestamp} with the current timestamp - 1 hour.
Another method would be to split the write process to CloudSQL and to Cloud Pub/Sub and then have a Dataflow reader to stream into BigQuery. This works well when you have materially different target schema for your BigQuery tables - which is common when denormalizing your relational data.
The upside is that you can reduce overall latency to say a few seconds; however, the main downside is that if your transactional data is highly mutating you will have to create a versioning scheme to track changes.
Google has provided a reference article on this subject related to using a change data capture tool to identify the changed data and only pushing that.
This makes some assumptions that may not work for you:
willingness to learn debezium
willingness to let GCP connect to your source MySQL database
If those work for your situation it seems like a good solution.
I think you can use federated queries as one possible solution:
A federated query is a way to send a query statement to an external database and get the result back as a temporary table. Federated queries use the BigQuery Connection API to establish a connection with the external database. In your standard SQL query, you use the EXTERNAL_QUERY function to send a query statement to the external database, using that database's SQL dialect. The results are converted to BigQuery standard SQL data types.
You can use federated queries with the following external databases:
Cloud Spanner
Cloud SQL
After the initial one-time setup, you can write a query with the EXTERNAL_QUERY SQL function.
I leave you the documentation so you can implement it on your project:
https://cloud.google.com/bigquery/docs/federated-queries-intro

Data migration from teradata to bigquery

My requirement is to migrate data from teradata database to Google bigquery database where table structure and schema remains unchanged. Later, using the bigquery database, I want to generate reports.
Can anyone suggest how I can achieve this?
I think you should try TDCH to export data to Google Cloud Storage in Avro format. TDCH runs on top of hadoop and exports data in parallel. You can then import data from avro files into BigQuery.
I was part of a team that addressed this issue in a Whitepaper.
The white paper documents the process of migrating data from Teradata Database to Google BigQuery. It highlights several key areas to consider when planning a migration of this nature, including the rationale for Apache NiFi as the preferred data flow technology, pre-migration considerations, details of the migration phase, and post-migration best practices.
Link: How To Migrate From Teradata To Google BigQuery
I think you can also try to use cloud composer(apache airflow) or install apache airflow in instance.
If you can open the ports from Teradata DB then you can run 'gsutil' command from there and schedule it via airflow/composer to run the jobs on daily basis. Its quick and you can leverage the scheduling capabilities of airflow.
BigQuery introduced Migration Service which is a comprehensive solution for migrating the data warehouse to BigQuery. It includes free to use tools that help with each phase of migration including assessment and planning to execution and verification.
Reference:
https://cloud.google.com/bigquery/docs/migration-intro

Impala OR hive with SPARK as execution engine?

I want to design Web UI which fetches data from HDFS. I want to generate some reports using this data which is stored in HDFS. I have my own custom reports format. I am writing REST API's to fetch data. But running HIVE queries gives latency issues Hence I want different approach for this, I could think of two.
Using IMPALA to create tables. But I am not sure about REST support for IMPALA.
Using HIVE but instead of MR use SPARK as execution engine. .
spark-job-server provides REST support, and fetch data with SPARK-SQL.
Which of the approach will be suitable or is there any better approach for this?
Please can anyone help as I am very new in this.
I'd prefer to choose impala if latency is the main consideration. It's dedicated to SQL processing on hdfs and does it well. About REST api and the application logic you are achieving, this seems to be a good example