Google Cloud Data Fusion, How can I load many tables to bigquery in one pipeline

Google Cloud Data Fusion, How can I load many tables to bigquery in one pipeline - google-bigquery

I want to load many tables which is in aws rds mysql server by using cloud data fusion. each table storage is more than about 1gb. also I found the plugin which name is "multiple database table" to load multi table. but i got a fail. Also basically when I used database source I can check my tables' schema. However, in multiple database table, i can 't find how to check table's schema. how can i use this plugin? or is there any other way to load many tables in data fusion service?
My pipeline setting was as follows.

I'm posting this Community Wiki as OP didn't provide enough details to reproduce but the below information might help someone.
There are few ways to get your data using Cloud Data Fusion, you can use pipeline, plugin, driver and a few others depending on your needs.
On the internet you can find two very well described guides with examples.
If you would like to find some information about Cloud Data Fusion with GCP products you should read Bahadir Bulut guide - How I used Google Cloud Data Fusion to create a data warehouse - Part 1 and Part 2. Also Data Fusion allows to use 150+ preconfigured connectors and transformations like Amazons S3, SQS, etc. Azure services and many more.
Another well described (which I guess would help OP) is to configure both Amazon and GCP resources and using pipelines. This guide is Building a Simple Batch Data Pipeline from AWS RDS to Google BigQuery — Part 1: Setting UP AWS Data pipeline and second part Building a Simple Batch Data Pipeline from AWS RDS to Google BigQuery — Part 2: Setting up BigQuery Transfer Service and Scheduled Query.. In short this guide describes 2 main steps:
Extract data from MYSQL RDS and bring into S3 using AWS data pipeline service
From S3, bring the file inside Bigquery using BigqQuery transfer service.

Related

Usage Tracking in Azure synapse analytics

Can anyone share a Kusto query (KQL) that I can use in log analytics that would return some usage tracking stats?
I am trying to identify which "Views" and "Tables" are used the most. Also trying to find out who the power users are and commands/query that is run against the "Tables".
Any insights would be appreciated.

You can use below functions to gather the useage statics
DiagnosticMetricsExpand()
DiagnosticLogsExpand()
ActivityLogRecordsExpand()
And create target tables to store the function data to analyse the useage information.
Refer the Azure documentation for complete details https://learn.microsoft.com/en-us/azure/data-explorer/ingest-data-no-code?tabs=activity-logs
Tutorial: Ingest monitoring data in Azure Data Explorer without code
In this tutorial, you learn how to ingest monitoring data to Azure Data Explorer without one line of code and query that data.

Connecting to Cloud SQL server instance from BigQuery

There is an option to connect a Cloud mySQL instance from BigQuery. I just wanted to know how we can connect a Cloud SQL Server instance to BigQuery.

SQL Server:
There are a bunch of third-party extensions/tools that provide this service. One of them is SSIS Data Flow Source & Destination for Google BigQuery, which is Visual Studio extension that connects SQL Server with Google BigQuery data through SSIS Workflows.:
https://www.cdata.com/drivers/bigquery/ssis/
https://marketplace.visualstudio.com/items?itemName=CDATASOFTWARE.SSISDataFlowSourceDestinationforGoogleBigQuery
In regards to using SQL Server Integration Services to load the data from the on-premises SQL Server to BigQuery, you can take a look for this site. You can also perform ETL from a relational database into BigQuery using Cloud Dataflow, the official documentation details how it can be done, you might need to use Cloud Storage as an intermediate data sink.
Cloud SQL:
BigQuery allows to query data from Cloud SQL by using federated query. The connection must be created within the same project where your Cloud SQL instance is located. If you want to query your data stored in your Cloud SQL instance from BigQuery located in another project, please follow the steps listed below:
Enable the BigQuery API and the BigQuery connection API within your project.
Create a connection to your Cloud SQL instance within the project by following this documentation.
Once you have created the connection, please locate and select it within BigQuery.
Click on the SHARE CONNECTION button and grant permissions to the users that will be use that connection. Please note that the BigQuery Connection User role is the only needed to use a shared connection.
Additionally, please notice that the "Cloud SQL federated queries" feature is in a Beta stage and might change or have limited support (is no available for certain regions, in which case, it is required to use one the supported options mentioned here). Please remember, that to use Cloud SQL Federated queries in BigQuery, the intances need to have a public IP.
If you are limited e.g. by region, one good option might be exporting the data from CloudSQL to Storage as a CSV, and then load it into BigQuery. If you need, it is possible to automate this process using Cloud Composer, refer to this article.
Other approach is to extract information from Cloud SQL (with exports) and import it into BigQuery through load jobs, or streaming inserts.
I hope you find the above pieces of information useful.

It is possible, but be warned the feature is currently Beta
https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries

Data migration from teradata to bigquery

My requirement is to migrate data from teradata database to Google bigquery database where table structure and schema remains unchanged. Later, using the bigquery database, I want to generate reports.
Can anyone suggest how I can achieve this?

I think you should try TDCH to export data to Google Cloud Storage in Avro format. TDCH runs on top of hadoop and exports data in parallel. You can then import data from avro files into BigQuery.

I was part of a team that addressed this issue in a Whitepaper.
The white paper documents the process of migrating data from Teradata Database to Google BigQuery. It highlights several key areas to consider when planning a migration of this nature, including the rationale for Apache NiFi as the preferred data flow technology, pre-migration considerations, details of the migration phase, and post-migration best practices.
Link: How To Migrate From Teradata To Google BigQuery

I think you can also try to use cloud composer(apache airflow) or install apache airflow in instance.
If you can open the ports from Teradata DB then you can run 'gsutil' command from there and schedule it via airflow/composer to run the jobs on daily basis. Its quick and you can leverage the scheduling capabilities of airflow.

BigQuery introduced Migration Service which is a comprehensive solution for migrating the data warehouse to BigQuery. It includes free to use tools that help with each phase of migration including assessment and planning to execution and verification.
Reference:
https://cloud.google.com/bigquery/docs/migration-intro

google-cloud-dataflow : How to read data from a Database and write to BigQuery

I need to setup a data pipeline from some source databases like Oracle, MySQL and load the data to BigQuery.
How can I use google-cloud-dataflow to read data from a database(jdbc connection) and write to BigQuery tables using Python.
Also, I have some hive tables in an on-premise Hadoop cluster, how do I transfer this data to BigQuery.
I couldn't find the right documentation or examples to achieve this.
Can you please point me in the right direction.

I applied a solution in my project to provide such thing, you need to follow these steps:
Load data from Google Cloud SQL to Google Cloud storage in CSV by following this link.
Load the CSV data from Google cloud storage directly into BigQuery by following this link.

Synchronize Amazon RDS with Google BigQuery

People, the company where I work has some MySQL databases on AWS (Amazon RDS). We are making a POC with BigQuery and what I am researching now is how to replicate the bases to BigQuery (the existing registers and the new ones in the future). My doubts are:
How to replicate the MySQL tables and rows to BigQuery. Is there any tool to do that (I am reading about Amazon Database Migration Service)? Should I replicate to Google Cloud SQL and than export to BigQuery?
How to replicate the future registers? Is possible to create a job inside MySQL to send the new registers after a predefined number? For example, after 1,000 new rows are inserted (or a time is passed), some event is "triggered" and the new registers are copied to Cloud SQL/BigQuery?
My initial idea is to dump the original base, load it to the other and use a script to listen to new registers and send them to the new base.
Have I explained it properly? Is it understandable?

You will need to use one of the ETL tools which have integration with both mySQL and BigQuery to perform initial transfer of the data and copy subsequent changes to BigQuery. Take a look on the list of available tools [1]
You can also implement your own tool by developing a process which will extract the data from mySQL to a CSV file and then load that file into BigQuery using data import [2]
[1] https://cloud.google.com/bigquery/third-party-tools
[2] https://cloud.google.com/bigquery/loading-data-into-bigquery

In addition to what Vadim said, you can try:
mysqldump to CSV files to s3 (I believe RDS allows that)
run "gsutil" Google Cloud Storage utility to copy data from s3 to GCS
run "bq load file.csv" to load the file to BigQuery
I'm interested in hearing your experience, so feel free to ping me in private.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas