How do I create a BigQuery dataset out of another BigQuery dataset? - google-bigquery

I need to understand the below:
1.) How does one BigQuery connect to another BigQuery and apply some logic and create another BigQuery. For e.g if i have a ETL tool like Data Stage and we have some data been uploaded for us to consume in form of a BigQuery. So in DataStage or using any other technology how do i design the job so that the source is one BQ and the Target is another BQ.
2.) I want to achieve like my input will be a VIEW (BigQuery) and then need to run some logic on the BigQuery View and then load into another BigQuery view.
3.) What is the technology used to connected one BigQuery to another BigQuery is it https or any other technology.
Thanks

If you have a large amount of data to process (many GB), you should do the transformation of the data directly in the Big Query database. It would be very slow to extract all the data, run it through something locally, and send it back. You don't need any outside technology to make one view depend on another view, besides access to the relevant data.
The ideal job design will be an SQL query that Big Query can process. If you are trying to link tables/views across different projects then the source BQ table must be listed in fully-specified form projectName.datasetName.tableName in the FROM clauses of the SQL query. Project names are globally unique in Google Cloud.
Permissions to access the data must be set up correctly. BQ provides fine-grained control over who can access, and it is in the BQ documentation. You can also enable public access to all BQ users if that is appropriate.
Once you have that SQL query, you can create a new view by sending your SQL to Google BigQuery either through the command line (the bq tool), the web console, or an API.

1) You can use BigQuery Connector in DataStage to read and write to bigquery.
2) Bigquery use namespaces in the format project.dataset.table to access tables across projects. This allows you to manipulate your data in GCP as it were in the same database.
To manipulate your data you can use DML or standard SQL.
To execute your queries you can use the GCP Web console or client libraries such as python or java.
3) BigQuery is a RESTful web service and use HTTPS

Related

Replicate data from cloud SQL postgres to bigQuery

I am looking for the recommended way of streaming database change from cloud SQL (postgres) to bigQuery ? I am seeing that CDC streaming does not seems available for postgres, does anyone know the timeline of this feature ?
Thanks a lot for you help.
Jonathan.
With Datastream for BigQuery, you can now replicate data and schema updates from operational databases directly into BigQuery.
Datastream reads and delivers every change—insert, update, and delete—from your MySQL, PostgreSQL, AlloyDB, and Oracle databases into BigQuery with minimal latency. The source database can be hosted on-premises, on Google Cloud services such as Cloud SQL or Bare Metal Solution for Oracle, or anywhere else on any cloud.
https://cloud.google.com/datastream-for-bigquery
You have to create an ETL process. That will allow you to automatically transform data from Postgres into BigQuery. You can do that using many ways, but I will point you to the two main approaches that I've already implemented:
Way 1:
Set Up the ETL Process manually:
Create your ETL using open source tools...
This method involves the use of the COPY command to migrate data from PostgreSQL tables and standard file-system files. It can be used as a normal SQL statement with SQL functions or PL/pgSQL procedures which gives a lot of flexibility to extract data as a full dump or incrementally. You need to know that it is a time-consuming process and would need you to invest in engineering bandwidth!
Also, you could try different tech stacks to implement the above, and I recommended this one Java Spring Data Flow
Way 2:
Using DataFlow
You can automate the ETL process using GCP's DataFlow without coding your own solution. It is faster and it cost, of course.
DataFlow is Unified stream and batch data processing that's
serverless, fast, and cost-effective.
Check more details and learn in a minute here
Also check this

BigQuery to BigQuery DataFlow

I've had a look at this SO post but it's three years old and I think GCP has changed since then.
What I'm trying to do is set up a data pipeline using DataFlow jobs to copy/transform data from one GBQ project into another GBQ project.
To create a DataFlow job, you need to choose a template and there is no template that matches my needs i.e. no BQ to BQ template.
There is an option to use a custom template (which I imagine would be a python script or something along those lines), but it seems odd that there is no BQ to BQ template. Is DataFlow not the right tool for this job? Should I just use scheduled queries?
Thanks in advance
There is a way which is not very straight forward if you really want to use Dataflow template, you can use BigQuery to cloud storage template to store data in GCS and then cloud storage to BigQuery template to bring the data to destination project. However make sure you gave proper permission that is required to access the cloud storage buckets from the destination project.
If the transformations you want are not possible using SQL or not practical to use SQL, you can use Cloud Data fusion -> Integration studio. Here you can choose both source and sink as BigQuery and there are a number of options available for transformation component. It is similar to ETL tool. Data Fusion Quickstart documentation.
Otherwise, you can simply execute or schedule a query as per your requirement in BigQuery itself and save the result of the query in another table Saving query results in destination table.

Transformation in Snowflake or Azure data Factory?

I'm very new to Snowflake, so forgive me if the answer is obvious.
I am loading the data from on-prem into Azure using Data Factory, and then ingesting into Snowflake using COPY INTO. However, I need to enable access for some of the transformed data to other platforms, meaning that if I perform transformation in Snowflake, I'll need to create an external table in Azure (essentially pushing this data back to Azure so other platforms can access it).
As we don't particularly want to introduce a new tool, I have two options for our fairly basic transformation:
do the transformation in ADF
do the transformation in Snowflake in SQL scripts and then create an external table so other teams can access the data using other tools (these platforms don't integrate with Snowflake)
Are there any major drawbacks to option 2 apart from increased storage costs?
I'm trying to weigh up the following: maintenance effort (our team's skills lie in SQL not ADF), cost, and performance.
Any advice would be appreciated.
As stated in the question, there are many possible answers for this scenario - with my favorite being the second one ("do the transformation in Snowflake in SQL scripts and then create an external table so other teams can access the data using other tools").
If you need to make the results of these transformations available on Azure storage, Azure Data Factory supports this natively:
Copy data from Snowflake that utilizes Snowflake's COPY into [location] command to achieve the best performance. https://learn.microsoft.com/en-us/azure/data-factory/connector-snowflake#supported-capabilities
Or you could manage this inside Snowflake using the same COPY INTO that ADF uses.
Let me add a couple screenshots from the Snowflake webinar "Data Warehouse or Data Lake? How You Can Have Both in a Single Platform":
https://resources.snowflake.com/webinars-thought-leadership/data-warehouse-or-data-lake-how-you-can-have-both-in-a-single-platform-3

How to create a triggered update of cloud SQL instance export into SQL dump file in cloud storage? [duplicate]

I am designing a solution in which Google Cloud SQL will be used to store all data from the regular functioning of the app(kind of OLTP data). The data is expected to grow over time into pretty large size. The data itself is relational in nature and hence we have chosen Cloud SQL instead of Cloud Datastore.
This data needs to be fed into Big Query for analytics and this needs to be near real-time analytics (as the best case), although realistically some lag can be expected. But I am trying to design a solution which reduces this lag to minimum possible.
My question has 3 parts -
Should I use Cloud SQL for storing data and then move it to BigQuery or change the basic design itself and use BigQuery for storing the data initially as well? Is BigQuery suitable for use for regular, low-latency OLTP workloads?(I don't think so - is my assumption correct?)
What is the recommended/best practice for loading Cloud SQL data into BigQuery and have this integration work near real-time?
Is Cloud Dataflow a good option? If I connect Cloud SQL to Cloud DataFlow and further to BigQuery - will it work? Or is there any other way to achieve this which is better(as asked in question 2)?
Take a look at how WePay does this:
https://wecode.wepay.com/posts/bigquery-wepay
The MySQL to GCS operator executes a SELECT query against a MySQL
table. The SELECT pulls all data greater than (or equal to) the last
high watermark. The high watermark is either the primary key of the
table (if the table is append-only), or a modification timestamp
column (if the table receives updates). Again, the SELECT statement
also goes back a bit in time (or rows) to catch potentially dropped
rows from the last query (due to the issues mentioned above).
With Airflow they manage to keep BigQuery synchronized to their MySQL database every 15 minutes.
BigQuery supports Cloud SQL federated queries which lets you directly query Cloud SQL database from BigQuery. To keep Cloud SQL table in sync with BigQuery, you can write a simple script with following query to sync two tables every hour.
INSERT
demo.customers (column1)
SELECT
*
FROM
EXTERNAL_QUERY(
"project.us.connection",
"SELECT column1 FROM mysql_table WHERE timestamp > ${timestamp};");
Just remember replace the ${timestamp} with the current timestamp - 1 hour.
Another method would be to split the write process to CloudSQL and to Cloud Pub/Sub and then have a Dataflow reader to stream into BigQuery. This works well when you have materially different target schema for your BigQuery tables - which is common when denormalizing your relational data.
The upside is that you can reduce overall latency to say a few seconds; however, the main downside is that if your transactional data is highly mutating you will have to create a versioning scheme to track changes.
Google has provided a reference article on this subject related to using a change data capture tool to identify the changed data and only pushing that.
This makes some assumptions that may not work for you:
willingness to learn debezium
willingness to let GCP connect to your source MySQL database
If those work for your situation it seems like a good solution.
I think you can use federated queries as one possible solution:
A federated query is a way to send a query statement to an external database and get the result back as a temporary table. Federated queries use the BigQuery Connection API to establish a connection with the external database. In your standard SQL query, you use the EXTERNAL_QUERY function to send a query statement to the external database, using that database's SQL dialect. The results are converted to BigQuery standard SQL data types.
You can use federated queries with the following external databases:
Cloud Spanner
Cloud SQL
After the initial one-time setup, you can write a query with the EXTERNAL_QUERY SQL function.
I leave you the documentation so you can implement it on your project:
https://cloud.google.com/bigquery/docs/federated-queries-intro

Tableau data extract refresh from Google BigQuery takes very long

We are very pleased with the combination BigQuery <-> Tableau Server with live connection. However, we now want to work with a data extract (500MB) on Tableau Server (since this datasource is not too big and is used very frequently). This takes too much time to refresh (1.5h+). We noticed that only 0.1% is query time and the rest is data export. Since the Tableau Server is on the same platform and location, latency should not be a problem.
This is similar to the slow export of a BigQuery table to a single file, which can be solved by using "daisy chain" option (wildcards). Unfortunately we can't use similar logic with a Google BigQuery data extract refresh in Tableau...
We have identified some approaches, but are not pleased with our current ideas:
Working with incremental refresh: our existing BigQuery table rows can change: these changes can only be applied in Tableau if you do a full refresh
Exporting the BigQuery table to GCS using the daisy chain option and making a Tableau data extract using the Tableau SDK: this would result in quite some overhead...
Writing a Dataflow job using a custom sink for Tableau Server (data extracts).
Experimenting with a Tableau web connector that communicates directly with the BigQuery API: I don't think this will be faster? I didn't see anything about parallelizing calls with the Tableau web connecector, but I didn't try this approach yet.
We would prefer a non-technical option, to limit maintenance... Is there a way to modify the Tableau connector to make use of the "daisy chain" option for BigQuery?
You've uploaded the data in BigQuery. Can't you just use the input for that load job (a CSV perhaps) as input for Tableau?
When we use Tableau and BigQuery we also notice that extracts are slow but we generally don't do that because you lose BigQuery's power. We start with a live data connection at first, and then (if needed) convert this into a custom query that aggregates that data into a much smaller datasets which extracts in just a few seconds.
Another way to achieve higher performance with BigQuery and Tableau is aggregating or joining tables on beforehand. JOINs on huge tables can be slow, so if you use a lot of those you might consider generating a denormalised dataset which does all of the JOIN-ing first. You will get a dataset with a lot of duplicates and a lot of columns. But if you select only what you need in Tableau (hide unused fields!) then these columns won't count in your query cost.
One recommendation I have seen is similar to your point 2 where you export the BQ table to Google Cloud Storage and then use the Tableau Extract API to create a .tde from the flat files in GCS.
This was from an article on the Google Cloud site so I'd assume it would be best practice:
https://cloud.google.com/blog/products/gcp/the-switch-to-self-service-marketing-analytics-at-zulily-best-practices-for-using-tableau-with-bigquery
There is an article here which provides a step by step guide to achieving the above.
https://community.tableau.com/docs/DOC-23161
It would be nice if Tableau optimised the BQ connector for extract refresh using the BigQuery Storage API. We too have our Tableau Server environment in the same GCP zone as our BQ datasets and experience slow refresh times.