Perform custom SQL query with Google Cloud Data Fusion - google-bigquery

I have data pipelines that consist of multiple SQL queries being run against BigQuery tables, I would like to build these in Google Cloud Fusion, but I don't see an option to transform/select with custom SQL.
is this available, or am I misinterpreting the use cases for this tool?

A new Action plugin is being added that would allow you to specify a SQL to run in BQ. Expect the connectors to be available in Hub by mid May.
Nitin

There is now a native BigQuery Execute action that allows SQL queries to run as part of a Data Fusion Pipeline.
This job is an action, see below from the official documentation:
Action plugins define custom actions that are scheduled to take place
during a workflow but don't directly manipulate data in the workflow.
For example, using the Database custom action, you can run an
arbitrary database command at the end of your pipeline. Alternatively,
you can trigger an action to move files within Cloud Storage.

Related

Can we use SQL scripts (Develop hub) during pipeline creation (Integrate hub) in azure synapse?

I want to use my SQL script (present under Develop hub) file inside a Pipeline (present under Integrate hub). Currently I do not see any Activities available solving this purpose.
There is one Script activity under General section which only have a Query & NonQuery option, not for referring any SQL script file created earlier.
Is that feature available at all in Azure Synapse Analytics? Can we refer to SQL script by some other means?
If your Synapse workspace is paired with Azure DevOps then I imagine it’s easy to get the file content with a REST API call (eg here). However then you have to parse the file as GO is not supported by the Script activity. ADF / Synapse Pipeline functions do not support a RegEx style split eg word boundary and GO (\bGO\b) so it starts to get kind of fiddly. I had some success with replace and uriComponent functions.
However you would be better of using Stored Procedures and the Stored Proc activity in Synapse Pipelines - much simpler implementation.

BigQuery to BigQuery DataFlow

I've had a look at this SO post but it's three years old and I think GCP has changed since then.
What I'm trying to do is set up a data pipeline using DataFlow jobs to copy/transform data from one GBQ project into another GBQ project.
To create a DataFlow job, you need to choose a template and there is no template that matches my needs i.e. no BQ to BQ template.
There is an option to use a custom template (which I imagine would be a python script or something along those lines), but it seems odd that there is no BQ to BQ template. Is DataFlow not the right tool for this job? Should I just use scheduled queries?
Thanks in advance
There is a way which is not very straight forward if you really want to use Dataflow template, you can use BigQuery to cloud storage template to store data in GCS and then cloud storage to BigQuery template to bring the data to destination project. However make sure you gave proper permission that is required to access the cloud storage buckets from the destination project.
If the transformations you want are not possible using SQL or not practical to use SQL, you can use Cloud Data fusion -> Integration studio. Here you can choose both source and sink as BigQuery and there are a number of options available for transformation component. It is similar to ETL tool. Data Fusion Quickstart documentation.
Otherwise, you can simply execute or schedule a query as per your requirement in BigQuery itself and save the result of the query in another table Saving query results in destination table.

How do I create a BigQuery dataset out of another BigQuery dataset?

I need to understand the below:
1.) How does one BigQuery connect to another BigQuery and apply some logic and create another BigQuery. For e.g if i have a ETL tool like Data Stage and we have some data been uploaded for us to consume in form of a BigQuery. So in DataStage or using any other technology how do i design the job so that the source is one BQ and the Target is another BQ.
2.) I want to achieve like my input will be a VIEW (BigQuery) and then need to run some logic on the BigQuery View and then load into another BigQuery view.
3.) What is the technology used to connected one BigQuery to another BigQuery is it https or any other technology.
Thanks
If you have a large amount of data to process (many GB), you should do the transformation of the data directly in the Big Query database. It would be very slow to extract all the data, run it through something locally, and send it back. You don't need any outside technology to make one view depend on another view, besides access to the relevant data.
The ideal job design will be an SQL query that Big Query can process. If you are trying to link tables/views across different projects then the source BQ table must be listed in fully-specified form projectName.datasetName.tableName in the FROM clauses of the SQL query. Project names are globally unique in Google Cloud.
Permissions to access the data must be set up correctly. BQ provides fine-grained control over who can access, and it is in the BQ documentation. You can also enable public access to all BQ users if that is appropriate.
Once you have that SQL query, you can create a new view by sending your SQL to Google BigQuery either through the command line (the bq tool), the web console, or an API.
1) You can use BigQuery Connector in DataStage to read and write to bigquery.
2) Bigquery use namespaces in the format project.dataset.table to access tables across projects. This allows you to manipulate your data in GCP as it were in the same database.
To manipulate your data you can use DML or standard SQL.
To execute your queries you can use the GCP Web console or client libraries such as python or java.
3) BigQuery is a RESTful web service and use HTTPS

scheduling a query to copy data from a dataset between projects in BigQuery

We want to perform a test on BigQuery with scheduled queries.
The test retrieves a table from a dataset and, basically, copies it in another dataset (for which we have permission as owners) in another project. So far, we managed to do that with a script we wrote in R against the BigQuery API in a Google Compute Engine instance but we want/need to do it with scheduled queries in BigQuery.
If I just compose a query for retrieving the initial table data and I try to schedule it, I see there's a project selector but it's disabled so seems like I'm tied to the project for the user I'm logging in with.
Is this doable or am I overdoing it and using the API is the only option to do this?
Is this doable or am I overdoing it and using the API is the only option to do this?
The current scheduler logic doesn't allow this and for that reason, the project drop-down is disabled in the webUI.
As an example, I tried setting this schedule Job
CREATE TABLE IF NOT EXISTS `projectId.partitionTables.tableName` (Field0 TIMESTAMP) --AS SELECT * FROM mydataset.myothertable
And this is the error returning from the transferAPI
You will need to ask BigQuery team to add this option to future version of th scheduler API

How can I export data from BigQuery to S3?

I would like to automatically export the results of a Google BigQuery query to an S3 bucket every night. Does BigQuery support any kind of automated query runs?
This is kind of the reverse of this question.
BigQuery does not support any automatic scheduling of jobs. You would have to use some other framework to run some script on a schedule to insert the query job.
One such option might be a Google Apps Script time-driven simple trigger. BigQuery is accessible through Google Apps Script, so putting these together should get you the ability to run BigQuery jobs on a schedule.
Google Apps Script Simple Trigger options: https://developers.google.com/apps-script/guides/triggers/#available_types_of_triggers
BigQuery Google Apps script sample code: https://developers.google.com/apps-script/advanced/bigquery