I would like to automatically export the results of a Google BigQuery query to an S3 bucket every night. Does BigQuery support any kind of automated query runs?
This is kind of the reverse of this question.
BigQuery does not support any automatic scheduling of jobs. You would have to use some other framework to run some script on a schedule to insert the query job.
One such option might be a Google Apps Script time-driven simple trigger. BigQuery is accessible through Google Apps Script, so putting these together should get you the ability to run BigQuery jobs on a schedule.
Google Apps Script Simple Trigger options: https://developers.google.com/apps-script/guides/triggers/#available_types_of_triggers
BigQuery Google Apps script sample code: https://developers.google.com/apps-script/advanced/bigquery
Related
Our organization has data in Google Bigtable - hosted by our Vendor. We want to run jobs in BigQuery to query from Bigtable and export the data to CloudStore as .csv files without the storing the data as a dataset in BigQuery.
We do not want to store in BigQuery datasets as we are not doing any analysis using BigQuery as all Analysis is done using on premise Analytical solution.
Is this possible ?
You have a few options, and the best solution would be to automate using Cloud Workflows.
The steps I see would be:
Export from BigTable in Avro or Parquet format to Cloud Storage.
There is a gcloud and API way to do this described here.
You then import the exported files into BigQuery.
There is a a way to use bq CLI tool and API way as well to do this described here.
Then you export from BigQuery to multiple CSV files as it's documented here.
You get multiple CSV files, you can then run the gcloud compose tool to merge them.
All the above can be done in Cloud Workflows. Each call can be implemented either via API (preferred) or using the command line options using Cloud Build triggers for example. For Workflow syntax you can get guidance from this article, and the linked content from the footer section of the article.
I've had a look at this SO post but it's three years old and I think GCP has changed since then.
What I'm trying to do is set up a data pipeline using DataFlow jobs to copy/transform data from one GBQ project into another GBQ project.
To create a DataFlow job, you need to choose a template and there is no template that matches my needs i.e. no BQ to BQ template.
There is an option to use a custom template (which I imagine would be a python script or something along those lines), but it seems odd that there is no BQ to BQ template. Is DataFlow not the right tool for this job? Should I just use scheduled queries?
Thanks in advance
There is a way which is not very straight forward if you really want to use Dataflow template, you can use BigQuery to cloud storage template to store data in GCS and then cloud storage to BigQuery template to bring the data to destination project. However make sure you gave proper permission that is required to access the cloud storage buckets from the destination project.
If the transformations you want are not possible using SQL or not practical to use SQL, you can use Cloud Data fusion -> Integration studio. Here you can choose both source and sink as BigQuery and there are a number of options available for transformation component. It is similar to ETL tool. Data Fusion Quickstart documentation.
Otherwise, you can simply execute or schedule a query as per your requirement in BigQuery itself and save the result of the query in another table Saving query results in destination table.
I have an App Engine scheduled job which runs everyday and look for rows in a PostgreSQL table (hosted in gcp not a cloudsql) which meets a criteria to archive. If the criteria is met, it connects to BigQuery and streams the data to big query. Everyday there are few records qualify for archiving and we write to BigQuery. Is this the cost effective way or we can try loading data using Cloud Functions? https://cloud.google.com/solutions/performing-etl-from-relational-database-into-bigquery
App Engine and Cloud Functions have different purposes. You should use App Engine if you want to deploy a full application in a serverless environment. If you need to integrate services in the cloud, use Cloud Function. In your case it seems that Cloud Functions fits better.
It's important to remember that Cloud Function has a time limitation: the maximum time which your code has to run is 9 minutes.
You can find this and other limitations here
Furthermore, you can find here a pricing calculator for GCP products.
If you have any further questions, please let me know.
I have data pipelines that consist of multiple SQL queries being run against BigQuery tables, I would like to build these in Google Cloud Fusion, but I don't see an option to transform/select with custom SQL.
is this available, or am I misinterpreting the use cases for this tool?
A new Action plugin is being added that would allow you to specify a SQL to run in BQ. Expect the connectors to be available in Hub by mid May.
Nitin
There is now a native BigQuery Execute action that allows SQL queries to run as part of a Data Fusion Pipeline.
This job is an action, see below from the official documentation:
Action plugins define custom actions that are scheduled to take place
during a workflow but don't directly manipulate data in the workflow.
For example, using the Database custom action, you can run an
arbitrary database command at the end of your pipeline. Alternatively,
you can trigger an action to move files within Cloud Storage.
My requirement is to migrate data from teradata database to Google bigquery database where table structure and schema remains unchanged. Later, using the bigquery database, I want to generate reports.
Can anyone suggest how I can achieve this?
I think you should try TDCH to export data to Google Cloud Storage in Avro format. TDCH runs on top of hadoop and exports data in parallel. You can then import data from avro files into BigQuery.
I was part of a team that addressed this issue in a Whitepaper.
The white paper documents the process of migrating data from Teradata Database to Google BigQuery. It highlights several key areas to consider when planning a migration of this nature, including the rationale for Apache NiFi as the preferred data flow technology, pre-migration considerations, details of the migration phase, and post-migration best practices.
Link: How To Migrate From Teradata To Google BigQuery
I think you can also try to use cloud composer(apache airflow) or install apache airflow in instance.
If you can open the ports from Teradata DB then you can run 'gsutil' command from there and schedule it via airflow/composer to run the jobs on daily basis. Its quick and you can leverage the scheduling capabilities of airflow.
BigQuery introduced Migration Service which is a comprehensive solution for migrating the data warehouse to BigQuery. It includes free to use tools that help with each phase of migration including assessment and planning to execution and verification.
Reference:
https://cloud.google.com/bigquery/docs/migration-intro