CICD pipeline for Bigquery Schema/ ddl deployment - google-bigquery

I am looking for CI/CD solution for the Google Bigquery script.
The requirement is that I have a list of files with DDL script, design the CI/CD solution which should maintain the version, and deploy the script in Google Bigquery in auto/schedule based.

Since you want to use version control to commit the schema, you can use CI for Data in BigQuery CLI utility Github Repository which will help you in orchestrating the processes. For more information you can check this documentation. For implementing this, you can check this link.
Since you want to use CD, Cloud Build can be used with BigQuery where you can use your own custom builders for your requirement. You can also configure notifications for both BigQuery and GitHub using Cloud Build.

for product recommendation for CI, user cloud source repositories and for CD use cloud build
there are multiple ways to do deploy
option 1: here you are specifying inline query in the cloud build steps, this does not exactly takes into your latest version of SQL. see option 2 for latest version of sql
here you see $PROJECT_ID and $_DATASET these are dynamic variables that you set at run time by environment variables in cloud build also you can use same way to
— name: ‘gcr.io/cloud-builders/gcloud’
entrypoint: ‘bq’
id: ‘create entry min day view’
args:
— query
— --use_legacy_sql=false
— "CREATE OR REPLACE TABLE $PROJECT_ID.$_DATASET.TABLENAME
AS
SELECT 1"
option 2:
there is post for this here
on the last post above link, you can use use bash as entry point and pass bq arguments as args
hope this helps.

Related

Automatically execute Redshift SQL code periodically stored on Github using Jenkins

I have a few reddshift sql scripts on Github which I need to execute in sequence every 30 mins.
e.g.
At 10:00 am
script001.sql to be executed first
script002.sql to be executed next and so on...
At 10:30 am
script001.sql to be executed first
script002.sql to be executed next and so on...
These scripts run on already existing tables in Redshift. Some of these scripts create tables which are used in the subsequent queries and hence order of execution is important to avoid "Table not found" error.
I have tried:
Creating a freestyle project in Jenkins with the following configuration:
General Tab --> GitHub Project and provided a Project URL
Source Code Management Tab --> Selected Git and provided Repository URL in the format https://personaltoken#github.com/site/repo.git
Branches to build Tab --> */main
Build Triggers Tab --> Build periodically (H/30 * * * *)
Now I don't know how to add Build Step to execute the query from GitHub. The configuration builds successfully but obviously does nothing as no steps have been defined.
Creating a pipeline project in Jenkins with the same configuration as above but without Pipeline script as I am not sure how to write a pipeline script to run redshift SQL stored on Github
I tried looking for a solution but couldn't find anything for Redshift. There are tutorials and snippets for SQL server and Oracle but their scripts are different and can't be used in Redshift.
Any help on this would be greatly appreciated.

What is the difference and relationship between an Azure DevOps Build Definition, and a Pipeline?

I am trying to automate a process in Azure DevOps, using the REST API. I think it should go like this (at least, this is the current manual process):
fork repo
create pipeline(s) based using YAML files in newly forked repo
run pipelines in particular way
I am new to the Azure DevOps REST API and I am struggling to understand what I have done and what I should be doing.
Using the REST API, I seem to be able to create what I would call a pipeline, using the pipeline endpoint; I do notice that if I want to run it, I have to interact with its build definition instead.
Also, looking at code other colleagues have written, it seems (though I may be wrong) like they are able to achieve the same by simply creating a build definition, and not explicitly creating pipeline.
This lack of understanding is driving me bonkers so I am hoping someone can enlighten me!
Question
What is the difference, and relationship, between a Build Definition and a Pipeline?
Additional info, I am not interested in working with the older Release Pipelines and I have tried to find the answer among the Azure DevOps REST API docs, but to no avail.
If you want to create a pipeline you can do this using both of this. However, the difference is actually in terms of concept:
build definitions are part of first available flow which consist: build and release where build was responsible for building, testing and publishing artifact for later use in releases to deploy
pipeline are a new approach which leverage YAML designed process for building/testing/deploying code
More info you can find here - Whats the difference between a build pipeline and a release pipeline in Azure DevOps?
And for instance for this pipeline/build
https://dev.azure.com/thecodemanual/DevOps%20Manual/_build?definitionId=157
where definition id is 157
You will get reposnses in both endpoints:
https://dev.azure.com/{{organization}}/{{project}}/_apis/build/definitions/157?api-version=5.1
and
https://dev.azure.com/{{organization}}/{{project}}/_apis/pipelines/157?api-version=6.0-preview.1
and in that term pipeline id = build id
The pipelines endpoint is not very useful:
https://dev.azure.com/{Organization}/{ProjectName}/_apis/pipelines?api-version=6.0-preview.1
It will only give you a list of pipelines with very basic info such as name, ID, folder etc.
To create and update YAML pipelines you need to use the Build definitions endpoint. The IDs you use in the endpoint are the same IDs as the Pipelines endpoint uses.
Get definition, Get list, Create, Update:
https://dev.azure.com/{Organization}/{ProjectName}/_apis/build/definitions?api-version=6.0
(To create a working pipeline you must first Get an existing pipeline, modify the JSON you receive, then POST it as a new definition.)

Run GitLab CI job only if a trigger builds a tag

I want to run a conditional build that is triggered by an api trigger, but only when the ref passed in by the trigger matches a specific regex.
I can imagine 2 ways this could be done:
Logical operators in .gitlab-ci.yml's only: directive like so:
only:
- /^staging-v.*$/ AND triggers
or
Controlling the result status using return codes
script:
- return 3;
would interpreted as "not run" or "skipped"
Am i missing something? I read all the documentation i could find but this scenario is never really explained. Is there maybe a thrid way to do this?
This would be handy with the new environments feature of GitLab 8.9
I'm using the latest 8.9.0 gitlab release.
Also the API trigger is needed as i need to pass in more variables from the developer to the build and deploy environment which are dynamic.

Is it possible to run an OpenRefine script in the background?

Can I trigger an OpenRefine script to run in the background without user interaction? Possibly use a windows service to load a OpenRefine config file or start the OpenRefine web server with parameters and save the output?
We parse various data sources from files and place the output into specific tables and fields in sql server. We have a very old application that creates these "match patterns" and would like to replace it with something more modern. Speed is important but not critical. We are parsing files with 5 to 1,000,000 lines typically.
I could be going in the wrong direction with OpenRefine if so please let me know. Our support team that creates these "match patterns" would be best suited with a UI like OpenRefine instead of writing Perl or Python scripts.
Thanks for your help.
OpenRefine has a set of library that let you automated an existing job. The following are available:
* two in Python here and here
* one in ruby
* one in nodejs
Those libraries needs two inputs:
a source file to be processed in OpenRefine
the OpenRefine operation in JSON format.
At RefinePro (disclaimer I am the founder and CEO of RefinePro), we have written some extra wrapper to schedule to select an OpenRefine project, extract the JSON operations, start the library and save the result. The newly created job can then be scheduled.
Please keep in mind that OpenRefine has very poor error handling which limits it's usage as an ETL platform.

How to download all data in a Google BigQuery dataset?

Is there an easy way to directly download all the data contained in a certain dataset on Google BigQuery? I'm actually downloading "as csv", making one query after another, but it doesn't allow me to get more than 15k rows, and rows i need to download are over 5M.
Thank you
You can run BigQuery extraction jobs using the Web UI, the command line tool, or the BigQuery API. The data can be extracted
For example, using the command line tool:
First install and auth using these instructions:
https://developers.google.com/bigquery/bq-command-line-tool-quickstart
Then make sure you have an available Google Cloud Storage bucket (see Google Cloud Console for this purpose).
Then, run the following command:
bq extract my_dataset.my_table gs://mybucket/myfilename.csv
More on extracting data via API here:
https://developers.google.com/bigquery/exporting-data-from-bigquery
Detailed step-by-step to download large query output
enable billing
You have to give your credit card number to Google to export the output, and you might have to pay.
But the free quota (1TB of processed data) should suffice for many hobby projects.
create a project
associate billing to a project
do your query
create a new dataset
click "Show options" and enable "Allow Large Results" if the output is very large
export the query result to a table in the dataset
create a bucket on Cloud Storage.
export the table to the created bucked on Cloud Storage.
make sure to click GZIP compression
use a name like <bucket>/prefix.gz.
If the output is very large, the file name must have an asterisk * and the output will be split into multiple files.
download the table from cloud storage to your computer.
Does not seem possible to download multiple files from the web interface if the large file got split up, but you could install gsutil and run:
gsutil -m cp -r 'gs://<bucket>/prefix_*' .
See also: Download files and folders from Google Storage bucket to a local folder
There is a gsutil in Ubuntu 16.04 but it is an unrelated package.
You must install and setup as documented at: https://cloud.google.com/storage/docs/gsutil
unzip locally:
for f in *.gz; do gunzip "$f"; done
Here is a sample project I needed this for which motivated this answer.
For python you can use following code,it will download data as a dataframe.
from google.cloud import bigquery
def read_from_bqtable(bq_projectname, bq_query):
client = bigquery.Client(bq_projectname)
bq_data = client.query(bq_query).to_dataframe()
return bq_data #return dataframe
bigQueryTableData_df = read_from_bqtable('gcp-project-id', 'SELECT * FROM `gcp-project-id.dataset-name.table-name` ')
yes steps suggested by Michael Manoochehri are correct and easy way to export data from Google Bigquery.
I have written a bash script so that you do not required to do these steps every time , just use my bash script .
below are the github url :
https://github.com/rajnish4dba/GoogleBigQuery_Scripts
scope :
1. export data based on your Big Query SQL.
2. export data based on your table name.
3. transfer your export file to SFtp server.
try it and let me know your feedback.
to help use ExportDataFromBigQuery.sh -h