Bigquery - Version Control Scheduled Queries - sql

Right now I have scheduled queries via BQ interface. They work but do not scale or migrate very well (across dev and prod gcp projects). So I am trying to do scheduled queries in a way that is reproducible, scalable and migrate-able.
My queries are complicated and hence I am struggling with ', " and ''' to make it run via bq commands and schedule via github actions.
This is the query which is most complicated:
declare bq_last_id int64;
declare external_sql string;
set bq_last_id = (select max(id) from bq_dataset.bq_table);
set external_sql = '"select * from mysql_table where id > ('|| bq_last_id ||')"';
execute immediate 'select * from external_query("my-gcp-project.my-region.my-connection-name",'|| external_sql || ');'
In total there are 20 something queries that have to scheduled. This one is the only one which is incremental or other are drop and recreate table again so they are not as complicated as this one.
WHAT I HAVE TRIED TILL NOW:
creating an on-demand query in BQ interface and then using bq mk command to run it with timestamp as a variable as shown in this answer. The problem with it is I still have to manually create the on-demand queries and I will have to do separately in dev and prod projects.
I am unable to find a way to create on-demand query using bq cmd.
I am unable to bq query to run the queries (that is not create bq scheduled queries at all). and then later schedule them via gihub action.
Any help with correct syntax or better suggestions to do this will be super helpful to me.
Thanks.

The way I use to solve problems with scape characters when inserting queries in bq commands is using jq on shell and my queries on a file, as following:
create queries.sql file with your query script:
cat queries.sql
declare bq_last_id int64;
declare external_sql string;
set bq_last_id = (select max(id) from bq_dataset.bq_table);
set external_sql = '"select * from mysql_table where id > ('|| bq_last_id ||')"';
execute immediate 'select * from external_query("my-gcp-project.my-region.my-connection-name",'|| external_sql || ');'
Create the following script:
schedule_query.sh
#!/bin/bash
set -f #avoind * used as wildcard
json=$(jq -nc --arg query "$(<queries.sql)" '{ "query": $query }')
#adapt the command and param to work in your environment (destination, tables, etc...)
bq mk \
--transfer_config \
--target_dataset=mydataset \
--display_name='My Scheduled Query' \
--params="$json" \
--data_source=scheduled_query \
--service_account_name=abcdef-test-sa#abcdef-test.iam.gserviceaccount.com

Related

List all objects in a dataset using bq ls command in BigQuery

I am trying to list all the objects present in a dataset in BigQuery.
I tried using bq ls projectID:dataset_name command in Google SDK shell. However this returned only the list of tables present in the dataset. I am interested in listing all the stored procedures present in the same dataset.
It is possible to get the list of functions with a query:
bq query --nouse_legacy_sql \
'SELECT
*
FROM
mydataset.INFORMATION_SCHEMA.ROUTINES'
It is possible to include all routines in the bq ls command by setting the flag --routines=true:
bq ls dataset_name --routines=true
The default value is false. Routines include persistent user-defined functions, table functions (Preview), and stored procedures. See GCP docs for more detail.

How do I specify query parameters for a scheduled query in BigQuery?

I am trying to set up some Scheduled Queries against my tables in BigQuery.
https://cloud.google.com/bigquery/docs/scheduling-queries
I would like these queries to use parameters, e.g.
bq query --use_legacy_sql=false \
--parameter=label::TEST_LABEL \
"SELECT #label AS label"
I cannot find any way to do this via the bq commandline tool or the API. Passing it in as a --parameter flag, or a field to the params JSON does not work.
bq mk \
--transfer_config \
--project_id=my_project \
--location=US \
--display_name='Test Scheduled Query' \
--params='{"query":"INSERT my_project.my_data.test SELECT #label AS label;"}' \
--data_source=scheduled_query \
--service_account_name=my-svc-act#my_project.iam.gserviceaccount.com
You got some concept wrong. Not sure what is your use case.
Scheduled Queries
A scheduled query is setup to run periodically in the background, and it's triggers automatically. Hence as it's automatically scheduled, there is no way to provide custom parameters in the way you have described. This is due to scheduled queries using features of BigQuery Data Transfer Service.
A scheduled query has only a few inbuilt runtime parameters such as #run_time, more on this here. You will find further information in this other link.
You probably want to execute a query on-demand, and not a scheduled query.
On-demand queries
Example:
bq query \
--use_legacy_sql=false \
--parameter=corpus::romeoandjuliet \
--parameter=min_word_count:INT64:250 \
'SELECT
word, word_count
FROM
`bigquery-public-data.samples.shakespeare`
WHERE
corpus = #corpus
AND
word_count >= #min_word_count
ORDER BY
word_count DESC'
More on running parameterized queries here.

How to drop multiple tables in Big query using Wildcards TABLE_DATE_RANGE()?

I was looking at the documentation but I haven't found the way to Drop multiple tables using wild cards.
I was trying to do something like this but it doesn't work:
DROP TABLE
TABLE_DATE_RANGE([clients.sessions_],
TIMESTAMP('2017-01-01'),
TIMESTAMP('2017-05-31'))
For dataset stats and tables like daily_table_20181017 keeping dates conventions, I would go with simple script and gcloud Command-Line Tool:
for table in `bq ls --max_results=10000000 stats |grep TABLE |grep daily_table |awk '{print $1}'`; do echo stats.$table; bq rm -f -t stats.$table; done
DROP TABLE [table_name]; is now supported in bigquery. So here is a purely SQL/bigquery UI solution.
select concat("drop table ",table_schema,".", table_name, ";" )
from <dataset-name>.INFORMATION_SCHEMA.TABLES
where table_name like "partial_table_name%"
order by table_name desc
Audit that you are dropping the correct tables. Copy and paste back into bigquery to drop listed tables.
DDL e.g. DROP TABLE doesn't exist yet in BigQuery. However, I know Google are currently working on it.
In the meantime, you'll need to use the API to delete tables. For example, using the gCloud tool:
bq rm -f -t dataset.table
If you want to do bulk deletes, then you can use some bash/awk magic. Or, if you prefer, call the Rest API directly with e.g. the Python client.
See here too.
I just used python to loop this and solve it using Graham example:
from subprocess import call
return_code = call('bq rm -f -t dataset.' + table_name +'_'+ period + '', shell=True)
For a long time #graham's approach worked for me. Just recently the BQ CLI stopped working effectively and froze everytime I ran the above command. Hence I dug around for a new approach and used some parts of Google cloud official documentation. I followed the following approach using a Jupyter notebook.
from google.cloud import bigquery
# TODO(developer): Construct a BigQuery client object.
client = bigquery.Client.from_service_account_json('/folder/my_service_account_credentials.json')
dataset_id = 'project_id.dataset_id'
dataset = client.get_dataset(dataset_id)
# Creating a list of all tables in the above dataset
tables = list(client.list_tables(dataset)) # API request(s)
## Filtering out relevant wildcard tables to be deleted
## Mention a substring that's common in all your tables that you want to delete
tables_to_delete = ["{}.{}.{}".format(dataset.project, dataset.dataset_id, table.table_id)
for table in tables if "search_sequence_" in format(table.table_id)]
for table in tables_to_delete:
client.delete_table(table)
print("Deleted table {}".format(table)) ```
To build off of #Dengar 's answer. You can use procedural SQL in BigQuery to run all of those delete statements in a for loop like so:
FOR record IN (
select concat(
"drop table ",
table_schema,".", table_name, ";" ) as del_stmt
from <dataset_name>.INFORMATION_SCHEMA.TABLES
order by table_name) DO
-- create the views
EXECUTE IMMEDIATE
FORMAT( """
%s
""", record.del_stmt);
END
FOR;
Add a WHERE condition if you do not want to delete all tables in the dataset.
With scripting and table information schema available, the following can also be used directly in the UI.
I would not recommend this for removing a larger number of tables.
FOR tn IN (SELECT table_name FROM yourDataset.INFORMATION_SCHEMA.TABLES WHERE table_name LIKE "filter%")
DO
EXECUTE IMMEDIATE FORMAT("DROP TABLE yourDataset.%s", tn.table_name);
END FOR;

Google Bigquery BQ command line execute query from a file

I use the bq command line tool to run queries, e.g:
bq query "select * from table"
What if I store the query in a file and run the query from that file? is there a way to do that?
The other answers seem to be either outdated or needlessly brittle. As of 2019, bq query reads from stdin, so you can just redirect your file into it:
bq query < myfile.sql
Query parameters are passed like this:
bq query --parameter name:type:value < myfile.sql
There is another way.
Try this:
bq query --flagfile=[your file with absolute path]
Ex:
bq query --flagfile=/home/user/abc.sql
You can run a query from a text file with a little bit of shell magic:
$ echo "SELECT 17" > qq.txt
$ bq query "$(cat qq.txt)"
Waiting on bqjob_r603d91b7e0435a0f_00000150c56689c6_1 ... (0s) Current status: DONE
+-----+
| f0_ |
+-----+
| 17 |
+-----+
Note this works on any unix variant (including mac). If you're using a windows, this should work under powershell but not the default cmd prompt.
If you are using standard sql (Not Legacy Sql).
**Steps:**
1. Create .sql file (you can you any extension).
2. Put your query in that. Make sure (;) at the end of the query.
3. Go to command line ad execute below commands.
4. If you want add parameter then you have to specify sequentially.
Example:
bq query --use_legacy_sql=False "$(cat /home/airflow/projects/bql/query/test.sql)"
for parameter
bq query --use_legacy_sql=False --parameter=country::USA "$(cat /home/airflow/projects/bql/query/test.sql)"
cat >/home/airflow/projects/bql/query/test.sql
select * from l1_gcb_trxn.account where country=#country;
This thread offers good solution
bq query `cat my_query.sql`
bq query --replace --use_legacy_sql=false --destination_table=syw-analytics:store_ranking.SHC_ENGAGEMENT_RANKING_TEST
"SELECT RED,
DEC,
REDEM
from `\syw.abc.xyz\`"

Storing query result in a variable

I have a query whose result I wanted to store in a variable
How can I do it ?
I tried
./hive -e "use telecom;insert overwrite local directory '/tmp/result' select
avg(a) from abc;"
./hive --hiveconf MY_VAR =`cat /tmp/result/000000_0`;
I am able to get average value in MY_VAR but it takes me in hive CLI which is not required
and is there a way to access unix commands inside hive CLI?
Use Case: in mysql the following is valid:
set #max_date := select max(date) from some_table;
select * from some_other_table where date > #max_date;
This is super useful for scripts that need to repeatedly call this variable since you only need to execute the max date query once rather than every time the variable is called.
HIVE does not currently support this. (please correct me if I'm wrong! I have been trying to figure out how to do this all all afternoon)
My workaround is to store the required variable in a table that is small enough to map join onto the query in which it is used. Because the join is a map rather than a broadcast join it should not significantly hurt performance. For example:
drop table if exists var_table;
create table var_table as
select max(date) as max_date from some_table;
select some_other_table.*
from some_other_table
left join var_table
where some_other_table.date > var_table.max_date;
The suggested solution by #visakh is not optimal because stores the string 'select count(1) from table_name;' rather than the returned value and so will not be helpful in cases where you need to call a var repeatedly during a script.
Storing hive query output in a variable and using it in another query.
In shell create a variable with desired value by doing:
var=`hive -S -e "select max(datekey) from ....;"`
echo $var
Use the variable value in another hive query by:
hive -hiveconf MID_DATE=$var -f test.hql
You can simply achieve this using a shell script.
create a shell script
file: avg_op.sh
#!/bin/sh
hive -e 'use telecom;select avg(a) from abc;' > avg.txt
wait
value=`cat avg.txt`
hive --hiveconf avgval=$value -e "set avgval;set hiveconf:avgval;
use telecom;
select * from abc2 where avg_var=\${hiveconf:avgval};"
execute the .sh file
>bash avg_op.sh
If you trying to capture the number from a Hive query or impala query in Linux, you can achieve this by executing the query and selecting numbers from the regex.
With Hive,
max=`beeline -u ${hiveConnectionUrl} -e "select max(col1) from schema_name.table_name;" | sed 's/[^0-9]*//g'`
The main part is to extract the number from the result. Also, if you're getting too large a result, you can use --silent=true flag to silent the execution which would reduce the log messages.
You can use BeeTamer for that. It allows to store result (or part of it) in a variable, and use this variable later in your code.
Beetamer is a macro language / macro processor that allows to extend functionality of the Apache Hive and Cloudera Impala engines.
select avg(a) from abc;
%capture MY_AVERAGE;
select * from abc2 where avg_var=#MY_AVERAGE#;
In here you save average value from you query into macro variable MY_AVERAGE and then reusing it in the second query.
try below :
$ var=$(hive -e "select '12' ")
$ echo $var
12 -- output