big query command line to execute multiple sql files - google-bigquery

Does anyone here knows how to execute multiple sql files in bq command line? Sample if I have 2 sql files named test1.sql and test2.sql, how should I do it?
If I do this:
bq query --use_legacy_sql=false > test1.sql
this only executes the test1.sql.
What I want to do is to execute both test1.sql and tes2.sql.

There isn't : bq query
Best option if you want one line is to use the && operator.
bq query --use_legacy_sql=false > test1.sql && bq query --use_legacy_sql=false > test2.sql

There is an alternative way, which is using a shell script in order to loop through all of the files:
#!/bin/bash
FILES="/path/to/sqls"
for f in $FILES
do
bq query --use_legacy_sql=false < "$f"
done

Related

Bigquery - Version Control Scheduled Queries

Right now I have scheduled queries via BQ interface. They work but do not scale or migrate very well (across dev and prod gcp projects). So I am trying to do scheduled queries in a way that is reproducible, scalable and migrate-able.
My queries are complicated and hence I am struggling with ', " and ''' to make it run via bq commands and schedule via github actions.
This is the query which is most complicated:
declare bq_last_id int64;
declare external_sql string;
set bq_last_id = (select max(id) from bq_dataset.bq_table);
set external_sql = '"select * from mysql_table where id > ('|| bq_last_id ||')"';
execute immediate 'select * from external_query("my-gcp-project.my-region.my-connection-name",'|| external_sql || ');'
In total there are 20 something queries that have to scheduled. This one is the only one which is incremental or other are drop and recreate table again so they are not as complicated as this one.
WHAT I HAVE TRIED TILL NOW:
creating an on-demand query in BQ interface and then using bq mk command to run it with timestamp as a variable as shown in this answer. The problem with it is I still have to manually create the on-demand queries and I will have to do separately in dev and prod projects.
I am unable to find a way to create on-demand query using bq cmd.
I am unable to bq query to run the queries (that is not create bq scheduled queries at all). and then later schedule them via gihub action.
Any help with correct syntax or better suggestions to do this will be super helpful to me.
Thanks.
The way I use to solve problems with scape characters when inserting queries in bq commands is using jq on shell and my queries on a file, as following:
create queries.sql file with your query script:
cat queries.sql
declare bq_last_id int64;
declare external_sql string;
set bq_last_id = (select max(id) from bq_dataset.bq_table);
set external_sql = '"select * from mysql_table where id > ('|| bq_last_id ||')"';
execute immediate 'select * from external_query("my-gcp-project.my-region.my-connection-name",'|| external_sql || ');'
Create the following script:
schedule_query.sh
#!/bin/bash
set -f #avoind * used as wildcard
json=$(jq -nc --arg query "$(<queries.sql)" '{ "query": $query }')
#adapt the command and param to work in your environment (destination, tables, etc...)
bq mk \
--transfer_config \
--target_dataset=mydataset \
--display_name='My Scheduled Query' \
--params="$json" \
--data_source=scheduled_query \
--service_account_name=abcdef-test-sa#abcdef-test.iam.gserviceaccount.com

Reduce Beeline Hive CSV verbosity

I am new to using beeline and I am using a statement like the following:
beeline -u 'jdbc:hive2://myserver' --outputformat=csv2 -f export.sql > results.csv
The statement works as desired but the resulting CSV is very verbose with all the INFO statements preceding and following the actual CSV data. How can I modify the statement so that the only items in the CSV is the actual data and none of the other statements? I just want a clean dataset in results.csv
You can do that by making silent=true and verbose=false
beeline -u 'jdbc:hive2://myserver' --outputformat=csv2 --silent=true --verbose=false -f export.sql > results.csv

how to enable standard sql for BigQuery using bq shell

bq query has --use_legacy_sql flag that can be set to false to enable standard query.
How to do the same if bq shell is used
I tried below variations and both of those failed with error Unknown command line flag 'use_legacy_sql'.
bq --use_legacy_sql=false shell
bq shell --use_legacy_sql=false
It doesn't look like it's possible currently, so I filed a feature request. The alternative is to pass it to "query" each time, although that feels very verbose. For example:
$ bq shell
myproject> query --use_legacy_sql=false SELECT [1, 2, 3] AS arr

Google Bigquery BQ command line execute query from a file

I use the bq command line tool to run queries, e.g:
bq query "select * from table"
What if I store the query in a file and run the query from that file? is there a way to do that?
The other answers seem to be either outdated or needlessly brittle. As of 2019, bq query reads from stdin, so you can just redirect your file into it:
bq query < myfile.sql
Query parameters are passed like this:
bq query --parameter name:type:value < myfile.sql
There is another way.
Try this:
bq query --flagfile=[your file with absolute path]
Ex:
bq query --flagfile=/home/user/abc.sql
You can run a query from a text file with a little bit of shell magic:
$ echo "SELECT 17" > qq.txt
$ bq query "$(cat qq.txt)"
Waiting on bqjob_r603d91b7e0435a0f_00000150c56689c6_1 ... (0s) Current status: DONE
+-----+
| f0_ |
+-----+
| 17 |
+-----+
Note this works on any unix variant (including mac). If you're using a windows, this should work under powershell but not the default cmd prompt.
If you are using standard sql (Not Legacy Sql).
**Steps:**
1. Create .sql file (you can you any extension).
2. Put your query in that. Make sure (;) at the end of the query.
3. Go to command line ad execute below commands.
4. If you want add parameter then you have to specify sequentially.
Example:
bq query --use_legacy_sql=False "$(cat /home/airflow/projects/bql/query/test.sql)"
for parameter
bq query --use_legacy_sql=False --parameter=country::USA "$(cat /home/airflow/projects/bql/query/test.sql)"
cat >/home/airflow/projects/bql/query/test.sql
select * from l1_gcb_trxn.account where country=#country;
This thread offers good solution
bq query `cat my_query.sql`
bq query --replace --use_legacy_sql=false --destination_table=syw-analytics:store_ranking.SHC_ENGAGEMENT_RANKING_TEST
"SELECT RED,
DEC,
REDEM
from `\syw.abc.xyz\`"

Storing query result in a variable

I have a query whose result I wanted to store in a variable
How can I do it ?
I tried
./hive -e "use telecom;insert overwrite local directory '/tmp/result' select
avg(a) from abc;"
./hive --hiveconf MY_VAR =`cat /tmp/result/000000_0`;
I am able to get average value in MY_VAR but it takes me in hive CLI which is not required
and is there a way to access unix commands inside hive CLI?
Use Case: in mysql the following is valid:
set #max_date := select max(date) from some_table;
select * from some_other_table where date > #max_date;
This is super useful for scripts that need to repeatedly call this variable since you only need to execute the max date query once rather than every time the variable is called.
HIVE does not currently support this. (please correct me if I'm wrong! I have been trying to figure out how to do this all all afternoon)
My workaround is to store the required variable in a table that is small enough to map join onto the query in which it is used. Because the join is a map rather than a broadcast join it should not significantly hurt performance. For example:
drop table if exists var_table;
create table var_table as
select max(date) as max_date from some_table;
select some_other_table.*
from some_other_table
left join var_table
where some_other_table.date > var_table.max_date;
The suggested solution by #visakh is not optimal because stores the string 'select count(1) from table_name;' rather than the returned value and so will not be helpful in cases where you need to call a var repeatedly during a script.
Storing hive query output in a variable and using it in another query.
In shell create a variable with desired value by doing:
var=`hive -S -e "select max(datekey) from ....;"`
echo $var
Use the variable value in another hive query by:
hive -hiveconf MID_DATE=$var -f test.hql
You can simply achieve this using a shell script.
create a shell script
file: avg_op.sh
#!/bin/sh
hive -e 'use telecom;select avg(a) from abc;' > avg.txt
wait
value=`cat avg.txt`
hive --hiveconf avgval=$value -e "set avgval;set hiveconf:avgval;
use telecom;
select * from abc2 where avg_var=\${hiveconf:avgval};"
execute the .sh file
>bash avg_op.sh
If you trying to capture the number from a Hive query or impala query in Linux, you can achieve this by executing the query and selecting numbers from the regex.
With Hive,
max=`beeline -u ${hiveConnectionUrl} -e "select max(col1) from schema_name.table_name;" | sed 's/[^0-9]*//g'`
The main part is to extract the number from the result. Also, if you're getting too large a result, you can use --silent=true flag to silent the execution which would reduce the log messages.
You can use BeeTamer for that. It allows to store result (or part of it) in a variable, and use this variable later in your code.
Beetamer is a macro language / macro processor that allows to extend functionality of the Apache Hive and Cloudera Impala engines.
select avg(a) from abc;
%capture MY_AVERAGE;
select * from abc2 where avg_var=#MY_AVERAGE#;
In here you save average value from you query into macro variable MY_AVERAGE and then reusing it in the second query.
try below :
$ var=$(hive -e "select '12' ")
$ echo $var
12 -- output