Google Bigquery BQ command line execute query from a file - google-bigquery

I use the bq command line tool to run queries, e.g:
bq query "select * from table"
What if I store the query in a file and run the query from that file? is there a way to do that?

The other answers seem to be either outdated or needlessly brittle. As of 2019, bq query reads from stdin, so you can just redirect your file into it:
bq query < myfile.sql
Query parameters are passed like this:
bq query --parameter name:type:value < myfile.sql

There is another way.
Try this:
bq query --flagfile=[your file with absolute path]
Ex:
bq query --flagfile=/home/user/abc.sql

You can run a query from a text file with a little bit of shell magic:
$ echo "SELECT 17" > qq.txt
$ bq query "$(cat qq.txt)"
Waiting on bqjob_r603d91b7e0435a0f_00000150c56689c6_1 ... (0s) Current status: DONE
+-----+
| f0_ |
+-----+
| 17 |
+-----+
Note this works on any unix variant (including mac). If you're using a windows, this should work under powershell but not the default cmd prompt.

If you are using standard sql (Not Legacy Sql).
**Steps:**
1. Create .sql file (you can you any extension).
2. Put your query in that. Make sure (;) at the end of the query.
3. Go to command line ad execute below commands.
4. If you want add parameter then you have to specify sequentially.
Example:
bq query --use_legacy_sql=False "$(cat /home/airflow/projects/bql/query/test.sql)"
for parameter
bq query --use_legacy_sql=False --parameter=country::USA "$(cat /home/airflow/projects/bql/query/test.sql)"
cat >/home/airflow/projects/bql/query/test.sql
select * from l1_gcb_trxn.account where country=#country;

This thread offers good solution
bq query `cat my_query.sql`

bq query --replace --use_legacy_sql=false --destination_table=syw-analytics:store_ranking.SHC_ENGAGEMENT_RANKING_TEST
"SELECT RED,
DEC,
REDEM
from `\syw.abc.xyz\`"

Related

big query command line to execute multiple sql files

Does anyone here knows how to execute multiple sql files in bq command line? Sample if I have 2 sql files named test1.sql and test2.sql, how should I do it?
If I do this:
bq query --use_legacy_sql=false > test1.sql
this only executes the test1.sql.
What I want to do is to execute both test1.sql and tes2.sql.
There isn't : bq query
Best option if you want one line is to use the && operator.
bq query --use_legacy_sql=false > test1.sql && bq query --use_legacy_sql=false > test2.sql
There is an alternative way, which is using a shell script in order to loop through all of the files:
#!/bin/bash
FILES="/path/to/sqls"
for f in $FILES
do
bq query --use_legacy_sql=false < "$f"
done

BigQuery error when using wildcard in "select * from ..." only when executed on GCE VM

I get an error from BigQuery when running a basic query with a wildcard:
bq query --use_legacy_sql=false "SELECT * FROM mydata.states LIMIT 10"
The problem is with the * - here is the error I get from bq when running it on the VM in GCE:
Error in query string: Error processing job '...': Field 'workspace' not found in table 'mydata.states'.
The "workspace" is the name of the directory in my current working directory - it appears that bq is expanding that (similar to ls *).
The same command works just fine in the bq shell without expanding * to the first directory it finds. The same query works perfectly fine on my local ubuntu outside of GCE.
If I list columns explicitly it works fine. I can't figure out what makes bq to replace * with the directory name in my current path and how to disable that?
I have two very similar machines running bq command line version 2.0.24 and both are ubuntu 14.04. Other than this, the * works in bash just as expected, including set -f that stops expansion all together, but it has no effect on bq...
The funny thing is that * works as expected when used in a query like this:
bq query --use_legacy_sql=false "SELECT COUNT(*) FROM mydata.states LIMIT 10"
The other odd thing is that this also works fine:
echo "SELECT * FROM mydata.states LIMIT 10" | bq query
The BigQuery command line client does not expand the * itself; that's caused by Bash. The best long-term solution would be to put your query into a file, e.g. my_query.sql. Then you can do:
bq query --use_legacy_sql=false < my_query.sql
Now you don't need to worry about escaping any part of the query, since the query text is read from the file.

how to enable standard sql for BigQuery using bq shell

bq query has --use_legacy_sql flag that can be set to false to enable standard query.
How to do the same if bq shell is used
I tried below variations and both of those failed with error Unknown command line flag 'use_legacy_sql'.
bq --use_legacy_sql=false shell
bq shell --use_legacy_sql=false
It doesn't look like it's possible currently, so I filed a feature request. The alternative is to pass it to "query" each time, although that feels very verbose. For example:
$ bq shell
myproject> query --use_legacy_sql=false SELECT [1, 2, 3] AS arr

How to save the results of an impala query

I've loaded a large set of data from S3 into hdfs, and then inserted the data to a table in impala.
I then ran a query against this data, and I'm looking to get these results back into S3.
I'm using Amazon EMR, with impala 1.2.4. If it's not possible to get the results of the query back to S3 directly, are there options to get the data back to hdfs and then some how send it back to S3 from there?
I have messed around with the impala-shell -o filename options, but that appears to only work on the local linux file system.
I thought this would have been a common scenario, but having trouble finding any information about saving the results of a query anywhere.
Any pointers appreciated.
To add to the knowledge above I am including the command that writes the query results to a file with a delimeter as we declared using the option --output_delimeter and also by using the option
--delimeted which actually switches off the default tab delimeter option.
impala-shell -q "query " --delimited --output_delimiter='\001' --print_header -o 'filename'
What I usually do if it's a smallish result set is run the script from the command line then upload to s3 using the AWS command line tool:
impala-shell -e "select ble from bla" -o filename
aws s3 cp filename s3://mybucket/filename
An alternative is use Hive as the last step in your data pipeline after you've run your query in Impala:
1. Impala step:
create table processed_data
as
select blah
--do whatever else you need to do in here
from raw_data1
join raw_data2 on a=b
2. Hive step:
create external table export
like processed_data
location 's3://mybucket/export/';
insert into table export
select * from processed_data;
If you have aws cli installed you can use standard out of Impala shell , impala shell query | aws cli cp - s3folder/outputfilename
You can use unix pipe and stream (-)

Storing query result in a variable

I have a query whose result I wanted to store in a variable
How can I do it ?
I tried
./hive -e "use telecom;insert overwrite local directory '/tmp/result' select
avg(a) from abc;"
./hive --hiveconf MY_VAR =`cat /tmp/result/000000_0`;
I am able to get average value in MY_VAR but it takes me in hive CLI which is not required
and is there a way to access unix commands inside hive CLI?
Use Case: in mysql the following is valid:
set #max_date := select max(date) from some_table;
select * from some_other_table where date > #max_date;
This is super useful for scripts that need to repeatedly call this variable since you only need to execute the max date query once rather than every time the variable is called.
HIVE does not currently support this. (please correct me if I'm wrong! I have been trying to figure out how to do this all all afternoon)
My workaround is to store the required variable in a table that is small enough to map join onto the query in which it is used. Because the join is a map rather than a broadcast join it should not significantly hurt performance. For example:
drop table if exists var_table;
create table var_table as
select max(date) as max_date from some_table;
select some_other_table.*
from some_other_table
left join var_table
where some_other_table.date > var_table.max_date;
The suggested solution by #visakh is not optimal because stores the string 'select count(1) from table_name;' rather than the returned value and so will not be helpful in cases where you need to call a var repeatedly during a script.
Storing hive query output in a variable and using it in another query.
In shell create a variable with desired value by doing:
var=`hive -S -e "select max(datekey) from ....;"`
echo $var
Use the variable value in another hive query by:
hive -hiveconf MID_DATE=$var -f test.hql
You can simply achieve this using a shell script.
create a shell script
file: avg_op.sh
#!/bin/sh
hive -e 'use telecom;select avg(a) from abc;' > avg.txt
wait
value=`cat avg.txt`
hive --hiveconf avgval=$value -e "set avgval;set hiveconf:avgval;
use telecom;
select * from abc2 where avg_var=\${hiveconf:avgval};"
execute the .sh file
>bash avg_op.sh
If you trying to capture the number from a Hive query or impala query in Linux, you can achieve this by executing the query and selecting numbers from the regex.
With Hive,
max=`beeline -u ${hiveConnectionUrl} -e "select max(col1) from schema_name.table_name;" | sed 's/[^0-9]*//g'`
The main part is to extract the number from the result. Also, if you're getting too large a result, you can use --silent=true flag to silent the execution which would reduce the log messages.
You can use BeeTamer for that. It allows to store result (or part of it) in a variable, and use this variable later in your code.
Beetamer is a macro language / macro processor that allows to extend functionality of the Apache Hive and Cloudera Impala engines.
select avg(a) from abc;
%capture MY_AVERAGE;
select * from abc2 where avg_var=#MY_AVERAGE#;
In here you save average value from you query into macro variable MY_AVERAGE and then reusing it in the second query.
try below :
$ var=$(hive -e "select '12' ")
$ echo $var
12 -- output