Storing query result in a variable - variables

I have a query whose result I wanted to store in a variable
How can I do it ?
I tried
./hive -e "use telecom;insert overwrite local directory '/tmp/result' select
avg(a) from abc;"
./hive --hiveconf MY_VAR =`cat /tmp/result/000000_0`;
I am able to get average value in MY_VAR but it takes me in hive CLI which is not required
and is there a way to access unix commands inside hive CLI?

Use Case: in mysql the following is valid:
set #max_date := select max(date) from some_table;
select * from some_other_table where date > #max_date;
This is super useful for scripts that need to repeatedly call this variable since you only need to execute the max date query once rather than every time the variable is called.
HIVE does not currently support this. (please correct me if I'm wrong! I have been trying to figure out how to do this all all afternoon)
My workaround is to store the required variable in a table that is small enough to map join onto the query in which it is used. Because the join is a map rather than a broadcast join it should not significantly hurt performance. For example:
drop table if exists var_table;
create table var_table as
select max(date) as max_date from some_table;
select some_other_table.*
from some_other_table
left join var_table
where some_other_table.date > var_table.max_date;
The suggested solution by #visakh is not optimal because stores the string 'select count(1) from table_name;' rather than the returned value and so will not be helpful in cases where you need to call a var repeatedly during a script.

Storing hive query output in a variable and using it in another query.
In shell create a variable with desired value by doing:
var=`hive -S -e "select max(datekey) from ....;"`
echo $var
Use the variable value in another hive query by:
hive -hiveconf MID_DATE=$var -f test.hql

You can simply achieve this using a shell script.
create a shell script
file: avg_op.sh
#!/bin/sh
hive -e 'use telecom;select avg(a) from abc;' > avg.txt
wait
value=`cat avg.txt`
hive --hiveconf avgval=$value -e "set avgval;set hiveconf:avgval;
use telecom;
select * from abc2 where avg_var=\${hiveconf:avgval};"
execute the .sh file
>bash avg_op.sh

If you trying to capture the number from a Hive query or impala query in Linux, you can achieve this by executing the query and selecting numbers from the regex.
With Hive,
max=`beeline -u ${hiveConnectionUrl} -e "select max(col1) from schema_name.table_name;" | sed 's/[^0-9]*//g'`
The main part is to extract the number from the result. Also, if you're getting too large a result, you can use --silent=true flag to silent the execution which would reduce the log messages.

You can use BeeTamer for that. It allows to store result (or part of it) in a variable, and use this variable later in your code.
Beetamer is a macro language / macro processor that allows to extend functionality of the Apache Hive and Cloudera Impala engines.
select avg(a) from abc;
%capture MY_AVERAGE;
select * from abc2 where avg_var=#MY_AVERAGE#;
In here you save average value from you query into macro variable MY_AVERAGE and then reusing it in the second query.

try below :
$ var=$(hive -e "select '12' ")
$ echo $var
12 -- output

Related

How to export hive query result to single local file?

I want to export hive query result to single local file with pipe delimiter.
Hive query contains order by clause.
I have tried below solutions.
Solution1:
hive -e 'insert overwrite local directory '/problem1/solution' fields terminated by '|' select * from table_name order by rec_date'
This solution is creating multiple files. After merging files, it loosing data order.
Solution2:
beeline -u 'jdbc:hive2://server_ip:10000/db_name' --silent --outputformat=dsv --delimiterForDSV='|' -e 'select * from table_name order by rec_date' > /problem1/solution
This solution is creating single file but it has empty 2 lines at top and 2 lines at bottom.
I am removing empty lines using sed command. It takes very long time.
Is there any other efficient way to achieve this?
Try these settings for executing ORDER BY on single reducer:
set hive.optimize.sampling.orderby=false; --disable parallel ORDER BY
Or try to set the number of reducers manually:
set mapred.reduce.tasks=1;

Writing a loop in a shell script to read output of HIVE Select

I am new to writing shell scripts. I currently have some reports that we run in HIVE SQL and I am trying to automate them the best I can. Currently I use crontab within our UNIX environment to have these queries run automatically everyday. Right now I have to paste that data into excel then filter and create separate documents for each "end-user".
What I am trying to accomplish is this:
I have a column in my output query that shows the name of a company, CompanyA, CompanyB, CompanyC
etc. Depending on the details in there it could be anywhere from 12-20 different "companies". I don't want to hard code a query for each one to create it's own output. What I would like to do, is have a query that selects each unique company in that field(agency_name) and then run my select statement that would say:
select * from output_results where agency_name = "name here" and then write this output to a csv named Balance_Detail_"NAME"_Date then loop through and run this query and create an output for each name it found in the agency_name field.
Here is a template how you can loop query result and call another hive script and to save in a CSV. Needs debugging. And of course use your queries:
#!/bin/bash
dt=$(date +'%Y_%m_%d')
for NAME in $(hive -S -e "select distinct agency_name from ..... ")
do
#echo "Company name is $NAME"
hive -S -e " select * from output_results where agency_name = $NAME" | sed 's/[[:space:]]\+/,/g' > Balance_Detail_${NAME}_${dt}
done

Why don't decorators work on integer range partitioned bigquery tables?

I created an integer range partitioned bigquery table similar to one described in the tutorial:
CREATE TABLE
mydataset.newtable
PARTITION BY
RANGE_BUCKET(customer_id, GENERATE_ARRAY(1, 100, 1))
AS SELECT 1 AS customer_id, DATE "2019-10-01" AS date1
However, trying to extract a single partition into a bucket, running in bash
bq extract myproject:mydataset.newtable\$1 gs://mybucket/newtable.csv
I get an error "partition key is not valid". Why? How do I find the valid keys?
Similarly I cannot use the decorator to select from a specific partition using query composer:
select from mydataset.newtable$0 or select from mydataset.newtable$1
give
Syntax error: Illegal input character "$" at [1:46]
The decorator $ is valid in LegacySQL, but you can opt by one of these options:
# LegacySQL, legacy sql is used by default in the following command.
# From the UI you need to change it in More -> Query Settings
bq query 'SELECT * from mydataset.newtable$10'
or
# StandardSQL, the option use_legacy_sql=false force to use standard sql
bq query --use_legacy_sql=false 'SELECT * from mydataset.newtable WHERE customer_id BETWEEN 10 AND 20'
Regarding the bq extract command I could export after removing :
$ bq extract myproject:mydataset.newtable$1 gs://mybucket/newtable.csv
Waiting on bqjob_..._000001701cb5d260_1 ... (0s) Current status: DONE
$ gsutil cat gs://mybucket/newtable.csv
customer_id,date1
18,2020-10-01
1,2019-10-01
2,2019-10-02
$
Edit:
After checking your comment below, you are correct, the bq extract above returns all the data.
The doc Exporting table data suggests that 'mydataset.table$N'should work. But when the scape character (\) is used, this error is returned: Partition key is invalid: key: "N"
Since there are not documentation that indicates this is possible, I have already created a FR to add this funtionality. You can monitor this request in this link, it is important to note that there is not an ETA for its resolution.
This issue has now been solved by Google, so the following command works as expected:
bq extract myproject:mydataset.newtable\$1 gs://mybucket/newtable.csv

Use result of a Hive/Impala query inside another query

I'm working with hive/impala and I often run into the need of having to query the results of a show partition to get specific partition. Let's suppose I have a table tbl1 partitioned by fields country and date. So, show partitions tbl1 would result in something like this
country=c1/date=d1
country=c1/date=d3
country=c2/date=d2
I want to do something like select * from (show partitions tbl1) a where a.country='c1' and I want to do this in Hue or shell (hive and impala).
Is this possible?
I don't think what you are trying is possible inside impala/hive directly.
I can suggest an alternative way:
Use bash in combination impala/hive
so instead of entering into interactive modes in hive and impala , use the command line option to pass the query from bash shell itself so that the result comes back to bash shell and then use grep or other text processing commands to process it.
so it would seem like
impala -k -i <> --ssl -ca_cert <> -B -q "show partitions tbl1" | grep "country=c1"
here u need to put required values in place of <>
so in this way you can use grep/sed or other tools to get the desired output.
Obviously it depends on your use case what you exactly want..but I hope this can help
If someone ever finds this useful, this is what I ended up doing. Assuming you have spark-shell or spark2-shell, you can store the output of show partitions in a dataframe and then transform such dataframe. This is what I did (inside spark2-shell:
val df = spark.sql("show partitions tbl1").map(row => {
val arrayValues = row.getString(0).split("/")
(arrayValues.head.split("=")(1), arrayValues(1).split("=")(1))
}).toDF("country", "date")
this takes the list of partitions (a DataFrame[String]) and splits the dataframe by / and then for each piece, splits for = and takes the value

run shell command inside Hive that uses Hive's variables value as input

I have a python script that receives a Hive table name and 2 dates and adds all partitions between those dates. (runs a bunch of hive -e 'alter table add partition (date=...)')
What i would like to do is when running a Hive script that has a hiveconf:date variable
pass it to the python script as input.
something like:
!python addpartitions.py mytable date=${hiveconf:date}
but of course the variable substitution does not take place...
Any way to achieve this?