Use result of a Hive/Impala query inside another query

Use result of a Hive/Impala query inside another query - hive

I'm working with hive/impala and I often run into the need of having to query the results of a show partition to get specific partition. Let's suppose I have a table tbl1 partitioned by fields country and date. So, show partitions tbl1 would result in something like this
country=c1/date=d1
country=c1/date=d3
country=c2/date=d2
I want to do something like select * from (show partitions tbl1) a where a.country='c1' and I want to do this in Hue or shell (hive and impala).
Is this possible?

I don't think what you are trying is possible inside impala/hive directly.
I can suggest an alternative way:
Use bash in combination impala/hive
so instead of entering into interactive modes in hive and impala , use the command line option to pass the query from bash shell itself so that the result comes back to bash shell and then use grep or other text processing commands to process it.
so it would seem like
impala -k -i <> --ssl -ca_cert <> -B -q "show partitions tbl1" | grep "country=c1"
here u need to put required values in place of <>
so in this way you can use grep/sed or other tools to get the desired output.
Obviously it depends on your use case what you exactly want..but I hope this can help

If someone ever finds this useful, this is what I ended up doing. Assuming you have spark-shell or spark2-shell, you can store the output of show partitions in a dataframe and then transform such dataframe. This is what I did (inside spark2-shell:
val df = spark.sql("show partitions tbl1").map(row => {
val arrayValues = row.getString(0).split("/")
(arrayValues.head.split("=")(1), arrayValues(1).split("=")(1))
}).toDF("country", "date")
this takes the list of partitions (a DataFrame[String]) and splits the dataframe by / and then for each piece, splits for = and takes the value

Related

Hive - Is there a way to dynamically create tables from a list

I'm using Hive to aggregate stats, and I want to do a breakdown by the industry our customers fall under. Ideally, I'd like to write the stats for each industry to a separate output file per industry (e.g. industry1_stats, industry2_stats, etc.). I have a list of various industries our customers are in, but that list isn't pre-set.
So far, everything I've seen from Hive documentation indicates that I need to know what tables I'd want beforehand and hard-code those into my Hive script. Is there a way to do this dynamically, either in the Hive script itself (preferable) or through some external code before kicking off the Hive script?

I would suggest go for a shell script..
Get the list of columns
hive -e 'select distinct industry_name from [dbname].[table_name];' > list
Iterate over every line... passing every line(industry names) of list as argument to the do while loop
tail -n +1 list | while IFS=' ' read -r industry_name
do
hive -hiveconf MY_VAR=$industry_name -f my_script.hql
done
save the shell script as test.sh
and in my_script.hql
use uvtest;
create table ${hiveconf:MY_VAR} (id INT, name CHAR(10));
you'll have to place both the test.sh and my_script.hql in the same folder.
Below command should create all the tables from list of column names.
sh test.sh
Follow this link for using hive in shell scripts:
https://www.mapr.com/blog/quick-tips-using-hive-shell-inside-scripts

I wound up achieving this using Hive's dynamic partitioning (each partition writes to a separate directory on disk, so I can just iterate through that file). The official Hive documentation on partitioning and this blog post were particularly helpful for me.

Write results of SQL query to multiple files based on field value

My team uses a query that generates a text file over 500MB in size.
The query is executed from a Korn Shell script on an AIX server connecting to DB2.
The results are ordered and grouped by a specific field.
My question: Is it possible, using SQL, to write all rows with this specific field value to its own text file?
For example: All rows with field VENDORID = 1 would go to 1.txt, VENDORID = 2 to 2.txt, etc.
The field in question currently has 1000+ different values, so I would expect the same amount of text files.

Here is an alternative approach that gets each file directly from the database.
You can use the DB2 export command to generate each file. Something like this should be able to create one file :
db2 export to 1.txt of DEL select * from table where vendorid = 1
I would use a shell script or something like Perl to automate the execution of such a command for each value.
Depending on how fancy you want to get, you could just hardcode the extent of vendorid, or you could first get the list of distinct vendorids from the table and use that.
This method might scale a bit better than extracting one huge text file first.

Storing query result in a variable

I have a query whose result I wanted to store in a variable
How can I do it ?
I tried
./hive -e "use telecom;insert overwrite local directory '/tmp/result' select
avg(a) from abc;"
./hive --hiveconf MY_VAR =`cat /tmp/result/000000_0`;
I am able to get average value in MY_VAR but it takes me in hive CLI which is not required
and is there a way to access unix commands inside hive CLI?

Use Case: in mysql the following is valid:
set #max_date := select max(date) from some_table;
select * from some_other_table where date > #max_date;
This is super useful for scripts that need to repeatedly call this variable since you only need to execute the max date query once rather than every time the variable is called.
HIVE does not currently support this. (please correct me if I'm wrong! I have been trying to figure out how to do this all all afternoon)
My workaround is to store the required variable in a table that is small enough to map join onto the query in which it is used. Because the join is a map rather than a broadcast join it should not significantly hurt performance. For example:
drop table if exists var_table;
create table var_table as
select max(date) as max_date from some_table;
select some_other_table.*
from some_other_table
left join var_table
where some_other_table.date > var_table.max_date;
The suggested solution by #visakh is not optimal because stores the string 'select count(1) from table_name;' rather than the returned value and so will not be helpful in cases where you need to call a var repeatedly during a script.

Storing hive query output in a variable and using it in another query.
In shell create a variable with desired value by doing:
var=`hive -S -e "select max(datekey) from ....;"`
echo $var
Use the variable value in another hive query by:
hive -hiveconf MID_DATE=$var -f test.hql

You can simply achieve this using a shell script.
create a shell script
file: avg_op.sh
#!/bin/sh
hive -e 'use telecom;select avg(a) from abc;' > avg.txt
wait
value=`cat avg.txt`
hive --hiveconf avgval=$value -e "set avgval;set hiveconf:avgval;
use telecom;
select * from abc2 where avg_var=\${hiveconf:avgval};"
execute the .sh file
>bash avg_op.sh

If you trying to capture the number from a Hive query or impala query in Linux, you can achieve this by executing the query and selecting numbers from the regex.
With Hive,
max=`beeline -u ${hiveConnectionUrl} -e "select max(col1) from schema_name.table_name;" | sed 's/[^0-9]*//g'`
The main part is to extract the number from the result. Also, if you're getting too large a result, you can use --silent=true flag to silent the execution which would reduce the log messages.

You can use BeeTamer for that. It allows to store result (or part of it) in a variable, and use this variable later in your code.
Beetamer is a macro language / macro processor that allows to extend functionality of the Apache Hive and Cloudera Impala engines.
select avg(a) from abc;
%capture MY_AVERAGE;
select * from abc2 where avg_var=#MY_AVERAGE#;
In here you save average value from you query into macro variable MY_AVERAGE and then reusing it in the second query.

try below :
$ var=$(hive -e "select '12' ")
$ echo $var
12 -- output

run shell command inside Hive that uses Hive's variables value as input

I have a python script that receives a Hive table name and 2 dates and adds all partitions between those dates. (runs a bunch of hive -e 'alter table add partition (date=...)')
What i would like to do is when running a Hive script that has a hiveconf:date variable
pass it to the python script as input.
something like:
!python addpartitions.py mytable date=${hiveconf:date}
but of course the variable substitution does not take place...
Any way to achieve this?

Best practice to add time partitions to a table

having an event tables, partitioned by time (year,month,day,hour)
Wanna join a few events in hive script that gets the year,month,day,hour as variables,
how can you add for example also events from all 6 hours prior to my time
without 'recover all...'
10x

So basically what i needed was a way to use a date that the Hive script receives as parameter
and add all partitions 3 hour before and 3 hours after that date, without recovering all partitions and add the specific hours in every Where clause.
Didn't find a way to do it inside the hive script, so i wrote a quick python code that gets a date and table name, along with how many hours to add from before/after.
When trying to run it inside the Hive script with:
!python script.py tablename ${hivecond:my.date} 3
i was surprised that the variable substition does not take place in a line that starts with !
my workaround was to get the date that the hive script recieved from the log file in the machine using something like:
'cat /mnt/var/log/hadoop/steps/ls /mnt/var/log/hadoop/steps/ |sort -r|head -n 1/stdout'
and from there you can parse each hive parameter in the python code without passing it via Hive.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Use result of a Hive/Impala query inside another query - hive

Related

Hive - Is there a way to dynamically create tables from a list

Write results of SQL query to multiple files based on field value

Storing query result in a variable

run shell command inside Hive that uses Hive's variables value as input

Best practice to add time partitions to a table

Categories

Resources