Expression in Hive LIMIT clause - sql

In Impala, you can do this:
SELECT x FROM t1 LIMIT cast(truncate(9.9) AS INT);
But in Hive, it seems to only take LIMIT [constant].
Is there a way to add expression in LIMIT?
https://www.cloudera.com/documentation/enterprise/5-6-x/topics/impala_limit.html

Unfortunately, this is not possible in Hive. As a workaround you can calculate variable in the shell and pass it to the Hive using --hivevar variable. Limit clause allows only per-calculated variables or constants as arguments:
Demo with variable. You can pass it also as --hivevar argument in the hive command line:
hive> set hivevar:limit=10;
hive> select 10 limit ${hivevar:limit};
OK
10
Time taken: 0.098 seconds, Fetched: 1 row(s)

Related

Remove nulls from an array in SQL

Want to remove nulls from an array in hive/sql
for example : array is ['1',null] after converting to string values it should be '1' only.
to split the array I am using below:
concat_ws( ",", array_val)
this gives : 1,null
required output : 1
Thanks for the help!
Use regexp_replace to remove null from concatenated string:
hive> select regexp_replace('null,1,2,null,2,3,null','(,+null)|(^null,)','');
OK
1,2,2,3
Time taken: 6.006 seconds, Fetched: 1 row(s)

Hive variable concatenation

I am facing problems in concatenating the value of a variable with a string .
my script contains the below
set hivevar:tab_dt= substr(date_sub(current_date,1),1,10);
CREATE TABLE default.udr_lt_bc_${hivevar:tab_dt}
(
trans_id double
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
in the above, the variable tab_dt gets assigned correctly with yesterdays date in the format yyyymmdd.
but when i try to concatenate this variable in a table name with a static string, the script fails. it is not doing the concatenation .
Kindly provide a solution.
note: i tried the below too, which is erroring out too
set hivevar:tab_dt= substr(date_sub(current_date,1),1,10);
set hivevar:tab_nm1= default.udr_lt_bc_;
set hivevar:tab_name= concat(${hivevar:tab_dt},${hivevar:tab_nm1})
CREATE TABLE ${hivevar:tab_name}
(
trans_id double
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
This too is returning an error.
Hive does not calculate expressions in the variables, substituting them as is.
Your create table expression results in this:
CREATE TABLE default.udr_lt_bc_substr(date_sub(current_date,1),1,10)...
Your second expression results in this:
CREATE TABLE concat(substr(date_sub(current_date,1),1,10),default.udr_lt_bc_)
Unfortunately Hive does not support such expressions in DDL.
I recommend to calculate this variable in a shell and pass as a --hivevar to the hive script.
For example in the sell script:
table_name=udr_lt_bc_$(date +'%Y_%m_%d' --date "-1 day")
#table_name is udr_lt_bc_2017_10_31 now
#call your script
hive -hivevar table_name="$table_name" -f your_script.hql
And then in your_script you can use variable:
CREATE TABLE default.${hivevar:table_name}
Note that '-' is not allowed in table names, that is why i used '_' instead.
For better understanding how Hive substitutes variables, try this:
hive> set hivevar:tab_dt= substr(date_sub(current_date,1),1,10);
hive> select ${hivevar:tab_dt};
OK
2017-10-31
Time taken: 1.406 seconds, Fetched: 1 row(s)
hive> select '${hivevar:tab_dt}';
OK
substr(date_sub(current_date,1),1,10)
Time taken: 0.087 seconds, Fetched: 1 row(s)
Note that in the first select statement the variable was substituted as is before execution and then calculated in the SQL. Second select statement prevent calculation because the variable is quoted and remains as is: substr(date_sub(current_date,1),1,10).
Another way in Hive:
select concat("table_",date_sub(from_unixtime(unix_timestamp(current_date,'yyyy-MM-dd'),'yyyy-MM-dd'),0));
Here, we can use above in a variable and use it as per our needs.

Optimizing multiple identical operator and function calls in Hive?

I'm new to Hive and trying to optimize a query that is taking a while to run. I have identical calls to regexp_extract and get_json in my SELECT and WHERE statements, and I was wondering if there is a way to optimize this by storing the results from one statement and using them in the other (or if Hive is already doing something like this in the background).
Example query:
SELECT
regexp_extract(get_json(json, 'url'), '.*[&?]q=([^&]*)') as query
FROM
api_request_logs
WHERE
LENGTH(regexp_extract(get_json(json, 'url'), '.*[&?]q=([^&]*)')) > 0
Thanks!
You can use a derived table to specify the regex only once but I don't think it runs faster
select * from (
select regexp_extract(get_json(json, 'url'), '.*[&?]q=([^&]*)') as query
from api_request_logs
) t where length(query) > 0

ROUND second argument only takes constant + hive

The following:
hive> create table t1 (val double, digit int);
hive> insert into t1 values(10,2);
hive> insert into t1 values(156660,3);
hive> insert into t1 values(8765450,4);
hive> select round(val, digit) from round_test;
Gives this error:
FAILED: SemanticException [Error 10014]: Line 1:7 Wrong arguments
'digit': ROUND second argument only takes constant
Its working fine in impala.
Could somebody help me please point out where the error is coming from?
BigDecimal a = new BigDecimal(value);
BigDecimal roundOff = a.setScale(places, BigDecimal.ROUND_HALF_EVEN);
return roundOff.doubleValue();
Thanks Mark for your quick response.
I've already used UDF to solve this issue. As this is a known issue HIVE-4523. Thought some patch has already applied.
The error says that the secund argument of ROUND must be a costant. i.e. with hive you can't use a column as secund argument for your ROUND function. If you need to do that I'd suggest you to create you UDF.

In Hive I need to Get numeric value after a particular word is it possible?

i want to get a numeric value immediately after a particular word in string
In hive for example :
APDSGDSCRAM051 in that i need to get numeric value after word RAM
is it possible in hive
Note: its not a fixed length string
Here you go, you need to use substr and instr pre-defined hive functions:
create table str_testing (c string);
insert into table str_testing values ('APDSGDSCRAM051');
select substr(c, instr(c, 'RAM') + 3) from str_testing;
OK
051
Time taken: 0.243 seconds, Fetched: 1 row(s)
As explained here, you can implemented in hive as
select regexp_extract(name, '\\d+', 0) from <table_name>;
Note: I do not have environment for Hive configured so you can check this by running at your end. Ya this will work only for first set of numbers found in your string, if you string has numbers at multiple places this might fail.