Using not equal symbol in hive query - hive

I need to use '!=' symbol in my hive query with partitions.
I tried something like
from sample_table
insert overwrite table sample1
partition (src='a')
select * where act=10
insert overwrite table sample1
partition (src!='a')
select * where act=20
But it is showing error at '!=' symbol. How can i replace !=

Try to use rlike/regex function in hive to specify condition.
I think you can also use not operator <> not !=

partition (src!='a') - what do you expect Hive to do - to write "select *" result into any partition instead of "a"? You see, partition (src='a') means that you are writing result of aftergoing select statement into table's partition named "a". "PARTITION (a=b)" is not a conditional command like "WHERE a=b", you're just specifying how to name a partition.
You have just to specify another partition name, so your query should look like:
from sample_table insert overwrite table sample1 partition (src='a') select * where act=10 insert overwrite table sample1 partition (src='b') select * where act=20;
After that you should see 2 new partitions "a" and "b" in table "sample1" with some data from these select * where act=10 and select * where act=20 queries respectively.

may i know your hive version?
try using A <> B
Description from Hive DOCS:
NULL if A or B is NULL, TRUE if expression A is NOT equal to expression B, otherwise FALSE.

Related

BigQuery Temp Table Column has no Name

I'm trying to create a temp table in BigQuery, something like:
CREATE TEMP TABLE myTmpTable AS
SELECT t.event_id, MAX(t.event_date)
FROM eventsTable t
WHERE t.field_name = "foo"
AND t.new_string = "bar"
GROUP BY t.event_id;
This results in error "CREATE TABLE columns must be named, but column 2 has no name". I understand that it can't extract a column name from MAX(t.event_date). Is there a way I can specify a column name?
Is there a way I can specify a column name?
Use below
SELECT t.event_id, MAX(t.event_date) AS max_event_date
Meantime the whole SELECT looks wrong to me - if you group by issue_id then event_id should be somehow aggregated. Or you might want to group by event_id instead!

Select column value if column exists in that table else create that column and set it's value to null in BigQuery

I want to select total 450 fixed columns from the table which may or may not have all 450 columns always. When it doesn't have all columns then it should create the missing column and set it's value as null.
In Sql there is a function
if exists()
But in bigquery I am unable to use it wisely.
Any suggestion will help a lot
I assume in the following that you have a source table (the one with potentially "missing" columns) and an existing target table (with the desired schema).
In order to get the information of the columns of these tables, you just need to look into the INFORMATION_SCHEMA.COLUMNS table.
The solution below uses dynamic SQL, to 1) generate the desired SQL, 2) run it.
DECLARE column_selection STRING;
SET column_selection = (
WITH column_table AS (
SELECT
source.column_name AS source_colum,
tgt.column_name AS target_column
FROM
(SELECT
column_name
FROM `<yourproject>.<target_dataset>.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name='<target_table>') tgt
LEFT JOIN
(SELECT column_name
FROM `<yourproject>.<source_dataset>.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name='<source_table>') source
ON source.column_name = tgt.column_name
)
SELECT STRING_AGG(coalesce(source_column,
CONCAT("NULL AS `",target_column, "`")), ", \n") AS col_selection
FROM
column_table
)
EXECUTE IMMEDIATE
FORMAT("SELECT %s FROM `<yourproject>.<source_dataset>.<source_table>`", column_selection) ;
Explanation of the steps
Build a column_table for the columns we want to query:
a. first column containing the columns of the target table,
b. second one containing the corresponding source columns if they exist, or NULL if they don't
Once we have this table, we can build the desired SELECT statement: the name of the column is it's in the source table, or if it's NOT present, we want to have in our query " NULL AS `column_name_in_target` "
This is expressed in the
coalesce(source_column, CONCAT("NULL AS ``",target_column, "\``"))
We aggregate all these statement with STRING_AGG into the desired column selection.
Final step: putting together the rest of the query ( "SELECT" + <column_selection_string> + "FROM <your_source_table>" + ...), and we can EXECUTE IMMEDIATE it.

Hive - getting the column names count of a table

How can I get the hive column count names using HQL? I know we can use the describe.tablename to get the names of columns. How do we get the count?
create table mytable(i int,str string,dt date, ai array<int>,strct struct<k:int,j:int>);
select count(*)
from (select transform ('')
using 'hive -e "desc mytable"'
as col_name,data_type,comment
) t
;
5
Some additional playing around:
create table mytable (id int,first_name string,last_name string);
insert into mytable values (1,'Dudu',null);
select size(array(*)) from mytable limit 1;
This is not bulletproof since not all combinations of columns types can be combined into an array.
It also requires that the table will contain at least 1 row.
Here is a more complex but also stronger solution (types versa), but also requires that the table will contain at least 1 row
select size(str_to_map(val)) from (select transform (struct(*)) using 'sed -r "s/.(.*)./\1/' as val from mytable) t;

HiveQL: Using query results as variables

in Hive I'd like to dynamically extract information from a table, save it in a variable and further use it. Consider the following example, where I retrieve the maximum of column var and want to use it as a condition in the subsequent query.
set maximo=select max(var) from table;
select
*
from
table
where
var=${hiveconf:maximo}
It does not work, although
set maximo=select max(var) from table;
${hiveconf:maximo}
shows me the intended result.
Doing:
select '${hiveconf:maximo}'
gives
"select max(var) from table"
though.
Best
Hive substitutes variables as is and does not execute them. Use shell wrapper script to get result into variable and pass it to your Hive script.
maximo=$(hive -e "set hive.cli.print.header=false; select max(var) from table;")
hive -hiveconf "maximo"="$maximo" -f your_hive_script.hql
And after this inside your script you can use select '${hiveconf:maximo}'
#Hein du Plessis
Whilst it's not possible to do exactly what you want from Hue -- a constant source of frustration for me -- if you are restricted to Hue, and can't use a shell wrapper as suggested above, there are workarounds depending on the scenario.
When I once wanted to set a variable by selecting the max of a column in a table to use in a query, I got round it like this:
I first put the result into a table comprising two columns, with the (arbitrary word) 'MAX_KEY' in one column and the result of the max calculation in the other, like this:
drop table if exists tam_seg.tbl_stg_temp_max_id;
create table tam_seg.tbl_stg_temp_max_id as
select
'MAX_KEY' as max_key
, max(pvw_id) as max_id
from
tam_seg.tbl_dim_cc_phone_vs_web;
I then added the word 'MAX_KEY' to a sub-query then joined in the above table so I could use the result in the main query:
select
-- *** here is the joined in value from the table being used ***
cast(mxi.max_id + qry.temp_id as string) as pvw_id
, qry.cc_phone_vs_web
from
(
select
snp.cc_phone_vs_web
, row_number() over(order by snp.cc_phone_vs_web) as temp_id
-- *** here is the key being added to the sub-query ***
, 'MAX_KEY' as max_key
from
(
select distinct cc_phone_vs_web from tam_seg.tbl_stg_base_snapshots
) as snp
left outer join
tam_seg.tbl_dim_cc_phone_vs_web as pvw
on snp.cc_phone_vs_web = pvw.cc_phone_vs_web
where
pvw.cc_phone_vs_web is null
) as qry
-- *** here is the table with the select result in being joined in ***
left outer join
tam_seg.tbl_stg_temp_max_id as mxi
on qry.max_key = mxi.max_key
;
Not sure if this is your scenario but maybe it can be adapted. I'm 99% sure you can't just put a select statement directly into a variable in Hue though.
If I am doing something in just Hue I would probably do the temporary table and join method. But if I were using a shall wrapper anyway I would definitely do it there.
I hope this helps.

how to group by data from hive with specific partition?

I have the following:
hive>show partitions TABLENAME
pt=2012.07.28.08
pt=2012.07.28.09
pt=2012.07.28.10
pt=2012.07.28.11
hive> select pt,count(*) from TABLENAME group by pt;
OK
Why can't the group by get the data?
Check if the hive.mapred.mode is set to "strict", if so it'll not allow all partitions to scan for the submitted query. You can set it to nonstrict as below:
hive>set hive.mapred.mode=nonstrict;
I'm not sure whether this caused NO results out of your query, but trying to address it. Do share the results.
Note: You can check the default value for this parameter in hive-default.xml
You can always achive the same using 2 select statements . For ex
Create table table1(
session_id string,
page_id string
)
partitioned by (metrics_date string);
Consider we are have loaded table for 2 partitions
hive>show partitions table1
metrics_date=2012.07.28.08
metrics_date=2012.07.28.09
select * from table1 ;
1212121212 google.com 2012.07.28.08
1212121212 google.com 2012.07.28.09`
Getting number of rows per partition
select metrics_date,count(*) from (
select * from table1 ) temp
group by metrics_date;
To get whole results along with group by ,You can use the below query.
SELECT pt,count(*) OVER (PARTITION BY pt) FROM TABLENAME;
This can be achiened through partition by.