Creating local variable in Hive - hive

Let's assume I have a table (my_table) with four double columns A, B, C, and D. I'd like to create a new table that uses derived data from the existing table as such:
create table my_new_table as select A, B, C, D, A / (C + D), B / (C + D) from my_table;
Is there a way to define a local variable (e.g. my_var = C + D) that I could declare in the select statement and then use across the row, i.e.
create table my_new_table as select A, B, C, D, A / my_var, B / my_var from my_table;
Just wanted to know if this is feasible in Hive.

Yes, that is feasible, and here's an example of how to do it.
set my_var = C+D;
create table my_new_table as
select A, B, C, D, A / ${hiveconf:my_var}, B / ${hiveconf:my_var} from my_table;

Related

How to use the IN clause forcing more columns into a tuple and then looking up that tuple among the result of a query returning only one column

I have two Postgresql tables table_j and table_r containing some fields in common:
a, b, c, d
I would like to extract from table_r a tuple of these fields where a condition on other fields is met, and since this condition can be met for records that have the same values on fields a, b, c, d, I would like to extract unique tuples.
select distinct(a, b, c, d) from table_r where date > '2022-05-01' and code='123456'
I would like to extract from table_j records that match values of fields a, b, c, d from the previous query.
I have tried to do it like so:
select a, b, c, d, e
from table_j
where (a, b, c, d) in
( select distinct(a, b, c, d) from table_r where date > '2022-05-01' and code='123456' )
order by a, b, c, d;
but it tells
ERROR: the subquery has too few columns
LINE 1: e from table_j where (a, b, c, d) in
^
So I bet this means that the results from the subquery is only one column, while the snippet
where (a, b, c, d) in
from the "outer" query is trating (a, b, c, d) as four columns, even if I have put them between parentheses.
So how can I turn those 4 columns in that snippet into only one column containing the tuple of the four fields?
distinct is not a function and always applies to all columns of the SELECT list. Enclosing one or more columns in parentheses won't change the result of the DISTINCT.
Your problem stems from the fact that (a,b,c,d) create a single column with an anonymous record as its type in Postgres. It does not return four columns, but just a single one. And thus the comparison with four columns from the IN condition fails.
Additionally: in a sub-query for an IN condition the distinct is actually useless because the IN operator only checks the first match. It might even make things slower.
select a, b, c, d, e
from table_j
where (a, b, c, d) in (select distinct a, b, c, d
from table_r
where date > '2022-05-01'
and code='123456' )
order by a, b, c, d;

How to deduplicate rows without an additional table?

I have a table containing some duplicates e.g.
-- table definition: t(a,b,value)
select a, b
from t
group by a, b
having count(*) > 1;
I could do
create table x as
select a, b, min(value)
from t
group by a, b;
delete from t;
insert into t select * from x;
drop table x;
but this needs creating a table x which for huge tables becomes impractical.
Assuming you want to retain the tuple having the smallest value for a given a and b value, you may try:
DELETE
FROM yourTable
WHERE NOT EXISTS (SELECT 1 FROM yourTable t
WHERE t.a = yourTable.a AND
t.b = yourTable.b AND
t.value < yourTable.value);
The above query might benefit from an index on (a, b, value). But if you don't already have this index then suggesting it is a something of a moot point, as you would have to recreate the entire table.
but this needs creating a table x which for huge tables becomes
impractical
On the contrary, creating a new table with all the distinct rows is the preferred method for huge tables with a large number of duplicates.
Create the new table x with the exact same schema as t:
CREATE TABLE x(a ... REFERENCES ...., b ..., value ...);
Disable foreign key constraints checks to speed up the process:
PRAGMA foreign_keys = OFF;
Insert the distinct rows of t to x:
INSERT INTO x(a, b, value)
SELECT a, b, MIN(value)
FROM t
GROUP BY a, b
Drop the table t:
DROP TABLE t;
Rename the table x as t:
ALTER TABLE x RENAME TO t;
Finally reenable foreign key constraints checks:
PRAGMA foreign_keys = ON;
See a simplified demo.

An equivalent expression for MIN_BY in SQL?

In SQL we have the function MIN_BY(B,C), which returns the value of B at the minimum of C.
How would one get the same functionality, but without using the MIN_BY function?
i.e. given columns A,B,C, I want to group by A and return the value of B that corresponds to the minimum of C. I can see there must be some way to do it using OVER and PARTITION BY but am not well versed in enough to see how!
One method uses window functions:
select a, min(min_bc)
from (select t.*, min(b) over (partition by a order by c) as min_bc
from t
)
group by a;
Just to understand:
SETUP:
create table Test (a int, b int, c int);
insert into test values(1,2,3);
insert into test values(1,3,2);
insert into test values(1,4,1);
insert into test values(2,4,5);
insert into test values(2,8,4);
QUERY(min(b) for the case of multiple rows with minimum of c):
select a, min(b) from Test t
where c = (select min(c) from Test b where b.a = t.a)
group by a
RESULT:
A MIN(B)
1 4
2 8
RESULT of Gordon Linoffs query:
A MIN(MIN_BC)
1 2
2 4
Who's right, who's wrong and why
If you don't want to use subquery or window functions, I wrote an alternative here (Works best for INT types as the comparator):
https://stackoverflow.com/a/75289042/1997873

How do I insert rows from another table and update at the same time?

Table A:
A|B|C|Version
1|2|3|1
1|2|3|2
I want table B to be
A|B|C|Version
1|2|3|2
Every row is the same as A except Version is increased by 1.
Let's say I only want to copy the rows from table A where version=1.
How do I do that?
insert into for "copying" + Where clause for only getting the one version "copied".
INSERT INTO tableB ( A, B, C, Version) VALUES
(SELECT * FROM tableA WHERE tableA.Version = 1);
INSERT INTO b( a, b, c, version )
SELECT a, b, c, version + 1
FROM a
WHERE version = 1
would work. Of course, since your WHERE clause limits you to just the rows where version = 1, you could just use a hard-coded 2 in your SELECT
INSERT INTO b( a, b, c, version )
SELECT a, b, c, 2
FROM a
WHERE version = 1

Excel like calculations in SQL

In Excel if I have say numbers 1, 2 and 3 in columns A, B and C. I can write a formula in column D "=A+B" and then a formula in column E "=D+C".
Basically, I can use the result of a calculated column in the same row.
Can I achieve something similar in SQL with a single line of query.
For example, something like
SELECT A, B, C, A+B as D, D+C as E
FROM TABLE1
Result: 1, 2, 3, 3, 6
You can use calculated columns when create table as
CREATE TABLE tbl(id int, A int, B int, C int, D as A+B, E as A + B + C);
insert tbl(A, B, C) values (1, 2, 3)
Or use
SELECT A, B, C, A+B as D, + A+B + C as E
FROM TABLE1