How to reshape data after GROUPING SETS in Hive? - hive

I would like to aggregate a column over many different dimensions. I think GOUPING SETS would be appropriate to my problem, but I cannot figure out how to transform/reshape the resulting table from GROUPING SETS.
This is my query using GROUPING SETS:
select date, dim1, dim2, dim3, sum(value) as sum_value
from table
grouping by date, dim1, dim2, dim3
grouping sets ((date, dim1), (date, dim2), (date, dim3))
The query would result in a table like this:
date dim1 dim2 dim3 sum_value
2017-01-01 A NULL NULL [value_A]
2017-01-01 B NULL NULL [value_B]
2017-01-01 NULL C NULL [value_C]
2017-01-01 NULL D NULL [value_D]
2017-01-01 NULL NULL E [value_E]
2017-01-01 NULL NULL F [value_F]
But what I really need is a table like this:
date dim factor sum_value
2017-01-01 dim1 A [value_A]
2017-01-01 dim1 B [value_B]
2017-01-01 dim2 C [value_C]
2017-01-01 dim2 D [value_D]
2017-01-01 dim3 E [value_E]
2017-01-01 dim3 F [value_F]
The actual number of dimensions is far more than 3, so it wouldn't be a good idea to hard-code the query. Is there a way to reshape the table from grouping sets or other aggregation methods to get the desired table?
Thanks!

select `date`
,elt(log2(GROUPING__ID - 1),'dim1','dim2','dim3') as dim
,coalesce (dim1,dim2,dim3) as factor
,sum(value) as sum_value
from `table`
group by `date`,dim1,dim2,dim3
grouping sets ((`date`,dim1),(`date`,dim2),(`date`,dim3))

Related

SQL select 1 row out of several rows that have similar values

I have a table like this:
ID
OtherID
Date
1
z
2022-09-19
1
b
2021-04-05
2
e
2022-04-05
3
t
2022-07-08
3
z
2021-03-02
I want a table like this:
ID
OtherID
Date
1
z
2022-09-19
2
e
2022-04-05
3
t
2022-07-08
That have distinct pairs consisted of ID-OtherID based on the Date values which are the most recent.
The problem I have now is the relationship between ID and OtherID is 1:M
I've looked at SELECT DISTINCT, GROUP BY, LAG but I couldn't figure it out. I'm sorry if this is a duplicate question. I couldn't find the right keywords to search for the answer.
Update: I use Postgres but would like to know other SQL as well.
This works for many dbms (versions of postgres, mysql and others) but you may need to adapt if something else. You could use a CTE, or a join, or a subquery such as this:
select id, otherid, date
from (
select id, otherid, date,
rank() over (partition by id order by date desc) as id_rank
from my_table
)z
where id_rank = 1
id
otherid
date
1
z
2022-09-19T00:00:00.000Z
2
e
2022-04-05T00:00:00.000Z
3
t
2022-07-08T00:00:00.000Z
You can use a Common Table Expression (CTE) with ROW_NUMBER() to assign a row number based on the ID column (then return the first row for each ID in the WHERE clause rn = 1):
WITH cte AS
(SELECT ID,
OtherID,
Date,
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY Date DESC) AS rn
FROM sample_table)
SELECT ID,
OtherID,
Date
FROM cte
WHERE rn = 1;
Result:
ID
OtherID
Date
1
z
2022-09-19
2
e
2022-04-05
3
t
2022-07-08
Fiddle here.

Select the first row in the last group of consecutive rows

How would I select the row that is the first occurrence in the last 'grouping' of consecutive rows, where a grouping is defined by the consecutive appearance of a particular column value (in the example below state).
For example, given the following table:
id
datetime
state
value_needed
1
2021-04-01 09:42:41.319000
incomplete
A
2
2021-04-04 09:42:41.319000
done
B
3
2021-04-05 09:42:41.319000
incomplete
C
4
2021-04-05 10:42:41.319000
incomplete
C
5
2021-04-07 09:42:41.319000
done
D
6
2021-04-012 09:42:41.319000
done
E
I would want the row with id=5 as it it is the first occurrence of state=done in the last (i.e. most recent) grouping of state=done.
Assuming all columns NOT NULL.
SELECT *
FROM tbl t1
WHERE NOT EXISTS (
SELECT FROM tbl t2
WHERE t2.state <> t1.state
AND t2.datetime > t1.datetime
)
ORDER BY datetime
LIMIT 1;
db<>fiddle here
NOT EXISTS is only true for the last group of peers. (There is no later row with a different state.)
ORDER BY datetime and take the first. Voilá.
Here's a window function solution that accesses your table only once (which may or may not perform better for large data sets):
SELECT *
FROM (
SELECT *,
LEAD (state) OVER (ORDER BY datetime DESC)
IS DISTINCT FROM state AS first_in_group
FROM tbl
) t
WHERE first_in_group
ORDER BY datetime DESC
LIMIT 1
A dbfiddle based on Erwin Brandstetter's. To illustrate, here's the value of first_in_group for each row:
id datetime state value_needed first_in_group
---------------------------------------------------------------------
6 2021-04-12 09:42:41.319 done E f
5 2021-04-07 09:42:41.319 done D t
4 2021-04-05 10:42:41.319 incomplete C f
3 2021-04-05 09:42:41.319 incomplete C t
2 2021-04-04 09:42:41.319 done B t
1 2021-04-01 09:42:41.319 incomplete A t

How to pivot after grouping in sql

select day,type,sum(type)
from table1
group by 1,2
It returned something like this
day type count(type)
2021-04-13 a 10
2021-04-13 b 5
2021-04-14 c 2
but my desired result is as follows
I would like to pivot them . how can I transform them?
2021-04-13 2021-04-14
a 10 0
b 5 0
c 0 2
Thanks
You can use conditional aggregation. Presumably, you intend:
select day,
count(*) filter (where date = '2021-04-13') as cnt_20210413,
count(*) filter (where date = '2021-04-14') as cnt_20210414
from table1
group by day;
Note that your query has sum(type), but type is then shown with the values that are non-numeric strings. Hence, confusion. I am guessing you really intend count() and not sum().

SQL Group by only correlative rows

Say I have the following table:
Code A B C Date ID
------------------------------
50 1 1 A 2018-01-08 150001
50 1 1 A 2018-01-15 165454
50 1 1 B 2018-02-01 184545
50 1 1 A 2018-02-02 195487
I need the sql query to output the following:
Code A B C Min(Date) Min(ID)
-------------------------------
50 1 1 A 2018-01-08 150001
50 1 1 B 2018-02-01 184545
50 1 1 A 2018-02-02 195487
If I use standard group by, rows 1,2,4 are grouped in 1 row, and this is not that I want.
I want to select the row with MIN(date) and MIN(id) from the duplicate records that are together based on column code, A, B and C
in this case 1st 2 rows are duplicates so i want the min() row.
and 3rd and 4th row are distinct.
Note that the database is Vertica 8.1, that is very similar to Oracle or PostgreSQL
I think you would need the analytic function LAG(). Using this function, you can get the value of the previous row (or NULL if it's the first row itself). So you can check if the value on the previous row is different or not, and filter accordingly.
I'm not familiar with Vertica, but this should be the correct documentation for it: https://my.vertica.com/docs/7.0.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Analytic/LAGAnalytic.htm
Please try the query below, it should do it:
SELECT l.Code, l.A, l.B, l.C, l.Date, l.ID
FROM (SELECT t.*,
LAG(t.C, 1) OVER (PARTITION BY t.Code, t.A ORDER BY t.Date) prev_val
FROM table_1 t) l
WHERE l.C != l.prev_val
OR l.prev_val IS NULL
ORDER BY l.Code, l.A, l.Date

SQLite: Divide value from one column based on criteria from other columns

I have a table in SQLite3 with the following structure:
Date Category Value
------------ -------------- -------------
20160101 A 5
20160101 B 3
20160102 A 4
20160102 B 2
20160103 A 7
20160103 B 3
20160104 A 8
20160104 B 1
My goal is to select values from the table so that for each date I divide the value of category A with the value of category B. I have exactly one value for each category for every date. I.e. the goal is to select two columns with these values:
Date NewValue(A/B)
------------ --------------
20160101 1.6667
20160102 2
20160103 2.3333
20160104 8
I have tried to solve this by creating a temporary table, but I get wrong values.
You can do this using conditional aggregation or a join:
select t.date, ta.value / tb.value
from t ta join
t tb
on ta.date = tb.date and ta.category = 'A' and tb.category = 'B';
One caveat: SQLite does integer division. So, if the values are integers, you should use something like:
select t.date, ta.value * 1.0 / tb.value