SQL - Impala - How to unfold one categorical column into many?

SQL - Impala - How to unfold one categorical column into many? - sql

I have the following table :
user category number
1 A 8
1 B 6
2 A 1
2 C 9
3 B 5
I want to "unfold" or "dummify" the category column and fill them with the "number" column to obtain:
user cat_A cat_B cat_C
1 8 6 0
2 1 0 9
3 0 5 0
Is it possible to achieve this in SQL (Impala) ?
I found this question How to create dummy variable columns for thousands of categories in Google BigQuery?
However it seems a little bit complex and I'd rather do it in Pandas.
Is there a simpler solution, knowing that I have 10 categories (A, B, C, D etc)?

You can try to use condition aggregate function.
SELECT user,
SUM(CASE WHEN category = 'A' THEN number ELSE 0 END) cat_A,
SUM(CASE WHEN category = 'B' THEN number ELSE 0 END) cat_B,
SUM(CASE WHEN category = 'C' THEN number ELSE 0 END) cat_C
FROM T
GROUP BY user

Related

pgsql rotate table and generate dynamic columns based on values

I want to make a dynamic pivot on pgsql table
Original table :
time
strategy
pnl
1
a
100
2
a
200
3
a
300
1
b
1000
2
b
2000
1
c
22
target table :
time
sum
a
b
c
1
1132
100
1000
32
2
2200
200
2000
0
3
22
0
0
22
the problem that the strategy content is dynamic i can have sometimes over 40 unique values (in this example there are only 3 a,b,c )
i have the following code it looks like a good start but the are some problems i cannot solve
SELECT time,
--sum(case when strategy='a' then pnl else 0 end) AS "a" ,
--sum(case when strategy='b' then pnl else 0 end) AS "b" ,
--sum(case when strategy='c' then pnl else 0 end) AS "c"
--generate the contect above, (when using the code above the function works)
(SELECT string_agg(clause, ',')
FROM (SELECT format('sum(case when strategy=''%s'' then pnl else 0 end) AS "%s" ',
strategy, strategy) AS clause
FROM (SELECT DISTINCT strategy FROM server_logs.logs where strategy != '' and subclass = 'pnl') s
ORDER BY strategy) clauses)
FROM (
select case when strategy is null then 'system' else strategy end as strategy,
time,
sum(case when value::float!=0 then 0::float else value::float end) as pnl
FROM server_logs.logs
where subclass='pnl'
group by rollup(strategy), time
) as t
group by time
order by time desc
;

Show two different sum columns based on a single column

Show two different sum columns based on another column.
For this table:
ID Item Quantity Location
1 1 10 A
2 1 10 B
3 1 10 A
4 2 10 A
5 2 10 A
6 2 10 B
7 3 10 A
8 3 20 A
I need to see the total quantities for both location A and location B (to compare which is higher), but only for items that have a location B:
Expected result:
Item Quantity A Quantity B
1 20 10
2 20 10
I've been trying this but getting errors:
SELECT st.item, st.qty ALIAS(stqty),
(SELECT SUM(dc.qty)
FROM table dc
WHERE st.item = dc.item) ALIAS(dcqty))
FROM table st
WHERE location ='b'
I can do this easily with two queries obviously, but I was hoping for a way to do it in one.

you can use a sum with case statement to do your pivot then a having to exclude rows with no total for b
here is the fiddle
https://www.db-fiddle.com/f/rS8fgvWoFxn879Utc2CKbu/0
select Item,
sum(case when Location = 'A' then Quantity else 0 end),
sum(case when Location = 'B' then Quantity else 0 end)
from myTable
group by Item
having sum(case when Location = 'B' then Quantity else 0 end) > 0

Presto SQL pivoting (for lack of a better word) data

I am working with some course data in a Presto database. The data in the table looks like:
student_id period score completed
1 2016_Q1 3 Y
1 2016_Q3 4 Y
3 2017_Q1 4 Y
4 2018_Q1 2 N
I would like to format the data so that it looks like:
student_id 2018_Q1_score 2018_Q1_completed 2017_Q3_score
1 0 N 5
3 4 Y 4
4 2 N 2
I know that I could do this by joining to the table for each time period, but I wanted to ask here to see if any gurus had a recommendation for a more scalable solution (e.g. perhaps not having to manually create a new join for each period). Any suggestions?

You can just use conditional aggregation:
select student_id,
max(case when period = '2018_Q1' then score else 0 end) as score_2018q1,
max(case when period = '2018_Q1' then completed then 'N' end) as completed_2018q1,
max(case when period = '2017_Q3' then score else 0 end) as score_2017q3
from t
group by student_id

SQL Server : how can I get difference between counts of total rows and those with only data

I have a table with data as shown below (the table is built every day with current date, but I left off that field for ease of reading).
This table keeps track of people and the doors they enter on a daily basis.
Table entrance_t:
id entrance entered
------------------------
1 a 0
1 b 0
1 c 0
1 d 0
2 a 1
2 b 0
2 c 0
2 d 0
3 a 0
3 b 1
3 c 1
3 d 1
My goal is to report on people and count entrances not used(grouping on people), but ONLY if they entered(entered=1).
So using the above table, I would like the results of query to be...
id count
----------
2 3
3 1
(id=2 did not use 3 of the entrances and id=3 did not use 1)
I tried queries(some with inner joins on two instances of same table) and I can get the entrances not used, but it's always for everybody. Like this...
id count
----------
1 4
2 3
3 1
How do I not display results id=1 since they did not enter at all?
Thank you,

You could use conditional aggregation:
SELECT id, count(CASE WHEN entered = 0 THEN 1 END) AS cnt
FROM entrance_t
GROUP BY id
HAVING count(CASE WHEN entered = 1 THEN 1 END) > 0;
DBFiddle Demo

Inserting a new indicator column to tell if a given row maximizes another column in SQL

I currently have a table in SQL that looks like this
PRODUCT_ID_1 PRODUCT_ID_2 SCORE
1 2 10
1 3 100
1 10 3000
2 10 10
3 35 100
3 2 1001
That is, PRODUCT_ID_1,PRODUCT_ID_2 is a primary key for this table.
What I would like to do is use this table to add in a row to tell whether or not the current row is the one that maximizes SCORE for a value of PRODUCT_ID_1.
In other words, what I would like to get is the following table:
PRODUCT_ID_1 PRODUCT_ID_2 SCORE IS_MAX_SCORE_FOR_ID_1
1 2 10 0
1 3 100 0
1 10 3000 1
2 10 10 1
3 35 100 0
3 2 1001 1
I am wondering how I can compute the IS_MAX_SCORE_FOR_ID_1 column and insert it into the table without having to create a new table.

You can try like this...
Select PRODUCT_ID_1, PRODUCT_ID_2 ,SCORE,
(Case when b.Score=
(Select Max(a.Score) from TableName a where a.PRODUCT_ID_1=b. PRODUCT_ID_1)
then 1 else 0 End) as IS_MAX_SCORE_FOR_ID_1
from TableName b

You can use a window function for this:
select product_id_1,
product_id_2,
score,
case
when score = max(score) over (partition by product_id_1) then 1
else 0
end as is_max_score_for_id_1
from the_table
order by product_id_1;
(The above is ANSI SQL and should run on any modern DBMS)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL - Impala - How to unfold one categorical column into many? - sql

You can try to use condition aggregate function. SELECT user, SUM(CASE WHEN category = 'A' THEN number ELSE 0 END) cat_A, SUM(CASE WHEN category = 'B' THEN number ELSE 0 END) cat_B, SUM(CASE WHEN category = 'C' THEN number ELSE 0 END) cat_C FROM T GROUP BY user

Related

pgsql rotate table and generate dynamic columns based on values

Show two different sum columns based on a single column

Presto SQL pivoting (for lack of a better word) data

SQL Server : how can I get difference between counts of total rows and those with only data

Inserting a new indicator column to tell if a given row maximizes another column in SQL

Categories

Resources