Create a matrix with counts - hive sql

Create a matrix with counts - hive sql - sql

Is there a way to achieve this with hive? I need to count users per segment.
I have a table:
user1, categoryA
user1, categoryB
user2, categoryC
And the desired output would be:
----------------- Category A, Category B, Category C
Category A -- 1 1 0
Category B -- 1 1 0
Category C -- 0 0 1

For static set of categories, this is possible:
with your_data as(
select stack (6,
'user1', 'categoryA',
'user1', 'categoryB',
'user2', 'categoryC',
'user2', 'categoryC',
'user3', 'categoryA',
'user4', 'categoryA'
) as (`user`, category)
)
select
category, sum(catA) as CategoryA, sum(catB) as CategoryB, sum(catC) as CategoryC
from
(
select `user` , category, --each user counted once per category
max(case when category='categoryA' then 1 else 0 end) over (partition by `user`) as catA,
max(case when category='categoryB' then 1 else 0 end) over (partition by `user`) as catB,
max(case when category='categoryC' then 1 else 0 end) over (partition by `user`) as catC
from your_data
group by `user` , category
)s
group by Category
order by category
Result:
category categorya categoryb categoryc
categoryA 3 1 0
categoryB 1 1 0
categoryC 0 0 1

Related

Incremental Sum across different groups

I am trying to figure out how to count every product at every date such that count is incremental across all product,
this is dummy table for understanding , I have millions of records with thousands of different products
I am unable to query at every date for each product the count in incremental fashion along with miles as per date provided
CREATE TABLE Dummy_tab (
empid int,
date1_start date,
name_emp varchar(255),
product varchar(255),
miles varchar(20)
);
INSERT INTO Dummy_tab VALUES
(1, '2018-08-27', 'Eric', 'a',10),
(1, '2018-08-28', 'Eric','b',10),
(1, '2018-08-28', 'Eric','a',20),
(2, '2020-01-8', 'Jack','d',10),
(2, '2020-02-8', 'Jack','b',20),
(2, '2020-12-28', 'Jack','b',20),
(2, '2020-12-28', 'Jack','d',20),
(2,'2021-10-28', 'Jack','c',20),
(2, '2022-12-28', 'Jack','d',20),
(3, '2018-12-31', 'Jane','',10),
(3, '2018-12-31', 'Jane','',15);
My desired O/p is this
Id Date a b c d empty miles
1 2018-08-27 1 0 0 0 0 10
1 2018-08-28 2 1 0 0 0 20
2 2020-01-08 0 0 0 1 0 10
2 2020-02-08 0 1 0 1 0 20
2 2020-12-28 0 2 0 2 0 20
2 2021-10-28 0 2 1 2 0 20
2 2022-12-28 0 2 1 3 0 20
3 2018-12-31 0 0 0 0 1 10
3 2019-12-31 0 0 0 0 2 15
FOR EXAMPLE
Eric has 3 entry for ID =1 with product a on 2018 08 27 with product b on 2018 08 28 with product a on 2018 08 28
SO 1st entry a= 1 for ID=1 2nt entry is sum of previous and current so now a =2 for ID=1 and b= 1 as there were no entry earlier for b
Miles needs to be maximum miles for that date from past dates

You need to first (conditionally) aggregate your values here, and then you can do a cumulative SUM:
WITH Aggregates AS(
SELECT empid AS Id,
date1_start AS [Date],
COUNT(CASE product WHEN 'a' THEN 1 END) AS a,
COUNT(CASE product WHEN 'b' THEN 1 END) AS b,
COUNT(CASE product WHEN 'c' THEN 1 END) AS c,
COUNT(CASE product WHEN 'd' THEN 1 END) AS d,
COUNT(CASE product WHEN '' THEN 1 END) AS empty,
MAX(miles) AS miles
FROM dbo.Dummy_tab
GROUP BY empid, date1_start)
SELECT Id,
[Date],
SUM(a) OVER (PARTITION BY Id ORDER BY [Date]) AS a,
SUM(b) OVER (PARTITION BY Id ORDER BY [Date]) AS b,
SUM(c) OVER (PARTITION BY Id ORDER BY [Date]) AS c,
SUM(d) OVER (PARTITION BY Id ORDER BY [Date]) AS d,
SUM(empty) OVER (PARTITION BY Id ORDER BY [Date]) AS empty,
miles
FROM Aggregates
ORDER BY ID,
[Date];

select from table where a=1 and a=2

i have a four requirement (may be four select is ok) where I need to find from single table, if customer has
a. apple and samsung
b. no_apple and no_samsung
c. apple and no_samsung
d. no_apple and samsung
my table be like...
cust_name device
john apple
john samsung
dave apple
tim samsung
patrick nokia
rick nokia
so expect output be like...
a:- output ( both apple and samsung)
count(*)
1
b:-output (no_apple and no_samsung)
count(*)
2
c:-output (apple and no_samsung)
count(*)
1
d:-output (no_apple and samsung)
count(*)
1

You can do it all in a single query using conditional aggregation:
SELECT COUNT(CASE WHEN num_apple > 0 AND num_samsung > 0 THEN 1 END)
AS apple_and_samsung,
COUNT(CASE WHEN num_apple = 0 AND num_samsung > 0 THEN 1 END)
AS no_apple_and_samsung,
COUNT(CASE WHEN num_apple > 0 AND num_samsung = 0 THEN 1 END)
AS apple_and_no_samsung,
COUNT(CASE WHEN num_apple = 0 AND num_samsung = 0 THEN 1 END)
AS no_apple_and_no_samsung
FROM (
SELECT cust_name,
COUNT(CASE device WHEN 'apple' THEN 1 END) AS num_apple,
COUNT(CASE device WHEN 'samsung' THEN 1 END) AS num_samsung
FROM table_name
GROUP BY cust_name
)
Which, for the sample data:
CREATE TABLE table_name (cust_name, device) AS
SELECT 'john', 'apple' FROM DUAL UNION ALL
SELECT 'john', 'samsung' FROM DUAL UNION ALL
SELECT 'dave', 'apple' FROM DUAL UNION ALL
SELECT 'tim', 'samsung' FROM DUAL UNION ALL
SELECT 'patrick', 'nokia' FROM DUAL UNION ALL
SELECT 'rick', 'nokia' FROM DUAL;
Outputs:
APPLE_AND_SAMSUNG
NO_APPLE_AND_SAMSUNG
APPLE_AND_NO_SAMSUNG
NO_APPLE_AND_NO_SAMSUNG
1
1
1
2
You can also do it by PIVOTing twice:
SELECT *
FROM table_name
PIVOT (
COUNT(DISTINCT device) FOR device IN (
'apple' AS apple,
'samsung' AS samsung
)
)
PIVOT (
COUNT(cust_name) FOR (apple, samsung) IN (
(1, 1) AS apple_and_samsung,
(1, 0) AS apple_and_no_samsung,
(0, 1) AS no_apple_and_samsung,
(0, 0) AS no_apple_and_no_samsung
)
)
db<>fiddle here

You might add proper HAVING clauses for each case after GROUPing by cust_name column such as
a)
SELECT COUNT(DISTINCT cust_name)
FROM t
GROUP BY cust_name
HAVING SUM(CASE WHEN device ='apple' THEN 1 ELSE 0 END)
* SUM(CASE WHEN device ='samsung' THEN 1 ELSE 0 END) = 1;
b)
SELECT SUM(COUNT(DISTINCT cust_name))
FROM t
GROUP BY cust_name
HAVING MIN(CASE WHEN device ='apple' THEN 0 ELSE 1 END)
* MIN(CASE WHEN device ='samsung' THEN 0 ELSE 1 END) = 1;
c)
SELECT COUNT(DISTINCT cust_name)
FROM t
GROUP BY cust_name
HAVING MIN(CASE WHEN device ='samsung' THEN 0 ELSE 1 END)
* MIN(CASE WHEN device ='apple' THEN 1 ELSE 0 END) = 1;
d)
SELECT COUNT(DISTINCT cust_name)
FROM t
GROUP BY cust_name
HAVING MIN(CASE WHEN device ='apple' THEN 0 ELSE 1 END)
* MIN(CASE WHEN device ='samsung' THEN 1 ELSE 0 END) = 1
Demo

how do we have count of a specific values for multiple columns with table having a unique column

If I have a table like :
u_id A B C D
----------------------------------
jud 1 1 0 1
bud 0 0 1 0
cud 1 1 0 1
nud 0 0 1 0
dud 1 0 0 1
aud 0 1 1 0
fud 1 0 1 1
which sql is useful to get output like:
count 0 count 1
-----------------------
A 3 4
B 4 3
C 3 4
D 3 4
Doesn't matter row or columns just need count of a specific value count for multiple columns in a table.
Instead of 0's and 1's it can be specific string values as well as 'yes' or 'no'
Thank you

Use UNION ALL and aggregation. Assuming that the only possible values in the columns are 0 and 1:
SELECT 'A' col, COUNT(*) - SUM(A) count0, SUM(A) count1 FROM mytable
UNION ALL SELECT 'B', COUNT(*) - SUM(B), SUM(B) FROM mytable
UNION ALL SELECT 'C', COUNT(*) - SUM(C), SUM(C) FROM mytable
UNION ALL SELECT 'D', COUNT(*) - SUM(D), SUM(D) FROM mytable
Demo on DB Fiddle:
| col | count0 | count1 |
| --- | ------ | ------ |
| A | 3 | 4 |
| B | 4 | 3 |
| C | 3 | 4 |
| D | 3 | 4 |
If other values than 0/1 are possible, then you can change the SELECTs to, eg 'yes'/'no', then:
SELECT
'A' col,
SUM(CASE WHEN A = 'no' THEN 1 ELSE 0 END) count_no,
SUM(CASE WHEN A = 'yes' THEN 1 ELSE O END) count_yes
FROM mytable
GROUP BY col
UNION ALL SELECT
'B' col,
SUM(CASE WHEN B = 'no' THEN 1 ELSE 0 END),
SUM(CASE WHEN B = 'yes' THEN 1 ELSE 0 END)
FROM mytable
GROUP BY col
UNION ALL SELECT
'C' col,
SUM(CASE WHEN C = 'no' THEN 1 ELSE 0 END),
SUM(CASE WHEN C = 'yes' THEN 1 ELSE 0 END)
FROM mytable
GROUP BY col
UNION ALL SELECT
'D' col,
SUM(CASE WHEN D = 'no' THEN 1 ELSE 0 END),
SUM(CASE WHEN D = 'yes' THEN 1 ELSE 0 END)
FROM mytable
GROUP BY col

If you are okay with a single row, you can do:
select sum(a), sum(1-a), sum(b), sum(1-b), sum(c), sum(1-c), sum(d), sum(1-d)
from t;
The advantage of this approach is that t is read only once. This is even more true if it is a complex view.
With that in mind, you can unpivot this result:
select v.x,
(case when v.x = 'a' then a_0 end) as a_0,
(case when v.x = 'a' then a_1 end) as a_1,
(case when v.x = 'b' then b_0 end) as b_0,
(case when v.x = 'b' then b_1 end) as b_1,
(case when v.x = 'c' then c_0 end) as c_0,
(case when v.x = 'c' then c_1 end) as c_1,
(case when v.x = 'd' then d_0 end) as d_0,
(case when v.x = 'd' then d_1 end) as d_1
from (select sum(a) as a_1, sum(1-a) as a_0,
sum(b) as b_1, sum(1-b) as b_0,
sum(c) as c_1, sum(1-c) as c_0,
sum(d) as d_1, sum(1-d) as d_0
from t
) s cross join
(values ('a'), ('b'), ('c'), ('d')) v(x) -- may require a subquery

You don't mention the database you're using, but in Oracle you can use DECODE and COUNT together to make this reasonably clean:
SELECT 'A' AS FIELD_NAME,
COUNT(DECODE(A, 0, 0, NULL)) AS ZERO_COUNT,
COUNT(DECODE(A, 0, NULL, A)) AS NON_ZERO_COUNT
FROM TEST_TABLE UNION ALL
SELECT 'B', COUNT(DECODE(B, 0, 0, NULL)),
COUNT(DECODE(B, 0, NULL, A))
FROM TEST_TABLE UNION ALL
SELECT 'C', COUNT(DECODE(C, 0, 0, NULL)),
COUNT(DECODE(C, 0, NULL, A))
FROM TEST_TABLE UNION ALL
SELECT 'D', COUNT(DECODE(D, 0, 0, NULL)),
COUNT(DECODE(D, 0, NULL, A))
FROM TEST_TABLE
dbfiddle here

how to achieve distinct records based on the priority

I have a table with data like below,
id code data1 data2 country
1 1 A NULL IND
1 1 B B NZ
1 1 CA
1 1 C Z WI
1 1 D S UK
2 2 NULL NULL IND
2 2 S NULL NZ
2 2 NULL K CA
2 2 T T WI
2 2 R K UK
3 3 NULL A WI
3 3 NULL a UK
the record will be populates based on the priority on country field. the priority is IND,NZ,CA,WI,UK
if there is any NULL,blank in data1,data2 fields data will populates from the next priority record.
So, My expected result is :
target table:
id code data1 data2 country
1 1 A B IND
2 2 S K IND
3 3 NULL A WI
Can any one help me with the query to achive the above result set.
I have added few more rows for better understanding on the query.

Hive has the first_value() function, which can be used for this purpose:
select distinct id, code,
first_value(data1) over (partition by id, code
order by (case when data1 is not null then 1 else 2 end),
(case country when 'IND' then 1 when 'NZ' then 2 when 'CA' then 3 when 'WI' then 4 when 'UK' then 5 else 6 end)
) as data1,
first_value(data2) over (partition by id, code
order by (case when data2 is not null then 1 else 2 end),
(case country when 'IND' then 1 when 'NZ' then 2 when 'CA' then 3 when 'WI' then 4 when 'UK' then 5 else 6 end)
) as data2,
first_value(country) over (partition by id, code
order by (case when data1 is not null then 1 else 2 end),
(case country when 'IND' then 1 when 'NZ' then 2 when 'CA' then 3 when 'WI' then 4 when 'UK' then 5 else 6 end)
) as country
from t;
I am not a big fan of select distinct with window functions. In this case, it seems like the simplest solution.

Use case to get priorities and use first_value on it.
select id, max(code), max(data1), max(data2), max(country)
from (
select
id,
code,
first_value(data1) over (partition by id
order by case when data1 is null or data1 = '' then 1 else 0 end * 10 + priority) data1,
first_value(data2) over (partition by id
order by case when data2 is null or data2 = '' then 1 else 0 end * 10 + priority) data2,
first_value(country) over (partition by id
order by case when country is null or country = '' then 1 else 0 end * 10 + priority) country
from (
select
t.*,
case country
when 'IND' then 1
when 'NZ' then 2
when 'CA' then 3
when 'WI' then 4
when 'UK' then 5
end priority
from your_table t
) t
) t group by id;
Produces:
ID MAX(CODE) MAX(DATA1) MAX(DATA2) MAX(COUNTRY)
1 1 A B IND
2 2 S K IND
3 3 NULL A WI
EDIT:
You can alternatively use FIELD function (available in hive, MySQL) to produce the priorities as suggested by #Dudu in the comments below:
field(country,'IND','NZ','CA','WI','UK')
See:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Another approach based on MIN of STRUCT.
For the order I'm using the function field ( field(country,'IND','NZ','CA','WI','UK')).
Since it was missing, I have added it to the documentation.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
select id
,min (code) as code
,min (case when coalesce(trim(data1),'') <> '' then struct(field(country,'IND','NZ','CA','WI','UK'),data1) end).col2 as data1
,min (case when coalesce(trim(data2),'') <> '' then struct(field(country,'IND','NZ','CA','WI','UK'),data2) end).col2 as data2
,min (struct(field(country,'IND','NZ','CA','WI','UK'),country)).col2 as country
from mytable
group by id
order by id
;
Demo
create table mytable
(
id int
,code int
,data1 string
,data2 string
,country string
);
insert into mytable values
(1 ,1 ,'A' ,NULL ,'IND')
,(1 ,1 ,'B' ,'B' ,'NZ' )
,(1 ,1 ,'' ,'' ,'CA' )
,(1 ,1 ,'C' ,'Z' ,'WI' )
,(1 ,1 ,'D' ,'S' ,'UK' )
,(2 ,2 ,NULL ,NULL ,'IND')
,(2 ,2 ,'S' ,NULL ,'NZ' )
,(2 ,2 ,NULL ,'K' ,'CA' )
,(2 ,2 ,'T' ,'T' ,'WI' )
,(2 ,2 ,'R' ,'K' ,'UK' )
,(3 ,3 ,NULL ,'A' ,'WI' )
,(3 ,3 ,NULL ,'a' ,'UK' )
;
select id
,min (code) as code
,min (case when coalesce(trim(data1),'') <> '' then struct(field(country,'IND','NZ','CA','WI','UK'),data1) end).col2 as data1
,min (case when coalesce(trim(data2),'') <> '' then struct(field(country,'IND','NZ','CA','WI','UK'),data2) end).col2 as data2
,min (struct(field(country,'IND','NZ','CA','WI','UK'),country)).col2 as country
from mytable
group by id
order by id
;
+----+------+-------+-------+---------+
| id | code | data1 | data2 | country |
+----+------+-------+-------+---------+
| 1 | 1 | A | B | IND |
| 2 | 2 | S | K | IND |
| 3 | 3 | NULL | A | WI |
+----+------+-------+-------+---------+

SQL Server: Rows to Columns with Case

I have a table with 2 columns:
CREATE TABLE Prop_Cl
(
Id int,
ClId int
);
INSERT INTO Prop_Cl
(Id, ClId)
VALUES
(1, 1111111),
(1, 1111112),
(1, 1111113),
(2, 2222221),
(3, 3333331),
(3, 3333332);
ID CLID
1 1111111
1 1111112
1 1111113
2 2222221
3 3333331
3 3333332
I'm trying to show this table in that way:
ID CLIENT 1 CLIENT 2 CLIENT 3 CLIENT 4
1 1111111 1111112 1111113 0
2 2222221 0 0 0
3 3333331 3333332 0 0
with this statement:
SELECT p.Id,
CASE WHEN (ROW_NUMBER() OVER(PARTITION BY p.Id ORDER BY p.Id)) = 1 THEN p.ClId ELSE 0 END AS 'Client 1',
CASE WHEN (ROW_NUMBER() OVER(PARTITION BY p.Id ORDER BY p.Id)) = 2 THEN p.ClId ELSE 0 END AS 'Client 2',
CASE WHEN (ROW_NUMBER() OVER(PARTITION BY p.Id ORDER BY p.Id)) = 3 THEN p.ClId ELSE 0 END AS 'Client 3',
CASE WHEN (ROW_NUMBER() OVER(PARTITION BY p.Id ORDER BY p.Id)) = 4 THEN p.ClId ELSE 0 END AS 'Client 4'
FROM Prop_Cl p
But I get this result:
ID CLIENT 1 CLIENT 2 CLIENT 3 CLIENT 4
1 1111111 0 0 0
1 0 1111112 0 0
1 0 0 1111113 0
2 2222221 0 0 0
3 3333331 0 0 0
3 0 3333332 0 0
I can't use PIVOT function because of my Sql Server realisation.
There are maximum 4 clients in each ID.
Any ideas?
SQL Fiddle

I would change the syntax slightly to use an aggregate function and a subquery similar to:
select id,
max(case when seq = 1 then ClId else 0 end) Client1,
max(case when seq = 2 then ClId else 0 end) Client2,
max(case when seq = 3 then ClId else 0 end) Client3,
max(case when seq = 4 then ClId else 0 end) Client4
from
(
select Id, ClId,
ROW_NUMBER() OVER(PARTITION BY Id ORDER BY Id) seq
from Prop_Cl
) s
group by id;
See SQL Fiddle with Demo

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Create a matrix with counts - hive sql - sql

Is there a way to achieve this with hive? I need to count users per segment. I have a table: user1, categoryA user1, categoryB user2, categoryC And the desired output would be: ----------------- Category A, Category B, Category C Category A -- 1 1 0 Category B -- 1 1 0 Category C -- 0 0 1

Related

Incremental Sum across different groups

select from table where a=1 and a=2

how do we have count of a specific values for multiple columns with table having a unique column

how to achieve distinct records based on the priority

SQL Server: Rows to Columns with Case

Categories

Resources