union grouped result sets in Hive

union grouped result sets in Hive - hive

I need to break a Hive query grouped by an ID column over the quarters of calendar year 2018. Below is how I am currently going about it I would like another option to achieve the same result with fewer queries.
--Query 1 quarter 1 2018 plus three identical queries for Q2,Q3,Q4
Create TABLE Q12018 stored as ORC as
select
ID,
count(1) as cnt,
sum(revenue) as revenue,
sum( (CASE
WHEN condition1
THEN 1
ELSE 0 END)) as metric1,
sum( (CASE
WHEN condition2
THEN revenue
ELSE 0 END)) as metric2,
sum( (CASE
WHEN condition3
THEN 1
ELSE 0 END)) as metric3,
sum( (CASE
WHEN codition4
THEN revenue
ELSE 0 END)) as metric4
from mainTable
where month between 201801 and 201803
group by
ID;
--Query 2
Create TABLE combined2018 stored as ORC as
select * from Q12018
union all
select * from Q22018
union all
select * from Q32018
union all
select * from Q42018 ;
--Query 3
Create TABLE Agg2018 stored as ORC as
Select
ID,
Sum(cnt),
Sum(revenue),
Sum(metric1),
Sum(metric2),
sum(metric3),
sum(metric4)
from combined2018
group by ID

Seems like at the end you are aggregating all the quarterly results, grouped by ID.If the end result is the aggregation of the quarterly results then change the where clause to include the entire year range to achieve the same end result.
select
ID,
count(1) as cnt,
sum(revenue) as revenue,
sum((CASE WHEN condition1 THEN 1 ELSE 0 END)) as metric1,
sum((CASE WHEN condition2 THEN revenue ELSE 0 END)) as metric2,
sum((CASE WHEN condition3 THEN 1 ELSE 0 END)) as metric3,
sum((CASE WHEN condition4 THEN revenue ELSE 0 END)) as metric4
from mainTable
where month between 201801 and 201812
group by ID;

Related

SQL query several columns about the same field but diferent codes

How can I create a query that has multiple counter columns for the same field?
I have a field called card_status that can have 7 different values.
I wanted to create a query that would display total values on the same row and not on 7 different rows.

SELECT SUM(CASE WHEN card_status = 1 THEN 1 ELSE 0 END) as Count_of_1,
SUM(CASE WHEN card_status = 2 THEN 1 ELSE 0 END) as Count_of_2,
...
SUM(CASE WHEN card_status = 7 THEN 1 ELSE 0 END) as Count_of_7
FROM your_table;

You could use a conditional count
For example:
SELECT col1, col2
, COUNT(CASE WHEN card_status = 'revoked' THEN card_status END) AS TotalRevoked
, COUNT(CASE WHEN card_status = 'requested' THEN card_status END) AS TotalRequested
, COUNT(CASE WHEN card_status = 'lost' THEN card_status END) AS TotalLost
-- add more
, COUNT(*) AS Total
FROM YourTable t
GROUP BY col1, col2
ORDER BY col1, col2
This works on the principle that counting a column or expression doesn't count the NULL's

Transpose SQL row headers to first column

What is the shortest, fastest, easiest way to transpose the results of the query below? I want the 0-3 and 3-6 to show in the first column. Sorry but this is one of those things that will boggle me for days if I don't reach out. Thanks in advance.
SELECT SUM (CASE WHEN CMAPR BETWEEN 0 AND 3 THEN 1 ELSE 0 END) AS [0-3],
SUM (CASE WHEN CMAPR BETWEEN 3.01 AND 6 THEN 1 ELSE 0 END) AS [3-6]
FROM TBL
Current Results:

Use aggregation:
SELECT (CASE WHEN CMAPR >= 0 AND CMAPR <=3 THEN '0-3'
WHEN CMAPR <= 6 THEN '3-6'
ELSE 'Other'
END) AS grp, COUNT(*)
FROM tbl
GROUP BY (CASE WHEN CMAPR >= 0 AND CMAPR <=3 THEN '0-3'
WHEN CMAPR <= 6 THEN '3-6'
ELSE 'Other'
END);
The only downside is that this will not return a group if it has no rows.
You can simply unpivot your query by doing:
SELECT v.*
FROM (SELECT SUM (CASE WHEN CMAPR BETWEEN 0 AND 3 THEN 1 ELSE 0 END) AS [0-3],
SUM (CASE WHEN CMAPR BETWEEN 3.01 AND 6 THEN 1 ELSE 0 END) AS [3-6]
FROM TBL
) x CROSS APPLY
(VALUES ('[0-3]', [0-3]), ('[3-6]', [3-6])) v(which, val);

Comparing values in two columns and returning values on conditional bases using sql hive

I have two columns, I want to get an output based on a comparative basis of both. My data is somewhat like:
Customer Id status
100 A
100 B
101 B
102 A
103 A
103 B
So a customer can have a status A or B or both, I have to segrerate them on customer id basis for a status. If status A and B then return happy, if only A, return Avg and if only B return Sad.

try the below query,
SELECT DISTINCT Customer_Id,
(CASE WHEN COUNT(*) OVER(PARTITION BY Customer_Id) > 1 THEN 'happy'
WHEN Tstatus = 'A' THEN 'Avg'
ELSE 'Sad'END) AS New_Status
FROM #table1
GROUP BY Customer_Id,Tstatus

if Customer Id and status is a unique combination then
STEP 1: use case to determine a or b
SELECT customer id
,CASE WHEN avg(case when [status] ='A' then 0 else 2 end)
FROM [Your Table]
group by[customer id]
and step 2 will be casing avg into result: like this
SELECT customer id
,CASE WHEN (avg(case when [status] ='A' then 0 else 2 end)) = 1 THEN 'happy' ELSE WHEN (avg(case when [status] ='A' then 0 else 2 end)) = 0 THEN 'Avg' ELSE 'Sad' END
FROM [Your Table]
group by[customer id]

I would do this simply as:
select customer_id,
(case when min(status) <> max(status) then 'happy'
when min(status) = 'A' then 'avg'
else 'sad'
end)
from t
where status in ('A', 'B')
group by customer_id

Tuning oracle subquery in select statement

I have a master table and a reference table as below.
WITH MAS as (
SELECT 10 as CUSTOMER_ID, 1 PROCESS_ID, 44 PROCESS_TYPE, 200 as AMOUNT FROM DUAL UNION ALL
SELECT 10 as CUSTOMER_ID, 1 PROCESS_ID, 44 PROCESS_TYPE, 250 as AMOUNT FROM DUAL UNION ALL
SELECT 10 as CUSTOMER_ID, 2 PROCESS_ID, 45 PROCESS_TYPE, 300 as AMOUNT FROM DUAL UNION ALL
SELECT 10 as CUSTOMER_ID, 2 PROCESS_ID, 45 PROCESS_TYPE, 350 as AMOUNT FROM DUAL
), REFTAB as (
SELECT 44 PROCESS_TYPE, 'A' GROUP_ID FROM DUAL UNION ALL
SELECT 44 PROCESS_TYPE, 'B' GROUP_ID FROM DUAL UNION ALL
SELECT 45 PROCESS_TYPE, 'C' GROUP_ID FROM DUAL UNION ALL
SELECT 45 PROCESS_TYPE, 'D' GROUP_ID FROM DUAL
) SELECT ...
My first select statement which works correctly is this one:
SELECT CUSTOMER_ID,
SUM(AMOUNT) as AMOUNT1,
SUM(CASE WHEN PROCESS_TYPE IN (SELECT PROCESS_TYPE FROM REFTAB WHERE GROUP_ID = 'A')
THEN AMOUNT ELSE NULL END) as AMOUNT2,
COUNT(CASE WHEN PROCESS_TYPE IN (SELECT PROCESS_TYPE FROM REFTAB WHERE GROUP_ID = 'D')
THEN 1 ELSE NULL END) as COUNT1
FROM MAS
GROUP BY CUSTOMER_ID
However, to address a performance issue, I changed it to this select statement:
SELECT CUSTOMER_ID,
SUM(AMOUNT) as AMOUNT1,
SUM(CASE WHEN GROUP_ID = 'A' THEN AMOUNT ELSE NULL END) as AMOUNT2,
COUNT(CASE WHEN GROUP_ID = 'D' THEN 1 ELSE NULL END) as COUNT1
FROM MAS A
LEFT JOIN REFTAB B ON A.PROCESS_TYPE = B.PROCESS_TYPE
GROUP BY CUSTOMER_ID
For the AMOUNT2 and COUNT1 columns, the values stay the same. But for AMOUNT1, the value is multiplied because of the join with the reference table.
I know I can add 1 more left join with an additional join condition on GROUP_ID. But that won't be any different from using a subquery.
Any idea how to make the query work with just 1 left join while not multiplying the AMOUNT1 value?

I know I can add 1 more left join with adding aditional GROUP_ID clause but it wont be different from subquery.
You'd be surprised. Having 2 left joins instead of subqueries in the SELECT gives the optimizer more ways of optimizing the query. I would still try it:
select m.customer_id,
sum(m.amount) as amount1,
sum(case when grpA.group_id is not null then m.amount end) as amount2,
count(grpD.group_id) as count1
from mas m
left join reftab grpA
on grpA.process_type = m.process_type
and grpA.group_id = 'A'
left join reftab grpD
on grpD.process_type = m.process_type
and grpD.group_id = 'D'
group by m.customer_id
You can also try this query, which uses the SUM() analytic function to calculate the amount1 value before the join to avoid the duplicate value problem:
select m.customer_id,
m.customer_sum as amount1,
sum(case when r.group_id = 'A' then m.amount end) as amount2,
count(case when r.group_id = 'D' then 'X' end) as count1
from (select customer_id,
process_type,
amount,
sum(amount) over (partition by customer_id) as customer_sum
from mas) m
left join reftab r
on r.process_type = m.process_type
group by m.customer_id,
m.customer_sum
You can test both options, and see which one performs better.

Starting off with your original query, simply replacing your IN queries with EXISTS statements should provide a significant boost. Also, be wary of summing NULLs, perhaps your ELSE statements should be 0?
SELECT CUSTOMER_ID,
SUM(AMOUNT) as AMOUNT1,
SUM(CASE WHEN EXISTS(SELECT 1 FROM REFTAB WHERE REFTAB.GROUP_ID = 'A' AND REFTAB.PROCESS_TYPE = MAS.PROCESS_TYPE)
THEN AMOUNT ELSE NULL END) as AMOUNT2,
COUNT(CASE WHEN EXISTS(SELECT 1 FROM REFTAB WHERE REFTAB.GROUP_ID = 'D' AND REFTAB.PROCESS_TYPE = MAS.PROCESS_TYPE)
THEN 1 ELSE NULL END) as COUNT1
FROM MAS
GROUP BY CUSTOMER_ID

The normal way is to aggregate the values before the group by. You can also use conditional aggregation, if the rest of the query is correct:
SELECT CUSTOMER_ID,
SUM(CASE WHEN seqnum = 1 THEN AMOUNT END) as AMOUNT1,
SUM(CASE WHEN GROUP_ID = 'A' THEN AMOUNT ELSE NULL END) as AMOUNT2,
COUNT(CASE WHEN GROUP_ID = 'D' THEN 1 ELSE NULL END) as COUNT1
FROM MAS A LEFT JOIN
(SELECT B.*, ROW_NUMBER() OVER (PARTITION BY PROCESS_TYPE ORDER BY PROCESS_TYPE) as seqnum
FROM REFTAB B
) B
ON A.PROCESS_TYPE = B.PROCESS_TYPE
GROUP BY CUSTOMER_ID;
This ignores the duplicates created by the joins.

Use of AVG function to determine percentages in a SQL query

I want to know what percentage of records have a given value, where percentage is defined as the number of records that match the value divided by the total number of records. i.e. if there are 100 records, of which 10 have a null value for student_id and 20 have a value of 999999, then the percentage_999999 should be 20%. Can I use the AVG function to determine this?
Option 1:
SELECT year, college_name,
sum(case when student_id IN ('999999999') then 1 else 0 end) as count_id_999999999,
count_id_999999999/total_id as percent_id_999999999,
sum(case when student_id IS NULL then 1 else 0 end) as count_id_NULL,
count_id_NULL/total_id as percent_id_NULL
count(*) as total_id
FROM enrolment_data ed
GROUP BY year, college_name
ORDER BY year, college_name;
Option 2:
SELECT year, college_name,
sum(case when student_id IN ('999999999') then 1 else 0 end) as count_id_999999999,
avg(case when student_id IN ('999999999') then 1.0 else 0 end) as percent_id_999999999,
sum(case when student_id IS NULL then 1 else 0 end) as count_id_NULL,
avg(case when student_id IS NULL then 1.0 else 0 end) as percent_id_NULL
count(*) as total_id
FROM enrolment_data ed
GROUP BY year, college_name
ORDER BY year, college_name;

I created a similar table with 100 records, 20 999999999s, 10 nulls, and 70 1s. This worked for me on SQL Server:
select count(*), StudentID
from ScratchTbl
group by StudentID;
(No column name) StudentID
10 NULL
70 1
20 999999999
select avg(case when StudentID = '999999999' then 1.0 else 0.0 end) as 'pct_9s',
sum(case when StudentID = '999999999' then 1 else 0 end) as 'count_9s',
avg(case when StudentID is null then 1.0 else 0.0 end) as 'pct_null',
sum(case when StudentID is null then 1 else 0 end) as 'count_null'
from ScratchTbl
pct_9s count_9s pct_null count_null
0.200000 20 0.100000 10
I have a feeling that your use of the group by clause could be creating problems for you, perhaps select a specific year/college using the where clause (and get rid of the group by line) and see if you get the results you expect.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

union grouped result sets in Hive - hive

Related

SQL query several columns about the same field but diferent codes

Transpose SQL row headers to first column

Comparing values in two columns and returning values on conditional bases using sql hive

Tuning oracle subquery in select statement

Use of AVG function to determine percentages in a SQL query

Categories

Resources