I'm using Apache Hive and I have a query like this:
SELECT CASE type WHEN 'a' THEN 'A'
WHEN 'b' THEN 'B'
ELSE 'C'
END AS map_type
,COUNT(user_id) AS count
FROM user_types
GROUP BY CASE type WHEN 'a' THEN 'A'
WHEN 'b' THEN 'B'
ELSE 'C'
END
;
As you can see, I need to group the result by the map_type field, which is calculated in a complex way. In my case, will the CASE WHEN parts in SELECT and GROUP BY be calculated twice? And if I used a subquery like below, will it be more efficient or not?
SELECT map_type
,COUNT(user_id) AS count
FROM (
SELECT CASE type WHEN 'a' THEN 'A'
WHEN 'b' THEN 'B'
ELSE 'C'
END AS map_type
,user_id
FROM user_types
) a
GROUP BY map_type;
The second query (involving the sub-query) might be more performant. This is based on interpretation from Hive's explain plan, and running these queries a few times.
The explain plan for query 1 (without the sub-query) has this section:
Group By Operator [GBY_2]
aggregations:["count(user_id)"]
keys:CASE (type) WHEN ('a') THEN ('A') WHEN ('b') THEN ('B') ELSE ('C') END (type: string)
On the other hand, the same section for query 2 (with the sub-query) has this:
Group By Operator [GBY_3]
aggregations:["count(_col1)"]
keys:_col0 (type: string)
Based on the plan, it looks like query 2 is doing slightly less work.
Also ran a test on dummy data, and got these execution times.
Query 1: (1st time) 6.43 s, (2nd time) 5.92 s, (3rd time): 4.30s
Query 2: (1st time) 0.82 s, (2nd time) 1.29 s, (3rd time): 1.03s
Query 2 completed faster in all cases.
The expense of doing an aggregation involves reading lots and lots of data. Then either sorting it or hashing it to bring the keys together. Then the engine needs to process the data and calculate the count.
Whether a case expression is called once or twice is pretty meaningless in the context of all the data movement. Don't worry about it. If there is extra work, it is trivial compared to everything else that needs to be done for the query.
I also think that Hive supports column aliases in the GROUP BY, but I might be mistaken.
The case statement is not harmful in your case but if you are going to use a subquery that it might increase the time
you can continue with
SELECT CASE type WHEN 'a' THEN 'A'
WHEN 'b' THEN 'B'
ELSE 'C'
END AS map_type
,COUNT(user_id) AS count
FROM user_types
GROUP BY CASE type WHEN 'a' THEN 'A'
WHEN 'b' THEN 'B'
ELSE 'C'
END
;
Related
I'm new to SQL, I will like to split the value into 2 columns and group it by the same customer. Below is my current table:
I have tried the query:
Select *
,Case when [Devices] = 'RF' THEN (Select [Lines] From table_name Else '0' )End As [RF]
,Case when [Devices] = 'Desktop' THEN (Select [Lines] From table_name Else '0') End As [Desktop]
From table_name
But it gives me the error : This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.
Please advise if anything wrong with the query.
Thank you!!
Customer
Lines
Devices
A
3
RF
A
4
Desktop
What I expected to see:
Customer
RF
Desktop
A
3
4
First problem: Don't use *, but provide the explicit column selection.
Second issue: You don't need to write sub queries in your CASE WHEN construct. The correct syntax of the query you've tried will be:
SELECT customer,
CASE WHEN devices = 'RF' THEN lines ELSE 0 END AS RF,
CASE WHEN devices = 'Desktop' THEN lines ELSE 0 END AS Desktop
FROM table_name;
Third point: This will produce two rows: One row RF 3, Desktop 0 and one row RF 0, Desktop 4. But the expected outcome according to your description is one row only. To achieve this, you need to SUM your values and GROUP BY customer:
SELECT customer,
SUM(CASE WHEN devices = 'RF' THEN lines ELSE 0 END) AS RF,
SUM(CASE WHEN devices = 'Desktop' THEN lines ELSE 0 END) AS Desktop
FROM table_name
GROUP BY customer;
Especially when doing the second option, I recommend to check if the ELSE clause of your CASE WHEN is really required since this has no effect unless all values are NULL.
In these simple use cases, such queries will work correctly. If your table is more complex and you need to cover more different cases, you should have a look on functions like PIVOT instead of writing lots of CASE WHEN constructs.
A last note: All these queries assume that your column "lines" has a numeric data type. If this isn't the case, you need to convert it. The exact syntax how to do this depends on your DB.
Simplified example:
In hive, I have a table t with two columns:
Name, Value
Bob, 2
Betty, 4
Robb, 3
I want to do a case when that uses the total of the Value column:
Select
Name
, CASE
When value>0.5*sum(value) over () THEN ‘0’
When value>0.9*sum(value) over () THEN ‘1’
ELSE ‘2’
END as var
From table
I don’t like the fact that sum(value) over () is computed twice. Is there a way to compute this only once. Added twist, I want to do this in one query, so without declaring user variables.
I was thinking of scalar queries:
With total as
(Select sum(value) from table)
Select
Name
, CASE
When value>0.5*(select * from total) THEN ‘0’
When value>0.9*(select * from total)THEN ‘1’
ELSE ‘2’
END as var
From table;
But this doesn’t work.
TLDR: Is there a way to simplify the first query without user variables ?
Don't worry about that. Let the optimizer worry about it. But, you can use a subquery or CTE if you don't want to repeat the expression:
select Name,
(case when value > 0.5 * total then '0'
when value > 0.9 * total then '1'
else '2'
end) as var
From (select t.*, sum(value) over () as total
from table t
) t;
Cross join a subquery that fetches the sum to the table:
Select
t.Name
, CASE
When t.value>0.9*tt.value THEN '1'
When t.value>0.5*tt.value THEN '0'
ELSE '2'
END as var
From table t cross join (select sum(value) value from table) tt
and change the order of the WHEN clauses in the CASE expression because as they are, the 2nd case will never succeed.
Since I/O is the major factor the slows down Hive queries, we should strive to reduce the num of stages to get better performance.
So it's better not to use a sub-query or CTE here.
Try this SQL with a global window clause:
select
name,
case
when value > 0.5*sum(value) over w then '0'
when value > 0.9*sum(value) over w then '1'
else '2'
end as var
from my_table
window w as (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
In this case window clause is the recommended way to reduce repetition of code.
Both the windowing and the sum aggregation will be computed only once. You can run explain select..., confirming that only ONE meaningful MR stage will be launched.
Edit:
1. A simple select clause on a subquery is not sth to worry about. It can be pushed down to the last phase of the subquery, so as to avoid additional MR stage.
2. Two identical aggregations residing in the same query block will only be evaluated once. So don’t worry about potential repeated calculation.
I'm trying to write a query that will aggregate data in a table according to a user-supplied table that drives the aggregations. I got it to work fine when I just used a sum statement, but when I put the sum inside of a case statement to allow the user to specify sum, count, mean, etc., I get group by errors.
I replaced:
sum(column)
with:
CASE b.calculationtype
WHEN 'SUM' THEN SUM(column)
WHEN 'MEAN' THEN AVG(column)
WHEN 'COUNT' THEN COUNT(column)
WHEN 'VARIANCE' THEN VARIANCE(column)
WHEN 'STANDARD DEVIATION' THEN STDDEV(column)
END
Does Oracle see beyond the case statement when evaluating the group by function or am I out of luck trying to make the actual aggregation function change based on the value in table b?
I could always brute force it the long way and move the calculationtype logic outside of the actual query, but that seems a little painful in that I'd have 5 identical queries with different aggregate functions that are called depending on the calculation type field.
select b.REPORT,
case b.AGG_VARIABLE_A_FLAG
when 'N' then null
when 'Y' then a.AGG_VARIABLE_A
end,
case b.AGG_VARIABLE_B_FLAG
when 'N' then null
when 'Y' then a.AGG_VARIABLE_B
end,
--<<< problem starts >>>
case b.CALCULATIONTYPE
when 'SUM' then sum(a.column1) when 'MEAN' then avg(a.column1) when 'COUNT' then count(a.column1) when 'VARIANCE' then variance(a.column1) when 'STANDARD DEVIATION' then stddev(a.column1)
end,
case b.CALCULATIONTYPE
when 'SUM' then sum(a.column2) when 'MEAN' then avg(a.column2) when 'COUNT' then count(a.column2) when 'VARIANCE' then variance(a.column2) when 'STANDARD DEVIATION' then stddev(a.column2)
end
--<<< problem ends >>
from DATA_TABLE a
cross join CONTROL_TABLE b
where a.ID = bind_variable_id
and a.SOURCEARRAY = b.SOURCEARRAY
and b.CALCULATIONTYPE <> 'INTERNAL'
group by b.REPORT,
case b.AGG_VARIABLE_A_FLAG
when 'N' then null
when 'Y' then a.AGG_VARIABLE_A
end,
case b.AGG_VARIABLE_B_FLAG
when 'N' then null
when 'Y' then a.AGG_VARIABLE_B
end
add b.calculationtype in GROUP BY clause.
Sample Data: Oracle 10g
Number Activity
1 x Activity
1 no activity
2 x activity
3 no activity
What I need to do is produce the rows where there is either no activity or x activity, then comes the problem: If there is an ID with both x activity and no activity, I only want to produce the x activity rows. Can it be done? Here is the CASE statement that produces the above data:
CASE WHEN DSKMTB_ACTIVITY_TYPE.ACTIVITY_TYPE_LABEL IS NULL
THEN 'No Activity'
ELSE DSKMTB_ACTIVITY_TYPE.ACTIVITY_TYPE_LABEL
END AS "Activity Type"
I am thinking I need to nest a CASE statement, but I can't quite gather the logic in my head. Please let me know if there is anything that I can do to help. Per the usual, I have not included the entire query here, as it is quite large, but will edit if anyone feels it is necessary. Thanks in advance.
By grouping by ID and filtering you might get through it:
SELECT
CASE
WHEN MAX(NVL(a.ACTIVITY_TYPE_LABEL, 'aaaa')) = 'aaaa'
THEN 'No Activity'
ELSE a.ACTIVITY_TYPE_LABEL
END AS "Activity Type"
FROM DSKMTB_ACTIVITY_TYPE a
GROUP BY a.id
The problem is if you actually select other fields, it will probably break the GROUP BY and we'll have to make a subquery.
You can put a MAX() around the case statement and than a GROUP BY on the ID field.
MAX(CASE WHEN DSKMTB_ACTIVITY_TYPE.ACTIVITY_TYPE_LABEL IS NULL
THEN 'No Activity'
ELSE DSKMTB_ACTIVITY_TYPE.ACTIVITY_TYPE_LABEL
END) AS "Activity Type"
GROUP BY ID
I'm building a query with a GROUP BY clause that needs the ability to count records based only on a certain condition (e.g. count only records where a certain column value is equal to 1).
SELECT UID,
COUNT(UID) AS TotalRecords,
SUM(ContractDollars) AS ContractDollars,
(COUNTIF(MyColumn, 1) / COUNT(UID) * 100) -- Get the average of all records that are 1
FROM dbo.AD_CurrentView
GROUP BY UID
HAVING SUM(ContractDollars) >= 500000
The COUNTIF() line obviously fails since there is no native SQL function called COUNTIF, but the idea here is to determine the percentage of all rows that have the value '1' for MyColumn.
Any thoughts on how to properly implement this in a MS SQL 2005 environment?
You could use a SUM (not COUNT!) combined with a CASE statement, like this:
SELECT SUM(CASE WHEN myColumn=1 THEN 1 ELSE 0 END)
FROM AD_CurrentView
Note: in my own test NULLs were not an issue, though this can be environment dependent. You could handle nulls such as:
SELECT SUM(CASE WHEN ISNULL(myColumn,0)=1 THEN 1 ELSE 0 END)
FROM AD_CurrentView
I usually do what Josh recommended, but brainstormed and tested a slightly hokey alternative that I felt like sharing.
You can take advantage of the fact that COUNT(ColumnName) doesn't count NULLs, and use something like this:
SELECT COUNT(NULLIF(0, myColumn))
FROM AD_CurrentView
NULLIF - returns NULL if the two passed in values are the same.
Advantage: Expresses your intent to COUNT rows instead of having the SUM() notation.
Disadvantage: Not as clear how it is working ("magic" is usually bad).
I would use this syntax. It achives the same as Josh and Chris's suggestions, but with the advantage it is ANSI complient and not tied to a particular database vendor.
select count(case when myColumn = 1 then 1 else null end)
from AD_CurrentView
How about
SELECT id, COUNT(IF status=42 THEN 1 ENDIF) AS cnt
FROM table
GROUP BY table
Shorter than CASE :)
Works because COUNT() doesn't count null values, and IF/CASE return null when condition is not met and there is no ELSE.
I think it's better than using SUM().
Adding on to Josh's answer,
SELECT COUNT(CASE WHEN myColumn=1 THEN AD_CurrentView.PrimaryKeyColumn ELSE NULL END)
FROM AD_CurrentView
Worked well for me (in SQL Server 2012) without changing the 'count' to a 'sum' and the same logic is portable to other 'conditional aggregates'. E.g., summing based on a condition:
SELECT SUM(CASE WHEN myColumn=1 THEN AD_CurrentView.NumberColumn ELSE 0 END)
FROM AD_CurrentView
It's 2022 and latest SQL Server still doesn't have COUNTIF (along with regex!). Here's what I use:
-- Count if MyColumn = 42
SELECT SUM(IIF(MyColumn = 42, 1, 0))
FROM MyTable
IIF is a shortcut for CASE WHEN MyColumn = 42 THEN 1 ELSE 0 END.
Not product-specific, but the SQL standard provides
SELECT COUNT() FILTER WHERE <condition-1>,
COUNT() FILTER WHERE <condition-2>, ...
FROM ...
for this purpose. Or something that closely resembles it, I don't know off the top of my hat.
And of course vendors will prefer to stick with their proprietary solutions.
Why not like this?
SELECT count(1)
FROM AD_CurrentView
WHERE myColumn=1
I had to use COUNTIF() in my case as part of my SELECT columns AND to mimic a % of the number of times each item appeared in my results.
So I used this...
SELECT COL1, COL2, ... ETC
(1 / SELECT a.vcount
FROM (SELECT vm2.visit_id, count(*) AS vcount
FROM dbo.visitmanifests AS vm2
WHERE vm2.inactive = 0 AND vm2.visit_id = vm.Visit_ID
GROUP BY vm2.visit_id) AS a)) AS [No of Visits],
COL xyz
FROM etc etc
Of course you will need to format the result according to your display requirements.
SELECT COALESCE(IF(myColumn = 1,COUNT(DISTINCT NumberColumn),NULL),0) column1,
COALESCE(CASE WHEN myColumn = 1 THEN COUNT(DISTINCT NumberColumn) ELSE NULL END,0) AS column2
FROM AD_CurrentView