Which is faster: Sum(Case When) Or Group By/Count(*)? - sql

I can write
Select
Sum(Case When Resposta.Tecla = 1 Then 1 Else 0 End) Valor1,
Sum(Case When Resposta.Tecla = 2 Then 1 Else 0 End) Valor2,
Sum(Case When Resposta.Tecla = 3 Then 1 Else 0 End) Valor3,
Sum(Case When Resposta.Tecla = 4 Then 1 Else 0 End) Valor4,
Sum(Case When Resposta.Tecla = 5 Then 1 Else 0 End) Valor5
From Resposta
Or
Select
Count(*)
From Resposta Group By Tecla
I tried this over a large number of rows and it seems like taking the same time.
Anyone can confirm this?

I believe the Group By is better because there are no specific treatments.
It can be optimized by the database engine.
I think the results may depend on the database engine you use.
Maybe the one you are using optimizes the first query anderstanding it is like a group by !
You can try the "explain / explain plan" command to see how the engine is computing your querys but with my Microsoft SQL Server 2008, I just can see a swap between 2 operations ("Compute scalar" and "agregate").
I tried such queries on a database table :
SQL Server 2k8
163000 rows in the table
12 cathegories (Valor1 -> Valor12)
the results are quite differents :
Group By : 2seconds
Case When : 6seconds !
So My choice is "Group By".
Another benefit is the query is simplyer to write !

What the DB does internally with the second query is practically the same as what you explicitly tell it to do with the first. There should be no difference in the execution plan and thus in the time the query takes. Taking this into account, clearly using the second query is better:
it's much more flexible, when there are more values of Tecla you
don't need to change your query
it's easier to understand. If you have a lot of values for Tecla
it'll be harder to read the first query and realize it just counts
distinct values
it's smaller - you're sending less information to the DB server and it will probably parse the query faster, which is the only performance difference I see in this queries. This makes a difference, albeit small

Either one is going to have to read all rows from Resposta, so for any reasonably sized table, I'd expect the I/O cost to dominate - giving approximately the same overall runtime.
I'd generally use:
Select
Tecla,
Count(*)
From Resposta
Group By Tecla
If there's a reasonable chance that the range of Tecla values will change in the future.

In my opinion GROUP BY statement will always be faster than SUM(CASE WHEN ...) because in your example for SUM ... there would be 5 different calculations while when using GROUP BY, DB will simply sort and calculate.
Imagine, you have a bag with different coins and you need to know, how much of earch type of coins do you have. You can do it this ways:
The SUM(CASE WHEN ...) way would be to compare each coin with predefined sample coins and do the math for each sample (add 1 or 0);
The GROUP BY way would be to sort coins by their types and then count earch group.
Which method would you prefer?

To fairly compete with count(*), Your first SQL should probably be:
Select
Sum(Case When Resposta.Tecla >= 1 AND Resposta.Tecla <=5 Then 1 Else 0 End) Valor
From Resposta
And to answer your question, I'm not noticing a difference at all in speed between SUM CASE WHEN and COUNT. I'm querying over 250,000 rows in POSTGRESQL.

Related

Query for grouping of successful attempts when order matters

Let's say, for example, I have a db table Jumper for tracking high jumpers. It has three columns of interest: attempt_id, athlete, and result (a boolean for whether the jumper cleared the bar or not).
I want to write a query that will compare all athletes' performance across different attempts yielding a table with this information: attempt number, number of cleared attempts, total attempts. In other words, what is the chance that an athlete will clear the bar on x attempt.
What is the best way of writing this query? It is trickier than it would seem at first because you need to determine the attempt number for each athlete to be able to total the final totals.
I would prefer answers be written with Django ORM, but SQL will also be accepted.
Edit: To be clear, I need it to be grouped by attempt, not by athlete. So it would be all athletes' combined x attempt.
You could solve it using SQL:
SELECT t.attempt_id,
SUM(CASE t.result WHEN TRUE THEN 1 ELSE 0 END) AS cleared,
COUNT(*) AS total
FROM Jumper t
GROUP BY t.attempt_id
EDIT: If attempt_id is just a sequence, and you want to use it to calculate the attempt number for each jumper, you could use this query instead:
SELECT t.attempt_number,
SUM(CASE t.result WHEN TRUE THEN 1 ELSE 0 END) AS cleared,
COUNT(*) AS total
FROM (SELECT s.*,
ROW_NUMBER() OVER(PARTITION BY athlete
ORDER BY attempt_id) AS attempt_number
FROM Jumper s) t
GROUP BY t.attempt_number
This way, you group every first attempt from all athletes, every second attempt from all athletes, and so on...

Add up all occurances when a condition exists

I am totally stuck on the best way to write SQL code to handle a task I have to produce a report. We use Sybase ASA. It is embedded with an application. The query needs to produce the following output:
Media Server | Total NUmber of Backups | Volume Size (KB) | Average Throughput | Number of Successful | Jobs Success %
Each media server would be unique.
I am having an issue with getting the Number of successful jobs and then determining the % of Success.
Here is the code that I have:
SELECT
dmj.name AS "Media Server",
CAST(SUM(dj.bytesWritten/1024/1024) as decimal(20,2)) as "Volume(MB)",
COUNT(distinct dj.id) AS "Total Number of Jobs",
CAST(AVG(dj.throughput) as decimal(10,2)) AS "Throughput (KB/sec)",
CASE
WHEN dj.statusCode = '0'
THEN COUNT (dj.statusCode)
END AS "Number of Successful Jobs"
FROM domain_JobArchive dj
INNER JOIN domain_MediaServer dmj
ON dj.mediaServerName = dmj.name
WHERE DATEDIFF(day, UtcBigIntToNomTime(dj.endTime), GETDATE()) <= 7
AND dj.Type != '17'
AND dj.statusCode = 0
GROUP BY dmj.name, dj.statusCode
I am also still unsure how to show the % of success.
We use Sybase ASA as it is embedded in an application.
Thanks.
So it's:
SUM(CASE when dj.statusCode = '0' then 1 else 0 end)
to get a count of successes, and:
SUM(CASE WHEN dj.statusCode = '0' THEN 1 else 0 END) * 100.0/ COUNT(*)
to get the percentage success - same as before, you won't get what you want with aggregate functions like COUNT(*) in the CASE expression in this situation (I think Sybase will accept that, but you'll get the overall COUNT and I think break the GROUP BY - don't do it here.)
You have a further problem - why are you grouping by statusCode? That should go, because you're looking at all jobs per Media Server and not separating them between success/failure, rather you're calculating stats on success/failure rates. So just group on the Server name.
Furthermore what's this:
COUNT(distinct dj.id)
I'd have thought you'd have a distinct list already - if you have to do that it suggests to me you've got multiple rows per job, so that will break your stats.
ALSO:
WHERE DATEDIFF(day, UtcBigIntToNomTime(dj.endTime), GETDATE()) <= 7
AND dj.Type != '17'
AND dj.statusCode = 0
You're restricting to successful jobs! I think you should start again - first do the sql to select the jobs, make sure it is right. If you need failure rates you'll have to select failures too.
Then, use the SUM(CASE ..) technique to calc results you need, GROUPing BY.
I think you need to build up the query again carefully, because at the moment there are too many problems.

How to run a subquery to split a table into two groups?

I've got a table called spending (actually in BigQuery, though I don't think that's necessarily relevant for this question) that is about 2.9GB and 19 million rows.
The data structure is like this:
product,org,spend,to_include,proportion_overseas
----------------------------------
SK001,03V,"Yes",0.1
SK002,03V,2.4,"Yes",0.1
SK001,O3T,66.1,"No",0.47
SK002,03T,87.1,"No",0.47
SK001,04C,16.1,"Yes",0
SK002,04C,27.1,"Yes",0
...
For info, it is slightly denormalised, in that to_include and proportion_overseas are actually properties of each organisation.
Now I want to work out, for each product:
the total amount that all organisations with no overseas spending spent on that product, and
the total amount that all organisations with non-zero overseas spending spent on that product.
I also only want to include in this calculation only rows where to_include='Yes'.
I'm not sure what the best approach to do this is in SQL. I don't mind whether I end up with two tables, or one.
I know how to get all spending by code, for all relevant rows:
SELECT product, SUM(spend)
FROM spending
WHERE to_include='Yes'
GROUP BY product;
But what I don't know is how to split each row into two groups: one group where proportion_overseas=0 and one group where proportion_overseas>0.
I don't think 'subquery' is the right term, so I don't really know what to Google for!
You can use conditional aggregation:
SELECT product, SUM(spend),
SUM(CASE WHEN proportion_overseas = 0 THEN spend ELSE 0 END) as not_overseas,
SUM(CASE WHEN proportion_overseas > 0 THEN spend ELSE 0 END) as overseas
FROM spending
WHERE to_include='Yes'
GROUP BY product;

Return NULL instead of 0 when using COUNT(column) SQL Server

I have query which running fine and its doing two types of work, COUNT and SUM.
Something like
select
id,
Count (contracts) as countcontracts,
count(something1),
count(something1),
count(something1),
sum(cost) as sumCost
from
table
group by
id
My problem is: if there is no contract for a given ID, it will return 0 for COUNT and Null for SUM. I want to see null instead of 0
I was thinking about case when Count (contracts) = 0 then null else Count (contracts) end but I don't want to do it this way because I have more than 12 count positions in query and its prepossessing big amount of records so I think it may slow down query performance.
Is there any other ways to replace 0 with NULL?
Try this:
select NULLIF ( Count(something) , 0)
Here are three methods:
1. (case when count(contracts) > 0 then count(contracts) end) as countcontracts
2. sum(case when contracts is not null then 1 end) as countcontracts
3. nullif(count(contracts), 0)
All three of these require writing more complicated expressions. However, this really isn't that difficult. Just copy the line multiple times, and change the name of the variable on each one. Or, take the current query, put it into a spreadsheet and use spreadsheet functions to make the transformation. Then copy the function down. (Spreadsheets are really good code generators for repeated lines of code.)

How to get aggregates without using a nested sql query

I am writing a custom report from an Avamar (Postgresql) database which contains backup job history. My task is to display jobs that failed last night (based on status_code), and include that client's success ratio (jobs succeeded/total jobs run) over the past 30 days on the same line.
So the overall select just picks up clients that failed (status_code doesn't equal 30000, which is the success code). However, for each failed client from last night, I need to also know how many jobs have succeeded, and how many jobs total were started/scheduled in the past 30 days. (The time period part is simple, so I haven't included it in the code below, to keep it simple.)
I tried to do this without using a nested query, based on Hobodave's feedback on this similar question but I'm not quite able to nail it.
In the query below, I get the following error:
column "v_activities_2.client_name" must appear in the GROUP BY clause or be used in an aggregate function
Here's my (broken) query. I know the logic is flawed, but I'm coming up empty with how best to accomplish this. Thanks in advance for any guidance!
select
split_part(client_name,'.',1) as client_name,
bunchofothercolumnns,
round(
100.0 * (
((sum(CASE WHEN status_code=30000 THEN 1 ELSE 0 END))) /
((sum(CASE WHEN type='Scheduled Backup' THEN 1 ELSE 0 END))))
as percent_total
from v_activities_2
where
status_code<>30000
order by client_name
You need to define a GROUP BY if you have columns in the SELECT that do not have aggregate functions performed on them:
SELECT SPLIT_PART(t.client_name, '.', 1) AS client_name,
SUM(CASE WHEN status_code = 30000 THEN 1 ELSE 0 END) as successes
FROM v_activities_2
GROUP BY SPLIT_PART(t.client_name, '.', 1)
ORDER BY client_name
How do you expect the following to work:
SUM(CASE WHEN status_code = 30000 THEN 1 ELSE 0 END) as successes
FROM v_activities_2
WHERE status_code <> 30000
You can't expect to count rows you're excluding.
Why avoid nested query?
It seems most logical / efficient solution here.
If you do this in one pass with no sobqueries (only group by's), you will end with scanning the whole table (or joined tables) - which is not efficient, because only SOME clients failed last night.
Subqueries are not that bad, in general.