How to get aggregates without using a nested sql query - sql

I am writing a custom report from an Avamar (Postgresql) database which contains backup job history. My task is to display jobs that failed last night (based on status_code), and include that client's success ratio (jobs succeeded/total jobs run) over the past 30 days on the same line.
So the overall select just picks up clients that failed (status_code doesn't equal 30000, which is the success code). However, for each failed client from last night, I need to also know how many jobs have succeeded, and how many jobs total were started/scheduled in the past 30 days. (The time period part is simple, so I haven't included it in the code below, to keep it simple.)
I tried to do this without using a nested query, based on Hobodave's feedback on this similar question but I'm not quite able to nail it.
In the query below, I get the following error:
column "v_activities_2.client_name" must appear in the GROUP BY clause or be used in an aggregate function
Here's my (broken) query. I know the logic is flawed, but I'm coming up empty with how best to accomplish this. Thanks in advance for any guidance!
select
split_part(client_name,'.',1) as client_name,
bunchofothercolumnns,
round(
100.0 * (
((sum(CASE WHEN status_code=30000 THEN 1 ELSE 0 END))) /
((sum(CASE WHEN type='Scheduled Backup' THEN 1 ELSE 0 END))))
as percent_total
from v_activities_2
where
status_code<>30000
order by client_name

You need to define a GROUP BY if you have columns in the SELECT that do not have aggregate functions performed on them:
SELECT SPLIT_PART(t.client_name, '.', 1) AS client_name,
SUM(CASE WHEN status_code = 30000 THEN 1 ELSE 0 END) as successes
FROM v_activities_2
GROUP BY SPLIT_PART(t.client_name, '.', 1)
ORDER BY client_name
How do you expect the following to work:
SUM(CASE WHEN status_code = 30000 THEN 1 ELSE 0 END) as successes
FROM v_activities_2
WHERE status_code <> 30000
You can't expect to count rows you're excluding.

Why avoid nested query?
It seems most logical / efficient solution here.
If you do this in one pass with no sobqueries (only group by's), you will end with scanning the whole table (or joined tables) - which is not efficient, because only SOME clients failed last night.
Subqueries are not that bad, in general.

Related

SQL multiple constrained counts in query

I am trying to get with 1 query multiple count results where each one is a subset of the previous one.
So my table would be called Recipe and has these columns:
recipe_num(Primary_key) decimal,
recipe_added date,
is_featured bit,
liked decimal
And what I want is to make a query that will return the amount of likes grouped by day for any particular month with
total recipes as total_recipes,
total recipes that were featured as featured_recipes,
total number of recipes that were featured and had more than 100 likes liked_recipes
So as you can see each they are all counts with each being a subset of the previous one.
Ideally I don't want to run separate select count's where that query the whole table but rather get from the previous one.
I am not very good at using count with Where, Having, etc... and not exactly sure how to do it, so far I have the following which I managed via digging around here.
select
recipe_added,
count(*) total_recipes,
count(case is_featured when 1 then 1 else null end) total_featured_recipes
from
RECIPES
group by
recipe_added
I am not exactly sure why I have to use case inside the count but I wasn't able to get it to work using WHERE, would like to know if this is possible as well.
Thanks
With a CASE expression inside COUNT() you are doing conditional aggregation and this is exactly what you need for this requirement:
select recipe_added,
count(*) total_recipes,
count(case when is_featured = 1 then 1 end) total_featured_recipes,
count(case when is_featured = 1 and liked > 100 then 1 end) liked_recipes
from Recipes
group by recipe_added
There is no need for ELSE null because the default behavior of a CASE expression is to return null when no other branch returns a value.
If you want results for a specific month, say October 2020, you can add a WHERE clause before the GROUP BY:
where format(recipe_added, 'yyyyMM') = '202010'
This will work for SQL Server.
If you are using a different database then you can use a similar approach.

Query for grouping of successful attempts when order matters

Let's say, for example, I have a db table Jumper for tracking high jumpers. It has three columns of interest: attempt_id, athlete, and result (a boolean for whether the jumper cleared the bar or not).
I want to write a query that will compare all athletes' performance across different attempts yielding a table with this information: attempt number, number of cleared attempts, total attempts. In other words, what is the chance that an athlete will clear the bar on x attempt.
What is the best way of writing this query? It is trickier than it would seem at first because you need to determine the attempt number for each athlete to be able to total the final totals.
I would prefer answers be written with Django ORM, but SQL will also be accepted.
Edit: To be clear, I need it to be grouped by attempt, not by athlete. So it would be all athletes' combined x attempt.
You could solve it using SQL:
SELECT t.attempt_id,
SUM(CASE t.result WHEN TRUE THEN 1 ELSE 0 END) AS cleared,
COUNT(*) AS total
FROM Jumper t
GROUP BY t.attempt_id
EDIT: If attempt_id is just a sequence, and you want to use it to calculate the attempt number for each jumper, you could use this query instead:
SELECT t.attempt_number,
SUM(CASE t.result WHEN TRUE THEN 1 ELSE 0 END) AS cleared,
COUNT(*) AS total
FROM (SELECT s.*,
ROW_NUMBER() OVER(PARTITION BY athlete
ORDER BY attempt_id) AS attempt_number
FROM Jumper s) t
GROUP BY t.attempt_number
This way, you group every first attempt from all athletes, every second attempt from all athletes, and so on...

Add up all occurances when a condition exists

I am totally stuck on the best way to write SQL code to handle a task I have to produce a report. We use Sybase ASA. It is embedded with an application. The query needs to produce the following output:
Media Server | Total NUmber of Backups | Volume Size (KB) | Average Throughput | Number of Successful | Jobs Success %
Each media server would be unique.
I am having an issue with getting the Number of successful jobs and then determining the % of Success.
Here is the code that I have:
SELECT
dmj.name AS "Media Server",
CAST(SUM(dj.bytesWritten/1024/1024) as decimal(20,2)) as "Volume(MB)",
COUNT(distinct dj.id) AS "Total Number of Jobs",
CAST(AVG(dj.throughput) as decimal(10,2)) AS "Throughput (KB/sec)",
CASE
WHEN dj.statusCode = '0'
THEN COUNT (dj.statusCode)
END AS "Number of Successful Jobs"
FROM domain_JobArchive dj
INNER JOIN domain_MediaServer dmj
ON dj.mediaServerName = dmj.name
WHERE DATEDIFF(day, UtcBigIntToNomTime(dj.endTime), GETDATE()) <= 7
AND dj.Type != '17'
AND dj.statusCode = 0
GROUP BY dmj.name, dj.statusCode
I am also still unsure how to show the % of success.
We use Sybase ASA as it is embedded in an application.
Thanks.
So it's:
SUM(CASE when dj.statusCode = '0' then 1 else 0 end)
to get a count of successes, and:
SUM(CASE WHEN dj.statusCode = '0' THEN 1 else 0 END) * 100.0/ COUNT(*)
to get the percentage success - same as before, you won't get what you want with aggregate functions like COUNT(*) in the CASE expression in this situation (I think Sybase will accept that, but you'll get the overall COUNT and I think break the GROUP BY - don't do it here.)
You have a further problem - why are you grouping by statusCode? That should go, because you're looking at all jobs per Media Server and not separating them between success/failure, rather you're calculating stats on success/failure rates. So just group on the Server name.
Furthermore what's this:
COUNT(distinct dj.id)
I'd have thought you'd have a distinct list already - if you have to do that it suggests to me you've got multiple rows per job, so that will break your stats.
ALSO:
WHERE DATEDIFF(day, UtcBigIntToNomTime(dj.endTime), GETDATE()) <= 7
AND dj.Type != '17'
AND dj.statusCode = 0
You're restricting to successful jobs! I think you should start again - first do the sql to select the jobs, make sure it is right. If you need failure rates you'll have to select failures too.
Then, use the SUM(CASE ..) technique to calc results you need, GROUPing BY.
I think you need to build up the query again carefully, because at the moment there are too many problems.

Return NULL instead of 0 when using COUNT(column) SQL Server

I have query which running fine and its doing two types of work, COUNT and SUM.
Something like
select
id,
Count (contracts) as countcontracts,
count(something1),
count(something1),
count(something1),
sum(cost) as sumCost
from
table
group by
id
My problem is: if there is no contract for a given ID, it will return 0 for COUNT and Null for SUM. I want to see null instead of 0
I was thinking about case when Count (contracts) = 0 then null else Count (contracts) end but I don't want to do it this way because I have more than 12 count positions in query and its prepossessing big amount of records so I think it may slow down query performance.
Is there any other ways to replace 0 with NULL?
Try this:
select NULLIF ( Count(something) , 0)
Here are three methods:
1. (case when count(contracts) > 0 then count(contracts) end) as countcontracts
2. sum(case when contracts is not null then 1 end) as countcontracts
3. nullif(count(contracts), 0)
All three of these require writing more complicated expressions. However, this really isn't that difficult. Just copy the line multiple times, and change the name of the variable on each one. Or, take the current query, put it into a spreadsheet and use spreadsheet functions to make the transformation. Then copy the function down. (Spreadsheets are really good code generators for repeated lines of code.)

Which is faster: Sum(Case When) Or Group By/Count(*)?

I can write
Select
Sum(Case When Resposta.Tecla = 1 Then 1 Else 0 End) Valor1,
Sum(Case When Resposta.Tecla = 2 Then 1 Else 0 End) Valor2,
Sum(Case When Resposta.Tecla = 3 Then 1 Else 0 End) Valor3,
Sum(Case When Resposta.Tecla = 4 Then 1 Else 0 End) Valor4,
Sum(Case When Resposta.Tecla = 5 Then 1 Else 0 End) Valor5
From Resposta
Or
Select
Count(*)
From Resposta Group By Tecla
I tried this over a large number of rows and it seems like taking the same time.
Anyone can confirm this?
I believe the Group By is better because there are no specific treatments.
It can be optimized by the database engine.
I think the results may depend on the database engine you use.
Maybe the one you are using optimizes the first query anderstanding it is like a group by !
You can try the "explain / explain plan" command to see how the engine is computing your querys but with my Microsoft SQL Server 2008, I just can see a swap between 2 operations ("Compute scalar" and "agregate").
I tried such queries on a database table :
SQL Server 2k8
163000 rows in the table
12 cathegories (Valor1 -> Valor12)
the results are quite differents :
Group By : 2seconds
Case When : 6seconds !
So My choice is "Group By".
Another benefit is the query is simplyer to write !
What the DB does internally with the second query is practically the same as what you explicitly tell it to do with the first. There should be no difference in the execution plan and thus in the time the query takes. Taking this into account, clearly using the second query is better:
it's much more flexible, when there are more values of Tecla you
don't need to change your query
it's easier to understand. If you have a lot of values for Tecla
it'll be harder to read the first query and realize it just counts
distinct values
it's smaller - you're sending less information to the DB server and it will probably parse the query faster, which is the only performance difference I see in this queries. This makes a difference, albeit small
Either one is going to have to read all rows from Resposta, so for any reasonably sized table, I'd expect the I/O cost to dominate - giving approximately the same overall runtime.
I'd generally use:
Select
Tecla,
Count(*)
From Resposta
Group By Tecla
If there's a reasonable chance that the range of Tecla values will change in the future.
In my opinion GROUP BY statement will always be faster than SUM(CASE WHEN ...) because in your example for SUM ... there would be 5 different calculations while when using GROUP BY, DB will simply sort and calculate.
Imagine, you have a bag with different coins and you need to know, how much of earch type of coins do you have. You can do it this ways:
The SUM(CASE WHEN ...) way would be to compare each coin with predefined sample coins and do the math for each sample (add 1 or 0);
The GROUP BY way would be to sort coins by their types and then count earch group.
Which method would you prefer?
To fairly compete with count(*), Your first SQL should probably be:
Select
Sum(Case When Resposta.Tecla >= 1 AND Resposta.Tecla <=5 Then 1 Else 0 End) Valor
From Resposta
And to answer your question, I'm not noticing a difference at all in speed between SUM CASE WHEN and COUNT. I'm querying over 250,000 rows in POSTGRESQL.