Query for grouping of successful attempts when order matters - sql

Let's say, for example, I have a db table Jumper for tracking high jumpers. It has three columns of interest: attempt_id, athlete, and result (a boolean for whether the jumper cleared the bar or not).
I want to write a query that will compare all athletes' performance across different attempts yielding a table with this information: attempt number, number of cleared attempts, total attempts. In other words, what is the chance that an athlete will clear the bar on x attempt.
What is the best way of writing this query? It is trickier than it would seem at first because you need to determine the attempt number for each athlete to be able to total the final totals.
I would prefer answers be written with Django ORM, but SQL will also be accepted.
Edit: To be clear, I need it to be grouped by attempt, not by athlete. So it would be all athletes' combined x attempt.

You could solve it using SQL:
SELECT t.attempt_id,
SUM(CASE t.result WHEN TRUE THEN 1 ELSE 0 END) AS cleared,
COUNT(*) AS total
FROM Jumper t
GROUP BY t.attempt_id
EDIT: If attempt_id is just a sequence, and you want to use it to calculate the attempt number for each jumper, you could use this query instead:
SELECT t.attempt_number,
SUM(CASE t.result WHEN TRUE THEN 1 ELSE 0 END) AS cleared,
COUNT(*) AS total
FROM (SELECT s.*,
ROW_NUMBER() OVER(PARTITION BY athlete
ORDER BY attempt_id) AS attempt_number
FROM Jumper s) t
GROUP BY t.attempt_number
This way, you group every first attempt from all athletes, every second attempt from all athletes, and so on...

Related

SQL multiple constrained counts in query

I am trying to get with 1 query multiple count results where each one is a subset of the previous one.
So my table would be called Recipe and has these columns:
recipe_num(Primary_key) decimal,
recipe_added date,
is_featured bit,
liked decimal
And what I want is to make a query that will return the amount of likes grouped by day for any particular month with
total recipes as total_recipes,
total recipes that were featured as featured_recipes,
total number of recipes that were featured and had more than 100 likes liked_recipes
So as you can see each they are all counts with each being a subset of the previous one.
Ideally I don't want to run separate select count's where that query the whole table but rather get from the previous one.
I am not very good at using count with Where, Having, etc... and not exactly sure how to do it, so far I have the following which I managed via digging around here.
select
recipe_added,
count(*) total_recipes,
count(case is_featured when 1 then 1 else null end) total_featured_recipes
from
RECIPES
group by
recipe_added
I am not exactly sure why I have to use case inside the count but I wasn't able to get it to work using WHERE, would like to know if this is possible as well.
Thanks
With a CASE expression inside COUNT() you are doing conditional aggregation and this is exactly what you need for this requirement:
select recipe_added,
count(*) total_recipes,
count(case when is_featured = 1 then 1 end) total_featured_recipes,
count(case when is_featured = 1 and liked > 100 then 1 end) liked_recipes
from Recipes
group by recipe_added
There is no need for ELSE null because the default behavior of a CASE expression is to return null when no other branch returns a value.
If you want results for a specific month, say October 2020, you can add a WHERE clause before the GROUP BY:
where format(recipe_added, 'yyyyMM') = '202010'
This will work for SQL Server.
If you are using a different database then you can use a similar approach.

Identifying parent records for many transactions

This is related to a question I asked previously for which lag/lead was suggested. However the data I'm working with are more complex than I first thought so I need a more robust solution. This screen shot shows an issue I need to tackle:
Within a single serial number, a shipment event defines a new reference window. So records 2,3,4 relate to 1. Record 6 relates to 5 and so forth. I need to mark the records for which the BillToId doesn't match the parent shipment.
I'm trying to understand if I could even use the LAG function to compare records 2,3,4 back to 1 when the number of post-shipment events varies (duplicates are allowed). I was thinking I might be better off with another fact table that identifies the parent rowid along each record first?
So then my question becomes how do I efficiently identify which shipment each row belongs to? Am I forced to run a subquery for each record? I'm working right now with over 2 million total rows. I would later make this query part of the ETL process so it would be processing smaller chunks of data.
Here is an approach that uses the cumulative sum functionality in SQL Server. The idea is to assign each "ship" activity a value of "1" and "0" for everything else. Then do a cumulative sum to identify each group that should have the same billtoid. After that, the ship information can be assigned to all records in the same group:
select rowid, dateid, billtoid, activitytypeid, serialnumber
from (select t.*,
max(case when activitytypeid = 'Ship' then billtoid end) over
(partition by serialnumber, cumships) as ship_billtoid
from (select t.*,
sum(case when activitytypeid = 'Ship' then 1 else 0 end) over
(partition by serialnumber order by rowid) as cumships
from t
) t
) t
where billtoid <> ship_billtoid;

Select finishes where athlete didn't finish first for the past 3 events

Suppose I have a database of athletic meeting results with a schema as follows
DATE,NAME,FINISH_POS
I wish to do a query to select all rows where an athlete has competed in at least three events without winning. For example with the following sample data
2013-06-22,Johnson,2
2013-06-21,Johnson,1
2013-06-20,Johnson,4
2013-06-19,Johnson,2
2013-06-18,Johnson,3
2013-06-17,Johnson,4
2013-06-16,Johnson,3
2013-06-15,Johnson,1
The following rows:
2013-06-20,Johnson,4
2013-06-19,Johnson,2
Would be matched. I have only managed to get started at the following stub:
select date,name FROM table WHERE ...;
I've been trying to wrap my head around the where clause but I can't even get a start
I think this can be even simpler / faster:
SELECT day, place, athlete
FROM (
SELECT *, min(place) OVER (PARTITION BY athlete
ORDER BY day
ROWS 3 PRECEDING) AS best
FROM t
) sub
WHERE best > 1
->SQLfiddle
Uses the aggregate function min() as window function to get the minimum place of the last three rows plus the current one.
The then trivial check for "no win" (best > 1) has to be done on the next query level since window functions are applied after the WHERE clause. So you need at least one CTE of sub-select for a condition on the result of a window function.
Details about window function calls in the manual here. In particular:
If frame_end is omitted it defaults to CURRENT ROW.
If place (finishing_pos) can be NULL, use this instead:
WHERE best IS DISTINCT FROM 1
min() ignores NULL values, but if all rows in the frame are NULL, the result is NULL.
Don't use type names and reserved words as identifiers, I substituted day for your date.
This assumes at most 1 competition per day, else you have to define how to deal with peers in the time line or use timestamp instead of date.
#Craig already mentioned the index to make this fast.
Here's an alternative formulation that does the work in two scans without subqueries:
SELECT
"date", athlete, place
FROM (
SELECT
"date",
place,
athlete,
1 <> ALL (array_agg(place) OVER w) AS include_row
FROM Table1
WINDOW w AS (PARTITION BY athlete ORDER BY "date" ASC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
) AS history
WHERE include_row;
See: http://sqlfiddle.com/#!1/fa3a4/34
The logic here is pretty much a literal translation of the question. Get the last four placements - current and the previous 3 - and return any rows in which the athlete didn't finish first in any of them.
Because the window frame is the only place where the number of rows of history to consider is defined, you can parameterise this variant unlike my previous effort (obsolete, http://sqlfiddle.com/#!1/fa3a4/31), so it works for the last n for any n. It's also a lot more efficient than the last try.
I'd be really interested in the relative efficiency of this vs #Andomar's query when executed on a dataset of non-trivial size. They're pretty much exactly the same on this tiny dataset. An index on Table1(athlete, "date") would be required for this to perform optimally on a large data set.
; with CTE as
(
select row_number() over (partition by athlete order by date) rn
, *
from Table1
)
select *
from CTE cur
where not exists
(
select *
from CTE prev
where prev.place = 1
and prev.athlete = cur.athlete
and prev.rn between cur.rn - 3 and cur.rn
)
Live example at SQL Fiddle.

How to write an SQL query that retrieves high scores over a recent subset of scores -- see explaination

Given a table of responses with columns:
Username, LessonNumber, QuestionNumber, Response, Score, Timestamp
How would I run a query that returns which users got a score of 90 or better on their first attempt at every question in their last 5 lessons? "last 5 lessons" is a limiting condition, rather than a requirement, so if they completely only 1 lesson, but got all of their first attempts for each question right, then they should be included in the results. We just don't want to look back farther than 5 lessons.
About the data: Users may be on different lessons. Some users may have not yet completed five lessons (may only be on lesson 3 for example). Each lesson has a different number of questions. Users have different lesson paths, so they may skip some lesson numbers or even complete lessons out of sequence.
Since this seems to be a problem of transforming temporally non-uniform/discontinuous values into uniform/contiguous values per-user, I think I can solve the bulk of the problem with a couple ranking function calls. The conditional specification of scoring above 90 for "first attempt at every question in their last 5 lessons" is also tricky, because the number of questions completed is variable per-user.
So far...
As a starting point or hint at what may need to happen, I've transformed Timestamp into an "AttemptNumber" for each question, by using "row_number() over (partition by Username,LessonNumber,QuestionNumber order by Timestamp) as AttemptNumber".
I'm also trying to transform LessonNumber from an absolute value into a contiguous ranked value for individual users. I could use "dense_rank() over (partition by Username order by LessonNumber desc) as LessonRank", but that assumes the order lessons are completed corresponds with the order of LessonNumber, which is unfortunately not always the case. However, let's assume that this is the case, since I do have a way of producing such a number through a couple of joins, so I can use the dense_rank transform described to select the "last 5 completed lessons" (i.e. LessonRank <= 5).
For the >90 condition, I think I can transform the score into an integer so that it's "1" if >= 90, and "0" if < 90. I can then introduce a clause like "group by Username having SUM(Score)=COUNT(Score).", which will select only those users with all scores equal to 1.
Any solutions or suggestions would be appreciated.
You kind of gave away the solution:
SELECT DISTINCT Username
FROM Results
WHERE Username NOT in (
SELECT DISTINCT Username
FROM (
SELECT
r.Username,r.LessonNumber, r.QuestionNumber, r.Score, r.Timestamp
, row_number() over (partition by r.Username,r.LessonNumber,r.QuestionNumber order by r.Timestamp) as AttemptNumber
, dense_rank() over (partition by r.Username order by r.LessonNumber desc) AS LessonRank
FROM Results r
) as f
WHERE LessonRank <= 5 and AttemptNumber = 1 and Score < 90
)
Concerning the LessonRank, I used exactly what you desribed since it is not clear how to order the lessons otherwise: The timestamp of the first attempt of the first question of a lesson? Or the timestamp of the first attempt of any question of a lesson? Or simply the first(or the most recent?) timestamp of any result of any question of a lesson?
The innermost Select adds all the AttemptNumber and LessonRank as provided by you.
The next Select retains only the results which would disqualify a user to be in the final list - all first attempts with an insufficient score in the last 5 lessons. We end up with a list of users we do not want to display in the final result.
Therefore, in the outermost Select, we can select all the users which are not in the exclusion list. Basically all the other users which have answered any question.
EDIT: As so often, second try should be better...
One more EDIT:
Here's a version including your remarks in the comments.
SELECT Username
FROM
(
SELECT Username, CASE WHEN Score >= 90 THEN 1 ELSE 0 END AS QuestionScoredWell
FROM (
SELECT
r.Username,r.LessonNumber, r.QuestionNumber, r.Score, r.Timestamp
, row_number() over (partition by r.Username,r.LessonNumber,r.QuestionNumber order by r.Timestamp) as AttemptNumber
, dense_rank() over (partition by r.Username order by r.LessonNumber desc) AS LessonRank
FROM Results r
) as f
WHERE LessonRank <= 5 and AttemptNumber = 1
) as ff
Group BY Username
HAVING MIN(QuestionScoredWell) = 1
I used a Having clause with a MIN expression on the calculated QuestionScoredWell value.
When comparing the execution plans for both queries, this query is actually faster. Not sure though whether this is partially due to the low number of data rows in my table.
Random suggestions:
1
The conditional specification of scoring above 90 for "first attempt at every question in their last 5 lessons" is also tricky, because the number of questions is variable per-user.
is equivalent to
There exists no first attempt with a score <= 90 most-recent 5 lessons
which strikes me as a little easier to grab with a NOT EXISTS subquery.
2
First attempt is the same as where timestamp = (select min(timestamp) ... )
You need to identify the top 5 lessons per user first, using the timestamp to prioritize lessons, then you can limit by score. Try:
Select username
from table t inner join
(select top 5 username, lessonNumber
from table
order by timestamp desc) l
on t.username = l.username and t.lessonNumber = l.lessonNumber
from table
where score >= 90

How to get aggregates without using a nested sql query

I am writing a custom report from an Avamar (Postgresql) database which contains backup job history. My task is to display jobs that failed last night (based on status_code), and include that client's success ratio (jobs succeeded/total jobs run) over the past 30 days on the same line.
So the overall select just picks up clients that failed (status_code doesn't equal 30000, which is the success code). However, for each failed client from last night, I need to also know how many jobs have succeeded, and how many jobs total were started/scheduled in the past 30 days. (The time period part is simple, so I haven't included it in the code below, to keep it simple.)
I tried to do this without using a nested query, based on Hobodave's feedback on this similar question but I'm not quite able to nail it.
In the query below, I get the following error:
column "v_activities_2.client_name" must appear in the GROUP BY clause or be used in an aggregate function
Here's my (broken) query. I know the logic is flawed, but I'm coming up empty with how best to accomplish this. Thanks in advance for any guidance!
select
split_part(client_name,'.',1) as client_name,
bunchofothercolumnns,
round(
100.0 * (
((sum(CASE WHEN status_code=30000 THEN 1 ELSE 0 END))) /
((sum(CASE WHEN type='Scheduled Backup' THEN 1 ELSE 0 END))))
as percent_total
from v_activities_2
where
status_code<>30000
order by client_name
You need to define a GROUP BY if you have columns in the SELECT that do not have aggregate functions performed on them:
SELECT SPLIT_PART(t.client_name, '.', 1) AS client_name,
SUM(CASE WHEN status_code = 30000 THEN 1 ELSE 0 END) as successes
FROM v_activities_2
GROUP BY SPLIT_PART(t.client_name, '.', 1)
ORDER BY client_name
How do you expect the following to work:
SUM(CASE WHEN status_code = 30000 THEN 1 ELSE 0 END) as successes
FROM v_activities_2
WHERE status_code <> 30000
You can't expect to count rows you're excluding.
Why avoid nested query?
It seems most logical / efficient solution here.
If you do this in one pass with no sobqueries (only group by's), you will end with scanning the whole table (or joined tables) - which is not efficient, because only SOME clients failed last night.
Subqueries are not that bad, in general.