How to write an SQL query that retrieves high scores over a recent subset of scores -- see explaination - sql

Given a table of responses with columns:
Username, LessonNumber, QuestionNumber, Response, Score, Timestamp
How would I run a query that returns which users got a score of 90 or better on their first attempt at every question in their last 5 lessons? "last 5 lessons" is a limiting condition, rather than a requirement, so if they completely only 1 lesson, but got all of their first attempts for each question right, then they should be included in the results. We just don't want to look back farther than 5 lessons.
About the data: Users may be on different lessons. Some users may have not yet completed five lessons (may only be on lesson 3 for example). Each lesson has a different number of questions. Users have different lesson paths, so they may skip some lesson numbers or even complete lessons out of sequence.
Since this seems to be a problem of transforming temporally non-uniform/discontinuous values into uniform/contiguous values per-user, I think I can solve the bulk of the problem with a couple ranking function calls. The conditional specification of scoring above 90 for "first attempt at every question in their last 5 lessons" is also tricky, because the number of questions completed is variable per-user.
So far...
As a starting point or hint at what may need to happen, I've transformed Timestamp into an "AttemptNumber" for each question, by using "row_number() over (partition by Username,LessonNumber,QuestionNumber order by Timestamp) as AttemptNumber".
I'm also trying to transform LessonNumber from an absolute value into a contiguous ranked value for individual users. I could use "dense_rank() over (partition by Username order by LessonNumber desc) as LessonRank", but that assumes the order lessons are completed corresponds with the order of LessonNumber, which is unfortunately not always the case. However, let's assume that this is the case, since I do have a way of producing such a number through a couple of joins, so I can use the dense_rank transform described to select the "last 5 completed lessons" (i.e. LessonRank <= 5).
For the >90 condition, I think I can transform the score into an integer so that it's "1" if >= 90, and "0" if < 90. I can then introduce a clause like "group by Username having SUM(Score)=COUNT(Score).", which will select only those users with all scores equal to 1.
Any solutions or suggestions would be appreciated.

You kind of gave away the solution:
SELECT DISTINCT Username
FROM Results
WHERE Username NOT in (
SELECT DISTINCT Username
FROM (
SELECT
r.Username,r.LessonNumber, r.QuestionNumber, r.Score, r.Timestamp
, row_number() over (partition by r.Username,r.LessonNumber,r.QuestionNumber order by r.Timestamp) as AttemptNumber
, dense_rank() over (partition by r.Username order by r.LessonNumber desc) AS LessonRank
FROM Results r
) as f
WHERE LessonRank <= 5 and AttemptNumber = 1 and Score < 90
)
Concerning the LessonRank, I used exactly what you desribed since it is not clear how to order the lessons otherwise: The timestamp of the first attempt of the first question of a lesson? Or the timestamp of the first attempt of any question of a lesson? Or simply the first(or the most recent?) timestamp of any result of any question of a lesson?
The innermost Select adds all the AttemptNumber and LessonRank as provided by you.
The next Select retains only the results which would disqualify a user to be in the final list - all first attempts with an insufficient score in the last 5 lessons. We end up with a list of users we do not want to display in the final result.
Therefore, in the outermost Select, we can select all the users which are not in the exclusion list. Basically all the other users which have answered any question.
EDIT: As so often, second try should be better...
One more EDIT:
Here's a version including your remarks in the comments.
SELECT Username
FROM
(
SELECT Username, CASE WHEN Score >= 90 THEN 1 ELSE 0 END AS QuestionScoredWell
FROM (
SELECT
r.Username,r.LessonNumber, r.QuestionNumber, r.Score, r.Timestamp
, row_number() over (partition by r.Username,r.LessonNumber,r.QuestionNumber order by r.Timestamp) as AttemptNumber
, dense_rank() over (partition by r.Username order by r.LessonNumber desc) AS LessonRank
FROM Results r
) as f
WHERE LessonRank <= 5 and AttemptNumber = 1
) as ff
Group BY Username
HAVING MIN(QuestionScoredWell) = 1
I used a Having clause with a MIN expression on the calculated QuestionScoredWell value.
When comparing the execution plans for both queries, this query is actually faster. Not sure though whether this is partially due to the low number of data rows in my table.

Random suggestions:
1
The conditional specification of scoring above 90 for "first attempt at every question in their last 5 lessons" is also tricky, because the number of questions is variable per-user.
is equivalent to
There exists no first attempt with a score <= 90 most-recent 5 lessons
which strikes me as a little easier to grab with a NOT EXISTS subquery.
2
First attempt is the same as where timestamp = (select min(timestamp) ... )

You need to identify the top 5 lessons per user first, using the timestamp to prioritize lessons, then you can limit by score. Try:
Select username
from table t inner join
(select top 5 username, lessonNumber
from table
order by timestamp desc) l
on t.username = l.username and t.lessonNumber = l.lessonNumber
from table
where score >= 90

Related

Count equal date parts of the timestamp column ACROSS the groups by the said column

I would like to count duplicate date values. I already know that my field "WHEN" is a timestamp, so I have to cast it to type of date.
my actual query looks like this:
SELECT
u.USERNAME,
r."WHEN",
r.UPDATEINOUT,
case (r.UPDATEINOUT) when 0 then 0 when 1 then 1 else r.INOUT end INOUT
FROM
ATTENDANT r
LEFT JOIN
USERS u ON r.USERID = u.ID
where
u.USERNAME = 'rk' and (r.UPDATEINOUT = 1 or r.UPDATEINOUT = 0 or r.UPDATEINOUT is null)
group by
r."WHEN",
INOUT,
u.USERNAME,
r.UPDATEINOUT
order by
r."WHEN"
And this is the result:
Username WHEN UPDATEINOUT INOUT
rk 09.04.2018, 14:59:45.000 [null] 0
rk 09.04.2018, 14:59:51.000 [null] 1
rk 11.04.2018, 08:31:02.000 [null] 0
rk 11.04.2018, 12:06:52.000 [null] 1
rk 11.04.2018, 12:10:29.000 [null] 0
rk 11.04.2018, 12:23:09.000 [null] 1
rk 11.04.2018, 12:43:47.000 [null] 0
rk 11.04.2018, 17:07:40.000 [null] 1
Now I would like to count the duplicate dates, the result should look like this:
Username WHEN UPDATEINOUT INOUT Count
rk 09.04.2018, 14:59:45.000 [null] 0 2
rk 09.04.2018, 14:59:51.000 [null] 1 2
rk 11.04.2018, 08:31:02.000 [null] 0 6
rk 11.04.2018, 12:06:52.000 [null] 1 6
rk 11.04.2018, 12:10:29.000 [null] 0 6
rk 11.04.2018, 12:23:09.000 [null] 1 6
rk 11.04.2018, 12:43:47.000 [null] 0 6
rk 11.04.2018, 17:07:40.000 [null] 1 6
When I add
count(cast(r."WHEN" as date))
then it shows me only one.
It seems there are answers suggesting "what to do" but not trying to explain why the results are like what they are. What is happening.
It seems like the topic starter is being given a fish, not a fishing rod.
When I add count(cast(r."WHEN" as date)) then it shows me only one.
This is a bit funny. This line shows novice misconception between natural human language like English, and mathematical sets languages like SQL.
R.Kut reads this his added line as "i want to count how many [distinct] values of the given expression are there".
But that is not what this command actually means in SQL. In SQL it means "I want to count how many rows are there, where the given expression is not null."
And so there is actually NO DIFFERENCE between count(cast(r."WHEN" as date)) and count(r."WHEN") - those two parameters-expressions are either both NULL or both NOT NULL. Hence the value of count of those equally nullable parameters is equal too.
The very attempt to truncate a value of the parameter of the aggregate function, like if it could alter nullability, is a misconception. I was there too. It takes time to get used what aggregates over mathematical sets really mean, and that you do not reading English when you are reading SQL.
Frankly, you could do just count (1) here removing not only the typecast but the column itself - it would still be the same, as those are rows, not values that are being asked to be counted. Unless there are rows where "WHEN" IS NULL - that would be accounted for by group by but not by count. After you read and think through the next section, come back and toy with http://sqlfiddle.com/#!9/ee09a/7
Now, there is another parameter to the count function that I kind of mentioned above. It is that "distinct" parameter.
Remark: One may say distinct is a keyword of the SQL language, not a parameter for a function, but de facto, not de jure, it changes the way the function works, so in my perception it is the parameter, in that unusual SQLish way the parameters are often given to functions. Or, another way to reason about this, it may be seen a part of the function name, if to imagine have we two functions to choose from, count and count-distinct.
So, the topic starter could had added count(distinct cast(r."WHEN" as date)) instead, and...
....and see that nothing changed. Because while this time he would really say to the server to count row with not-null (always only not-null!) and distinct values - the counting goes within the group.
And what those groups are?
group by
r."WHEN",
INOUT,
u.USERNAME,
r.UPDATEINOUT
See, across the groups we have rows with distinct time-and-date values of "WHEN". And other columns too, but I am not focusing on them. The thing here is that within every group both time and date parts of "WHEN" are kept the same. And "being the same" means "there is one distinct value, repeated again and again". And if there is only one distinct time-and-date value then the reduced values of only-time or only-date would have the same counts (reducing a value can only make to previously different values equal, but not make previously equal values now different).
Granted, in other situations, when the counting is going over columns not included in the group (or when there is not group by clause at all), the result may differ. There count(distinct ...) might do what the topic starter is expecting it to do.
http://sqlfiddle.com/#!9/0d65bf/7 - example.
However one has to keep in mind:
that happens at the expense of the extra sorting and grouping work done by a server, potentially making the query slow or consuming lot of memory
still it would work within the group (just groups then would start to contain different values for the counted columns).
....just sometimes the group is the whole query resultset (most simple variant - the whole table), if the programmer did not set it to be different by adding group by clause :-D
The code of the example, linked above:
create table X(a integer, b integer);
insert into X values (1,1);
insert into X values (1,2);
insert into X values (1,2);
commit;
select count(distinct b) from x group by a
-- Result: 1 row: 2
-- or if the whole table is the group
select count(distinct b) from x
-- Result: 1 row: 2
-- but if the group includes the counted column
-- then every group would contain EXACTLY ONE
-- row with a not-null distinct value
select count(distinct b) from x group by b
-- Result: 2 rows: 1 and 1
Here we come to yet another modification to toy with.
group by
cast(r."WHEN" as date), -- <====
INOUT,
u.USERNAME,
r.UPDATEINOUT
Now, this time we are telling the server to assemble the groups, in which only "date" part of "WHEN" is the same, while "time" part might differ.
However....
I think not every SQL server supports expressions in group by clause.
Unless there is an index made exactly by that expression, the server would have to do much extra work, probably resorting to natural scan and temporary files grouping, which would make the query heavy and slow.
And of course you would HAVE to modify the columns list accordingly.
.
SELECT
u.USERNAME,
cast(r."WHEN" as date), -- <=== no more raw r."WHEN"
r.UPDATEINOUT,
Because, well, you just cannot have "One True Value" for WHEN in the row, if you explicitly asked server to group different values of it.
Here you may come to the conclusion, that there is no simple straightforward way to have both groups and across-the-groups aggregates. Or to put it differently, to have two different sets of group by in one query (I do not want to mention UNION here, mkay?).
You would need ONE set of group by criteria for counting rows with the same date part (but maybe different time part), and yet ANOTHER group by criteria for picking and rendering groups different by both date-and-time.
In plain and simple SQL 1999 that means you would have to make TWO selects to make both groupings one way or another, and here Yogesh's and Gordon's answer kick in.
Like Gordon mentioned in post-99 SQL there came Window Functions which could let you have those criteria sets in one select, but they are not available in Firebird 2.x anyway.
You could upgrade to Firebird 3.0 and use window functions.
Alternatively, you can use CTEs
with t as ( < your query here > )
select t.*, tw.cnt
from t join
(select cast("WHEN" as date) as wdate, count(*) as cnt
from t
group by cast("WHEN" as date)
) tw
on cast(t."WHEN" as date) = tw.wdate;
I think you need subquery :
SELECT u.USERNAME, r."WHEN", r.UPDATEINOUT,
case (r.UPDATEINOUT) when 0 then 0 when 1 then 1 else r.INOUT end INOUT,
(SELECT COUNT(*)
FROM ATTENDANT r1
WHERE cast(r1."WHEN" as date)) = cast(r."WHEN" as date)
) as Count
FROM ATTENDANT r LEFT JOIN
USERS u
ON r.USERID = u.ID
WHERE u.USERNAME = 'rk' AND
(r.UPDATEINOUT = 1 or r.UPDATEINOUT = 0 or r.UPDATEINOUT is null)
GROUP BY r."WHEN", INOUT, u.USERNAME, r.UPDATEINOUT
ORDER BY r."WHEN";

Nested subquery in Access alias causing "enter parameter value"

I'm using Access (I normally use SQL Server) for a little job, and I'm getting "enter parameter value" for Night.NightId in the statement below that has a subquery within a subquery. I expect it would work if I wasn't nesting it two levels deep, but I can't think of a way around it (query ideas welcome).
The scenario is pretty simple, there's a Night table with a one-to-many relationship to a Score table - each night normally has 10 scores. Each score has a bit field IsDouble which is normally true for two of the scores.
I want to list all of the nights, with a number next to each representing how many of the top 2 scores were marked IsDouble (would be 0, 1 or 2).
Here's the SQL, I've tried lots of combinations of adding aliases to the column and the tables, but I've taken them out for simplicity below:
select Night.*
,
( select sum(IIF(IsDouble,1,0)) from
(SELECT top 2 * from Score where NightId=Night.NightId order by Score desc, IsDouble asc, ID)
) as TopTwoMarkedAsDoubles
from Night
This is a bit of speculation. However, some databases have issues with correlation conditions in multiply nested subqueries. MS Access might have this problem.
If so, you can solve this by using aggregation with a where clause that chooses the top two values:
select s.nightid,
sum(IIF(IsDouble, 1, 0)) as TopTwoMarkedAsDoubles
from Score as s
where s.id in (select top 2 s2.id
from score as s2
where s2.nightid = s.nightid
order by s2.score desc, s2.IsDouble asc, s2.id
)
group by s.nightid;
If this works, it is a simply matter to join Night back in to get the additional columns.
Your subquery can only see one level above it. so Night.NightId is totally unknown to it hence why you are being prompted to enter a value. You can use a Group By to get the value you want for each NightId then correlate that back to the original Night table.
Select *
From Night
left join (
Select N.NightId
, sum(IIF(S.IsDouble,1,0)) as [Number of Doubles]
from Night N
inner join Score S
on S.NightId = S.NightId
group by N.NightId) NightsWithScores
on Night.NightId = NightsWithScores.NightId
Because of the IIF(S.IsDouble,1,0) I don't see the point is using top.

Access 2013 - Query not returning correct Number of Results

I am trying to get the query below to return the TWO lowest PlayedTo results for each PlayerID.
select
x1.PlayerID, x1.RoundID, x1.PlayedTo
from P_7to8Calcs as x1
where
(
select count(*)
from P_7to8Calcs as x2
where x2.PlayerID = x1.PlayerID
and x2.PlayedTo <= x1.PlayedTo
) <3
order by PlayerID, PlayedTo, RoundID;
Unfortunately at the moment it doesn't return a result when there is a tie for one of the lowest scores. A copy of the dataset and code is here http://sqlfiddle.com/#!3/4a9fc/13.
PlayerID 47 has only one result returned as there are two different RoundID's that are tied for the second lowest PlayedTo. For what I am trying to calculate it doesn't matter which of these two it returns as I just need to know what the number is but for reporting I ideally need to know the one with the newest date.
One other slight problem with the query is the time it takes to run. It takes about 2 minutes in Access to run through the 83 records but it will need to run on about 1000 records when the database is fully up and running.
Any help will be much appreciated.
Resolve the tie by adding DatePlayed to your internal sorting (you wanted the one with the newest date anyway):
select
x1.PlayerID, x1.RoundID
, x1.PlayedTo
from P_7to8Calcs as x1
where
(
select count(*)
from P_7to8Calcs as x2
where x2.PlayerID = x1.PlayerID
and (x2.PlayedTo < x1.PlayedTo
or x2.PlayedTo = x1.PlayedTo
and x2.DatePlayed >= x1.DatePlayed
)
) <3
order by PlayerID, PlayedTo, RoundID;
For performance create an index supporting the join condition. Something like:
create index P_7to8Calcs__PlayerID_RoundID on P_7to8Calcs(PlayerId, PlayedTo);
Note: I used your SQLFiddle as I do not have Acess available here.
Edit: In case the index does not improve performance enough, you might want to try the following query using window functions (which avoids nested sub-query). It works in your SQLFiddle but I am not sure if this is supported by Access.
select x1.PlayerID, x1.RoundID, x1.PlayedTo
from (
select PlayerID, RoundID, PlayedTo
, RANK() OVER (PARTITION BY PlayerId ORDER BY PlayedTo, DatePlayed DESC) AS Rank
from P_7to8Calcs
) as x1
where x1.RANK < 3
order by PlayerID, PlayedTo, RoundID;
See OVER clause and Ranking Functions for documentation.

Select finishes where athlete didn't finish first for the past 3 events

Suppose I have a database of athletic meeting results with a schema as follows
DATE,NAME,FINISH_POS
I wish to do a query to select all rows where an athlete has competed in at least three events without winning. For example with the following sample data
2013-06-22,Johnson,2
2013-06-21,Johnson,1
2013-06-20,Johnson,4
2013-06-19,Johnson,2
2013-06-18,Johnson,3
2013-06-17,Johnson,4
2013-06-16,Johnson,3
2013-06-15,Johnson,1
The following rows:
2013-06-20,Johnson,4
2013-06-19,Johnson,2
Would be matched. I have only managed to get started at the following stub:
select date,name FROM table WHERE ...;
I've been trying to wrap my head around the where clause but I can't even get a start
I think this can be even simpler / faster:
SELECT day, place, athlete
FROM (
SELECT *, min(place) OVER (PARTITION BY athlete
ORDER BY day
ROWS 3 PRECEDING) AS best
FROM t
) sub
WHERE best > 1
->SQLfiddle
Uses the aggregate function min() as window function to get the minimum place of the last three rows plus the current one.
The then trivial check for "no win" (best > 1) has to be done on the next query level since window functions are applied after the WHERE clause. So you need at least one CTE of sub-select for a condition on the result of a window function.
Details about window function calls in the manual here. In particular:
If frame_end is omitted it defaults to CURRENT ROW.
If place (finishing_pos) can be NULL, use this instead:
WHERE best IS DISTINCT FROM 1
min() ignores NULL values, but if all rows in the frame are NULL, the result is NULL.
Don't use type names and reserved words as identifiers, I substituted day for your date.
This assumes at most 1 competition per day, else you have to define how to deal with peers in the time line or use timestamp instead of date.
#Craig already mentioned the index to make this fast.
Here's an alternative formulation that does the work in two scans without subqueries:
SELECT
"date", athlete, place
FROM (
SELECT
"date",
place,
athlete,
1 <> ALL (array_agg(place) OVER w) AS include_row
FROM Table1
WINDOW w AS (PARTITION BY athlete ORDER BY "date" ASC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
) AS history
WHERE include_row;
See: http://sqlfiddle.com/#!1/fa3a4/34
The logic here is pretty much a literal translation of the question. Get the last four placements - current and the previous 3 - and return any rows in which the athlete didn't finish first in any of them.
Because the window frame is the only place where the number of rows of history to consider is defined, you can parameterise this variant unlike my previous effort (obsolete, http://sqlfiddle.com/#!1/fa3a4/31), so it works for the last n for any n. It's also a lot more efficient than the last try.
I'd be really interested in the relative efficiency of this vs #Andomar's query when executed on a dataset of non-trivial size. They're pretty much exactly the same on this tiny dataset. An index on Table1(athlete, "date") would be required for this to perform optimally on a large data set.
; with CTE as
(
select row_number() over (partition by athlete order by date) rn
, *
from Table1
)
select *
from CTE cur
where not exists
(
select *
from CTE prev
where prev.place = 1
and prev.athlete = cur.athlete
and prev.rn between cur.rn - 3 and cur.rn
)
Live example at SQL Fiddle.

SQL conundrum, how to select latest date for part, but only 1 row per part (unique)

I am trying to wrap my head around this one this morning.
I am trying to show inventory status for parts (for our products) and this query only becomes complex if I try to return all parts.
Let me lay it out:
single table inventoryReport
I have a distinct list of X parts I wish to display, the result of which must be X # of rows (1 row per part showing latest inventory entry).
table is made up of dated entries of inventory changes (so I only need the LATEST date entry per part).
all data contained in this single table, so no joins necessary.
Currently for 1 single part, it is fairly simple and I can accomplish this by doing the following sql (to give you some idea):
SELECT TOP (1) ldDate, ptProdLine, inPart, inSite, inAbc, ptUm, inQtyOh + inQtyNonet AS in_qty_oh, inQtyAvail, inQtyNonet, ldCustConsignQty, inSuppConsignQty
FROM inventoryReport
WHERE (ldPart = 'ABC123')
ORDER BY ldDate DESC
that gets me my TOP 1 row, so simple per part, however I need to show all X (lets say 30 parts). So I need 30 rows, with that result. Of course the simple solution would be to loop X# of sql calls in my code (but it would be costly) and that would suffice, but for this purpose I would love to work this SQL some more to reduce the x# calls back to the db (if not needed) down to just 1 query.
From what I can see here I need to keep track of the latest date per item somehow while looking for my result set.
I would ultimately do a
WHERE ldPart in ('ABC123', 'BFD21', 'AA123', etc)
to limit the parts I need. Hopefully I made my question clear enough. Let me know if you have an idea. I cannot do a DISTINCT as the rows are not the same, the date needs to be the latest, and I need a maximum of X rows.
Thoughts? I'm stuck...
SELECT *
FROM (SELECT i.*,
ROW_NUMBER() OVER(PARTITION BY ldPart ORDER BY ldDate DESC) r
FROM inventoryReport i
WHERE ldPart in ('ABC123', 'BFD21', 'AA123', etc)
)
WHERE r = 1
EDIT: Be sure to test the performance of each solution. As pointed out in this question, the CTE method may outperform using ROW_NUMBER.
;with cteMaxDate as (
select ldPart, max(ldDate) as MaxDate
from inventoryReport
group by ldPart
)
SELECT md.MaxDate, ir.ptProdLine, ir.inPart, ir.inSite, ir.inAbc, ir.ptUm, ir.inQtyOh + ir.inQtyNonet AS in_qty_oh, ir.inQtyAvail, ir.inQtyNonet, ir.ldCustConsignQty, ir.inSuppConsignQty
FROM cteMaxDate md
INNER JOIN inventoryReport ir
on md.ldPart = ir.ldPart
and md.MaxDate = ir.ldDate
You need to join into a Sub-query:
SELECT i.ldPart, x.LastDate, i.inAbc
FROM inventoryReport i
INNER JOIN (Select ldPart, Max(ldDate) As LastDate FROM inventoryReport GROUP BY ldPart) x
on i.ldPart = x.ldPart and i.ldDate = x.LastDate