I have to query a table with few millons of rows and I want to do it the most optimized.
Lets supose that we want to controll the access to a movie theater with multiples screening rooms and save it like this:
AccessRecord
(TicketId,
TicketCreationTimestamp,
TheaterId,
ShowId,
MovieId,
SeatId,
CheckInTimestamp)
To simplify, the 'Id' columns of the data type 'bigint' and the 'Timestamp' are 'datetime'. The tickets are sold at any time and the people access to the theater randomly. And the primary key (so also unique) is TicketId.
I want to get for each Movie and Theater and Show (time) the AccessRecord info of the first and last person who accessed to the theater to see a mov. If two checkins happen at the same time, i just need 1, any of them.
My solution would be to concatenate the PK and the grouped column in a subquery to get the row:
select
AccessRecord.*
from
AccessRecord
inner join(
select
MAX(CONVERT(nvarchar(25),CheckInTimestamp, 121) + CONVERT(varchar(25), TicketId)) as MaxKey,
MIN(CONVERT(nvarchar(25),CheckInTimestamp, 121) + CONVERT(varchar(25), TicketId)) as MinKey
from
AccessRecord
group by
MovieId,
TheaterId,
ShowId
) as MaxAccess
on CONVERT(nvarchar(25),CheckInTimestamp, 121) + CONVERT(varchar(25), TicketId) = MaxKey
or CONVERT(nvarchar(25),CheckInTimestamp, 121) + CONVERT(varchar(25), TicketId) = MinKey
The conversion 121 is to the cannonical expression of datatime resluting like this: aaaa-mm-dd hh:mi:ss.mmm(24h), so ordered as string data type it will give the same result as it is ordered as a datetime.
As you can see this join is not very optimized, any ideas?
Update with how I tested the different solutions:
I've tested all your answers in a real database with SQL Server 2008 R2 with a table over 3M rows to choose the right one.
If I get only the first or the last person who accessed:
Joe Taras's solution lasts 10 secs.
GarethD's solution lasts 21 secs.
If I do the same accessed but with an ordered result by the grouping columns:
Joe Taras's solution lasts 10 secs.
GarethD's solution lasts 46 secs.
If I get both (the first and the last) people who accessed with an ordered result:
Joe Taras's (doing an union) solution lasts 19 secs.
GarethD's solution lasts 49 secs.
The rest of the solutions (even mine) last more than 60 secs in the first test so I canceled it.
Try this:
select a.*
from AccessRecord a
where not exists(
select 'next'
from AccessRecord a2
where a2.movieid = a.movieid
and a2.theaterid = a.theaterid
and a2.showid = a.showid
and a2.checkintimestamp > a.checkintimestamp
)
In this way you pick the last row as timestamp for the same movie, teather, show.
Ticket (I suppose) is different for each row
Using analytical functions may speed up the query, more specifically ROW_NUMBER, it should reduce the number of reads:
WITH CTE AS
( SELECT TicketId,
TicketCreationTimestamp,
TheaterId,
ShowId,
MovieId,
SeatId,
CheckInTimestamp,
RowNumber = ROW_NUMBER() OVER(PARTITION By MovieId, TheaterId, ShowId ORDER BY CheckInTimestamp, TicketID),
RowNumber2 = ROW_NUMBER() OVER(PARTITION By MovieId, TheaterId, ShowId ORDER BY CheckInTimestamp DESC, TicketID)
FROM AccessRecord
)
SELECT TicketId,
TicketCreationTimestamp,
TheaterId,
ShowId,
MovieId,
SeatId,
CheckInTimestamp,
FROM CTE
WHERE RowNumber = 1
OR RowNumber2 = 1;
However as always with optimisation you are best suited to tune your own queries, you have the data to test with and all the execution plans. Try the query with different indexes, if you show the actual execution plan SSMS will even suggest indexes to help your query. I would expect an index on (MovieId, TheaterId, ShowId) that includes CheckInTimestamp as a non key column would help.
Add either new columns to the table and pre-convert the dates or join the pk in that access table to a new table which has the converted values sitting it it already. The new table that looks up the conversion instead of doing it on the join will speed things up in your queries immensely. If you can do it so that the access record gets an integer FK that goes to the lookup (pre-converted values) table then you're going to avoid using dates at all and things will be phenopminally faster.
If you normalize the dataset and break it out into a star pattern, things will get even faster.
SELECT
R1.*
FROM AccessRecord R1
LEFT JOIN AccessRecord R2
ON R1.MovieId = R2.MovieId
AND R1.TheaterId = R2.TheaterId
AND R1.ShowId = R2.ShowId
AND (
R1.CheckInTimestamp < R2.CheckInTimestamp
OR (R1.CheckInTimestamp = R2.CheckInTimestamp
AND R1.TicketId< R2.TicketId
))
WHERE R2.TicketId IS NULL
Selects the last entry based on the CheckInTimestamp. But if there is a match for this, then it is based on the highest TicketId
Offcourse an index on MovieId, TheaterId and ShowId will help
This is where I learned the trick
You could also consider a union ALL qwuery instead of that nasty OR. Ors are usually slower than union ALL queries.
Related
I am trying to get the query below to return the TWO lowest PlayedTo results for each PlayerID.
select
x1.PlayerID, x1.RoundID, x1.PlayedTo
from P_7to8Calcs as x1
where
(
select count(*)
from P_7to8Calcs as x2
where x2.PlayerID = x1.PlayerID
and x2.PlayedTo <= x1.PlayedTo
) <3
order by PlayerID, PlayedTo, RoundID;
Unfortunately at the moment it doesn't return a result when there is a tie for one of the lowest scores. A copy of the dataset and code is here http://sqlfiddle.com/#!3/4a9fc/13.
PlayerID 47 has only one result returned as there are two different RoundID's that are tied for the second lowest PlayedTo. For what I am trying to calculate it doesn't matter which of these two it returns as I just need to know what the number is but for reporting I ideally need to know the one with the newest date.
One other slight problem with the query is the time it takes to run. It takes about 2 minutes in Access to run through the 83 records but it will need to run on about 1000 records when the database is fully up and running.
Any help will be much appreciated.
Resolve the tie by adding DatePlayed to your internal sorting (you wanted the one with the newest date anyway):
select
x1.PlayerID, x1.RoundID
, x1.PlayedTo
from P_7to8Calcs as x1
where
(
select count(*)
from P_7to8Calcs as x2
where x2.PlayerID = x1.PlayerID
and (x2.PlayedTo < x1.PlayedTo
or x2.PlayedTo = x1.PlayedTo
and x2.DatePlayed >= x1.DatePlayed
)
) <3
order by PlayerID, PlayedTo, RoundID;
For performance create an index supporting the join condition. Something like:
create index P_7to8Calcs__PlayerID_RoundID on P_7to8Calcs(PlayerId, PlayedTo);
Note: I used your SQLFiddle as I do not have Acess available here.
Edit: In case the index does not improve performance enough, you might want to try the following query using window functions (which avoids nested sub-query). It works in your SQLFiddle but I am not sure if this is supported by Access.
select x1.PlayerID, x1.RoundID, x1.PlayedTo
from (
select PlayerID, RoundID, PlayedTo
, RANK() OVER (PARTITION BY PlayerId ORDER BY PlayedTo, DatePlayed DESC) AS Rank
from P_7to8Calcs
) as x1
where x1.RANK < 3
order by PlayerID, PlayedTo, RoundID;
See OVER clause and Ranking Functions for documentation.
Suppose I have a database of athletic meeting results with a schema as follows
DATE,NAME,FINISH_POS
I wish to do a query to select all rows where an athlete has competed in at least three events without winning. For example with the following sample data
2013-06-22,Johnson,2
2013-06-21,Johnson,1
2013-06-20,Johnson,4
2013-06-19,Johnson,2
2013-06-18,Johnson,3
2013-06-17,Johnson,4
2013-06-16,Johnson,3
2013-06-15,Johnson,1
The following rows:
2013-06-20,Johnson,4
2013-06-19,Johnson,2
Would be matched. I have only managed to get started at the following stub:
select date,name FROM table WHERE ...;
I've been trying to wrap my head around the where clause but I can't even get a start
I think this can be even simpler / faster:
SELECT day, place, athlete
FROM (
SELECT *, min(place) OVER (PARTITION BY athlete
ORDER BY day
ROWS 3 PRECEDING) AS best
FROM t
) sub
WHERE best > 1
->SQLfiddle
Uses the aggregate function min() as window function to get the minimum place of the last three rows plus the current one.
The then trivial check for "no win" (best > 1) has to be done on the next query level since window functions are applied after the WHERE clause. So you need at least one CTE of sub-select for a condition on the result of a window function.
Details about window function calls in the manual here. In particular:
If frame_end is omitted it defaults to CURRENT ROW.
If place (finishing_pos) can be NULL, use this instead:
WHERE best IS DISTINCT FROM 1
min() ignores NULL values, but if all rows in the frame are NULL, the result is NULL.
Don't use type names and reserved words as identifiers, I substituted day for your date.
This assumes at most 1 competition per day, else you have to define how to deal with peers in the time line or use timestamp instead of date.
#Craig already mentioned the index to make this fast.
Here's an alternative formulation that does the work in two scans without subqueries:
SELECT
"date", athlete, place
FROM (
SELECT
"date",
place,
athlete,
1 <> ALL (array_agg(place) OVER w) AS include_row
FROM Table1
WINDOW w AS (PARTITION BY athlete ORDER BY "date" ASC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
) AS history
WHERE include_row;
See: http://sqlfiddle.com/#!1/fa3a4/34
The logic here is pretty much a literal translation of the question. Get the last four placements - current and the previous 3 - and return any rows in which the athlete didn't finish first in any of them.
Because the window frame is the only place where the number of rows of history to consider is defined, you can parameterise this variant unlike my previous effort (obsolete, http://sqlfiddle.com/#!1/fa3a4/31), so it works for the last n for any n. It's also a lot more efficient than the last try.
I'd be really interested in the relative efficiency of this vs #Andomar's query when executed on a dataset of non-trivial size. They're pretty much exactly the same on this tiny dataset. An index on Table1(athlete, "date") would be required for this to perform optimally on a large data set.
; with CTE as
(
select row_number() over (partition by athlete order by date) rn
, *
from Table1
)
select *
from CTE cur
where not exists
(
select *
from CTE prev
where prev.place = 1
and prev.athlete = cur.athlete
and prev.rn between cur.rn - 3 and cur.rn
)
Live example at SQL Fiddle.
Given a table of responses with columns:
Username, LessonNumber, QuestionNumber, Response, Score, Timestamp
How would I run a query that returns which users got a score of 90 or better on their first attempt at every question in their last 5 lessons? "last 5 lessons" is a limiting condition, rather than a requirement, so if they completely only 1 lesson, but got all of their first attempts for each question right, then they should be included in the results. We just don't want to look back farther than 5 lessons.
About the data: Users may be on different lessons. Some users may have not yet completed five lessons (may only be on lesson 3 for example). Each lesson has a different number of questions. Users have different lesson paths, so they may skip some lesson numbers or even complete lessons out of sequence.
Since this seems to be a problem of transforming temporally non-uniform/discontinuous values into uniform/contiguous values per-user, I think I can solve the bulk of the problem with a couple ranking function calls. The conditional specification of scoring above 90 for "first attempt at every question in their last 5 lessons" is also tricky, because the number of questions completed is variable per-user.
So far...
As a starting point or hint at what may need to happen, I've transformed Timestamp into an "AttemptNumber" for each question, by using "row_number() over (partition by Username,LessonNumber,QuestionNumber order by Timestamp) as AttemptNumber".
I'm also trying to transform LessonNumber from an absolute value into a contiguous ranked value for individual users. I could use "dense_rank() over (partition by Username order by LessonNumber desc) as LessonRank", but that assumes the order lessons are completed corresponds with the order of LessonNumber, which is unfortunately not always the case. However, let's assume that this is the case, since I do have a way of producing such a number through a couple of joins, so I can use the dense_rank transform described to select the "last 5 completed lessons" (i.e. LessonRank <= 5).
For the >90 condition, I think I can transform the score into an integer so that it's "1" if >= 90, and "0" if < 90. I can then introduce a clause like "group by Username having SUM(Score)=COUNT(Score).", which will select only those users with all scores equal to 1.
Any solutions or suggestions would be appreciated.
You kind of gave away the solution:
SELECT DISTINCT Username
FROM Results
WHERE Username NOT in (
SELECT DISTINCT Username
FROM (
SELECT
r.Username,r.LessonNumber, r.QuestionNumber, r.Score, r.Timestamp
, row_number() over (partition by r.Username,r.LessonNumber,r.QuestionNumber order by r.Timestamp) as AttemptNumber
, dense_rank() over (partition by r.Username order by r.LessonNumber desc) AS LessonRank
FROM Results r
) as f
WHERE LessonRank <= 5 and AttemptNumber = 1 and Score < 90
)
Concerning the LessonRank, I used exactly what you desribed since it is not clear how to order the lessons otherwise: The timestamp of the first attempt of the first question of a lesson? Or the timestamp of the first attempt of any question of a lesson? Or simply the first(or the most recent?) timestamp of any result of any question of a lesson?
The innermost Select adds all the AttemptNumber and LessonRank as provided by you.
The next Select retains only the results which would disqualify a user to be in the final list - all first attempts with an insufficient score in the last 5 lessons. We end up with a list of users we do not want to display in the final result.
Therefore, in the outermost Select, we can select all the users which are not in the exclusion list. Basically all the other users which have answered any question.
EDIT: As so often, second try should be better...
One more EDIT:
Here's a version including your remarks in the comments.
SELECT Username
FROM
(
SELECT Username, CASE WHEN Score >= 90 THEN 1 ELSE 0 END AS QuestionScoredWell
FROM (
SELECT
r.Username,r.LessonNumber, r.QuestionNumber, r.Score, r.Timestamp
, row_number() over (partition by r.Username,r.LessonNumber,r.QuestionNumber order by r.Timestamp) as AttemptNumber
, dense_rank() over (partition by r.Username order by r.LessonNumber desc) AS LessonRank
FROM Results r
) as f
WHERE LessonRank <= 5 and AttemptNumber = 1
) as ff
Group BY Username
HAVING MIN(QuestionScoredWell) = 1
I used a Having clause with a MIN expression on the calculated QuestionScoredWell value.
When comparing the execution plans for both queries, this query is actually faster. Not sure though whether this is partially due to the low number of data rows in my table.
Random suggestions:
1
The conditional specification of scoring above 90 for "first attempt at every question in their last 5 lessons" is also tricky, because the number of questions is variable per-user.
is equivalent to
There exists no first attempt with a score <= 90 most-recent 5 lessons
which strikes me as a little easier to grab with a NOT EXISTS subquery.
2
First attempt is the same as where timestamp = (select min(timestamp) ... )
You need to identify the top 5 lessons per user first, using the timestamp to prioritize lessons, then you can limit by score. Try:
Select username
from table t inner join
(select top 5 username, lessonNumber
from table
order by timestamp desc) l
on t.username = l.username and t.lessonNumber = l.lessonNumber
from table
where score >= 90
I am trying to wrap my head around this one this morning.
I am trying to show inventory status for parts (for our products) and this query only becomes complex if I try to return all parts.
Let me lay it out:
single table inventoryReport
I have a distinct list of X parts I wish to display, the result of which must be X # of rows (1 row per part showing latest inventory entry).
table is made up of dated entries of inventory changes (so I only need the LATEST date entry per part).
all data contained in this single table, so no joins necessary.
Currently for 1 single part, it is fairly simple and I can accomplish this by doing the following sql (to give you some idea):
SELECT TOP (1) ldDate, ptProdLine, inPart, inSite, inAbc, ptUm, inQtyOh + inQtyNonet AS in_qty_oh, inQtyAvail, inQtyNonet, ldCustConsignQty, inSuppConsignQty
FROM inventoryReport
WHERE (ldPart = 'ABC123')
ORDER BY ldDate DESC
that gets me my TOP 1 row, so simple per part, however I need to show all X (lets say 30 parts). So I need 30 rows, with that result. Of course the simple solution would be to loop X# of sql calls in my code (but it would be costly) and that would suffice, but for this purpose I would love to work this SQL some more to reduce the x# calls back to the db (if not needed) down to just 1 query.
From what I can see here I need to keep track of the latest date per item somehow while looking for my result set.
I would ultimately do a
WHERE ldPart in ('ABC123', 'BFD21', 'AA123', etc)
to limit the parts I need. Hopefully I made my question clear enough. Let me know if you have an idea. I cannot do a DISTINCT as the rows are not the same, the date needs to be the latest, and I need a maximum of X rows.
Thoughts? I'm stuck...
SELECT *
FROM (SELECT i.*,
ROW_NUMBER() OVER(PARTITION BY ldPart ORDER BY ldDate DESC) r
FROM inventoryReport i
WHERE ldPart in ('ABC123', 'BFD21', 'AA123', etc)
)
WHERE r = 1
EDIT: Be sure to test the performance of each solution. As pointed out in this question, the CTE method may outperform using ROW_NUMBER.
;with cteMaxDate as (
select ldPart, max(ldDate) as MaxDate
from inventoryReport
group by ldPart
)
SELECT md.MaxDate, ir.ptProdLine, ir.inPart, ir.inSite, ir.inAbc, ir.ptUm, ir.inQtyOh + ir.inQtyNonet AS in_qty_oh, ir.inQtyAvail, ir.inQtyNonet, ir.ldCustConsignQty, ir.inSuppConsignQty
FROM cteMaxDate md
INNER JOIN inventoryReport ir
on md.ldPart = ir.ldPart
and md.MaxDate = ir.ldDate
You need to join into a Sub-query:
SELECT i.ldPart, x.LastDate, i.inAbc
FROM inventoryReport i
INNER JOIN (Select ldPart, Max(ldDate) As LastDate FROM inventoryReport GROUP BY ldPart) x
on i.ldPart = x.ldPart and i.ldDate = x.LastDate
I have 2 tables - an Account table and a Users table. Each account can have multiple users. I have a scenario where I want to execute a single query/join against these two tables, but I want all the Account data (Account.*) and only the first set of user data (specifically their name).
Instead of doing a "min" or "max" on my aggregated group, I wanted to do a "first". But, apparently, there is no "First" aggregate function in TSQL.
Any suggestions on how to go about getting this query? Obviously, it is easy to get the cartesian product of Account x Users:
SELECT User.Name, Account.* FROM Account, User
WHERE Account.ID = User.Account_ID
But how might I got about only getting the first user from the product based on the order of their User.ID ?
Rather than grouping, go about it like this...
select
*
from account a
join (
select
account_id,
row_number() over (order by account_id, id) -
rank() over (order by account_id) as row_num from user
) first on first.account_id = a.id and first.row_num = 0
I know my answer is a bit late, but that might help others. There is a way to achieve a First() and Last() in SQL Server, and here it is :
Stuff(Min(Convert(Varchar, DATE_FIELD, 126) + Convert(Varchar, DESIRED_FIELD)), 1, 23, '')
Use Min() for First() and Max() for Last(). The DATE_FIELD should be the date that determines if it is the first or last record. The DESIRED_FIELD is the field you want the first or the last value. What it does is :
Add the date in ISO format at the start of the string (23 characters long)
Append the DESIRED_FIELD to that string
Get the MIN/MAX value for that field (since it start with the date, you will get the first or last record)
Stuff that concatened string to remove the first 23 characters (the date part)
Here you go!
EDIT: I got problems with the first formula : when the DATE_FIELD has .000 as milliseconds, SQL Server returns the date as string with NO milliseconds at all, thus removing the first 4 characters from the DESIRED_FIELD. I simply changed the format to "20" (without milliseconds) and it works all great. The only downside is if you have two fields that were created at the same seconds, the sort can possibly be messy... in which cas you can revert to "126" for the format.
Stuff(Max(Convert(Varchar, DATE_FIELD, 20) + Convert(Varchar, DESIRED_FIELD)), 1, 19, '')
EDIT 2 : My original intent was to return the last (or first) NON NULL row. I got asked how to return the last or first row, wether it be null or not. Simply add a ISNULL to the DESIRED_FIELD. When you concatenate two strings with a + operator, when one of them is NULL, the result is NULL. So use the following :
Stuff(Max(Convert(Varchar, DATE_FIELD, 20) + IsNull(Convert(Varchar, DESIRED_FIELD), '')), 1, 19, '')
Select *
From Accounts a
Left Join (
Select u.*,
row_number() over (Partition By u.AccountKey Order By u.UserKey) as Ranking
From Users u
) as UsersRanked
on UsersRanked.AccountKey = a.AccountKey and UsersRanked.Ranking = 1
This can be simplified by using the Partition By clause. In the above, if an account has three users, then the subquery numbers them 1,2, and 3, and for a different AccountKey, it will reset the numnbering. This means for each unique AccountKey, there will always be a 1, and potentially 2,3,4, etc.
Thus you filter on Ranking=1 to grab the first from each group.
This will give you one row per account, and if there is at least one user for that account, then it will give you the user with the lowest key(because I use a left join, you will always get an account listing even if no user exists). Replace Order By u.UserKey with another field if you prefer that the first user be chosen alphabetically or some other criteria.
I've benchmarked all the methods, the simpelest and fastest method to achieve this is by using outer/cross apply
SELECT u.Name, Account.* FROM Account
OUTER APPLY (SELECT TOP 1 * FROM User WHERE Account.ID = Account_ID ) as u
CROSS APPLY works just like INNER JOIN and fetches the rows where both tables are related, while OUTER APPLY works like LEFT OUTER JOIN and fetches all rows from the left table (Account here)
You can use OUTER APPLY, see documentation.
SELECT User1.Name, Account.* FROM Account
OUTER APPLY
(SELECT TOP 1 Name
FROM [User]
WHERE Account.ID = [User].Account_ID
ORDER BY Name ASC) User1
SELECT (SELECT TOP 1 Name
FROM User
WHERE Account_ID = a.AccountID
ORDER BY UserID) [Name],
a.*
FROM Account a
The STUFF response from Dominic Goulet is slick. But, if your DATE_FIELD is SMALLDATETIME (instead of DATETIME), then the ISO 8601 length will be 19 instead of 23 (because SMALLDATETIME has no milliseconds) - so adjust the STUFF parameter accordingly or the return value from the STUFF function will be incorrect (missing the first four characters).
First and Last do not exist in Sql Server 2005 or 2008, but in Sql Server 2012 there is a First_Value, Last_Value function. I tried to implement the aggregate First and Last for Sql Server 2005 and came to the obstacle that sql server does guarantee the calculation of the aggregate in a defined order. (See attribute SqlUserDefinedAggregateAttribute.IsInvariantToOrder Property, which is not implemented.) This might be because the query analyser tries to execute the calculation of the aggregate on multiple threads and combine the results, which speeds up the execution, but does not guarantee an order in which elements are aggregated.
Define "First". What you think of as first is a coincidence that normally has to do with clustered index order but should not be relied on (you can contrive examples that break it).
You are right not to use MAX() or MIN(). While tempting, consider the scenario where you the first name and last name are in separate fields. You might get names from different records.
Since it sounds like all your really care is that you get exactly one arbitrary record for each group, what you can do is just MIN or MAX an ID field for that record, and then join the table into the query on that ID.
There are a number of ways of doing this, here a a quick and dirty one.
Select (SELECT TOP 1 U.Name FROM Users U WHERE U.Account_ID = A.ID) AS "Name,
A.*
FROM Account A
(Slightly Off-Topic, but) I often run aggregate queries to list exception summaries, and then I want to know WHY a customer is in the results, so use MIN and MAX to give 2 semi-random samples that I can look at in details e.g.
SELECT Customer.Id, COUNT(*) AS ProblemCount
, MIN(Invoice.Id) AS MinInv, MAX(Invoice.Id) AS MaxInv
FROM Customer
INNER JOIN Invoice on Invoice.CustomerId = Customer.Id
WHERE Invoice.SomethingHasGoneWrong=1
GROUP BY Customer.Id
Create and join with a subselect 'FirstUser' that returns the first user for each account
SELECT User.Name, Account.*
FROM Account, User,
(select min(user.id) id,account_id from User group by user.account_id) as firstUser
WHERE Account.ID = User.Account_ID
and User.id = firstUser.id and Account.ID = firstUser.account_id