Query optimization in Oracle SQL - sql

Let's say I have an oracle database schema like so:
tournaments( id, name )
players( id, name )
gameinfo( id, pid (references players.id), tid (references tournaments.id), date)
So a row in the gameinfo table means that a certain player played a certain game in a tournament on a given date. Tournaments has about 20 records, players about 160 000 and game info about 2 million. I have to write a query which lists tournaments (with tid in the range of 1-4) and the number of players that played their first game ever in that tournament.
I came up with the following query:
select tid, count(pid)
from gameinfo g
where g.date = (select min(date) from gameinfo g1 where g1.player = g.player)
and g.tid in (1,2,3,4)
group by tid;
This is clearly suboptimal (it ran for about 58 minutes).
I had another idea, that I could make a view of:
select pid, tid, min(date)
from gameinfo
where tid in(1,2,3,4)
group by pid, tid;
And run my queries on this view, as it only had about 600 000 records, but this still seems less than optimal.
Can you give any advice on how this could be optimized ?

My first recommendation is to try analytic functions first. The row_number() function will enumerate the tournaments for each user. The first has a seqnum of 1:
select gi.*
from (select gi.*,
row_number() over (partition by gi.player order by date) as seqnum
from gameinfo gi
) gi
where tid in(1,2,3,4) and seqnum = 1
My second suggestion is to put the date of the first tournament into the players table, since it seems like important information for using the database.

Related

SQL plus, top 3 rank across two tables

I'm trying to find a way to query the top three users in a database in terms of number of listens and output their user ID and their rank.
The schema for the two tables in question is as follows :
User(user_id, email, first_name, last_name, password, created_on, last_sign_in)
PreviouslyPlayed(user_id, track_id, timestamp)
I could see how many people pull this off with a count query, but am wondering is there's a way to do this with a rank or dense rank
If you just want the user id and are using Oracle 12g+, then you can do:
select pp.user_id, rank() over (order by count(*) desc) as therank
from previouslyplayed pp
group by pp.user_id
order by count(*) desc
fetch first 3 rows only;
In earlier versions, you would use a subquery:
select pp.*
from (select pp.user_id, rank() over (order by count(*) desc) as therank
from previouslyplayed pp
group by pp.user_id
) pp
where therank <= 3;
You might want to review row_number(), rank(), and dense_rank() to be sure you are getting what you really want (the difference is in how they handle ties).
You only need the join if you are concerned that something called user_id in one table is not a valid user id. That seems unlikely, in any well-designed database.

SQL random sample with groups

I have a university graduate database and would like to extract a random sample of data of around 1000 records.
I want to ensure the sample is representative of the population so would like to include the same proportions of courses eg
I could do this using the following:
select top 500 id from degree where coursecode = 1 order by newid()
union
select top 300 id from degree where coursecode = 2 order by newid()
union
select top 200 id from degree where coursecode = 3 order by newid()
but we have hundreds of courses codes so this would be time consuming and I would like to be able to reuse this code for different sample sizes and don't particularly want to go through the query and hard code the sample sizes.
Any help would be greatly appreciated
You want a stratified sample. I would recommend doing this by sorting the data by course code and doing an nth sample. Here is one method that works best if you have a large population size:
select d.*
from (select d.*,
row_number() over (order by coursecode, newid) as seqnum,
count(*) over () as cnt
from degree d
) d
where seqnum % (cnt / 500) = 1;
EDIT:
You can also calculate the population size for each group "on the fly":
select d.*
from (select d.*,
row_number() over (partition by coursecode order by newid) as seqnum,
count(*) over () as cnt,
count(*) over (partition by coursecode) as cc_cnt
from degree d
) d
where seqnum < 500 * (cc_cnt * 1.0 / cnt)
Add a table for storing population.
I think it should be like this:
SELECT *
FROM (
SELECT id, coursecode, ROW_NUMBER() OVER (PARTITION BY coursecode ORDER BY NEWID()) AS rn
FROM degree) t
LEFT OUTER JOIN
population p ON t.coursecode = p.coursecode
WHERE
rn <= p.SampleSize
It is not necessary to partition the population at all.
If you are taking a sample of 1000 from a population among hundreds of course codes, it stands to reason that many of those course codes will not be selected at all in any one sampling.
If the population is uniform (say, a continuous sequence of student IDs), a uniformly-distributed sample will automatically be representative of population weighting by course code. Since newid() is a uniform random sampler, you're good to go out of the box.
The only wrinkle that you might encounter is if a student ID is a associated with multiple course codes. In this case make a unique list (temporary table or subquery) containing a sequential id, student id and course code, sample the sequential id from it, grouping by student id to remove duplicates.
I've done similar queries (but not on MS SQL) using a ROW_NUMBER approach:
select ...
from
( select ...
,row_number() over (partition by coursecode order by newid()) as rn
from degree
) as d
join sample size as s
on d.coursecode = s.coursecode
and d.rn <= s.samplesize

Unique Top 5 Random Query

Let's say I have an app that determine the winners in a prize drawing. All entries are entered into a table indicating their employeeID. Each employee can enter the drawing multiple times. I select from the table, order by newid to get a random sort. I assume the more entries (database records) an employee has the better chance he will end up in the top 5 of my query each time I run it. So far so good. However, because each employee has multiple records, there is a good chance he will come up multiple times in the top 5. I need the ability to return 5 unique records from the randomly sorted results.
How do I get 5 unique rows while still ensuring those with multiple drawing entries get a heavier weighting in the selection?
My base query:
SELECT TOP 5 employeeID
FROM events
TABLESAMPLE(1000 ROWS)
ORDER BY CHECKSUM(NEWID());
Kinda what I am trying to do:
SELECT TOP 5 *
FROM events
WHERE employeeID IN (SELECT employeeID
FROM events
TABLESAMPLE(1000 ROWS)
ORDER BY CHECKSUM(NEWID())
)
ORDER BY CHECKSUM(NEWID())
But of course I cannot do an order by in the subquery.
Any solution must take into account 2 things:
If an employee enter multiple tickets, his chance of winning increases relative to other.
Everyone can only win once
Here's my approach:
;WITH
tmp1 AS
(
SELECT EmployeeID,
ROW_NUMBER() OVER (ORDER BY NEWID()) AS SortOrder
FROM Events
),
tmp2 AS
(
SELECT EmployeeID,
MIN(SortOrder) AS WinOrder
FROM tmp1
GROUP BY EmployeeID
)
SELECT TOP 5 *
FROM tmp2
ORDER BY WinOrder
The SQL Fiddle gives employees 1 & 5 higher chances to win, but they will only win once each, regardless of how many times they enter.
Here's a fairly simple way to get what you're after:
select top 5 EmployeeID
from
(
select EmployeeID, row_number() over (order by newid()) DrawOrder
from Events
) wins
group by EmployeeID
order by min(DrawOrder)

Select entry of each group having exactly 1 entry

I am looking for an optimized query
let me show you a small example.
Lets suppose I have a table having three field studentId, teacherId and subject as
Now I want those data in which a physics teacher is teaching to only one student, i.e
teacher 300 is only teaching student 3 and so on.
What I have tried till now
select sid,tid from tabletesting with(nolock)
where tid in (select tid from tabletesting with(nolock)
where subject='physics' group by tid having count(tid) = 1)
and subject='physics'
The above query is working fine. But I want different solution in which I don't have to scan the same table twice.
I also tried using Rank() and Row_Number() but no result.
FYI :
I have showed you an example, this is not the actual table i am playing with, my table contain huge number of rows and columns and where clause is also very complex(i.e date comparison etc.), so I don't want to give the same where clause in subquery and outquery.
You can do this with window functions. Assuming that there are no duplicate students for a given teacher (as in your sample data):
select tt.sid, tt.tid
from (select tt.*, count(*) over (partition by teacher) as scnt
from TableTesting tt
) tt
where scnt = 1;
Another way to approach this, which might be more efficient, is to use an exists clause:
select tt.sid, tt.tid
from TableTesting tt
where not exists (select 1 from TableTesting tt1 where tt1.tid = tt.tid and tt1.sid <> tt.sid)
Another option is to use an analytic function:
select sid, tid, subject from
(
select sid, tid, subject, count(sid) over (partition by subject, tid) cnt
from tabletesting
) X
where cnt = 1

Cumulative Game Score SQL

I have developed a game recently and the database is running on MSSQL.
Here is my database structure
Table : Player
PlayerID uniqueIdentifier (PK)
PlayerName nvarchar
Table : GameResult
ID bigint (PK - Auto Increment)
PlayerID uniqueIdentifier (FK)
DateCreated Datetime
Score int
TimeTaken bigint
PuzzleID int
I have done an SQL listing Top 50 players that sort by highest score (DESC) and timetaken (ASC)
WITH ResultSet (PlayerID, Score, TimeTaken) AS(
SELECT DISTINCT(A.[PlayerID]), MAX(A.[Score]),MIN(A.[TimeTaken])
FROM GameResult A
WHERE A.[puzzleID] = #PuzzleID
GROUP BY A.[PlayerID])
SELECT TOP 50 RSP.[PlayerID], RSP.[PlayerName], RSA.[Score], RSA.[TimeTaken]
FROM ResultSet RSA
INNER JOIN Player RSP WITH(NOLOCK) ON RSA.PlayerID = RSP.PlayerID
ORDER By RSA.[Score] DESC, RSA.[timetaken] ASC
However above is applicable for just 1 puzzle.
Question
1) I need to modify the SQL to do a cumulative rank of 3 puzzle ID. For example, Puzzle 1, 2, 3 and it should be sort by highest sum score (DESC), and sum timetaken (ASC)
2) I also need an overall score population for all the possible 1 to 7 puzzle.
3) Each player only allowed to appear on the list once. First played and first to get highest score will be rank 1st.
I tried using CTE with UNION but the SQL statement doesn't work.
I hope gurus here can help me out on this. Much appreciated.
UPDATED WITH NEW SQL
Sql below allowed me to get the result for each puzzle id. I'm not sure if it is 100% but I believe it is correct.
;with ResultSet (PlayerID, maxScore, minTime, playedDate)
AS
(
SELECT TOP 50 PlayerID, MAX(score) as maxScore, MIN(timetaken) as minTime, MIN(datecreated) as playedDate
FROM gameresult
WHERE puzzleID = #PuzzleID
GROUP BY PlayerID
ORDER BY maxScore desc, minTime asc, playedDate asc
)
SELECT RSP.[PlayerID], RSP.[PlayerName], RSA.maxScore, RSA.minTime, RSA.PlayedDate
FROM ResultSet RSA
INNER JOIN Player RSP WITH(NOLOCK)
ON RSA.PlayerID = RSP.PlayerID
ORDER BY
maxScore DESC,
minTime ASC,
playedDate ASC
I would first like to point out that I do not believe your original query is correct. If you are looking for the best player for a particular puzzle, would that be the combination of the highest score plus the best time for that puzzle? If yes, using max and min does not guarantee that the max and min come from the same game (or row), which I believe should be a requirement. Instead you should have first determined the best game per player by using a row number windowing function. You can then do the top 50 sort off of that data.
The cumulative metrics should be easier to calculate because you only have to aggregate the sum of their score and the sum of their time and then sort, which means the new query should most likely look something like this:
;with ResultSet (PlayerID, Score, TimeTaken)
AS
(
SELECT TOP 50
A.[PlayerID],
SUM(A.[Score]),
SUM(A.[TimeTaken])
FROM GameResult A
WHERE
A.[puzzleID] in(1,2,3)
GROUP BY
A.PlayerID
ORDER BY
SUM(A.[Score]) DESC,
SUM(A.[TimeTaken]) ASC
)
SELECT RSP.[PlayerID], RSP.[PlayerName], RSA.[Score], RSA.[TimeTaken]
FROM ResultSet RSA
INNER JOIN Player RSP WITH(NOLOCK)
ON RSA.PlayerID = RSP.PlayerID
ORDER BY
Score DESC,
TimeTaken ASC
UPDATE:
Based on the new criteria, you will have to do something like this.
;WITH ResultSet (PlayerID, PuzzleId, Score, TimeTaken, seq)
AS
(
SELECT
A.[PlayerID],
A.PuzzleID,
A.[Score],
A.[TimeTaken],
seq = ROW_NUMBER() over(PARTITION BY PlayerID, PuzzleId ORDER BY Score DESC)
FROM GameResult A
WHERE
A.[puzzleID] in(1,2,3)
)
SELECT TOP 50
RSP.[PlayerID],
RSP.[PlayerName],
Score = SUM(RSA.[Score]), --total score
TimeTaken = SUM(RSA.[TimeTaken]) --total time taken
FROM ResultSet RSA
INNER JOIN Player RSP
ON RSA.PlayerID = RSP.PlayerID
WHERE
--this is used to filter the top score for each puzzle per player
seq = 1
GROUP BY
RSP.[PlayerID],
RSP.[PlayerName]
ORDER BY
SUM(RSA.Score) DESC,
SUM(RSA.TimeTaken) ASC