Performance issue on selecting n newest rows in subselect - sql

I have a database with courses. Each course contains a set of nodes, and some nodes contains a set of answers from students. The Answer table looks (simplified) like this:
Answer
id | courseId | nodeId | answer
------------------------------------------------
1 | 1 | 1 | <- text ->
2 | 2 | 2 | <- text ->
3 | 1 | 1 | <- text ->
4 | 1 | 3 | <- text ->
5 | 2 | 2 | <- text ->
.. | .. | .. | ..
When a teacher opens a course (i.e. courseId = 1) I want to pick the node that have received the most answers lately. I can do this using the following query:
with Answers as
(
select top 50 id, nodeId from Answer A where courseId=1 order by id desc
)
select top 1 nodeId from Answers group by nodeId order by count(id) desc
or equally using this query:
select top 1 nodeId from
(select top 50 id, nodeId from Answer A where courseId=1 order by id desc)
group by nodeId order by count(id) desc
In both querys the newest 50 answers (with the highest ids) are selected and then grouped by nodeId so I can pick the one with the highest frequency. My problem is, however, that the query is very slow. If I only run the subselect, it takes less than a second, and grouping 50 rows should be fast, but when I run the entire query it takes about 10 seconds! My guess is that sql server does the select and grouping first, and afterwards does the top 50 and top 1, which in this case leads to terrible performance.
So, how can I rewrite the query to be efficient?

You can add indexes to make your queries more efficient. For this query:
with Answers as (
select top 50 id, nodeId
from Answer A
where courseId = 1
order by id desc
)
select top 1 nodeId
from Answers
group by nodeId
order by count(id) desc;
The best index is Answer(courseId, id, nodeid).

To be more insightful we'd need to see the indexes on that table and the execution plans you're getting (one plan for the inner query on it's own, one plan for the full query).
I'd even recommend doing the same analysis again having added the index mentioned elsewhere on this page.
Without that information the only things we can recommend are trial and error.
For example, try avoiding using TOP (this shouldn't matter, but we're guessing while we can't see your indexes and execution plans)
WITH
Answers AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY id DESC) AS rowId,
id,
nodeId
FROM
Answer
WHERE
courseId = 1
),
top50 AS
(
SELECT
nodeId,
COUNT(*) AS row_count
FROM
Answers
WHERE
rowId <= 50
GROUP BY
nodeId
),
ranked AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY row_count DESC, nodeId DESC) AS ordinal,
nodeID
FROM
top50
)
SELECT
nodeID
FROM
ranked
WHERE
oridinal = 1
Which is massively over the top, but functionally the same as you have in your OP, but sufficiently different to potentially get a different execution plan.
Alternatively (and not very nice), just put the results of your inner query in to a table variable, then run the outer query on the table variable.
I still expect, however, that adding the index will be the least-worst option.

Related

Delete rows based on group by

Let's say I have the following table:
Id | QuestionId
----------------
1 | 'MyQuestionId'
1 | NULL
2 | NULL
2 | NULL
It should behave like so
Find all the results of the same Id
If ANY of them has QuestionId IS NOT NULL, do not touch any rows with that Id.
Only if ALL the results for the same Id have QuestionId IS NULL, delete all the rows with that Id.
So in this case it should only delete rows with Id=2.
I haven't found an example for such a case anywhere. I've tried some options with rank, count, group by, but nothing worked. Can you help me?
You can use an updatable CTE or derived table for this, and calculate the count using a window function.
WITH cte AS (
SELECT t.*,
CountNonNulls = COUNT(t.QuestionId) OVER (PARTITION BY t.Id)
FROM YourTable t
)
DELETE cte
WHERE CountNonNulls = 0;
db<>fiddle
Note that this query does not contain any self-joins at all.

How to return the category with max value for every user in postgresql?

This is the table
id
category
value
1
A
40
1
B
20
1
C
10
2
A
4
2
B
7
2
C
7
3
A
32
3
B
21
3
C
2
I want the result like this
id
category
1
A
2
B
2
C
3
A
For small tables or for only very few rows per user, a subquery with the window function rank() (as demonstrated by The Impaler) is just fine. The resulting sequential scan over the whole table, followed by a sort will be the most efficient query plan.
For more than a few rows per user, this gets increasingly inefficient though.
Typically, you also have a users table holding one distinct row per user. If you don't have it, created it! See:
Is there a way to SELECT n ON (like DISTINCT ON, but more than one of each)
Select first row in each GROUP BY group?
We can leverage that for an alternative query that scales much better - using WITH TIES in a LATERAL JOIN. Requires Postgres 13 or later.
SELECT u.id, t.*
FROM users u
CROSS JOIN LATERAL (
SELECT t.category
FROM tbl t
WHERE t.id = u.id
ORDER BY t.value DESC
FETCH FIRST 1 ROWS WITH TIES -- !
) t;
db<>fiddle here
See:
Get top row(s) with highest value, with ties
Fetching a minimum of N rows, plus all peers of the last row
This can use a multicolumn index to great effect - which must exist, of course:
CREATE INDEX ON tbl (id, value);
Or:
CREATE INDEX ON tbl (id, value DESC);
Even faster index-only scans become possible with:
CREATE INDEX ON tbl (id, value DESC, category);
Or (the optimum for the query at hand):
CREATE INDEX ON tbl (id, value DESC) INCLUDE (category);
Assuming value is defined NOT NULL, or we have to use DESC NULLS LAST. See:
Sort by column ASC, but NULL values first?
To keep users in the result that don't have any rows in table tbl, user LEFT JOIN LATERAL (...) ON true. See:
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
You can use RANK() to identify the rows you want. Then, filtering is easy. For example:
select *
from (
select *,
rank() over(partition by id order by value desc) as rk
from t
) x
where rk = 1
Result:
id category value rk
--- --------- ------ --
1 A 40 1
2 B 7 1
2 C 7 1
3 A 32 1
See running example at DB Fiddle.

Is SQL row_number order guaranteed when a CTE is referenced many times?

If I have a CTE definition that uses row_number() ordered by a non-unique column, and I reference that CTE twice in my query, is the row_number() value for each row guaranteed to be the same for both references to the CTE?
Example 1:
with tab as (
select 1 as id, 'john' as name
union
select 2, 'john'
union
select 3, 'brian'
),
ordered1 as (
select ROW_NUMBER() over (order by name) as rown, id, name
from tab
)
select o1.rown, o1.id, o1.name, o1.id - o2.id as id_diff
from ordered1 o1
join ordered1 o2 on o2.rown = o1.rown
Output:
+------+----+-------+---------+
| rown | id | name | id_diff |
+------+----+-------+---------+
| 1 | 3 | brian | 0 |
| 2 | 1 | john | 0 |
| 3 | 2 | john | 0 |
+------+----+-------+---------+
Is it guaranteed that id_diff = 0 for all rows?
Example 2:
with tab as (
select 1 as id, 'john' as name
union
select 2, 'john'
union
select 3, 'brian'
),
ordered1 as (
select ROW_NUMBER() over (order by name) as rown, id, name
from tab
),
ordered2 as (
select ROW_NUMBER() over (order by name) as rown, id, name
from tab
)
select o1.rown, o1.id, o1.name, o1.id - o2.id as id_diff
from ordered1 o1
join ordered2 o2 on o2.rown = o1.rown
Same output as above when I ran it, but that doesn't prove anything.
Now that I am joining two queries ordered1 and ordered2, can any guarantee be made about the value of id_diff = 0 in the result?
Example queries on http://rextester.com/AQDXP74920
I suspect that there is no guarantee in either case. If there is no such guarantee, then all CTEs using row_number() should always order by a unique combination of columns if the CTE may be referenced more than once in the query.
I have never heard this advice before, and would like some expert opinion.
No, there is no guarantee that ROW_NUMBER on a non-unique sort list returns the same sequence when a CTE is referenced multiple times. It is very likely to happen, but not guranteed, as the CTE is merely a view.
So always make the sort list unique in such a case, e.g. order by name, id.
The answer that Thorsten gave is correct, I just want to add some more details.
Users of SQL Server often think of CTEs as "temporary tables" or "derived tables. However, they are nothing of the sort. Although some databases do materialize CTEs (at least some of the time), SQL Server never materializes CTEs.
In fact, what happens, is that the CTE logic is inserted into the query -- just as if "replace(, )" were used on the query. This affects non-unique sorting keys. It also affects some non-deterministic functions, such as NEWID().
The advice in your case is simple: Whenever you use order by, include a unique key as the last order by key. You should do this whether order by is used in a window function or for a query. It is just a safe habit to get used to.

SQL: How many rows have the largest value for a column

I am sure this is a very simple answer, though I have not turned anything up. Most because I am sure I am phrasing the question wrong.
Anyway, lets say I have this very simple table:
Table: election_candidates
id | candidate_id | election_id | votes
---------------------------------------
1 | 2 | 1 | 3
2 | 5 | 1 | 3
3 | 3 | 1 | 2
I need to know if two candidates are tied. So if there is more than one candidate with the most amount of votes for an election.
I know I can use MAX function to get the largest value for an election, but is their an easy query to get how many candidates have the MAX for a given election?
I'm using PHP and the Codeigniter framework, though just a general example of a query that could work is just fine.
Most databases support ANSI-standard window functions. One way to do this is using rank():
select ec.election_id, count(*) as NumTies
from (select ec.*, rank(votes) over (partition by election_id order by votes desc) as seqnum
from election_candidates ec
) ec
where seqnum = 1
group by ec.election_id;
Couldn't you just do something like:
select e.*
from election_candidates e
inner join (
select election_id, max(votes) as maxVotes,
from election_candidates
group by election_id
) maxVotesPerElectionId on e.election_Id = maxVotesPerElectionId.election_id
and e.votes = maxVotesPerElectionId.maxVotes
this should get you the candiates (per election) with the max votes.
Just the winner:
SELECT *
from election_candidates
ORDER BY votes DESC
LIMIT 0,1
This will group all elections together, using rank() sort each election by votes cast and list in the order of placement.
All candidates are listed and displayed on how they did in each election.
DECLARE #T AS TABLE (id INT,candidate_id INT,election_id INT,votes INT)
INSERT INTO #T VALUES
(1 ,2,1,3),(2 ,5,1,3),(3 ,3,1,2),(4 ,2,2,3),(5 ,5,3,1),(6 ,6,1,4),(7 ,2,3,3),(8 ,1,4,3),
(9 ,1,5,2),(10,4,5,3),(11,5,5,3),(12,6,5,4)
SELECT
election_id,
votes,
RANK() OVER (PARTITION BY election_id ORDER BY votes) AS RANKING,
candidate_id
FROM #T
ORDER BY election_id,
RANK() OVER (PARTITION BY election_id ORDER BY votes)

How to find first duplicate row in a table sql server

I am working on SQL Server. I have a table, that contains around 75000 records. Among them there are several duplicate records. So i wrote a query to know which record repeated how many times like,
SELECT [RETAILERNAME],COUNT([RETAILERNAME]) as Repeated FROM [Stores] GROUP BY [RETAILERNAME]
It gives me result like,
---------------------------
RETAILERNAME | Repeated
---------------------------
X | 4
---------------------------
Y | 6
---------------------------
Z | 10
---------------------------
Among 4 record(s) of X record, i need take only first record of X.
so here i want to retrieve all fields from first row of duplicate records. i.e. Take all records whose RETAILERNAME='X' we will get some no. of duplicate records, we need to get only first row from them.
Please guide me.
You could try using ROW_NUMBER.
Something like
;WITH Vals AS (
SELECT [RETAILERNAME],
ROW_NUMBER() OVER(PARTITION BY [RETAILERNAME] ORDER BY [RETAILERNAME]) RowID
FROM [Stores ]
)
SELECT *
FROm Vals
WHERE RowID = 1
SQL Fiddle DEMO
You can then also remove the duplicates if need be (BUT BE CAREFUL THIS IS PERMANENT)
;WITH Vals AS (
SELECT [RETAILERNAME],
ROW_NUMBER() OVER(PARTITION BY [RETAILERNAME] ORDER BY [RETAILERNAME]) RowID
FROM Stores
)
DELETE
FROM Vals
WHERE RowID > 1;
You Can write query as under
SELECT TOP 1 * FROM [Stores] GROUP BY [RETAILERNAME]
HAVING your condition
WITH cte
AS (SELECT [retailername],
Row_number()
OVER(
partition BY [retailername]
ORDER BY [retailername])'RowRank'
FROM [retailername])
SELECT *
FROM cte