How can I optimize this query? - sql

I've got a bit of a nasty query with several subselects that are really slowing it down. I'm already caching the query, but the results of it changes often and the query results are meant to be shown on a high traffic page.
SELECT user_id, user_id AS uid, (SELECT correct_words
FROM score
WHERE user_id = `uid`
ORDER BY correct_words DESC, incorrect_words ASC
LIMIT 0, 1) AS correct_words,
(SELECT incorrect_words
FROM score
WHERE user_id = `uid`
ORDER BY correct_words DESC, incorrect_words ASC
LIMIT 0, 1) AS incorrect_words
FROM score
WHERE user_id > 0
AND DATE(date_tested) = DATE(NOW())
GROUP BY user_id
ORDER BY correct_words DESC,incorrect_words ASC
LIMIT 0,7
The goal of the query is to pick out the top score for users for that day, but only show the highest scoring instance of that user instead of all of their scores (So, for instance, if one user actually had 4 of the top 10 scores for that day, I only want to show that user's top score and remove the rest)
Try as I might, I've yet to replicate the results of this query any other way. Right now its average run time is about 2 seconds, but I'm afraid that might increase greatly as the table gets bigger.
Any thoughts?

try this:
The subquery basically returns the resultset of all the scores in the right order, and the outer query greps out the first occurence. When grouping in MySQL, columns that are not grouped on return the equivalent to FIRST(column): the value of the first occurence.
SELECT user_id, correct_words, incorrect_words
FROM
( SELECT user_id, correct_words, incorrect_words
FROM score
WHERE user_id>0
AND DATE(date_tested)=DATE(NOW())
ORDER BY correct_words DESC,incorrect_words ASC
)
GROUP BY user_id
LIMIT 0,7

The subqueries for correct_words and incorrect_words could be really killing your performance. In the worst case, MySQL has to execute those queries for each row it considers (not each row that it returns!). Rather than using scalar subqueries, consider rewriting your query to use JOIN-variants as appropriate.
Additionally, filtering by DATE(date_tested)=DATE(NOW()) may be preventing MySQL from using an index. I don't believe any of the production versions of MySQL allow function-based indices.
Make sure you have indices on all the columns you filter and order by. MySQL can make use of multi-column indices if the columns filtered or ordered by match your query, e.g. CREATE INDEX score_correct_incorrect_idx ON score ( correct_words DESC, incorrect_words ASC ); would be a candidate index, though MySQL may choose not to use it depending on the execution plan it creates and its estimates of table sizes.

Related

How to optimise a SQL SELECT query for generating a user's newsfeed?

I'm currently trying to build a feature to generate a user's newsfeed using the following query from a table of posts. This is the SQL statement we are using:
SELECT *
FROM "posts" AS "post"
WHERE "post"."sourceId" IN (...)
ORDER BY "post"."createdAt" DESC, "post"."timestamp" DESC
LIMIT 10;
The posts table currently has roughly 200K+ rows and likely to grow much larger. My skills in DB performance isn't the strongest, but is there anyway to optimise this query to make it run as fast as possible? I'm assuming it's not enough to add an index on the sourceId column but instead would need a multi column index to also take into account the ORDER BY column.
For this query:
SELECT p.*
FROM posts p
WHERE p.sourceId IN (...)
ORDER BY p.createdAt DESC, p.timestamp DESC
LIMIT 10;
The only index that can really help is an index on posts(sourceId).
Note that I removed the ". Do not escape table and column names when you define them. Then you don't need to escape them when you use them.
However, the query still has to sort all the data. And that can be time-consuming. A more complicated query is easier for Postgres to optimize:
select p.*
from ((select p.*
from posts p
where sourceId = $si_1
order by p.createdAt desc, p.timestamp desc
limit 10
) union all
(select p.*
from posts p
where sourceId = $si_2
order by p.createdAt desc, p.timestamp desc
limit 10
) union all
. . .
) p
order by p.createdAt desc, p.timestamp desc;
This query can use an index on posts(sourceId, createdAt desc, timestamp desc) for the inner selects. That should be fast. the outer order by will still need sorting, but the volume of data should be much smaller.
For instance, if a typical source has 10,000 rows and you are only looking for 3 of them, then your version of the query needs to sort 30,000 rows to fetch 10. This version fetches 30 rows uses the index and then sorts them to get the final 10.
That would be a big difference in performance.
You may find that just an index on sourceId is sufficient:
CREATE INDEX src_idx ON posts (sourceId);
Postgres would then manually have to sort the records which make it past the WHERE clause. Further adding the columns in the ORDER BY clause might also help:
CREATE INDEX idx ON posts (sourceId, createdAt DESC, timestamp DESC);
This might speed up the sorting operation by letting Postgres sort the matching groups of sourceId records at once.

SQL tuning, long running query + rownum

I have million record in database table having account no, address and many more columns. I want 100 rows in sorting with desc order, I used rownum for this, but the query is taking a long time to execute, since it scans the full table first make it in sorted order then apply the rownum.
What is the solution to minimize the query execution time?
For example:
select *
from
(select
acc_no, address
from
customer
order by
acc_no desc)
where
ROWNUM <= 100;
From past experience I found that the TOP works best for this scenario.
Also you should always select the columns you need only and avoid using the all card (*)
SELECT TOP 100 [acc_no], [address] FROM [customer] ORDER BY [acc_no] DESC
Useful resources about TOP, LIMIT and even ROWNUM.
https://www.w3schools.com/sql/sql_top.asp
Make sure you use index on acc_no column.
If you have an index already present on acc_no, check if that's being used during query execution or not by verifying the query execution plan.
To create a new index if not present, use below query :
Create index idx1 on customer(acc_no); -- If acc_no is not unique
Create unique index idx1 on customer(acc_no); -- If acc_no is unique. Note: Unique index is faster.
If in explain plan output you see "Full table scan", then it is a case that optimizer is not using the index.
Try with a hint first :
select /*+ index(idx1) */ * from
(select
acc_no, address
from
customer
order by
acc_no desc)
where
ROWNUM <= 100;
If the query with hint above returned results quickly, then you need to check why optimizer is ignoring your index deliberately. One probable reason for this is outdated statistics. Refresh the statistics.
Hope this helps.
Consider getting your top account numbers in an inner query / in-line view such that you only perform the joins on those 100 customer records. Otherwise, you could be performing all the joins on the million+ rows, then sorting the million+ results to get the top 100. Something like this may work.
select .....
from customer
where customer.acc_no in (select acc_no from
(select inner_cust.acc_no
from customer inner_cust
order by inner_cust.acc_no desc
)
where rownum <= 100)
and ...
Or, if you are using 12C you can use FETCH FIRST 100 ROWS ONLY
select .....
from customer
where customer.acc_no in (select inner_cust.acc_no
from customer inner_cust
order by inner_cust.acc_no desc
fetch first 100 rows only
)
and ...
This will give the result within 100ms, but MAKE SURE that there is index on column ACC_NO. There also can be combined index on ACC_NO+other colums, but ACC_NO MUST be on the first position in the index. You have to see "range scan" in execution plan. Not "full table scan", not "skip scan". You can probably see nested loops in execution plan (that will fetch ADDRESSes from table). You can improve speed even more by creating combined index for ACC_NO, ADDRESS (in this order). In such case Oracle engine does not have to read the table at all, because all the information is contained in the index. You can compare it in execution plan.
select top 100 acc_no, address
from customer
order by acc_no desc

Reverse initial order of SELECT statement

I want to run a SQL query in Postgres that is exactly the reverse of the one that you'd get by just running the initial query without an order by clause.
So if your query was:
SELECT * FROM users
Then
SELECT * FROM users ORDER BY <something here to make it exactly the reverse of before>
Would it just be this?
ORDER BY Desc
You are building on the incorrect assumption that you would get rows in a deterministic order with:
SELECT * FROM users;
What you get is really arbitrary. Postgres returns rows in any way it sees fit. For simple queries typically in order of their physical storage, which typically is the order in which rows were entered. But there are no guarantees, and the order may change any time between two calls. For instance after any UPDATE (writing a new physical row version), or when any background process reorders rows - like VACUUM. Or a more complex query might return rows according to an index or a join. Long story short: there is no reliable order for table rows in a relational database unless you specify it with ORDER BY.
That said, assuming you get rows from the above simple query in the order of physical storage, this would get you the reverse order:
SELECT * FROM users
ORDER BY ctid DESC;
ctid is the internal tuple ID signifying physical order. Related:
In-order sequence generation
How list all tables with data changes in the last 24 hours?
here is a tsql solution, thid might give you an idea how to do it in postgres
select * from (
SELECT *, row_number() over( order by (select 1)) rowid
FROM users
) x
order by rowid desc

What order is used by First() function?

Why do the following two queries return identical results?
SELECT FIRST(score) FROM (SELECT score FROM scores ORDER BY score ASC)
SELECT FIRST(score) FROM (SELECT score FROM scores ORDER BY score DESC)
It's confusing, considering that I manually specify the order of subqueries.
The order of the results in the subquery is irrelevant, unless you use TOP within the subquery, which you don't here. Most SQL variants won't allow this syntax -- using an ORDER BY in a subquery throws an error in SQL Server, for example.
Your top-level query has no ORDER BY, thus the concepts of FIRST or TOP 1 are undefined in the context of that query.
In the reference docs, Microsoft states (emphasis mine):
Because records are usually returned in no particular order (unless
the query includes an ORDER BY clause), the records returned by these
functions will be arbitrary.
To answer the question directly:
Access ignores the ORDER BY clause in most subqueries. I beleive (but can't prove) this is due to bugs/limitations in the query optimiser, although it's not documented anywhere (that I could find). I've tested lots of SQL using Access 2007 and Access 2016 to come to this conclusion.
To make the examples work as expected:
Add TOP 100 PERCENT to the subqueries:
SELECT FIRST(score) FROM (SELECT TOP 100 PERCENT score FROM scores ORDER BY score ASC)
SELECT FIRST(score) FROM (SELECT TOP 100 PERCENT score FROM scores ORDER BY score DESC)
When to use First/Last instead of Max/Min:
A good example of when you'd want to use this approach instead of the simpler Min and Max aggregate functions is when there's another field that you want from the same record, e.g if the underlying scores table also held the names of players and the rounds of the game, you can get the name and score of the best and worst player in each round like this:
SELECT
round, FIRST(name) AS best, FIRST(score) AS highscore, LAST(name) AS worst, LAST(score) AS lowscore
FROM
(SELECT TOP 100 PERCENT * FROM scores ORDER BY score DESC)
GROUP BY
round
Your statements are a perfect functional equivalents to
SELECT Min(Score) FROM Scores and
SELECT Max(Score) FROM Scores.
If you really want to retrieve the first and last score, you will need an AutoNumber or a DateTime field to indicate the input order. You could then query:
SELECT First(Score), Last(Score) FROM Scores ORDER BY MySortKey
If you persist with your question, the correct syntax would be
SELECT FIRST(score) FROM (SELECT score FROM scores) ORDER BY score ASC,
or, simplified,
SELECT FIRST(score) FROM scores ORDER BY score ASC

Are the results deterministic, if I partition SQL SELECT query without ORDER BY?

I have SQL SELECT query which returns a lot of rows, and I have to split it into several partitions. Ie, set max results to 10000 and iterate the rows calling the query select time with increasing first result (0, 10000, 20000). All the queries are done in same transaction, and data that my queries are fetching is not changing during the process (other data in those tables can change, though).
Is it ok to use just plain select:
select a from b where...
Or do I have to use order by with the select:
select a from b where ... order by c
In order to be sure that I will get all the rows? In other word, is it guaranteed that query without order by will always return the rows in the same order?
Adding order by to the query drops performance of the query dramatically.
I'm using Oracle, if that matters.
EDIT: Unfortunately I cannot take advantage of scrollable cursor.
Order is definitely not guaranteed without an order by clause, but whether or not your results will be deterministic (aside from the order) would depend on the where clause. For example, if you have a unique ID column and your where clause included a different filter range each time you access it, then you would have non-ordered deterministic results, i.e.:
select a from b where ID between 1 and 100
select a from b where ID between 101 and 200
select a from b where ID between 201 and 300
would all return distinct result sets, but order would not be any way guaranteed.
No, without order by it is not guaranteed that query will ALWAYS return the rows in the same order.
No guarantees unless you have an order by on the outermost query.
Bad SQL Server example, but same rules apply. Not guaranteed order even with inner query
SELECT
*
FROM
(
SELECT
*
FROM
Mytable
ORDER BY SomeCol
) foo
Use Limit
So you would do:
SELECT * FROM table ORDER BY id LIMIT 0,100
SELECT * FROM table ORDER BY id LIMIT 101,100
SELECT * FROM table ORDER BY id LIMIT 201,100
The LIMIT would be from which position you want to start and the second variable would be how many results you want to see.
Its a good pagnation trick.