I'm just getting into optimizing queries by logging slow queries and EXPLAINing them. I guess the thing is... I'm not sure exactly what kind of things I should be looking for.... I have the query
SELECT DISTINCT
screenshot.id,
screenshot.view_count
FROM screenshot_udb_affect_assoc
INNER JOIN screenshot ON id = screenshot_id
WHERE unit_id = 56
ORDER BY RAND()
LIMIT 0, 6;
Looking at these two elements.... where should I focus on optimization?
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE screenshot ALL PRIMARY NULL NULL NULL 504 Using temporary; Using filesort
1 SIMPLE screenshot_udb_affect_assoc ref screenshot_id screenshot_id 8 source_core.screenshot.id,const 3 Using index; Distinct
To begin with please refrain using ORDER BY RAND(). This in particular degrades performance when the table size is large.
For example, even with limit 1 , it generates number of random numbers equal to the row count, and would pick the smallest one. This might be inefficient if table size is large or bound to grow. Detailed discussion on this can be found at: http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/
Lastly, also ensure that your join columns are indexed.
Try:
SELECT s.id,
s.view_count
FROM SCREENSHOT s
WHERE EXISTS(SELECT NULL
FROM SCREENSHOT_UDB_AFFECT_ASSOC x
WHERE x.screenshot_id = s.id)
ORDER BY RAND()
LIMIT 6
Under 100K records, it's fine to use ORDER BY RAND() -- over that, and you want to start looking at alternatives that scale better. For more info, see this article.
I agree with kuriouscoder, refrain from using ORDER BY RAND(), and make sure each of the following fields are indexed in a single index:
screenshot_udb_affect_assoc.id
screenshot.id
screenshot.unit_id
do this using code like:
create index Index1 on screenshot(id):
Related
I want to apply pagination on a table with huge data. All I want to know a better option than using OFFSET in SQL Server.
Here is my simple query:
SELECT *
FROM TableName
ORDER BY Id DESC
OFFSET 30000000 ROWS
FETCH NEXT 20 ROWS ONLY
You can use Keyset Pagination for this. It's far more efficient than using Rowset Pagination (paging by row number).
In Rowset Pagination, all previous rows must be read, before being able to read the next page. Whereas in Keyset Pagination, the server can jump immediately to the correct place in the index, so no extra rows are read that do not need to be.
For this to perform well, you need to have a unique index on that key, which includes any other columns you need to query.
In this type of pagination, you cannot jump to a specific page number. You jump to a specific key and read from there. So you need to save the unique ID of page you are on and skip to the next. Alternatively, you could calculate or estimate a starting point for each page up-front.
One big benefit, apart from the obvious efficiency gain, is avoiding the "missing row" problem when paginating, caused by rows being removed from previously read pages. This does not happen when paginating by key, because the key does not change.
Here is an example:
Let us assume you have a table called TableName with an index on Id, and you want to start at the latest Id value and work backwards.
You begin with:
SELECT TOP (#numRows)
*
FROM TableName
ORDER BY Id DESC;
Note the use of ORDER BY to ensure the order is correct
In some RDBMSs you need LIMIT instead of TOP
The client will hold the last received Id value (the lowest in this case). On the next request, you jump to that key and carry on:
SELECT TOP (#numRows)
*
FROM TableName
WHERE Id < #lastId
ORDER BY Id DESC;
Note the use of < not <=
In case you were wondering, in a typical B-Tree+ index, the row with the indicated ID is not read, it's the row after it that's read.
The key chosen must be unique, so if you are paging by a non-unique column then you must add a second column to both ORDER BY and WHERE. You would need an index on OtherColumn, Id for example, to support this type of query. Don't forget INCLUDE columns on the index.
SQL Server does not support row/tuple comparators, so you cannot do (OtherColumn, Id) < (#lastOther, #lastId) (this is however supported in PostgreSQL, MySQL, MariaDB and SQLite).
Instead you need the following:
SELECT TOP (#numRows)
*
FROM TableName
WHERE (
(OtherColumn = #lastOther AND Id < #lastId)
OR OtherColumn < #lastOther
)
ORDER BY
OtherColumn DESC,
Id DESC;
This is more efficient than it looks, as SQL Server can convert this into a proper < over both values.
The presence of NULLs complicates things further. You may want to query those rows separately.
On very big merchant website we use a technic compound of ids stored in a pseudo temporary table and join with this table to the rows of the product table.
Let me talk with a clear example.
We have a table design this way :
CREATE TABLE S_TEMP.T_PAGINATION_PGN
(PGN_ID BIGINT IDENTITY(-9 223 372 036 854 775 808, 1) PRIMARY KEY,
PGN_SESSION_GUID UNIQUEIDENTIFIER NOT NULL,
PGN_SESSION_DATE DATETIME2(0) NOT NULL,
PGN_PRODUCT_ID INT NOT NULL,
PGN_SESSION_ORDER INT NOT NULL);
CREATE INDEX X_PGN_SESSION_GUID_ORDER
ON S_TEMP.T_PAGINATION_PGN (PGN_SESSION_GUID, PGN_SESSION_ORDER)
INCLUDE (PGN_SESSION_ORDER);
CREATE INDEX X_PGN_SESSION_DATE
ON S_TEMP.T_PAGINATION_PGN (PGN_SESSION_DATE);
We have a very big product table call T_PRODUIT_PRD and a customer filtered it with many predicates. We INSERT rows from the filtered SELECT into this table this way :
DECLARE #SESSION_ID UNIQUEIDENTIFIER = NEWID();
INSERT INTO S_TEMP.T_PAGINATION_PGN
SELECT #SESSION_ID , SYSUTCDATETIME(), PRD_ID,
ROW_NUMBER() OVER(ORDER BY --> custom order by
FROM dbo.T_PRODUIT_PRD
WHERE ... --> custom filter
Then everytime we need a desired page, compound of #N products we add a join to this table as :
...
JOIN S_TEMP.T_PAGINATION_PGN
ON PGN_SESSION_GUID = #SESSION_ID
AND 1 + (PGN_SESSION_ORDER / #N) = #DESIRED_PAGE_NUMBER
AND PGN_PRODUCT_ID = dbo.T_PRODUIT_PRD.PRD_ID
All the indexes will do the job !
Of course, regularly we have to purge this table and this is why we have a scheduled job which deletes the rows whose sessions were generated more than 4 hours ago :
DELETE FROM S_TEMP.T_PAGINATION_PGN
WHERE PGN_SESSION_DATE < DATEADD(hour, -4, SYSUTCDATETIME());
In the same spirit as SQLPro solution, I propose:
WITH CTE AS
(SELECT 30000000 AS N
UNION ALL SELECT N-1 FROM CTE
WHERE N > 30000000 +1 - 20)
SELECT T.* FROM CTE JOIN TableName T ON CTE.N=T.ID
ORDER BY CTE.N DESC
Tried with 2 billion lines and it's instant !
Easy to make it a stored procedure...
Of course, valid if ids follow each other.
It seems like there is a strange performance hit when running a query that includes both NOT 'some string' = ANY(array_column) as well as an ORDER BY statement in the same query.
The following is a simplified table structure illustrating the behavior where tagger is an array of UUIDs (v4):
CREATE TABLE IF NOT EXISTS "doc"."test" (
"id" STRING,
"last_active" TIMESTAMP,
"taggers" ARRAY(STRING)
)
The taggers array can grow somewhat large with maybe hundreds and in some case thousands of individual strings.
The following queries are all very performant and resolve within .03 seconds:
SELECT id FROM test ORDER BY last_active DESC LIMIT 10;
SELECT id FROM test WHERE NOT ('da10187a-408d-4dfc-ae46-857fd23a574a' = ANY(taggers)) LIMIT 10;
SELECT id FROM test WHERE ('da10187a-408d-4dfc-ae46-857fd23a574a' = ANY(taggers)) ORDER BY last_active DESC LIMIT 10;
However including both parts in the query jumps to around 2 - 3 seconds:
SELECT id FROM test WHERE NOT ('da10187a-408d-4dfc-ae46-857fd23a574a' = ANY(taggers)) ORDER BY last_active LIMIT 10;
What's very strange is that of the previous list of queries that run fast the last one is almost the exact same as the slow one, just without the negation. Negation of the ANY is also very fast. It's only when negation of ANY in a combination of a limit is added that things slow down. Any help would be greatly appreciated.
The query with only ORDER BY doesn't apply any filtering and it's much faster of course.
The query that only has the filtering NOT ...ANY() without ORDER BY applies the filter only to a short number of records until the LIMIT number (10 in this case) is reached.
The last query (filtering with NOT & ANY and ORDER BY) is significantly slower because it has to do much more work: It has to apply the filter on all records of the table, then sort them and finally return the first 10 (LIMIT).
I had to review some code, and came across something that someone did, and can't think of a reason why my way is better and it probably isn't, so, which is better/safer/more efficient?
SELECT MAX(a_date) FROM a_table WHERE a_primary_key = 5 GROUP BY event_id
OR
SELECT TOP 1 a_date FROM a_table WHERE a_primary_key = 5 ORDER BY a_date
I would have gone with the 2nd option, but I'm not sure why, and if that's right.
1) When there is a clustered index on the table and the column to be queried, both the MAX() operator and the query SELECT TOP 1 will have almost identical performance.
2) When there is no clustered index on the table and the column to be queried, the MAX() operator offers the better performance.
Reference: http://www.johnsansom.com/performance-comparison-of-select-top-1-verses-max/
Performance is generally similar, if your table is indexed.
Worth considering though: Top usually only makes sense if you're ordering your results (otherwise, top of what?)
Ordering a result requires more processing.
Min doesn't always require ordering. (Just depends, but often you don't need order by or group by, etc.)
In your two examples, I'd expect speed / x-plan to be very similar. You can always turn to your stats to make sure, but I doubt the difference would be significant.
They are different queries.
The first one returns many records (the biggest a_date for each event_id found within a_primary_key = 5)
The second one returns one record (the smallest a_date found within a_primary_key = 5).
For the queries to have the same result you would need:
SELECT MAX(a_date) FROM a_table WHERE a_primary_key = 5
SELECT TOP 1 a_date FROM a_table WHERE a_primary_key = 5 ORDER BY a_date DESC
The best way to know which is faster is to check the query plan and do your benchmarks. There are many factors that would affect the speed, such as table/heap size, etc. And even different versions of the same database may be optimized to favor one query over the other.
I perform max and top on one table with 20,00,000+ records ,
and found that Top give faster result with order by than max or min function.
So , best way is to execute both your query one by one for some time and check connection elapsed time for than.
MAX and TOP function differently. Your first query will return the maximum value found for a_date that has a a_primary_key = 5 for each different event_id found. The second query will simply grab the first a_date with a a_primary_key = 5 found in the result set.
To add the otherwise brilliant responses noting that the queries do very different things indeed, I'd like to point out that the results will be very different if there are no rows matching the criteria in the select.
SELECT MAX() will return one result with a NULL value
SELECT TOP 1 will result zero results
These are very different things.
I ran an experiment and I got the Clustered Index Cost 98% when I used an aggregate like Min/ Max, but when I used TOP and Order By, Clustered Index Scan cost was reduced to 45%. When it comes to querying large datasets, TOP and Order By combination will be less expensive and will give faster results.
Is there any way I can prohibit MySQL from performing a full table scan when the result was not found using indexes?
For example this query:
SELECT *
FROM a
WHERE (X BETWEEN a.B AND a.C)
ORDER BY a.B DESC
LIMIT 1;
Is only efficient if X satisfies the condition and there is at least 1 row returned, but if the condition cannot be satisfied by any data in the table, full scan will be performed, which can be very costly.
I don't want to optimize this particular query, it is just an example.
EXPLAIN on this query with X in or outside of range:
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE a range long_ip long_ip 8 \N 116183 100.00 Using where
STATUS VARIABLE show much better information. For X outside of range:
Handler_read_prev 84181
Key_read_requests 11047
In range:
Handler_read_key 1
Key_read_requests 12
If only there was a way to prevent Handler_read_prev from ever growing past 1.
UPDATE. I can't accept my own answer, because it doesn't really answer the question (HANDLER is a great feature, though). It seems to me that there is no general way to prevent MySQL from doing a full scan. Although, simple conditions like key='X' will be considered as "impossible where", more complex things like BETWEEN will not.
You could write a "fully covered" subquery that only uses data that is available in indexes. Based on the returned primary key, you can look up the rows in the master table.
The following query is fully covered by indexes on (id), (B,id), and (C,id):
select *
from a
where id in (
select id
from a
where x <= C
and id in (
select id
from a
where B <= X
)
)
limit 1
Each SELECT uses one index: the innermost the index on (B,id); the middle SELECT uses the index on (C,id), and the outer SELECT uses the primary key.
Here is what I came up with in the end:
HANDLER a OPEN;
HANDLER a READ BC <= (X);
HANDLER a CLOSE;
BC is the name of key (B,C). If we order the table by B DESC, then the result is guranteed to be equal to
SELECT *
FROM a
WHERE (X BETWEEN a.B AND a.C)
ORDER BY a.B DESC
LIMIT 1;
Now if X is not in the range of the table a, we just have to check that a.C is greater than X, if it's not, than X is definitely outside of the range and we don't need to look any further.
This is not very elegant though, and you will have to resort the table on each insert or update.
I've been trying to solve this problem for a few days now without much luck. I have found loads of resources that talk about paging on SQL Server 2000 both here and on codeproject.
The problem I am facing is trying to implement some sort of paging mechanism on a table which has three keys which make up the primary key. Operator, CustomerIdentifier, DateDisconnected.
Any help/pointers would be greately appreciated
SQL Server 2000 doesn't have the handy row_number function, so you'll have to auto-generate a row number column with a subquery, like so:
select
*
from
(select
*,
(select count(*) from tblA where
operator < a.operator
or (operator = a.operator
and customeridentifier < a.customeridentifier)
or (operator = a.operator
and customeridentifier = a.customeridentifier
and datedisconnected <= a.datedisconnected)) as rownum
from
tblA a) s
where
s.rownum between 5 and 10
order by s.rownum
However, you can sort those rows by any column in the table -- it doesn't have to use the composite key. It would probably run faster, too!
Additionally, composite keys are usually a flag. Is there any particular reason you aren't just using a surrogate key with a unique constraint on these three columns?