MAX vs Top 1 - which is better? - sql

I had to review some code, and came across something that someone did, and can't think of a reason why my way is better and it probably isn't, so, which is better/safer/more efficient?
SELECT MAX(a_date) FROM a_table WHERE a_primary_key = 5 GROUP BY event_id
OR
SELECT TOP 1 a_date FROM a_table WHERE a_primary_key = 5 ORDER BY a_date
I would have gone with the 2nd option, but I'm not sure why, and if that's right.

1) When there is a clustered index on the table and the column to be queried, both the MAX() operator and the query SELECT TOP 1 will have almost identical performance.
2) When there is no clustered index on the table and the column to be queried, the MAX() operator offers the better performance.
Reference: http://www.johnsansom.com/performance-comparison-of-select-top-1-verses-max/

Performance is generally similar, if your table is indexed.
Worth considering though: Top usually only makes sense if you're ordering your results (otherwise, top of what?)
Ordering a result requires more processing.
Min doesn't always require ordering. (Just depends, but often you don't need order by or group by, etc.)
In your two examples, I'd expect speed / x-plan to be very similar. You can always turn to your stats to make sure, but I doubt the difference would be significant.

They are different queries.
The first one returns many records (the biggest a_date for each event_id found within a_primary_key = 5)
The second one returns one record (the smallest a_date found within a_primary_key = 5).

For the queries to have the same result you would need:
SELECT MAX(a_date) FROM a_table WHERE a_primary_key = 5
SELECT TOP 1 a_date FROM a_table WHERE a_primary_key = 5 ORDER BY a_date DESC
The best way to know which is faster is to check the query plan and do your benchmarks. There are many factors that would affect the speed, such as table/heap size, etc. And even different versions of the same database may be optimized to favor one query over the other.

I perform max and top on one table with 20,00,000+ records ,
and found that Top give faster result with order by than max or min function.
So , best way is to execute both your query one by one for some time and check connection elapsed time for than.

MAX and TOP function differently. Your first query will return the maximum value found for a_date that has a a_primary_key = 5 for each different event_id found. The second query will simply grab the first a_date with a a_primary_key = 5 found in the result set.

To add the otherwise brilliant responses noting that the queries do very different things indeed, I'd like to point out that the results will be very different if there are no rows matching the criteria in the select.
SELECT MAX() will return one result with a NULL value
SELECT TOP 1 will result zero results
These are very different things.

I ran an experiment and I got the Clustered Index Cost 98% when I used an aggregate like Min/ Max, but when I used TOP and Order By, Clustered Index Scan cost was reduced to 45%. When it comes to querying large datasets, TOP and Order By combination will be less expensive and will give faster results.

Related

Two queries with equal results takes vastly different time to execute?

I'm currently using redshift. I was trying to execute a query to calculate a column called id_number (data type INTEGER) from a VARCHAR column called id to speed up further queries using id_number instead of id.
Here is the first query I tried :
select rank() over (order by id) id_number, id, sid1 ,sid2
from table
limit 10000
However, noticing that this query was taking quite some time, I tried the next query:
with A as(
select id, sid1, sid2
from table
limit 10000
)
select rank() over (order by id) id_number, id, sid1 ,sid2
from A
which was over in a flash.
How was it that the second query took such less time to execute, while the two queries seem to do the exact same thing?
If it is because of the positions of limit 10000, how did the position of limit contribute to the difference in execution time?
Your two queries are quite different.
The first one has to sort the complete table to get the rank() and then emits the first 10000 rows of the result (with no particular ordering enforced).
The second one selects 10000 rows (without a particular ordering enforced) and then sorts those to calculate rank() on them.
If the table is significantly larger than 10000 rows, it is unsurprising that the first query, which has to sort it all, is much slower.
Look at the EXPLAIN (ANALYZE, BUFFERS) output to understand this better.

Slow performance with NOT ANY and ORDER BY in same query

It seems like there is a strange performance hit when running a query that includes both NOT 'some string' = ANY(array_column) as well as an ORDER BY statement in the same query.
The following is a simplified table structure illustrating the behavior where tagger is an array of UUIDs (v4):
CREATE TABLE IF NOT EXISTS "doc"."test" (
"id" STRING,
"last_active" TIMESTAMP,
"taggers" ARRAY(STRING)
)
The taggers array can grow somewhat large with maybe hundreds and in some case thousands of individual strings.
The following queries are all very performant and resolve within .03 seconds:
SELECT id FROM test ORDER BY last_active DESC LIMIT 10;
SELECT id FROM test WHERE NOT ('da10187a-408d-4dfc-ae46-857fd23a574a' = ANY(taggers)) LIMIT 10;
SELECT id FROM test WHERE ('da10187a-408d-4dfc-ae46-857fd23a574a' = ANY(taggers)) ORDER BY last_active DESC LIMIT 10;
However including both parts in the query jumps to around 2 - 3 seconds:
SELECT id FROM test WHERE NOT ('da10187a-408d-4dfc-ae46-857fd23a574a' = ANY(taggers)) ORDER BY last_active LIMIT 10;
What's very strange is that of the previous list of queries that run fast the last one is almost the exact same as the slow one, just without the negation. Negation of the ANY is also very fast. It's only when negation of ANY in a combination of a limit is added that things slow down. Any help would be greatly appreciated.
The query with only ORDER BY doesn't apply any filtering and it's much faster of course.
The query that only has the filtering NOT ...ANY() without ORDER BY applies the filter only to a short number of records until the LIMIT number (10 in this case) is reached.
The last query (filtering with NOT & ANY and ORDER BY) is significantly slower because it has to do much more work: It has to apply the filter on all records of the table, then sort them and finally return the first 10 (LIMIT).

Help optimizing simple MySQL query

I'm just getting into optimizing queries by logging slow queries and EXPLAINing them. I guess the thing is... I'm not sure exactly what kind of things I should be looking for.... I have the query
SELECT DISTINCT
screenshot.id,
screenshot.view_count
FROM screenshot_udb_affect_assoc
INNER JOIN screenshot ON id = screenshot_id
WHERE unit_id = 56
ORDER BY RAND()
LIMIT 0, 6;
Looking at these two elements.... where should I focus on optimization?
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE screenshot ALL PRIMARY NULL NULL NULL 504 Using temporary; Using filesort
1 SIMPLE screenshot_udb_affect_assoc ref screenshot_id screenshot_id 8 source_core.screenshot.id,const 3 Using index; Distinct
To begin with please refrain using ORDER BY RAND(). This in particular degrades performance when the table size is large.
For example, even with limit 1 , it generates number of random numbers equal to the row count, and would pick the smallest one. This might be inefficient if table size is large or bound to grow. Detailed discussion on this can be found at: http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/
Lastly, also ensure that your join columns are indexed.
Try:
SELECT s.id,
s.view_count
FROM SCREENSHOT s
WHERE EXISTS(SELECT NULL
FROM SCREENSHOT_UDB_AFFECT_ASSOC x
WHERE x.screenshot_id = s.id)
ORDER BY RAND()
LIMIT 6
Under 100K records, it's fine to use ORDER BY RAND() -- over that, and you want to start looking at alternatives that scale better. For more info, see this article.
I agree with kuriouscoder, refrain from using ORDER BY RAND(), and make sure each of the following fields are indexed in a single index:
screenshot_udb_affect_assoc.id
screenshot.id
screenshot.unit_id
do this using code like:
create index Index1 on screenshot(id):

How can I order entries in a UNION without ORDER BY?

How can I be sure that my result set will have a first and b second? It would help me to solve a tricky ordering problem.
Here is a simplified example of what I'm doing:
SELECT a FROM A LIMIT 1
UNION
SELECT b FROM B LIMIT 1;
SELECT col
FROM
(
SELECT a col, 0 ordinal FROM A LIMIT 1
UNION ALL
SELECT b, 1 FROM B LIMIT 1
) t
ORDER BY ordinal
I don't think order is guaranteed, at least not across all DBMS.
What I've done in the past to control the ordering in UNIONs is:
(SELECT a, 0 AS Foo FROM A LIMIT 1)
UNION
(SELECT b, 1 AS Foo FROM B LIMIT 1)
ORDER BY Foo
Your result set with UNION will eliminate distinct values.
I can't find any proof in documentation, but from 10 years experience I can tell that UNION ALL does preserve order, at least in Oracle.
Do not rely on this, however, if you're building a nuclear plant or something like that.
No, the order of results in a SQL query is controlled only by the ORDER BY clause. It may be that you happen to see ordered results without an ORDER BY clause in some situation, but that is by chance (e.g. a side-effect of the optimiser's current query plan) and not guaranteed.
What is the tricky ordering problem?
I know for Oracle there is no way to guarantee which will come out first without an order by. The problem is if you try it it may come out in the correct order even for most of the times you run it. But as soon as you rely on it in production, it will come out wrong.
I would have thought not, since the database would most likely need to do an ORDER BY in order to the UNION.
UNION ALL might behave differently, but YMMV.
The short answer is yes, you will get A then B.

How do I calculate a moving average using MySQL?

I need to do something like:
SELECT value_column1
FROM table1
WHERE datetime_column1 >= '2009-01-01 00:00:00'
ORDER BY datetime_column1;
Except in addition to value_column1, I also need to retrieve a moving average of the previous 20 values of value_column1.
Standard SQL is preferred, but I will use MySQL extensions if necessary.
This is just off the top of my head, and I'm on the way out the door, so it's untested. I also can't imagine that it would perform very well on any kind of large data set. I did confirm that it at least runs without an error though. :)
SELECT
value_column1,
(
SELECT
AVG(value_column1) AS moving_average
FROM
Table1 T2
WHERE
(
SELECT
COUNT(*)
FROM
Table1 T3
WHERE
date_column1 BETWEEN T2.date_column1 AND T1.date_column1
) BETWEEN 1 AND 20
)
FROM
Table1 T1
Tom H's approach will work. You can simplify it like this if you have an identity column:
SELECT T1.id, T1.value_column1, avg(T2.value_column1)
FROM table1 T1
INNER JOIN table1 T2 ON T2.Id BETWEEN T1.Id-19 AND T1.Id
I realize that this answer is about 7 years too late. I had a similar requirement and thought I'd share my solution in case it's useful to someone else.
There are some MySQL extensions for technical analysis that include a simple moving average. They're really easy to install and use: https://github.com/mysqludf/lib_mysqludf_ta#readme
Once you've installed the UDF (per instructions in the README), you can include a simple moving average in a select statement like this:
SELECT TA_SMA(value_column1, 20) AS sma_20 FROM table1 ORDER BY datetime_column1
When I had a similar problem, I ended up using temp tables for a variety of reasons, but it made this a lot easier! What I did looks very similar to what you're doing, as far as the schema goes.
Make the schema something like ID identity, start_date, end_date, value. When you select, do a subselect avg of the previous 20 based on the identity ID.
Only do this if you find yourself already using temp tables for other reasons though (I hit the same rows over and over for different metrics, so it was helpful to have the small dataset).
My solution adds a row number in table. The following example code may help:
set #MA_period=5;
select id1,tmp1.date_time,tmp1.c,avg(tmp2.c) from
(select #b:=#b+1 as id1,date_time,c from websource.EURUSD,(select #b:=0) bb order by date_time asc) tmp1,
(select #a:=#a+1 as id2,date_time,c from websource.EURUSD,(select #a:=0) aa order by date_time asc) tmp2
where id1>#MA_period and id1>=id2 and id2>(id1-#MA_period)
group by id1
order by id1 asc,id2 asc
In my experience, Mysql as of 5.5.x tends not to use indexes on dependent selects, whether a subquery or join. This can have a very significant impact on performance where the dependent select criteria change on every row.
Moving average is an example of a query which falls into this category. Execution time may increase with the square of the rows. To avoid this, chose a database engine which can perform indexed look-ups on dependent selects. I find postgres works effectively for this problem.
In mysql 8 window function frame can be used to obtain the averages.
SELECT value_column1, AVG(value_column1) OVER (ORDER BY datetime_column1 ROWS 19 PRECEDING) as ma
FROM table1
WHERE datetime_column1 >= '2009-01-01 00:00:00'
ORDER BY datetime_column1;
This calculates the average of the current row and 19 preceding rows.