Expain the result of "explain" query in mysql - sql

I am using indexing for mysql tables.
My query was like this
EXPLAIN SELECT * FROM `logs` WHERE userId =288 AND dateTime BETWEEN '2010-08-01' AND '2010-08-27'
I have indexing on field userId for this table logs,
and the result of explain query is like below.
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE logs ref userId userId 4 const 49560 Using where
The question is "My indexing is really useful or not?"...
Thanks in advance
#fastmultiplication
I thought that indexing on both this field might increase load on mysql as there will be lot of entries with unique (userId and dateTime).
I have tried adding indexing on both userId_dateTime and the result is
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE logs ref userId_dateTime userId_dateTime 4 const 63455 Using where

Your query is using indexes, and yes, they are useful. You might find the following doc pages useful:
EXPLAIN Output Format
How MySQL Uses Indexes
Multiple-Column Indexes
Also:
Multiple column index vs multiple indexes
MySQL will usually use the index that returns the smallest number of rows. In your first example, MySQL is using the userId index to narrow down the number of rows to 49560. That means that userId does not contain unique values (if it did, you wouldn't need the date range condition). As there is no index on the dateTime column, it then has to scan each row to find the ones that meet your date range criteria.
In your second example, you appear to have created a compound (multiple-column) index on userId and dateTime. In this case, it appears as though MySQL is not able to use the latter half of the index for the BETWEEN clause—I'm not sure why. It may be worth trying it with two separate indexes, rather than a multiple-column index. You may also want to try replacing BETWEEN with:
'2010-08-01' >= AND <= '2010-08-27'
This should be identical, but see the following bug report, which may affect your version of MySQL:
Optimizer does not use index for BETWEEN in a JOIN condition

From the 'rows' field it looks like MySQL still estimates it will have to look at a lot of rows.
You should try adding an index to the dateTime field, too.
And for this particular query, maybe an index on both of the fields.
alter table logs add index user_datetime (userId,dateTime);

How many rows should the query return? And how fast is the query running?
It looks to me like a pretty simple query, which is using the correct index, so if it is slow for some reason, it is probably because it has to actually return a lot of data. If you are not actually interested in all the rows, you can use LIMIT to get less.

Related

Performance impact of view on aggregate function vs result set limiting

The problem
Using PostgreSQL 13, I ran into a performance issue selecting the highest id from a view that joins two tables, depending on the select statement I execute.
Here's a sample setup:
CREATE TABLE test1 (
id BIGSERIAL PRIMARY KEY,
joincol VARCHAR
);
CREATE TABLE test2 (
joincol VARCHAR
);
CREATE INDEX ON test1 (id);
CREATE INDEX ON test1 (joincol);
CREATE INDEX ON test2 (joincol);
CREATE VIEW testview AS (
SELECT test1.id,
test1.joincol AS t1charcol,
test2.joincol AS t2charcol
FROM test1, test2
WHERE test1.joincol = test2.joincol
);
What I found out
I'm executing two statements which result in completely different execution plans and runtimes. The following statement executes in less than 100ms. As far as I understand the execution plan, the runtime is independent of the rowcount, since Postgres iterates the rows one by one (starting at the highest id, using the index) until a join on a row is possible and immediately returns.
SELECT id FROM testview ORDER BY ID DESC LIMIT 1;
However, this one takes over 1 second on average (depending on rowcount), since the two tables are "joined completely", before Postgres uses the index to select the highest id.
SELECT MAX(id) FROM testview;
Please refer to this sample on dbfiddle to check the explain plans:
https://www.db-fiddle.com/f/bkMNeY6zXqBAYUsprJ5eWZ/1
My real environment
On my real environment test1 contains only a hand full of rows (< 100), having unique values in joincol. test2 contains up to ~10M rows, where joincol always matches a value of test1's joincol. test2's joincol is not nullable.
The actual question
Why does Postgres not recognize that it could use an Index Scan Backward on row basis for the second select? Is there anything I could improve on the tables/indexes?
Queries not strictly equivalent
why does Postgres not recognize that it could use a Index Scan Backward on row basis for the second select?
To make the context clear:
max(id) excludes NULL values. But ORDER BY ... LIMIT 1 does not.
NULL values sort last in ascending sort order, and first in descending. So an Index Scan Backward might not find the greatest value (according to max()) first, but any number of NULL values.
The formal equivalent of:
SELECT max(id) FROM testview;
is not:
SELECT id FROM testview ORDER BY id DESC LIMIT 1;
but:
SELECT id FROM testview ORDER BY id DESC NULLS LAST LIMIT 1;
The latter query doesn't get the fast query plan. But it would with an index with matching sort order: (id DESC NULLS LAST).
That's different for the aggregate functions min() and max(). Those get a fast plan when targeting table test1 directly using the plain PK index on (id). But not when based on the view (or the underlying join-query directly - the view is not the blocker). An index sorting NULL values in the right place has hardly any effect.
We know that id in this query can never be NULL. The column is defined NOT NULL. And the join in the view is effectively an INNER JOIN which cannot introduce NULL values for id.
We also know that the index on test.id cannot contain NULL values.
But the Postgres query planner is not an AI. (Nor does it try to be, that could get out of hands quickly.) I see two shortcomings:
min() and max() get the fast plan only when targeting the table, regardless of index sort order, an index condition is added: Index Cond: (id IS NOT NULL)
ORDER BY ... LIMIT 1 gets the fast plan only with the exactly matching index sort order.
Not sure, whether that might be improved (easily).
db<>fiddle here - demonstrating all of the above
Indexes
Is there anything I could improve on the tables/indexes?
This index is completely useless:
CREATE INDEX ON "test" ("id");
The PK on test.id is implemented with a unique index on the column, that already covers everything the additional index might do for you.
There may be more, waiting for the question to clear up.
Distorted test case
The test case is too far away from actual use case to be meaningful.
In the test setup, each table has 100k rows, there is no guarantee that every value in joincol has a match on the other side, and both columns can be NULL
Your real case has 10M rows in table1 and < 100 rows in table2, every value in table1.joincol has a match in table2.joincol, both are defined NOT NULL, and table2.joincol is unique. A classical one-to-many relationship. There should be a UNIQUE constraint on table2.joincol and a FK constraint t1.joincol --> t2.joincol.
But that's currently all twisted in the question. Standing by till that's cleaned up.
This is a very good problem, and good testcase.
I tested it in postgres 9.3 perhaps 13 is can it more more fast.
I used Occam's Razor and i excluded some possiblities
View (without view is slow to)
JOIN can filter some rows (unfortunatly in your test not, but more length md5 5-6 yes)
Other basic equivalent select statements not solve yout problem (inner query or exists)
I achieved to use just index, but because the tables isn't bigger than indexes it was not the solution.
I think
CREATE INDEX on "test" ("id");
is useless, because PK!
If you change this
CREATE INDEX on "test" ("joincol");
to this
CREATE INDEX ON TEST (joincol, id);
Than the second query use just indexes.
After you run this
REINDEX table test;
REINDEX table test2;
VACUUM ANALYZE test;
VACUUM ANALYZE test2;
you can achive some performance tuning. Because you created indexes before inserts.
I think the reason is the two aim of DB.
First aim optimalize just some row. So run Nested Loop. You can force it with limit x.
Second aim optimalize whole table. Run this query fast for whole table.
In this situation postgres optimalizer didn't notice that simple MAX can run with NESTED LOOP. Or perhaps postgres cannot use limit in aggregate clause (can run on whole partial select, what is filtered with query).
And this is very expensive. But you have possiblities to write there other aggregates, like SUM, MIN, AVG stb.
Perhaps can help you the Window functions too.

Oracle sql statement on very large table

I relative new to sql and I have a statement which takes forever to run.
SELECT
sum(a.amountcur)
FROM
custtrans a
WHERE
a.transdate <= '2013-12-31';
I's a large table but the statemnt takes about 6 minutes!
Any ideas why?
Your select, as you post it, will read 99% of the whole table (2013-12-31 is just a week ago, and i assume most entries are before that date and only very few after). If your table has many large columns (like varchar2(4000)), all that data will be read as well when oracle scans the table. So you might read several KB each row just to get the 30 bytes you need for amountcur and transdate.
If you have this scenario. create a combined index on transdate and amountcur:
CREATE INDEX myindex ON custtrans(transdate, amountcur)
With the combined index, oracle can read the index to fulfill your query and doesn't have to touch the main table at all, which might result in considerably less data that needs to be read from disk.
Make sure the table has an index on transdate.
create index custtrans_idx on custtrans (transdate);
Also if this field is defined as a date in the table then do
SELECT sum(a.amountcur)
FROM custtrans a
WHERE a.transdate <= to_date('2013-12-31', 'yyyy-mm-dd');
If the table is really large, the query has to scan every row with transdate below given.
Even if you have an index on transdate and it helps to stop the scan early (which it may not), when the number of matching rows is very high, it would take considerable time to scan them all and sum the values.
To speed things up, you could calculate partial sums, e.g. for each passed month, assuming that your data is historical and past does not change. Then you'd only need to scan custtrans only for 1-2 months, then quickly scan the table with monthly sums, and add the results.
Try to create an index only on column amountcur:
CREATE INDEX myindex ON custtrans(amountcur)
In this case Oracle will read most probably only the Index (Index Full Scan), nothing else.
Correction, as mentioned in comment. It must be a composite Index:
CREATE INDEX myindex ON custtrans(transdate, amountcur)
But maybe it is a bit useless to create an index just for a single select statement.
One option is to create an index on the column used in the where clause (this is useful if you want to retrieve only 10-15% rows by using indexed column).
Another option is to partition your table if it has millions of rows. In this case also if you try to retrieve 70-80% data, it wont help.
The best option is first to analyze your requirements and then make a choice.
Whenever you deal with date functions it's better to use to_date() function. Do not rely on implicit data type conversion.

Why is my index not automatically used?

I have a table
Archive(VarId SMALLINT, Timestamp DATETIME, Value FLOAT)
VarId is not unique. The table contains measurements. I have a clustered index on Timestamp. Now i have the requirement of finding a measurement for a specific VarId before a specific date. So I do:
SELECT TOP(1) *
FROM Archive
WHERE VarId = 135
AND Timestamp < '2012-06-01 14:21:00'
ORDER BY Timestamp DESC;
If there is no such measurement this query searches the whole table. So I introduced another index on (VarId, Timestamp).
My problem is: SQL Server doesn't seem to care about it, the query still takes forever. When I explicitly state 'WITH (INDEX = <id>)' it works as it should. What can I do so SQL Server uses my index automatically?
I'm using SQL Server 2005.
There are different possibilities with this.
I'll try help you to isolate them:
It could be SQL Server is favouring your Clustered Index (very likely it's the Primary Key) over your newly created index. One way to solve this is to have a NonClustered Primary Key and cluster the index on the other two fields (varid and timestamp). That is, if you don't want varid and timestamp to be the PK.
Also, looking at the (estimated) execution plan might help.
But I believe #1 only works nicely if those 2 fields are the most commonly used (queried) index. To find out if this is the case, it would be good to analyse which index users are most likely use (from http://sqlblog.com/blogs/louis_davidson/archive/2007/07/22/sys-dm-db-index-usage-stats.aspx):
select
ObjectName = object_schema_name(indexes.object_id) + '.' + object_name(indexes.object_id),
indexes.name,
case when is_unique = 1 then 'UNIQUE ' else '' end + indexes.type_desc,
ddius.user_seeks,
ddius.user_scans,
ddius.user_lookups,
ddius.user_updates
from
sys.indexes
left join sys.dm_db_index_usage_stats ddius on (
indexes.object_id = ddius.object_id
and indexes.index_id = ddius.index_id
and ddius.database_id = db_id()
)
WHERE
object_schema_name(indexes.object_id) != 'sys' -- exclude sys objects
AND object_name(indexes.object_id) LIKE 'Archive'
order by
ddius.user_seeks + ddius.user_scans + ddius.user_lookups
desc
Good luck
My guess is that your index design is the issue. You have a CLUSTERED index on a DATETIME field and I suspect that it is not unique data, much like VarId, and hence you did not declare it as UNIQUE. Because it is not unique there is a hidden, 4-byte "uniqueifier" field (so that each row can by physically unique regardless of you not giving it unique data) and the rows with the same DATETIME value are essentially random within the group of same DATETIME values (so even narrowing down a time still requires scanning through that grouping). You also have a NONCLUSTERED index on VarId, Timestamp. NONCLUSTERED indexes include the data from the CLUSTERED index so internally your NONCLUSTERED index is really: VarId, Timestamp, Timestamp (from the CLUSTERED index). So you could have left off the Timestamp column in the NONCLUSTERED index and it would have all been the same to the optimizer, but in a sense it would have been better as it would be a smaller index.
So your physical layout is based on a date while the VarId values are spread across those dates. Hence VarId = 135 can be spread very far apart in terms of data pages. Yes, your non-clustered index does group them together, but the optimizer is probably looking at the fact that you are wanting all fields (the "SELECT *" part) and the Timestamp < '2012-06-01 14:21:00' condition in addition to that seems to get most of what you need as opposed to finding a few rows and doing a bookmark lookup to get the "Value" field to fulfill the "SELECT *". Quite possibly if you do just "SELECT TOP(1) VarId, Timestamp" it would more likely use your NONCLUSTERED index without needing the "INDEX =" hint.
Another issue affecting performance overall could be that the ORDER BY is requesting the Timestamp in DESC order and if you have the CLUSTERED index in ASC order then it would be the opposite direction of what you are looking for (at least in this query). Of course, in that case then it might be ok to have Timestamp in the NONCLUSTERED index if it was in DESC order.
My advice is to rethink the CLUSTERED index. Judging on just this query alone (other queries/uses might alter the recommendation), try dropping the NONCLUSTERED index and recreate the CLUSTERED index with the Timestamp field first, in DESC order, and also with the VarId so it can be delcared UNIQUE. So:
CREATE UNIQUE CLUSTERED INDEX [UIX_Archive_Timestamp_VarId]
ON Archive (Timestamp DESC, VarId ASC)
This, of course, assumes that the Timestamp and VarId combination is unique. If not, then still try this without the UNIQUE keyword.
Update:
To pull all of this info and advice together:
When designing indexes you need to consider the distribution of the data and the use-cases for interacting with it. More often than not there is A LOT to consider and several different approaches will appear good in theory. You need to try a few approaches, profile/test them, and see which works best in reality. There is no "always do this" approach without knowing all aspects of what you are doing and what else is going on and what else is planned to use and/or modify this table which I suspect has not been presented in the original question.
So to start the journey, you are ordering records by date and are looking at ranges of dates AND dates naturally occur in order so putting Timestamp first benefits more of what you are doing and has less fragmentation, especially if defined as DESC in the CREATE. Having an NC index on just VarId at that point will then be fine, even if spread out, for looking at a set of rows for a particular VarId. So maybe start there (change order of direction of CLUSTERED index and remove Timestamp from the NC index). See how those changes compare to the existing structure. Then try moving the VarId field into the CLUSTERED index and remove the NC index. You say that the combination is also not unique but does increase the predictability of the ordering of the rows. See how that works. Does this table ever get updated? If not and if the Value field along with Timestamp and VarId would be unique, then try adding that to the CLUSTERED index and be sure to create with the UNIQUE keyword. See how these different approaches work by looking at the Actual Execution Plan and use SET STATISTICS IO ON before running the query and see how the logical reads between the different approaches compare.
Hope this helps :)
You might need to analyze your table to collect statistics, so the optimizer can determine whether to use the index or not.

Fields after "ORDER BY" or "WHERE" and index in MySQL

Do Fields after "ORDER BY" or "WHERE" might have index (PRIMARY, UNIQUE, INDEX) in mysql?
Consider a table with the following columns:
ID | AddedDate | CatID | Title | Description | Status | Editor
In these queries, are ID, AddedDate and CatID might have index?
SELECT *
FROM table WHERE ID = $id
SELECT *
FROM table
ORDER BY ID
SELECT *
FROM table
ORDER BY AddedDate
SELECT *
FROM table
ORDER BY CatID
You can order by any field. Please clarify our question if you want to know more / something else.
You might want to read ORDER BY optimization. There it says that fields with index might even improve the sorting as no extra has to be done (in the optimal case).
Update:
Yes, you can add an index if you want (if this is what you mean, it is still not clear as OMG Ponies points out). In general it is to say that you should add an index to those fields that you often use in WHERE clauses.
As far as I know, there are three basic ways to order rows:
In-memory sort: Read all rows into memeory and sort them. Very fast.
Using sorted index: Read one row at a time, looking up the columns that are not in the index in the base table.
File sort: Build a sort order by reading a part of the table at a time. This is really slow.
For tables that fit in memory, MySQL will probably choose option 1. That means it won't use an index even if it's present. The index will just be overhead.
But indexes shine for bigger tables. If the table is too big for memory, MySQL can avoid the painful file sort and rely on the index.
These days, memory is plentiful, and tables almost always fit in memory. I would only add indexes for ordering after I saw a file sort happening.
One of the main benefits of having an index that it lets you select only that subset of rows you're interested in. The alternative to using an index is to do a "full table scan".
Unless you have a "where" clause, you're not really going to get much benefit from having indexes.

Slow MySQL retreival of the row with the largest value in indexed column

I have a SQL table readings something like:
id int
client_id int
device_id int
unique index(client_id, device_id)
I do not understand why the following query is so slow:
SELECT client_id FROM `readings` WHERE device_id = 10 ORDER BY client_id DESC LIMIT 1
My understanding with the index is that mysql keeps an ordered list (one property of a btree) of each row in the table sorted first by client_id and then by device_id. When I execute an explain on this query it says that it will use the index, but that it will need to look at every row. This makes sense since in the worst case, there may only be one row with device_id = 10 and that may also be the row with the smallest client_id and thus at the end of it's search. However, in practice, this is not true. My table has ~10 million rows, and rows with device_id = 10 are spread fairly evenly throughout that table. Why then doesn't MySQL start at the end of the index and scan until it finds the first row with device_id = 10, stop and return that value? It does not seem possible that this is what is happening since the query takes ~30 seconds to execute.
Is it that my unique key is implemented as a hash somehow and thus not accessible in a list form? PHPMyAdmin is telling me that it is implemented as a b-tree, which makes me think that it should be able to do the scan as I mentioned above and quit with the first instance.
Where is my error and how can I make this query execute more quickly?
Thanks
Try switching the column order in the index:
unique index(device_id, client_id)
Since you are filtering on device_id, you would want that to be the first column in the index.
First, I'm assuming that you have good statistics for this table. If not, you'll want to analyze the table to make sure the optimizer can figure out what the best option is.
Here's another approach you could try that might work better. I could be that MySQL is not understanding your intent well enough to optimize correctly:
SELECT MAX(client_id) from readings where device_id = 10
Otherwise you could modify the index to be by device_id first, then client_id. Or you could add another index by just device_id.
You have a compound index on (client_id, device_id), these will(more or less) be concatenated for the purpose of indexing and the index will only be considered if you use the
first of the column(s). Your query is using 'device_id' which is the last of them, you could provide a separate index on that column, or swap the columns around in the index.
Also, check the output of EXPLAIN on your queries.