Avoiding a Full Table Scan in MySQL - sql

How can I avoid a full table scan on mysql?

In general, by making sure you have a usable index on fields that appear in WHERE, JOIN and ORDER BY clauses.

Index your data.
Write queries that use those indexes.
Anything more than that we need specifics.

Also note that sometimes you just can not rid of a full table scan, i.e. When you need all the rows from your table... or when the cost of scanning the index is gt the cost of scanning the full table.

Use a LIMIT clause when you know how many rows you are expecting to return, for example if you are looking for a record with a known ID field that is unique, limit your select to 1, that way mysql will stop searching after it finds the first record. The same goes for updates and deletes.
SELECT * FROM `yourTable` WHERE `idField` = 123 LIMIT 1

Related

Strange behavior when doing where and order by query in postgres

Background: A large table, 50M+, all column in query is indexed.
when I do a query like this:
select * from table where A=? order by id DESC limit 10;
In statement, A, id are both indexed.
Now confusing things happen:
the more rows where returned, the less time whole sql cost
the less rows where returned, the more time whole sql cost
I have a guess here: postgres do the order by first, and then where , so it cost more time to find 10 row in the orderd index when target rowset is small(like find 10 particular sand on beach); oppositeļ¼Œ if target rowset is large, it's easy to find the first 10.
Is it right? Or there are some other reason for this?
Final question: How to optimize this situation?
It can either use the index on A to apply the selectivity, then sort on "id" and apply the limit. Or it can read them already in order using the index on "id", then filter out the ones that meet the A condition until it finds 10 of them. It will choose the one it thinks is faster, and sometimes it makes the wrong choice.
If you had a multi-column index, on (A,id) it could use that one index to do both things, get the selectivity on A and still fetch the already in order by "id", at the same time.
Do you know PGAdmin? With "explain verbose" before your statement, you can check how the query is executed (meaning the order of the operators). Usually first happens the filter and only afterwards the sorting...

Does a query goes through all data when you only select the last N?

I have a query that select the last 5($new) items from my database.
SELECT OvenRunData.dataId AS id, OvenRunData.data AS data
FROM ovenRuns INNER JOIN OvenRunData ON OvenRuns.id = OvenRunData.ovenRunId
WHERE OvenRunData.ovenRunId = (SELECT MAX(id) FROM OvenRuns)
ORDER BY id DESC LIMIT '$new'
I want to execute this query every 5 seconds with an AJAX request so I can update my table.
I know this query select the last 5 records but I want to know if the query runs through all records and then selects the last 5 or does it select only the last 5 without checking all the data?
I'm really worried that I'll have lag.
You need two indexes to make it fast enough:
create index ix_OvenRuns_id on OvenRuns(id)
create index ix_OvenRunData_ovenRunId on OvenRunData(ovenRunId)
you can even put OvenRunData.dataId OvenRunData.data into the second one, or create clustered index, however, these indexes definitely avoid full data scan.
That depends on the indexes.
In your case, you should have one on OverRuns(id).
More here: http://use-the-index-luke.com/sql/partial-results/top-n-queries
The LIMIT is applied after the ORDER BY, and the ORDER BY is applied to the entire result-set. So the answer to your question is, yes it must go through all of the records in your result-set determined by your WHERE clause before applying the LIMIT.

Does DISTINCT performs a full table scan with multiple expressions?

I have a DISTINCT clause to remove the duplicate values.
What is the performance if there are multiple expressions?
For example:
SELECT DISTINCT city, state
FROM customers
WHERE total_orders > 10
ORDER BY city;
Will this perform a full table scan?
The DBMS performs a full table scan when it thinks it appropriate.
In your example, when the DBMS thinks that with total_orders > 10 it will only get very few rows and there is an index on that column, it will use that index to access the table records. In a second step it will apply DISTINCT and then sort (or sort on-the-fly when making rows distinct). If the DBMS thinks however it will get too many records with total_orders > 10 it may decide for a full table scan. (And then apply DISTINCT and ORDER BY). So whatever the situation, DISTINCT doesn't change anything.
In case you have an index on total_orders + City + state, the DBMS may decide not to access the table at all, because all data exists in the index and even in the order needed. The DBMS would do the same without DISTINCT, however.
In case you have an index on state + total_orders + City (i.e. wrong order; the WHERE clause can not be directly applied), the DBMS may still decide to read the index only, but it is less likely. And again: the DBMS would do the same without DISTINCT.
And if you have no index, the DBMS must do a full table scan of course, because there is no index to circumvent it. Well, I guess that was needless to say :-)
Will this perform a full table scan?
Check the EXPLAIN PLAN.
EXPLAIN PLAN FOR your_query;
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
It is up to the optimizer to decide the optimal plan for execution of the query. Since you do not have an index on the column used in the filter predicate, it has no other option than a FTS(Full Table Scan).

Getting RID Lookup instead of Table Scan?

SQL Fiddle: http://sqlfiddle.com/#!3/23cf8
In this query, when I have an In clause on an Id, and then also select other columns, the In is evaluated first, and then the Details column and other columns are pulled in via a RID Lookup:
--In production and in SQL Fiddle, Details is grabbed via a RID Lookup after the In clause is evaluated
SELECT [Id]
,[ForeignId]
,Details
--Generate a numbering(starting at 1)
--,Row_Number() Over(Partition By ForeignId Order By Id Desc) as ContactNumber --Desc because older posts should be numbered last
FROM SupportContacts
Where foreignId In (1,2,3,5)
With this query, the Details are being pulled in via a Table Scan.
With NumberedContacts AS
(
SELECT [Id]
,[ForeignId]
--Generate a numbering(starting at 1)
,Row_Number() Over(Partition By ForeignId Order By Id Desc) as ContactNumber --Desc because older posts should be numbered last
FROM SupportContacts
Where ForeignId In (1,2,3,5)
)
Select nc.[Id]
,nc.[ForeignId]
,sc.[Details]
From NumberedContacts nc
Inner Join SupportContacts sc on nc.Id = sc.Id
Where nc.ContactNumber <= 2 --Only grab the last 2 contacts per ForeignId
;
In SqlFiddle, the second query actually gets a RID Lookup, whereas in production with a million records it produces a Table Scan (the IN clause eliminates 99% of the rows)
Otherwise the query plan shown in SQL Fiddle is identical, the only difference being that for the second query the RID Lookup in SQL Fiddle, is a Table Scan in production :(
I would like to understand possibilities that would cause this behavior? What kinds of things would you look at to help determine the cause of it using a table scan here?
How can I influence it to use a RID Lookup there?
From looking at operation costs in the actual execution plan, I believe I can get the second query very close in performance to the first query if I can get it to use a RID Lookup. If I don't select the Detail column, then the performance of both queries is very close in production. It is only after adding other columns like Detail that performance degrades significantly for the second query. When I put it in SQL Fiddle and saw that the execution plan used an RID Lookup, I was surprised but slightly confused...
It doesn't have a clustered index because in testing with different clustered indexes, there was slightly worse performance for this and other queries. That was before I began adding other columns like Details though, and I can experiment with that more, but would like to have a understanding of what is going on now before I start shooting in the dark with random indexes.
What if you would change your main index to include the Details column?
If you use:
CREATE NONCLUSTERED INDEX [IX_SupportContacts_ForeignIdAsc_IdDesc]
ON SupportContacts ([ForeignId] ASC, [Id] DESC)
INCLUDE (Details);
then neither a RID lookup nor a table scan would be needed, since your query could be satisfied from just the index itself....
The differences in the query plans will be dependent on the types of indexes that exist and the statistics of the data for those tables in the different environments.
The optimiser uses the statistics (histograms of data frequency, mostly) and the available indexes to decide which execution plan is going to be the quickest.
So, for example, you have noticed that the performance degrades when the 'Details' column is included. This is an almost sure sign that either the 'Details' column is not part of an index, or if it is part of an index, the data in that column is mostly unique such that the index accesses would be equivalent (or almost equivalent) to a table scan.
Often when this situation arises, the optimiser will choose a table scan over the index access, as it can take advantage of things like block reads to access the table records faster than perhaps a fragmented read of an index.
To influence the path that will be chose by the optimiser, you would need to look at possible indexes that could be added/modified to make an index access more efficient, but this should be done with care as it can adversely affect other queries as well as possibly degrading insert performance.
The other important activity you can do to help the optimiser is to make sure the table statistics are kept up to date and refreshed at a frequency that is appropriate to the rate of change of the frequency distribution in the table data
If it's true that 99% of the rows would be omitted if it performed the query using the relevant index + RID then the likeliest problem in your production environment is that your statistics are out of date and the optimiser doesn't realise that ForeignID in (1,2,3,5) would limit the result set to 1% of the total data.
Here's a good link for discovering more about statistics from Pinal Dave: http://blog.sqlauthority.com/2010/01/25/sql-server-find-statistics-update-date-update-statistics/
As for forcing the optimiser to follow the correct path WITHOUT updating the statistics, you could use a table hint - if you know the index that your plan should be using which contains the ID and ForeignID columns then stick that in your query as a hint and force SQL optimiser to use the index:
http://msdn.microsoft.com/en-us/library/ms187373.aspx
FYI, if you want the best performance from your second query, use this index and avoid the headache you're experiencing altogether:
create index ix1 on SupportContacts(ForeignID, Id DESC) include (Details);

SQLite - select expression is very slow

I'm experiencing some heavy performance-issues with a query in SQLite. Currently there are around 20000 entries in the table activity_tbl and about 40 in the table activity_data_tbl. I have an index for both of the columns used in the query below, but it doesn't seem to have any effect on the performance at all.
SELECT a._id, a.start_time + b.length AS time
FROM activity_tbl a INNER JOIN activity_data_tbl b
ON a.activity_data_id = b._data_id
WHERE time > ?
ORDER BY 2
LIMIT 1
As you can see, I select one column and a value created from adding two columns together. I guess this is what's causing the low performance, since the query is very fast if I just select a.start_time or b.length.
Do you guys have any suggestion for how I could optimize this?
Try putting an index on the time column. This should speed up the query
This query is not optimizable using indexes for the filter part since you are filtering and ordering on a calculated value. To optimize the query you will either need to filter on one of the actual table columns (starttime or length) or pre-compute the time values before querying.
The only place an index will help, and I assume you have one, is on b.data_id.
A compound index may help. According to its docs, SQLite tries to avoid to access the table, if the index has enough information. So if the engine did its homework it will recognize that the index is enough to compute the where clause value and spare some time. If it does not work, only the pre-computation will do.
If you are more often confronted with similar tasks, please read this: http://www.sqlite.org/rtree.html