How To Reduce Number of Rows Scanned by MySQL - sql

I have a query that pulls 5 records from a table of ~10,000. The order clause isn't covered by an index, but the where clause is.
The query scans about 7,700 rows to pull these 5 results, and that seems like a bit much. I understand, though, that the complexity of the ordering criteria complicates matters. How, if at all, can i reduce the number of rows scanned?
The query looks like this:
SELECT *
FROM `mediatypes_article`
WHERE `mediatypes_article`.`is_published` = 1
ORDER BY `mediatypes_article`.`published_date` DESC, `mediatypes_article`.`ordering` ASC, `mediatypes_article`.`id` DESC LIMIT 5;
medaitypes_article.is_published is indexed.

How many rows apply to "is_published = 1" ?
I assume that is like... 7.700 rows?
Either way you take it, the full result that will match the WHERE clause has to be fetched and completely ordered by all sorting criteria. Then the full list of all sorted published articles will be truncated after the first 5 results.
Maybe it will help you to look at the MySQL documentation article about ORDER BY optimization, but for the first you should try to apply indices on the columns that are stated in the ORDER BY statement. It is very likely that this will speed up things greatly.

Executing OPTIMIZE TABLE may not help, but it doesn't hurt either.

When you have ordering, you have to traverse all the btree to figure out the proper order.
10,000 records to order is not that big amount to worry about. Remember, with proper indexing, the RDBMS doesn't fetch the whole record to figure out the order. It has the indexed columns in btree pages saved on disk and with few page reads, the whole btree is loaded in memory and can be traversed.

In MySQL you can make an index that includes multiple columns. I think what you probably need to do is make an index that includes is_published and published_date. You should look at the output from the EXPLAIN statement to make sure it's doing things the smart way, and add an index if it is not.

Related

Is ORDER BY time consuming?

I am always wondering if ORDER BY is efficient, because I believe it inevitably need a whole-database scanning, even if the ordering field is indexed.
For example, if I order by created_at and limit 10. I think, because the database cannot know I will order by created_at a priori, it has to sort the whole data and return the first 10 items. Of course if we have an index on created_at, things might be better.
However, even with index, I think we can still run into trouble. For example, I want to sort by a function of a field, say (age^2-age-10). Even if we indexed the age field, the database cannot know what function I will use a priori, so it has to calculate the sqrt on all rows.
Am I wrong? Anyway, could anyone explain to me the workflow behind ORDER BY?
If there is an index that is sorted in the same order as specified in the ORDER BY clause, the database will not need to perform a sort operation. The query optimizer looks for indexes that can speed up your query. It analyzes your SQL query and, in the case of ORDER BY clauses, looks for indexes that have the same order. See Indexing ORDER BY for more details.
Some database engines allow indexing computed columns, which would cover the case you mentioned.
In theory, the database optimizer can take into account the limit clause when determining the query plan. This is most obviously useful with a limit 1 query, which can be implemented just by keeping track of which row has the extreme value for the columns in the order by. The same idea can be extended to larger limit sizes.
In practice, I don't think that most databases implement this optimization when the limit is larger than 1. Some may for the special case of limit 1 (or top 1 or whatever the right syntax is).
An index can be used for an order by. In general, the columns in the index would need to match exactly the appropriate columns in the index. SQL optimizers are generally not smart enough to recognize simple conversions. On the other hand, people who write SQL usually don't do such transformations.

Where clause is slowing my query from 2 seconds to 24 seconds

I am trying to write a simple query to count the results from a big table.
SELECT COUNT(*)
FROM DM.DM_CUSTOMER_SEG_BRIDGE_CORP_DW AL3
WHERE (AL3.REFERENCE_YEAR(+) =2012)
Above query is taking around 24 seconds to return me output. If I remove where clause and execute same query, it is giving me result in 2 seconds.
May i know what is the reason for that. I am relatively new to SQL queries.
Please help
Thanks,
Naveen
You might need an index on the table. Typically you will need an index on any columns used in the where clause
as for the (+) syntax I think it is redundant (i'm no Oracle expert) but see Difference between Oracle's plus (+) notation and ansi JOIN notation?
The reason may seem subtle. But there are multiple ways that Oracle could approach a query like this:
SELECT COUNT(*)
FROM DM.DM_CUSTOMER_SEG_BRIDGE_CORP_DW AL3
One way is to read all the rows in the table. Because this is a big table, that is not the most efficient approach. A second method would be to use statistics of some sort, where the number of rows are in the statistics. I don't think Oracle ever does this, but it is conceivable.
The final method is to read an index. Typically, an index would be much smaller than the table and it might already be in memory. The above query would be reading a much smaller amount of data. (Here is an interesting article on counting all the rows in a table.)
When you introduce the where clause,
WHERE (AL3.REFERENCE_YEAR(+) =2012)
Oracle can no longer scan just any index. It would have to scan the reference_year index. What is the problem? If it scanned an index, it would still need to fetch the data records to get the value of reference_year -- and that is equivalent (actually worse) than scanning the whole table.
Even with an index on reference_year, you are not guaranteed to use the index. The problem is something called selectivity. The number of rows that you are fetching may still be quite large, relative to the number of rows in the database (in this context, 10% is "quite large"). The Oracle optimize may choose to do a full table scan rather than read the index.

Oracle Index Sort Order and Joining

I have 2 tables that are a few millions of rows with indexes. I'm looking to convert one of the indexes to DESC order to optimize some operations. However, will that affect joinining speed or other optimizations?
For example:
Table A:
a_id (pk)
Table B:
b_id (pk)
a_id (fk)
If A.a_id is stored as DESC and B.a_id is stored ASC will I encounter any problems or slowness on joins? Will oracle be able to use the indexes for joining even though they have different sort orders? Do I have to make B.a_id DESC as well or create a second index that is DESC ? Obviously I'd like to try a simple experiment but I don't have DBA access or a spare oracle setup to work with.
Will oracle be able to use the indexes
for joining even though they have
different sort orders?
Indexes are not used "for joining". They're used to access data. The row sources thus created are then joined. The only reason I can think of that the sort order of the index would have any impact on joining would be if a merge join is occurring and the index is being used to avoid sorting. In this case, the impact of changing to a descending index might be that the data needs to be sorted in memory after it is accessed; or it might not, if the optimizer is intelligent enough to simply walk through that data in reverse order when doing the merge.
If you have queries whose execution plans rely on using the index on A.A_ID to get the data in ascending order (either for purposes of a merge join or to meet your requested ordering of the results), then changing the index to descending order could have an impact.
Edit: Just did a quick test on some sample data. The optimizer does seem to have the capability to merge row sources sorting in opposite orders without resorting either of them. So at the most obvious level, having one index ascending and the other descending should not cause serious performance problems. However, it does look like the descending indexes can have other effects on the execution plan -- in my case, the ascending index was used for a fast full scan, while the descending one was used for a range scan. This could cause changes in query performance -- good or bad -- but the only way to know for certain is to test it.
Oracle implements indexes as doubly-linked lists, so it makes no difference whether you specify an ASC or DESC index for a single column.
DESC indexes are a special case that helps when you have a multi-column index, e.g. if I have a query that often orders by colA ASC, colB DESC, then I might decide to add an index on (colA, colB DESC) in order to avoid a sort.
Developing without a development and test system?
Your answer is to develop with one. Oracle comes on all platforms, just install, add data, do your work.
For you, just live dangerously and do the index change, who cares what happens. Grab for that brass ring. So you miss. You won't lose any data.
I'm not sure I get what you're trying to ask - you cannot "store" in descending or ascending order. You can fetch the results of the query and order it using ORDER BY clause which will sort the resulting set in ascending or descending order.
There is no guarantee that you're inserting any data in ascending or descending order.
Consequently, the "order" by which it is inserted will have no bearing on the performance because there is no order
Generally speaking an index can do scanning in asc/desc order since the 2 pointers in the index structure are sufficient to identify leaf blocks and corresponding blocks while doing scan based on asc/desc order without sorting in the memory.
However if we create an index with desc column definition its structure will be much larger than the a normal index since the normal index has a 90-10 splits (incrementing row ids) where as desc index will be 50-50 splits and will lead to unused space and a candidate for rebuild which will require additional maintenance and overhead.
DESC indexes can be helpful when you have a multi-column index where one column is need in asc while the other in desc to avoid sorting in the memory.
Early optimization is a waste of time. Just leave this problem and do the next thing. When there are 100 million rows in this table change the indexes and test what happens, until then your ten rows of data are not worth the time to "optimize".

Indexing affects only the WHERE clause?

If I have something like:
CREATE INDEX idx_myTable_field_x
ON myTable
USING btree (field_x);
SELECT COUNT(field_x), field_x FROM myTable GROUP BY field_x ORDER BY field_x;
Imagine myTable with around 500,000 rows and most of field_x values being unique.
Since I don't use any WHERE clause, will the created index have any effect at all in my query?
Edit: I'm asking this question because I don't get any relevant difference between query-times before and after creating the index; They always take about 8 seconds (which, of course is too much time!). Is this behaviour expected?
The index will not help here as you are reading the whole table anyway there is no use in going to an index first (PostgreSQL does not yet have index-only scans)
Because nearly all values in the index are unique, it wouldn't really help in this situation anyway. Index lookups (including index-scans for other DBMS) tend to be really helpful for lookup of a small number of rows.
There is a slight possibility that the index might be used for ordering but I doubt that.
If you look at the output of EXPLAIN ANALYZE VERBOSE you can see if the sorting is done in memory or (due to the size of the result) is done on disk.
If sorting is done on disk, you can speed up the query by increasing the work_mem - either globally or just for your session.
Since field_x is the only column referenced in your query, your index covers the query and should help you avoid lookups into actual rows of myTable.
EDIT: As indicated in the comment discussion below, while this answer is valid for most RDBMS implementations, it does not apply to postgresql.
The index should be used. If you ever want to see how your indexes are being used (or not), the execution plan of the query is a great place to see what the database has decided to do. In your case you should execute something like:
explain SELECT COUNT(field_x), field_x FROM myTable GROUP BY field_x ORDER BY field_x;
More information about what all the output you are seeing means can be found in the postgres docs: http://www.postgresql.org/docs/8.4/static/sql-explain.html
There is also: http://wiki.postgresql.org/wiki/Image:Explaining_EXPLAIN.pdf which is a bit more in depth.

Indexing table with duplicates MySQL/SQL Server with millions of records

I need help in indexing in MySQL.
I have a table in MySQL with following rows:
ID Store_ID Feature_ID Order_ID Viewed_Date Deal_ID IsTrial
The ID is auto generated. Store_ID goes from 1 - 8. Feature_ID from 1 - let's say 100. Viewed Date is Date and time on which the data is inserted. IsTrial is either 0 or 1. You can ignore Order_ID and Deal_ID from this discussion.
There are millions of data in the table and we have a reporting backend that needs to view the number of views in a certain period or overall where trial is 0 for a particular store id and for a particular feature.
The query takes the form of:
select count(viewed_date)
from theTable
where viewed_date between '2009-12-01' and '2010-12-31'
and store_id = '2'
and feature_id = '12'
and Istrial = 0
In SQL Server you can have a filtered index to use for Istrial. Is there anything similar to this in MySQL? Also, Store_ID and Feature_ID have a lot of duplicate data. I created an index using Store_ID and Feature_ID. Although this seems to have decreased the search period, I need better improvement than this. Right now I have more than 4 million rows. To search for a particular query like the one above, it looks at 3.5 million rows in order to give me the count of 500k rows.
PS. I forgot to add view_date filter in the query. Now I have done this.
Well you could expand your index to consist of Store_ID, Feature_ID and IsTrial. You won't get any better than this, performancewise.
My first idea would be an index on (feature_id, store_id, istrial), since feature_id seems to be the column with the highest Shannon entropy. But without knowing the statistics on feature_id i'm not sure. Maybe you should better create two indexes, (store_id, feature_id, istrial) being the other and let the optimizer sort it out. Using all three columns also has the advantage of the database being able to answer your query from the index alone, which should improve performance, too.
But if neither of your columns is selective enough to sufficiently improve index performance, you might have to resort to denormalization by using INSERT/UPDATE triggers to fill a second table (feature_id, store_id, istrial, view_count). This would slow down inserts and updates, of course...
You might want to think about splitting that table horizontally. You could run a nightly job that puts each store_id in a separate table. Or take a look at feature_id, yeah, it's a lot of tables but if you don't need real-time data. It's the route I would take.
If you need to optimize this query specifically in MySQL, why not add istrial to the end of the existing index on Store_ID and Feature_ID. This will completely index away the WHERE clause and will be able to grab the COUNT from the cardinality summary of the index if the table is MyISAM. All of your existing queries that leverage the current index will be unchanged as well.
edit: also, I'm unsure of why you're doing COUNT(viewed_date) instead of COUNT(*)? Is viewed_date ever NULL? If not, you can just use the COUNT(*) which will eliminate the need to go to the .MYD file if you take it in conjunction with my other suggestion.
The best way I found in tackling this problem is to skip DTA's recommendation and do it on my own in the following way:
Use Profiler to find the costliest queries in terms of CPU usage (probably blocking queries) and apply indexes to tables based on those queries. If the query execution plan can be changed to decrease the Read, Writes and overall execution time, then first do that. If not, in which case the query is what it is, then apply clustered/non-clustered index combination to best suit. This depends on the nature of the existing table indexes, the bytes total of columns participating in index, etc.
Run queries in the SSMS to find the most frequently executing queries and do the same as above.
Create a defragmentation schedule in order to either Reorganize or Rebuild indexes depending on how much fragmented they are.
I am pretty sure others can suggest good ideas. Doing these gave me good results. I hope someone can use this help. I think DTA does not really make things faster in terms of indexing because you really need to go through what all indexes it is going to create. This is more true for a database that gets hit a lot.