SQL Query always uses filesort in order by clause

SQL Query always uses filesort in order by clause - sql

I am trying to optimize a sql query which is using order by clause. When I use EXPLAIN the query always displays "using filesort". I am applying this query for a group discussion forum where there are tags attached to posts by users.
Here are the 3 tables I am using: users, user_tag, tags
user_tag is the association mapping table for users and their tags.
CREATE TABLE `usertable` (
`user_id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`user_name` varchar(20) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
PRIMARY KEY (`user_name`),
KEY `user_id` (`user_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
CREATE TABLE `user_tag` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`user_id` int(11) unsigned NOT NULL,
`tag_id` int(11) unsigned NOT NULL,
`usage_count` int(11) unsigned NOT NULL,
PRIMARY KEY (`id`),
KEY `tag_id` (`tag_id`),
KEY `usage_count` (`usage_count`),
KEY `user_id` (`user_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
I update the usage_count on server side using programming. Here is the query that's giving me problem. The query is to find out the tag_id and usage_count for a particular username, sorted by usage count in descending order
select user_tag.tag_id, user_tag.usage_count
from user_tag inner join usertable on usertable.user_id = user_tag.user_id
where user_name="abc" order by usage_count DESC;
Here is the explain output:
mysql> explain select
user_tag.tag_id,
user_tag.usage_count from user_tag
inner join usertable on
user_tag.user_id = usertable.user_id
where user_name="abc" order by
user_tag.usage_count desc;
Explain output here
What should I be changing to lose that "Using filesort"

I'm rather rusty with this, but here goes.
The key used to fetch the rows is not the same as the one used in the ORDER BY:
http://dev.mysql.com/doc/refman/5.1/en/order-by-optimization.html
As mentioned by OMG Ponies, an index on user_id, usage_count may resolve the filesort.
KEY `user_id_usage_count` (`user_id`,`usage_count`)

"Using filesort" is not necessarily bad; in many cases it doesn't actually matter.
Also, its name is somewhat confusing. The filesort() function does not necessarily use temporary files to perform the sort. For small data sets, the data are sorted in memory which is pretty fast.
Unless you think it's a specific problem (for example, after profiling your application on production-grade hardware in the lab, removing the ORDER BY solves a specific performance issue), or your data set is large, you should probably not worry about it.

Related

How to optimize SQL query that uses GROUP BY and joined many-to-many relation tables?

I have tables with many-to-many relations:
CREATE TABLE `item` (
`id` mediumint(8) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(100) NOT NULL DEFAULT '',
`size_id` tinyint(3) NOT NULL DEFAULT 0,
PRIMARY KEY (`id`),
INDEX `size` (`size_id`)
);
CREATE TABLE `items_styles` (
`style_id` smallint(5) unsigned NOT NULL,
`item_id` mediumint(8) unsigned NOT NULL,
PRIMARY KEY (`item_id`, `style_id`),
INDEX `style` (`style_id`),
INDEX `item` (`item_id`),
CONSTRAINT `items_styles_item_id_item_id` FOREIGN KEY (`item_id`) REFERENCES `item` (`id`)
);
CREATE TABLE `items_themes` (
`theme_id` tinyint(3) unsigned NOT NULL,
`item_id` mediumint(8) unsigned NOT NULL,
PRIMARY KEY (`item_id`, `theme_id`),
INDEX `theme` (`theme_id`),
INDEX `item` (`item_id`),
CONSTRAINT `items_themes_item_id_item_id` FOREIGN KEY (`item_id`) REFERENCES `item` (`id`)
);
I'm trying to get the report that shows style_id and the number of items that use this style but with applying filters to the item table and/or to another table, like this:
SELECT i_s.style_id, COUNT(i.id) total FROM item i
JOIN items_themes i_t ON i.id = i_t.item_id AND i_t.theme_id IN (6, 7)
JOIN items_styles i_s ON i.id = i_s.item_id
GROUP BY i_s.style_id;
-- or like this
SELECT i_s.style_id, COUNT(i.id) total FROM item i
JOIN items_themes i_t ON i.id = i_t.item_id AND i_t.theme_id IN (6, 7)
JOIN items_styles i_s ON i.id = i_s.item_id
WHERE i.size_id != 3
GROUP BY i_s.style_id;
The problem is that tables are pretty big so queries take a long time to execute (~8 seconds)
item - 8M rows
items_styles - 12M rows
items_themes - 11M rows
Is there any way to optimize these queries? If not, what approach can be used to receive such reports.
I will be grateful for any help. Thanks.

First, you don't need the items table for the queries. Probably doesn't have much impact on performance, but no need.
So you can write the query as:
SELECT i_s.style_id, COUNT(*) as total
FROM items_themes i_t JOIN
items_styles i_s
ON i_s.item_id = i_t.item_id
WHERE i_t.theme_id IN (6, 7)
GROUP BY i_s.style_id;
For this query, you want an index on items_themes(theme_id, item_id). There is no much you can do about the GROUP BY.
Then, I don't think this is what you really want, because it will double count an item that has both themes. So, use EXISTS instead:
SELECT i_s.style_id, COUNT(*) as total
FROM items_styles i_s
WHERE EXISTS (SELECT
FROM items_themes i_t
WHERE i_t.item_id = i_s.item_id AND
i_t.theme_id IN (6, 7)
)
GROUP BY i_s.style_id;
For this, you want an index on items_themes(item_id, theme_id). You can also try an index on items_styles(style_id). Some databases might be able to use that one, but I am guessing not MariaDB.

In a many-to-many table, it is optimal to have these two indexes:
PRIMARY KEY (`item_id`, `style_id`),
INDEX `style` (`style_id`, `item_id`)
And be sure to use InnoDB.
More discussion: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table
Still, you have two many-to-many mappings, so there probably is no great solution.

MS Access last() results are sometimes wrong

I have an MS Access query that uses last() but sometimes it doesn't work as expected--which I know is what's expected lol. But I need to find a solution, either in Access or by converting the below query to MySQL. Any suggestions?
SELECT maindata.TrendShort, Last(maindata.Resistance) AS LastOfResistance, Last(maindata.Support) AS LastOfSupport, Count(maindata.ID) AS Days, Max(maindata.Datestamp) AS Datestamp, maindata.ProductID
FROM market_opinion AS maindata
WHERE (((Exists (select * from market_opinion action_count where maindata.ProductID = action_count.ProductID and maindata.Datestamp < action_count.Datestamp and maindata.TrendShort<> action_count.TrendShort))=False))
GROUP BY maindata.TrendShort, maindata.ProductID
ORDER BY Count(maindata.ID) DESC;
Only the LastOfResistence and LastOfSupport are occasionally wrong, the other fields are always correct.
CREATE TABLE `market_opinion` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`ProductID` int(11) DEFAULT NULL,
`Trend` varchar(11) DEFAULT NULL,
`TrendShort` varchar(7) DEFAULT NULL,
`Resistance` decimal(9,2) unsigned DEFAULT NULL,
`Support` decimal(9,2) unsigned DEFAULT NULL,
`Username` varchar(12) DEFAULT NULL,
`Datestamp` date DEFAULT NULL,
PRIMARY KEY (`ID`),
KEY `ProductID` (`ProductID`),
KEY `Datestamp` (`Datestamp`),
KEY `TrendShort` (`TrendShort`)
) ENGINE=InnoDB AUTO_INCREMENT=9536 DEFAULT CHARSET=utf8;

Without some feel for the data, this becomes something of a guess, but what I'm thinking is that the Last(Resistance) and the Last(Support) aren't necessarily pulling from the same record as Max(DateStamp). You might try breaking your query out into a 2 part query, such as:
Select maindata.TrendShort, Resistance, Support, COUNT(ID) as Days, ProductId
FROM market_opinion maindata
INNER JOIN (SELECT mo.TrendShort, MAX(mo.DateStamp) AS MaxDate
FROM market_opinion mo
WHERE (((EXISTS(SELECT ...))=FALSE))
GROUP BY mo.TrendShort) inner
WHERE maindata.TrendShort=inner.TrendSort AND maindata.DateStamp = inner.MaxDate
ORDER BY Days DESC;
I've left out the bulk of your query where the ellipsis (...) is, I wouldn't expect any changes there. You might consider looking at http://www.access-programmers.co.uk/forums/showthread.php?t=42291 for a discussion of first/last vs min/max. Let me know if this gets you any closer. If not, maybe some sample data where it's not working out. It might give some insight into what's going on.

First or Last just returns some (arbitrary) record that is not the last or first respectively.
In most cases, simply use Min for the first and Max for the last.

MySQL query slow when selecting VARCHAR

I have this table:
CREATE TABLE `search_engine_rankings` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`keyword_id` int(11) DEFAULT NULL,
`search_engine_id` int(11) DEFAULT NULL,
`total_results` int(11) DEFAULT NULL,
`rank` int(11) DEFAULT NULL,
`url` varchar(255) DEFAULT NULL,
`created_at` datetime DEFAULT NULL,
`updated_at` datetime DEFAULT NULL,
`indexed_at` date DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `unique_ranking` (`keyword_id`,`search_engine_id`,`rank`,`indexed_at`),
KEY `search_engine_rankings_search_engine_id_fk` (`search_engine_id`),
CONSTRAINT `search_engine_rankings_keyword_id_fk` FOREIGN KEY (`keyword_id`) REFERENCES `keywords` (`id`) ON DELETE CASCADE,
CONSTRAINT `search_engine_rankings_search_engine_id_fk` FOREIGN KEY (`search_engine_id`) REFERENCES `search_engines` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=244454637 DEFAULT CHARSET=utf8
It has about 250M rows in production.
When I do:
select id,
rank
from search_engine_rankings
where keyword_id = 19
and search_engine_id = 11
and indexed_at = "2010-12-03";
...it runs very quickly.
When I add the url column (VARCHAR):
select id,
rank,
url
from search_engine_rankings
where keyword_id = 19
and search_engine_id = 11
and indexed_at = "2010-12-03";
...it runs very slowly.
Any ideas?

The first query can be satisfied by the index alone -- no need to read the base table to obtain the values in the Select clause. The second statement requires reads of the base table because the URL column is not part of the index.
UNIQUE KEY `unique_ranking` (`keyword_id`,`search_engine_id`,`rank`,`indexed_at`),
The rows in tbe base table are not in the same physical order as the rows in the index, and so the read of the base table can involve considerable disk-thrashing.
You can think of it as a kind of proof of optimization -- on the first query the disk-thrashing is avoided because the engine is smart enough to consult the index for the values requested in the select clause; it will already have read that index into RAM for the where clause, so it takes advantage of that fact.

Additionally to Tim's answer. An index in Mysql can only be used left-to-right. Which means it can use columns of your index in your WHERE clause only up to the point you use them.
Currently, your UNIQUE index is keyword_id,search_engine_id,rank,indexed_at. This will be able to filter the columns keyword_id and search_engine_id, still needing to scan over the remaining rows to filter for indexed_at
But if you change it to: keyword_id,search_engine_id,indexed_at,rank (just the order). This will be able to filter the columns keyword_id,search_engine_id and indexed_at
I believe it will be able to fully use that index to read the appropriate part of your table.

I know it's an old post but I was experiencing the same situation and I didn't found an answer.
This really happens in MySQL, when you have varchar columns it takes a lot of time processing. My query took about 20 sec to process 1.7M rows and now is about 1.9 sec.
Ok first of all, create a view from this query:
CREATE VIEW view_one AS
select id,rank
from search_engine_rankings
where keyword_id = 19000
and search_engine_id = 11
and indexed_at = "2010-12-03";
Second, same query but with an inner join:
select v.*, s.url
from view_one AS v
inner join search_engine_rankings s ON s.id=v.id;

TLDR: I solved this by running optimize on the table.
I experienced the same just now. Even lookups on primary key and selecting just some few rows was slow. Testing a bit, I found it not to be limited to the varchar column, selecting an int also took considerable amounts of time.
A query roughly looking like this took around 3s:
select someint from mytable where id in (1234, 12345, 123456).
While a query roughly looking like this took <10ms:
select count(*) from mytable where id in (1234, 12345, 123456).
The approved answer here is to just make an index spanning someint also, and it will be fast, as mysql can fetch all information it needs from the index and won't have to touch the table. That probably works in some settings, but I think it's a silly workaround - something is clearly wrong, it should not take three seconds to fetch three rows from a table! Besides, most applications just does a "select * from mytable", and doing changes at the application side is not always trivial.
After optimize table, both queries takes <10ms.

MySQL 1 millon row query speed

I'm having trouble getting a decent query time out of a large MySQL table, currently its taking over 20 seconds. The problem lies in the GROUP BY as MySQL needs to run a filesort but I don't see how I can get around this
QUERY:
SELECT play_date, COUNT(DISTINCT(email)) AS count
FROM log
WHERE type = 'play'
AND play_date BETWEEN '2009-02-23'
AND '2009-02-24'
GROUP BY play_date
ORDER BY play_date desc
EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE log ALL type,type_2 NULL NULL NULL 530892 Using where; Using filesort
TABLE STRUCTURE
CREATE TABLE IF NOT EXISTS `log` (
`id` int(11) NOT NULL auto_increment,
`email` varchar(255) NOT NULL,
`type` enum('played','reg','friend') NOT NULL,
`timestamp` timestamp NOT NULL default CURRENT_TIMESTAMP,
`play_date` date NOT NULL,
`email_refer` varchar(255) NOT NULL,
`remote_addr` varchar(15) NOT NULL,
PRIMARY KEY (`id`),
KEY `email` (`email`),
KEY `type` (`type`),
KEY `email_refer` (`email_refer`),
KEY `type_2` (`type`,`timestamp`,`play_date`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=707859 ;
If anyone knows how I could improve the speed I would be very greatful
Tom
EDIT
I've added the new index with just play_date and type but MySQL refuses to use it
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE log ALL play_date NULL NULL NULL 801647 Using where; Using filesort
This index was created using ALTER TABLE log ADD INDEX (type, play_date);

You need to create index on fields type AND play_date.
Like this:
ALTER TABLE `log` ADD INDEX (`type`, `play_date`);
Or, alternately, you can rearrange your last key like this:
KEY `type_2` (`type`,`play_date`,`timestamp`)
so MySQL can use its left part as a key.

You should add an index on the fields that you base your search on.
In your case it play_date and type

You're not taking advantage of the key named type_2. It is a composite key for type, timestamp and play_date, but you're filtering by type and play_date, ignoring timestamp. Because of this, the engine can't make use of that key.
You should create an index on the fields type and play_date, or remove timestamp from the key type_2.
Or you could try to incorporate timestamp into your current query as a filter. But judging from your current query I don't think that is logical.

Does there need to be an index on play_date, or move the position in the composite index to second place?

The fastest options would be this
ALTER TABLE `log` ADD INDEX (`type`, `play_date`, 'email');
It would turn this index into a "covering index", which would mean that the query would only access the index stored in memory and not even goto the hard disk.

The DESC parameter is causing MySQL not to use the index for the ORDER BY. You can leave it ASC and iterate the resultset in reverse on the client side (?).

How do you setup Post Revisions/History Tracking with ORM?

I am trying to figure out how to setup a revisions system for posts and other content. I figured that would mean it would need to work with a basic belongs_to/has_one/has_many/has_many_though ORM (any good ORM should support this).
I was thinking a that I could have some tables like (with matching models)
[[POST]] (has_many (text) through (revisions)
id
title
[[Revisions]] (belongs_to posts/text)
id
post_id
text_id
date
[[TEXT]]
id
body
user_id
Where I could join THROUGH the revisions table to get the latest TEXT body. But I'm kind of foggy on how it will all work. Has anyone setup something like this?
Basically, I need to be able to load an article and request the latest content entry.
// Get the post row
$post = new Model_Post($id);
// Get the latest revision (JOIN through revisions to TEXT) and print that body.
$post->text->body;
Having the ability to shuffle back in time to previous revisions and removing revisions would also be a big help.
At any rate, these are just ideas of how I think that some kind of history tracking would work. I'm open to any form of tracking I just want to know what the best-practice is.
:EDIT:
It seems that moving forward, two tables seems to make the most sense. Since I plan to store two copies of text this will also help to save space. The first table posts will store the data of the current revision for fast reads without any joins. The posts body will be the value of the matching revision's text field - but processed through markdown/bbcode/tidy/etc. This will allow me to retain the original text (for the next edit) without having to store that text twice in one revision row (or having to re-parse it each time I display it).
So fetching will be be ORM friendly. Then for creates/updates I will have to handle revisions separately and then just update the post object with the new current revision values.
CREATE TABLE IF NOT EXISTS `posts` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`published` tinyint(1) unsigned DEFAULT NULL,
`allow_comments` tinyint(1) unsigned DEFAULT NULL,
`user_id` int(11) NOT NULL,
`title` varchar(100) NOT NULL,
`body` text NOT NULL,
`created` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`),
KEY `published` (`published`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ;
CREATE TABLE IF NOT EXISTS `postsrevisions` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`post_id` int(10) unsigned NOT NULL,
`user_id` int(10) unsigned NOT NULL,
`is_current` tinyint(1) unsigned DEFAULT NULL,
`date` datetime NOT NULL,
`title` varchar(100) NOT NULL,
`text` text NOT NULL,
`image` varchar(200) NOT NULL,
PRIMARY KEY (`id`),
KEY `post_id` (`post_id`),
KEY `user_id` (`user_id`),
KEY `is_current` (`is_current`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ;

Your Revisions table as you have shown it models a many-to-many relationship between Posts and Text. This is probably not what you want, unless a given row in Text may provide the content for multiple rows in Posts. This is not how most CMS architectures work.
You certainly don't need three tables. I have no idea why you think this is needed for 3NF. The point of 3NF is that an attribute should not depend on a non-key attribute, it doesn't say you should split into multiple tables needlessly.
So you might only need a one-to-many relationship between two tables: Posts and Revisions. That is, for each post, there can be multiple revisions, but a given revision applies to only one post. Others have suggested two alternatives for finding the current post:
A flag column in Revisions to note the current revision. Changing the current revision is as simple as changing the flag to true in the desired revision and to false to the formerly current revision.
A foreign key in Posts to the revision that is current for the given post. This is even simpler, because you can change the current revision in one update instead of two. But circular foreign key references can cause problems vis-a-vis backup & restore, cascading updates, etc.
You could even implement the revision system using a single table:
CREATE TABLE PostRevisions (
post_revision_id SERIAL PRIMARY KEY,
post_id INT NOT NULL,
is_current TINYINT NULL,
date DATE,
title VARCHAR(80) NOT NULL,
text TEXT NOT NULL,
UNIQUE KEY (post_id, is_current)
);
I'm not sure it's duplication to store the title with each revision, because the title could be revised as much as the text, couldn't it?
The column is_current should be either 1 or NULL. A unique constraint doesn't count NULLs, so you can have only one row where is_current is 1 and an unlimited number of rows where it's NULL.
This does require updating two rows to make a revision current, but you gain some simplicity by reducing the model to a single table. This is a great advantage when you're using an ORM.
You can create a view to simplify the common case of querying current posts:
CREATE VIEW Posts AS SELECT * FROM PostRevisions WHERE is_current = 1;
update: Re your updated question: I agree that proper relational design would encourage two tables so that you could make a few attributes of a Post invariant for all that post's revisions. But most ORM tools assume an entity exists in a single table, and ORM's are clumsy at joining rows from multiple tables to constitute a given entity. So I would say if using an ORM is a priority, you should store the posts and revisions in a single table. Sacrifice a little bit of relational correctness to support the assumptions of the ORM paradigm.
Another suggestion is to consider Dimensional Modeling. This is a school of database design to support OLAP and data warehousing. It uses denormalization judiciously, so you can usually organize data in a Star Schema. The main entity (the "Fact Table") is represented by a single table, so this would be a win for an ORM-centric application design.

You'd probably be better off in this case to put a CurrentTextID on your Post table to avoid having to figure out which revision is current (an alternative would be a flag on Revision, but I think a CurrentTextID on the post will give you easier queries).
With the CurrentTextID on the Post, your ORM should place a single property (CurrentText) on your Post class which would allow you to access the current text with essentially the statement you provided.
Your ORM should also give you some way to load the Revisions based on the Post; If you want more details about that then you should include information about which ORM you are using and how you have it configured.

I think two tables would suffice here. A post table and it's revisions. If you're not worried about duplicating data, a single table (de-normalized) could also work.

For anyone interested, here is how wordpress handles revisions using a single MySQL posts table.
CREATE TABLE IF NOT EXISTS `wp_posts` (
`ID` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`post_author` bigint(20) unsigned NOT NULL DEFAULT '0',
`post_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`post_date_gmt` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`post_content` longtext NOT NULL,
`post_title` text NOT NULL,
`post_excerpt` text NOT NULL,
`post_status` varchar(20) NOT NULL DEFAULT 'publish',
`comment_status` varchar(20) NOT NULL DEFAULT 'open',
`ping_status` varchar(20) NOT NULL DEFAULT 'open',
`post_password` varchar(20) NOT NULL DEFAULT '',
`post_name` varchar(200) NOT NULL DEFAULT '',
`to_ping` text NOT NULL,
`pinged` text NOT NULL,
`post_modified` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`post_modified_gmt` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`post_content_filtered` text NOT NULL,
`post_parent` bigint(20) unsigned NOT NULL DEFAULT '0',
`guid` varchar(255) NOT NULL DEFAULT '',
`menu_order` int(11) NOT NULL DEFAULT '0',
`post_type` varchar(20) NOT NULL DEFAULT 'post',
`post_mime_type` varchar(100) NOT NULL DEFAULT '',
`comment_count` bigint(20) NOT NULL DEFAULT '0',
PRIMARY KEY (`ID`),
KEY `post_name` (`post_name`),
KEY `type_status_date` (`post_type`,`post_status`,`post_date`,`ID`),
KEY `post_parent` (`post_parent`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 ;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas