How can I optimize this SQL query to get rid of the filesort and temp table? - sql

Here's the query:
SELECT
count(id) AS count
FROM `numbers`
GROUP BY
MONTH(created_at),
YEAR(created_at)
ORDER BY
YEAR(created_at),
MONTH(created_at)
That query throws a 'Using temporary' and 'Using filesort' when doing EXPLAIN.
Ultimately what I'm doing is looking at a table of user-submitted tracking numbers and counting the number of submitted rows a grouping the counts by month/year.
ie. In November 2008 there were 11,312 submitted rows.
UPDATE, here's the DESCRIBE for the numbers table.
id int(11) NO PRI NULL auto_increment
tracking varchar(255) YES NULL
service varchar(255) YES NULL
notes text YES NULL
user_id int(11) YES NULL
active tinyint(1) YES 1
deleted tinyint(1) YES 0
feed text YES NULL
status varchar(255) YES NULL
created_at datetime YES NULL
updated_at datetime YES NULL
scheduled_delivery date YES NULL
carrier_service varchar(255) YES NULL

Give this a shot:
SELECT COUNT(x.id)
FROM (SELECT t.id,
MONTH(t.created_at) 'created_month',
YEAR(t.created_at) 'created_year'
FROM NUMBERS t) x
GROUP BY x.created_month, x.created_year
ORDER BY x.created_month, x.created_year
It's not a good habit to use functions in the WHERE, GROUP BY and ORDER BY clauses because indexes can't be used.
...query throws a 'Using temporary' and 'Using filesort' when doing EXPLAIN.
From what I found, that's to be expected when using DISTINCT/GROUP BY.

Make sure you have a covering index over YEAR and MONTH (that is, both fields within the same index) so that the ORDER BY component of your query can use an index. This should remove the need for a filesort, although a temporary table may still be needed to handle the grouping.

SELECT
count(`id`) AS count, MONTH(`created_at`) as month, YEAR(`created_at`) as year
FROM `numbers`
GROUP BY month, year
ORDER BY created_at
This will be the best you can get, as far as I can tell. I created a table with an id and a datetime column and filled it with 10000 rows. The query above uses a sub select, but it really doesn't do you any different and has the overhead of a sub select. The resulting time for mine was 0.015s and his was 0.016s.
Make sure that you have an index on created_at, this will help your initial query out. It is pretty rare to not end up with a file sort when the group by comes about, but it may be possible in other situations. MySql's docs have an article about this if you feel so inclined. I do not see how those methods can be applied here, with the information you have provided.

Whenever MySQL has to do work in memory, and that work exceeds the available amount (innodb_buffer_pool_size), it starts having to use the disk to store temporary work. You could increase the variable I mentioned, but setting it too high could cause performance problems in other areas.
If you're running a dedicated server, set it to ~50-75%.

The best method would be creating a helper column that would contain numberic values of YEAR and MONTH concatenated together:
YEAR(created_at) * 100 + MONTH(created_at)
Grouping on this column would use INDEX FOR GROUP BY.
However, you can create two helper tables, the first one containing reasonable number of years (say, from 1900 to 2100), the second one containing months (from 0 to 11), and use these tables to generate the sets:
SELECT (
SELECT COUNT(*)
FROM numbers
WHERE created_at >= '1900-01-01' + INTERVAL y YEAR + INTERVAL m MONTH
AND created_at < '1900-01-01' + INTERVAL y YEAR + INTERVAL m + 1 MONTH
)
FROM year_table
CROSS JOIN
month_table
WHERE y BETWEEN 2008 AND 2010

I'm sorry, but I have to disagree with the other answers.
I think what you need is to add an index to your table, preferably a covering index.
If you add an index on the columns you are searching on (created_at) and also on the columns you want to get a result from (id) then it will be dramatically faster then before.
The reason why you are using a temp table is because you use a group by.
To speed up the group by, you can change the MySQL server settings to increase the size of the tmp table and the max heap table size so that the temp table will be in memory.

Related

Date Range Nested Loop Optimization

Redshift db.
Table A is a date/calendar table.
Table B is a member table. Table B structured as a slowly changing dimension type 6. It has nearly 200 M records.
The goal is to write a performant performant query that gives the count of members for every day in the last 4 years. My first attempted resulting in a query like so:
select
date,
location,
sub_location,
race,
gender,
dob,
member_type,
count(distinct member_id)
from date_table d
join member_table m
on m.row_start <= d.full_date
and m.row_end >= d.full_date
and m.is_active = 'Y'
and m.row_end >= '2019-01-01'
where d.date_key >= 20190101
and d.date_key <= to_char(current_date, 'yyyymmdd')
group by
date,
location,
sub_location,
race,
gender,
dob,
member_type
The performance on this is god awful because of the join being a nested loop. I've been trying to think of a way to rework this to avoid that issue but have not had any success. Curious if there is a way to do so that would increase performance significantly.
For reference here are the table designs as well as the explain plan:
create table date_table
(
date_key integer not null encode delta
primary key,
full_date date encode delta,
)
diststyle all
sortkey (date_key);
create table member_tabnle
(
member_key bigint not null
primary key,
member_id integer,
location integer distkey,
sub_location integer encode zstd,
gender varchar(50) encode zstd,
race varchar(100) encode zstd,
date_of_birth date encode delta32k,
member_type char(10) encode zstd,
active char encode zstd,
row_start timestamp encode zstd,
row_end timestamp encode zstd,
)
diststyle key
interleaved sortkey (location, member_id);
execution plan
I've rewritten the query in various ways, none of which meaningfully impacted performance.
The output should be
Date, member attributes, count of records
You're in luck as I solved this exact issue a few years back and wrote up a description on the solution. You can find it here - http://wad-design.s3-website-us-east-1.amazonaws.com/sql_limits_wp.html
The basic issue you are facing is the need to massively grow the data before you can condense it down. These fat-in-the-middle queries can be expensive on Redshift and often spill to disk making them even more expensive. The solution is to not create a row for each account for each date but rather to look at it as counting account starts by date and account ends by date - the active accounts is the difference between the rolling sums of these values.
I was able to take a clients query run time down from 45 minutes to 17 seconds using this approach.
If the approach isn't clear let me know in a comment and I can help apply this approach to your situation. It can trip people up the first time.
This approach can be used to solve other problems efficiently like joining on the nearest date.

SQL Server : index for finding latest value which is greater than a passed value

I have a table with 4 columns
USER_ID: numeric
EVENT_DATE: date
VERSION: date
SCORE: decimal
I have a clustered index on (USER_ID, EVENT_DATE, VERSION). These three values together are unique.
I need to get the maximum EventDate for a set of UserIds (~1000 different ids) where the Score is larger than a specific value and only consider those entries with a specific Version.
SELECT M.*
FROM (VALUES
( 5237 ),
………1000 more
( 27054 ) ) C (USER_ID)
CROSS APPLY
(SELECT TOP 1 C.USER_ID, M.EVENT_DATE, M.SCORE
FROM MY_HUGE_TABLE M
WHERE C. USER_ID = M. USER_ID
AND M.VERSION = 'xxxx-xx-xx'
AND M.SCORE > 2 --Comment M.SCORE > 2
ORDER BY M.EVENT_DATE DESC) M
Once I execute the query, I get poor results with respect to runtime, due to a missing index on score column (I suppose).
If I delete the filtering on “M.SCORE > 2” I get my results ten times faster, nevertheless the latest Scores may be less than “2”.
Could anyone please hint me on how to setup an index which could allow me to improve my query performance.
Thank you very much in advance
For your query, the optimal index would be on (User_ID, Version, ValueDate desc, Score).
Unfortunately, your clustered index doesn't match. Only the first and third columns match, but they need to match in order. So, only the User_ID can help but that probably doesn't do much to filter the data.

order by field with more than 10000 ids

I need to do specific ordering with use of order by field.
select * from table order by field(id,3,4,1,2.......upto 10000 ids)
As the ordering required is not gettable from SQL then how much it affect as per performance and is it feasible to do?
Updates from the comments:
Ordering depends on user and category IDs and can be anything the user wants.
The ordering specification changes (about) daily.
So, we need a custom ordering that depends on the user and category and this ordering needs to change daily.
The easiest way would be to put your ordering in a separate table (called ordering_table in this example):
id | position
----+----------
1 | 11
2 | 42
3 | 23
etc.
The above would mean "put an id of 1 at position 11, 2 at position 42, 3 at position 23, ...". Then you can join that ordering table in:
SELECT t.id, t.col1, t.col2
FROM some_table t
JOIN ordering_table o ON (t.id = o.id)
ORDER BY o.position
Where ordering_table is the table (as above) that defines your strange ordering. This approach simply represents your ordering function as a table (any function with a finite domain is, essentially, just a table after all).
This "ordering table" approach should work fine as long as the ordering table is complete.
If you only need this strange ordering in one place then you could merge the position column into your main table and add NOT NULL and UNIQUE constraints on that column to make sure you cover everything and have a consistent ordering.
Further commenting indicates that you want different orderings for different users and categories and that the ordering will change on a daily basis. You could make separate tables for each condition (which would lead to a combinatorial explosion) or, as Mikael Eriksson and ypercube suggest, add a couple more columns to the ordering table to hold the user and category:
CREATE TABLE ordering_table (
thing_id INT NOT NULL,
position INT NOT NULL,
user_id INT NOT NULL,
category_id INT NOT NULL
);
The thing_id, user_id, and category_id would be foreign keys to their respective tables and you'd probably want to index all the columns in ordering_table but a couple minutes of looking at the query plans would be worthwhile to see if the indexes get used would be worthwhile. You could also make all four columns the primary key to avoid duplicates. Then, the lookup query would be something like this:
SELECT t.id, t.col1, t.col2
FROM some_table t
LEFT JOIN ordering_table o
ON (t.id = o.thing_id AND o.user_id = $user AND o.category_id = $cat)
ORDER BY COALESCE(o.position, 99999)
Where $user and $cat are the user and category IDs (respectively). Note the change to a LEFT JOIN and the addition of COALESCE to allow for missing rows in ordering_table, these changes will push anything that doesn't have a specified position in the order to the bottom of the list rather than removing them from the results completely.

SQL standard select current records from an audit log question

My memory is failing me. I have a simple audit log table based on a trigger:
ID int (identity, PK)
CustomerID int
Name varchar(255)
Address varchar(255)
AuditDateTime datetime
AuditCode char(1)
It has data like this:
ID CustomerID Name Address AuditDateTime AuditCode
1 123 Bob 123 Internet Way 2009-07-17 13:18:06.353I
2 123 Bob 123 Internet Way 2009-07-17 13:19:02.117D
3 123 Jerry 123 Internet Way 2009-07-17 13:36:03.517I
4 123 Bob 123 My Edited Way 2009-07-17 13:36:08.050U
5 100 Arnold 100 SkyNet Way 2009-07-17 13:36:18.607I
6 100 Nicky 100 Star Way 2009-07-17 13:36:25.920U
7 110 Blondie 110 Another Way 2009-07-17 13:36:42.313I
8 113 Sally 113 Yet another Way 2009-07-17 13:36:57.627I
What would be the efficient select statement be to get all most current records between a start and end time? FYI: I for insert, D for delete, and U for update.
Am I missing anything in the audit table? My next step is to create an audit table that only records changes, yet you can extract the most recent records for the given time frame. For the life of me I cannot find it on any search engine easily. Links would work too. Thanks for the help.
Another (better?) method to keep audit history is to use a 'startDate' and 'endDate' column rather than an auditDateTime and AuditCode column. This is often the approach in tracking Type 2 changes (new versions of a row) in data warehouses.
This lets you more directly select the current rows (WHERE endDate is NULL), and you will not need to treat updates differently than inserts or deletes. You simply have three cases:
Insert: copy the full row along with a start date and NULL end date
Delete: set the End Date of the existing current row (endDate is NULL)
Update: do a Delete then Insert
Your select would simply be:
select * from AuditTable where endDate is NULL
Anyway, here's my query for your existing schema:
declare #from datetime
declare #to datetime
select b.* from (
select
customerId
max(auditdatetime) 'auditDateTime'
from
AuditTable
where
auditcode in ('I', 'U')
and auditdatetime between #from and #to
group by customerId
having
/* rely on "current" being defined as INSERTS > DELETES */
sum(case when auditcode = 'I' then 1 else 0 end) >
sum(case when auditcode = 'D' then 1 else 0 end)
) a
cross apply(
select top 1 customerId, name, address, auditdateTime
from AuditTable
where auditdatetime = a.auditdatetime and customerId = a.customerId
) b
References
A cribsheet for data warehouses, but has a good section on type 2 changes (what you want to track)
MSDN page on data warehousing
Ok, a couple of things for audit log tables.
For most applications, we want audit tables to be extremely quick on insertion.
If the audit log is truly for diagnostic or for very irregular audit reasons, then the quickest insertion criteria is to make the table physically ordered upon insertion time.
And this means to put the audit time as the first column of the clustered index, e.g.
create unique clustered index idx_mytable on mytable(AuditDateTime, ID)
This will allow for extremely efficient select queries upon AuditDateTime O(log n), and O(1) insertions.
If you wish to look up your audit table on a per CustomerID basis, then you will need to compromise.
You may add a nonclustered index upon (CustomerID, AuditDateTime), which will allow for O(log n) lookup of per-customer audit history, however the cost will be the maintenance of that nonclustered index upon insertion - that maintenance will be O(log n) conversely.
However that insertion time penalty may be preferable to the table scan (that is, O(n) time complexity cost) that you will need to pay if you don't have an index on CustomerID and this is a regular query that is performed.
An O(n) lookup which locks the table for the writing process for an irregular query may block up writers, so it is sometimes in writers' interests to be slightly slower if it guarantees that readers aren't going to be blocking their commits, because readers need to table scan because of a lack of a good index to support them....
Addition: if you are looking to restrict to a given timeframe, the most important thing first of all is the index upon AuditDateTime. And make it clustered as you are inserting in AuditDateTime order. This is the biggest thing you can do to make your query efficient from the start.
Next, if you are looking for the most recent update for all CustomerID's within a given timespan, well thereafter a full scan of the data, restricted by insertion date, is required.
You will need to do a subquery upon your audit table, between the range,
select CustomerID, max(AuditDateTime) MaxAuditDateTime
from AuditTrail
where AuditDateTime >= #begin and Audit DateTime <= #end
and then incorporate that into your select query proper, eg.
select AuditTrail.* from AuditTrail
inner join
(select CustomerID, max(AuditDateTime) MaxAuditDateTime
from AuditTrail
where AuditDateTime >= #begin and Audit DateTime <= #end
) filtration
on filtration.CustomerID = AuditTrail.CustomerID and
filtration.AuditDateTime = AuditTrail.AuditDateTime
Another approach is using a sub select
select a.ID
, a.CustomerID
, a.Name
, a.Address
, a.AuditDateTime
, a.AuditCode
from myauditlogtable a,
(select s.id as maxid,max(s.AuditDateTime)
from myauditlogtable as s
group by maxid)
as subq
where subq.maxid=a.id;
start and end time? e.g as in between 1am to 3am
or start and end date time? e.g as in 2009-07-17 13:36 to 2009-07-18 13:36

SQL Query for count of records matching day in a date range?

I have a table with records that look like this:
CREATE TABLE sample (
ix int unsigned auto_increment primary key,
start_active datetime,
last_active datetime
);
I need to know how many records were active on each of the last 30 days. The days should also be sorted incrementing so they are returned oldest to newest.
I'm using MySQL and the query will be run from PHP but I don't really need the PHP code, just the query.
Here's my start:
SELECT COUNT(1) cnt, DATE(?each of last 30 days?) adate
FROM sample
WHERE adate BETWEEN start_active AND last_active
GROUP BY adate;
Do an outer join.
No table? Make a table. I always keep a dummy table around just for this.
create table artificial_range(
id int not null primary key auto_increment,
name varchar( 20 ) null ) ;
-- or whatever your database requires for an auto increment column
insert into artificial_range( name ) values ( null )
-- create one row.
insert into artificial_range( name ) select name from artificial_range;
-- you now have two rows
insert into artificial_range( name ) select name from artificial_range;
-- you now have four rows
insert into artificial_range( name ) select name from artificial_range;
-- you now have eight rows
--etc.
insert into artificial_range( name ) select name from artificial_range;
-- you now have 1024 rows, with ids 1-1024
Now make it convenient to use, and limit it to 30 days, with a view:
Edit: JR Lawhorne notes:
You need to change "date_add" to "date_sub" to get the previous 30 days in the created view.
Thanks JR!
create view each_of_the_last_30_days as
select date_sub( now(), interval (id - 1) day ) as adate
from artificial_range where id < 32;
Now use this in your query (I haven't actually tested your query, I'm just assuming it works correctly):
Edit: I should be joining the other way:
SELECT COUNT(*) cnt, b.adate
FROM each_of_the_last_30_days b
left outer join sample a
on ( b.adate BETWEEN a.start_active AND a.last_active)
GROUP BY b.adate;
SQL is great at matching sets of values that are stored in the database, but it isn't so great at matching sets of values that aren't in the database. So one easy workaround is to create a temp table containing the values you need:
CREATE TEMPORARY TABLE days_ago (d SMALLINT);
INSERT INTO days_ago (d) VALUES
(0), (1), (2), ... (29), (30);
Now you can compare a date that is d days ago to the span between start_active and last_active of each row. Count how many matching rows in the group per value of d and you've got your count.
SELECT CURRENT_DATE - d DAYS, COUNT(*) cnt,
FROM days_ago
LEFT JOIN sample ON (CURRENT_DATE - d DAYS BETWEEN start_active AND last_active)
GROUP BY d
ORDER BY d DESC; -- oldest to newest
Another note: you can't use column aliases defined in your select-list in expressions until you get to the GROUP BY clause. Actually, in standard SQL you can't use them until the ORDER BY clause, but MySQL supports using aliases in GROUP BY and HAVING clauses as well.
Turn the date into a unix timestamp, which is seconds, in your query and then just look for the difference to be <= the number of seconds in a month.
You can find more information here:
http://dev.mysql.com/doc/refman/5.1/en/date-and-time-functions.html#function_unix-timestamp
If you need help with the query please let me know, but MySQL has nice functions for dealing with datetime.
[Edit] Since I was confused as to the real question, I need to finish the lawn but before I forget I want to write this down.
To get a count of the number by day you will want your where clause to be as I described above, to limit to the past 30 days, but you will need to group by day, and so select by converting each start to a day of the month and then do a count of those.
This assumes that each use will be limited to one day, if the start and end dates can span several days then it will be trickier.