How can I speed up queries with `GROUP BY` in them?

How can I speed up queries with `GROUP BY` in them? - indexing

Details:
MariaDB: Server version: 10.2.10-MariaDB MariaDB Server
The DB table, trans_tbl is using Aria DB engine
Table is somewhat large: 126,006,123 rows
Server is not at all large: AWS t3 micro w/attached 30GB EBS
I applied indexes to this DB table as follows:
A primary key: evt_id
Another index on the column I want to group by: transaction_type
3 Related Questions:
Why is the transaction_type index ignored when I perform the following?
SELECT COUNT(evt_id), transaction_type FROM trans_tbl GROUP BY transaction_type
If I look at the output from EXPLAIN, I see:
MariaDB [my_db]> EXPLAIN SELECT COUNT(evt_id), transaction_type FROM trans_tbl GROUP BY transaction_type;
+------+-------------+-----------+------+---------------+------+---------+------+-----------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------+------+---------------+------+---------+------+-----------+---------------------------------+
| 1 | SIMPLE | trans_tbl | ALL | NULL | NULL | NULL | NULL | 126006123 | Using temporary; Using filesort |
+------+-------------+-----------+------+---------------+------+---------+------+-----------+---------------------------------+
What's confusing me here is that both of the items in the query are indexed. So, shouldn't the index(es) be utilized?
Why is the transaction_type index being used in the following case, where all I've done is switched from COUNT(evt_id) -- the primary key -- to COUNT(1). (The column is transaction_type, the index generated from it is called TransType.)
MariaDB [my_db]> EXPLAIN SELECT COUNT(1), transaction_type FROM trans_tbl GROUP BY transaction_type;
+------+-------------+-----------+-------+---------------+-----------+---------+------+-----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------+-------+---------------+-----------+---------+------+-----------+-------------+
| 1 | SIMPLE | trans_tbl | index | NULL | TransType | 35 | NULL | 126006123 | Using index |
+------+-------------+-----------+-------+---------------+-----------+---------+------+-----------+-------------+
The first query (with COUNT(evt_id)) takes 2 minutes & 40 seconds. Since it is not using the indices, that makes sense. But the second query (with COUNT(1)) takes 50 seconds. This makes no sense to me. Shouldn't it take essentially 0 seconds? Can't it just look at the first and last index value of each group, subtract them, and have the count? It seems to me that it is indeed actually counting. What's the point of an index?
I guess my more important question is: How do I set up my indexes to allow for grouping on that index to return results almost instantaneously, as I would expect?
PS I know the machine is ridiculously underpowered for this size of DB table. But, the table data is not worth throwing a lot of money at it to improve performance. I'd rather just learn to implement Aria indexes properly to gain speed.

COUNT(x) checks x for being NOT NULL before counting the row.
COUNT(*) is the usual pattern for counting rows.
So...
SELECT COUNT(evt_id), transaction_t is just `SELECT FIND_IN_SET(17, '8,12,17,90');`ype
FROM trans_tbl GROUP BY transaction_type;
decided to do a table scan, then sort and group.
SELECT COUNT(*), transaction_type
FROM trans_tbl GROUP BY transaction_type;
saw INDEX(transaction_type) and said "goodie; I can just scan that index without having to sort." Note: It still has to scan in order to count. But the INDEX is smaller than the table, so it could be done faster. This is also called a "covering" index since all the columns needed in the SELECT are found in that one INDEX.
COUNT(1) might be treated the same as COUNT(*), I don't know.
INDEX(transaction_type) is essentially identical to INDEX(transaction_type, evt_id). This because the PRIMARY KEY is silently tacked onto any secondary key in InnoDB.
I don't know why INDEX(transaction_type, evt_id) was not used. Bottom line: Use COUNT(*).
Why not 0 seconds? The counts are not saved anywhere. Anyway, there could be other queries modifying the counts as you run you SELECT. The improvement came from scanning 126M 2-column rows instead of 126M multi-column rows.

Related

ORDER BY [PRIMARY_KEY] has to apply sort-order when it should simply use the index?

according to my Research, ordering by the Primary key (or on any other column with an index) - the query should run without an explicit sort.
I also found a blog where this behavior was shown on different databases, one of them being Oracle.
However - in my Tests it this was not true - what could be the reason? Bad install-options? Broken Index? (although I ruled that out by creating a completely new table)
the query:
select * from auftrag_test order by auftragkey
the execution plan:
Plan Hash Value : 505195503
-----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
-----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 167910 | 44496150 | 11494 | 00:00:01 |
| 1 | SORT ORDER BY | | 167910 | 44496150 | 11494 | 00:00:01 |
| 2 | TABLE ACCESS FULL | AUFTRAG_TEST | 167910 | 44496150 | 1908 | 00:00:01 |
-----------------------------------------------------------------------------------
create table AUFTRAG_TEST
(
auftragkey VARCHAR2(40) not null,
...
);
alter table AUFTRAG_TEST
add constraint PK_AUFTRAG_TEST primary key (AUFTRAGKEY);
you might ask yourself why the Primary key would be a varchar field. Well, this is something our bosses have decided. (Actually we put in stringified guids)
The blog I found:
http://use-the-index-luke.com/sql/sorting-grouping/indexed-order-by
P.S.: I think that I found out the Problem. This select does NOT "order by":
select *
from auftrag_test
where auftragkey = 'aabbccddeeffaabbccddeeffaabbccdd'
order by auftragkey
So - apparently - it does ONLY work, if you filter against an index, with "equality" which wouldn't be very helpful at all.
P.P.S: MS-SQL seems to do just what I expected. If I order by the Primary key (with a non clustered unique index) - the sort is "free". In execution plan, and also query time wise.

You should be aware that scanning a big table through an index might take hours Vs. full table scan on the same table that will take only few minutes.
In this case travesing through the index is order to save a O(n*log(n)) sort operation, doesn't sound like a good idea.
Heap table will yield sort operation.
IOT (Index orginized Table, also knows as "clustered index") is already sorted.
create table t_heap (i int primary key,j int);
create table t_iot (i int primary key,j int) organization index;
select * from t_heap order by i;
select * from t_iot order by i;

Hint FIRST_ROWS(n) not giving optimized result for Order by clause

We have around 8 million records in a table having around 50 columns, we need to see few records very quickly so we are using FIRST_ROWS(10) hint for this purpose and its working amazingly fast.
SELECT /*+ FIRST_ROWS(10) */ ABC.view_ABC.ID, ABC.view_ABC.VERSION, ABC.view_ABC.M_UUID, ABC.view_ABC.M_PROCESS_NAME FROM ABC.view_ABC
However when we put a clause of ORDER BY e.g. creationtime (which is almost a unique value for each row in that table), this query will take ages to return all columns.
SELECT /*+ FIRST_ROWS(10) */ ABC.view_ABC.ID, ABC.view_ABC.VERSION, ABC.view_ABC.M_UUID, ABC.view_ABC.M_PROCESS_NAME FROM ABC.view_ABC ORDER BY ABC.view_ABC.CREATIONTIME DESC
One thing that I noticed is; if we put a ORDER BY for some column like VERSION which has same value for multiple rows, it gives the result better.
This ORDER BY is not working efficiently for any unique column like for ID column in this table.
Another thing worth considering is; if we reduce the number of columns to be fetched e.g. 3 columns instead of 50 columns the results are somehow coming faster.
P.S. gather statistics are run on this table weekly, but data is pushed hourly. Only INSERT statement is running on this table, no DELETE or UPDATE queries are running on this table.
Also, there is a simple view created no this table, the above queries are being run on same view.

Without an order by clause the optimiser can perform whatever join operations your view is hiding and start returning data as soon as it has some. The hint is changing how it accesses the underlying tables so that it, for example, does a nested loop join instead of a merge join - which would allow it to find the first matching rows quickly; but might be less efficient overall for returning all of the data. Your hint is telling the optimiser that you want it prioritise the speed of the first batch of rows returned over the speed of the entire query.
When you add the order by clause then all of the data has to be found before it can be ordered. All of the join conditions have to be met and all of the nested loops/merges etc. completed, and then the entire result set has to be sorted into the order you specified, before any rows can be returned.
If the column you're ordering by is indexed and that index is being used (or can be used) by the optimiser to identify rows in the driving table then it's possible it could be incorporating that into the sort, but you can't rely on that as the optimiser can change the plan as the data and statistics change.
You may find it useful to look at the execution plans of your various queries, with and without the hint, to see what the optimiser is doing in each case, including where in the chain of steps it is doing the sort operation, and the types of joins it is doing.

There is a multi-column index on this column (CREATION_TIME), somehow oracle hint optimizer was not using this index.
However on same table there was another column (TERMINATION_TIME), it had an index on itself. So we use the same query but with this indexed column in ORDER BY clause.
Below is the explain plan for first query with CREATION_TIME in ORDER BY clause which is part of multi-column index.
-------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 7406K| 473M| | 308K (1)| 01:01:40 |
| 1 | SORT ORDER BY | | 7406K| 473M| 567M| 308K (1)| 01:01:40 |
| 2 | TABLE ACCESS FULL| Table_ABC | 7406K| 473M| | 189K (1)| 00:37:57 |
-------------------------------------------------------------------------------------------------------------
And this one is with TERMINATION_TIME as ORDER BY clause.
--------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 10 | 670 | 10 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| TABLE_ABC | 7406K| 473M| 10 (0)| 00:00:01 |
| 2 | INDEX FULL SCAN DESCENDING| XGN620150305000000 | 10 | | 3 (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------------------
If you see, its a clear difference in the Cost, Rows involved, Usage of Temporary Space (which is not even used in later case) and finally the Time.
Now the query response time is much better.
Thanks.

How can I speed up a count(*) which is already using indexes? (MyISAM)

I have a 3 large tables (10k, 10k, and 100M rows) and am trying to do a simple count on a join of them, where all the joined columns are indexed. Why does the COUNT(*) take so long, and how can I speed it up (without triggers and a running summary)?
mysql> describe SELECT COUNT(*) FROM `metaward_alias` INNER JOIN `metaward_achiever` ON (`metaward_alias`.`id` = `metaward_achiever`.`alias_id`) INNER JOIN `metaward_award` ON (`metaward_achiever`.`award_id` = `metaward_award`.`id`) WHERE `metaward_award`.`owner_id` = 8;
+----+-------------+-------------------+--------+-------------------------------------------------------+----------------------------+---------+---------------------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------------+--------+-------------------------------------------------------+----------------------------+---------+---------------------------------+------+-------------+
| 1 | SIMPLE | metaward_award | ref | PRIMARY,metaward_award_owner_id | metaward_award_owner_id | 4 | const | 1552 | |
| 1 | SIMPLE | metaward_achiever | ref | metaward_achiever_award_id,metaward_achiever_alias_id | metaward_achiever_award_id | 4 | paul.metaward_award.id | 2498 | |
| 1 | SIMPLE | metaward_alias | eq_ref | PRIMARY | PRIMARY | 4 | paul.metaward_achiever.alias_id | 1 | Using index |
+----+-------------+-------------------+--------+-------------------------------------------------------+----------------------------+---------+---------------------------------+------+-------------+
3 rows in set (0.00 sec)
But actually running the query takes about 10 minutes, and I'm on MyISAM so the tables are fully locked down for that duration

I guess the reason is that you do a huge join over three tables (without applying where clause first, the result would be 10k * 10k * 100M = 1016 rows). Try to reorder joins (for example start with metaward_award, then join only metaward_achiever see how long that takes, then try to plug metaward_alias, possibly using subquery to force your preferred evaluation order).
If that does not help you might have to denormalize your data, for example by storing number of aliases for particular metaward_achiever. Then you'd get rid of one join altogether. Maybe you can even cache the sums for metaward_award, depending on how and how often is your data updated.
Other thing that might help is getting all your database content into RAM :-)

Make sure you have indexes on:
metaward_alias id
metaward_achiever alias_id
metaward_achiever award_id
metaward_award id
metaward_award owner_id
I'm sure many people will also suggest to count on a specific column, but in MySql this doesn't make any difference for your query.
UPDATE:
You could also try to set the condition on the main table instead of one of the joined tables. That would give you the same result, but it could be faster (I don't know how clever MySql is):
SELECT COUNT(*) FROM `metaward_award`
INNER JOIN `metaward_achiever`
ON (`metaward_achiever`.`award_id` = `metaward_award`.`id`)
INNER JOIN `metaward_alias`
ON (`metaward_alias`.`id` = `metaward_achiever`.`alias_id`)
WHERE `metaward_award`.`owner_id` = 8

10 minutes is way too long for that query. I think you must have a really small key cache. You can get its size in bytes with:
SELECT ##key_buffer_size
First off, you should run ANALYZE TABLE or OPTIMIZE TABLE. They'll sort your index and can slightly improve the performance.
You should also see if you can use more compact types for your columns. For instance, if you're not going to have more than 16 millions owners or awards or aliases, you can change your INT columns into MEDIUMINT (UNSIGNED, of course). Perhaps even SMALLINT in some cases? That will reduce your index footprint and you'll fit more of it in the cache.

How to optimize this query?

Query:
select id,
title
from posts
where id in (23,24,60,19,21,32,43,49,9,11,17,34,37,39,46,5
2,55)
Explain plan:
mysql> explain select id,title from posts where id in (23,24,60,19,21,32,43,49,9,11,17,34,37,39,46,5
2,55);
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | posts | ALL | PRIMARY | NULL | NULL | NULL | 30 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
1 row in set (0.05 sec)
id is the primary key of posts table.

Other than adding other indexes, such as
a clustered index id
a covering index which includes id [first] and the other columns from the SELECT clause
there seem to be little to be done...
In fact, even if there were such indexes available, MySQL may decide to do a table scan, as is the case here ("ALL" type). The reason may be the table may has a relative few rows (compared with the estimated number of rows the query would return), and it is therefore more efficient to "read" the table, sequentially, discarding non matching rows as we go, rather than "hoping all over the place", with an index indirection.

I don't see any problem with it. If you need to select against a list, then "IN" is the right way to do it. You're not selecting unnecessary information, and the thing you're selecting against is a key, which is presumably indexed.

MySQL, reading this EXPLAIN statement

I have a query which is starting to cause some concern in my application. I'm trying to understand this EXPLAIN statement better to understand where indexes are potentially missing:
+----+-------------+-------+--------+---------------+------------+---------+-------------------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+------------+---------+-------------------------------+------+---------------------------------+
| 1 | SIMPLE | s | ref | client_id | client_id | 4 | const | 102 | Using temporary; Using filesort |
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 4 | www_foo_com.s.user_id | 1 | |
| 1 | SIMPLE | a | ref | session_id | session_id | 4 | www_foo_com.s.session_id | 1 | Using index |
| 1 | SIMPLE | h | ref | email_id | email_id | 4 | www_foo_com.a.email_id | 10 | Using index |
| 1 | SIMPLE | ph | ref | session_id | session_id | 4 | www_foo_com.s.session_id | 1 | Using index |
| 1 | SIMPLE | em | ref | session_id | session_id | 4 | www_foo_com.s.session_id | 1 | |
| 1 | SIMPLE | pho | ref | session_id | session_id | 4 | www_foo_com.s.session_id | 1 | |
| 1 | SIMPLE | c | ALL | userfield | NULL | NULL | NULL | 1108 | |
+----+-------------+-------+--------+---------------+------------+---------+-------------------------------+------+---------------------------------+
8 rows in set (0.00 sec)
I'm trying to understand where my indexes are missing by reading this EXPLAIN statement. Is it fair to say that one can understand how to optimize this query without seeing the query at all and just look at the results of the EXPLAIN?
It appears that the ALL scan against the 'c' table is the achilles heel. What's the best way to index this based on constant values as recommended on MySQL's documentation? |
Note, I also added an index to userfield in the cdr table and that hasn't done much good either.
Thanks.
--- edit ---
Here's the query, sorry -- don't know why I neglected to include it the first pass through.
SELECT s.`session_id` id,
DATE_FORMAT(s.`created`,'%m/%d/%Y') date,
u.`name`,
COUNT(DISTINCT c.id) calls,
COUNT(DISTINCT h.id) emails,
SEC_TO_TIME(MAX(DISTINCT c.duration)) duration,
(COUNT(DISTINCT em.email_id) + COUNT(DISTINCT pho.phone_id) > 0) status
FROM `fa_sessions` s
LEFT JOIN `fa_users` u ON s.`user_id`=u.`user_id`
LEFT JOIN `fa_email_aliases` a ON a.session_id = s.session_id
LEFT JOIN `fa_email_headers` h ON h.email_id = a.email_id
LEFT JOIN `fa_phones` ph ON ph.session_id = s.session_id
LEFT JOIN `fa_email_aliases` em ON em.session_id = s.session_id AND em.status = 1
LEFT JOIN `fa_phones` pho ON pho.session_id = s.session_id AND pho.status = 1
LEFT JOIN `cdr` c ON c.userfield = ph.phone_id
WHERE s.`partner_id`=1
GROUP BY s.`session_id`

I assume you've looked here to get more info about what it is telling you. Obviously the ALL means its going through all of them. The using temporary and using filesort are talked about on that page. You might want to look at that.
From the page:
Using filesort
MySQL must do an extra pass to find
out how to retrieve the rows in sorted
order. The sort is done by going
through all rows according to the join
type and storing the sort key and
pointer to the row for all rows that
match the WHERE clause. The keys then
are sorted and the rows are retrieved
in sorted order. See Section 7.2.12,
“ORDER BY Optimization”.
Using temporary
To resolve the query, MySQL needs to
create a temporary table to hold the
result. This typically happens if the
query contains GROUP BY and ORDER BY
clauses that list columns differently.
I agree that seeing the query might help to figure things out better.

My advice?
Break the query into 2 and use a temporary table in the middle.
Reasonning
The problem appears to be that table c is being table scanned, and that this is the last table in the query. This is probably bad: if you have a table scan, you want to do it at the start of the query, so it's only done once.
I'm not a MySQL guru, but I have spent a whole lot of time optimising queries on other DBs. It looks to me like the optimiser hasn't worked out that it should start with c and work backwards.
The other thing that strikes me is that there are probably too many tables in the join. Most optimisers struggle with more than 4 tables (because the number of possible table orders is growing exponentially, so checking them all becomes impractical).
Having too many tables in a join is the root of 90% of performance problems I have seen.
Give it a go, and let us know how you get on. If it doesn't help, please post the SQL, table definitions and indeces, and I'll take another look.
General Tips
Feel free to look at this answer I gave on general performance tips.
A great resource
MySQL Documentation for EXPLAIN

Well looking at the query would be useful, but there's at least one thing that's obviously worth looking into - the final line shows the ALL type for that part of the query, which is generally not great to see. If the suggested possible key (userfield) makes sense as an added index to table c, it might be worth adding it and seeing if that reduces the rows returned for that table in the search.

Query Plan
The query plan we might hope the optimiser would choose would be something like:
start with sessions where partner_id=1 , possibly using an index on partner_id,
join sessions to users, using an index on user_id
join sessions to phones, where status=1, using an index on session_id and possibly status
join sessions to phones again using an index on session_id and phone_id **
join phones to cdr using an index on userfield
join sessions to email_aliases, where status=1 using an index on session_id and possibly status
join sessions to email_aliases again using an index on session_id and email_id **
join email_aliases to email_headers using an index on email_id
** by putting 2 fields in these indeces, we enable the optimiser to join to the table using session_id, and immediately find out the associated phone_id or email_id without having to read the underlying table. This technique saves us a read, and can save a lot of time.
Indeces I would create:
The above query plan suggests these indeces:
fa_sessions ( partner_id, session_id )
fa_users ( user_id )
fa_email_aliases ( session_id, email_id )
fa_email_headers ( email_id )
fa_email_aliases ( session_id, status )
fa_phones ( session_id, status, phone_id )
cdr ( userfield )
Notes
You will almost certainly get acceptable performance without creating all of these.
If any of the tables are small ( less than 100 rows ) then it's probably not worth creating an index.
fa_email_aliases might work with ( session_id, status, email_id ), depending on how the optimiser works.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How can I speed up queries with `GROUP BY` in them? - indexing

Related

ORDER BY [PRIMARY_KEY] has to apply sort-order when it should simply use the index?

Hint FIRST_ROWS(n) not giving optimized result for Order by clause

How can I speed up a count(*) which is already using indexes? (MyISAM)

How to optimize this query?

MySQL, reading this EXPLAIN statement

Categories

Resources