SQL to find duplicate entries (within a group)
I have a small problem and I'm not sure what would be the best way to fix it, as I only have limited access to the database (Oracle) itself.
In our Table "EVENT" we have about 160k entries, each EVENT has a GROUPID and a normal entry has exactly 5 rows with the same GROUPID. Due to a bug we currently get a couple of duplicate entries (duplicate, so 10 rows instead of 5, just a different EVENTID. This may change, so it's just <> 5). We need to filter all the entries of these groups.
Due to limited access to the database we can not use a temporary table, nor can we add an index to the GROUPID column to make it faster.
We can get the GROUPIDs with this query, but we would need a second query to get the needed data
select A."GROUPID"
from "EVENT" A
group by A."GROUPID"
having count(A."GROUPID") <> 5
One solution would be a subselect:
select *
from "EVENT" A
where A."GROUPID" IN (
select B."GROUPID"
from "EVENT" B
group by B."GROUPID"
having count(B."GROUPID") <> 5
)
Without an index on GROUPID and 160k entries, this takes much too long.
Tried thinking about a join that can handle this, but can't find a good solution so far.
Anybody can find a good solution for this maybe?
Small edit:
We don't have 100% duplicates here, as each entry still has a unique ID and the GROUPID is not unique either (that's why we need to use "group by") - or maybe I just miss an easy solution for it :)
Small example about the data (I don't want to delete it, just find it)
EVENTID | GROUPID | TYPEID
123456 123 12
123457 123 145
123458 123 2612
123459 123 41
123460 123 238
234567 123 12
234568 123 145
234569 123 2612
234570 123 41
234571 123 238
It has some more columns, like timestamp etc, but as you can see already, everything is identical, besides the EVENTID.
We will run it more often for testing, to find the bug and check if it happens again.
A classic problem for analytic queries to solve:
select eventid,
groupid,
typeid
from (
Select eventid,
groupid,
typeid,
count(*) over (partition by group_id) count_by_group_id
from EVENT
)
where count_by_group_id <> 5
You can get the answer with a join instead of a subquery
select
a.*
from
event as a
inner join
(select groupid
from event
group by groupid
having count(*) <> 5) as b
on a.groupid = b.groupid
This is a fairly common way of obtaining the all the information out of the rows in a group.
Like your suggested answer and the other responses, this will run a lot faster with an index on groupid. It's up to the DBA to balance the benefit of making your query run a lot faster against the cost of maintaining yet another index.
If the DBA decides against the index, make sure the appropriate people understand that its the index strategy and not the way you wrote the query that is slowing things down.
How long does that SQL actually take? You are only going to run it once I presume, having fixed the bug that caused the corruption in the first place? I just set up a test case like this:
SQL> create table my_objects as
2 select object_name, ceil(rownum/5) groupid, rpad('x',500,'x') filler
3 from all_objects;
Table created.
SQL> select count(*) from my_objects;
COUNT(*)
----------
83782
SQL> select * from my_objects where groupid in (
2 select groupid from my_objects
3 group by groupid
4 having count(*) <> 5
5 );
OBJECT_NAME GROUPID FILLER
------------------------------ ---------- --------------------------------
XYZ 16757 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
YYYY 16757 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Elapsed: 00:00:01.67
Less than 2 seconds. OK, my table has half as many rows as yours, but 160K isn't huge. I added the filler column to make the table take up some disk space. The AUTOTRACE execution plan was:
-------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
-------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 389 | 112K| 14029 (2)|
|* 1 | HASH JOIN | | 389 | 112K| 14029 (2)|
| 2 | VIEW | VW_NSO_1 | 94424 | 1198K| 6570 (2)|
|* 3 | FILTER | | | | |
| 4 | HASH GROUP BY | | 1 | 1198K| 6570 (2)|
| 5 | TABLE ACCESS FULL| MY_OBJECTS | 94424 | 1198K| 6504 (1)|
| 6 | TABLE ACCESS FULL | MY_OBJECTS | 94424 | 25M| 6506 (1)|
-------------------------------------------------------------------------
If your DBAs won't add an index to make this faster, ask them what they suggest you do (that's what they're paid for, after all). Presumably you have a business case why you need this information in which case your immediate management should be on your side.
Perhaps you could ask your DBAs to duplicate the data into a database where you could add an index.
From a SQL perspective I think you've already answered your own question. The approach you've described (ie using the sub-select) is fine, and I'd be surprised if any other way of writing the query differed vastly in performance.
160K records doesn't seem like a lot to me. I could understand if you were unhappy with the performance of that query if it was going into a piece of application code, but from the sounds of it you're just using it as part of some data cleansing excercise. (and so would expect you to be a little more tolerant in performance terms).
Even without any supporting index, its still just two full table table scans on 160K rows, which frankly, I'd expect to perform in some sort of vaguely reasonable time.
Talk to your db administrators. They've helped create the problem, so let them be part of the solution.
/EDIT/ In the meantime, run the query you have. Find out how long it takes, rather than guessing. Even better would be to run it, with set autotrace on, and post the results here, then we might be able to help you refine it somewhat.
Does this work do what you want, and does it offer better performance? (I just thought I'd throw it in as a suggestion).
select *
from group g
where (select count(*) from event e where g.groupid = e.groupid) <> 5
How about an analytic:
SELECT * FROM (
SELECT eventid, groupid, typeid, COUNT(groupid) OVER (PARTITION BY groupid) group_count
FROM event
)
WHERE group_count <> 5
Related
Details:
MariaDB: Server version: 10.2.10-MariaDB MariaDB Server
The DB table, trans_tbl is using Aria DB engine
Table is somewhat large: 126,006,123 rows
Server is not at all large: AWS t3 micro w/attached 30GB EBS
I applied indexes to this DB table as follows:
A primary key: evt_id
Another index on the column I want to group by: transaction_type
3 Related Questions:
Why is the transaction_type index ignored when I perform the following?
SELECT COUNT(evt_id), transaction_type FROM trans_tbl GROUP BY transaction_type
If I look at the output from EXPLAIN, I see:
MariaDB [my_db]> EXPLAIN SELECT COUNT(evt_id), transaction_type FROM trans_tbl GROUP BY transaction_type;
+------+-------------+-----------+------+---------------+------+---------+------+-----------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------+------+---------------+------+---------+------+-----------+---------------------------------+
| 1 | SIMPLE | trans_tbl | ALL | NULL | NULL | NULL | NULL | 126006123 | Using temporary; Using filesort |
+------+-------------+-----------+------+---------------+------+---------+------+-----------+---------------------------------+
What's confusing me here is that both of the items in the query are indexed. So, shouldn't the index(es) be utilized?
Why is the transaction_type index being used in the following case, where all I've done is switched from COUNT(evt_id) -- the primary key -- to COUNT(1). (The column is transaction_type, the index generated from it is called TransType.)
MariaDB [my_db]> EXPLAIN SELECT COUNT(1), transaction_type FROM trans_tbl GROUP BY transaction_type;
+------+-------------+-----------+-------+---------------+-----------+---------+------+-----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------+-------+---------------+-----------+---------+------+-----------+-------------+
| 1 | SIMPLE | trans_tbl | index | NULL | TransType | 35 | NULL | 126006123 | Using index |
+------+-------------+-----------+-------+---------------+-----------+---------+------+-----------+-------------+
The first query (with COUNT(evt_id)) takes 2 minutes & 40 seconds. Since it is not using the indices, that makes sense. But the second query (with COUNT(1)) takes 50 seconds. This makes no sense to me. Shouldn't it take essentially 0 seconds? Can't it just look at the first and last index value of each group, subtract them, and have the count? It seems to me that it is indeed actually counting. What's the point of an index?
I guess my more important question is: How do I set up my indexes to allow for grouping on that index to return results almost instantaneously, as I would expect?
PS I know the machine is ridiculously underpowered for this size of DB table. But, the table data is not worth throwing a lot of money at it to improve performance. I'd rather just learn to implement Aria indexes properly to gain speed.
COUNT(x) checks x for being NOT NULL before counting the row.
COUNT(*) is the usual pattern for counting rows.
So...
SELECT COUNT(evt_id), transaction_t is just `SELECT FIND_IN_SET(17, '8,12,17,90');`ype
FROM trans_tbl GROUP BY transaction_type;
decided to do a table scan, then sort and group.
SELECT COUNT(*), transaction_type
FROM trans_tbl GROUP BY transaction_type;
saw INDEX(transaction_type) and said "goodie; I can just scan that index without having to sort." Note: It still has to scan in order to count. But the INDEX is smaller than the table, so it could be done faster. This is also called a "covering" index since all the columns needed in the SELECT are found in that one INDEX.
COUNT(1) might be treated the same as COUNT(*), I don't know.
INDEX(transaction_type) is essentially identical to INDEX(transaction_type, evt_id). This because the PRIMARY KEY is silently tacked onto any secondary key in InnoDB.
I don't know why INDEX(transaction_type, evt_id) was not used. Bottom line: Use COUNT(*).
Why not 0 seconds? The counts are not saved anywhere. Anyway, there could be other queries modifying the counts as you run you SELECT. The improvement came from scanning 126M 2-column rows instead of 126M multi-column rows.
The following is a problem which is not well-suited to an RDBMS, I think, but that is what I've got deal with.
I am trying to write a tool to search through logs stored in a database.
Some rows might be:
Time | ID | Object | Description
2012-01-01 13:37 | 1 | 1 | Something happened
2012-01-01 13:39 | 2 | 2 | Something else happened
2012-01-01 13:50 | 3 | 2 | Bad
2012-01-01 14:08 | 4 | 1 | Good
2012-01-01 14:27 | 5 | 1 | Bad
2012-01-01 14:30 | 6 | 2 | Good
Object is a foreign key. In practice, Time will increase with ID but that is not an actual constraint. In reality there are more fields. It's a Postgres database - I'd like to be able to support SQLite as well but am aware this may well be impossible.
Now, I want to be able to run a query for, say, all Bad events that happened to Object 2:
SELECT * FROM table WHERE Object = 2 AND Description = 'Bad';
But it would often be useful to see some lines of context around the results - just as with the -C option to grep is very useful when searching through text logs.
For the above query, if we wanted one line of context either side, we would want rows 2 and 6 in addition to row 3.
If the original query returned multiple rows, more context would need to be retrieved.
Notice that the context is not retrieved from the events associated with Object 1; we eliminate only the restriction on the Description.
Also, the order involved, and hence what determines what is adjacent to what, is that induced by the Time field.
This specifies what I want to achieve, but the database concerned is fairly big, at least in comparison to the power of the machine it's running on.
The most often cited solution for getting adjacent rows requires you to run one extra query per result in what I'll call the base query; this is no good because that might be thousands of queries.
My current least bad solution is to run a query to retrieved the IDs of all possible rows that could be context - in the above example, that would be a search for all rows relating to Object 2. Then I get the IDs matching the base query, expand (using the list of all possible IDs) to a list of IDs of rows matching the base query or in context, then finally retrieve the data for those IDs.
This works, but is inelegant and slow.
It is especially slow when using the tool from a remote computer, as that initial list of IDs can be very large, and retrieving it and then just transmitting it over the internet can be inordinate.
Another solution I have tried is using a subquery or view that computes the "buffer sequence" of the rows.
Here's what the table looks like with this field added:
Time | ID | Sequence | Object | Description
2012-01-01 13:37 | 1 | 1 | 1 | Something happened
2012-01-01 13:39 | 2 | 1 | 2 | Something else happened
2012-01-01 13:50 | 3 | 2 | 2 | Bad
2012-01-01 14:08 | 4 | 2 | 1 | Good
2012-01-01 14:27 | 5 | 3 | 1 | Bad
2012-01-01 14:30 | 6 | 3 | 2 | Good
Running the base query on this table then allows you to generate the list of IDs you want by adding or subtracting from the Sequence value.
This eliminates the problem of transferring loads of rows over the wire, but now the database has to run this complicated subquery, and it's unacceptably slow, especially on the first run - given the use-case, queries are sporadic and caching is not very effective.
If I were in charge of the schema I'd probably just store this field there in the database, but I'm not, so any suggestions for improvements are welcome. Thanks!
You should use the ROW_NUMBER windowing function
http://www.postgresql.org/docs/current/static/functions-window.html
Adjacency is an abstract construct and relies on an explicit sort (or PARTITION OVER) ... do you mean the one with the preceeding time stamp?
Decide how you decide on what sort of "adjacent" you want, then get ROW_NUMBER over that criteria.
Once you have that you would just JOIN each row on the item having ROW_NUMBER +/- 1
You can try this with sqlite
SELECT DISTINCT t2.*
FROM (SELECT * FROM t WHERE object=2 AND description='Bad') t1
JOIN
(SELECT * FROM t WHERE object=2) t2
ON t1.id = t2.id OR
t2.id IN (SELECT id FROM t WHERE object=2 AND t.time<t1.time ORDER BY t.time DESC LIMIT 1) OR
t2.id IN (SELECT id FROM t WHERE object=2 AND t.time>t1.time ORDER BY t.time ASC LIMIT 1)
ORDER BY t2.time
;
Change the limit values by more context
I would appreciate some guidance on the following query. We have a list of experiments and their current progress state (for simplicity, I've reduced the statuses to 4types, but we have 10 different statuses in our data). I need to eventually return a list of the current status of all non-finished experiments.
Given a table exp_status,
Experiment | ID | Status
----------------------------
A | 1 | Starting
A | 2 | Working On It
B | 3 | Starting
B | 4 | Working On It
B | 5 | Finished Type I
C | 6 | Starting
D | 7 | Starting
D | 8 | Working On It
D | 9 | Finished Type II
E | 10 | Starting
E | 11 | Working On It
F | 12 | Starting
G | 13 | Starting
H | 14 | Starting
H | 15 | Working On It
H | 16 | Finished Type II
Desired Result Set:
Experiment | ID | Status
----------------------------
A | 2 | Working On It
C | 6 | Starting
E | 11 | Working On It
F | 12 | Starting
G | 13 | Starting
The most recent ID number will correspond to the most recent status.
Now, the current code I have executes in 150 seconds.
SELECT *
FROM
(SELECT Experiment, ID, Status,
row_number () over (partition by Experiment
order by ID desc) as rn
FROM exp_status)
WHERE rn = 1
AND status NOT LIKE ('Finished%')
The thing is, this code wastes its time. The result set is 45 thousand rows pulled from a table of 3.9 million. This is because most experiments are in the finished status. The code goes through and orders all of them then only filters out the finished at the end. About 95% of the experiments in the table are in the finished phase. I could not figure out how to make the query first pick out all the experiments and statuses where there isn't a 'Finished' for that experiment. I tried the following but had very slow performance.
SELECT *
FROM exp_status
WHERE experiment NOT IN
(
SELECT experiment
FROM exp_status
WHERE status LIKE ('Finished%')
)
Any help would be appreciated!
Given your requirement, I think your current query with with row_number() is one of the most efficient possible. This query takes time not because it has to sort the data, but because there is so much data to read in the first place (the extra cpu time is negligible compared to the fetch time). Furthermore, the first query makes a FULL SCAN that is really the best way to read lots of data.
You need to find a way to read a lot less rows if you want to improve performance. The second query doesn't go in the right direction:
the inner query will likely be a full scan since the 'finished' rows will be spread across the whole table and likely represent a big percentage of all rows.
the outer query will also likey be a full scan and a nice ANTI-HASH JOIN which should be quicker than 45k * (number of status change per experiment) non-unique index scans.
So the second query seems to have at least twice the number of reads (plus a join).
If you want to really improve performance, I think you will need a change of design.
You could for instance build a table of active experiments and join to this table. You would maintain this table either as a materialized view or with a modification to the code that inserts experiment statuses. You could go further and store the last status in this table. Maintaining this "last status" will likely be an extra burden but this could be justified by the improved performance.
Consider partitioning your table by status
www.orafaq.com/wiki/Partitioning_FAQ
You could also create materialized views to avoid having to recalculate your aggregations if these types of queries are frequent.
Could you provide the execution plans of your queries. Without those it is difficult to know the exact reason it is taking so long
You can improve your first query slightly by using this variant:
select experiment
, max(id) id
, max(status) keep (dense_rank last order by id) status
from exp_status
group by experiment
having max(status) keep (dense_rank last order by id) not like 'Finished%'
If you compare the plans, you'll notice one step less
Regards,
Rob.
I have a 3 large tables (10k, 10k, and 100M rows) and am trying to do a simple count on a join of them, where all the joined columns are indexed. Why does the COUNT(*) take so long, and how can I speed it up (without triggers and a running summary)?
mysql> describe SELECT COUNT(*) FROM `metaward_alias` INNER JOIN `metaward_achiever` ON (`metaward_alias`.`id` = `metaward_achiever`.`alias_id`) INNER JOIN `metaward_award` ON (`metaward_achiever`.`award_id` = `metaward_award`.`id`) WHERE `metaward_award`.`owner_id` = 8;
+----+-------------+-------------------+--------+-------------------------------------------------------+----------------------------+---------+---------------------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------------+--------+-------------------------------------------------------+----------------------------+---------+---------------------------------+------+-------------+
| 1 | SIMPLE | metaward_award | ref | PRIMARY,metaward_award_owner_id | metaward_award_owner_id | 4 | const | 1552 | |
| 1 | SIMPLE | metaward_achiever | ref | metaward_achiever_award_id,metaward_achiever_alias_id | metaward_achiever_award_id | 4 | paul.metaward_award.id | 2498 | |
| 1 | SIMPLE | metaward_alias | eq_ref | PRIMARY | PRIMARY | 4 | paul.metaward_achiever.alias_id | 1 | Using index |
+----+-------------+-------------------+--------+-------------------------------------------------------+----------------------------+---------+---------------------------------+------+-------------+
3 rows in set (0.00 sec)
But actually running the query takes about 10 minutes, and I'm on MyISAM so the tables are fully locked down for that duration
I guess the reason is that you do a huge join over three tables (without applying where clause first, the result would be 10k * 10k * 100M = 1016 rows). Try to reorder joins (for example start with metaward_award, then join only metaward_achiever see how long that takes, then try to plug metaward_alias, possibly using subquery to force your preferred evaluation order).
If that does not help you might have to denormalize your data, for example by storing number of aliases for particular metaward_achiever. Then you'd get rid of one join altogether. Maybe you can even cache the sums for metaward_award, depending on how and how often is your data updated.
Other thing that might help is getting all your database content into RAM :-)
Make sure you have indexes on:
metaward_alias id
metaward_achiever alias_id
metaward_achiever award_id
metaward_award id
metaward_award owner_id
I'm sure many people will also suggest to count on a specific column, but in MySql this doesn't make any difference for your query.
UPDATE:
You could also try to set the condition on the main table instead of one of the joined tables. That would give you the same result, but it could be faster (I don't know how clever MySql is):
SELECT COUNT(*) FROM `metaward_award`
INNER JOIN `metaward_achiever`
ON (`metaward_achiever`.`award_id` = `metaward_award`.`id`)
INNER JOIN `metaward_alias`
ON (`metaward_alias`.`id` = `metaward_achiever`.`alias_id`)
WHERE `metaward_award`.`owner_id` = 8
10 minutes is way too long for that query. I think you must have a really small key cache. You can get its size in bytes with:
SELECT ##key_buffer_size
First off, you should run ANALYZE TABLE or OPTIMIZE TABLE. They'll sort your index and can slightly improve the performance.
You should also see if you can use more compact types for your columns. For instance, if you're not going to have more than 16 millions owners or awards or aliases, you can change your INT columns into MEDIUMINT (UNSIGNED, of course). Perhaps even SMALLINT in some cases? That will reduce your index footprint and you'll fit more of it in the cache.
I have a query which is starting to cause some concern in my application. I'm trying to understand this EXPLAIN statement better to understand where indexes are potentially missing:
+----+-------------+-------+--------+---------------+------------+---------+-------------------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+------------+---------+-------------------------------+------+---------------------------------+
| 1 | SIMPLE | s | ref | client_id | client_id | 4 | const | 102 | Using temporary; Using filesort |
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 4 | www_foo_com.s.user_id | 1 | |
| 1 | SIMPLE | a | ref | session_id | session_id | 4 | www_foo_com.s.session_id | 1 | Using index |
| 1 | SIMPLE | h | ref | email_id | email_id | 4 | www_foo_com.a.email_id | 10 | Using index |
| 1 | SIMPLE | ph | ref | session_id | session_id | 4 | www_foo_com.s.session_id | 1 | Using index |
| 1 | SIMPLE | em | ref | session_id | session_id | 4 | www_foo_com.s.session_id | 1 | |
| 1 | SIMPLE | pho | ref | session_id | session_id | 4 | www_foo_com.s.session_id | 1 | |
| 1 | SIMPLE | c | ALL | userfield | NULL | NULL | NULL | 1108 | |
+----+-------------+-------+--------+---------------+------------+---------+-------------------------------+------+---------------------------------+
8 rows in set (0.00 sec)
I'm trying to understand where my indexes are missing by reading this EXPLAIN statement. Is it fair to say that one can understand how to optimize this query without seeing the query at all and just look at the results of the EXPLAIN?
It appears that the ALL scan against the 'c' table is the achilles heel. What's the best way to index this based on constant values as recommended on MySQL's documentation? |
Note, I also added an index to userfield in the cdr table and that hasn't done much good either.
Thanks.
--- edit ---
Here's the query, sorry -- don't know why I neglected to include it the first pass through.
SELECT s.`session_id` id,
DATE_FORMAT(s.`created`,'%m/%d/%Y') date,
u.`name`,
COUNT(DISTINCT c.id) calls,
COUNT(DISTINCT h.id) emails,
SEC_TO_TIME(MAX(DISTINCT c.duration)) duration,
(COUNT(DISTINCT em.email_id) + COUNT(DISTINCT pho.phone_id) > 0) status
FROM `fa_sessions` s
LEFT JOIN `fa_users` u ON s.`user_id`=u.`user_id`
LEFT JOIN `fa_email_aliases` a ON a.session_id = s.session_id
LEFT JOIN `fa_email_headers` h ON h.email_id = a.email_id
LEFT JOIN `fa_phones` ph ON ph.session_id = s.session_id
LEFT JOIN `fa_email_aliases` em ON em.session_id = s.session_id AND em.status = 1
LEFT JOIN `fa_phones` pho ON pho.session_id = s.session_id AND pho.status = 1
LEFT JOIN `cdr` c ON c.userfield = ph.phone_id
WHERE s.`partner_id`=1
GROUP BY s.`session_id`
I assume you've looked here to get more info about what it is telling you. Obviously the ALL means its going through all of them. The using temporary and using filesort are talked about on that page. You might want to look at that.
From the page:
Using filesort
MySQL must do an extra pass to find
out how to retrieve the rows in sorted
order. The sort is done by going
through all rows according to the join
type and storing the sort key and
pointer to the row for all rows that
match the WHERE clause. The keys then
are sorted and the rows are retrieved
in sorted order. See Section 7.2.12,
“ORDER BY Optimization”.
Using temporary
To resolve the query, MySQL needs to
create a temporary table to hold the
result. This typically happens if the
query contains GROUP BY and ORDER BY
clauses that list columns differently.
I agree that seeing the query might help to figure things out better.
My advice?
Break the query into 2 and use a temporary table in the middle.
Reasonning
The problem appears to be that table c is being table scanned, and that this is the last table in the query. This is probably bad: if you have a table scan, you want to do it at the start of the query, so it's only done once.
I'm not a MySQL guru, but I have spent a whole lot of time optimising queries on other DBs. It looks to me like the optimiser hasn't worked out that it should start with c and work backwards.
The other thing that strikes me is that there are probably too many tables in the join. Most optimisers struggle with more than 4 tables (because the number of possible table orders is growing exponentially, so checking them all becomes impractical).
Having too many tables in a join is the root of 90% of performance problems I have seen.
Give it a go, and let us know how you get on. If it doesn't help, please post the SQL, table definitions and indeces, and I'll take another look.
General Tips
Feel free to look at this answer I gave on general performance tips.
A great resource
MySQL Documentation for EXPLAIN
Well looking at the query would be useful, but there's at least one thing that's obviously worth looking into - the final line shows the ALL type for that part of the query, which is generally not great to see. If the suggested possible key (userfield) makes sense as an added index to table c, it might be worth adding it and seeing if that reduces the rows returned for that table in the search.
Query Plan
The query plan we might hope the optimiser would choose would be something like:
start with sessions where partner_id=1 , possibly using an index on partner_id,
join sessions to users, using an index on user_id
join sessions to phones, where status=1, using an index on session_id and possibly status
join sessions to phones again using an index on session_id and phone_id **
join phones to cdr using an index on userfield
join sessions to email_aliases, where status=1 using an index on session_id and possibly status
join sessions to email_aliases again using an index on session_id and email_id **
join email_aliases to email_headers using an index on email_id
** by putting 2 fields in these indeces, we enable the optimiser to join to the table using session_id, and immediately find out the associated phone_id or email_id without having to read the underlying table. This technique saves us a read, and can save a lot of time.
Indeces I would create:
The above query plan suggests these indeces:
fa_sessions ( partner_id, session_id )
fa_users ( user_id )
fa_email_aliases ( session_id, email_id )
fa_email_headers ( email_id )
fa_email_aliases ( session_id, status )
fa_phones ( session_id, status, phone_id )
cdr ( userfield )
Notes
You will almost certainly get acceptable performance without creating all of these.
If any of the tables are small ( less than 100 rows ) then it's probably not worth creating an index.
fa_email_aliases might work with ( session_id, status, email_id ), depending on how the optimiser works.