SQL script runs VERY slowly with small change - sql

I am relatively new to SQL. I have a script that used to run very quickly (<0.5 seconds) but runs very slowly (>120 seconds) if I add one change - and I can't see why this change makes such a difference. Any help would be hugely appreciated!
This is the script and it runs quickly if I do NOT include "tt2.bulk_cnt
" in line 26:
with bulksum1 as
(
select t1.membercode,
t1.schemecode,
t1.transdate
from mina_raw2 t1
where t1.transactiontype in ('RSP','SP','UNTV','ASTR','CN','TVIN','UCON','TRAS')
group by t1.membercode,
t1.schemecode,
t1.transdate
),
bulksum2 as
(
select t1.schemecode,
t1.transdate,
count(*) as bulk_cnt
from bulksum1 t1
group by t1.schemecode,
t1.transdate
having count(*) >= 10
),
results as
(
select t1.*, tt2.bulk_cnt
from mina_raw2 t1
inner join bulksum2 tt2
on t1.schemecode = tt2.schemecode and t1.transdate = tt2.transdate
where t1.transactiontype in ('RSP','SP','UNTV','ASTR','CN','TVIN','UCON','TRAS')
)
select * from results
EDIT: I apologise for not putting enough detail in here previously - although I can use basic SQL code, I am a complete novice when it comes to databases.
Database: Oracle (I'm not sure which version, sorry)
Execution plans:
QUICK query:
Plan hash value: 1712123489
---------------------------------------------
| Id | Operation | Name |
---------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | HASH JOIN | |
| 2 | VIEW | |
| 3 | FILTER | |
| 4 | HASH GROUP BY | |
| 5 | VIEW | VM_NWVW_0 |
| 6 | HASH GROUP BY | |
| 7 | TABLE ACCESS FULL| MINA_RAW2 |
| 8 | TABLE ACCESS FULL | MINA_RAW2 |
---------------------------------------------
SLOW query:
Plan hash value: 1298175315
--------------------------------------------
| Id | Operation | Name |
--------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | FILTER | |
| 2 | HASH GROUP BY | |
| 3 | HASH JOIN | |
| 4 | VIEW | VM_NWVW_0 |
| 5 | HASH GROUP BY | |
| 6 | TABLE ACCESS FULL| MINA_RAW2 |
| 7 | TABLE ACCESS FULL | MINA_RAW2 |
--------------------------------------------

A few observations, and then some things to do:
1) More information is needed. In particular, how many rows are there in the MINA_RAW2 table, what indexes exist on this table, and when was the last time it was analyzed? To determine the answers to these questions, run:
SELECT COUNT(*) FROM MINA_RAW2;
SELECT TABLE_NAME, LAST_ANALYZED, NUM_ROWS
FROM USER_TABLES
WHERE TABLE_NAME = 'MINA_RAW2';
From looking at the plan output it looks like the database is doing two FULL SCANs on MINA_RAW2 - it would be nice if this could be reduced to no more than one, and hopefully none. It's always tough to tell without very detailed information about the data in the table, but at first blush it appears that an index on TRANSACTIONTYPE might be helpful. If such an index doesn't exist you might want to consider adding it.
2) Assuming that the statistics are out-of-date (as in, old, nonexistent, or a significant amount of data (> 10%) has been added, deleted, or updated since the last analysis) run the following:
BEGIN
DBMS_STATS.GATHER_TABLE_STATS(owner => 'YOUR-SCHEMA-NAME',
table_name => 'MINA_RAW2');
END;
substituting the correct schema name for "YOUR-SCHEMA-NAME" above. Remember to capitalize the schema name! If you don't know if you should or shouldn't gather statistics, err on the side of caution and do it. It shouldn't take much time.
3) Re-try your existing query after updating the table statistics. I think there's a fair chance that having up-to-date statistics in the database will solve your issues. If not:
4) This query is doing a GROUP BY on the results of a GROUP BY. This doesn't appear to be necessary as the initial GROUP BY doesn't do any grouping - instead, it appears this is being done to get the unique combinations of MEMBERCODE, SCHEMECODE, and TRANSDATE so that the count of the members by scheme and date can be determined. I think the whole query can be simplified to:
WITH cteWORKING_TRANS AS (SELECT *
FROM MINA_RAW2
WHERE TRANSACTIONTYPE IN ('RSP','SP','UNTV',
'ASTR','CN','TVIN',
'UCON','TRAS')),
cteBULKSUM AS (SELECT a.SCHEMECODE,
a.TRANSDATE,
COUNT(*) AS BULK_CNT
FROM (SELECT DISTINCT MEMBERCODE,
SCHEMECODE,
TRANSDATE
FROM cteWORKING_TRANS) a
GROUP BY a.SCHEMECODE,
a.TRANSDATE)
SELECT t.*, b.BULK_CNT
FROM cteWORKING_TRANS t
INNER JOIN cteBULKSUM b
ON b.SCHEMECODE = t.SCHEMECODE AND
b.TRANSDATE = t.TRANSDATE

I managed to remove an unnecessary subquery, but this syntax with distinct inside count may not work outside of PostgreSQL or may not be the desired result. I know I've certainly used it there.
select t1.*, tt2.bulk_cnt
from mina_raw2 t1
inner join (select t2.schemecode,
t2.transdate,
count(DISTINCT membercode) as bulk_cnt
from mina_raw2 t2
where t2.transactiontype in ('RSP','SP','UNTV','ASTR','CN','TVIN','UCON','TRAS')
group by t2.schemecode,
t2.transdate
having count(DISTINCT membercode) >= 10) tt2
on t1.schemecode = tt2.schemecode and t1.transdate = tt2.transdate
where t1.transactiontype in ('RSP','SP','UNTV','ASTR','CN','TVIN','UCON','TRAS')
When you use those with queries, instead of subqueries when you don't need to, you're kneecapping the query optimizer.

Related

SQL structure for multiple queries of the same table (using window function, case, join)

I have a complex production SQL question. It's actually PrestoDB Hadoop, but conforms to common SQL.
I've got to get a bunch of metrics from a table, a little like this (sorry if the tables are mangled):
+--------+--------------+------------------+
| device | install_date | customer_account |
+--------+--------------+------------------+
| dev 1 | 1-Jun | 123 |
| dev 1 | 4-Jun | 456 |
| dev 1 | 10-Jun | 789 |
| dev 2 | 20-Jun | 50 |
| dev 2 | 25-Jun | 60 |
+--------+--------------+------------------+
I need something like this:
+--------+------------------+-------------------------+
| device | max_install_date | previous_account_number |
+--------+------------------+-------------------------+
| dev 1 | 10-Jun | 456 |
| dev 2 | 25-Jun | 50 |
+--------+------------------+-------------------------+
I can do two separate queries to get max install date and previous account number, like this:
select device, max(install_date) as max_install_date
from (select [a whole bunch of stuff], dense_rank() over(partition by device order by [something_else]) rnk
from some_table a
)
But how do you combine them into one query to get one line for each device? I have rank, with statements, case statements, and one join. They all work individually but I'm banging my head to understand how to combine them all.
I need to understand how to structure big queries.
ps. any good books you recommend on advanced SQL for data analysis? I see a bunch on Amazon but nothing that tells me how to construct big queries like this. I'm not a DBA. I'm a data guy.
Thanks.
You can use correlated subquery approach :
select t.*
from table t
where install_date = (select max(install_date) from table t1 where t1.device = t.device);
This assumes install_date has resonbale date format.
I think you want:
select t.*
from (select t.*, max(install_date) over (partition by device) as max_install_date,
lag(customer_account) over (partition by device order by install-date) as prev_customer_account
from t
) t
where install_date = max_install_date;

Best Way to Join One Column on Columns From Two Other Tables

I have a schema like the following in Oracle
Section:
+--------+----------+
| sec_ID | group_ID |
+--------+----------+
| 1 | 1 |
| 2 | 1 |
| 3 | 2 |
| 4 | 2 |
+--------+----------+
Section_to_Item:
+--------+---------+
| sec_ID | item_ID |
+--------+---------+
| 1 | 1 |
| 1 | 2 |
| 2 | 3 |
| 2 | 4 |
+--------+---------+
Item:
+---------+------+
| item_ID | data |
+---------+------+
| 1 | a |
| 2 | b |
| 3 | c |
| 4 | d |
+---------+------+
Item_Version:
+---------+----------+--------+
| item_ID | start_ID | end_ID |
+---------+----------+--------+
| 1 | 1 | |
| 2 | 1 | 3 |
| 3 | 2 | |
| 4 | 1 | 2 |
+---------+----------+--------+
Section_to_Item has FK into Section and Item on the *_ID columns.
Item_version is indexed on item_ID but has no FK to Item.item_ID (ran out of space in the snapshot group).
I have code that receives a list of version IDs and I want to get all items in sections in a given group that are valid for at least one of the versions passed in. If an item has no end_ID, it's valid for anything starting with start_ID. If it has an end_id, it's valid for anything up until (not including) end_ID.
What I currently have is:
SELECT Items.data
FROM Section, Section_to_Items, Item, Item_Version
WHERE Section.group_ID = 1
AND Section_to_Item.sec_ID = Section.sec_ID
AND Item.item_ID = Section_to_Item.item_ID
AND Item.item_ID = Item_Version.item_ID
AND exists (
SELECT *
FROM (
SELECT 2 AS version FROM DUAL
UNION ALL SELECT 3 AS version FROM DUAL
) passed_versions
WHERE Item_Version.start_ID <= passed_versions.version
AND (Item_Version.end_ID IS NULL or Item_Version.end_ID > passed_version.version)
)
Note that the UNION ALL statement is dynamically generated from the list of passed in versions.
This query currently does a cartesian join and is very slow.
For some reason, if I change the query to join
AND Item_Version.item_ID = Section_to_Item.item_ID
which is not a FK, the query does not do the cartesian join and is much faster.
A) Can anyone explain why this is?
B) Is this the right way to be joining this sequence of tables (I feel weird about joining Item.item_ID to two different tables)
C) Is this the right way to get versions between start_ID and end_ID?
Edit
Same query with inner join syntax:
SELECT Items.data
FROM Item
INNER JOIN Section_to_Items ON Section_to_Items.item_ID = Item.item_ID
INNER JOIN Section ON Section.sec_ID = Section_to_Items.sec_ID
INNER JOIN Item_Version ON Item_Version.item_ID = Item_.item_ID
WHERE Section.group_ID = 1
AND exists (
SELECT *
FROM (
SELECT 2 AS version FROM DUAL
UNION ALL SELECT 3 AS version FROM DUAL
) passed_versions
WHERE Item_Version.start_ID <= passed_versions.version
AND (Item_Version.end_ID IS NULL or Item_Version.end_ID > passed_version.version)
)
Note that in this case the performance difference comes from joining on Item_Version first and then joining Section_to_Item on Item_Version.item_ID.
In terms of table size, Section_to_Item, Item, and Item_Version should be similar (1000s) while Section should be small.
Edit
I just found out that apparently, the schema has no FKs. The FKs specified in the schema configuration files are ignored. They're just there for documentation. So there's no difference between joining on a FK column or not. That being said, by changing the joins into a cascade of SELECT INs, I'm able to avoid joining the entire Item table twice. I don't love the resulting query, and I don't really understand the difference, but the stats indicate it's much less work (changes the A-Rows returned from the inner most scan on Section from 656,000 to 488 (it used to be 656k starts returning 1 row, now it's 488 starts returning 1 row)).
Edit
It turned out to be stale statistics - the two queries were equivalent the whole time but with the incomplete statistics, the DB happened to notice the correct plan only in the second instance. After updating statistics, both queries generated the same plan.
I'm not sure if this is the best idea but this seems to avoid the cartesian join:
select data
from Item
where item_ID in (
select item_ID
from Item_Version
where item_ID in (
select item_ID
from Section_to_Item
where sec_ID in (
select sec_ID
from Section
where group_ID = 1
)
)
and exists (
select 1
from (
select 2 as version
from dual
union all
select 3 as version
from dual
) versions
where versions.version >= start_ID
and (end_ID is null or versions.version <)
)
)

Is it possible to do complex SQL queries using Django?

I have the following Script to get a list of calculated index for each day after specific date:
with test_reqs as (
select id_test, date_request, sum(n_requests) as n_req from cdr_test_stats
where
id_test in (2,4) and -- List of Ids included in index calc
date_request >= 20170823 -- Start date (end date -> Last in DB -> Today)
group by id_test, date_request
),
date_reqs as (
select date_request, sum(n_req) as n_req
from test_reqs
group by date_request
),
test_reqs_ratio as (
select H.id_test, H.date_request,
case when D.n_req = 0 then null else H.n_req/D.n_req end as ratio_req
from test_reqs H
inner join date_reqs D
on H.date_request = D.date_request
),
test_reqs_index as (
select HR.*, least(nullif(HA.n_dates_hbalert, 0), 10) as index_hb
from test_reqs_ratio HR
left join cdr_test_alerts_stats HA
on HR.id_test = HA.id_test and HR.date_request = HA.date_request
)
select date_request, 10-sum(ratio_req*index_hb) as index_hb
from test_reqs_index
group by date_request
Result:
---------------------------
| date_request | index_hb |
---------------------------
| 20170904 | 7.5508 |
| 20170905 | 7.6870 |
| 20170825 | 7.4335 |
| 20170901 | 7.7116 |
| 20170824 | 1.6568 |
| 20170823 | 0.0000 |
| 20170903 | 5.1850 |
| 20170830 | 0.0000 |
| 20170828 | 0.0000 |
---------------------------
The problem is that I want to get the same in Django and avoid to execute the raw query using the cursor.
Many thanks for any suggestion.
Without going deep into the specifics of your query, I'd say the Django ORM has enough expressiveness to handle most problems, but generally, would require you to redesign the query from the ground up. You would have to use subqueries and joins instead of the CTE's, and you might end up with a solution that does some of the work in Python land instead of the DB.
Taking this into account the answer is: depends. Your functional requirements, such as performance and data size play a role.
Another solution worth considering is declaring your SQL query as a view, and at least in the case of Postgres, use something like django-pgviews to query it with Django ORM almost as if it were a model.

JOIN two tables, but only include data from first table in first instance of each unique record

Title might be confusing.
I have a table of Cases, and each Case can contain many Tasks. To achieve a different workflow for each Task, I have different tables such as Case_Emails, Case_Calls, Case_Chats, etc...
I want to build a Query that will eventually be exported to Excel. In this query, I want to list out each Task, and the Tasks are already joined together via a UNION in another table using a common format. For each task in the Query, I want only the first Task associated with a case to include the details from Cases table. Example below:
+----+---------+------------+-------------+-------------+-------------+
| id | Case ID | Agent Name | Task Info 1 | Task Info 2 | Task Info 3 |
+----+---------+------------+-------------+-------------+-------------+
| 1 | 4000000 | Some Name | Detailstuff | Stuffdetail | Thingsyo |
| 2 | | | Detailstuff | Stuffdetail | Thingsyo |
| 3 | | | Detailstuff | Stuffdetail | Thingsyo |
| 4 | 4000003 | Some Name | Detailstuff | Stuffdetail | Thingsyo |
| 5 | | | Detailstuff | Stuffdetail | Thingsyo |
| 6 | 4000006 | Some Name | Detailstuff | Stuffdetail | Thingsyo |
+----+---------+------------+-------------+-------------+-------------+
My original approach was attempting a LEFT JOIN on Case ID, but I couldn't figure out how to filter the data out from the extra rows.
This would be much simpler if Access supported the ROW_NUMBER function. It doesn't, but you can sort of simulate it with a correlated subquery using the Tasks table (this assumes that each task has a unique numeric ID). This basically assigns a row number to each task, partitioned by the CaseID. Then you can just conditionally display the CaseID and AgentName where RowNum = 1.
SELECT Switch(RowNum = 1, CaseID) as Case,
Switch(RowNum = 1, AgentName) as Agent,
TaskName
FROM (
SELECT c.CaseID,
c.AgentName,
t.TaskName,
(select count(*)
from Tasks t2
where t2.CaseID = c.CaseID and t2.ID <= t.ID) as RowNum
FROM Cases c
INNER JOIN Tasks t ON c.CaseID = t.CaseID
order by c.CaseID, t.TaskName
)
You didn't post your table structure, so I'm not sure this will work for you as-is, but maybe you can adapt it.
No matter what when you join you will have duplicate values. to remove the duplicates either put in a Distinct in your select or a Group by after your filters. This should resolve the duplicates in you query for task info 1,2,3.
Found out that I can name my tables in the query like so:
FROM Case_Calls Calls
With this other name, I was able to filter based on a sub query:
IIF( Calls.[ID] <> (select top 1 [ID] from Case_Calls where [Case ID] = Calls.[Case ID]), '', Cases.[Creator]) As [Case Creator]
This solution gives me the results that I want :) It's rather ugly SQL, and difficult to parse when I'm dealing with dozens of columns, but it gets the job done!
I'm still curious if there is a better solution...

MySQL, reading this EXPLAIN statement

I have a query which is starting to cause some concern in my application. I'm trying to understand this EXPLAIN statement better to understand where indexes are potentially missing:
+----+-------------+-------+--------+---------------+------------+---------+-------------------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+------------+---------+-------------------------------+------+---------------------------------+
| 1 | SIMPLE | s | ref | client_id | client_id | 4 | const | 102 | Using temporary; Using filesort |
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 4 | www_foo_com.s.user_id | 1 | |
| 1 | SIMPLE | a | ref | session_id | session_id | 4 | www_foo_com.s.session_id | 1 | Using index |
| 1 | SIMPLE | h | ref | email_id | email_id | 4 | www_foo_com.a.email_id | 10 | Using index |
| 1 | SIMPLE | ph | ref | session_id | session_id | 4 | www_foo_com.s.session_id | 1 | Using index |
| 1 | SIMPLE | em | ref | session_id | session_id | 4 | www_foo_com.s.session_id | 1 | |
| 1 | SIMPLE | pho | ref | session_id | session_id | 4 | www_foo_com.s.session_id | 1 | |
| 1 | SIMPLE | c | ALL | userfield | NULL | NULL | NULL | 1108 | |
+----+-------------+-------+--------+---------------+------------+---------+-------------------------------+------+---------------------------------+
8 rows in set (0.00 sec)
I'm trying to understand where my indexes are missing by reading this EXPLAIN statement. Is it fair to say that one can understand how to optimize this query without seeing the query at all and just look at the results of the EXPLAIN?
It appears that the ALL scan against the 'c' table is the achilles heel. What's the best way to index this based on constant values as recommended on MySQL's documentation? |
Note, I also added an index to userfield in the cdr table and that hasn't done much good either.
Thanks.
--- edit ---
Here's the query, sorry -- don't know why I neglected to include it the first pass through.
SELECT s.`session_id` id,
DATE_FORMAT(s.`created`,'%m/%d/%Y') date,
u.`name`,
COUNT(DISTINCT c.id) calls,
COUNT(DISTINCT h.id) emails,
SEC_TO_TIME(MAX(DISTINCT c.duration)) duration,
(COUNT(DISTINCT em.email_id) + COUNT(DISTINCT pho.phone_id) > 0) status
FROM `fa_sessions` s
LEFT JOIN `fa_users` u ON s.`user_id`=u.`user_id`
LEFT JOIN `fa_email_aliases` a ON a.session_id = s.session_id
LEFT JOIN `fa_email_headers` h ON h.email_id = a.email_id
LEFT JOIN `fa_phones` ph ON ph.session_id = s.session_id
LEFT JOIN `fa_email_aliases` em ON em.session_id = s.session_id AND em.status = 1
LEFT JOIN `fa_phones` pho ON pho.session_id = s.session_id AND pho.status = 1
LEFT JOIN `cdr` c ON c.userfield = ph.phone_id
WHERE s.`partner_id`=1
GROUP BY s.`session_id`
I assume you've looked here to get more info about what it is telling you. Obviously the ALL means its going through all of them. The using temporary and using filesort are talked about on that page. You might want to look at that.
From the page:
Using filesort
MySQL must do an extra pass to find
out how to retrieve the rows in sorted
order. The sort is done by going
through all rows according to the join
type and storing the sort key and
pointer to the row for all rows that
match the WHERE clause. The keys then
are sorted and the rows are retrieved
in sorted order. See Section 7.2.12,
“ORDER BY Optimization”.
Using temporary
To resolve the query, MySQL needs to
create a temporary table to hold the
result. This typically happens if the
query contains GROUP BY and ORDER BY
clauses that list columns differently.
I agree that seeing the query might help to figure things out better.
My advice?
Break the query into 2 and use a temporary table in the middle.
Reasonning
The problem appears to be that table c is being table scanned, and that this is the last table in the query. This is probably bad: if you have a table scan, you want to do it at the start of the query, so it's only done once.
I'm not a MySQL guru, but I have spent a whole lot of time optimising queries on other DBs. It looks to me like the optimiser hasn't worked out that it should start with c and work backwards.
The other thing that strikes me is that there are probably too many tables in the join. Most optimisers struggle with more than 4 tables (because the number of possible table orders is growing exponentially, so checking them all becomes impractical).
Having too many tables in a join is the root of 90% of performance problems I have seen.
Give it a go, and let us know how you get on. If it doesn't help, please post the SQL, table definitions and indeces, and I'll take another look.
General Tips
Feel free to look at this answer I gave on general performance tips.
A great resource
MySQL Documentation for EXPLAIN
Well looking at the query would be useful, but there's at least one thing that's obviously worth looking into - the final line shows the ALL type for that part of the query, which is generally not great to see. If the suggested possible key (userfield) makes sense as an added index to table c, it might be worth adding it and seeing if that reduces the rows returned for that table in the search.
Query Plan
The query plan we might hope the optimiser would choose would be something like:
start with sessions where partner_id=1 , possibly using an index on partner_id,
join sessions to users, using an index on user_id
join sessions to phones, where status=1, using an index on session_id and possibly status
join sessions to phones again using an index on session_id and phone_id **
join phones to cdr using an index on userfield
join sessions to email_aliases, where status=1 using an index on session_id and possibly status
join sessions to email_aliases again using an index on session_id and email_id **
join email_aliases to email_headers using an index on email_id
** by putting 2 fields in these indeces, we enable the optimiser to join to the table using session_id, and immediately find out the associated phone_id or email_id without having to read the underlying table. This technique saves us a read, and can save a lot of time.
Indeces I would create:
The above query plan suggests these indeces:
fa_sessions ( partner_id, session_id )
fa_users ( user_id )
fa_email_aliases ( session_id, email_id )
fa_email_headers ( email_id )
fa_email_aliases ( session_id, status )
fa_phones ( session_id, status, phone_id )
cdr ( userfield )
Notes
You will almost certainly get acceptable performance without creating all of these.
If any of the tables are small ( less than 100 rows ) then it's probably not worth creating an index.
fa_email_aliases might work with ( session_id, status, email_id ), depending on how the optimiser works.