SQL query feels inefficient - how can I improve it? - sql

I'm using the SQL code below in SQLite to get a list of trades from a table containing trades and then combining it with total portfolio value on the day from a holdings table that has position and price data for a set of instruments.
The holdings table has about 150000 records and the trades table has about 1700
SELECT t.*, (SELECT p.adjclose FROM prices AS p
WHERE t.instrument = p.instrument
AND p.date = "2013-02-28 00:00:00") as close,
su.mv as mv
FROM trades AS t
left outer join
(SELECT h.date, SUM(h.price * h.position) as mv FROM holdings AS h
WHERE h.portfolio = "usequity"
AND h.date >= "2013-01-11 00:00:00"
AND h.date <= "2013-02-2"
GROUP BY h.date) as su
ON t.date = su.date
WHERE t.portname = "usequity"
AND t.date >= "2013-01-11 00:00:00"
AND t.date <= "2013-02-28 00:00:00";
Running the SQL code returns
[2014-12-01 19:21:00] 123 row(s) retrieved starting from 1 in 572/627 ms
Which seems really slow for a small dataset. Both tables are indexed on instrument and date.
I don't know how to index the table su on the fly so I'm not sure how to improve this code. Any help greatly appreciated.
EDIT
explain query plan shows
selectid,order,from,detail
1,0,0,"SEARCH TABLE holdings AS h USING AUTOMATIC COVERING INDEX (portfolio=?) (~7 rows)"
1,0,0,"USE TEMP B-TREE FOR GROUP BY"
0,0,0,"SCAN TABLE trades AS t (~11111 rows)"
0,1,1,"SEARCH SUBQUERY 1 AS su USING AUTOMATIC COVERING INDEX (date=?) (~3 rows)"
0,0,0,"EXECUTE CORRELATED SCALAR SUBQUERY 2"
2,0,0,"SEARCH TABLE prices AS p USING INDEX p1 (instrument=? AND date=?) (~9 rows)"

The lookup on prices is fast (it's using the index for both columns).
You could create a temporary table for the su subquery and add an index to that, but the AUTOMATIC INDEX shows that the database is already doing this.
The lookup on holdings is done with a temporary index; you should create an explicit index for that. (An index on both portfolio and date would be even more efficient.)
You could avoid the need for a temporary table by looking up the values from holdings dynamically, like you're already doing for the closing price (but this might not be an improvement if there are many trades on the same day):
SELECT t.*,
(SELECT p.adjclose
FROM prices AS p
WHERE p.instrument = t.instrument
AND p.date = '2013-02-28 00:00:00'
) AS close,
(SELECT SUM(h.price * h.position)
FROM holdings AS h
WHERE h.portfolio = 'usequity'
AND h.date = t.date
) AS mv
FROM trades AS t
WHERE t.portname = 'usequity'
AND t.date BETWEEN '2013-01-11 00:00:00'
AND '2013-02-28 00:00:00';

Related

Oracle SQL: Indexes not being used

I have the following indexes
create index i_payment_amount ON payment(amount);
create index i_customer_createdate ON customer(createdate);
and the following query
select c.createdate, c.firstname, c.lastname, round(sum(p.amount)) as spentmoney
from customer c
join rental r
on c.customerid = r.customerid
join payment p
on p.rentalid = r.rentalid
where c.createdate > date '2019-06-01' + 30
or (select round(sum(pp.amount)) from payment pp
join rental rr
on rr.rentalid = pp.rentalid
where rr.rentalid = r.rentalid) < 50
group by c.firstname, c.lastname,c.createdate
order by c.firstname, c.lastname;
The query is calculating the customers who registered in one month before 2019-06-01 and also getting customers who did not spend more than $50. And I wanted to optimize it with the help of indexes.
I created b-tree indexes and With the first try, I wanted to even make the indexes appear in the query plan, but they didn't.
I also couldn't create a function-based index for the payment, because it does not support the group function (sum).
Are there any suggestions to create a proper index, which will optimize the query or even use them?
I don't see any obvious candidates for indexing in the query.
Say there are 2 million rows in customer, and 1 million of them have createdate > date '2019-06-01' + 30 (which can be simplified to createdate > date '2019-07-01'). Using an index to find those 1 million rows and then visiting the customer table a million times is going to be a lot more I/O than just full-scanning the table once. Range-partitioning customer might help, if you are licensed for it and depending on the data distribution.
Possibly an index on payment(rentalid, amount) could be treated by the optimiser as a skinny table which would be more efficient to full-scan than the payment table itself, since those are the only two columns you need from the table, making the join to payment more efficient. However, that is only one of three tables involved in the query so I wouldn't expect a massive improvement.
I notice in your question you mention that you want the customers who registered in one month before 2019-06-01, but I don't see that condition in your query. If registration date is c.createdate then perhaps
where c.createdate > date '2019-06-01' + 30
should be something more like
where c.createdate between date '2019-06-01' and date '2019-07-01'
in which case an index on c.createdate starts to look more useful, but that's a different query.
Would including the HAVING clause do any good? Because, you have all those values, already so you'd avoid the subquery entirely. Something like this:
SELECT c.createdate,
c.firstname,
c.lastname,
ROUND (SUM (p.amount)) AS spentmoney
FROM customer c
JOIN rental r ON c.customerid = r.customerid
JOIN payment p ON p.rentalid = r.rentalid
GROUP BY c.firstname, c.lastname, c.createdate
HAVING SUM (p.amount) < 50
OR c.createdate > DATE '2019-06-01' + 30
ORDER BY c.firstname, c.lastname;

SQL repetitive code in where clause, how to insert the whole where into variable

I write a lot of queries with the same WHERE clause. I wish i could create a variable to insert each time for a query.
My query:
select distinct order_external_status
from analytics.dwh_orders_details dod
**where dod.merchant_id = 7797
and order_type = 'pre_live'
and order_date >= '2019-09-10' and order_date <= '2019-09-24';**
Next query with the same WHERE:
select dod.order_id,
oc.*
from analytics.dwh_orders_details dod
left join analytics.dwh_oc_all_details oc
on dod.order_id = oc.order_id
**where dod.merchant_id = 7797
and order_type = 'pre_live'
and order_date >= '2019-09-10' and order_date <= '2019-09-24';**
Can have 10 to 15 queries like that in a day. It will be nice if i could put where clause in a variable and just write it once. For now we use Redshift, we will move to Snowflake soon, if it matters.
DBs not allow to create views or temp tables...
You can create a view and use the view in your queries:
create view v_myview as
select dod.*
from analytics.dwh_orders_details dod
where dod.merchant_id = 7797 and
dod.order_type = 'pre_live' and
dod.order_date >= '2019-09-10' and
dod.order_date <= '2019-09-24';

Optimize Query Join Between Dates

3 tables
WorkRecordfact - has workdate(date) - ~300000 rows
EmployeeStatus - Startdate(date), EndDate(date), PositionID - 450 Rows
Positions - PositionID, PositionCode - 10 rows
Queries that look for data in WorkRecordFact filtering by position are taking a long time. Basic sample Query
SELECT workrecordfact.*
FROM workrecordfact
INNER JOIN Employeestatus on
Employeestatus.EmployeeID = workrecordfact.EmployeeID and
employeestatus.startdate <= workrecordfact.workdate and
employeestatus.enddate >= workrecordfact.workdate
INNER JOIN Positions on
employeestatus.PositionID = positions.PositionID
Where workrecordfact.workdate >= '20180601'
and workrecordfact.workdate <= '20180930'
and PositionCode = 'CSR'
Workrecordfact has a clustered index on Workdate
Employeestatus has 4 indexes
EmployeeID
EmployeeID+ StartDate
EmployeeID+ EndDate
EmployeeID+ StartDate + EndDate
In the query Statistics I'm seeing a lot of 500% elements. Starting with a Clustered Index Seek on the WorkRecordFact index. Some numbers that stand out
Estimated Number of Rows 250
Estimated Number of Rows to be Read 667
Number of Executions 381
Number of Rows Read 49525952??!?!?
Actual Number of Rows 112018
Results are taking long enough that the .net app sending the query is receiving a timeout in some cases.
I've rebuild/reorganized fragmented indexes, and refreshed statistics but that's not solved the issue.
Any Ideas?
UPDATE: It seems the query is running quite well from SMSS and only timing out from the application. Dates are passed in as parameters BTW, currently investigating possible issues with parameter sniffing :-/
Since workrecordfact has lot more records than EmployeeStatus and is a date range query.
First indexing them is always a pain.
Secondly Drop indexes on Employee Status they are of no use with regard to this query in my opinion.
EmployeeID
EmployeeID+ StartDate
EmployeeID+ EndDate
EmployeeID+ StartDate + EndDate
Create one more for EmployeeID on the fact table.
I think that should help.
The name of the game is to limit IO on the workrecordfact table. The way the query and indexing are currently set up, we're scanning 100% of the work in the third quarter, then later filtering that down to just the work of the CSRs. I wonder if the CSR criteria might be more selective and get us to a lower number of rows read?
This query is pretty zippy, right?
SELECT es.EmployeeID, es.StartDate, es.EndDate
FROM Positions p
JOIN Employeestatus es
ON p.PositionID = es.PositionID
WHERE PositionCode = 'CSR'
'CSR' probably matches exactly one Positions row. Assuming positions do approximately equal work, we probably only need to read 10% of the Q3 portion of the work table.
Thinking about bringing in the work facts afterwards like this.
SELECT w.*
FROM
(
SELECT es.EmployeeID, es.StartDate, es.EndDate
FROM Positions p
JOIN Employeestatus es
ON p.PositionID = es.PositionID
WHERE PositionCode = 'CSR'
) es2
JOIN workrecordfact w
ON es2.EmployeeID = w.EmployeeID
AND es2.startdate <= w.workdate AND w.workdate <= es2.enddate
WHERE
'2018-06-01' <= w.workdate AND w.workdate <= '2018-09-30'
This query would be best supported by this index:
CREATE INDEX WorkRecordFact_EmployeeId_WorkDate ON WorkRecordFact(EmployeeId, WorkDate)
Moving the conditional choice of workdate logically earlier into the query might be helpful:
SELECT w.*
FROM
(
SELECT es.EmployeeID,
CASE WHEN es.StartDate <= '2018-06-01' THEN '2018-06-01' ELSE es.StartDate END as StartDate,
CASE WHEN '2018-09-30' <= es.EndDate THEN '2018-09-30' ELSE es.EndDate END as EndDate
FROM Positions p
JOIN Employeestatus es
ON p.PositionID = es.PositionID
WHERE PositionCode = 'CSR'
) es2
JOIN workrecordfact w
ON es2.EmployeeID = w.EmployeeID
AND es2.startdate <= w.workdate AND w.workdate <= es2.enddate

Slow running query, Postgresql

I have a very slow query (30+ minutes or more) that I think can be sped up with more efficient coding. Below is the code and the query plan that results. So I am looking for answers to speed up with query that is performing several joins on large tables.
drop table if exists totalshad;
create temporary table totalshad as
select pricedate, hour, sum(cast(price as numeric)) as totalprice from
pjm.rtcons
where
rtcons.pricedate >= '2017-12-01'
-- and
-- rtcons.pricedate <= '2018-01-23'
group by pricedate, hour
order by pricedate, hour;
-----------------------------
drop table if exists percshad;
create temporary table percshad as
select totalshad.pricedate, totalshad.hour, facility, round(sum(cast(price
as numeric)),2) as cons_shad, round(sum(cast(totalprice as numeric)),2) as
total_shad, round(cast(price/totalprice as numeric),4) as per_shad from
totalshad
join pjm.rtcons on
rtcons.pricedate = totalshad.pricedate
and
rtcons.hour = totalshad.hour
and
facility = 'ETOWANDA-NMESHOPP ETL 1057 A 115 KV'
where totalprice <> 0 and totalshad.pricedate > '2017-12-01'
group by totalshad.pricedate, totalshad.hour, facility,
(price/totalprice)
order by per_shad desc
limit 5;
EXPLAIN select facility, percshad.pricedate, percshad.hour, per_shad,
minmcc.rtmcc, minnode.nodename, maxmcc.rtmcc, maxnode.nodename from percshad
join pjm.prices minmcc on
minmcc.pricedate = percshad.pricedate
and
minmcc.hour = percshad.hour
and
minmcc.rtmcc = (select min(rtmcc) from pjm.prices where pricedate =
percshad.pricedate and hour = percshad.hour)
join pjm.nodes minnode on
minnode.node_id = minmcc.node_id
join pjm.prices maxmcc on
maxmcc.pricedate = percshad.pricedate
and
maxmcc.hour = percshad.hour
and
maxmcc.rtmcc = (select max(rtmcc) from pjm.prices where pricedate =
percshad.pricedate and hour = percshad.hour)
join pjm.nodes maxnode on
maxnode.node_id = maxmcc.node_id
order by per_shad desc
limit 5
And here is the EXPLAIN output:
UPDATE: I have now simplified my code down to the following. But as can be seen from the EXPLAIN, it stills takes forever to find the node_id in the last select statement
drop table if exists totalshad;
create temporary table totalshad as
select pricedate, hour, sum(cast(price as numeric)) as totalprice from
pjm.rtcons
where
rtcons.pricedate >= '2017-12-01'
-- and
-- rtcons.pricedate <= '2018-01-23'
group by pricedate, hour
order by pricedate, hour;
-----------------------------
drop table if exists percshad;
create temporary table percshad as
select totalshad.pricedate, totalshad.hour, facility, round(sum(cast(price
as numeric)),2) as cons_shad, round(sum(cast(totalprice as numeric)),2) as
total_shad,
round(cast(price/totalprice as numeric),4) as per_shad from totalshad
join pjm.rtcons on
rtcons.pricedate = totalshad.pricedate
and
rtcons.hour = totalshad.hour
and
facility = 'ETOWANDA-NMESHOPP ETL 1057 A 115 KV'
where totalprice <> 0 and totalshad.pricedate > '2017-12-01'
group by totalshad.pricedate, totalshad.hour, facility, (price/totalprice)
order by per_shad desc
limit 5;
drop table if exists mincong;
create temporary table mincong as
select pricedate, hour, min(rtmcc) as rtmcc
from pjm.prices JOIN percshad USING (pricedate, hour)
group by pricedate, hour;
EXPLAIN select distinct on (pricedate, hour) prices.node_id from mincong
JOIN pjm.prices USING (pricedate, hour, rtmcc)
group by pricedate, hour, node_id
The problem are the subselects in the join condition; they have to be executed for every row joined.
If you cannot get rid of them, try to create an index that will support the subselects as good as possible:
CREATE INDEX ON pjm.prices(pricedate, hour, rtmcc);

Teradata spool space issue on running a sub query with Count

I am using below query to calculate business days between two dates for all the order numbers. Business days are already available in the teradata table Common_WorkingCalendar. But, i'm also facing spool space issue while i execute the query. I have ample space available in my data lab. Need to optimize the query. Appreciate any inputs.
SELECT
tx."OrderNumber",
(SELECT COUNT(1) FROM Common_WorkingCalendar
WHERE CalDate between Cast(tx."TimeStamp" as date) and Cast(mf.ShipDate as date)) as BusDays
from StoreFulfillment ff
inner join StoreTransmission tx
on tx.OrderNumber = ff.OrderNumber
inner join StoreMerchandiseFulfillment mf
on mf.OrderNumber = ff.OrderNumber
This is a very inefficient way to get this count which results in a product join.
The recommended approach is adding a sequential number to your calendar which increases only on business days (calculated using SUM(CASE WHEN businessDay THEN 1 ELSE 0 END) OVER (ORDER BY CalDate ROWS UNBOUNDED PRECEDING)), then it's two joins, for the start date and the end date.
If this calculation is needed a lot you better add a new column, otherwise you can do it on the fly:
WITH cte AS
(
SELECT CalDate,
-- as this table only contains business days you can use this instead
row_number(*) Over (ORDER BY CalDate) AS DayNo
FROM Common_WorkingCalendar
)
SELECT
tx."OrderNumber",
to_dt.DayNo - from_dt.DayNo AS BusDays
FROM StoreFulfillment ff
INNER JOIN StoreTransmission tx
ON tx.OrderNumber = ff.OrderNumber
INNER JOIN StoreMerchandiseFulfillment mf
ON mf.OrderNumber = ff.OrderNumber
JOIN cte AS from_dt
ON from_dt.CalDate = Cast(tx."TimeStamp" AS DATE)
JOIN cte AS to_dt
ON to_dt.CalDate = Cast(mf.ShipDate AS DATE)