How would I optimize this? I need to pull data from different tables and use results of queries in queries - sql

Please propose an approach I should follow since I am obviously missing the point. I am new to SQL and still think in terms of MS Access. Here's an example of what I'm trying to do: Like I said, don't worry about the detail, I just want to know how I would do this in SQL.
I have the following tables:
Hrs_Worked (staff, date, hrs) (200 000+ records)
Units_Done (team, date, type) (few thousand records)
Rate_Per_Unit (date, team, RatePerUnit) (few thousand records)
Staff_Effort (staff, team, timestamp) (eventually 3 - 4 million records)
SO I need to do the following:
1) Calculate what each team earned by multiplying their units with RatePerUnit and Grouping on Team and Date. I create a view TeamEarnPerDay:
Create View teamEarnPerDay AS
SELECT
,Units_Done.Date,
,Units_Done.TeamID,
,Sum([Units_Done]*[Rate_Per_Unit.Rate]) AS Earn
FROM Units_Done INNER JOIN Rate_Per_Unit
ON (Units_Done.quality = Rate_Per_Unit.quality)
AND (Units_Done.type = Rate_Per_Unit.type)
AND (Units_Done.TeamID = Rate_Per_Unit.TeamID)
AND (Units_Done.Date = Rate_Per_Unit.Date)
GROUP BY
Units_Done.Date,
Units_Done.TeamID;
2) Count the TEAM's effort by Grouping Staff_Effort on Team and Date and counting records. This table has a few million records.
I have to cast the timestamp as a date....
CREATE View team_effort AS
SELECT
TeamID
,CAST([Timestamp] AS Date) as TeamDate,
,Count(Staff_EffortID) AS TeamEffort
FROM Staff_Effort
GROUP BY
TeamID
,CAST([Timestamp] AS Date);
3) Calculate the Team's Rate_of_pay: (1) Team_earnings / (2) Team_effort
I use the 2 views I created above. This view's performance drops but is still acceptable to me.
Create View team_rate_of_pay AS
SELECT
tepd.Date
,tepd.TeamID
,tepd.Earn
,tepd.TeamBags
,[Earn]/[TeamEffort] AS teamRate
FROM teamEarnPerDay
INNER JOIN team_effort
ON (teamEarnPerDay.Date = team_effort.TeamDate)
AND (teamEarnPerDay.TeamID = team_effort.TeamID);
4) Group Staff_Effort on Date and Staff and count records to get each individuals's effort. (share of the team effort)
I have to cast the Timestamp as a date....
Create View staff_effort AS
SELECT
TeamID
,StaffID
,CAST([Timestamp] AS Date) as StaffDate
,Count(Staff_EffortID) AS StaffEffort
FROM Staff_Effort
GROUP BY
,TeamID
,StaffID
,CAST([Timestamp] AS Date);
5) Calculate Staff earnings by: (4) Staff_Effort x (3) team_rate_of_pay
Multiply the individual's effort by the team rate he worked at on the day.
This one is ridiculously slow. In fact, it's useless.
CREATE View staff_earnings AS
SELECT
staff_effort.StaffDate
,staff_effort.StaffID
,sum(staff_effort.StaffEffort) AS StaffEffort
,sum([StaffEffort]*[TeamRate]) AS StaffEarn
FROM staff_effort INNER JOIN team_rate_of_pay
ON (staff_effort.TeamID = team_rate_of_pay.TeamID)
AND (staff_effort.StaffDate = team_rate_of_pay.Date)
Group By
staff_effort.StaffDate,
staff_effort.StaffID;
So you see what I mean.... I need various results and subsequent queries are dependent on those results.
What I tried to do is to write a view for each of the above steps and then just use the view in the next step and so on. They work fine but view nr 3 runs slower than the rest, even though still acceptable. View nr 5 is just ridiculously slow.
I actually have another view after nr.5 which brings hours worked into play as well but that just takes forever to produce a few rows.
I want a single line for each staff member, showing what he earned each day calculated as set out above, with his hours worked each day.
I also tried to reduce the number of views by using sub-queries instead but that took even longer.
A little guidance / direction will be much appreciated.
Thanks in advance.
--EDIT--
Taking the query posted in the comments. Did some formatting, added aliases and a little cleanup it would look like this.
SELECT epd.CompanyID
,epd.DATE
,epd.TeamID
,epd.Earn
,tb.TeamBags
,epd.Earn / tb.TeamBags AS RateperBag
FROM teamEarnPerDay epd
INNER JOIN teamBags tb ON epd.DATE = tb.TeamDate
AND epd.TeamID = tb.TeamID;

I eventually did 2 things:
1) Managed to reduce the nr of nested views by using sub-queries. This did not improve performance by much but it seems simpler with fewer views.
2) The actual improvement was caused by using LEFT JOIN in stead of Inner Join.
The final view ran for 50 minutes with the Inner Join without producing a single row yet.
With LEFT JOIN, it produced all the results in 20 seconds!
Hope this helps someone.

Related

Redshift SQL result set 100s of rows wide efficiency (long to wide)

Scenario: Medical records reporting to state government which requires a pipe delimited text file as input.
Challenge: Select hundreds of values from a fact table and produce a wide result set to be (Redshift) UNLOADed to disk.
What I have tried so far is a SQL that I want to make into a VIEW.
;WITH
CTE_patient_record AS
(
SELECT
record_id
FROM fact_patient_record
WHERE update_date = <yesterday>
)
,CTE_patient_record_item AS
(
SELECT
record_id
,record_item_name
,record_item_value
FROM fact_patient_record_item fpri
INNER JOIN CTE_patient_record cpr ON fpri.record_id = cpr.record_id
)
Note that fact_patient_record has 87M rows and fact_patient_record_item has 97M rows.
The above code runs in 2 seconds for 2 test records and the CTE_patient_record_item CTE has about 200 rows per record for a total of about 400.
Now, produce the result set:
,CTE_result AS
(
SELECT
cpr.record_id
,cpri002.record_item_value AS diagnosis_1
,cpri003.record_item_value AS diagnosis_2
,cpri004.record_item_value AS medication_1
...
FROM CTE_patient_record cpr
INNER JOIN CTE_patient_record_item cpri002 ON cpr.cpr.record_id = cpri002.cpr.record_id
AND cpri002.record_item_name = 'diagnosis_1'
INNER JOIN CTE_patient_record_item cpri003 ON cpr.cpr.record_id = cpri003.cpr.record_id
AND cpri003.record_item_name = 'diagnosis_2'
INNER JOIN CTE_patient_record_item cpri004 ON cpr.cpr.record_id = cpri004.cpr.record_id
AND cpri003.record_item_name = 'mediation_1'
...
) SELECT * FROM CTE_result
Result set looks like this:
record_id diagnosis_1 diagnosis_2 medication_1 ...
100001 09 9B 88X ...
...and then I use the Reshift UNLOAD command to write to disk pipe delimited.
I am testing this on a full production sized environment but only for 2 test records.
Those 2 test records have about 200 items each.
Processing output is 2 rows 200 columns wide.
It takes 30 to 40 minutes to process just just the 2 records.
You might ask me why I am joining on the item name which is a string. Basically there is no item id, no integer, to join on. Long story.
I am looking for suggestions on how to improve performance. With only 2 records, 30 to 40 minutes is unacceptable. What will happen when I have 1000s of records?
I have also tried making the VIEW a MATERIALIZED VIEW however, it takes 30 to 40 minutes (not surprisingly) to compile the materialized view also.
I am not sure which route to take from here.
Stored procedure? I have experience with stored procs.
Create new tables so I can create integer id's to join on and indexes? However, my managers are "new table" averse.
?
I could just stop with the first two CTEs, pull the data down to python and process using pandas dataframe which I've done before successfully but it would be nice if I could have an efficient query, just use Redshift UNLOAD and be done with it.
Any help would be appreciated.
UPDATE: Many thanks to Paul Coulson and Bill Weiner for pointing me in the right direction! (Paul I am unable to upvote your answer as I am too new here).
Using (pseudo code):
MAX(CASE WHEN t1.name = 'somename' THEN t1.value END ) AS name
...
FROM table1 t1
reduced execution time from 30 minutes to 30 seconds.
EXPLAIN PLAN for the original solution is 2700 lines long, for the new solution using conditional aggregation is 40 lines long.
Thanks guys.
Without some more information it is impossible to know what is going on for sure but what you are doing is likely not ideal. An explanation plan and the execution time per step would help a bunch.
What I suspect is getting you is that you are reading a 97M row table 200 times. This will slow things down but shouldn't take 40 min. So I also suspect that record_item_name is not unique per value of record_id. This will lead to row replication and could be expanding the data set many fold. Also is record_id unique in fact_patient_record? If not then this will cause row replication. If all of this is large enough to cause significant spill and significant network broadcasting your 40 min execution time is very plausible.
There is no need to be joining when all the data is in a single copy of the table. #PhilCoulson is correct that some sort of conditional aggregation could be applied and the decode() syntax could save you space if you don't like case. Several of the above issues that might be affecting your joins would also make this aggregation complicated. What are you looking for if there are several values for record_item_value for each record_id and record_item_name pair? I expect you have some discovery of what your data holds in your future.

Alternative of IN in SQL to execute it in less time

I wrote a query but I want to know if this execution time will be slow or fast? Can we use any alternative to IN since it is used 4 times in such a small query.
Can we have a better way to write this query ? Moreover I am confused because the someone wants to query it with large number of ep_et_id and the way I wrote this is no where have a place to mention the ep_et_id
The below query fetches 74133 results in 14.55 secs in direct Postgres call. But I believe this will take more time if I call it from the webpage.
select lp, ob_id
from ob_in
where ob_id in (select ob_id
from ob_for_e
where ep_et_id in (select ep_et_id
from rrts rep
left join rrts_inf rinf on rep.rrts_id = rinf.rrts_id
where rrts_type in ('FR','IN')
and rinf.status in('V')))
The tables I am using here are ob_in, ob_for_e, rrts , rrts_inf
The location point lp are in table ob_in and I put many IN to fetch IDs from the table to apply the final condition on table rrts_inf for type FR , IN and Status with V

Order by in subquery behaving differently than native sql query?

So I am honestly a little puzzled by this!
I have a query that returns a set of transactions that contain both repair costs and an odometer reading at the time of repair on the master level. To get an accurate Cost per mile reading I need to do a subquery to get both the first meter reading between a starting date and an end date, and an ending meter.
(select top 1 wf2.ro_num
from wotrans wotr2
left join wofile wf2
on wotr2.rop_ro_num = wf2.ro_num
and wotr2.rop_fac = wf2.ro_fac
where wotr.rop_veh_num = wotr2.rop_veh_num
and wotr.rop_veh_facility = wotr2.rop_veh_facility
AND ((#sdate = '01/01/1900 00:00:00' and wotr2.rop_tran_date = 0)
OR ([dbo].[udf_RTA_ConvertDateInt](#sdate) <= wotr2.rop_tran_date
AND [dbo].[udf_RTA_ConvertDateInt](#edate) >= wotr2.rop_tran_date))
order by wotr2.rop_tran_date asc) as highMeter
The reason I have the tables aliased as xx2 is because those tables are also used in the main query, and I don't want these to interact with each other except to pull the correct vehicle number and facility.
Basically when I run the main query it returns a value that is not correct; it returns the one that is second(keep in mind that the first and second have the same date.) But when I take the subquery and just copy and paste it into it's own query and run it, it returns the correct value.
I do have a work around for this, but I am just curious as to why this happening. I have searched quite a bit and found not much(other than the fact that people don't like order bys in subqueries). Talking to one of my friends that also does quite a bit of SQL scripting, it looks to us as if the subquery is ordering differently than the subquery by itsself when you have multiple values that are the same for the order by(i.e. 10 dates of 08/05/2016).
Any ideas would be helpful!
Like I said I have a work around that works in this one case, but don't know yet if it will work on a larger dataset.
Let me know if you want more code.

Parallel Date Sales SQL View

I have a challenge which I can't seem to resolve on my own and now need help!
I have a requirement to show parallel year date sales via SQL and by that I mean if today (20/08/2015) Customer A has purchased products worth 500, I want to know how much Customer A spent on the same day last year (so 20/08/2014).
Here's a SQL fiddle where I've built everything (I reckoned that would be easiest for you guys). I have 3 dimensions (DimProduct, DimDate and DimCustomer), a fact table (FactSales) and a view (VW_ParallelSales) which I've built on top. I have also left a query on the right hand side with what I'm trying to achieve. If you run the query you will see that for Antonio, the SaleAmount on 20140820 was 3500 and if you look at the very bottom of the table, you can see there's one more record for Antonio in the fact table on 20150820 for 6500. So esentially, what I want is to have that 3500 which was sold on 20140820 (which is the parallel year date of 20150820) under the column ParallelSales (which at the moment is showing as NULL).
It all works like a charm if I don't include the ProductKey in the view and have just the CustomerKey (see this fiddle). However, as soon as I add the Product Key, because there is no exact match of CustomerKey-ProductKey that has happened in the past, I'm getting NULLS for ParallelSales (or at least that's what I think the reason is).
What I want to be able to do is then use the view and join on both DimCustomer and DimProduct and run queries both ways, i.e.:
Query 1: How much did Customer A spend today vs today last year?
Query 2: How much of Product A did we sell today vs today last year?
At the moment, as is, I need to have 2 views for that - one that joins the two sub-queries in the view on CustomerKey and the other one - on ProductKey (and obviously the dates).
I know it's a lot to ask but I do need to get this to work and would appreciate your help immensely! Thanks :)
For customer sales in diferent years.
SQL Fiddle Demo
SELECT DimCustomer.CustomerName,
VW_Current.Saledate,
VW_Current.ParallelDate,
VW_Current.CurrentSales,
VW_Previous.CurrentSales as ParallelSale
FROM DimCustomer
INNER JOIN VW_ParallelSales VW_Current
ON DimCustomer.CustomerKey = VW_Current.CustomerKey
LEFT JOIN VW_ParallelSales VW_Previous
ON VW_Current.ParallelDate = VW_Previous.Saledate
AND DimCustomer.CustomerKey = VW_Previous.CustomerKey
ORDER BY 1, 2
For productkey
SQL Fiddle Demo
With sales as (
SELECT
DimProduct.ProductKey,
DimProduct.ProductName,
VW_ParallelSales.Saledate,
VW_ParallelSales.ParallelDate,
VW_ParallelSales.CurrentSales,
VW_ParallelSales.ParallelSales
FROM DimProduct INNER JOIN VW_ParallelSales ON DimProduct.ProductKey =
VW_ParallelSales.ProductKey
)
SELECT
s_recent.ProductName,
s_recent.Saledate ThisYear,
s_old.Saledate PreviousYear,
s_recent.CurrentSales CurrentSales,
s_old.CurrentSales ParallelSales
FROM
SALES s_recent
left outer join SALES s_old
on s_recent.saledate = s_old.saledate + 10000
and s_recent.ProductKey = s_old.ProductKey

daily difference calculation performance improvement

I need to calculate the daily price difference in percentage. The query I have works but is getting slower every day. The main idea is to calculate the delta with the previous row. The previous row is normally the previous day, but there might sometimes be a day missing. When that happens it needs to take the last day available.
I'm looking for a way to limit the set that I retrieve in the inner query. There are about 20.000 records added per day.
update
price_watches pw
set
min_percent_changed = calc.delta
from
(select
id,
product_id,
calculation_date,
(1 - (price_min / lag(price_min) over (order by product_id, calculation_date))) * 100 as delta
from
price_watches
where
price_min > 0) calc
where
calc.id = pw.id;
This is wrong on many levels.
1.) It looks like you are updating all rows, including old rows that already have their min_percent_changed set and probably shouldn't be updated again.
2.) You are updating even if the new min_percent_changed is the same as the old.
3.) You are updating rows to store a redundant value that could be calculated on the fly rather cheaply (if done right), thereby making the row bigger and more error prone and producing lots of dead row versions, which means a lot of work for vacuum and slowing down everything else.
You shouldn't be doing any of this.
If you need to materialize the daily delta for read performance optimization, I suggest a small additional 1:1 table that can be updated cheaply without messing with the main table. Especially, if you recalc the value for every row every time. But better calculate new data.
If you really want to recalculate for every row (like your current UPDATE seems to do), make that a MATERIALIZED VIEW to automate the process.
If the new query I am going to demonstrate is fast enough, don't store any redundant data and calculate deltas on the fly.
For your current setup, this query should be much faster, when combined with this matching index:
CREATE INDEX price_watches_product_id_calculation_date_idx
ON price_watches(product_id, calculation_date DESC NULLS LAST);
Query:
UPDATE price_watches pw
SET min_percent_changed = calc.delta
FROM price_watches p1
, LATERAL (
SELECT (1 - p1.price_min / p2.price_min) * 100 AS delta
FROM price_watches p2
WHERE p2.product_id = p1.product_id
AND p2.calculation_date < p1.calculation_date
ORDER BY p2.calculation_date DESC NULLS LAST
LIMIT 1
) calc
WHERE p1.price_min > 0
AND p1.calculation_date = current_date - 1 -- only update new rows!
AND pw.id = p1.id
AND pw.min_percent_changed IS DISTINCT FROM calc.delta;
I am restricting the update to rows from "yesterday": current_date - 1. This is a wild guess at what you actually need.
Explanation for the added last line of the query:
How do I (or can I) SELECT DISTINCT on multiple columns?
Similar to this answer on dba.SE from just a few hours ago:
Slow window function query with big table
Proper information in the question would allow me to adapt the query and give more explanation.