Alternative of IN in SQL to execute it in less time - sql

I wrote a query but I want to know if this execution time will be slow or fast? Can we use any alternative to IN since it is used 4 times in such a small query.
Can we have a better way to write this query ? Moreover I am confused because the someone wants to query it with large number of ep_et_id and the way I wrote this is no where have a place to mention the ep_et_id
The below query fetches 74133 results in 14.55 secs in direct Postgres call. But I believe this will take more time if I call it from the webpage.
select lp, ob_id
from ob_in
where ob_id in (select ob_id
from ob_for_e
where ep_et_id in (select ep_et_id
from rrts rep
left join rrts_inf rinf on rep.rrts_id = rinf.rrts_id
where rrts_type in ('FR','IN')
and rinf.status in('V')))
The tables I am using here are ob_in, ob_for_e, rrts , rrts_inf
The location point lp are in table ob_in and I put many IN to fetch IDs from the table to apply the final condition on table rrts_inf for type FR , IN and Status with V

Related

Redshift SQL result set 100s of rows wide efficiency (long to wide)

Scenario: Medical records reporting to state government which requires a pipe delimited text file as input.
Challenge: Select hundreds of values from a fact table and produce a wide result set to be (Redshift) UNLOADed to disk.
What I have tried so far is a SQL that I want to make into a VIEW.
;WITH
CTE_patient_record AS
(
SELECT
record_id
FROM fact_patient_record
WHERE update_date = <yesterday>
)
,CTE_patient_record_item AS
(
SELECT
record_id
,record_item_name
,record_item_value
FROM fact_patient_record_item fpri
INNER JOIN CTE_patient_record cpr ON fpri.record_id = cpr.record_id
)
Note that fact_patient_record has 87M rows and fact_patient_record_item has 97M rows.
The above code runs in 2 seconds for 2 test records and the CTE_patient_record_item CTE has about 200 rows per record for a total of about 400.
Now, produce the result set:
,CTE_result AS
(
SELECT
cpr.record_id
,cpri002.record_item_value AS diagnosis_1
,cpri003.record_item_value AS diagnosis_2
,cpri004.record_item_value AS medication_1
...
FROM CTE_patient_record cpr
INNER JOIN CTE_patient_record_item cpri002 ON cpr.cpr.record_id = cpri002.cpr.record_id
AND cpri002.record_item_name = 'diagnosis_1'
INNER JOIN CTE_patient_record_item cpri003 ON cpr.cpr.record_id = cpri003.cpr.record_id
AND cpri003.record_item_name = 'diagnosis_2'
INNER JOIN CTE_patient_record_item cpri004 ON cpr.cpr.record_id = cpri004.cpr.record_id
AND cpri003.record_item_name = 'mediation_1'
...
) SELECT * FROM CTE_result
Result set looks like this:
record_id diagnosis_1 diagnosis_2 medication_1 ...
100001 09 9B 88X ...
...and then I use the Reshift UNLOAD command to write to disk pipe delimited.
I am testing this on a full production sized environment but only for 2 test records.
Those 2 test records have about 200 items each.
Processing output is 2 rows 200 columns wide.
It takes 30 to 40 minutes to process just just the 2 records.
You might ask me why I am joining on the item name which is a string. Basically there is no item id, no integer, to join on. Long story.
I am looking for suggestions on how to improve performance. With only 2 records, 30 to 40 minutes is unacceptable. What will happen when I have 1000s of records?
I have also tried making the VIEW a MATERIALIZED VIEW however, it takes 30 to 40 minutes (not surprisingly) to compile the materialized view also.
I am not sure which route to take from here.
Stored procedure? I have experience with stored procs.
Create new tables so I can create integer id's to join on and indexes? However, my managers are "new table" averse.
?
I could just stop with the first two CTEs, pull the data down to python and process using pandas dataframe which I've done before successfully but it would be nice if I could have an efficient query, just use Redshift UNLOAD and be done with it.
Any help would be appreciated.
UPDATE: Many thanks to Paul Coulson and Bill Weiner for pointing me in the right direction! (Paul I am unable to upvote your answer as I am too new here).
Using (pseudo code):
MAX(CASE WHEN t1.name = 'somename' THEN t1.value END ) AS name
...
FROM table1 t1
reduced execution time from 30 minutes to 30 seconds.
EXPLAIN PLAN for the original solution is 2700 lines long, for the new solution using conditional aggregation is 40 lines long.
Thanks guys.
Without some more information it is impossible to know what is going on for sure but what you are doing is likely not ideal. An explanation plan and the execution time per step would help a bunch.
What I suspect is getting you is that you are reading a 97M row table 200 times. This will slow things down but shouldn't take 40 min. So I also suspect that record_item_name is not unique per value of record_id. This will lead to row replication and could be expanding the data set many fold. Also is record_id unique in fact_patient_record? If not then this will cause row replication. If all of this is large enough to cause significant spill and significant network broadcasting your 40 min execution time is very plausible.
There is no need to be joining when all the data is in a single copy of the table. #PhilCoulson is correct that some sort of conditional aggregation could be applied and the decode() syntax could save you space if you don't like case. Several of the above issues that might be affecting your joins would also make this aggregation complicated. What are you looking for if there are several values for record_item_value for each record_id and record_item_name pair? I expect you have some discovery of what your data holds in your future.

How would I optimize this? I need to pull data from different tables and use results of queries in queries

Please propose an approach I should follow since I am obviously missing the point. I am new to SQL and still think in terms of MS Access. Here's an example of what I'm trying to do: Like I said, don't worry about the detail, I just want to know how I would do this in SQL.
I have the following tables:
Hrs_Worked (staff, date, hrs) (200 000+ records)
Units_Done (team, date, type) (few thousand records)
Rate_Per_Unit (date, team, RatePerUnit) (few thousand records)
Staff_Effort (staff, team, timestamp) (eventually 3 - 4 million records)
SO I need to do the following:
1) Calculate what each team earned by multiplying their units with RatePerUnit and Grouping on Team and Date. I create a view TeamEarnPerDay:
Create View teamEarnPerDay AS
SELECT
,Units_Done.Date,
,Units_Done.TeamID,
,Sum([Units_Done]*[Rate_Per_Unit.Rate]) AS Earn
FROM Units_Done INNER JOIN Rate_Per_Unit
ON (Units_Done.quality = Rate_Per_Unit.quality)
AND (Units_Done.type = Rate_Per_Unit.type)
AND (Units_Done.TeamID = Rate_Per_Unit.TeamID)
AND (Units_Done.Date = Rate_Per_Unit.Date)
GROUP BY
Units_Done.Date,
Units_Done.TeamID;
2) Count the TEAM's effort by Grouping Staff_Effort on Team and Date and counting records. This table has a few million records.
I have to cast the timestamp as a date....
CREATE View team_effort AS
SELECT
TeamID
,CAST([Timestamp] AS Date) as TeamDate,
,Count(Staff_EffortID) AS TeamEffort
FROM Staff_Effort
GROUP BY
TeamID
,CAST([Timestamp] AS Date);
3) Calculate the Team's Rate_of_pay: (1) Team_earnings / (2) Team_effort
I use the 2 views I created above. This view's performance drops but is still acceptable to me.
Create View team_rate_of_pay AS
SELECT
tepd.Date
,tepd.TeamID
,tepd.Earn
,tepd.TeamBags
,[Earn]/[TeamEffort] AS teamRate
FROM teamEarnPerDay
INNER JOIN team_effort
ON (teamEarnPerDay.Date = team_effort.TeamDate)
AND (teamEarnPerDay.TeamID = team_effort.TeamID);
4) Group Staff_Effort on Date and Staff and count records to get each individuals's effort. (share of the team effort)
I have to cast the Timestamp as a date....
Create View staff_effort AS
SELECT
TeamID
,StaffID
,CAST([Timestamp] AS Date) as StaffDate
,Count(Staff_EffortID) AS StaffEffort
FROM Staff_Effort
GROUP BY
,TeamID
,StaffID
,CAST([Timestamp] AS Date);
5) Calculate Staff earnings by: (4) Staff_Effort x (3) team_rate_of_pay
Multiply the individual's effort by the team rate he worked at on the day.
This one is ridiculously slow. In fact, it's useless.
CREATE View staff_earnings AS
SELECT
staff_effort.StaffDate
,staff_effort.StaffID
,sum(staff_effort.StaffEffort) AS StaffEffort
,sum([StaffEffort]*[TeamRate]) AS StaffEarn
FROM staff_effort INNER JOIN team_rate_of_pay
ON (staff_effort.TeamID = team_rate_of_pay.TeamID)
AND (staff_effort.StaffDate = team_rate_of_pay.Date)
Group By
staff_effort.StaffDate,
staff_effort.StaffID;
So you see what I mean.... I need various results and subsequent queries are dependent on those results.
What I tried to do is to write a view for each of the above steps and then just use the view in the next step and so on. They work fine but view nr 3 runs slower than the rest, even though still acceptable. View nr 5 is just ridiculously slow.
I actually have another view after nr.5 which brings hours worked into play as well but that just takes forever to produce a few rows.
I want a single line for each staff member, showing what he earned each day calculated as set out above, with his hours worked each day.
I also tried to reduce the number of views by using sub-queries instead but that took even longer.
A little guidance / direction will be much appreciated.
Thanks in advance.
--EDIT--
Taking the query posted in the comments. Did some formatting, added aliases and a little cleanup it would look like this.
SELECT epd.CompanyID
,epd.DATE
,epd.TeamID
,epd.Earn
,tb.TeamBags
,epd.Earn / tb.TeamBags AS RateperBag
FROM teamEarnPerDay epd
INNER JOIN teamBags tb ON epd.DATE = tb.TeamDate
AND epd.TeamID = tb.TeamID;
I eventually did 2 things:
1) Managed to reduce the nr of nested views by using sub-queries. This did not improve performance by much but it seems simpler with fewer views.
2) The actual improvement was caused by using LEFT JOIN in stead of Inner Join.
The final view ran for 50 minutes with the Inner Join without producing a single row yet.
With LEFT JOIN, it produced all the results in 20 seconds!
Hope this helps someone.

Order by in subquery behaving differently than native sql query?

So I am honestly a little puzzled by this!
I have a query that returns a set of transactions that contain both repair costs and an odometer reading at the time of repair on the master level. To get an accurate Cost per mile reading I need to do a subquery to get both the first meter reading between a starting date and an end date, and an ending meter.
(select top 1 wf2.ro_num
from wotrans wotr2
left join wofile wf2
on wotr2.rop_ro_num = wf2.ro_num
and wotr2.rop_fac = wf2.ro_fac
where wotr.rop_veh_num = wotr2.rop_veh_num
and wotr.rop_veh_facility = wotr2.rop_veh_facility
AND ((#sdate = '01/01/1900 00:00:00' and wotr2.rop_tran_date = 0)
OR ([dbo].[udf_RTA_ConvertDateInt](#sdate) <= wotr2.rop_tran_date
AND [dbo].[udf_RTA_ConvertDateInt](#edate) >= wotr2.rop_tran_date))
order by wotr2.rop_tran_date asc) as highMeter
The reason I have the tables aliased as xx2 is because those tables are also used in the main query, and I don't want these to interact with each other except to pull the correct vehicle number and facility.
Basically when I run the main query it returns a value that is not correct; it returns the one that is second(keep in mind that the first and second have the same date.) But when I take the subquery and just copy and paste it into it's own query and run it, it returns the correct value.
I do have a work around for this, but I am just curious as to why this happening. I have searched quite a bit and found not much(other than the fact that people don't like order bys in subqueries). Talking to one of my friends that also does quite a bit of SQL scripting, it looks to us as if the subquery is ordering differently than the subquery by itsself when you have multiple values that are the same for the order by(i.e. 10 dates of 08/05/2016).
Any ideas would be helpful!
Like I said I have a work around that works in this one case, but don't know yet if it will work on a larger dataset.
Let me know if you want more code.

Max vs Count Huge Performance difference on a query

I have to similar queries which the only difference is that one is doing a sum of a column and the other is doing a count(distinct) of another column.
The first one runs in seconds (17s) and the other one never stops (1 hour and counting). I've seen the plan for the count query and it has huge costs. I don't understand why.
They are hitting the exact same views.
Why is this happening and what can I do?
The one that is running fine:
select a11.SOURCEPP SOURCEPP,
a12.DUMMY DUMMY,
a11.SIM_NAME SIM_NAME,
a13.THEORETICAL THEORETICAL,
sum(a11.REVENUE) WJXBFS1
from CLIENT_SOURCE_DATA a11
join DUMMY_V a12
on (a11.SOURCEPP = a12.SOURCEPP)
join SIM_INFO a13
on (a11.SIM_NAME = a13.SIM_NAME)
where (a13.THEORETICAL in (0)
and a11.SIM_NAME in ('ETS40'))
group by a11.SOURCEPP,
a12.DUMMY,
a11.SIM_NAME,
a13.THEORETICAL
the one that doesn't run:
select a12.SOURCEPP SOURCEPP,
a12.SIM_NAME SIM_NAME,
a13.THEORETICAL THEORETICAL,
count(distinct a12.CLIENTID) WJXBFS1
from CLIENT_SOURCE_DATA a12
join SIM_INFO a13
on (a12.SIM_NAME = a13.SIM_NAME)
where (a13.THEORETICAL in (0)
and a12.SIM_NAME in ('ETS40'))
group by a12.SOURCEPP,
a12.SIM_NAME,
a13.THEORETICAL
DISTINCT is very slow when there are many DISTINCT values, database needs to SORT/HASH and store all values (or sets) in memory/temporary tablespace. Also it makes parallel execution much more difficult to apply.
If there is a way how to rewrite the query without using DISTINCT you should definitely do it.
As answered above, DISTINCT has to do a table scan and then hash, aggregate and sort the data into sets. This increases the amount of time it takes across the board (CPU, disk access, and the time it takes to return the data). I would recommend trying a subquery instead if possible. This will limit the aggregation execution to only the data you want to be distinct instead of having the engine perform it on all of the data. Here's an article on how this works in practice, with an example.

Import multiple queries written in SQL into access

I've got to write 50 relatively simple queries, that all use the same basic form, but each successive query depends on the one before it to run.
I can quick and easily write the queries in SQL in an text editor e.g. word, but I dont know how to import the text back into access. Nor do I know how to specify the name of the query in the SQL code or how to specify that the end of a query has been readched.
Here is a sample of 4 queries. Here, the 1st line is the name of the Query and the two consecutive hard return represents the end of eqch query.
‘Ring2Q1
SELECT RINGS.Parent, RINGS_1.Child, 2 AS Ring
FROM RINGS INNER JOIN RINGS AS RINGS_1 ON RINGS.Child = RINGS_1.Parent;
‘Ring2Q2
SELECT Ring2Q1.Parent, Ring2Q1.Child, Max(Ring2Q1.Ring) AS Ring
FROM Ring2Q1
GROUP BY Ring2Q1.Parent, Ring2Q1.Child;
‘Ring3Q1
SELECT RINGS.Parent, Ring2Q2.Child, 3 AS Ring
FROM RINGS INNER JOIN Ring2Q2 ON RINGS.Child = Ring2Q2.Parent;
‘Ring3Q2
SELECT Ring3Q1.Parent, Ring3Q1.Child, Max(Ring3Q1.Ring) AS Ring
FROM Ring3Q1
GROUP BY Ring3Q1.Parent, Ring3Q1.Child;
Go into Access. Create a new query. Select SQL View. You can copy and paste the text of the query in here. Save it as the name you need for the next query. Repeat. You will obviously need a starting table that the first query calls. I would look at why you need a cascading set of 50 queries, on any sizeable amount of data this is going to take a long time to run.