Optimizing a Left Join on inequality

Optimizing a Left Join on inequality - sql

I have 2 tables, table 1 with policies and their effective dates, table with 2 with the effective dates and a "factor". Want to pull table 1 policies with their latest factors. All policies should have factors, but in the off chance they don't, a null value should be returned.
Table 1 example
State
Policy
Effective_date
Alabama
P15001
1/4/2021
Alabama
P15002
2/8/2022
Arizona
P15004
3/7/2018
... (1M+ rows)
Table 2 example
State
Effective_date
Factor
Alabama
1/1/2018
1.345
Alabama
7/1/2020
1.143
Alabama
10/1/2021
1.099
Arizona
1/1/2017
0.899
...
Want:
State
Policy
Policy_Effective_date
Factor
t2_effective_date
Alabama
P15001
1/4/2021
1.143
7/1/2020
Alabama
P15002
2/8/2022
1.099
10/1/2021
Arizona
P15004
3/7/2018
0.899
1/1/2017
column 5 is nice to have but not 100% necessary.
(All these data tables/numbers are made up)
The current query is something like this:
create table want as
select t1.state, t1. policy, t1.effective_date, max(t2.effective_date) as t2_effective_date
from t1 left join t2
on t1.state = t2.state
and t1.Effective_date >= t2.Effective_date
group by t1.state, t1.policy
and then the t2_effective_date is joined to t2 to get the factor.
It works but is quite inefficient (takes a long time). I also don't like the inequality in the join, but couldn't come up with anything better myself. Are there better ways than this? Any solution is fine, if creating new helper tables, new columns are needed that's fine. More code is fine, if it makes it more efficient.
I'm using SAS SQL. Thanks in advance!

You do not need to "join" the tables. Just interleave them and remember the last time the factor changed.
First let's convert your listing into actual datasets.
data t1 ;
input State :$20. Policy :$10. Effective_date :mmddyy.;
format effective_date yymmdd10.;
cards;
Alabama P15001 1/4/2021
Alabama P15002 2/8/2022
Arizona P15004 3/7/2018
;
data t2;
input State :$20. Effective_date :mmddyy. Factor ;
format effective_date yymmdd10.;
cards;
Alabama 1/1/2018 1.345
Alabama 7/1/2020 1.143
Alabama 10/1/2021 1.099
Arizona 1/1/2017 0.899
;
Now use SET with BY to interleave the observations by STATE and DATE.
data want ;
set t2(in=in2) t1(in=in1);
by state effective_date ;
retain factor_effective_date factor2;
format factor_effective_date yymmdd10.;
if first.state then call missing(of factor_effective_date factor2);
if in2 then do;
Factor_effective_date=effective_date;
Factor2 = factor;
end;
if in1 ;
drop factor ;
rename factor2=Factor effective_date=Policy_effective_date;
run;
Results:
Policy_ factor_
effective_ effective_
Obs State date Policy date Factor
1 Alabama 2021-01-04 P15001 2020-07-01 1.143
2 Alabama 2022-02-08 P15002 2021-10-01 1.099
3 Arizona 2018-03-07 P15004 2017-01-01 0.899

Slightly different approach here, modifying table 2 to have a start/end date and then joining using a BETWEEN.
data t1 ;
input State :$20. Policy :$10. Effective_date :mmddyy.;
format effective_date yymmdd10.;
cards;
Alabama P15001 1/4/2021
Alabama P15002 2/8/2022
Arizona P15004 3/7/2018
;
data t2;
input State :$20. Effective_date :mmddyy. Factor ;
format effective_date yymmdd10.;
cards;
Alabama 1/1/2018 1.345
Alabama 7/1/2020 1.143
Alabama 10/1/2021 1.099
Arizona 1/1/2017 0.899
;
proc sort data=t2;
by state effective_date;
run;
data t2_start_end;
merge t2 t2(firstobs=2 rename=effective_date =end_date drop = factor);
by state ;
end_date = end_date - 1;
if last.state then end_date=today();
rename effective_date = start_date;
run;
proc sql;
create table want as
select t1.*, t2.factor, t2.start_date as policy_factor_start
from t1
left join t2_start_end as t2
on t1.state=t2.state and t1.effective_date between t2.start_date and t2.end_date
order by 1, 2, 3;
quit;

Related

Show the most current dateusing MAX Date

Table covidDeaths
Location Date total_cases total_deaths
_______________________________________________________________________
United States 2020-01-22 00:00:00.000 1 NULL
United States 2020-01-23 00:00:00.000 1 0
United States 2020-01-24 00:00:00.000 2 1
United States 2020-01-25 00:00:00.000 2 0
United States 2020-01-26 00:00:00.000 5 3
United States 2021-11-11 00:00:00.000 46851529 58626
United States 2021-11-12 00:00:00.000 46991304 139775
United States 2021-11-13 00:00:00.000 47050502 59198
United States 2021-11-14 00:00:00.000 47074080 23578
I'm running into a problem that is leaving me a bit frustrated. I am looking for the total_cases and total_deaths using the most current date where the location is the United States in a table named covidDeaths. I know you can use the Max() function to find the most current date on file so I have tried
SELECT MAX(date) AS "Current Date", total_deaths, total_cases
FROM covidDeaths
WHERE location = 'United States'
GROUP BY total_cases, total_deaths;
I want it to output a single row like this.
_______________________________________
|Current Date|Total_Deaths|Total_Cases|
|____________|____________|___________|
|2021-11-14 |763092 |47074080 |
|____________|____________|___________|
Instead, I'm getting
_______________________________________
|Current Date|Total_Deaths|Total_Cases|
|____________|____________|___________|
|2020-01-23 |Null |1 |
|____________|____________|___________|
|2020-01-24 |Null |2 |
|____________|____________|___________|
and so on until it reaches the max (date).
I am using SQL Server 2019.
I'm hoping someone can explain to me what I am doing wrong and why it's outputting multiple dates instead of just the most current.

Use a TOP query:
SELECT TOP 1 WITH TIES date AS [Current Date], total_deaths, total_cases
FROM covidDeaths
WHERE location = 'United States'
ORDER BY date DESC;
I am using WITH TIES here in case there might be more than one record having the most recent date. If not, or you don't care about ties, then you may simply use TOP 1 instead.
Note: I see no reason to be using GROUP BY here, as your current query does not select any aggregates.

SQL: Create a flag for separate records in the same table with overlapping date ranges

I'm trying to figure out how to create a boolean field that would tell me when two records have overlapping date ranges.
IN the following example, every unique Location/Counterparty combo within a specified date range can EITHER have a contract, or a DeliveryPoint, not both. So id 1&2 should be flagged, but id's 3 and 4 are ok because they don't overlap, so the flag should read "False".
I started to do a self join, but after that, I couldn't wrap my head around the next step. Did I start correctly, or is the solution totally different?
id Location Counterparty Contract DeliveryPoint StartDate EndDate
1 New York Wal Mart Philadelphia 3/1/2019 12/31/2020
2 New York Wal Mart 123456 5/1/2019 7/31/2019
3 Toronto Target Boston 3/1/2019 5/31/2019
4 Toronto Target 456789 6/1/2019 12/31/2020
With the flag, I'd want it to look like
id Location Counterparty Contract DeliveryPoint StartDate EndDate Overlap
1 New York Wal Mart Philadelphia 3/1/2019 12/31/2020 TRUE
2 New York Wal Mart 123456 5/1/2019 7/31/2019 TRUE
3 Toronto Target Boston 3/1/2019 5/31/2019 FALSE
4 Toronto Target 456789 6/1/2019 12/31/2020 FALSE

On your insert query, I think you could create a subquery that search other record with overlapping dates. Please attention the date fields test. See the example:
insert into table(location, Counterparty, Overlap)
select
#location,
#Counterparty,
case when exists(select Id
from table t
where t.location = #location
and t.Counterparty = #Counterparty
and #startDate <= t.EndDate
and #endDate >= t.StartDate
) then 1 else 0 end as Overlap

CTE recursion infinite loop

I'm working with a stored procedure and using a CTE in SQL Server and I'm trying to reach some data from a 2 tables, but when the execution goes to the CTE query it gets an infinite loop and never ends, is there a way to prevent that infinite loop?
This is the query that I create:
WITH tableName(Id, enddate, statusDte, closeId, shceDte, calcDte, closeEndDte, ParentId, LastClose, lasCloseDte, closeClass,addSe,twon,code)
AS
(
SELECT
tba.Id,
CASE WHEN tb.ParentId IS NOT NULL
THEN tb.Id
WHEN tb.statusDte IN (1,2,3)
THEN tb.calcDte ELSE tb.shceDte
END ForecastDueDate,
statusDte, closeId, shceDte, calcDte,
CASE WHEN tb.ParentId IS NULL
THEN closeEndDte ELSE NULL END, tb.ParentId, 0,
CASE WHEN tb.ParentId IS NOT NULL
THEN statusDte
WHEN tb.statusDte = 5
AND (tb.calcDte BETWEEN '1/1/2020 12:00:00 AM' AND '12/31/2020 11:59:59 PM'
OR tb.closeEndDte BETWEEN '1/1/2020 12:00:00 AM' AND '12/31/2020 11:59:59 PM')
THEN ams.GetPreviousNthFullAuditDate(tb.Id, tb.AuditID, 2) ELSE a.statusDate END as lastDate,
a.closeClass, tba.addSe,tba.town,tba.code
FROM
tableA tba
INNER JOIN
tableB tb ON tb.Id = tba.Id
WHERE
statusDte NOT IN (3,4) AND tba.IsAtve = 1
UNION ALL
SELECT
Id, enddate,
statusDte, statusDte, shceDte, calcDte, closeEndDte, ParentId,
0, lasCloseDte, closeClass,addSe,twon,code
FROM
tableName
WHERE
enddate BETWEEN enddate AND '12/31/2020 11:59:59 PM'
)
SELECT *
FROM tableName
OPTION (maxrecursion 0)
Expected results
Id enddate statusDte closeId shceDte calcDte closeEndDte parentId lastClose lastCloseDte closeClass addSe town code
----------- ----------------------- ------------- ----------- ----------------------- ----------------------- ----------------------- ----------------------- ----------- ----------------------- ----------- --------------------------------- ---------------------- --------------------------------------------------
133 2011-04-04 00:00:00.000 22 14453 NULL 2011-04-04 00:00:00.000 2099-12-31 00:00:00.000 NULL 0 NULL 1 4707 EXECUTIVE DRIVE '' SAN DIEGO 123
56 2018-12-07 13:00:00.000 22 52354 NULL 2018-12-07 13:00:00.000 2019-12-07 00:00:00.000 NULL 0 NULL 1 75 STATE ST FL 24 '' BOSTON 345
12 2021-02-05 17:00:00.000 22 75751 NULL 2021-02-05 17:00:00.000 NULL NULL 0 NULL 1 1450 FRAZEE RD STE 308 '' SAN DIEGO 678
334 2019-03-07 16:30:00.000 15 66707 2019-03-07 16:30:00.000 2019-03-23 21:00:00.000 NULL NULL 0 2019-03-07 16:30:00.000 1 42690 RIO NEDO, STE E '' TEMECULA 91011
33 2020-01-10 17:00:00.000 22 65568 NULL 2020-01-10 17:00:00.000 NULL NULL 0 2018-01-10 17:00:00.000 1 2518 UNICORNIO ST. '' CARLSBAD 136
55 2020-04-16 20:00:00.000 22 67812 NULL 2020-04-16 20:00:00.000 NULL NULL 0 2018-04-17 20:00:00.000 1 4534 OSPREY STREET '' SAN DIEGO 653
66 2020-02-21 17:00:00.000 22 75956 NULL 2020-02-21 17:00:00.000 NULL NULL 0 2019-02-21 17:00:00.000 1 3511 CAMINO DEL RIO S, STE 305 '' SAN DIEGO 0484
094 2021-02-20 21:00:00.000 22 75629 NULL 2021-02-20 21:00:00.000 NULL NULL 0 NULL 1 29349 EAGLE DR '' MURRIETA 345

First, let's try to add some best practices. Qualify all your columns with the appropriate table alias. Just doing some of them is inconsistent and inconsistent style is difficult to read and prone to errors.
Next, you've (hopefully) dumbed down your actual query. Generic names like "tableA" hinder understanding.
Next - your first case expression seems highly suspicious. You have one branch returns tb.id and the others return what appears to be a date (or datetime). You can, unfortunately, cast an int to a datetime. Might not make any sense and it won't generate an error. So - does this make sense?
Next - you've made a common mistake with your datetime boundaries. Depending on your data you might never know this. But there is no reason to expect that and there is every reason to write your logic so that it avoids any possibility. Tibor discusses in great detail here. Shorter version - your upper boundary should always be an exclusive one to support all possible values of time for your datatype. 23:59:59 will ignore any time values with non-zero milliseconds. And use a literal format that is not dependent on language or connection settings.
Next, you add confusion. You named your columns in the cte declaration but your code also includes aliases for some (but not all - see, refer to the consistency comment) columns which differ significantly from the actual column name for the cte. The 2nd column for the cte is "enddate", the anchor query uses the alias "ForecastDueDate"
Next, you have this: tb.statusDte = 5. The name implies date; the literal implies something different. You have other columns that end in "Dte" that are obviously dates, but not this one? Danger, danger!
Next, you refer to columns "a.closeClass" and "a.statusDate". There is no table or alias named "a".
Lastly, you have:
WHERE enddate BETWEEN enddate AND '12/31/2020 11:59:59 PM'
Think about what you wrote. Is not enddate always between enddate and Dec 31 2010 (so long as enddate <= that value)? I think this is the source of your issue. You're not computing or adjusting anything from the anchor, so the recursed part just keeps selecting and selecting and selecting. There is no logic to end the recursion.
The next question is obviously "now to fix it". That is impossible to say without knowing your schema, what it represents, and your goal. The use of recursion here is not obvious.

If the data is in a structure that the hierarchy between records is is a loop then recursion goes to infinite causing a problem in SQL. You will see the resources used by SQL process is increasing tremendously.
If you use MAXRECURSION with a different value than 0 (zero lets SQL to continue recursion without a limit) you will be able to limit the recursion.
With data that is looping or referencing each other you can this MAXRECURSION parameter

Keeping Specific Dates AND Distinct NULLS

I have the following stored procedure
ALTER PROCEDURE [dbo].[sp_GetFeedDate10_TEST]
-- sp_GetFeedDate10_TEST '05/30/2018'
#daycurrent DATE
AS
BEGIN
SELECT
TANK.Name, TANK.Tank_Pond_uID,
FEED.FeedValue, FEED.Date
FROM
aa_Tanks_Ponds TANK
LEFT JOIN
aa_Feed_Chart FEED ON TANK.Tank_Pond_uID = FEED.Tank_Pond_uID
WHERE
TANK.Site_Name = 'Dry Creek' --and (FEED.Date = DATEADD(DAY, -10, #daycurrent))
ORDER BY
TANK.Tank_Pond_uID
END
I commented out the "and
(FEED.Date = DATEADD(DAY, -10, #daycurrent))
where I am currently having issues with filtering. The join is pulling a list of tanks and joining with any feed values and dates from another list.
Where I run into trouble is when filtering based on the date, I lose all the distinct NULLS for all tanks that don't contain a value. It's important to keep the result the same rows and order as the TANK table.
Here's and example of the first table:
Name Tank_Pond_uID
--------------------
B01 DCB01
B02 DCB02
B03 DCB03
B04 DCB04
B05 DCB05
Example of the second table:
Site_Name Tank_Pond_uID Date FeedValue
--------------------------------------------------
Dry Creek DCB01 2018-05-20 90
Dry Creek DCB02 2018-05-20 90
Dry Creek DCB03 2018-05-20 90
Where I run the above stored procedure with a date:
sp_GetFeedDate10_TEST '05/30/2018'
(I know date doesn't matter as the commented out section needs it), I get the following result:
Name Tank_Pond_uID FeedValue Date
-----------------------------------------------
B01 DCB01 90 2018-05-20
B01 DCB01 90 2018-05-21
B01 DCB01 90 2018-05-22
B02 DCB02 90 2018-05-20
B02 DCB02 90 2018-05-21
B02 DCB02 90 2018-05-22
B03 DCB03 NULL NULL
B04 DCB04 NULL NULL
B05 DCB05 NULL NULL
When I add the commented section back into the Store procedure and run it again I get the following result:
Name Tank_Pond_uID FeedValue Date
-----------------------------------------------
B01 DCB01 90 2018-05-20
B02 DCB02 90 2018-05-20
I would like the expected result to be keep the Left rows and still filter date:
Example of expected table result:
Name Tank_Pond_uID FeedValue Date
-----------------------------------------------
B01 DCB01 90 2018-05-20
B02 DCB02 90 2018-05-20
B03 DCB03 NULL NULL
B04 DCB04 NULL NULL
B05 DCB05 NULL NULL

You have turned your left join into an inner join with the predicates in your where clause. Move the date predicate to the join and this will work.
select TANK.Name
, TANK.Tank_Pond_uID
, FEED.FeedValue,FEED.Date
from aa_Tanks_Ponds TANK
LEFT JOIN aa_Feed_Chart FEED ON TANK.Tank_Pond_uID = FEED.Tank_Pond_uID
and FEED.Date = DATEADD(DAY, -10, #daycurrent)
where TANK.Site_Name = 'Dry Creek'
order by TANK.Tank_Pond_uID

SAS/SQL - Identifying/grouping return journeys in rail journey data

I have a table in Oracle that contains rail journeys. Some journeys have a journey_type of 'S', which means they are single journeys. However, some customers essentially use these single journeys to make a return journey (e.g. purchasing a single from London to Manchester and a single ticket from Manchester to London). I need to be able to identify instances of this and somehow group the two journeys into one row. To complicate matters a little further, 'reason for travel' is recorded at the transaction level and the two journeys will probably have different transaction IDs - so I need to be able to retain both of the values for this variable in the new row. I've been struggling with this today and haven't been able to come up with an acceptable solution so thought I'd ask for advice here.
Here is example data:
Cust_ID Journey_ID Origin Destination Type Date Reason
100 100001 London Manchester S 15/01/2014 Family
100 100100 Manchester London S 16/01/2014 Family
100 110023 London Manchester S 25/01/2014 Family
100 114000 Manchester London S 29/01/2014 Holiday
100 129345 London Norwich S 02/02/2014 Business
100 134578 Norwich London S 15/02/2014 Business
100 145843 London Manchester S 01/03/2014 Family
100 147893 Manchester London S 04/03/2014 Family
200 157878 Birmingham London S 04/04/2014 Friends
200 159899 London Birmingham S 06/04/2014 Friends
I'd like to create something like this:
Cust_ID Journey_ID Origin Destination Date Reason1 Reason2
100 100001 London Manchester 15/01/2014 Family Family
100 110023 London Manchester 25/01/2014 Family Holiday
100 129345 London Norwich 02/02/2014 Business Business
100 145843 London Manchester 01/03/2014 Family Family
200 157878 Birmingham London 04/04/2014 Friends Friends
The tools I have available are Oracle SQL Developer and SAS. Any thoughts on how you'd go about this would be appreciated! The table contains many millions of records so efficiency is an issue.
edit: forgot to include transaction_id in the tables. It could either be the same or different for the outward and return journeys.

Here's one solution, although there may be a better way using HASH tables. Essentially my appoach is to first change the transaction id for the return journey to be the same as the outbound journey, then it is a fairly simple matter of outputting the desired result. If you need to keep the original transaction id for the return journey then you'll have to output the data to a new dataset instead of using MODIFY and then add in the other journeys of type 'R'. I've added in a few extra data lines, including a couple of return journeys (one that occurs between 2 singles) and a single journey with no return. Hope this helps.
/* create dummy dataset */
data have;
input Cust_ID Journey_ID Origin $ :12. Destination $ :12. Type $ Date :ddmmyy10. Reason $ Trans_id;
format date date9.;
datalines;
100 100001 London Manchester S 15/01/2014 Family 1
100 100100 Manchester London S 16/01/2014 Family 2
100 110023 London Manchester S 25/01/2014 Family 3
100 114000 Manchester London S 29/01/2014 Holiday 4
100 100300 London Exeter R 31/01/2014 Business 5
100 100300 Exeter London R 01/02/2014 Business 5
100 129345 London Norwich S 02/02/2014 Business 6
100 130300 Norwich Ipswich R 11/02/2014 Business 7
100 130300 Ipswich Norwich R 12/02/2014 Business 7
100 134578 Norwich London S 15/02/2014 Business 8
100 145843 London Manchester S 01/03/2014 Family 9
100 147893 Manchester London S 04/03/2014 Family 10
100 148123 London Brighton S 06/03/2014 Family 11
200 157878 Birmingham London S 04/04/2014 Friends 12
200 159899 London Birmingham S 06/04/2014 Friends 13
;
run;
/* sort data */
proc sort data=have;
by cust_id date trans_id;
run;
/* update transaction id for return journey to be same as outbound journey */
data have;
/* dataset entered twice to enable BY group processing */
modify have (where=(type='S')) have (where=(type='S'));
by cust_id date trans_id;
retain _tid _orig _dest _pair;
if first.cust_id or _pair=1 then do;
_tid = trans_id;
_orig = origin;
_dest = destination;
_pair=0;
end;
/* does destination and origin match origin and destination of previous journey */
else if origin=_dest and destination=_orig then do;
trans_id = _tid;
_pair=1;
end;
run;
/* need to sort again as some trans_id's have changed */
proc sort data=have;
by cust_id trans_id descending date;
run;
/* output required data with both reason */
data want (keep=cust_id journey_id origin destination type date reason1 reason2);
set have;
by cust_id trans_id descending date;
length reason1 reason2 $12;
retain reason2;
if first.trans_id and not last.trans_id then reason2 = reason;
else if last.trans_id then do;
reason1 = reason;
output;
call missing(reason1,reason2);
end;
run;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas