Hash join equivalent on PROC SQL between - sql

I usually use PROC SQL for when I'm joining a table on that also has a date condition (i.e. target_date falls between start_date and end_date).
I've been able to successfully translate this to a hash join when considering an INNER JOIN:
data hash_join;
if _n_ = 1 then do;
declare hash add1(dataset:'table_2',multidata: 'Y');
add1.defineKey('key_1');
add1.defineData('start_date','end_date','value_1');
add1.defineDone();
end;
format
start_date date9.
end_date date9.
value_1 10.5
;
set table_1 (keep=key_1 target_date);
if add1.find() = 0 then do until (add1.find_next());
if start_date le target_date le end_date then output;
end;
run;
Which is the same thing as:
proc sql;
create table sql_join as select
b.start_date,
b.end_date,
b.value_1,
a.key_1,
a.target_date
from table_1 a
inner join table_2 b
on a.key_1 = b.key_1 and
a.target_date between b.start_date and b.end_date
;quit;
I'm having trouble figuring out what the equivalent would be to a LEFT JOIN though. For instance, if something doesn't JOIN, I'd want to output, which I think is straightforward:
if add1.find() ne 0 then output;
And if it JOINs and the date is between, that seems straightforward as well:
if add1.find() = 0 then do until (add1.find_next());
if start_date le target_date le end_date then output;
end;
But how do I get the rest of the records from table_1 that might join, but don't have the target_date between the start_date and end_date? For instance, let's say table_2 is a start_date and end_date of a sale, and that sale didn't start until February 1st for a key_1 = 'Clothes'. If my table_1 has 'Clothes' and sales on January 1st, it will JOIN on the key, but I want to output the blank value. Any ideas on how to do this?
Any help would be much appreciated!

You just need to keep track of whether you've found a match or not. Since you're not using the hash find to track the 'between' part of things, you can't use that, so you just have to do it yourself.
See this example. Here I modify SASHELP.CLASS to look like your input tables, then add a bit of logic to see if anything was found.
data table_1;
set sashelp.class;
rename age=target_date name=key_1;
drop height weight;
run;
data table_2;
set sashelp.class;
do _i = 1 to mod(_n_,3);
start_date = age-3+_i;
end_date = age+1-_i;
if start_date le end_date then output;
end;
rename name=key_1 height=value_1;
keep height weight start_date age end_date name;
run;
data hash_join;
if _n_ = 1 then do;
declare hash add1(dataset:'table_2',multidata: 'Y');
add1.defineKey('key_1');
add1.defineData('start_date','end_date','value_1');
add1.defineDone();
end;
format
start_date date9.
end_date date9.
value_1 10.5
;
set table_1 (keep=key_1 target_date);
if add1.find() = 0 then do until (add1.find_next());
if start_date le target_date le end_date then do;
found=1;
output;
end;
end;
call missing(of value_1); *full list of values to clear - all of hash data elements;
if not (found) then output;
run;

I think you just need to track if something has the key, but not in the range:
if add1.find() ^=0 then output;
else do;
found = 0;
do until (add1.find_next());
if start_date le target_date le end_date then do;
output;
found=1;
end;
end;
if ^found then output;
end;
No data to test with, so this is just me coding in SO. Let me know if it doesn't work.

Related

{SAS} Converting a SQL merge in to HASH merge

proc sql;
create table ndd1 as
select a.*, 1 as default_flag, b.retail_account_no, c.limit
from (
select
posting_date, counterparty_id, counterparty_indicator, customer_id, last_default_date, last_out_of_default_date
from gdwh30.tb0_default
where counterparty_id<>'*noval*'
and ((last_out_of_default_date < posting_date and last_default_date >= last_out_of_default_date)
or (last_out_of_default_date > posting_date and last_default_date < last_out_of_default_date))
and posting_date = to_date(%nrbquote(')&tt.%nrbquote('), 'DD-MON-YYYY hh24:mi')) a
left join (
select retail_account_no, REF_account_ID, regexp_substr(REF_account_ID,'[^*]+', 1) as counterparty_id
from gdwh30.TB0_account
/*tb_account*/
where bus_date_from<= to_date(%nrbquote(')&tt.%nrbquote('), 'DD-MON-YYYY hh24:mi')
and bus_date_until>=to_date(%nrbquote(')&tt.%nrbquote('), 'DD-MON-YYYY hh24:mi')
and entity_id='RBBG') b
on a.counterparty_id=b.counterparty_id
;quit;
&tt. is of the form
30APR2020:00:00:00
or date22.
data ndd_1;
if 0 then set gdwh30.tb0_default today ;
if _n_=1 then do;
declare hash k(dataset:"gdwh30.tb0_default");
k.definekey ("posting_date");
k.definedata ("counterparty_id", "posting_date", "last_out_of_default_date", "last_default_date");
k.definedone();
declare hash j(dataset:"today");
j.definekey("today1");
j.definedata ("today1");
j.definedone();
end;
set ndd_1;
if k.find(key:posting_date)=0 and j.find(key:today1)=0 then output;
run;
What I think is must do is format the columns pre N=1;
For the second join on Tb0_acc I want to attempt a fuzzy merger so any help will be greatly appreciated.
Today1 is &tt. but in a table instead of a list. I first try to join the A part just to get a feel for these hash joins.
Posting_date is of the form 31AUG2016:00:00:00

Is there an sql query to count the number of people in a particular year, knowing the date of birth and the date of death of each person?

I have a table showing the name, the date of birth and the date of death of people (1900-2000). I need to know the number of people for each year in a certain period of time, for example, in 1940 the population was 2.3 billion, in 1941 2.4 billion, in 1942 2.2 billion and so on until 1950.
I work in SAS Enterprise Guide and maybe the code will look a little different than normal sql. At least I want to see something like this:
~
count of people | year
2.300.000.000 |1940
2.400.000.000 |1941
.....................
select
count(name),
from db
where bd<1jan1940 and dd>=1jan1940 and dd=<31dec1940
group by month
First of all, you must know the initial population at the end of 1899. Let's say that was 2 billion. Then add births minus deaths for each year. (You must access the table twice in order to do this, once for births and once for deaths.) Use SUM OVER to get a running total.
I am not sure which DBMS you are actually using, but this is pretty standard SQL:
select yr, 2000000000 + sum(births.cnt - deaths.cnt) over (order by yr)
from
(
select extract(year from bd) as yr, count(*) as cnt
from db
group by extract(year from bd)
) births
join
(
select extract(year from dd) as yr, count(*) as cnt
from db
group by extract(year from dd)
) deaths using (yr)
order by yr;
data dob_data;
do i = 1 to 10000;
num = ceil(rand('UNIFORM',0,10));
dob = intnx('day','01JAN1899'd,ceil(rand('UNIFORM',1,36865)));
select (num);
when (1) dod = intnx('day',dob,ceil(rand('UNIFORM',1,36865)));
otherwise dod = .;
end;
output;
end;
format dob dod date9.;
drop num;
run;
data calendar;
do i=0 to 100;
year = 1900+i;
soy = intnx('year','01JAN1900'd,i,'s');
eoy = intnx('year','01JAN1900'd,i,'e');
output;
end;
format soy eoy date9.;
run;
proc sql;
create table pop as
select year,
sum(case when DOB < soy and coalesce(DOD,'31DEC2200'd) ge soy then 1 else 0 end) as Alive_At_Start,
sum(case when DOB between soy and eoy then 1 else 0 end) as Born_During,
sum(case when coalesce(DOD,'31DEC2200'd) between soy and eoy then -1 else 0 end) as Passed,
sum(case when DOB le eoy and coalesce(DOD,'31DEC2200'd) > eoy then 1 else 0 end) as Alive_At_End
from dob_data t1, calendar t2
group by year;
quit;

plsql Trying to use number variable (year) to find fiscal year

Thanks for any help,
I am trying to build a stored procedure, in pl/sql, where i can load data from the fiscal year specified.
For example, fyloader(2013) would load pe.contact_date between 10/01/2012 to 09/30/2013. How can I do that?
Thank you again,
Will
create or replace procedure fyloader (
p_fiscal_year number
) as
begin
insert into schemax.tablex
(select distinct pe.pat_id as PATID,
pat.pat_mrn_id as PATMRNID,
pe.pcp_prov_id as PCPPROVID,
pe.department_id as DEPARTMENTID,
pe.contact_date as CONTACTDATE,
trunc((TO_DATE ('30-09-2013 23:59:59', 'DD-MM-YYYY HH24:MI:SS') - pat.birth_date)/365.25) AS AGE,
'' as AGEGROUP,
ce.financial_class as FINCLASS
from pat_enc pe
inner join patient pat on pe.pat_id = pat.pat_id
inner join clarity_dep b on b.department_id = pe.department_id
left join clarity_ser cs on pe.visit_prov_id = cs.prov_id
left join zc_def_division zc on cs.def_division_c = zc.def_division_c
inner join hsp_account ha on pe.hsp_account_id = ha.hsp_account_id
inner join clarity_epm ce on ha.primary_payor_id = ce.payor_id
where
pe.enc_type_c = '101' --include office visits only (add 1001 if you require VirVis)
and (pe.appt_status_c = 2 OR pe.appt_status_c is null)
and (pat.pat_status_c =1 or nvl(pat.pat_status_c,99999) = 99999)
and pe.contact_date between TO_DATE('1-OCT-2012','DD-MON-YY') and TO_DATE('30-SEP-2013','DD-MON-YY')
and (zc.def_division_c in (26,91,266,329,330,331,332,333,389,402)
or pe.visit_prov_id = pe.pcp_prov_id)
and ce.financial_class in ('1','3')
and b.DEPARTMENT_ID in (200101,200102,200104,200201,200202,200204,
200220,200301,200302,200304,200319,200401,200402,200404,200501,200601,
200602,200701,200702,200704,200801,200802,200804,200911,200912,200913,
200914,200916,200917,200921,200923,200924,200925,200926,200927,200928,
201002,201201,201202,202101,202102,202104,202108,202301,202302,202308,
234005,230407,290109));
commit;
end;
You can use the 'to_date' and 'add_months' functions to get the result you need. Assuming your financial year is 01-Jul -> 30-Jun the below code will give you what you want.
declare
p_input_year number := 2014;
v_fy_start_date date;
v_fy_end_date date;
begin
v_fy_start_date := add_months(to_date('0101'||p_input_year,'DDMMYYYY'), -6);
v_fy_end_date := add_months(to_date('0101'||p_input_year,'DDMMYYYY'), 6) - 1;
dbms_output.put_line('FY Start: ' || v_fy_start_date);
dbms_output.put_line('FY End: ' || v_fy_end_date);
end;
The results of dbms_output being:
FY Start: 01-JUL-13
FY End: 30-JUN-14
If your financial year dates are different you will need to adjust the -6 / 6 passed to add_months.

Counting records given criterias

I give up! I've been trying to make this work for some time now but I can't get the logic/and or code right.
What I'm trying to do is to count ongoing cases for every date in a defined period. I have a table looking basically like this:
CaseId OpenDate CloseDate
1 01JAN2014 05JAN2014
2 02JAN2014 04JAN2014
3 02JAN2014 .
4 03JAN2014 04JAN2014
5 06JAN2014 08JAN2014
6 07JAN2014 .
I created a data set iterating dates from today-30 to today as comparative dates (CompDate).
Definition of ongoing case is (CompDate <= OpenDate and CompDate < CloseDate) or CloseDate = .
My plan was to join the dates with the table and get something like this
CompDate OngoingCases
01JAN2014 1
02JAN2014 3
03JAN2014 4
04JAN2014 2
05JAN2014 1
06JAN2014 2
07JAN2014 3
So far I've came up with this code which gives me something else..
proc sql;
create table Ongoing as
select distinct
a.CompDate,
count(distinct case when (a.CompDate <= datepart(b.caseopendate) and (a.CompDate < datepart(b.caseclosedate) or b.caseclosedate = . )) then b.caseid end) as Cases
from List_of_dates as a
left outer join dcms_cases as b
on a.Date
where
.
.
group by a.Date
;
quit;
I think all you need is to join two tables applying your condition with dates in ON-statement. The following SQL will do the trick:
proc sql;
create table Ongoing as
select a.CompDate, count(b.CaseId) as OngoingCases
from List_of_dates a
left join dcms_cases b
on b.OpenDate<=a.CompDate and (b.CloseDate>a.CompDate or b.CloseDate=.)
group by a.CompDate
;
quit;
I made a few assumptions:
1) You meant to say the comp was greater than open date but less than close date or closed date is missing
2) You want the final result set to show how many ongoing cases were there based off the open date rather than the comp date since the comp date would be one static value... based on your description.
I have code below that used your sample and named it 'AA1'. It will give you the result set below.
Data AA1;
set AA1;
Format CompDate date9. Ongoing $3.;
CompDate = today() - 100; /*Change this to whatever your criteria is for the comparison*/
If ((CompDate >= opendate) and (CompDate < closedate)) or (closedate = .) then Ongoing = 'Y'; else Ongoing = 'N';
run;
/*Sort table in order to do a count of the ongoing cases*/
proc sort data = AA1;
by caseID CompDate Ongoing;
run;
/*Count how many ongoing cases exist based on the Open date values */
data AA1(rename= (count=OngoingCases));
set AA1;
count + 1;
by opendate;
if first.opendate then count = 1;
run;
/*Clean up to keep the variables you want in your result set */
data final;
set new(keep=opendate ongoingCases);
run;
One way is to iterate over your date range for every record in your dataset, and output records which satisfy your criteria...
%LET START = today()-30 ;
%LET END = today() ;
data want ;
set have ;
do CompDate = &START to &END ;
OngoingCase = (OpenDate <= CompDate < CloseDate)
or (OpenDate <= CompDate and missing(CloseDate)) ;
if OngoingCase then output ;
end ;
format CompDate date9. ;
run ;
proc summary data=want nway ;
class CompDate ;
var OngoingCase ;
output out=case_sum (drop=_:) sum= ;
run ;
I modified the current query slightly and moved the CASE condition to the WHERE clause.
proc sql;
create table Ongoing as
select distinct
a.CompDate,
count(distinct b.caseid ) as Cases
from List_of_dates as a
left outer join dcms_cases as b
on a.Date
where (a.CompDate <= datepart(b.caseopendate) and a.CompDate < datepart(b.caseclosedate))
or b.caseclosedate = .
group by a.Date
;
quit;

Comparing Two Sets of Date Ranges in SQL

I have two sets of data with different date ranges.
Tbl 1:
ID, Date_Start, Date_End
1, 2010-01-01, 2010-01-09
1, 2010-01-10, 2010-01-19
1, 2010-01-30, 2010-01-31
Tbl 2:
ID, Date_Start, Date_End
1, 2010-01-01, 2010-01-04
1, 2010-01-08, 2010-01-17
1, 2010-01-30, 2010-01-31
I'd like to find cases date ranges do not entirely overlap date ranges in Tbl 2. So for instance, in this example, I'd like output that looks something like this --
Output:
ID, Gap_Start, Gap_End
1, 2010-01-05, 2010-01-07
1, 2010-01-18, 2010-01-19
Date ranges will never overlap within a table. To do this, I'm using either DB2 SQL or SAS. Unfortunately, the datasets are big enough (millions of records) that I can't just brute force it.
Thank you!
Following on from Jon of All Trades' approach, this is a more completed solution. The crucial features are:
Use an auxiliary calendar table, which is just a list of all dates.
From the calendar table, JOIN to Tbl1 to get a list of dates which are in range.
Also do an anti-JOIN to Tbl2 to get only the dates which aren't in Tbl2's ranges.
I've enclosed those results in a Common Table Expression (CTE) called OutDates.
Define another CTE based on OutDates to get just the dates which start a gap; call this EarliestDates.
Define another CTE based on OutDates to get just the dates which end a gap; call this LatestDates.
JOIN EarliestDates and LatestDates to put each gap into a single row.
WITH
OutDates(ID, dt) AS
( SELECT Tbl1.ID, Calendar.dt FROM Calendar
INNER JOIN Tbl1 ON Calendar.dt BETWEEN Tbl1.Date_Start AND Tbl1.Date_End
LEFT OUTER JOIN Tbl2 ON Calendar.dt BETWEEN Tbl2.Date_Start AND Tbl2.Date_End
WHERE Tbl2.ID IS NULL
)
,
EarliestDates AS
( SELECT earliest.ID, earliest.dt FROM OutDates earliest
LEFT OUTER JOIN OutDates nonesuch_earlier ON DateAdd(day, -1, earliest.dt) = nonesuch_earlier.dt
WHERE nonesuch_earlier.ID IS NULL
)
,
LatestDates AS
( SELECT latest.ID, latest.dt FROM OutDates latest
LEFT OUTER JOIN OutDates nonesuch_later ON DATEADD(day, 1, latest.dt) = nonesuch_later.dt
WHERE nonesuch_later.ID IS NULL
)
SELECT rangestart.ID, rangestart.dt AS Gap_Start, rangeend.dt AS Gap_End
FROM EarliestDates rangestart JOIN LatestDates rangeend
ON rangestart.dt <= rangeend.dt
LEFT OUTER JOIN EarliestDates nonesuch_inner1
ON nonesuch_inner1.dt <= rangeend.dt AND nonesuch_inner1.dt > rangestart.dt
LEFT OUTER JOIN LatestDates nonesuch_inner2
ON nonesuch_inner2.dt >= rangestart.dt AND nonesuch_inner2.dt < rangeend.dt
WHERE nonesuch_inner1.dt IS NULL AND nonesuch_inner2.dt IS NULL
This is a working implementation using Sql Server syntax for the common table expressions, but it should be easy to convert to DB2 syntax. I don't know how well it well scale to be honest, I've only tested it with a very small dataset.
I don't think there is the efficient and general solution for all the cases. Under certain circumstances, however, we can figure out some efficient ones. For instance, below assumes that: (1) datasets one and two have the same set of ids in the same order; and (2) there are relatively short possible date ranges (assumed here to be all the dates in the year of 2010 only). Notice that one input range may generate two gaps.
/* test data */
data one;
input id1 (start1 finish1) (:anydtdte.);
format start1 finish1 e8601da.;
cards;
1 2010-01-01 2010-01-09
1 2010-01-10 2010-01-19
1 2010-01-30 2010-01-31
2 2010-01-02 2010-01-10
;
run;
data two;
input id2 (start2 finish2) (:anydtdte.);
format start2 finish2 e8601da.;
cards;
1 2010-01-01 2010-01-04
1 2010-01-08 2010-01-17
1 2010-01-30 2010-01-31
2 2010-01-05 2010-01-06
;
run;
/* assumptions:
(1) datasets one and two have the same set of ids in the same
sorted order;
(2) only possible dates are in the year of 2010
*/
%let minDate = %sysevalf('01jan2010'd - 1);
%let maxDate = %sysevalf('31dec2010'd + 1);
data gaps;
array inRange[&minDate:&maxDate] _temporary_;
array covered[&minDate:&maxDate] _temporary_;
do i = &minDate to &maxDate; inRange[i] = 0; covered[i] = 0; end;
do until (last.id1);
set one;
by id1;
do i = start1 to finish1; inRange[i] = 1; end;
end;
do until (last.id2);
set two;
by id2;
do i = start2 to finish2; covered[i] = 1; end;
end;
format startGap finishGap e8601da.;
startGap = .;
finishGap = .;
do i = &minDate+1 to &maxDate;
if inRange[i] and not covered[i] and missing(startGap) then startGap = i;
if (covered[i] or not inRange[i]) and not missing(startGap) and not covered[i-1] then do;
finishGap = i - 1;
output;
call missing(startGap, finishGap);
keep id1 startGap finishGap;
end;
end;
run;
/* check */
proc print data=gaps noobs;
run;
/* on lst
id1 startGap finishGap
1 2010-01-05 2010-01-07
1 2010-01-18 2010-01-19
2 2010-01-02 2010-01-04
2 2010-01-07 2010-01-10
*/
This is not a complete solution, as it returns a list of dates rather than ranges, but maybe it will be of use:
SELECT
R1.ID, D.Date
FROM
#Ranges1 AS R1
INNER JOIN Dates AS D ON D.Date BETWEEN R1.StartDate AND R1.EndDate
EXCEPT
SELECT
R2.ID, D.Date
FROM
#Ranges2 AS R2
INNER JOIN Dates AS D ON D.Date BETWEEN R2.StartDate AND R2.EndDate
Note that this solution requires a dates table: a table with one record per day, for all the dates you're likely to use. It has the advantages of being succinct, and handling overlapping date ranges (not necessary in your case, but maybe for the next guy).
For what it's worth, this is the method I ended up using. I think you could do it in pure SQL, but it got horrifically ugly and difficult to debug.
Step 1 -- I consolidated the date ranges in both datasets. This means that something like
ID, Start_Date, End_Date
1, 2010-01-01, 2010-01-31
1, 2010-02-01, 2010-02-28
got transformed into this --
ID, Start_Date, End_Date
1, 2010-01-01, 2010-02-28.
The query I used to produce this was --
WITH Cte_recomb (Id, Start_date, End_date, Hopcount) AS
(SELECT Id,
Start_date,
End_date,
1 AS Hopcount
FROM Table1
UNION ALL
SELECT Cte_recomb.Id,
Cte_recomb.Start_date,
Table1.End_date,
(Recomb.Hopcount + 1) AS Hopcount
FROM Cte_recomb, Table1
WHERE (Cte_recomb.Id = Table1.Id) AND
(Cte_recomb.End_date + 1 day = Table1.Start_date)),
Cte_maxenddate AS
(SELECT Id,
Start_date,
Max (End_date) AS End_date
FROM Cte_recomb
GROUP BY Id, Start_date
ORDER BY Id, Start_date)
SELECT Maxend.*
FROM Cte_maxenddate AS Maxend
LEFT JOIN
Cte_recomb AS Nextrec
ON (Nextrec.Id = Maxend.Id) AND
(Nextrec.Start_date < Maxend.Start_date) AND
(Nextrec.End_date >= Maxend.End_date)
WHERE Nextrec.Id IS NULL;
Step 2 --
I produced another dataset that created a record for every overlap between the two datasets. You'll need an additional step to find cases where a given record in Table1 doesn't have a corresponding record in Table2 at all.
SELECT Table1.Id,
Table1.Start_date AS Table1_start_date,
Table1.End_date AS Table1_end_date,
Table2.Start_date AS Table2_start_date,
Table2.End_date AS Table2_end_date
FROM Table1
INNER JOIN
Table2
ON (Table1.Plcy_id_sk = Id) AND
( (Table1.Start_date BETWEEN Table2.Start_date AND Table2.End_date) OR
(Table2.Start_date BETWEEN Table1.Start_date AND Table1.End_date)) AND
( (Table1.Start_date <> Table2.Start_date) OR
(Table1.End_date <> Table2.End_date))
ORDER BY Table1.Id, Table1.Start_date, Table2.Start_date;
Step 3 --
I take the above dataset, and run the following SAS job. I tried to do this in pure SQL with recursive queries, but it got uglier and uglier every time I looked at it.
Data Table1_Gaps;
Set Table1_Compare;
By ID Table1_Start_Date Table2_Start_Date;
format Gap_Start_Date yymmdd10.;
format Gap_End_Date yymmdd10.;
format Old_Start_Date yymmdd10.;
format Old_End_Date yymmdd10.;
Retain Old_Start_Date Old_End_Date;
IF (Table2_End_Date = .) then do;
Gap_Start_Date = Table1_Start_Date;
Gap_End_Date = Table1_End_Date;
output;
end;
else do;
If (Table2_Start_Date > Table1_Start_Date) then do;
if first.Table1_Start_Date then do;
Gap_Start_Date = Table1_Start_Date;
Gap_End_Date = Table2_Start_Date - 1;
output;
end;
else do;
Gap_Start_Date = Old_End_Date + 1;
Gap_End_Date = Table2_Start_Date - 1;
output;
end;
end;
If (Table2_End_Date < Table1_End_Date) then do;
if Last.Table1_Start_Date then do;
Gap_Start_Date = Table2_End_Date + 1;
Gap_End_Date = Table1_End_Date;
output;
end;
end;
end;
Old_Start_Date = Table2_Start_Date;
Old_End_Date = Table2_End_Date;
drop Old_Start_Date Old_End_Date;
run;
I haven't verified it entirely yet, but this approach does seem to have given me the results I wanted. Any thoughts?