Efficient rolling sum (window aggregate) in SAS - sql

I have two tables:
tb_payments: contract_id, payment_date, payment_value
tb_reference: contract_id, reference_date
For each (contract_id, reference_date) in tb_reference, I want to create a column sum_payments as the 90 days rolling sum from tb_payments. I can accomplish this (very inefficiently) with the query below:
%let window=90;
proc sql;
create index contract_id on tb_payments;
quit;
proc sql;
create table tb_rolling as
select a.contract_id,
a.reference_date,
(select sum(b.payment_value)
from tb_payments as b
where a.contract_id = b.contract_id
and a.reference_date - &window. < b.payment_date
and b.payment_date <= a.reference_date
) as sum_payments
from tb_reference as a;
quit;
How can I rewrite this to reduce the time complexity, using proc sql or SAS data step?
Edit with more info:
I chose 90 days as the window arbitrarily, but I will perform calculations for several windows. A solution that can perform calculations for several windows at the same time would be ideal
Both tables can have 10+ millions of rows, and data is completely arbitrary. My SAS server is quite powerful though
Contract_ids can be repeated in both tables
The pairs (contract_id, reference_date) and (contract_id, payment_date) are unique
Edit with sample data:
%let seed=1111;
data tb_reference (drop=i);
call streaminit(&seed.);
do i = 1 to 10000;
contract_id = round(rand('UNIFORM')*1000000,1);
output;
end;
run;
proc surveyselect data=tb_reference out=tb_payments n=5000 seed=&seed.; run;
data tb_reference(drop=i);
format reference_date date9.;
call streaminit(&seed.);
set tb_reference;
do i = 1 to 1+round(rand('UNIFORM')*4,1);
reference_date = '01jan2016'd + round(rand('UNIFORM')*1000,1);
output;
end;
run;
proc sort data=tb_reference nodupkey; by contract_id reference_date; run;
data tb_payments(drop=i);
format payment_date date9. payment_value comma20.2;
call streaminit(&seed.);
set tb_payments;
do i = 1 to 1+round(rand('UNIFORM')*20,1);
payment_date = '01jan2015'd + round(rand('UNIFORM')*1365,1);
payment_value = round(rand('UNIFORM')*3333,0.01);
output;
end;
run;
proc sort data=tb_payments nodupkey; by contract_id payment_date; run;
Update:
I compared my naive solution to both proposals from Quentin and Tom.
The merge method is quite fast and achieved over 10x speedup for n=10000. It is also very powerful, as beautifully demonstrated by Tom in his answer.
Hash tables are insanely fast and achieved over 500x speedup. Because my datasets are large, this is the way to go, but there's a catch: they need to fit in RAM.
If anyone needs the full testing code, feel free to send me a message.

Here's an example of a hash approach. Since your data are already sorted, I don't think there is much benefit to the hash approach over Tom's merge approach.
General idea is to read all of the payment data into a hash table (you may run out of memory if your real data is too big), then read through the data set of reference dates. For each reference date, you look up all of the payments for that contract_id, and iterate through them, testing to see if payment date is <90 days before the reference_date, and conditionally incrementing sum_payments.
Should be noticeably faster than the SQL approach in your question, but could lose to the MERGE approach. If the data were not sorted in advance, this might beat the time for sorting both big datasets and then merging. It could handle multiple payments on the same date.
data want;
*initialize variables for hash table ;
call missing(payment_date,payment_value) ;
*Load a hash table with all of the payment data ;
if _n_=1 then do ;
declare hash h(dataset:"tb_payments", multidata: "yes");
h.defineKey("contract_ID");
h.defineData("payment_date","payment_value");
h.defineDone() ;
end ;
*read in the reference dates ;
set tb_reference (keep=contract_id reference_date) ;
*for each reference date, look up all the payments for that contract_id ;
*and iterate through them. If the payment date is < 90 days before reference date then ;
*increment sum_payments ;
sum_payments=0 ;
rc=h.find();
do while (rc = 0); *found a record;
if 0<=(reference_date-payment_date)<90 then sum_payments = sum_payments + payment_value ;
rc=h.find_next();
end;
run ;

It probably is possible to do this all with PROC EXPAND if you have it licensed. But let's look at how to do it without that.
It shouldn't be that hard if all of the dates are present in the PAYMENTS table. Just merge the two tables by ID and DATE. Calculate the running sum, but with the wrinkle of also subtracting out the value that is rolling out the back of the window. Then just keep the dates that are in the reference file.
One issue might be the need to find all possible dates for a CONTRACT_ID so that LAG() function can be used. That is easy to do with PROC MEANS.
proc summary data=tb_payments nway ;
by contract_id ;
var payment_date;
output out=tb_id_dates(drop=_:) min=date1 max=date2 ;
run;
And a data step. This step could also be a view instead.
data tb_id_dates_all ;
set tb_id_dates ;
do date=date1 to date2 ;
output;
end;
format date date9.;
keep contract_id date ;
run;
Now just merge the three datasets and calculate the cumulative sums. Note that I included a do loop to accumulate multiple payments on a single day (remove the nodupkey in your sample data generation code to test it).
If you want to generate multiple windows then you will need multiple actual LAG() function calls.
data want ;
do until (last.contract_id);
do until (last.date);
merge tb_id_dates_all tb_payments(rename=(payment_date=date))
tb_reference(rename=(reference_date=date) in=in2)
;
by contract_id date ;
payment=sum(0,payment,payment_value);
end;
day_num=sum(day_num,1);
array lag_days(5) _temporary_ (7 30 60 90 180) ;
array lag_payment(5) _temporary_ ;
array cumm(5) cumm_7 cumm_30 cumm_60 cumm_90 cumm_180 ;
lag_payment(1) = lag7(payment);
lag_payment(2) = lag30(payment);
lag_payment(3) = lag60(payment);
lag_payment(4) = lag90(payment);
lag_payment(5) = lag180(payment);
do i=1 to dim(cumm) ;
cumm(i)=sum(cumm(i),payment);
if day_num > lag_days(i) then cumm(i)=sum(cumm(i),-lag_payment(i));
if .z < abs(cumm(i)) < 1e-5 then cumm(i)=0;
end;
if in2 then output ;
end;
keep contract_id date cumm_: ;
format cumm_: comma20.2 ;
rename date=reference_date ;
run;
If you want to make the code flexible for the number of windows you will need to add some code generation to create the LAGxx() function calls. For example you could use this macro:
%macro lags(windows);
%local i n lag ;
%let n=%sysfunc(countw(&windows));
array lag_days(&n) _temporary_ (&windows) ;
array lag_payment(&n) _temporary_ ;
array cumm(&n)
%do i=1 %to &n ;
%let lag=%scan(&windows,&i);
cumm_&lag
%end;
;
%do i=1 %to &n ;
%let lag=%scan(&windows,&i);
lag_payment(&i) = lag&lag(payment);
%end;
%mend lags;
And replace the ARRAY and assignment statements with LAGxx() functions with this call to the macro:
%lags(7 30 60 90 180)

Related

How to do a group by count on multiple columns in SAS (data step)?

I already have the answer in proc sql but I need the data step version of my code. If someone could please help me convert it, I would be grateful.
PROC SQL;
CREATE TABLE CARS AS
SELECT Origin, Type, Cylinders, DriveTrain, COUNT(*) AS COUNT
FROM SASHELP.CARS
group by Origin, Type, Cylinders, DriveTrain;
QUIT;
Data step would not be the appropriate solution here, PROC FREQ would be the SAS solution.
proc freq data=sashelp.cars;
table origin*type*cylinders*drivetrain / out=cars list;
run;
For completeness, here's a data step approach. Very much not recommended:
Sort data set first by grouping variables
Use BY Group in data step to identify groups of interest
Use RETAIN to hold value across rows
Use FIRST./LAST. to accumulate counter and output
*sort for BY statement is required;
proc sort data=sashelp.cars out=cars_sorted;
by origin type cylinders drivetrain;
run;
data cars_count;
set cars_sorted;
by origin type cylinders drivetrain;
*RETAIN tells SAS to keep this variable across rows, otherwise it resets for each observation;
retain count;
*if first in category set count to 0;
if first.drivetrain then count=0;
*increment count for each record (implicit retain so RETAIN is not actually required here);
count+1;
*if last of the group then output the total count for that group;
if last.drivetrain then output;
*keep only variables of interest;
keep origin type cylinders drivetrain count;
run;
*display results;
proc print data=cars_count;
run;
As long you none of your key variables have missing values and the full summary table will fit into your available memory you could use data step HASH.
That will eliminate the need to pre-sort the data.
data _null_;
set sashelp.cars end=eof;
if _n_=1 then do;
declare hash h(ordered:'yes');
rc=h.definekey('Origin','Type','Cylinders','DriveTrain');
rc=h.definedata('Origin','Type','Cylinders','DriveTrain','count');
rc=h.definedone();
end;
if h.find() then count=0;
count+1;
rc=h.replace();
if eof then rc=h.output(dataset:'cars2');
run;

SAS SQL Loop Inputting Sequential Files

I have:
6 files, named as follows: "ROSTER2008", "ROSTER2009", ..., "ROSTER2013"
Each file has these variables: TEAMID and MATE1, MATE2, ..., MATEX. The names of the teammates are stored for each team, through to teammate X.
Now: I want to loop through code that reads in the 6 files and creates one output file with TEAM, MATES2008, MATES2009, ..., MATES2013, where the MATES20XX variables contain the number of teammates on each team in that respective year (I no longer care about their names).
Here is what I've tried to do:
%macro sqlloop(start, end);
proc sql;
%DO year=&start. %TO &end.;
create table mates_by_team_&year. as
select distinct put(teamID, z2.) as team,
count(*) as mates
from myLibrary.rosters
where d='ROSTER(HOW DO I PUT YEAR HERE?)'.d;
%end;
quit;
%mend;
%sqlloop(start=2008, end=2013)
SO, here are my questions:
How do I fix the 'where' statement such that it understands the file name I intend to pull in on each iteration? (That is ROSTER2008,...,ROSTER2013)
Right now, I am creating 6 different tables for each file I input. How can I, instead, join these files into one table as I have described above?
Thank you in advance for your help!
Xtina
For the sake of simplicity I have assumed that your X in MATEX never goes beyond 999. Obviously you can increase the limit or you can use proc contents or dictionary tables to find out the max limit.
I have pulled all the variables Mate1:MateX in an array and checking if they are missing or not and increasing the counter to count the number of team mates.
%macro loops(start=,end=);
%do year=&start. %to &end.;
data year_&year.(keep=teamid Mates_&year.);
set roster&year.;
array mates{*} mate1-mate999;
Mates_&year.=0;
do i=1 to 999;
if not missing(mates{i}) then Mates_&year.=Mates_&year.+1;
end;
run;
proc sort data=year_&year.;
by teamid;
%end;
At the end I have merged all the datasets together.
data team;
merge year_&start.-year_&end.;
by teamid;
run;
%mend;
%loops(start=2008,end=2009);
Assuming those are SAS datasets and your using SAS 9.3+ I'd recommend a data step and proc freq approach. It's infinitely easier to understand. If you wanted to make it a macro replace the years (2008/2012) with your macro variables.
Edit: Added N() to count number of teammates assuming it follows the mate1 mate2 ... matex.
The single datastep would have the answer and you can drop the variables.
Data combined;
Set roster2008 - roster2012 indsname=source;
Year = substr(scan(source, 2, '.'), 7);
Num_mates = n(of mate:);
*drop mate:;
Run;
PROC sort data=combined;
By teamid year;
Run;
PROC transpose data=combined out=want prefix=mate;
By teamid;
VAR num_mates;
Id year;
Run;
*OLD CODE BASED ON SQL SOLUTION;
*Proc freq data=combined;
*Table dsn*teamid/list;
*Run;

SAS how to get random selection by group randomly split into multiple groups

I have a simple data set of customers (about 40,000k)
It looks like:
customerid, group, other_variable
a,blue,y
b,blue,x
c,blue,z
d,green,y
e,green,d
f,green,r
g,green,e
I want to randomly select for each group, Y amounts of customers (along with their other variable(s).
The catch is, i want to have two random selections of Y amounts for each group
i.e.
4000 random green customers split into two sets of 2000 randomly
and 4000 random blue customers split into two sets of 2000 randomly
This is because I have different messages to give to the two different splits
I'm not sampling with replacement. Needs to be unique customers
Would prefer a solution in PROC SQL but happy for alternative solution in sas if proc sql isn't idea
proc surveyselect is the general tool of choice for random sampling in SAS. The code is very simple, I would just sample 4000 of each group, then assign a new subgroup every 2000 rows, since the data is in a random order anyway (although sorted by group).
The default sampling method for proc surveyselect is srs, which is simple random sampling without replacement, exactly what is required here.
Here's some example code.
/* create dummy dataset */
data have;
do customerid = 1 to 10000;
length group other_variable $8;
if rand('uniform')<0.5 then group = 'blue'; /* assign blue or green with equal likelihood */
else group = 'green';
other_variable = byte(97+(floor((1+122-97)*rand('uniform')))); /* random letter between a and z */
output;
end;
run;
/* dataset must be sorted by group variable */
proc sort data=have;
by group;
run;
/* extract random sample of 4000 from each group */
proc surveyselect data=have
out=want
n=4000
seed=12345; /* specify seed to enable results to be reproduced */
strata group; /* set grouping variable */
run;
/* assign a new subgroup for every 2000 rows */
data want;
set want;
sub=int((_n_-1)/2000)+1;
run;
data custgroup ;
do i=1 to nobs;
set sorted_data nobs=nobs ;
point = ranuni(0);
end;
proc sort data = custgroup out=sortedcust
by group point;
run;
data final;
set sortedcust;
by group point;
if first group then i=1;
i+1;
run;
Basically what I am doing is first assign a random number to all observations in the data set. Then perform sorting based on the variable group and point.
Now I achieved a random sequence of observation within group. i=1 and i+1
would be to identify the row of observation(s) within group. This means would avoid extracting duplicated observations . Use output statement as well to control where you want to store the observation based on i.
My approach may not be the most efficient one.
The code below should do it. First, you will need to generate a random number. As Joe said above, it is better to seed it with a specific number so that you can reproduce the sample if necessary. Then you can use Proc Sql with the outobs statement to generate a sample.
(BTW, it would be a good idea not to name a variable 'group'.)
data YourDataSet;
set YourDataSet;
myrandomnumber = ranuni(123);
run;
proc sql outobs=2000;
create table bluesample as
select *
from YourDataSet
where group eq 'blue'
order by myrandomnumber;
quit;
proc sql outobs=2000;
create table greensample as
select *
from YourDataSet
where group eq 'green'
order by myrandomnumber;
quit;

sas - calculate moving average for grouped data with BY statement

I'm a SAS beginner and I'm curious if the following task can be done much more simple as it is currently in my head.
I have the following (simplified) meta data in a table named user_date_money:
User - Date - Money
with various users and dates for every calendar day (for the last 4 years). The data is ordered by User ASC and Date ASC, sample data looks like this:
User | Date | Money
Anna 23.10.2013 5
Anna 24.10.2013 1
Anna 25.10.2013 12
....
Aron 23.10.2013 5
Aron 24.10.2013 12
Aron 25.10.2013 4
....
Zoe 23.10.2013 1
Zoe 24.10.2013 1
Zoe 25.10.2013 0
I now want to calculate a five day moving average for the Money. I started with the pretty popular apprach with the lag() function like this:
data cma;
set user_date_money;
if missing(money) then
do;
OBS = 0;
money = 0.0;
end;
else OBS = 1;
money5 = lag5(money);
OBS5= lag5(obs);
if missing(money5) then money5= 0.0;
if missing(obs5) then obs5= 0;
if _N_ = 1 then
do;
SUM = 0.0;
N = 0;
end;
else;
sum = sum + money-money5;
n = n + obs-obs5;
MEAN = sum / n ;
retain sum n;
run;
as you see, the problem with this method occurs if there if the data step runs into a new user. Aron would get some lagged values from Anna which of course should not happen.
Now my question: I am pretty sure you can handle the user switch by adding some extra fields like laggeduser and by resetting the N, Sum and Mean variables if you notice such a switch but:
Can this be done in an easier way? Perhaps using the BY Clause in any way?
Thanks for your ideas and help!
Best regards
I think the easiest way is to use PROC EXPAND:
PROC EXPAND data=user_date_money out=cma;
ID date;
BY user;
CONVERT money=MEAN / transformin=(setmiss 0) transformout=(movave 5);
RUN;
And as mentioned in John's comment, it's important to remember about missing values (and about beginning and ending observations as well). I've added SETMISS option to the code, as you made it clear that you want to 'zerofy' missing values, not ignore them (default MOVAVE behaviour).
And if you want to exclude first 4 observations for each user (since they don't have enough pre-history to calculate moving average 5), you can use option 'TRIMLEFT 4' inside TRANSFORMOUT=().
If your particular need is simple enough, you can calculate it using PROC MEANS and a multilabel format.
data mydata;
do id = 1 to 5;
datevar = '01JAN2010'd-1;
do month = 0 to 4;
datevar=intnx('MONTH',datevar,1,'b');
sales = floor(500*rand('normal',7))+1500;
output;
end;
end;
run;
proc format;
value movingavg (multilabel notsorted)
'01JAN2010'd-'31MAR2010'd = 'JAN-MAR 2010'
'01FEB2010'd-'30APR2010'd = 'FEB-APR 2010'
'01MAR2010'd-'31MAY2010'd = 'MAR-MAY 2010'
/* ... more of these ... */
;
quit;
proc means data=mydata;
class id datevar/mlf order=data;
types id*datevar;
format datevar movingavg.;
var sales;
run;
The PROC FORMAT can be done programatically by use of the CNTLIN dataset, see SAS documentation for PROC FORMAT for more information.
If you make sure your data is sorted, you can use the first and last named variables to initialize your running totals when you get to a new member. These and retain should get you what you need; I don't think lag() is really called for here.
Yes, you can use by groupings. First, you'll sort by user and date (as you already have).
proc sort data=user_date_money;
by user date;
run;
Then, redo the data step using the by variable and a counter.
data cma;
set user_date_money;
by user;
length User_Recs 3
Average 8;
retain User_Recs;
if First.User=1 then User_Recs=0;
User_Recs=User_Recs+1;
if User_Recs>4 then do;
Average=(lag4(money)+lag3(money)+lag2(money)+lag1(money)+money)/5;
end;
drop User_Recs;
run;

How do you create an index for unique variables in sas?

I am trying to combine the odds ratios outputted from two models with different adjustments in SAS:
i.e:
ods output oddsratios=adjustedOR1(rename=(OddsRatioEst=OR1);
proc logistic data=dataname;
model y= b d c a e; run;
ods output oddsratios=adjustedOR2 (rename=(OddsRatioEst=OR2);
proc logistic data=dataname;
model y= b d c; run;
proc sort.....
data Oddsratios (keep=Effect OR1 OR2);
merge adjustedOR1 adjustedOR2; by effect; run;
Problem is that if I sort and merge by the Effect variable, I lose the order in which I put the explanatory variables in the model.
Is there anyway to assign an index to the variable according to the order I put it in the model, so that the final table will have the effect column in the order: b d c a e?
Thanks for your help
I suggest creating a new sequence variable in your "primary" dataset with the sort order you want. then re-sorting the merged result by that variable:
data adjustedOR1;
set adjustedOR1;
sortkey = _n_;
run;
proc sort data=adjustedOR1;
by effect;
run;
proc sort data=adjustedOR2;
by effect;
run;
data Oddsratios (keep=Effect OR1 OR2 sortkey);
merge adjustedOR1 adjustedOR2;
by effect;
run;
proc sort data=Oddsratios;
by sortkey;
run;
This would be a bit more generic than hard-coding the sort sequence as Keith suggests using PROC SQL (which also works by the way).
And thanks to Keith for providing a practical example!
I think the easiest way to sort your data is to do the merge in Proc Sql and use a case statement in an 'order by' clause. Here is an example.
ods output oddsratios=adjustedOR1(rename=(OddsRatioEst=OR1));
proc logistic data=sashelp.class;
model sex= height age weight; run;
ods output oddsratios=adjustedOR2 (rename=(OddsRatioEst=OR2));
proc logistic data=sashelp.class;
model sex= height age; run;
proc sql;
create table Oddsratios as select
a.effect,
a.or1,
b.or2
from adjustedOR1 as a
left join
adjustedOR2 as b
on a.effect=b.effect
order by
case a.effect
when 'Height' then 1
when 'Age' then 2
when 'Weight' then 3
end;
quit;
If you want to just change the order in which the variables appear in the data set, you can use the retain statement:
data Oddsratios (keep=Effect OR1 OR2);
retain b d c a e;
merge adjustedOR1 adjustedOR2;
by effect;
run;
This isn't really what retain is for, but it works.
But I wonder why you care what the order of the variables in the data set is. You can specify the order when you display results with proc print, for example.
Simplest answer I think that is still fairly flexible is to create an informat from your original dataset. Then during the merge you can create a new variable with the numeric order variable and sort by that afterwards.
Another solution would be to merge in a fashion not requiring a sort - create a hash table, for example, or create a format out of the odds ratio 2 dataset and append it on in a simple data step rather than via merge.
data have;
input effect $;
datalines;
b
d
c
a
e
;;;;
run;
data for_format;
set have;
fmtname='EFF';
type='j';
hlo='s';
start=effect;
label=_n_;
keep hlo type fmtname start label;
run;
proc format cntlin=for_format;
quit;
proc sort data=have;
by effect;
run;
data want;
set have; *your merge here instead;
by effect;
eff_order=input(effect,$EFF.);
run;
proc sort data=want;
by eff_order;
run;
proc print data=want;
run;