I have:
6 files, named as follows: "ROSTER2008", "ROSTER2009", ..., "ROSTER2013"
Each file has these variables: TEAMID and MATE1, MATE2, ..., MATEX. The names of the teammates are stored for each team, through to teammate X.
Now: I want to loop through code that reads in the 6 files and creates one output file with TEAM, MATES2008, MATES2009, ..., MATES2013, where the MATES20XX variables contain the number of teammates on each team in that respective year (I no longer care about their names).
Here is what I've tried to do:
%macro sqlloop(start, end);
proc sql;
%DO year=&start. %TO &end.;
create table mates_by_team_&year. as
select distinct put(teamID, z2.) as team,
count(*) as mates
from myLibrary.rosters
where d='ROSTER(HOW DO I PUT YEAR HERE?)'.d;
%end;
quit;
%mend;
%sqlloop(start=2008, end=2013)
SO, here are my questions:
How do I fix the 'where' statement such that it understands the file name I intend to pull in on each iteration? (That is ROSTER2008,...,ROSTER2013)
Right now, I am creating 6 different tables for each file I input. How can I, instead, join these files into one table as I have described above?
Thank you in advance for your help!
Xtina
For the sake of simplicity I have assumed that your X in MATEX never goes beyond 999. Obviously you can increase the limit or you can use proc contents or dictionary tables to find out the max limit.
I have pulled all the variables Mate1:MateX in an array and checking if they are missing or not and increasing the counter to count the number of team mates.
%macro loops(start=,end=);
%do year=&start. %to &end.;
data year_&year.(keep=teamid Mates_&year.);
set roster&year.;
array mates{*} mate1-mate999;
Mates_&year.=0;
do i=1 to 999;
if not missing(mates{i}) then Mates_&year.=Mates_&year.+1;
end;
run;
proc sort data=year_&year.;
by teamid;
%end;
At the end I have merged all the datasets together.
data team;
merge year_&start.-year_&end.;
by teamid;
run;
%mend;
%loops(start=2008,end=2009);
Assuming those are SAS datasets and your using SAS 9.3+ I'd recommend a data step and proc freq approach. It's infinitely easier to understand. If you wanted to make it a macro replace the years (2008/2012) with your macro variables.
Edit: Added N() to count number of teammates assuming it follows the mate1 mate2 ... matex.
The single datastep would have the answer and you can drop the variables.
Data combined;
Set roster2008 - roster2012 indsname=source;
Year = substr(scan(source, 2, '.'), 7);
Num_mates = n(of mate:);
*drop mate:;
Run;
PROC sort data=combined;
By teamid year;
Run;
PROC transpose data=combined out=want prefix=mate;
By teamid;
VAR num_mates;
Id year;
Run;
*OLD CODE BASED ON SQL SOLUTION;
*Proc freq data=combined;
*Table dsn*teamid/list;
*Run;
Related
I already have the answer in proc sql but I need the data step version of my code. If someone could please help me convert it, I would be grateful.
PROC SQL;
CREATE TABLE CARS AS
SELECT Origin, Type, Cylinders, DriveTrain, COUNT(*) AS COUNT
FROM SASHELP.CARS
group by Origin, Type, Cylinders, DriveTrain;
QUIT;
Data step would not be the appropriate solution here, PROC FREQ would be the SAS solution.
proc freq data=sashelp.cars;
table origin*type*cylinders*drivetrain / out=cars list;
run;
For completeness, here's a data step approach. Very much not recommended:
Sort data set first by grouping variables
Use BY Group in data step to identify groups of interest
Use RETAIN to hold value across rows
Use FIRST./LAST. to accumulate counter and output
*sort for BY statement is required;
proc sort data=sashelp.cars out=cars_sorted;
by origin type cylinders drivetrain;
run;
data cars_count;
set cars_sorted;
by origin type cylinders drivetrain;
*RETAIN tells SAS to keep this variable across rows, otherwise it resets for each observation;
retain count;
*if first in category set count to 0;
if first.drivetrain then count=0;
*increment count for each record (implicit retain so RETAIN is not actually required here);
count+1;
*if last of the group then output the total count for that group;
if last.drivetrain then output;
*keep only variables of interest;
keep origin type cylinders drivetrain count;
run;
*display results;
proc print data=cars_count;
run;
As long you none of your key variables have missing values and the full summary table will fit into your available memory you could use data step HASH.
That will eliminate the need to pre-sort the data.
data _null_;
set sashelp.cars end=eof;
if _n_=1 then do;
declare hash h(ordered:'yes');
rc=h.definekey('Origin','Type','Cylinders','DriveTrain');
rc=h.definedata('Origin','Type','Cylinders','DriveTrain','count');
rc=h.definedone();
end;
if h.find() then count=0;
count+1;
rc=h.replace();
if eof then rc=h.output(dataset:'cars2');
run;
I have two tables:
tb_payments: contract_id, payment_date, payment_value
tb_reference: contract_id, reference_date
For each (contract_id, reference_date) in tb_reference, I want to create a column sum_payments as the 90 days rolling sum from tb_payments. I can accomplish this (very inefficiently) with the query below:
%let window=90;
proc sql;
create index contract_id on tb_payments;
quit;
proc sql;
create table tb_rolling as
select a.contract_id,
a.reference_date,
(select sum(b.payment_value)
from tb_payments as b
where a.contract_id = b.contract_id
and a.reference_date - &window. < b.payment_date
and b.payment_date <= a.reference_date
) as sum_payments
from tb_reference as a;
quit;
How can I rewrite this to reduce the time complexity, using proc sql or SAS data step?
Edit with more info:
I chose 90 days as the window arbitrarily, but I will perform calculations for several windows. A solution that can perform calculations for several windows at the same time would be ideal
Both tables can have 10+ millions of rows, and data is completely arbitrary. My SAS server is quite powerful though
Contract_ids can be repeated in both tables
The pairs (contract_id, reference_date) and (contract_id, payment_date) are unique
Edit with sample data:
%let seed=1111;
data tb_reference (drop=i);
call streaminit(&seed.);
do i = 1 to 10000;
contract_id = round(rand('UNIFORM')*1000000,1);
output;
end;
run;
proc surveyselect data=tb_reference out=tb_payments n=5000 seed=&seed.; run;
data tb_reference(drop=i);
format reference_date date9.;
call streaminit(&seed.);
set tb_reference;
do i = 1 to 1+round(rand('UNIFORM')*4,1);
reference_date = '01jan2016'd + round(rand('UNIFORM')*1000,1);
output;
end;
run;
proc sort data=tb_reference nodupkey; by contract_id reference_date; run;
data tb_payments(drop=i);
format payment_date date9. payment_value comma20.2;
call streaminit(&seed.);
set tb_payments;
do i = 1 to 1+round(rand('UNIFORM')*20,1);
payment_date = '01jan2015'd + round(rand('UNIFORM')*1365,1);
payment_value = round(rand('UNIFORM')*3333,0.01);
output;
end;
run;
proc sort data=tb_payments nodupkey; by contract_id payment_date; run;
Update:
I compared my naive solution to both proposals from Quentin and Tom.
The merge method is quite fast and achieved over 10x speedup for n=10000. It is also very powerful, as beautifully demonstrated by Tom in his answer.
Hash tables are insanely fast and achieved over 500x speedup. Because my datasets are large, this is the way to go, but there's a catch: they need to fit in RAM.
If anyone needs the full testing code, feel free to send me a message.
Here's an example of a hash approach. Since your data are already sorted, I don't think there is much benefit to the hash approach over Tom's merge approach.
General idea is to read all of the payment data into a hash table (you may run out of memory if your real data is too big), then read through the data set of reference dates. For each reference date, you look up all of the payments for that contract_id, and iterate through them, testing to see if payment date is <90 days before the reference_date, and conditionally incrementing sum_payments.
Should be noticeably faster than the SQL approach in your question, but could lose to the MERGE approach. If the data were not sorted in advance, this might beat the time for sorting both big datasets and then merging. It could handle multiple payments on the same date.
data want;
*initialize variables for hash table ;
call missing(payment_date,payment_value) ;
*Load a hash table with all of the payment data ;
if _n_=1 then do ;
declare hash h(dataset:"tb_payments", multidata: "yes");
h.defineKey("contract_ID");
h.defineData("payment_date","payment_value");
h.defineDone() ;
end ;
*read in the reference dates ;
set tb_reference (keep=contract_id reference_date) ;
*for each reference date, look up all the payments for that contract_id ;
*and iterate through them. If the payment date is < 90 days before reference date then ;
*increment sum_payments ;
sum_payments=0 ;
rc=h.find();
do while (rc = 0); *found a record;
if 0<=(reference_date-payment_date)<90 then sum_payments = sum_payments + payment_value ;
rc=h.find_next();
end;
run ;
It probably is possible to do this all with PROC EXPAND if you have it licensed. But let's look at how to do it without that.
It shouldn't be that hard if all of the dates are present in the PAYMENTS table. Just merge the two tables by ID and DATE. Calculate the running sum, but with the wrinkle of also subtracting out the value that is rolling out the back of the window. Then just keep the dates that are in the reference file.
One issue might be the need to find all possible dates for a CONTRACT_ID so that LAG() function can be used. That is easy to do with PROC MEANS.
proc summary data=tb_payments nway ;
by contract_id ;
var payment_date;
output out=tb_id_dates(drop=_:) min=date1 max=date2 ;
run;
And a data step. This step could also be a view instead.
data tb_id_dates_all ;
set tb_id_dates ;
do date=date1 to date2 ;
output;
end;
format date date9.;
keep contract_id date ;
run;
Now just merge the three datasets and calculate the cumulative sums. Note that I included a do loop to accumulate multiple payments on a single day (remove the nodupkey in your sample data generation code to test it).
If you want to generate multiple windows then you will need multiple actual LAG() function calls.
data want ;
do until (last.contract_id);
do until (last.date);
merge tb_id_dates_all tb_payments(rename=(payment_date=date))
tb_reference(rename=(reference_date=date) in=in2)
;
by contract_id date ;
payment=sum(0,payment,payment_value);
end;
day_num=sum(day_num,1);
array lag_days(5) _temporary_ (7 30 60 90 180) ;
array lag_payment(5) _temporary_ ;
array cumm(5) cumm_7 cumm_30 cumm_60 cumm_90 cumm_180 ;
lag_payment(1) = lag7(payment);
lag_payment(2) = lag30(payment);
lag_payment(3) = lag60(payment);
lag_payment(4) = lag90(payment);
lag_payment(5) = lag180(payment);
do i=1 to dim(cumm) ;
cumm(i)=sum(cumm(i),payment);
if day_num > lag_days(i) then cumm(i)=sum(cumm(i),-lag_payment(i));
if .z < abs(cumm(i)) < 1e-5 then cumm(i)=0;
end;
if in2 then output ;
end;
keep contract_id date cumm_: ;
format cumm_: comma20.2 ;
rename date=reference_date ;
run;
If you want to make the code flexible for the number of windows you will need to add some code generation to create the LAGxx() function calls. For example you could use this macro:
%macro lags(windows);
%local i n lag ;
%let n=%sysfunc(countw(&windows));
array lag_days(&n) _temporary_ (&windows) ;
array lag_payment(&n) _temporary_ ;
array cumm(&n)
%do i=1 %to &n ;
%let lag=%scan(&windows,&i);
cumm_&lag
%end;
;
%do i=1 %to &n ;
%let lag=%scan(&windows,&i);
lag_payment(&i) = lag&lag(payment);
%end;
%mend lags;
And replace the ARRAY and assignment statements with LAGxx() functions with this call to the macro:
%lags(7 30 60 90 180)
Can anybody help me that how I can calculate the correlation between two variables within each group in Proc Sql? Is there any such function just as sum or mean? Thanks a lot!
You should use proc corr to start with, as this does all the required calculations, which gets you most of the way there. You will need to filter and transpose the output dataset into your desired format. There are many answers on this site showing how to do that sort of thing, so have a look at those - in this case a wide to long transposition is required.
proc sort data = sashelp.class out = class;
by sex;
run;
proc corr data = class outp=mypcorr noprint;
var HEIGHT WEIGHT;
by SEX;
run;
I have a simple data set of customers (about 40,000k)
It looks like:
customerid, group, other_variable
a,blue,y
b,blue,x
c,blue,z
d,green,y
e,green,d
f,green,r
g,green,e
I want to randomly select for each group, Y amounts of customers (along with their other variable(s).
The catch is, i want to have two random selections of Y amounts for each group
i.e.
4000 random green customers split into two sets of 2000 randomly
and 4000 random blue customers split into two sets of 2000 randomly
This is because I have different messages to give to the two different splits
I'm not sampling with replacement. Needs to be unique customers
Would prefer a solution in PROC SQL but happy for alternative solution in sas if proc sql isn't idea
proc surveyselect is the general tool of choice for random sampling in SAS. The code is very simple, I would just sample 4000 of each group, then assign a new subgroup every 2000 rows, since the data is in a random order anyway (although sorted by group).
The default sampling method for proc surveyselect is srs, which is simple random sampling without replacement, exactly what is required here.
Here's some example code.
/* create dummy dataset */
data have;
do customerid = 1 to 10000;
length group other_variable $8;
if rand('uniform')<0.5 then group = 'blue'; /* assign blue or green with equal likelihood */
else group = 'green';
other_variable = byte(97+(floor((1+122-97)*rand('uniform')))); /* random letter between a and z */
output;
end;
run;
/* dataset must be sorted by group variable */
proc sort data=have;
by group;
run;
/* extract random sample of 4000 from each group */
proc surveyselect data=have
out=want
n=4000
seed=12345; /* specify seed to enable results to be reproduced */
strata group; /* set grouping variable */
run;
/* assign a new subgroup for every 2000 rows */
data want;
set want;
sub=int((_n_-1)/2000)+1;
run;
data custgroup ;
do i=1 to nobs;
set sorted_data nobs=nobs ;
point = ranuni(0);
end;
proc sort data = custgroup out=sortedcust
by group point;
run;
data final;
set sortedcust;
by group point;
if first group then i=1;
i+1;
run;
Basically what I am doing is first assign a random number to all observations in the data set. Then perform sorting based on the variable group and point.
Now I achieved a random sequence of observation within group. i=1 and i+1
would be to identify the row of observation(s) within group. This means would avoid extracting duplicated observations . Use output statement as well to control where you want to store the observation based on i.
My approach may not be the most efficient one.
The code below should do it. First, you will need to generate a random number. As Joe said above, it is better to seed it with a specific number so that you can reproduce the sample if necessary. Then you can use Proc Sql with the outobs statement to generate a sample.
(BTW, it would be a good idea not to name a variable 'group'.)
data YourDataSet;
set YourDataSet;
myrandomnumber = ranuni(123);
run;
proc sql outobs=2000;
create table bluesample as
select *
from YourDataSet
where group eq 'blue'
order by myrandomnumber;
quit;
proc sql outobs=2000;
create table greensample as
select *
from YourDataSet
where group eq 'green'
order by myrandomnumber;
quit;
I am trying to combine the odds ratios outputted from two models with different adjustments in SAS:
i.e:
ods output oddsratios=adjustedOR1(rename=(OddsRatioEst=OR1);
proc logistic data=dataname;
model y= b d c a e; run;
ods output oddsratios=adjustedOR2 (rename=(OddsRatioEst=OR2);
proc logistic data=dataname;
model y= b d c; run;
proc sort.....
data Oddsratios (keep=Effect OR1 OR2);
merge adjustedOR1 adjustedOR2; by effect; run;
Problem is that if I sort and merge by the Effect variable, I lose the order in which I put the explanatory variables in the model.
Is there anyway to assign an index to the variable according to the order I put it in the model, so that the final table will have the effect column in the order: b d c a e?
Thanks for your help
I suggest creating a new sequence variable in your "primary" dataset with the sort order you want. then re-sorting the merged result by that variable:
data adjustedOR1;
set adjustedOR1;
sortkey = _n_;
run;
proc sort data=adjustedOR1;
by effect;
run;
proc sort data=adjustedOR2;
by effect;
run;
data Oddsratios (keep=Effect OR1 OR2 sortkey);
merge adjustedOR1 adjustedOR2;
by effect;
run;
proc sort data=Oddsratios;
by sortkey;
run;
This would be a bit more generic than hard-coding the sort sequence as Keith suggests using PROC SQL (which also works by the way).
And thanks to Keith for providing a practical example!
I think the easiest way to sort your data is to do the merge in Proc Sql and use a case statement in an 'order by' clause. Here is an example.
ods output oddsratios=adjustedOR1(rename=(OddsRatioEst=OR1));
proc logistic data=sashelp.class;
model sex= height age weight; run;
ods output oddsratios=adjustedOR2 (rename=(OddsRatioEst=OR2));
proc logistic data=sashelp.class;
model sex= height age; run;
proc sql;
create table Oddsratios as select
a.effect,
a.or1,
b.or2
from adjustedOR1 as a
left join
adjustedOR2 as b
on a.effect=b.effect
order by
case a.effect
when 'Height' then 1
when 'Age' then 2
when 'Weight' then 3
end;
quit;
If you want to just change the order in which the variables appear in the data set, you can use the retain statement:
data Oddsratios (keep=Effect OR1 OR2);
retain b d c a e;
merge adjustedOR1 adjustedOR2;
by effect;
run;
This isn't really what retain is for, but it works.
But I wonder why you care what the order of the variables in the data set is. You can specify the order when you display results with proc print, for example.
Simplest answer I think that is still fairly flexible is to create an informat from your original dataset. Then during the merge you can create a new variable with the numeric order variable and sort by that afterwards.
Another solution would be to merge in a fashion not requiring a sort - create a hash table, for example, or create a format out of the odds ratio 2 dataset and append it on in a simple data step rather than via merge.
data have;
input effect $;
datalines;
b
d
c
a
e
;;;;
run;
data for_format;
set have;
fmtname='EFF';
type='j';
hlo='s';
start=effect;
label=_n_;
keep hlo type fmtname start label;
run;
proc format cntlin=for_format;
quit;
proc sort data=have;
by effect;
run;
data want;
set have; *your merge here instead;
by effect;
eff_order=input(effect,$EFF.);
run;
proc sort data=want;
by eff_order;
run;
proc print data=want;
run;