I'm a SAS beginner and I'm curious if the following task can be done much more simple as it is currently in my head.
I have the following (simplified) meta data in a table named user_date_money:
User - Date - Money
with various users and dates for every calendar day (for the last 4 years). The data is ordered by User ASC and Date ASC, sample data looks like this:
User | Date | Money
Anna 23.10.2013 5
Anna 24.10.2013 1
Anna 25.10.2013 12
....
Aron 23.10.2013 5
Aron 24.10.2013 12
Aron 25.10.2013 4
....
Zoe 23.10.2013 1
Zoe 24.10.2013 1
Zoe 25.10.2013 0
I now want to calculate a five day moving average for the Money. I started with the pretty popular apprach with the lag() function like this:
data cma;
set user_date_money;
if missing(money) then
do;
OBS = 0;
money = 0.0;
end;
else OBS = 1;
money5 = lag5(money);
OBS5= lag5(obs);
if missing(money5) then money5= 0.0;
if missing(obs5) then obs5= 0;
if _N_ = 1 then
do;
SUM = 0.0;
N = 0;
end;
else;
sum = sum + money-money5;
n = n + obs-obs5;
MEAN = sum / n ;
retain sum n;
run;
as you see, the problem with this method occurs if there if the data step runs into a new user. Aron would get some lagged values from Anna which of course should not happen.
Now my question: I am pretty sure you can handle the user switch by adding some extra fields like laggeduser and by resetting the N, Sum and Mean variables if you notice such a switch but:
Can this be done in an easier way? Perhaps using the BY Clause in any way?
Thanks for your ideas and help!
Best regards
I think the easiest way is to use PROC EXPAND:
PROC EXPAND data=user_date_money out=cma;
ID date;
BY user;
CONVERT money=MEAN / transformin=(setmiss 0) transformout=(movave 5);
RUN;
And as mentioned in John's comment, it's important to remember about missing values (and about beginning and ending observations as well). I've added SETMISS option to the code, as you made it clear that you want to 'zerofy' missing values, not ignore them (default MOVAVE behaviour).
And if you want to exclude first 4 observations for each user (since they don't have enough pre-history to calculate moving average 5), you can use option 'TRIMLEFT 4' inside TRANSFORMOUT=().
If your particular need is simple enough, you can calculate it using PROC MEANS and a multilabel format.
data mydata;
do id = 1 to 5;
datevar = '01JAN2010'd-1;
do month = 0 to 4;
datevar=intnx('MONTH',datevar,1,'b');
sales = floor(500*rand('normal',7))+1500;
output;
end;
end;
run;
proc format;
value movingavg (multilabel notsorted)
'01JAN2010'd-'31MAR2010'd = 'JAN-MAR 2010'
'01FEB2010'd-'30APR2010'd = 'FEB-APR 2010'
'01MAR2010'd-'31MAY2010'd = 'MAR-MAY 2010'
/* ... more of these ... */
;
quit;
proc means data=mydata;
class id datevar/mlf order=data;
types id*datevar;
format datevar movingavg.;
var sales;
run;
The PROC FORMAT can be done programatically by use of the CNTLIN dataset, see SAS documentation for PROC FORMAT for more information.
If you make sure your data is sorted, you can use the first and last named variables to initialize your running totals when you get to a new member. These and retain should get you what you need; I don't think lag() is really called for here.
Yes, you can use by groupings. First, you'll sort by user and date (as you already have).
proc sort data=user_date_money;
by user date;
run;
Then, redo the data step using the by variable and a counter.
data cma;
set user_date_money;
by user;
length User_Recs 3
Average 8;
retain User_Recs;
if First.User=1 then User_Recs=0;
User_Recs=User_Recs+1;
if User_Recs>4 then do;
Average=(lag4(money)+lag3(money)+lag2(money)+lag1(money)+money)/5;
end;
drop User_Recs;
run;
Related
I'm fairly new to SAS and need some help, if this is at all possible.
My data looks something like this
sample data
For now, I only have 201801, but I want the follow code to work for all months in 2018, or rather force every yearmonth to show up
PROC TRANSPOSE DATA = have OUT = want PREFIX = primary_;
BY member_no;
ID yearmonth_pd;
VAR desc;
OPTIONS MISSING = ' ';
RUN;
This purpose of this, is that later on I want to left join every yearmonth_pd to other information, but this returns the error
ERROR: Column primary_201802 could not be found in the table/view identified with the
correlation name B.
We are entering the the variables manually as such b.primary_201801, b.primary_201802,....
but were hoping this could be done without the need for that, as other departments use the code, but don't necessarily know how to code.
Basically, I want something like the following (the table on top is what I'm looking for and the one below is what data generally looks like)
desired outcome
Thanks all!
You did not indicate if yearmonth_pd is a string, a formatted date value, or simply an integer that is an encoded representation yyyymm for year and month. I'll presume simple integer.
You will need to prepend a by group of extra rows to have such that there is one row per target column.
data all_months;
member_no = 0;
do yearmo_pd = 201801 to 201812;
output;
end;
run;
data haveFull;
set all_months have;
run;
PROC TRANSPOSE DATA = have OUT = want(where=(member_no ne 0)) PREFIX = primary_;
BY member_no;
ID yearmonth_pd;
VAR desc;
OPTIONS MISSING = ' ';
RUN;
This approach can also be used to force a certain column order in the out= table when the data= has sparse or out of sequence id values.
I have a dataset with one row per week for 2 years (so 104 rows). I have a flag column which is either 1 or 0 for each week. I want to create a new column with the following logic:
if the flag=1 for that week then have a 1 for that week and the following 3 weeks as flag_new.
My current approach, which works, is:
if flag=1 or lag(flag)=1 or lag2(flag)=1 or lag3(flag)=1 then flag_new=1;
Although this works, it becomes very tedious if I want flag_new to be 1 for the following 20 or 30 weeks instead of just 3 weeks.
I was hoping there would be an easier way to do this (perhaps a loop?), but I am not too familiar with it.
Any help is much appreciated.
Maybe instead of a look back, think of it as a look ahead. That is, each time you see flag=1, set flag_new=1 for that record and the next three records. Something like (untested):
if flag=1 then count=3;
else count+(-1) ; *implicit retain from sum statement;
if count>=0 then flag_new=1;
You can use a temporary array as well to keep the lagged information and then capture the highest of the array. If it's a one then you can set the new flag to 1 as well. To change the dimensions, just change the 2 to the n-1 you need.
This also demonstrates the BY statements and resetting it for the beginning of a new group.
data want;
array p{0:2} _temporary_;
set have;
by object;
if first.object then call missing(of p{*});
p{mod(_n_,4)} = flag;
highest = max(of p{*});
if highest > 1 then do;
flag_new = 1;
end;
run;
I have a data table with columns: Year, Month, Sales. It is effectively a summary table, like a pivot table in excel.
With this table, if there are no sales reported for one month (i.e. Not 0 sales, but no mention of sales so SAS cannot pinpoint a value to a certain month) then that whole row would disappear.
I do not want this to happen, I would instead like that row to display 0 rather than not appear. Is there a way to change the format of this to ensure every row would appear?
Note: The months are not calendar months, as such you could have month60 relating to 2011.
If the table is being created using proc summary or proc means, one way of achieving the sort of output you want provided that you have at least 1 row for each month in your data is to use the completetypes option, e.g.
proc summary data = sashelp.class completetypes;
class sex age;
var weight;
output out = mysummary mean=;
run;
This produces a row with frequency 0 for Sex = F, Age = 16 rather than skipping that output entirely.
A more reliable but more labour-intensive method, which works even if some values never appear anywhere in your data, is to use the classdata option, e.g.
data myclassdata;
do SEX = 'M','F';
do AGE = 13 to 17;
output;
end;
end;
run;
proc summary nway data = sashelp.class classdata=myclassdata exclusive;
class sex age;
var weight;
output out = mysummary2 mean=;
run;
The exclusive option here restricts the output to combinations of levels that are present in the classdata dataset. Without it, you get at least those specified in the classdata plus rows for all possible combinations based on observed 1-way values as though you had specified completetypes.
I have a simple data set of customers (about 40,000k)
It looks like:
customerid, group, other_variable
a,blue,y
b,blue,x
c,blue,z
d,green,y
e,green,d
f,green,r
g,green,e
I want to randomly select for each group, Y amounts of customers (along with their other variable(s).
The catch is, i want to have two random selections of Y amounts for each group
i.e.
4000 random green customers split into two sets of 2000 randomly
and 4000 random blue customers split into two sets of 2000 randomly
This is because I have different messages to give to the two different splits
I'm not sampling with replacement. Needs to be unique customers
Would prefer a solution in PROC SQL but happy for alternative solution in sas if proc sql isn't idea
proc surveyselect is the general tool of choice for random sampling in SAS. The code is very simple, I would just sample 4000 of each group, then assign a new subgroup every 2000 rows, since the data is in a random order anyway (although sorted by group).
The default sampling method for proc surveyselect is srs, which is simple random sampling without replacement, exactly what is required here.
Here's some example code.
/* create dummy dataset */
data have;
do customerid = 1 to 10000;
length group other_variable $8;
if rand('uniform')<0.5 then group = 'blue'; /* assign blue or green with equal likelihood */
else group = 'green';
other_variable = byte(97+(floor((1+122-97)*rand('uniform')))); /* random letter between a and z */
output;
end;
run;
/* dataset must be sorted by group variable */
proc sort data=have;
by group;
run;
/* extract random sample of 4000 from each group */
proc surveyselect data=have
out=want
n=4000
seed=12345; /* specify seed to enable results to be reproduced */
strata group; /* set grouping variable */
run;
/* assign a new subgroup for every 2000 rows */
data want;
set want;
sub=int((_n_-1)/2000)+1;
run;
data custgroup ;
do i=1 to nobs;
set sorted_data nobs=nobs ;
point = ranuni(0);
end;
proc sort data = custgroup out=sortedcust
by group point;
run;
data final;
set sortedcust;
by group point;
if first group then i=1;
i+1;
run;
Basically what I am doing is first assign a random number to all observations in the data set. Then perform sorting based on the variable group and point.
Now I achieved a random sequence of observation within group. i=1 and i+1
would be to identify the row of observation(s) within group. This means would avoid extracting duplicated observations . Use output statement as well to control where you want to store the observation based on i.
My approach may not be the most efficient one.
The code below should do it. First, you will need to generate a random number. As Joe said above, it is better to seed it with a specific number so that you can reproduce the sample if necessary. Then you can use Proc Sql with the outobs statement to generate a sample.
(BTW, it would be a good idea not to name a variable 'group'.)
data YourDataSet;
set YourDataSet;
myrandomnumber = ranuni(123);
run;
proc sql outobs=2000;
create table bluesample as
select *
from YourDataSet
where group eq 'blue'
order by myrandomnumber;
quit;
proc sql outobs=2000;
create table greensample as
select *
from YourDataSet
where group eq 'green'
order by myrandomnumber;
quit;
I have the following code to compute the skewness on a rolling window of returns:
libname backup 'C:\Users\Anwender\Desktop\backup sas data';
data crsp_daily;
set backup.crsp_daily;
run;
proc sort data=crsp_daily;
by permno date;
run;
data crsp_daily1a;
set crsp_daily;
lastofmonth = last.month;
by permno year month;
run;
proc sql;
create table roll_ret as
select h2.permno, h2.Date, h1.retadj as lagret
from crsp_daily1a as h1,
crsp_daily1a as h2
where h1.permno = h2.permno
and intck("MONTH",h1.date,h2.date) between 0 and 11
group by h2.permno, h2.date
having count(h2.permno)>250 and h2.lastofmonth = 1
;
quit;
proc means data = roll_ret noprint;
by permno date;
var lagret;
output out=crsp_daily_final skew=skewRet kurt=KurtRet;
run;
The input data set has a daily date variable, from which I have already constructed a year and month variable. It also has an ID for the stock (permno) and daily returns (retadj).I want to compute rolling skewness from all observations from the last year, but only if there are at least 250 observations in this window. I am only interested in results for the last of the month.
The Input data set has more than 60 million!!! observations, the above code is simply too slow. I have already tried to work with a view instead of an data set for roll_view without improvement.
How can I quickly compute a rolling skewness in the above sense for this very large data set?
General comments on my code would be appreciated as well.
Thanks very much!
PROC SQL performs heuristic analysis of the potential join strategies, you can review it by using proc sql _method option. Potential user optimization strategies are outlined here (http://support.sas.com/techsup/technote/ts553.html). Probably, your case falls into the category of join of a small (h2) and large (h1) datasets - creating an index on the key usually helps in this case.