I have a data table with columns: Year, Month, Sales. It is effectively a summary table, like a pivot table in excel.
With this table, if there are no sales reported for one month (i.e. Not 0 sales, but no mention of sales so SAS cannot pinpoint a value to a certain month) then that whole row would disappear.
I do not want this to happen, I would instead like that row to display 0 rather than not appear. Is there a way to change the format of this to ensure every row would appear?
Note: The months are not calendar months, as such you could have month60 relating to 2011.
If the table is being created using proc summary or proc means, one way of achieving the sort of output you want provided that you have at least 1 row for each month in your data is to use the completetypes option, e.g.
proc summary data = sashelp.class completetypes;
class sex age;
var weight;
output out = mysummary mean=;
run;
This produces a row with frequency 0 for Sex = F, Age = 16 rather than skipping that output entirely.
A more reliable but more labour-intensive method, which works even if some values never appear anywhere in your data, is to use the classdata option, e.g.
data myclassdata;
do SEX = 'M','F';
do AGE = 13 to 17;
output;
end;
end;
run;
proc summary nway data = sashelp.class classdata=myclassdata exclusive;
class sex age;
var weight;
output out = mysummary2 mean=;
run;
The exclusive option here restricts the output to combinations of levels that are present in the classdata dataset. Without it, you get at least those specified in the classdata plus rows for all possible combinations based on observed 1-way values as though you had specified completetypes.
Related
I'm fairly new to SAS and need some help, if this is at all possible.
My data looks something like this
sample data
For now, I only have 201801, but I want the follow code to work for all months in 2018, or rather force every yearmonth to show up
PROC TRANSPOSE DATA = have OUT = want PREFIX = primary_;
BY member_no;
ID yearmonth_pd;
VAR desc;
OPTIONS MISSING = ' ';
RUN;
This purpose of this, is that later on I want to left join every yearmonth_pd to other information, but this returns the error
ERROR: Column primary_201802 could not be found in the table/view identified with the
correlation name B.
We are entering the the variables manually as such b.primary_201801, b.primary_201802,....
but were hoping this could be done without the need for that, as other departments use the code, but don't necessarily know how to code.
Basically, I want something like the following (the table on top is what I'm looking for and the one below is what data generally looks like)
desired outcome
Thanks all!
You did not indicate if yearmonth_pd is a string, a formatted date value, or simply an integer that is an encoded representation yyyymm for year and month. I'll presume simple integer.
You will need to prepend a by group of extra rows to have such that there is one row per target column.
data all_months;
member_no = 0;
do yearmo_pd = 201801 to 201812;
output;
end;
run;
data haveFull;
set all_months have;
run;
PROC TRANSPOSE DATA = have OUT = want(where=(member_no ne 0)) PREFIX = primary_;
BY member_no;
ID yearmonth_pd;
VAR desc;
OPTIONS MISSING = ' ';
RUN;
This approach can also be used to force a certain column order in the out= table when the data= has sparse or out of sequence id values.
I'm trying to pull back all MAX instances given subset data....first.id or last.id doesn't work because I want to keep several rows of the same transaction. For example:
TableView_of_Data
In this example I want the highlighted rows as output. My data has several FORMs, QUARTERs, and CUST_ID I'd like to programmatically have SAS pull back latest based on FORM, QUARTER, CUST_ID
Last.DB_ID only brings back 1 row. I need all rows of the same DB_ID.
also this failed to do anything:
data work.want;
set work.have;
by FORM Quarter Cust_ID DB_ID ;
if Max(DB_ID) then output;
run;
You need to do two passes through your data: one to determine what the max value is for that ID, and one to find the rows that have that maximum value.
Doing this in the data step requires a DoW loop, which runs one data step iteration per cust_id value but two passes through the dataset.
data want;
do _n_ = 1 by 1 until (last.cust_id);
set have;
by form quarter cust_id;
if last.cust_id then max_db_value=db_id;
end;
do _n_ = 1 by 1 until (last.cust_id);
set have;
by form quarter cust_id;
if db_id = max_db_Value then output;
end;
run;
That works if DB_ID is sorted as it is in your example. If it's not sorted, you can compare the currently stored max_db_value to the current db_id and assign the new value from db_id to it if it's higher, something like
max_db_value = max(db_id, max_db_value);
instead of assigning it when last.cust_id is true.
I have a simple data set of customers (about 40,000k)
It looks like:
customerid, group, other_variable
a,blue,y
b,blue,x
c,blue,z
d,green,y
e,green,d
f,green,r
g,green,e
I want to randomly select for each group, Y amounts of customers (along with their other variable(s).
The catch is, i want to have two random selections of Y amounts for each group
i.e.
4000 random green customers split into two sets of 2000 randomly
and 4000 random blue customers split into two sets of 2000 randomly
This is because I have different messages to give to the two different splits
I'm not sampling with replacement. Needs to be unique customers
Would prefer a solution in PROC SQL but happy for alternative solution in sas if proc sql isn't idea
proc surveyselect is the general tool of choice for random sampling in SAS. The code is very simple, I would just sample 4000 of each group, then assign a new subgroup every 2000 rows, since the data is in a random order anyway (although sorted by group).
The default sampling method for proc surveyselect is srs, which is simple random sampling without replacement, exactly what is required here.
Here's some example code.
/* create dummy dataset */
data have;
do customerid = 1 to 10000;
length group other_variable $8;
if rand('uniform')<0.5 then group = 'blue'; /* assign blue or green with equal likelihood */
else group = 'green';
other_variable = byte(97+(floor((1+122-97)*rand('uniform')))); /* random letter between a and z */
output;
end;
run;
/* dataset must be sorted by group variable */
proc sort data=have;
by group;
run;
/* extract random sample of 4000 from each group */
proc surveyselect data=have
out=want
n=4000
seed=12345; /* specify seed to enable results to be reproduced */
strata group; /* set grouping variable */
run;
/* assign a new subgroup for every 2000 rows */
data want;
set want;
sub=int((_n_-1)/2000)+1;
run;
data custgroup ;
do i=1 to nobs;
set sorted_data nobs=nobs ;
point = ranuni(0);
end;
proc sort data = custgroup out=sortedcust
by group point;
run;
data final;
set sortedcust;
by group point;
if first group then i=1;
i+1;
run;
Basically what I am doing is first assign a random number to all observations in the data set. Then perform sorting based on the variable group and point.
Now I achieved a random sequence of observation within group. i=1 and i+1
would be to identify the row of observation(s) within group. This means would avoid extracting duplicated observations . Use output statement as well to control where you want to store the observation based on i.
My approach may not be the most efficient one.
The code below should do it. First, you will need to generate a random number. As Joe said above, it is better to seed it with a specific number so that you can reproduce the sample if necessary. Then you can use Proc Sql with the outobs statement to generate a sample.
(BTW, it would be a good idea not to name a variable 'group'.)
data YourDataSet;
set YourDataSet;
myrandomnumber = ranuni(123);
run;
proc sql outobs=2000;
create table bluesample as
select *
from YourDataSet
where group eq 'blue'
order by myrandomnumber;
quit;
proc sql outobs=2000;
create table greensample as
select *
from YourDataSet
where group eq 'green'
order by myrandomnumber;
quit;
I'm a SAS beginner and I'm curious if the following task can be done much more simple as it is currently in my head.
I have the following (simplified) meta data in a table named user_date_money:
User - Date - Money
with various users and dates for every calendar day (for the last 4 years). The data is ordered by User ASC and Date ASC, sample data looks like this:
User | Date | Money
Anna 23.10.2013 5
Anna 24.10.2013 1
Anna 25.10.2013 12
....
Aron 23.10.2013 5
Aron 24.10.2013 12
Aron 25.10.2013 4
....
Zoe 23.10.2013 1
Zoe 24.10.2013 1
Zoe 25.10.2013 0
I now want to calculate a five day moving average for the Money. I started with the pretty popular apprach with the lag() function like this:
data cma;
set user_date_money;
if missing(money) then
do;
OBS = 0;
money = 0.0;
end;
else OBS = 1;
money5 = lag5(money);
OBS5= lag5(obs);
if missing(money5) then money5= 0.0;
if missing(obs5) then obs5= 0;
if _N_ = 1 then
do;
SUM = 0.0;
N = 0;
end;
else;
sum = sum + money-money5;
n = n + obs-obs5;
MEAN = sum / n ;
retain sum n;
run;
as you see, the problem with this method occurs if there if the data step runs into a new user. Aron would get some lagged values from Anna which of course should not happen.
Now my question: I am pretty sure you can handle the user switch by adding some extra fields like laggeduser and by resetting the N, Sum and Mean variables if you notice such a switch but:
Can this be done in an easier way? Perhaps using the BY Clause in any way?
Thanks for your ideas and help!
Best regards
I think the easiest way is to use PROC EXPAND:
PROC EXPAND data=user_date_money out=cma;
ID date;
BY user;
CONVERT money=MEAN / transformin=(setmiss 0) transformout=(movave 5);
RUN;
And as mentioned in John's comment, it's important to remember about missing values (and about beginning and ending observations as well). I've added SETMISS option to the code, as you made it clear that you want to 'zerofy' missing values, not ignore them (default MOVAVE behaviour).
And if you want to exclude first 4 observations for each user (since they don't have enough pre-history to calculate moving average 5), you can use option 'TRIMLEFT 4' inside TRANSFORMOUT=().
If your particular need is simple enough, you can calculate it using PROC MEANS and a multilabel format.
data mydata;
do id = 1 to 5;
datevar = '01JAN2010'd-1;
do month = 0 to 4;
datevar=intnx('MONTH',datevar,1,'b');
sales = floor(500*rand('normal',7))+1500;
output;
end;
end;
run;
proc format;
value movingavg (multilabel notsorted)
'01JAN2010'd-'31MAR2010'd = 'JAN-MAR 2010'
'01FEB2010'd-'30APR2010'd = 'FEB-APR 2010'
'01MAR2010'd-'31MAY2010'd = 'MAR-MAY 2010'
/* ... more of these ... */
;
quit;
proc means data=mydata;
class id datevar/mlf order=data;
types id*datevar;
format datevar movingavg.;
var sales;
run;
The PROC FORMAT can be done programatically by use of the CNTLIN dataset, see SAS documentation for PROC FORMAT for more information.
If you make sure your data is sorted, you can use the first and last named variables to initialize your running totals when you get to a new member. These and retain should get you what you need; I don't think lag() is really called for here.
Yes, you can use by groupings. First, you'll sort by user and date (as you already have).
proc sort data=user_date_money;
by user date;
run;
Then, redo the data step using the by variable and a counter.
data cma;
set user_date_money;
by user;
length User_Recs 3
Average 8;
retain User_Recs;
if First.User=1 then User_Recs=0;
User_Recs=User_Recs+1;
if User_Recs>4 then do;
Average=(lag4(money)+lag3(money)+lag2(money)+lag1(money)+money)/5;
end;
drop User_Recs;
run;
I have 2 SAS data sets, Lab and Rslt. They both have all of the same variables, but Rslt is supposed to have what is essentially a subset of Lab. For what I'm trying to do, there are 4 important variables: visit, accsnnum, battrnam, and lbtestcd. All are character variables. I want to compare the two files Lab and Rslt to find out where they vary -- specifically, I need to know the count of lbtestcd per unique accsnnum.
But I must control for a few factors. First, I only need to compare observations that have "Lipid Panel" or "Chemistry (6)" in the battrnam variable. The Rslt file only contains these observations, so we don't need to worry about that one. So I subsetted Lab using this code:
data work.lab;
set livingston.ndb_lab_1;
where battrnam contains "Lipid Panel" or battrnam = "Chemisty (6)";
run;
This worked fine. Now, I need to control for the variable visit. I need to get rid of all observations in both Lab and Rslt that have visits that contain "Day 1" or "Screening". I accomplished this using the following code:
data work.lab;
set work.lab;
if visit = "Day 1" or visit = "Screening" then delete;
else visit = visit;
run;
data work.rslt;
set work.rslt;
if visit = "Day 1" or visit = "Screening" then delete;
else visit = visit;
run;
Now this is where I get stuck. I need to create a way to compare the count of lbtestcd by accsnnum between the two separate files Lab and Rslt, and I need a way for it to flag the accsnum where there is a difference between Lab and Rslt for the count of lbtestcd. For example, if Lab has an accsnum A1 that has 5 unique lbtestcd values, and Rslt has the accsnum A1 with 7 unique lbtestcd value, I need that one to be brought to my attention.
I can do a proc freq for each file, but these are large data sets and I don't want to have to compare by hand. Perhaps exporting the count of lbtestcd by accsnum to a variable in a new 3rd dataset for each of the 2 files Lab and Rslt, then creating a variable that is the difference of these two? So that if difference != 0 then I can get a report of those asscnum? Advice in SQL will work too, as I can run that through SAS.
Edit
I've used some SQL to get the count of lbtestcd by accsnum for each data set using the code below, though I still need to figure out how to export these values to a data set to compare.
proc sql;
select accsnnum, count(lbtestcd)
from work.lab1
group by accsnnum;
quit;
proc sql;
select accsnnum, count(lbtestcd)
from work.rslt1
group by accsnnum;
quit;
Thanks for any and all help you can give. This one is really stumping me!
I would do a PROC FREQ on each dataset (or proc whatever-you-like-that-does-counts) and then use PROC COMPARE. For example:
proc freq data=rslt1;
tables accsnnum*ibtestcd/out=rsltcounts;
run;
proc freq data=lab1;
tables accsnnum*ibtestcd/out=labcounts;
run;
proc compare base=lab1 compare=rslt1 out=compares /* options */;
by accsnnum;
run;
PROC COMPARE has a lot of options; in this case the most helpful would probably be:
outnoequal - only outputs rows for each row that are not identical in the two datasets
outbase and outcomp - outputs a row for each of BASE and COMPARE datasets (if OUTNOEQUAL, then only when they differ)
outdif - outputs 'difference' rows, ie, one minus the other; this may or may not be helpful for you
The documentation lists all of the options. You may also need to look at the METHOD options if your data might have numeric precision issues.