Making simple calculations across rows and columns in SAS - sql

I have an issue with making calculations in SAS as an example I have the following data:
Type amount
Axiom_Indlån 19966699113
Puljerneskontantindestående 133819901
Puljerne Andre passiver -9389117
Rap_Indlån 47501558321
I want to calculate the following:
('Rap_Indlån' - 'Puljerneskontantindestående' - 'Puljerne Andre passiver') - Axiom_Indlån
How do I achieve this ?
And how would I do it if it was columns instead of rows?
This is one of my big issues I hope you can point me in the right direction.

Hard to tell what you are asking for since you have not shown an input or output datasets.
But it sounds like you just want to multiple some of the values by negative one before summing.
So if your dataset looks like this:
data have;
infile cards dsd truncover;
input name :$50. value;
cards;
Axiom_Indlån,19966699113
Puljerneskontantindestående,133819901
Puljerne Andre passiver,-9389117
Rap_Indlån,47501558321
;
You could get the total pretty easily in SQL for example.
proc sql;
create table want as
select sum(VALUE*case when (NAME in ('Rap_Indlån')) then 1 else -1 end) as TOTAL
from HAVE
;
quit;

If you wanted to do it by columns, simply transpose and subtract as normal.
proc transpose data=have out=have_tpose;
id name;
var value;
run;
data want;
set have_tpose;
total = ('Rap_Indlån'n - 'Puljerneskontantindestående'n - 'Puljerne Andre passiver'n) - 'Axiom_Indlån'n;
run;

Related

SAS, forcing the creation of columns when using proc transpose

I'm fairly new to SAS and need some help, if this is at all possible.
My data looks something like this
sample data
For now, I only have 201801, but I want the follow code to work for all months in 2018, or rather force every yearmonth to show up
PROC TRANSPOSE DATA = have OUT = want PREFIX = primary_;
BY member_no;
ID yearmonth_pd;
VAR desc;
OPTIONS MISSING = ' ';
RUN;
This purpose of this, is that later on I want to left join every yearmonth_pd to other information, but this returns the error
ERROR: Column primary_201802 could not be found in the table/view identified with the
correlation name B.
We are entering the the variables manually as such b.primary_201801, b.primary_201802,....
but were hoping this could be done without the need for that, as other departments use the code, but don't necessarily know how to code.
Basically, I want something like the following (the table on top is what I'm looking for and the one below is what data generally looks like)
desired outcome
Thanks all!
You did not indicate if yearmonth_pd is a string, a formatted date value, or simply an integer that is an encoded representation yyyymm for year and month. I'll presume simple integer.
You will need to prepend a by group of extra rows to have such that there is one row per target column.
data all_months;
member_no = 0;
do yearmo_pd = 201801 to 201812;
output;
end;
run;
data haveFull;
set all_months have;
run;
PROC TRANSPOSE DATA = have OUT = want(where=(member_no ne 0)) PREFIX = primary_;
BY member_no;
ID yearmonth_pd;
VAR desc;
OPTIONS MISSING = ' ';
RUN;
This approach can also be used to force a certain column order in the out= table when the data= has sparse or out of sequence id values.

proc sql correlation

Can anybody help me that how I can calculate the correlation between two variables within each group in Proc Sql? Is there any such function just as sum or mean? Thanks a lot!
You should use proc corr to start with, as this does all the required calculations, which gets you most of the way there. You will need to filter and transpose the output dataset into your desired format. There are many answers on this site showing how to do that sort of thing, so have a look at those - in this case a wide to long transposition is required.
proc sort data = sashelp.class out = class;
by sex;
run;
proc corr data = class outp=mypcorr noprint;
var HEIGHT WEIGHT;
by SEX;
run;

SAS how to get random selection by group randomly split into multiple groups

I have a simple data set of customers (about 40,000k)
It looks like:
customerid, group, other_variable
a,blue,y
b,blue,x
c,blue,z
d,green,y
e,green,d
f,green,r
g,green,e
I want to randomly select for each group, Y amounts of customers (along with their other variable(s).
The catch is, i want to have two random selections of Y amounts for each group
i.e.
4000 random green customers split into two sets of 2000 randomly
and 4000 random blue customers split into two sets of 2000 randomly
This is because I have different messages to give to the two different splits
I'm not sampling with replacement. Needs to be unique customers
Would prefer a solution in PROC SQL but happy for alternative solution in sas if proc sql isn't idea
proc surveyselect is the general tool of choice for random sampling in SAS. The code is very simple, I would just sample 4000 of each group, then assign a new subgroup every 2000 rows, since the data is in a random order anyway (although sorted by group).
The default sampling method for proc surveyselect is srs, which is simple random sampling without replacement, exactly what is required here.
Here's some example code.
/* create dummy dataset */
data have;
do customerid = 1 to 10000;
length group other_variable $8;
if rand('uniform')<0.5 then group = 'blue'; /* assign blue or green with equal likelihood */
else group = 'green';
other_variable = byte(97+(floor((1+122-97)*rand('uniform')))); /* random letter between a and z */
output;
end;
run;
/* dataset must be sorted by group variable */
proc sort data=have;
by group;
run;
/* extract random sample of 4000 from each group */
proc surveyselect data=have
out=want
n=4000
seed=12345; /* specify seed to enable results to be reproduced */
strata group; /* set grouping variable */
run;
/* assign a new subgroup for every 2000 rows */
data want;
set want;
sub=int((_n_-1)/2000)+1;
run;
data custgroup ;
do i=1 to nobs;
set sorted_data nobs=nobs ;
point = ranuni(0);
end;
proc sort data = custgroup out=sortedcust
by group point;
run;
data final;
set sortedcust;
by group point;
if first group then i=1;
i+1;
run;
Basically what I am doing is first assign a random number to all observations in the data set. Then perform sorting based on the variable group and point.
Now I achieved a random sequence of observation within group. i=1 and i+1
would be to identify the row of observation(s) within group. This means would avoid extracting duplicated observations . Use output statement as well to control where you want to store the observation based on i.
My approach may not be the most efficient one.
The code below should do it. First, you will need to generate a random number. As Joe said above, it is better to seed it with a specific number so that you can reproduce the sample if necessary. Then you can use Proc Sql with the outobs statement to generate a sample.
(BTW, it would be a good idea not to name a variable 'group'.)
data YourDataSet;
set YourDataSet;
myrandomnumber = ranuni(123);
run;
proc sql outobs=2000;
create table bluesample as
select *
from YourDataSet
where group eq 'blue'
order by myrandomnumber;
quit;
proc sql outobs=2000;
create table greensample as
select *
from YourDataSet
where group eq 'green'
order by myrandomnumber;
quit;

sas - calculate moving average for grouped data with BY statement

I'm a SAS beginner and I'm curious if the following task can be done much more simple as it is currently in my head.
I have the following (simplified) meta data in a table named user_date_money:
User - Date - Money
with various users and dates for every calendar day (for the last 4 years). The data is ordered by User ASC and Date ASC, sample data looks like this:
User | Date | Money
Anna 23.10.2013 5
Anna 24.10.2013 1
Anna 25.10.2013 12
....
Aron 23.10.2013 5
Aron 24.10.2013 12
Aron 25.10.2013 4
....
Zoe 23.10.2013 1
Zoe 24.10.2013 1
Zoe 25.10.2013 0
I now want to calculate a five day moving average for the Money. I started with the pretty popular apprach with the lag() function like this:
data cma;
set user_date_money;
if missing(money) then
do;
OBS = 0;
money = 0.0;
end;
else OBS = 1;
money5 = lag5(money);
OBS5= lag5(obs);
if missing(money5) then money5= 0.0;
if missing(obs5) then obs5= 0;
if _N_ = 1 then
do;
SUM = 0.0;
N = 0;
end;
else;
sum = sum + money-money5;
n = n + obs-obs5;
MEAN = sum / n ;
retain sum n;
run;
as you see, the problem with this method occurs if there if the data step runs into a new user. Aron would get some lagged values from Anna which of course should not happen.
Now my question: I am pretty sure you can handle the user switch by adding some extra fields like laggeduser and by resetting the N, Sum and Mean variables if you notice such a switch but:
Can this be done in an easier way? Perhaps using the BY Clause in any way?
Thanks for your ideas and help!
Best regards
I think the easiest way is to use PROC EXPAND:
PROC EXPAND data=user_date_money out=cma;
ID date;
BY user;
CONVERT money=MEAN / transformin=(setmiss 0) transformout=(movave 5);
RUN;
And as mentioned in John's comment, it's important to remember about missing values (and about beginning and ending observations as well). I've added SETMISS option to the code, as you made it clear that you want to 'zerofy' missing values, not ignore them (default MOVAVE behaviour).
And if you want to exclude first 4 observations for each user (since they don't have enough pre-history to calculate moving average 5), you can use option 'TRIMLEFT 4' inside TRANSFORMOUT=().
If your particular need is simple enough, you can calculate it using PROC MEANS and a multilabel format.
data mydata;
do id = 1 to 5;
datevar = '01JAN2010'd-1;
do month = 0 to 4;
datevar=intnx('MONTH',datevar,1,'b');
sales = floor(500*rand('normal',7))+1500;
output;
end;
end;
run;
proc format;
value movingavg (multilabel notsorted)
'01JAN2010'd-'31MAR2010'd = 'JAN-MAR 2010'
'01FEB2010'd-'30APR2010'd = 'FEB-APR 2010'
'01MAR2010'd-'31MAY2010'd = 'MAR-MAY 2010'
/* ... more of these ... */
;
quit;
proc means data=mydata;
class id datevar/mlf order=data;
types id*datevar;
format datevar movingavg.;
var sales;
run;
The PROC FORMAT can be done programatically by use of the CNTLIN dataset, see SAS documentation for PROC FORMAT for more information.
If you make sure your data is sorted, you can use the first and last named variables to initialize your running totals when you get to a new member. These and retain should get you what you need; I don't think lag() is really called for here.
Yes, you can use by groupings. First, you'll sort by user and date (as you already have).
proc sort data=user_date_money;
by user date;
run;
Then, redo the data step using the by variable and a counter.
data cma;
set user_date_money;
by user;
length User_Recs 3
Average 8;
retain User_Recs;
if First.User=1 then User_Recs=0;
User_Recs=User_Recs+1;
if User_Recs>4 then do;
Average=(lag4(money)+lag3(money)+lag2(money)+lag1(money)+money)/5;
end;
drop User_Recs;
run;

Need to compare 2 variables, each coming from a separate data set, and flag differences

I have 2 SAS data sets, Lab and Rslt. They both have all of the same variables, but Rslt is supposed to have what is essentially a subset of Lab. For what I'm trying to do, there are 4 important variables: visit, accsnnum, battrnam, and lbtestcd. All are character variables. I want to compare the two files Lab and Rslt to find out where they vary -- specifically, I need to know the count of lbtestcd per unique accsnnum.
But I must control for a few factors. First, I only need to compare observations that have "Lipid Panel" or "Chemistry (6)" in the battrnam variable. The Rslt file only contains these observations, so we don't need to worry about that one. So I subsetted Lab using this code:
data work.lab;
set livingston.ndb_lab_1;
where battrnam contains "Lipid Panel" or battrnam = "Chemisty (6)";
run;
This worked fine. Now, I need to control for the variable visit. I need to get rid of all observations in both Lab and Rslt that have visits that contain "Day 1" or "Screening". I accomplished this using the following code:
data work.lab;
set work.lab;
if visit = "Day 1" or visit = "Screening" then delete;
else visit = visit;
run;
data work.rslt;
set work.rslt;
if visit = "Day 1" or visit = "Screening" then delete;
else visit = visit;
run;
Now this is where I get stuck. I need to create a way to compare the count of lbtestcd by accsnnum between the two separate files Lab and Rslt, and I need a way for it to flag the accsnum where there is a difference between Lab and Rslt for the count of lbtestcd. For example, if Lab has an accsnum A1 that has 5 unique lbtestcd values, and Rslt has the accsnum A1 with 7 unique lbtestcd value, I need that one to be brought to my attention.
I can do a proc freq for each file, but these are large data sets and I don't want to have to compare by hand. Perhaps exporting the count of lbtestcd by accsnum to a variable in a new 3rd dataset for each of the 2 files Lab and Rslt, then creating a variable that is the difference of these two? So that if difference != 0 then I can get a report of those asscnum? Advice in SQL will work too, as I can run that through SAS.
Edit
I've used some SQL to get the count of lbtestcd by accsnum for each data set using the code below, though I still need to figure out how to export these values to a data set to compare.
proc sql;
select accsnnum, count(lbtestcd)
from work.lab1
group by accsnnum;
quit;
proc sql;
select accsnnum, count(lbtestcd)
from work.rslt1
group by accsnnum;
quit;
Thanks for any and all help you can give. This one is really stumping me!
I would do a PROC FREQ on each dataset (or proc whatever-you-like-that-does-counts) and then use PROC COMPARE. For example:
proc freq data=rslt1;
tables accsnnum*ibtestcd/out=rsltcounts;
run;
proc freq data=lab1;
tables accsnnum*ibtestcd/out=labcounts;
run;
proc compare base=lab1 compare=rslt1 out=compares /* options */;
by accsnnum;
run;
PROC COMPARE has a lot of options; in this case the most helpful would probably be:
outnoequal - only outputs rows for each row that are not identical in the two datasets
outbase and outcomp - outputs a row for each of BASE and COMPARE datasets (if OUTNOEQUAL, then only when they differ)
outdif - outputs 'difference' rows, ie, one minus the other; this may or may not be helpful for you
The documentation lists all of the options. You may also need to look at the METHOD options if your data might have numeric precision issues.