SAS, forcing the creation of columns when using proc transpose - sql

I'm fairly new to SAS and need some help, if this is at all possible.
My data looks something like this
sample data
For now, I only have 201801, but I want the follow code to work for all months in 2018, or rather force every yearmonth to show up
PROC TRANSPOSE DATA = have OUT = want PREFIX = primary_;
BY member_no;
ID yearmonth_pd;
VAR desc;
OPTIONS MISSING = ' ';
RUN;
This purpose of this, is that later on I want to left join every yearmonth_pd to other information, but this returns the error
ERROR: Column primary_201802 could not be found in the table/view identified with the
correlation name B.
We are entering the the variables manually as such b.primary_201801, b.primary_201802,....
but were hoping this could be done without the need for that, as other departments use the code, but don't necessarily know how to code.
Basically, I want something like the following (the table on top is what I'm looking for and the one below is what data generally looks like)
desired outcome
Thanks all!

You did not indicate if yearmonth_pd is a string, a formatted date value, or simply an integer that is an encoded representation yyyymm for year and month. I'll presume simple integer.
You will need to prepend a by group of extra rows to have such that there is one row per target column.
data all_months;
member_no = 0;
do yearmo_pd = 201801 to 201812;
output;
end;
run;
data haveFull;
set all_months have;
run;
PROC TRANSPOSE DATA = have OUT = want(where=(member_no ne 0)) PREFIX = primary_;
BY member_no;
ID yearmonth_pd;
VAR desc;
OPTIONS MISSING = ' ';
RUN;
This approach can also be used to force a certain column order in the out= table when the data= has sparse or out of sequence id values.

Related

Making simple calculations across rows and columns in SAS

I have an issue with making calculations in SAS as an example I have the following data:
Type amount
Axiom_Indlån 19966699113
Puljerneskontantindestående 133819901
Puljerne Andre passiver -9389117
Rap_Indlån 47501558321
I want to calculate the following:
('Rap_Indlån' - 'Puljerneskontantindestående' - 'Puljerne Andre passiver') - Axiom_Indlån
How do I achieve this ?
And how would I do it if it was columns instead of rows?
This is one of my big issues I hope you can point me in the right direction.
Hard to tell what you are asking for since you have not shown an input or output datasets.
But it sounds like you just want to multiple some of the values by negative one before summing.
So if your dataset looks like this:
data have;
infile cards dsd truncover;
input name :$50. value;
cards;
Axiom_Indlån,19966699113
Puljerneskontantindestående,133819901
Puljerne Andre passiver,-9389117
Rap_Indlån,47501558321
;
You could get the total pretty easily in SQL for example.
proc sql;
create table want as
select sum(VALUE*case when (NAME in ('Rap_Indlån')) then 1 else -1 end) as TOTAL
from HAVE
;
quit;
If you wanted to do it by columns, simply transpose and subtract as normal.
proc transpose data=have out=have_tpose;
id name;
var value;
run;
data want;
set have_tpose;
total = ('Rap_Indlån'n - 'Puljerneskontantindestående'n - 'Puljerne Andre passiver'n) - 'Axiom_Indlån'n;
run;

SAS Rows/Columns Not Appearing (Table Formatting Issue)

I have a data table with columns: Year, Month, Sales. It is effectively a summary table, like a pivot table in excel.
With this table, if there are no sales reported for one month (i.e. Not 0 sales, but no mention of sales so SAS cannot pinpoint a value to a certain month) then that whole row would disappear.
I do not want this to happen, I would instead like that row to display 0 rather than not appear. Is there a way to change the format of this to ensure every row would appear?
Note: The months are not calendar months, as such you could have month60 relating to 2011.
If the table is being created using proc summary or proc means, one way of achieving the sort of output you want provided that you have at least 1 row for each month in your data is to use the completetypes option, e.g.
proc summary data = sashelp.class completetypes;
class sex age;
var weight;
output out = mysummary mean=;
run;
This produces a row with frequency 0 for Sex = F, Age = 16 rather than skipping that output entirely.
A more reliable but more labour-intensive method, which works even if some values never appear anywhere in your data, is to use the classdata option, e.g.
data myclassdata;
do SEX = 'M','F';
do AGE = 13 to 17;
output;
end;
end;
run;
proc summary nway data = sashelp.class classdata=myclassdata exclusive;
class sex age;
var weight;
output out = mysummary2 mean=;
run;
The exclusive option here restricts the output to combinations of levels that are present in the classdata dataset. Without it, you get at least those specified in the classdata plus rows for all possible combinations based on observed 1-way values as though you had specified completetypes.

SAS DATA STEP: subset multiple rows without using first/last

I'm trying to pull back all MAX instances given subset data....first.id or last.id doesn't work because I want to keep several rows of the same transaction. For example:
TableView_of_Data
In this example I want the highlighted rows as output. My data has several FORMs, QUARTERs, and CUST_ID I'd like to programmatically have SAS pull back latest based on FORM, QUARTER, CUST_ID
Last.DB_ID only brings back 1 row. I need all rows of the same DB_ID.
also this failed to do anything:
data work.want;
set work.have;
by FORM Quarter Cust_ID DB_ID ;
if Max(DB_ID) then output;
run;
You need to do two passes through your data: one to determine what the max value is for that ID, and one to find the rows that have that maximum value.
Doing this in the data step requires a DoW loop, which runs one data step iteration per cust_id value but two passes through the dataset.
data want;
do _n_ = 1 by 1 until (last.cust_id);
set have;
by form quarter cust_id;
if last.cust_id then max_db_value=db_id;
end;
do _n_ = 1 by 1 until (last.cust_id);
set have;
by form quarter cust_id;
if db_id = max_db_Value then output;
end;
run;
That works if DB_ID is sorted as it is in your example. If it's not sorted, you can compare the currently stored max_db_value to the current db_id and assign the new value from db_id to it if it's higher, something like
max_db_value = max(db_id, max_db_value);
instead of assigning it when last.cust_id is true.

SAS how to get random selection by group randomly split into multiple groups

I have a simple data set of customers (about 40,000k)
It looks like:
customerid, group, other_variable
a,blue,y
b,blue,x
c,blue,z
d,green,y
e,green,d
f,green,r
g,green,e
I want to randomly select for each group, Y amounts of customers (along with their other variable(s).
The catch is, i want to have two random selections of Y amounts for each group
i.e.
4000 random green customers split into two sets of 2000 randomly
and 4000 random blue customers split into two sets of 2000 randomly
This is because I have different messages to give to the two different splits
I'm not sampling with replacement. Needs to be unique customers
Would prefer a solution in PROC SQL but happy for alternative solution in sas if proc sql isn't idea
proc surveyselect is the general tool of choice for random sampling in SAS. The code is very simple, I would just sample 4000 of each group, then assign a new subgroup every 2000 rows, since the data is in a random order anyway (although sorted by group).
The default sampling method for proc surveyselect is srs, which is simple random sampling without replacement, exactly what is required here.
Here's some example code.
/* create dummy dataset */
data have;
do customerid = 1 to 10000;
length group other_variable $8;
if rand('uniform')<0.5 then group = 'blue'; /* assign blue or green with equal likelihood */
else group = 'green';
other_variable = byte(97+(floor((1+122-97)*rand('uniform')))); /* random letter between a and z */
output;
end;
run;
/* dataset must be sorted by group variable */
proc sort data=have;
by group;
run;
/* extract random sample of 4000 from each group */
proc surveyselect data=have
out=want
n=4000
seed=12345; /* specify seed to enable results to be reproduced */
strata group; /* set grouping variable */
run;
/* assign a new subgroup for every 2000 rows */
data want;
set want;
sub=int((_n_-1)/2000)+1;
run;
data custgroup ;
do i=1 to nobs;
set sorted_data nobs=nobs ;
point = ranuni(0);
end;
proc sort data = custgroup out=sortedcust
by group point;
run;
data final;
set sortedcust;
by group point;
if first group then i=1;
i+1;
run;
Basically what I am doing is first assign a random number to all observations in the data set. Then perform sorting based on the variable group and point.
Now I achieved a random sequence of observation within group. i=1 and i+1
would be to identify the row of observation(s) within group. This means would avoid extracting duplicated observations . Use output statement as well to control where you want to store the observation based on i.
My approach may not be the most efficient one.
The code below should do it. First, you will need to generate a random number. As Joe said above, it is better to seed it with a specific number so that you can reproduce the sample if necessary. Then you can use Proc Sql with the outobs statement to generate a sample.
(BTW, it would be a good idea not to name a variable 'group'.)
data YourDataSet;
set YourDataSet;
myrandomnumber = ranuni(123);
run;
proc sql outobs=2000;
create table bluesample as
select *
from YourDataSet
where group eq 'blue'
order by myrandomnumber;
quit;
proc sql outobs=2000;
create table greensample as
select *
from YourDataSet
where group eq 'green'
order by myrandomnumber;
quit;

Need to compare 2 variables, each coming from a separate data set, and flag differences

I have 2 SAS data sets, Lab and Rslt. They both have all of the same variables, but Rslt is supposed to have what is essentially a subset of Lab. For what I'm trying to do, there are 4 important variables: visit, accsnnum, battrnam, and lbtestcd. All are character variables. I want to compare the two files Lab and Rslt to find out where they vary -- specifically, I need to know the count of lbtestcd per unique accsnnum.
But I must control for a few factors. First, I only need to compare observations that have "Lipid Panel" or "Chemistry (6)" in the battrnam variable. The Rslt file only contains these observations, so we don't need to worry about that one. So I subsetted Lab using this code:
data work.lab;
set livingston.ndb_lab_1;
where battrnam contains "Lipid Panel" or battrnam = "Chemisty (6)";
run;
This worked fine. Now, I need to control for the variable visit. I need to get rid of all observations in both Lab and Rslt that have visits that contain "Day 1" or "Screening". I accomplished this using the following code:
data work.lab;
set work.lab;
if visit = "Day 1" or visit = "Screening" then delete;
else visit = visit;
run;
data work.rslt;
set work.rslt;
if visit = "Day 1" or visit = "Screening" then delete;
else visit = visit;
run;
Now this is where I get stuck. I need to create a way to compare the count of lbtestcd by accsnnum between the two separate files Lab and Rslt, and I need a way for it to flag the accsnum where there is a difference between Lab and Rslt for the count of lbtestcd. For example, if Lab has an accsnum A1 that has 5 unique lbtestcd values, and Rslt has the accsnum A1 with 7 unique lbtestcd value, I need that one to be brought to my attention.
I can do a proc freq for each file, but these are large data sets and I don't want to have to compare by hand. Perhaps exporting the count of lbtestcd by accsnum to a variable in a new 3rd dataset for each of the 2 files Lab and Rslt, then creating a variable that is the difference of these two? So that if difference != 0 then I can get a report of those asscnum? Advice in SQL will work too, as I can run that through SAS.
Edit
I've used some SQL to get the count of lbtestcd by accsnum for each data set using the code below, though I still need to figure out how to export these values to a data set to compare.
proc sql;
select accsnnum, count(lbtestcd)
from work.lab1
group by accsnnum;
quit;
proc sql;
select accsnnum, count(lbtestcd)
from work.rslt1
group by accsnnum;
quit;
Thanks for any and all help you can give. This one is really stumping me!
I would do a PROC FREQ on each dataset (or proc whatever-you-like-that-does-counts) and then use PROC COMPARE. For example:
proc freq data=rslt1;
tables accsnnum*ibtestcd/out=rsltcounts;
run;
proc freq data=lab1;
tables accsnnum*ibtestcd/out=labcounts;
run;
proc compare base=lab1 compare=rslt1 out=compares /* options */;
by accsnnum;
run;
PROC COMPARE has a lot of options; in this case the most helpful would probably be:
outnoequal - only outputs rows for each row that are not identical in the two datasets
outbase and outcomp - outputs a row for each of BASE and COMPARE datasets (if OUTNOEQUAL, then only when they differ)
outdif - outputs 'difference' rows, ie, one minus the other; this may or may not be helpful for you
The documentation lists all of the options. You may also need to look at the METHOD options if your data might have numeric precision issues.