SAS computing multiple new variables from one row - sql

I have a dataset as listed below:
ID-----V1-----V2------V3
01------5------3-------7
02------3------8-------5
03------6------9-------1
and I want to calculate 3 new variables (ERR_CODE, ERR_DETAIL, ERR_ID) according to behavior of certain columns.
If V1 is greater than 4 then ERR_CODE = A and ERR_DETAIL = "Out of range" and ERR_ID = [ID]_A
If V2 is greater than 4 then ERR_CODE = B and ERR_DETAIL = "Check Log" and ERR_ID = [ID]_B
If V3 is greater than 4 then ERR_CODE = C and ERR_DETAIL = "Fault" and ERR_ID = [ID]_C
Desired output table be like
ID-----ERR_CODE----ERR_DETAIL---------ERR_ID
01--------A--------Out of range---------01_A
01--------C--------Fault----------------01_C
02--------B--------Check Log------------02_B
02--------C--------Fault----------------02_C
03--------A--------Out of range---------03_A
03--------B--------Check Log------------03_B
I am using SAS 9.3 with EG 5.1. I have tried do-loops, arrays, if statements and case-when's but it naturally skips to the next row to calculate when condition is met. But i want to calculate other met conditions fo each row.
I have managed to do it by creating seperate tables for each condition and then merge them. But that doesn't seem an effective way if there are much conditions to work with.
My question is how can i manage to calculate other met conditions for each ID at once without calculating seperately? The output table's row count will be more than the input as expected but for me it is not possible to achieve by applying case-when or if etc.
Thanks in advance and sorry if i am not clear.

Just use IF/THEN/DO blocks. Add an OUTPUT statement to write new observation for each error.
data have ;
input ID $ V1-V3;
cards;
01 5 3 7
02 3 8 5
03 6 9 1
;
data want;
set have;
length ERR_CODE $1 ERR_DETAIL $20 ERR_ID $10 ;
if v1>4 then do;
err_code='A'; err_detail="Out of range"; err_id=catx('_',id,err_code);
output;
end;
if v2>4 then do;
err_code='B'; err_detail="Fault"; err_id=catx('_',id,err_code);
output;
end;
if v3>4 then do;
err_code='C'; err_detail="Check Log"; err_id=catx('_',id,err_code);
output;
end;
drop v1-v3 ;
run;
Results:
Obs ID ERR_CODE ERR_DETAIL ERR_ID
1 01 A Out of range 01_A
2 01 C Check Log 01_C
3 02 B Fault 02_B
4 02 C Check Log 02_C
5 03 A Out of range 03_A
6 03 B Fault 03_B

Related

SAS delete and group by

Simplified version of the dataset I have is:
DATA HAVE;
INPUT ID match1 $ match2 $ not_relevant;
DATALINES;
1 "ABC" "ABC" 4
1 "XYZ" "XYZ" 29
2 "QQQ" "AAA" 5
2 "ABC" "ABC" 9
3 "EFG" "EFG" 7
3 "DEF" "DEF" 12
3 "LMK" LMK" 16
3 "LMK" . 29
;RUN;
I am looking to compare match1 and match2, and if anywhere in the ID column match1 does not equal match2, I would like to remove all of the rows with that ID. So for this example dataset I want to remove all of ID 2 (rows 3 and 4) since row 3 does not have a match between match1 and match2. All I can figure out how to do so far is to delete the rows where they dont match, which isnt terribly helpful for this application. I assume it would be easier to make it a new data set with some wheres but I am unsure how to begin there. Any ideas / advice?
EDIT:
Apologies, I dumbed down my dataset too much and forgot about an important exception. Note in my new dataset (I only added one row to the end). I do NOT want to delete group 3, since match2 is blank. I only want to delete a group where match2 is not blank and match1 does not equal match2.
Thanks
There's a few ways to do this. One would be to just construct a dataset of IDs that have non-matching rows, then do a merge or a SQL join and remove anything that matched this list.
However, my preferred option (partly because of speed, but also it's more straightforward once you understand how it works) is the DoW loop.
data want;
id_nonmatch = 0;
do _n_ = 1 by 1 until (last.id);
set have;
by id;
if match1 ne match2 then id_nonmatch = 1; *set the flag to 1 if we find a nonmatch;
end;
do _n_ = 1 by 1 until (last.id);
set have;
by id;
if id_nonmatch = 0 then output;
end;
run;
There are two set statements on the data step, each of which runs through the same dataset separately. If it doesn't make sense, throw a put _all_; inside each of the do loops - that will show you what it's doing. The first loop goes over all of the rows for one ID, checks if any violate the constraint, and if none do, the flag variable (id_nonmatch) stays 0. If one does, it becomes a 1 (and stays that way). Then, when it hits an ID boundary, it stops pulling records from the first set statement, and goes onto the second - re-pulling those same rows. Now, it outputs only when the flag is a zero.
This is very efficient because of buffering - unless your id groups are very large, the data step may be able to use buffers to keep the same rows in memory and not have to reread them from disk. (This will depend on your disk and buffers - and seems to help much less on flash than on physical disks [since there is not the additional benefit of the disk head not having to move] - so your mileage may vary here.)
Just to show this difference, here is a log showing that there isn't much additional time needed for the second read - when the record is reasonably sized. This benefit is less when the record is very small - I imagine there is more overhead involved. Note that the second read adds only 1/7 of the time of the first read to the total processing time!
69 data have;
70 call streaminit(7);
71 length strvar $1000;
72 do id = 1 to 100000;
73 do iter = 1 to 50;
74 x = rand('Uniform');
75 output;
76 end;
77 end;
78 run;
NOTE: Variable strvar is uninitialized.
NOTE: The data set WORK.HAVE has 5000000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 5.20 seconds
cpu time 5.20 seconds
79
80
81 data _null_;
82 do _n_ = 1 by 1 until (last.id);
83 set have;
84 by id;
85 end;
86 run;
NOTE: There were 5000000 observations read from the data set WORK.HAVE.
NOTE: DATA statement used (Total process time):
real time 2.37 seconds
cpu time 2.37 seconds
87
88
89 data _null_;
90 do _n_ = 1 by 1 until (last.id);
91 set have;
92 by id;
93 end;
94 do _n_ = 1 by 1 until (last.id);
95 set have;
96 by id;
97 end;
98 run;
NOTE: There were 5000000 observations read from the data set WORK.HAVE.
NOTE: There were 5000000 observations read from the data set WORK.HAVE.
NOTE: DATA statement used (Total process time):
real time 2.74 seconds
cpu time 2.73 seconds
It is easy to do this with an SQL query with a GROUP BY and HAVING clause.
proc sql;
create table want as
select *
from have
group by id
having max( (match1 ne match2) and not missing(match2))
;
quit;
SAS evaluates boolean expressions as 1/0 for TRUE/FALSE so the MAX() of a series of TRUE/FALSE values will be TRUE if ANY of them are TRUE.

How to subtract second row from first, fourth row from third and so forth

I have SAS dataset as in the attached picture.. what I'm trying to accomplish is created new calculated field from Total column where I'm subtracting first row-second row, third row-fourth row and so on..
What i have tried so far is
DATA WANT2;
SET WANT;
BY APPT_TYPE;
IF FIRST.APPT_TYPE THEN SUPPLY-OPEN; ELSE 'ERROR';
RUN;
this throws an eror as statement is not valid..
not really sure how to go about this
My dataset
Here you go. The best I can do with the limited information you provided. Next time please provide sample data and your expected output.
data have;
input APPT_TYPE$ _NAME_$ Quantity;
datalines;
ASON Supply 10
ASON Open 8
ASSN Supply 9
ASSN Open 7
S30 Supply 11
S30 Open 8
;
proc sort data = have;
by APPT_TYPE descending _NAME_ ;
run;
data want;
set have;
by APPT_TYPE descending _NAME_;
lag_N_Order = lag1(Quantity);
N_Order = Quantity;
Difference = lag_N_Order - N_Order;
keep APPT_TYPE _NAME_ N_Order lag_N_Order Difference Type;
if last.APPT_TYPE & last._NAME_ & Difference>0;
run;

How do i assign a value to a new variable, using another dataset which contains one value in SAS

I have a dataframe
ID value1
1 12
2 345
3 342
i have a second dataframe
value2
3823
how do I get the following result?
ID value1 value2
1 12 3823
2 345 3823
3 342 3823
any joins I have done have given me
ID value1 value2
1 12 .
2 345 .
3 342 .
. . 3823
No need for joins or helper variables:
data have;
do i = 1 to 3;
output;
end;
run;
data lookup;
j = 1;
run;
data want;
set have;
if _n_ = 1 then set lookup;
run;
Without the if _n_ = 1, the data step stops after one iteration when it tries to read a second row from the lookup dataset and finds that there are no rows remaining.
N.B. this requires that the have dataset doesn't already contain a variable with the same name as the variable(s) attached from the lookup dataset.
By far the easiest way to do this is to utilize PROC SQL and defining the condition 1=1, which is always true for each comparison:
data first;
input ID value1 ##;
cards;
1 12 2 345 3 342
run;
data second;
input value2 ;
cards;
3823
run;
proc sql;
create table wanted as
select * from first
left join second
on 1 =1
;quit;
Edit: As far as I know, there isn't direct way to merge datasets by each row, but you can do the following trick:
Add variable Help:
data second_trick;
set second;
help=1;
run;
data first_trick;
set first;
help=1;
run;
Then we just perform the merge by the static variable:
data wanted_trick;
merge first_trick(in=a) second_trick;
by help;
if a; /*Left join, just to be sure.*/
run;
now this only works if you want to add single static value. Don't try to use it your Second set has more rows.
For more on Merges and joins see: https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/249-30.pdf

Creating ID variable using digits from two different variables on SAS

I'm trying to create a new variable on SAS. There is a column called "Statefip" and a column called "countyfip". I need a four digit ID number that combines these two columns.
For example:
enter image description here
How do I tell SAS to follow this format when creating this new variable?
This is easy to do using put and input statements. The z3 format includes leading 0's in the output. || concatenates the put statements and then input converts the id field back to numeric.
data have;
input statefip countyfip;
datalines;
1 1
8 109
12 57
13 313
;
run;
data want;
set have;
id = input(put(statefip,2.) || put(countyfip,z3.),8.);
run;
proc print;
Output:
Obs statefip countyfip id
1 1 1 1001
2 8 109 8109
3 12 57 12057
4 13 313 13313

SAS do loop + lag function?

This is my first post, so please let me know if I'm not clear enough. Here's what I'm trying to do - this is my dataset. My approach for this is a do loop with a lag but the result is rubbish.
data a;
input #1 obs #4 mindate mmddyy10. #15 maxdate mmddyy10.;
format mindate maxdate date9.;
datalines;
1 01/02/2013 01/05/2013
2 01/02/2013 01/05/2013
3 01/02/2013 01/05/2013
4 01/03/2013 01/06/2013
5 02/02/2013 02/08/2013
6 02/02/2013 02/08/2013
7 02/02/2013 02/08/2013
8 03/10/2013 03/11/2013
9 04/02/2013 04/22/2013
10 04/10/2013 04/22/2013
11 05/04/2013 05/07/2013
12 06/10/2013 06/20/2013
;
run;
Now, I'm trying to produce a new column - "Replacement" based on the following logic:
If a record's mindate occurs before its lag's maxdate, it cannot be a replacement for it. If it cannot be a replacement, skip forward (so- 2,3,4 cannot replace 1, but 5 can).
Otherwise... if the mindate is less than 30 days, Replacement = Y. If not, replacement = N. Once a record replaces another (so, in this case, 5 does replace 1, because 02/02/2013 is <30 than 01/05/2013, it cannot duplicate as a replacement for another record. But if it's an N for one record above, it can still be a Y for some other record. So, 6 is now evaluated against 2, 7 against 3,etc. Since those two combos are both "Y", 8 is now evaluated versus 4, but because its mindate >30 relative to 4's maxdate, it's a N. But, it's then evaluated against against
And so on...
I should that in a 100 record dataset, this would imply that the 100th record could technically replace the 1st, so I've been trying lags within loops. Any tips/help is greatly appreciated! Expected output:
obs mindate maxdate Replacement
1 02JAN2013 05JAN2013
2 02JAN2013 05JAN2013
3 02JAN2013 05JAN2013
4 03JAN2013 06JAN2013
5 02FEB2013 08FEB2013 Y
6 02FEB2013 08FEB2013 Y
7 02FEB2013 08FEB2013 Y
8 10MAR2013 11MAR2013 Y
9 02APR2013 22APR2013 Y
10 10APR2013 22APR2013 N
11 04MAY2013 07MAY2013 Y
12 10JUN2013 20JUN2013 Y
I think this is correct if the asker was mistaken about replacement = Y for obs = 12.
/*Get number of obs so we can build a temporary array to hold the dataset*/
data _null_;
set have nobs= nobs;
call symput("nobs",nobs);
stop;
run;
data want;
/*Load the dataset into a temporary array*/
array dates[2,&NOBS] _temporary_;
if _n_ = 1 then do _n_ = 1 by 1 until(eof);
set have end = eof;
dates[1,_n_] = maxdate;
dates[2,_n_] = 0;
end;
set have;
length replacement $1;
replacement = 'N';
do i = 1 to _n_ - 1 until(replacement = 'Y');
if dates[2,i] = 0 and 0 <= mindate - dates[1,i] <= 30 then do;
replacement = 'Y';
dates[2,i] = _n_;
replaces = i;
end;
end;
drop i;
run;
You could use a hash object + hash iterator instead of a temporary array if you preferred. I've also included an extra var, replaces, to show which previous row each row replaces.
Here is a solution using SQL and hash tables. It is not optimal but it was the first method that sprang to mind.
/* Join the input with its self */
proc sql;
create table b as
select
a1.obs,
a2.obs as obs2
from a as a1
inner join a as a2
/* Set the replacement criteria */
on a1.maxdate < a2.mindate <= a1.maxdate + 30
order by a2.obs, a1.obs;
quit;
/* Create a mapping for replacements */
data c;
set b;
/* Create two empty hash tables so we can look up the used observations */
if _N_ = 1 then do;
declare hash h();
h.definekey("obs");
h.definedone();
declare hash h2();
h2.definekey("obs2");
h2.definedone();
end;
/* Check if we've already used this observation as a replacement */
if h2.find() then do;
/* Check if we've already replaced his observation */
if h.find() then do;
/* Add the observations to the hash table and output */
h2.add();
h.add();
output;
end;
end;
run;
/* Combine the replacement map with the original data */
proc sql;
select
a.*,
ifc(c.obs, "Y", "N") as Replace,
c.obs as Replaces
from a
left join c
on a.obs = c.obs2
order by a.obs;
quit;
There are several ways in which this can be simplified:
The dates can be brought through the first proc sql
The if statements can be combined
The final join could be replaced by a little extra logic in the data step