SAS - Writing an if Statement with a Do Loop inside - sql

I am trying to help a colleague with a SAS Script she is working with . I am a programmer so I understand the logic, however I don't now the syntax in SAS. Basically this is what she is trying to do.
We have:
Array of Procedure Dates (proc_date[i])
Array of Procedures (proc[i]).
Each record in our data can have up to 20 Procedures and 20 dates.
i=20
Each procedure has an associated code, lets just say there are 100 different codes where codes 1 to 10 is ProcedureA, 11 to 20 is ProcedureB etc etc.
We need to loop through each Procedure and assign it the correct ProcedureCategory if it falls into 1 of the 100 codes (ie: it enters one of the If Statements). When this is true we then need to loop through each other corresponding Procedure Date in that row, if they are different dates when we add the 'Weighted Values' together, else we would just take the greater of the 2 values.
I hope that helps and I hope this makes sense. I could write this in another language (ie: C/C++/C#/VB) however I'm at a loss with SAS as I'm just not that familiar with the syntax and the logic doesn't seem to be that other OO languages.
Thanks in advance for any assistance.
Kind Regards.

You don't want to do 100 if statements, one way or the other.
The answer to the core of your question is that you need the do loop outside of the if statement.
data want;
set have;
array proc[20];
array proc_date[20];
do _i = 1 to dim(proc); *this would be 20;
if proc[_i] = 53 then ... ;
else if proc[_i] = 54 then ...;
end;
run;
Now, what you're trying to do with the proc_date sounds like you need to do something with proc_date[i] in that same loop. The loop is just an iterator - the only thing changing is _i which is used as an array index. It's welcome to be the array index for either array (or to do any other thing). This is where this differs from common OOP practice, since this isn't an array class; you're not using the individual object to iterate it. This is a functional language style (c would do it the same way, even).
However, the if/else bits there would be unwieldy and long. In SAS you have a lot of ways of dealing with that. You might have another array of 100 values proc can take, and then inside that do loop have another do loop iterating over that array (do _j = 1 to 100;) - or the other way around (iterate through the 100, inside that iterate through the 20) if that makes more sense (if you want to at one time have all of the values).
data want;
set have;
array proc[20];
array proc_date[20];
array proc_val[100]; *needs to be populated in the `have` dataset, or in another;
do _i = 1 to dim(proc_val);
do _i = 1 to dim(proc);
if proc[_j] = proc_val[_i] then ...; *this statement executes 100*20 times;
end;
end;
run;
You also could have a user-defined format, which is really just a one-to-one mapping of values (start value -> label value). Map your 100 values to the 10 procedures they correspond to, or whatever. Then all 100 if statements become
proc_value[_i] = put(proc[_i],PROCFMT.);
and then proc_value[_i] stores the procedure (or whatever) which you then can evaluate more simply hopefully.
You also may want to look into hash tables; both for a similar concept as the format above, but also for doing the storage. Hash tables are a common idea in programming that perhaps you've already come across, and the way SAS implements them is actually OOP-like. If you're trying to do some sort of summarization based on the procedure values, you could easily do that in a hash table and probably much more efficiently than you could in IF statements.

Here are some statements mentioned.
*codes 1 to 10 is ProcedureA, 11 to 20 is ProcedureB ;
proc format;
value codes 1-10 = 'A'
11-20 = 'B';
Procedure(i) = put(code,codes.);
another way to recode ranges is with the between syntax
if 1 <= value <= 10 then variable = <new-value>;
hth

Related

An automated way to use SAS Lag Function/ Loop with Lag?

I have a dataset with one row per week for 2 years (so 104 rows). I have a flag column which is either 1 or 0 for each week. I want to create a new column with the following logic:
if the flag=1 for that week then have a 1 for that week and the following 3 weeks as flag_new.
My current approach, which works, is:
if flag=1 or lag(flag)=1 or lag2(flag)=1 or lag3(flag)=1 then flag_new=1;
Although this works, it becomes very tedious if I want flag_new to be 1 for the following 20 or 30 weeks instead of just 3 weeks.
I was hoping there would be an easier way to do this (perhaps a loop?), but I am not too familiar with it.
Any help is much appreciated.
Maybe instead of a look back, think of it as a look ahead. That is, each time you see flag=1, set flag_new=1 for that record and the next three records. Something like (untested):
if flag=1 then count=3;
else count+(-1) ; *implicit retain from sum statement;
if count>=0 then flag_new=1;
You can use a temporary array as well to keep the lagged information and then capture the highest of the array. If it's a one then you can set the new flag to 1 as well. To change the dimensions, just change the 2 to the n-1 you need.
This also demonstrates the BY statements and resetting it for the beginning of a new group.
data want;
array p{0:2} _temporary_;
set have;
by object;
if first.object then call missing(of p{*});
p{mod(_n_,4)} = flag;
highest = max(of p{*});
if highest > 1 then do;
flag_new = 1;
end;
run;

Vast data replacement between two tables in Oracle

Assume that we have two tables, named Tb1 and Tb2 and we are going to replace data from one to another. Tb1 is the main source of data and Tb2 is the Destination. This replacement operation has 3 parts.
In the first part we are going to validate all rows in Tb1 and check if they are correct. For example National security code must exactly have 10 digits or a real customer must have a valid birth date so according to these validation rules, 28 different validation methods and error codes have been considered. During the validation every spoiled row's description and status will be updated to a new state.
Part 2 fixes the rows' problems and the third one replace them to the Tb2.
For instance this row says that it has 4 different error.
-- Tb1.desc=6,8,14,16
-- Tb1.sts=0
A correct row of data
-- Tb1.desc=Null i
-- Tb1.sts=1
I have been working on the first part recently and have come up with a solution which works fine but it is too slow. Unfortunately It takes exactly 31 minutes to validate 100,000 rows. In a real situation we are going to validate more than 2 million records so it is totally useless despite all it's functionality.
Let's take look at my package :
procedure Val_primary IS
begin
Open X_CUSTOMER;
Loop
fetch X_CUSTOMER bulk collect into CUSTOMER_RECORD;
EXIT WHEN X_CUSTOMER%notfound;
For i in CUSTOMER_RECORD.first..CUSTOMER_RECORD.last loop
Val_CTYP(CUSTOMER_RECORD(i).XCUSTYP);
Val_BRNCH(CUSTOMER_RECORD(i).XBRNCH);
--Rest of the validations ...
UptDate_Val(CUSTOMER_RECORD(i).Xrownum);
end loop;
CUSTOMER_RECORD.delete;
End loop;
Close X_CUSTOMER;
end Val_primary;
Inside a validation procedure :
procedure Val_CTYP(customer_type IN number)IS
Begin
IF(customer_type<1 or customer_type>3)then
RW_FINAL_STATUS:=0;
FINAL_ERR_DSC:=Concat(FINAL_ERR_DSC,ERR_INVALID_CTYP);
End If;
End Val_CTYP;
Inside the update procedure :
procedure UptDate_Val(rownumb IN number) IS
begin
update tb1 set tb1.xstst=RW_FINAL_STATUS,tb1.xdesc=FINAL_ERR_DSC where xc1customer.xrownum=rownumb;
RW_FINAL_STATUS:=1;
FINAL_ERR_DSC:=null;
end UptDate_Val;
Is there any way to reduce execution time ?
It must be done less than 20 minutes for more than 2 million records.
Maybe each validation check could be a case expression within an inline view, and you could concatenate them etc in the enclosing query, giving you a single SQL statement that could drive an update. Something along the lines of:
select xxx, yyy, zzz -- whatever columns you need from xc1customer
, errors -- concatenation of all error codes that apply
, case when errors is not null then 0 else 1 end as status
from ( select xxx, yyy, zzz
, trim(ltrim(val_ctyp||' ') || ltrim(val_abc||' ') || ltrim(val_xyz||' ') || etc...) as errors
from ( select c.xxx, c.yyy, c.zzz
, case when customer_type < 1 or customer_type > 3 then err_invalid_ctyp end as val_ctyp
, case ... end as val_abc
, case ... end as val_xyz
from xc1customer c
)
);
Sticking with the procedural approach, the slow part seems to be the single-row update. There is no advantage to bulk-collecting all 20 million rows into session memory only to apply 20 million individual updates. The quick fix would be to add a limit clause to the bulk collect (and move the exit to the bottom of the loop where it should be), have your validation procedures set a value in the array instead of updating the table, and batch the updates into one forall per loop iteration.
You can be a bit freer with passing records and arrays in and out of procedures rather than having everything a global variable, as passing by reference means there is no performance overhead.
There are two potential lines of attack.
Specific implementation. Collections are read into session memory. This is usually quite small compared to global memory allocation. Reading 100000 longish rows into session memory is a bad idea and can cause performance issues. So breaking up the process into smaller chunks (say 1000 rows) will most likely improve throughput.
General implementation. What is the point of the tripartite process? Updating Table1 with some error flags is an expensive activity. A more efficient approach would be to apply the fixes to the data in the collection and apply that to Table2. You can write a log record if you need to track what changes are made.
Applying these suggestion you'd end up with a single procedure which looks a bit like this:
procedure one_and_only is
begin
open x_customer;
<< tab_loop >>
loop
fetch x_customer bulk collect into customer_record
limit 1000;
exit when customer_record.count() = 0;
<< rec_loop >>
for i in customer_record.first..customer_record.last loop
val_and_fix_ctyp(customer_record(i).xcustyp);
val_and_fix_brnch(customer_record(i).xbrnch);
--rest of the validations ...
end loop rec_loop;
-- apply the cleaned data to target table
forall j in 1..customer_record.count()
insert into table_2
values customer_record(j);
end loop tab_loop;
close x_customer;
end one_and_only;
Note that this approach requires the customer_record collection to match the projection of the target table. Also, don't use %notfound to test for end of the cursor unless you can guarantee the total number of read records is an exact multiple of the LIMIT number.

fuzzy merge using SAS proc sql

I have two files which I would like to match by name and I would like to take account of spelling errors by using the compged function. The names have been thoroughly cleaned and I have no other useful match variables that could be used to reduce the search space.
The files name1 and name2 have over 500k rows each and thus after 11 hours this code has not run.
Is there some way I can code this more efficiently or is my issue purely due to computing power?
proc sql;
create table name1_name2_Fuzzy as
select a.*, b.*
from name1 as a
inner join name2 as b
on COMPGED(a.match_name, b.match_name) < 200;
quit;
You have a parameter in compged function that you didn't use, and that can improve the performance (maybe 6 or 7 hours instead of 11..).
this parameter is the cutoff. If you choose 300 as a cutoff, when the distance between the words reaches 300, sas stops the calculation and outputs 300.
So here in your case, you should choose a cutoff >200 (and NOT >=200).
Complev function is faster than Compged. If you don't need an exact cost of each operation (with call compost routine), you can use it instead of compged, and you can reduce minutes or maybe hours of computations. Complev has also the cutoff option.
Hope this helps !
Working off memory here, but if the first char of each match_name is different, the COMPGED will be over 200, true? So, you wouldn't consider them a match?
If so, make an indexed column with the first character of match_name in each table, and join on that before the COMPGED. That should eliminate most of the non-matches so far fewer COMPGED calculations will be needed.

SAS Proc sql row number

How do I get the row number of an observation in proc sql, similar to _N_ for a datastep in proc sql?
For example
proc sql outobs=5;
select case mod(<something>, 2)
when 0 then "EVEN"
else "ODD"
end
from maps.africa
end;
Want:
Row
----------
1 odd
2 even
3 odd
.
.
.
Monotonic() does exist and in some cases can be helpful, but it is not identical to a row number, and can be dangerous to use, particularly given SQL is a heavily optimized language that will happily split your query into multiple threads - in which case monotonic() would fail to accomplish what you want. It in particular can behave differently on different datasets, on different SAS installations, or even simply on different days.
The safe way to do this is to create a view with _n_ copied into a permanent variable.
data africa_v/view=africa_v;
set maps.africa;
rownum=_n_;
run;
proc sql;
select case mod(rownum, 2)
when 0 then "EVEN"
else "ODD"
end
from africa_v;
quit;
This adds nearly no overhead - a few milliseconds - and achieves the same result, but with the safety to be confident you have the right ordering. The two queries (this and shipt's) run in nearly identical times on my machine, well within the margin of error (2.95s vs 2.98s for all records).
Use the monotonic() function. Whilst in the past I have read that this is an undocumented function (it is true that it does not appear on the sas website, there is at least one sas 'proceedings' document which makes use of it heavily
For example:
proc sql outobs=5;
select case mod(monotonic(), 2)
when 0 then "EVEN"
else "ODD"
end
from maps.africa;
quit;
will achieve your aim.

Need to compare 2 variables, each coming from a separate data set, and flag differences

I have 2 SAS data sets, Lab and Rslt. They both have all of the same variables, but Rslt is supposed to have what is essentially a subset of Lab. For what I'm trying to do, there are 4 important variables: visit, accsnnum, battrnam, and lbtestcd. All are character variables. I want to compare the two files Lab and Rslt to find out where they vary -- specifically, I need to know the count of lbtestcd per unique accsnnum.
But I must control for a few factors. First, I only need to compare observations that have "Lipid Panel" or "Chemistry (6)" in the battrnam variable. The Rslt file only contains these observations, so we don't need to worry about that one. So I subsetted Lab using this code:
data work.lab;
set livingston.ndb_lab_1;
where battrnam contains "Lipid Panel" or battrnam = "Chemisty (6)";
run;
This worked fine. Now, I need to control for the variable visit. I need to get rid of all observations in both Lab and Rslt that have visits that contain "Day 1" or "Screening". I accomplished this using the following code:
data work.lab;
set work.lab;
if visit = "Day 1" or visit = "Screening" then delete;
else visit = visit;
run;
data work.rslt;
set work.rslt;
if visit = "Day 1" or visit = "Screening" then delete;
else visit = visit;
run;
Now this is where I get stuck. I need to create a way to compare the count of lbtestcd by accsnnum between the two separate files Lab and Rslt, and I need a way for it to flag the accsnum where there is a difference between Lab and Rslt for the count of lbtestcd. For example, if Lab has an accsnum A1 that has 5 unique lbtestcd values, and Rslt has the accsnum A1 with 7 unique lbtestcd value, I need that one to be brought to my attention.
I can do a proc freq for each file, but these are large data sets and I don't want to have to compare by hand. Perhaps exporting the count of lbtestcd by accsnum to a variable in a new 3rd dataset for each of the 2 files Lab and Rslt, then creating a variable that is the difference of these two? So that if difference != 0 then I can get a report of those asscnum? Advice in SQL will work too, as I can run that through SAS.
Edit
I've used some SQL to get the count of lbtestcd by accsnum for each data set using the code below, though I still need to figure out how to export these values to a data set to compare.
proc sql;
select accsnnum, count(lbtestcd)
from work.lab1
group by accsnnum;
quit;
proc sql;
select accsnnum, count(lbtestcd)
from work.rslt1
group by accsnnum;
quit;
Thanks for any and all help you can give. This one is really stumping me!
I would do a PROC FREQ on each dataset (or proc whatever-you-like-that-does-counts) and then use PROC COMPARE. For example:
proc freq data=rslt1;
tables accsnnum*ibtestcd/out=rsltcounts;
run;
proc freq data=lab1;
tables accsnnum*ibtestcd/out=labcounts;
run;
proc compare base=lab1 compare=rslt1 out=compares /* options */;
by accsnnum;
run;
PROC COMPARE has a lot of options; in this case the most helpful would probably be:
outnoequal - only outputs rows for each row that are not identical in the two datasets
outbase and outcomp - outputs a row for each of BASE and COMPARE datasets (if OUTNOEQUAL, then only when they differ)
outdif - outputs 'difference' rows, ie, one minus the other; this may or may not be helpful for you
The documentation lists all of the options. You may also need to look at the METHOD options if your data might have numeric precision issues.