SAS: most efficient method to output first non-missing across multiple columns - sql

The data I have are millions of rows and rather sparse with anywhere between 3 and 10 variables needing processed. My end result needs to be one single row containing the first non-missing value for each column. Take the following test data:
** test data **;
data test;
length ID $5 AID 8 TYPE $5;
input ID $ AID TYPE $;
datalines;
A . .
. 123 .
C . XYZ
;
run;
The end result should look like such:
ID AID TYPE
A 123 XYZ
Using macro lists and loops I can brute force this result with multiple merge statements where the variable is non-missing and obs=1 but this is not efficient when the data are very large (below I'd loop over these variables rather than write multiple merge statements):
** works but takes too long on big data **;
data one_row;
merge
test(keep=ID where=(ID ne "") obs=1) /* character */
test(keep=AID where=(AID ne .) obs=1) /* numeric */
test(keep=TYPE where=(TYPE ne "") obs=1); /* character */
run;
The coalesce function seems very promising, but I believe I need it in combination with array and output to build this single-row result. The function also differs (coalesce and coalescec depending on variable type) whereas it does not matter using proc sql. I get an error using array since all variables in the array list are not the same type.

Exactly what is most efficient will largely depend on the characteristics of your data. In particular, whether the first nonmissing value for the last variable is usually relatively "early" in the dataset, or if you usually will have to trawl through the entire dataset to get to it.
I assume your dataset is not indexed (as that would simplify things greatly).
One option is the standard data step. This isn't necessarily fast, but it's probably not too much slower than most other options given you're going to have to read most/all of the rows no matter what you do. This has a nice advantage that it can stop when every row is complete.
data want;
if 0 then set test; *defines characteristics;
set test(rename=(id=_id aid=_aid type=_type)) end=eof;
id=coalescec(id,_id);
aid=coalesce(aid,_aid);
type=coalescec(type,_type);
if cmiss(of id aid type)=0 then do;
output;
stop;
end;
else if eof then output;
drop _:;
run;
You could populate all of that from macro variables from dictionary.columns, or even might use temporary arrays, though I think that gets too messy.
Another option is the self update, except it needs two changes. One, you need something to join on (as opposed to merge which can have no by variable). Two, it will give you the last nonmissing value, not the first, so you'd have to reverse-sort the dataset.
But assuming you added x to the first dataset, with any value (doesn't matter, but constant for every row), it is this simple:
data want;
update test(obs=0) test;
by x;
run;
So that has the huge advantage of simplicity of code, exchanged for some cost of time (reverse sorting and adding a new variable).
If your dataset is very sparse, a transpose might be a good compromise. Doesn't require knowing the variable names as you can process them with arrays.
data test_t;
set test;
array numvars _numeric_;
array charvars _character_;
do _i = 1 to dim(numvars);
if not missing(numvars[_i]) then do;
varname = vname(numvars[_i]);
numvalue= numvars[_i];
output;
end;
end;
do _i = 1 to dim(charvars);
if not missing(charvars[_i]) then do;
varname = vname(charvars[_i]);
charvalue= charvars[_i];
output;
end;
end;
keep numvalue charvalue varname;
run;
proc sort data=test_t;
by varname;
run;
data want;
set test_t;
by varname;
if first.varname;
run;
Then you proc transpose this to get the desired want (or maybe this works for you as is). It does lose the formats/etc. on the value, so take that into account, and your character value length probably needs to be set to something appropriately long - and then set back (you can use an if 0 then set to fix it).
A similar hash approach would work roughly the same way; it has the advantage that it would stop much sooner, and doesn't require resorting.
data test_h;
set test end=eof;
array numvars _numeric_;
array charvars _character_;
length varname $32 numvalue 8 charvalue $1024; *or longest charvalue length;
if _n_=1 then do;
declare hash h(ordered:'a');
h.defineKey('varname');
h.defineData('varname','numvalue','charvalue');
h.defineDone();
end;
do _i = 1 to dim(numvars);
if not missing(numvars[_i]) then do;
varname = vname(numvars[_i]);
rc = h.find();
if rc ne 0 then do;
numvalue= numvars[_i];
rc=h.add();
end;
end;
end;
do _i = 1 to dim(charvars);
if not missing(charvars[_i]) then do;
varname = vname(charvars[_i]);
rc = h.find();
if rc ne 0 then do;
charvalue= charvars[_i];
rc=h.add();
end;
end;
end;
if eof or h.num_items = dim(numvars) + dim(charvars) then do;
rc = h.output(dataset:'want');
end;
run;
There are lots of other solutions, just depending on your data which would be most efficient.

Related

sas macro resolving issue

Dummy data:
MEMNAME _var1 var2 var3 var4
XY XYSTART_1 XYSTATT_2 XYSTAET_3 XYSTAWT_4
I want to create a macro variable that will have data as TEST_XYSTART, TEST_XYSTATT, TEST_XYSTAET, TEST_TAWT.... how can I do this in datastep without using call symput because I want to use this macro variable in the same datastep (call symput will not create macro variable until I end the datastep).
I tried as below (not working), please tell me what is the correct way of write the step.
case = "TEST_"|| strip(reverse(substr(strip(reverse(testcase(i))),3)));
%let var = case; (with/without quotes not getting the desired result).
abc= strip(reverse(substr(strip(reverse(testcase(i))),3)));
%let test = TEST_;
%let var = &test.abc;
I am getting correct data with this statement: strip(reverse(substr(strip(reverse(testcase(i))),3)))
just not able to concatenate this value with TEST_ and assign it to the macro variable in a datastep.
Appreciate your help!
It makes no sense to locate a %LET statement in the middle of a data step. The macro processor evaluates the macro statements first and then passes the resulting code onto SAS to evaluate. So if you had this code:
data want;
set have;
%let var=abc ;
run;
It is the same as if you placed the %LET statements before the DATA statement.
%let var=abc ;
data want;
set have;
run;
If you want to reference a variable dynamically in a data step then use either an index into an array.
data want;
set have;
array test test_1 - test_3;
new_var = test[ testnum ] ;
run;
Or use the VvalueX() function which will return the value of a variable whose name is a character expression.
data want;
set have;
new_var = vvaluex( cats('test_',testnum) );
run;

SAS running out of memory when trying to do fuzzy matching using hash tables on a small sample dataset

I have a list of names, phone numbers, and addresses with ~5,000,000 rows. I am trying to create a list of unique customers at each address with each unique customer being assigned a key. The customer names are not formatted consistently, so John Smith might appear as, e.g., Smith, John, or John Smith Jr., etc.
The logic I want to follow is this:
If two records at an address have the same phone number, it's the same customer, whether they have different names or not, and get assigned the same customer_key.
If two records at an address don't have the same phone number (or have no phone number), but a fuzzy match on their names exceeds some threshold, they are the same customer, and get assigned the same customer_key.
Note that the same customer name + phone number match at two separate addresses should not be assigned the same customer key.
Here are example tables containing sample input and desired output:
https://docs.google.com/spreadsheets/d/1RsABSFy5a5dLE8mC-ZQF_lNg4I0kLQy7dEFhbfovq40/edit?usp=sharing
My code attempt using this sample dataset is below:
data customer_keys(keep=customer_name customer_key customer_phone clean_address);
length customer_name $50 Comp_Name $20 customer_key 8 clean_address comp_address $5 customer_phone comp_phone $11.;
if _N_ = 1 then do;
declare hash h(multidata:'Y');
h.defineKey('Comp_Name','comp_address');
h.defineData('Comp_Name', 'customer_key','comp_address', 'comp_phone');
h.defineDone();
declare hiter hi('h');
declare hash hh(multidata:'Y');
hh.defineKey('customer_key');
hh.defineData('customer_key', 'Comp_Name','comp_address','comp_phone');
hh.defineDone();
_customer_key=0;
end;
set testdat;
rc=h.find(key:customer_name, key:clean_address);
if rc ne 0 then do;
rc=hi.first();
do while (rc=0);
if not missing(customer_phone) and clean_address=comp_address and customer_phone=comp_phone
then do;
h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
hh.add();
end;
else if not missing(customer_name) and clean_address=comp_address and jaroT(customer_name,Comp_name) ge 0.8
then do;
rc=hh.find();
do while (r ne 0);
dist2=jaroT(customer_name,Comp_name);
hh.has_next(result:r);
if r=0 & dist2 ge 0.8 then do;
h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
hh.add();
output;return;
end;
else if r ne 0 & dist2 ge 0.8
then rc=hh.find_next();
else if dist2 < 0.8
then leave;
end;
end;
rc=hi.next();
end;
_customer_key+1;
customer_key=_customer_key;
h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
hh.add(key:customer_key, data:customer_key, data:customer_name, data:clean_address, data:customer_phone);
end;
output;
run;
Running this code yields the error:
ERROR: Hash object added 23068656 items when memory failure occurred.
FATAL: Insufficient memory to execute DATA step program. Aborted during the EXECUTION phase.
ERROR: The SAS System stopped processing this step because of insufficient memory.
I notice that if I remove the additional logic dealing with the phone number entirely, I don't have this memory issue. However I am still going to assume that my approach will fail either way because of lack of memory when I try to run it on the full dataset.
search for "sas.exe-memsize 16G" in your windows search function to pull a new version of sas program that will have 16G in memory storage. You can change the number before G, too. Also make sure you have enough disk space. Cheers.
I managed to solve the memory issue and now the code is working as expected! The error was in the section dealing with matching phone numbers. The corrected code is below:
data test customer_keys(keep=customer_name customer_key customer_phone clean_address);
length customer_name $50 Comp_Name $20 customer_key 8 clean_address comp_address $22 customer_phone comp_phone $14.;
if _N_ = 1 then do;
declare hash h(multidata:'Y');
h.defineKey('Comp_Name','comp_address');
h.defineData('Comp_Name', 'customer_key','comp_address', 'comp_phone');
h.defineDone();
declare hiter hi('h');
declare hash hh(multidata:'Y');
hh.defineKey('customer_key');
hh.defineData('customer_key', 'Comp_Name','comp_address','comp_phone');
hh.defineDone();
_customer_key=0;
end;
set testdat;
rc=h.find(key:customer_name, key:clean_address);
if rc ne 0 then do;
rc=hi.first();
do while (rc=0);
if not missing(customer_phone) and clean_address=comp_address and customer_phone=comp_phone
then do;
rc=hh.find();
do while (r ne 0);
hh.has_next(result:r);
if r=0 then do;
h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
hh.add();
output;return;
end;
else leave;
end;
end;
if not missing(customer_name) and clean_address=comp_address and jaroT(customer_name,Comp_name) ge 0.8
then do;
rc=hh.find();
do while (r ne 0);
dist2=jaroT(customer_name,Comp_name);
hh.has_next(result:r);
if r=0 & dist2 ge 0.8 then do;
h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
hh.add();
output;return;
end;
else if r ne 0 & dist2 ge 0.8
then rc=hh.find_next();
else if dist2 < 0.8
then leave;
end;
end;
rc=hi.next();
end;
_customer_key+1;
customer_key=_customer_key;
h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
hh.add(key:customer_key, data:customer_key, data:customer_name, data:clean_address, data:customer_phone);
end;
output;
run;
Also below I am posting the code for the jaroT function so that others can use it if they'd like, I didn't write this function, and you could substitute it in my code above with any comparison algorithm you'd like (be sure to also change the threshold numbers):
FUNCTION jaroT(string_1 $,string_2 $);
if STRING_1=STRING_2 THEN return(1);
else do;
length1=length(string_1);
if length1>26 then length1=26;
length2=length(string_2);
if length2>26 then length2=26;
range=(int(max(length1,length2)/2)-1);
big=max(length1,length2);
short=min(length1,length2);
array String1{26} $ 1 _temporary_;
array String2{26} $ 1 _temporary_;
array String1Match{26} $ 1 _temporary_;
array String2Match{26} $ 1 _temporary_;
/*The following two do loops place the characters into arrays labelled string1 and string2.
While we are here, we also set a second array of the same dimensions full of zeros. This will
act as our match key, whereby values in the same relative position as those in the original string
will be set to 1 when we find a valid match candidate later on.*/
do i=1 to length1 by 1;
String1{i}=substr(string_1,i,1);
String1Match{i}='0';
end;
do i=1 to length2 by 1;
String2{i}=substr(string_2,i,1);
String2Match{i}='0';
end;
/*We introduce m, which will keep track of the number of matches */
m=0;
/*We set a loop to compare one string with the other. We only need to loop the same number of
times as there are characters in one of our strings. Hence "do while i<=length1".
We set the allowable search range for a character using pos and endpos, and set another loop to
search through this range. We loop through until we find our first match, or until we hit
the end of our search range. If the character in string 2 is already signed to a match, we move
on to searching the next character. When we find a match, the match flag for that character in both
strings is set to 1. Hopefully by the end of the loop, we have match flags for our two arrays set.
*/
do i=1 to length1 by 1;
pos=max(i-range,1);
endpos=min(range+i,length2);
do while (pos<=endpos and String1Match{i}^='1');
if String1{i}=String2{pos} and String2Match{pos}^='1' then do;
m=m+1;
String1Match{i}='1';
String2Match{pos}='1';
end;
pos=pos+1;
end;
end;
/* If there are no matching characters, we do not bother with any more work, and say the two strings are not alike at all */
IF m=0 then return(0);
else if m=1 then do;
t=0;
end;
/* If those three conditions all fail, then we move onto the heavy lifting.*/
else do;
/* We set i back to 1, ready for another looping run.
c is a variable to track the position of the next valid transposition check.
j is a variable helping to keep track of matching characters found during the next loop inside string 1.
k is a variable helping to keep track of matching characters found during the next loop inside string 2.
t will be the number of tranpositions found.
*/
i=1;
c=1;
k=0;
j=0;
t=0;
/* We begin our loop. These conditional loops within loops
make several logical conclusions to arrive at the correct number of transpositions and matching characters
at the beginning of a string. At the end of this we should have every variable we need to calculate the winkler
score (and theoretically the jaro as well). I'm not going to write out an explanation here, but if you're
interested all the extra variables are defined just above this comment, and I've already told you what the
string arrays are. Work through a couple of examples with pen and paper, or in your head, to see how
and why it works.*/
do while (j<m OR k<m);
IF j<m then do;
IF String1Match{i}='1' THEN DO;
j=j+1;
String1{j}=String1{i};
end;
end;
IF k<m then do;
IF String2Match{i}='1' THEN DO;
k=k+1;
String2{k}=String2{i};
end;
end;
IF j>=c and k>=c then do;
IF String1{c}^=String2{c} then t=t+1;
c=c+1;
end;
i=i+1;
end;
end;
/* Finally, we do the calculation of the scores */
jaro=(1/3)*((m/length1)+(m/length2)+((m-(t/2))/m));
return(jaro);
end;
endsub;
FUNCTION fuzznum(num_1,num_2,diff,direction);
IF direction=-1 THEN DO;
IF num_1-num_2<=diff AND num_1-num_2>=0 THEN RETURN(1);
ELSE RETURN(0);
END;
IF direction=1 THEN DO;
IF num_1-num_2>=-(diff) AND num_1-num_2<=0 THEN RETURN(1);
ELSE RETURN(0);
END;
IF direction=0 THEN DO;
IF ABS(num_1-num_2)<=diff THEN RETURN(1);
ELSE RETURN(0);
END;
endsub;

Output to a text file

I need to output lots of different datasets to different text files. The datasets share some common variables that need to be output but also have quite a lot of different ones. I have loaded these different ones into a macro variable separated by blanks so that I can macroize this.
So I created a macro which loops over the datasets and outputs each into a different text file.
For this purpose, I used a put statement inside a data step. The PUT statement looks like this:
PUT (all the common variables shared by all the datasets), (macro variable containing all the dataset-specific variables);
E.g.:
%MACRO OUTPUT();
%DO N=1 %TO &TABLES_COUNT;
DATA _NULL_;
SET &&TABLE&N;
FILE 'PATH/&&TABLE&N..txt';
PUT a b c d "&vars";
RUN;
%END;
%MEND OUTPUT;
Where &vars is the macro variable containing all the variables needed for outputting for a dataset in the current loop.
Which gets resolved, for example, to:
PUT a b c d special1 special2 special5 ... special329;
Now the problem is, the quoted string can only be 262 characters long. And some of my datasets I am trying to output have so many variables to be output that this macro variable which is a quoted string and holds all those variables will be much longer than that. Is there any other way how I can do this?
Do not include quotes around the list of variable names.
put a b c d &vars ;
There should not be any limit to the number of variables you can output, but if the length of the output line gets too long SAS will wrap to a new line. The default line length is currently 32,767 (but older versions of SAS use 256 as the default line length). You can actually set that much higher if you want. So you could use 1,000,000 for example. The upper limit probably depends on your operating system.
FILE "PATH/&&TABLE&N..txt" lrecl=1000000 ;
If you just want to make sure that the common variables appear at the front (that is you are not excluding any of the variables) then perhaps you don't need the list of variables for each table at all.
DATA _NULL_;
retain a b c d ;
SET &&TABLE&N;
FILE "&PATH/&&TABLE&N..txt" lrecl=1000000;
put (_all_) (+0) ;
RUN;
I would tackle this but having 1 put statement per variable. Use the # modifier so that you don't get a new line.
For example:
data test;
a=1;
b=2;
c=3;
output;
output;
run;
data _null_;
set test;
put a #;
put b #;
put c #;
put;
run;
Outputs this to the log:
800 data _null_;
801 set test;
802 put a #;
803 put b #;
804 put c #;
805 put;
806 run;
1 2 3
1 2 3
NOTE: There were 2 observations read from the data set WORK.TEST.
NOTE: DATA statement used (Total process time):
real time 0.07 seconds
cpu time 0.03 seconds
So modify your macro to loop through the two sets of values using this syntax.
Not sure why you're talking about quoted strings: you would not quote the &vars argument.
put a b c d &vars;
not
put a b c d "&vars";
There's a limit there, but it's much higher (64k).
That said, I would do this in a data driven fashion with CALL EXECUTE. This is pretty simple and does it all in one step, assuming you can easily determine which datasets to output from the dictionary tables in a WHERE statement. This has a limitation of 32kiB total, though if you're actually going to go over that you can work around it very easily (you can separate out various bits into multiple calls, and even structure the call so that if the callstr hits 32000 long you issue a call execute with it and then continue).
This avoids having to manage a bunch of large macro variables (your &VAR will really be &&VAR&N and will be many large macro variables).
data test;
length vars callstr $32767;
do _n_ = 1 by 1 until (last.memname);
set sashelp.vcolumn;
where memname in ('CLASS','CARS');
by libname memname;
vars = catx(' ',vars,name);
end;
callstr = catx(' ',
'data _null_;',
'set',cats(libname,'.',memname),';',
'file',cats('"c:\temp\',memname,'.txt"'),';',
'put',vars,';',
'run;');
call execute(callstr);
run;

Algorithm for calculating most stable, consecutive values from a database

I have some questions and I'm in need of your input.
Say I have a database table filled with 2000-3000 rows and each row has a value and some identifiers. I am in need of withdrawing ~100 consecutive rows with the most stable values (lowest spread). It's okay with a few jumper values if you can exclude them.
How would you do this and what algorithm would you use?
I'm currently using SAS Enterprise Guide for my DB which runs on Oracle. I don't really know that much of the generic SAS language but I don't know what other language I could use for this? Some scripting language? I have limited programming knowledge but this task seems pretty easy, correct?
The algorithms I've been thinking of is:
Select 100 consecutive rows and calculate standard deviation. Increment select statement by 1 and calculate standard deviation again. Loop trough the whole table.
Export the rows with the lowest standard deviation
Same as 1, but calculate variance instead of standard deviation (basically the same thing). When the whole table has been looped, do it again but exclude 1 row which has the highest value from avg. Repeat process until 5 jumpers has been excluded and compare the results.
Pros and cons compared to method 1?
Questions:
Best & easiest method?
Prefered language? Possible in SAS?
Do you have any other method you would recommend?
Thanks in advance
/Niklas
The below code will do what you are asking. It is just using some sample data and only calcs it for 10 observations (rather than 100). I'll leave it to you to adapt as required.
Create some sample data. available to all sas installations:
data xx;
set sashelp.stocks;
where stock = 'IBM';
obs = _n_;
run;
Create row numbers and sort it descending. Makes it easier to calc stddev:
proc sort data=xx;
by descending obs;
run;
Use an array to keep the subsequent 10 obs for every row. Calculate the stddev for each row using the array (except for the last 10 rows. Remember we are working backwards through the data.
data calcs;
set xx;
array a[10] arr1-arr10;
retain arr1-arr10 .;
do tmp=10 to 2 by -1;
a[tmp] = a[tmp-1];
end;
a[1] = close;
if _n_ ge 10 then do;
std = std(of arr1-arr10);
end;
run;
Find which obs (ie. row) had the lowest stddev calc. Save it to a macro var.
proc sql noprint;
select obs into :start_row
from calcs
having std = min(std)
;
quit;
Select the 10 observations from the sample data that were involved in calcing the lowest stddev.
proc sql noprint;
create table final as
select *
from xx
where obs between &start_row and %eval(&start_row+10)
order by obs
;
quit;
An addition to Robert's solution but with part 2 included as well, creating a second array and then looping through and removing the top 5 values. You'll still the last parts of Roberts solution to extract the row with the minimum standard deviation and then the corresponding attached rows. You didn't specify how you wanted to deal with the variances that have the max removed so they are left in the dataset.
data want;
*set arrays for looping;
/*used to calculate the std*/
array p{0:9} _temporary_;
/*used to copy the array over to reduce variables*/
array ps(1:10) _temporary_;
/*used to store the var with 5 max values removed*/
array s{1:5} var1-var5;
set sample;
p{mod(_n_,10)} = open;
if _n_ ge 10 then std = std(of p{*});
*remove max values to calculate variance;
if _n_ ge 10 then do;
*copy array over to remove values;
do i=1 to 10;
ps(i)=p(i-1);
end;
do i=1 to 5;
index=whichn(max(of ps(*)), of ps(*));
ps(index)=.;
s(i)=var(of ps(*));
end;
end;
run;

Replace missings SAS

I have two tables:
data a;
input a b c;
datalines;
1 2 .
;
run;
data b;
input a b c;
datalines;
1 . 3
;
run;
The result I want from these tables is replacing the missings by the values that are not missing:
a b c
-----
1 2 3
How can I do it with mostly less piece of code?
EDIT:
I wrote the code and it works, but may be there is more simple code for this.
%macro x;
%macro dummy; %mend dummy;
data _null_;
set x end=Last;
call symputx("name"||left(_N_),name);
if Last then call symputx("num",_n_);
run;
data c;
set a b;
run;
data c;
set c;
%do i=1 %to &num;
x&i=lag(&&name&i);
%end;
n=_n_;
run;
data c1 (drop= n %do i=1 %to &num; x&i %end;);
set c (where=(n=2));
%do i=1 %to &num;
if missing(&&name&i) and not missing(x&i) then &&name&i=x&i;
%end;
run;
%mend;
%x;
If the values are consistent, ie, you never have:
1 2 3
1 3 .
and/or are happy for them to be overwritten, then UPDATE is excellent for this.
data c;
update a b;
by a;
run;
UPDATE will only replace values with non-missing values, so . gets replaced by 3 but 2 is not replaced by .. Again assuming a is the ID variable as Gordon assumes.
You also can easily do this:
data c;
set a b;
by a;
retain b_1 c_1;
if first.a then do; *save the first b and c;
b_1=b;
c_1=c;
end;
else do; *now fill in missings using COALESCE which only replaces if missing;
b_1=coalesce(b_1,b); *use coalescec if this is a char var;
c_1=coalesce(c_1,c); *same;
end;
if last.a then output; *output last row;
drop b c;
rename
b_1=b
c_1=c
;
run;
This makes sure you keep the first instance of any particular value, if they may be different (the opposite of update which keeps the last instance, and different from the SQL solution which takes MAX specifically). All three should give the same result if you have only identical values. Data step options should be a bit faster than the SQL option, I expect, as they're both one pass solutions with no matching required (though it probably doesn't matter).
Using proc SQL, you can do this with aggregation:
proc sql;
select max(a) as a, max(b) as b, max(c) as c
from (select a, b, c from a union all
select a, b, c from b
) x;
If, as I suspect, the first column is an id for matching the two tables, you should instead do:
proc sql;
select coalesce(a.a, b.a), coalesce(a.b, b.b) as b, coalesce(a.c, b.c) as c
from a full join
b
on a.a = b.a;
I'm going to post how to do your approach with some details here: I wouldn't consider this the best approach for this, but you can perhaps learn more easily by starting with what you have, and it's not a horrible approach certainly - just not optimal.
Starting:
%macro x;
%macro dummy; %mend dummy;
data _null_;
set x end=Last;
call symputx("name"||left(_N_),name);
if Last then call symputx("num",_n_);
run;
data c;
set a b;
run;
data c; *NOTE 1;
set c;
%do i=1 %to &num;
x&i=lag(&&name&i); *NOTE 2;
%end;
n=_n_;
run;
data c1 (drop= n %do i=1 %to &num; x&i %end;); *NOTE 3;
set c (where=(n=2));
%do i=1 %to &num;
if missing(&&name&i) and not missing(x&i) then &&name&i=x&i;
%end;
run;
%mend;
%x;
Ending:
*You can still do the first datastep to figure out the dimensions of the arrays,
if you want, use &num instead of the 3s hardcoded in there (but do not need &name list).;
data c;
set a(in=in_a) b(in=in_b);
array x[3] _temporary_; *NOTE 4;
array var[3] a b c;
do i = 1 to dim(x); *NOTE 5;
x[i] = lag(vars[i]);
end;
if in_b then do; *NOTE 6;
do i=1 to dim(x);
if missing(vars[i]) then vars[i]=x[i]; *NOTE 7;
end;
output;
end;
run;
Notes:
NOTE 1: You can combine the two c datasteps here with no difference at all. In general have as few data steps as you can, as they're slow - this is a difference from R or similar which use in memory processing, in SAS you use disk processing which is nice for ability to do 200GB of data but not as fast for multiple steps like this - so make fewer steps.
NOTE 2: This is basically a macro implementation of an array. SAS datastep has an array already! Use it.
NOTE 3: You don't need to do the drop like that. drop=n x: works fine as long as none of your real variables start with x (and if they do, use an _ before all of your dummy variables and it will be the same). : is a wild card for 'starts with'.
NOTE 4: Here is the array implementation of your x array. I use temporary because that means the variables will be dropped automatically for you.
NOTE 5: Here we do the lags. I don't like using lag for this where retain does a better job of the same thing, but it works fine.
NOTE 6: This if in_b is like your if last from your step. This identifies records in b only - if there's only one then it will only happen once.
NOTE 7: This is doing the replacement for missing. COALESCE \ COALESCEC would also work for this purpose (though in some cases you might need to use this method if you are unsure of the variable type). No reason to check if not missing unless you're using special missings in some fashion - no harm in replacing . with ..