I have two tables:
data a;
input a b c;
datalines;
1 2 .
;
run;
data b;
input a b c;
datalines;
1 . 3
;
run;
The result I want from these tables is replacing the missings by the values that are not missing:
a b c
-----
1 2 3
How can I do it with mostly less piece of code?
EDIT:
I wrote the code and it works, but may be there is more simple code for this.
%macro x;
%macro dummy; %mend dummy;
data _null_;
set x end=Last;
call symputx("name"||left(_N_),name);
if Last then call symputx("num",_n_);
run;
data c;
set a b;
run;
data c;
set c;
%do i=1 %to #
x&i=lag(&&name&i);
%end;
n=_n_;
run;
data c1 (drop= n %do i=1 %to # x&i %end;);
set c (where=(n=2));
%do i=1 %to #
if missing(&&name&i) and not missing(x&i) then &&name&i=x&i;
%end;
run;
%mend;
%x;
If the values are consistent, ie, you never have:
1 2 3
1 3 .
and/or are happy for them to be overwritten, then UPDATE is excellent for this.
data c;
update a b;
by a;
run;
UPDATE will only replace values with non-missing values, so . gets replaced by 3 but 2 is not replaced by .. Again assuming a is the ID variable as Gordon assumes.
You also can easily do this:
data c;
set a b;
by a;
retain b_1 c_1;
if first.a then do; *save the first b and c;
b_1=b;
c_1=c;
end;
else do; *now fill in missings using COALESCE which only replaces if missing;
b_1=coalesce(b_1,b); *use coalescec if this is a char var;
c_1=coalesce(c_1,c); *same;
end;
if last.a then output; *output last row;
drop b c;
rename
b_1=b
c_1=c
;
run;
This makes sure you keep the first instance of any particular value, if they may be different (the opposite of update which keeps the last instance, and different from the SQL solution which takes MAX specifically). All three should give the same result if you have only identical values. Data step options should be a bit faster than the SQL option, I expect, as they're both one pass solutions with no matching required (though it probably doesn't matter).
Using proc SQL, you can do this with aggregation:
proc sql;
select max(a) as a, max(b) as b, max(c) as c
from (select a, b, c from a union all
select a, b, c from b
) x;
If, as I suspect, the first column is an id for matching the two tables, you should instead do:
proc sql;
select coalesce(a.a, b.a), coalesce(a.b, b.b) as b, coalesce(a.c, b.c) as c
from a full join
b
on a.a = b.a;
I'm going to post how to do your approach with some details here: I wouldn't consider this the best approach for this, but you can perhaps learn more easily by starting with what you have, and it's not a horrible approach certainly - just not optimal.
Starting:
%macro x;
%macro dummy; %mend dummy;
data _null_;
set x end=Last;
call symputx("name"||left(_N_),name);
if Last then call symputx("num",_n_);
run;
data c;
set a b;
run;
data c; *NOTE 1;
set c;
%do i=1 %to #
x&i=lag(&&name&i); *NOTE 2;
%end;
n=_n_;
run;
data c1 (drop= n %do i=1 %to # x&i %end;); *NOTE 3;
set c (where=(n=2));
%do i=1 %to #
if missing(&&name&i) and not missing(x&i) then &&name&i=x&i;
%end;
run;
%mend;
%x;
Ending:
*You can still do the first datastep to figure out the dimensions of the arrays,
if you want, use &num instead of the 3s hardcoded in there (but do not need &name list).;
data c;
set a(in=in_a) b(in=in_b);
array x[3] _temporary_; *NOTE 4;
array var[3] a b c;
do i = 1 to dim(x); *NOTE 5;
x[i] = lag(vars[i]);
end;
if in_b then do; *NOTE 6;
do i=1 to dim(x);
if missing(vars[i]) then vars[i]=x[i]; *NOTE 7;
end;
output;
end;
run;
Notes:
NOTE 1: You can combine the two c datasteps here with no difference at all. In general have as few data steps as you can, as they're slow - this is a difference from R or similar which use in memory processing, in SAS you use disk processing which is nice for ability to do 200GB of data but not as fast for multiple steps like this - so make fewer steps.
NOTE 2: This is basically a macro implementation of an array. SAS datastep has an array already! Use it.
NOTE 3: You don't need to do the drop like that. drop=n x: works fine as long as none of your real variables start with x (and if they do, use an _ before all of your dummy variables and it will be the same). : is a wild card for 'starts with'.
NOTE 4: Here is the array implementation of your x array. I use temporary because that means the variables will be dropped automatically for you.
NOTE 5: Here we do the lags. I don't like using lag for this where retain does a better job of the same thing, but it works fine.
NOTE 6: This if in_b is like your if last from your step. This identifies records in b only - if there's only one then it will only happen once.
NOTE 7: This is doing the replacement for missing. COALESCE \ COALESCEC would also work for this purpose (though in some cases you might need to use this method if you are unsure of the variable type). No reason to check if not missing unless you're using special missings in some fashion - no harm in replacing . with ..
Related
I have variables named _200701, _200702,... till _201612, each containing specific numeric data for that month. From these, I want to substract specific amount (variable cap_inc), if a condition is met:
%MACRO DeleteExc(var);
DATA Working.Test;
SET Working.Test;
IF &var. GE cap_inc THEN &var. = SUM(&var., - cap_inc);
ELSE &var. = &var.;
RUN;
%MEND;
Code is working if I put only one month as a parameter (eg _200909)... But I want to put there sequence from these variables. I have tried combinations like "OF _200701 -- _201612" or "OF _20:", but nothing has worked.
I have also another macro, using parmbuff parameter, working in the "for each loop" way, where I can put more variables separated by comma, for instance
%DeleteExc(_200701, _200702, _200703)
But I still can't pass all variables in some convenient, easy to follow way. (I don't want to type all parameters as there is 120 of them).
Is there any way how to do this?
Thank you!
First thing is that if you want to pass a list into a macro then DO NOT delimit the list using a comma. It will just make calling the macro a large pain. You will will either need to use macro quoting to hide the comma. Or override SAS's parameter processing by using the /parmbuff option and add logic to process the &syspbuff macro variable yourself. Use some other character that is not used in the values as the delimiter. Like | or ^ for example. For a list of variable names use spaces as the delimiter.
%DeleteExc(varlist=_200701 _200702 _200703)
Then you can use the macro variable anywhere SAS expects a list of variables.
array in &varlist ;
total = sum(of &varlist);
Now since your list is really a list of MONTHS then give your macro the start and end month and let it generate the list for you.
%macro DeleteExc(start,end);
%local i var ;
%do i=0 %to %sysfunc(intck(month,&start,&end)) ;
%let var=_%sysfunc(intnx(month,&start,&i,b),yymmn6);
IF .Z < cap_inc < &var. THEN &var. = &var - cap_inc;
%end;
%mend;
DATA Working.Test;
SET Working.Test;
%DeleteExc("01JAN2007"d,"01DEC2016"d);
RUN;
Here are a few options - perhaps there's one you haven't tried?
data example;
array months{*} _200701-_200712 _200801-_200812 (24*1);
array underscores{*} _:;
_randomvar = 100;
s1 = sum(of _200701-_200812); /*Generates lots of notes about uninitialised variables but gives correct result*/
s2 = sum(of _200701--_200812); /*Works only if there are no rogue columns in between month columns*/
s3 = sum(of months{*}); /* Requires array definition*/
s4 = sum(of _:); /*Sum any variables with _ prefix - potentially including undesired variables*/
put (s1-s4)(=);
run;
The double dash (--) variable name range list can be used to specify the variables in an array. A simple iterative DO LOOP lets you perform the desired operation on each variable.
data want;
set have;
array month_named_variables _200701 -- _201612;
do _index = 1 to dim(month_named_variables); drop _index;
IF month_named_variables(_index) GE cap_inc THEN
month_named_variables(_index) = SUM(month_named_variables(_index), - cap_inc);
ELSE
month_named_variables(_index) = month_named_variables(_index);
end;
run;
If the data set has extra variables within the name range you can still use an array and non-macro code:
data want;
set have;
array nums _numeric_;
do _index = 1 to dim(nums); drop _index;
_vname = vname(nums(_index)); drop _vname;
if _vname ne: '_'
or not (2007 <= input(substr(_vname,2,4), ??4.) <= 2016)
or not (01 <= input(substr(_vname,6,2), ??2.) <= 12)
or not length(_vname) = 7
then continue;
IF nums(_index) GE cap_inc THEN
nums(_index) = SUM(nums(_index), - cap_inc);
ELSE
nums(_index) = nums(_index);
end;
run;
If you really need use a specific list of variables and want to work within a macro, I would recommend passing the FROM and TO values corresponding to the variable names and looping that range according to the naming convention:
%macro want(data=, yyyymm_from=, yyyymm_to=, guard=1000, debug=0);
%local LOWER UPPER YEARMON INDEX NVARS;
%let LOWER = %sysfunc(inputn(&yyyymm_from,yymmn6.));
%let UPPER = %sysfunc(inputn(&yyyymm_to,yymmn6.));
%let INDEX = 1;
%do YEARMON = &LOWER %to &UPPER;
%let yyyymm = %sysfunc(putn(&YEARMON, yymmn6.));
%local ymvar&INDEX;
%let ymvar&INDEX = _&yyyymm; %* NAMING CONVENTION;
%if &debug %then %put NOTE: YMVAR&INDEX=%superq(YMVAR&INDEX);
%if &INDEX > &GUARD %then %do;
%put ERROR: Exceeded guard limit of &GUARD variables;
%return;
%end;
%let NVARS = &INDEX;
%let YEARMON = %sysfunc(INTNX(MONTH,&yearmon,1)); %* NAMING CONVENTION;
%let YEARMON = %eval(&YEARMON-1); %* back off by one for implicit macro do loop increment of +1;
%let INDEX = %eval(&INDEX+1);
%end;
%do INDEX = 1 %to &NVARS;
%put NOTE: &=INDEX YMVAR&INDEX=&&&YMVAR&INDEX;
%end;
%mend;
%want (data=have, yyyymm_from=200701, yyyymm_to=201612)
If my understanding is correct, you want to do loop with month,which is defendant on variables in data, you could set start date and end date, then do loop.
%macro month_loop(start,end);
%let start=%sysfunc(inputn(&start,yymmn6.));
%let end=%sysfunc(inputn(&end,yymmn6.));
%let date=&start;
%do %until (%sysfunc(indexw("&date","&end")));
%let date=%sysfunc(intnx(month,&date,1));
%let var=_%sysfunc(putn(&date,yymmn6.));
data want;
set have;
IF &var. GE cap_inc THEN &var. = SUM(&var., - cap_inc);
ELSE &var. = &var.;
run;
%end;
%mend;
%month_loop(200701,201612)
The data I have are millions of rows and rather sparse with anywhere between 3 and 10 variables needing processed. My end result needs to be one single row containing the first non-missing value for each column. Take the following test data:
** test data **;
data test;
length ID $5 AID 8 TYPE $5;
input ID $ AID TYPE $;
datalines;
A . .
. 123 .
C . XYZ
;
run;
The end result should look like such:
ID AID TYPE
A 123 XYZ
Using macro lists and loops I can brute force this result with multiple merge statements where the variable is non-missing and obs=1 but this is not efficient when the data are very large (below I'd loop over these variables rather than write multiple merge statements):
** works but takes too long on big data **;
data one_row;
merge
test(keep=ID where=(ID ne "") obs=1) /* character */
test(keep=AID where=(AID ne .) obs=1) /* numeric */
test(keep=TYPE where=(TYPE ne "") obs=1); /* character */
run;
The coalesce function seems very promising, but I believe I need it in combination with array and output to build this single-row result. The function also differs (coalesce and coalescec depending on variable type) whereas it does not matter using proc sql. I get an error using array since all variables in the array list are not the same type.
Exactly what is most efficient will largely depend on the characteristics of your data. In particular, whether the first nonmissing value for the last variable is usually relatively "early" in the dataset, or if you usually will have to trawl through the entire dataset to get to it.
I assume your dataset is not indexed (as that would simplify things greatly).
One option is the standard data step. This isn't necessarily fast, but it's probably not too much slower than most other options given you're going to have to read most/all of the rows no matter what you do. This has a nice advantage that it can stop when every row is complete.
data want;
if 0 then set test; *defines characteristics;
set test(rename=(id=_id aid=_aid type=_type)) end=eof;
id=coalescec(id,_id);
aid=coalesce(aid,_aid);
type=coalescec(type,_type);
if cmiss(of id aid type)=0 then do;
output;
stop;
end;
else if eof then output;
drop _:;
run;
You could populate all of that from macro variables from dictionary.columns, or even might use temporary arrays, though I think that gets too messy.
Another option is the self update, except it needs two changes. One, you need something to join on (as opposed to merge which can have no by variable). Two, it will give you the last nonmissing value, not the first, so you'd have to reverse-sort the dataset.
But assuming you added x to the first dataset, with any value (doesn't matter, but constant for every row), it is this simple:
data want;
update test(obs=0) test;
by x;
run;
So that has the huge advantage of simplicity of code, exchanged for some cost of time (reverse sorting and adding a new variable).
If your dataset is very sparse, a transpose might be a good compromise. Doesn't require knowing the variable names as you can process them with arrays.
data test_t;
set test;
array numvars _numeric_;
array charvars _character_;
do _i = 1 to dim(numvars);
if not missing(numvars[_i]) then do;
varname = vname(numvars[_i]);
numvalue= numvars[_i];
output;
end;
end;
do _i = 1 to dim(charvars);
if not missing(charvars[_i]) then do;
varname = vname(charvars[_i]);
charvalue= charvars[_i];
output;
end;
end;
keep numvalue charvalue varname;
run;
proc sort data=test_t;
by varname;
run;
data want;
set test_t;
by varname;
if first.varname;
run;
Then you proc transpose this to get the desired want (or maybe this works for you as is). It does lose the formats/etc. on the value, so take that into account, and your character value length probably needs to be set to something appropriately long - and then set back (you can use an if 0 then set to fix it).
A similar hash approach would work roughly the same way; it has the advantage that it would stop much sooner, and doesn't require resorting.
data test_h;
set test end=eof;
array numvars _numeric_;
array charvars _character_;
length varname $32 numvalue 8 charvalue $1024; *or longest charvalue length;
if _n_=1 then do;
declare hash h(ordered:'a');
h.defineKey('varname');
h.defineData('varname','numvalue','charvalue');
h.defineDone();
end;
do _i = 1 to dim(numvars);
if not missing(numvars[_i]) then do;
varname = vname(numvars[_i]);
rc = h.find();
if rc ne 0 then do;
numvalue= numvars[_i];
rc=h.add();
end;
end;
end;
do _i = 1 to dim(charvars);
if not missing(charvars[_i]) then do;
varname = vname(charvars[_i]);
rc = h.find();
if rc ne 0 then do;
charvalue= charvars[_i];
rc=h.add();
end;
end;
end;
if eof or h.num_items = dim(numvars) + dim(charvars) then do;
rc = h.output(dataset:'want');
end;
run;
There are lots of other solutions, just depending on your data which would be most efficient.
I asking an old question. I have the code and have looked to previous questions but nevertheless I am unable to correct my mistake. Below is the code with dummy data. I am unable to pass the names of variables to the macro.
data x;
inputs x$ y z;
datalines;
a 23 34
b 34 43
a 23 54
b 87 78
a 12 32
b 22 33
;
run;
Now I create a list of variables
%let name_list=y z;
Then I write macro.
%macro mixed;
%let j=1;
%let first=%scan(&name_list.,%eval(&j));
%do %while (&first ne );
proc mixed data=x;
class x;
model &name_list.=;
random x;
ods output covParms=cov1;
run;
%let j=%eval(&j+1);
%let first=%scan(&name_list.,%eval(&j));
%end;
run;
%mend;
%mixed;
Some how this is not working. Any help will be appreciated.
If you want to iterate over the names in a list then you can just use a normal %DO ... %TO loop. No need to manually initialize or increment the counter.
%do i=1 %to %sysfunc(countw(&name_list));
%let name=%scan(&name_list,&i);
.... place code here that uses &NAME ....
%end;
I wish to drop the columns in a SAS dataset which has a sum less than a particular value. Consider the case below.
Column_A Pred_1 Pred_2 Pred_3 Pred_4 Pred_5
A 1 1 0 1 0
A 0 1 0 1 0
A 0 1 0 1 0
A 0 1 0 1 1
A 0 1 0 0 1
Let us assume that our threshold is 4, so I wish to drop predictors having sum of active observations less than 4, so the output would look like
Column_A Pred_2 Pred_4
A 1 1
A 1 1
A 1 1
A 1 1
A 1 0
Currently I am using a very inefficient method of using multiple transposes to drop the predictors. There are multiple datasets with records > 30,000 so transpose approach is taking time. Would appreciate if anyone has a more efficient solution!
Thanks!
Seems like you could do:
Run PROC MEANS or similar proc to get the sums
Create a macro variable that contains all variables in the dataset with sum < threshhold
Drop those variables
Then no TRANSPOSE or whatever, just regular plain old summarization and drops. Note you should use ODS OUTPUT not the OUT= in PROC MEANS, or else you will have to PROC TRANSPOSE the normal PROC MEANS OUT= dataset.
An example using a trivial dataset:
data have;
array x[20];
do _n_ = 1 to 20;
do _i = 1 to dim(x);
x[_i] = rand('Uniform') < 0.2;
end;
output;
end;
run;
ods output summary=have_sums; *how we get our output;
ods html select none; *stop it from going to results window;
proc means data=have stackodsoutput sum; *stackodsoutput is 9.3+ I believe;
var x1-x20;
run;
ods html select all; *reenable normal html output;
%let threshhold=4; *your threshhold value;
proc sql;
select variable
into :droplist_threshhold separated by ' '
from have_sums
where sum lt &threshhold; *checking if sum is over threshhold;
quit;
data want;
set have;
drop &droplist_threshhold.; *and now, drop them!;
run;
Just use PROC SUMMARY to get the sums. You can then use a data step to generate the list of variable names to drop.
%let threshhold=4;
%let varlist= pred_1 - pred_5;
proc summary data=have ;
var &varlist ;
output out=sum sum= ;
run;
data _null_;
set sum ;
array x &varlist ;
length droplist $500 ;
do i=1 to dim(x);
if x(i) < &threshhold then droplist=catx(' ',droplist,vname(x(i)));
end;
call symputx('droplist',droplist);
run;
You can then use the macro variable to generate a DROP statement or a DROP= dataset option.
drop &droplist;
I can create a Numbered Range List of numeric type, but not character type.
My code is similar to this:
DATA TestDataset;
INPUT a1-a3 $;
DATALINES;
A B C
;
RUN;
This produces 3 variables - [a1], [a2] and [a3] as expected. However [a3] is character, but [a1] and [a2] are numeric. This leaves me with missing values as per the following table:
a1 a2 a3
. . C
The following code works, but obviously it does not scale nicely.
INPUT a1 $ a2 $ a3 $;
Am I missing something?
I believe you can use the hyphen notation on the length statement to get what you want. You really should use a length statement regardless..otherwise it defaults to $8.
DATA TestDataset;
length a1-a3 $20;
INPUT a1-a3 ;
DATALINES;
A B C
;
RUN;
I came up with a macro solution:
%MACRO var_list_char (var_prefix, n);
%LOCAL i ;
%DO i = 1 %TO &n;
&var_prefix&i$
%END;
%MEND;
DATA TestDataset;
INPUT %var_list_char (a, 3);
DATALINES;
A B C
;
RUN;
I wish I could find a way to do this without macros - I will keep digging for a bit and will update this post if I find more. In the meantime, the above approach will definitely work.
UPDATE 1: #carolinajay65's solution above is the correct non-macro approach.
UPDATE 2: There is another way that I found.
DATA TestDataset;
INPUT (a1-a3) ($);
DATALINES;
A B C
;
RUN;
More documentation of the language features supporting this technique can be found here, in the section labeled "How to Group Variables and Informats".