I want to assign a new variable from existing highest n variable.
So if we have a table that has increasing number of columns -
data have;
input uid $ var1 $ var2 $ var3 $;
datalines;
1111 1 0 1
2222 1 0 0
3333 0 0 0
4444 1 1 1
5555 0 0 0
6666 1 1 1
;
I want derive the variable var3 as final_code.
data want;
set have;
final_code = max(of var1-var3);
run;
Above doesn't make sense here as I want only var3 column to remain. Similarly, if var4 is there, I wish to have var4 only.
Does somebody want to help me here ?
If I understand you right, you don't want max of the values but the value from the highest-numbered-variable.
Lots of ways to do this, which way depends on how the variables are named. Here's the easiest, if they're actually named as you say.
data want;
set have;
array var[*] var:;
final_code = var[dim(var)];
run;
Here we make an array out of var: and then choose the last element in the array using dim (to say the size of the array).
I think this is what you are looking for is:
%let n=3
data want;
set have;
var&n = max(of var1-var&n);
drop var1-var%eval(&n-1);
run;
The macro variable &n holds the value of n. This acts as a substitution during the compilation phase of the code.
The DROP statement tells the data step to drop those variable.
The %eval() macro function performs integer math on macro values. So we are dropping 1 through N-1.
Related
I have a dataframe
ID value1
1 12
2 345
3 342
i have a second dataframe
value2
3823
how do I get the following result?
ID value1 value2
1 12 3823
2 345 3823
3 342 3823
any joins I have done have given me
ID value1 value2
1 12 .
2 345 .
3 342 .
. . 3823
No need for joins or helper variables:
data have;
do i = 1 to 3;
output;
end;
run;
data lookup;
j = 1;
run;
data want;
set have;
if _n_ = 1 then set lookup;
run;
Without the if _n_ = 1, the data step stops after one iteration when it tries to read a second row from the lookup dataset and finds that there are no rows remaining.
N.B. this requires that the have dataset doesn't already contain a variable with the same name as the variable(s) attached from the lookup dataset.
By far the easiest way to do this is to utilize PROC SQL and defining the condition 1=1, which is always true for each comparison:
data first;
input ID value1 ##;
cards;
1 12 2 345 3 342
run;
data second;
input value2 ;
cards;
3823
run;
proc sql;
create table wanted as
select * from first
left join second
on 1 =1
;quit;
Edit: As far as I know, there isn't direct way to merge datasets by each row, but you can do the following trick:
Add variable Help:
data second_trick;
set second;
help=1;
run;
data first_trick;
set first;
help=1;
run;
Then we just perform the merge by the static variable:
data wanted_trick;
merge first_trick(in=a) second_trick;
by help;
if a; /*Left join, just to be sure.*/
run;
now this only works if you want to add single static value. Don't try to use it your Second set has more rows.
For more on Merges and joins see: https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/249-30.pdf
I have the following program, but don't understand what # symbol in the end of the INPUT lines does:
data colors;
input #1 Var1 $ #8 Var2 $ #;
input #1 Var3 $ #8 Var4 $ #;
datalines;
RED ORANGE YELLOW GREEN
BLUE INDIGO PURPLE VIOLET
CYAN WHOTE FICSIA BLACK
GRAY BROWN PINK MAGENTA
run;
proc print data=colors;
run;
Output produced without # in the end of the INPUT line is different from the ouput with #.
Can you please clarify what does # in the end of the 2nd and 3rd INPUT lines do?
# at the end of an input statement means, do not advance the line pointer after semicolon. ## means, do not advance the line pointer after the run statement either.
Normally an input statement has an implicit advance the line pointer one after semicolon. So:
data want;
input a b;
datalines;
1 2 3 4
5 6 7 8
run;
proc print data=want;
run;
Will return
1 2
5 6
If you want to read 3 4 into another line, then, you might do something like:
data want;
input a b #;
output;
input a b;
datalines;
1 2 3 4
5 6 7 8
run;
proc print data=want;
run;
Which gives
1 2
3 4
5 6
7 8
Similarly you could simply write
data want;
input a b ##;
datalines;
1 2 3 4
5 6 7 8
run;
proc print data=want;
run;
To get the same result - ## would hold the line pointer even across the run statement. (It still would advance once it hit the end of the line.)
In Summary: I think you probably don't want the trailing # in this case. The Input statements do not seem fitting for the data you are reading. With the trailing #, you are reading the same data into var1 and var3, and the same data into var2 and var4, because it is reading the same line twice. Either way, you are not reading in what the data appears to be. You would be better off with:
input Var1 $ Var2 $ #;
input Var3 $ Var4 $;
Or, more simply:
input Var1 $ Var2 $ Var3 $ Var4 $;
Official details from the SAS support site, annotated:
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000146292.htm
Using Line-Hold Specifiers
Line-hold specifiers keep the pointer on
the current input record when
a data record is read by more than one INPUT statement (trailing #)
Use a single trailing # to allow the next INPUT statement to read from the same record.
Normally, each INPUT statement in a DATA step reads a new data record
into the input buffer. When you use a trailing #, the following
occurs:
The pointer position does not change.
No new record is read into the input buffer.
The next INPUT statement for the same iteration of the DATA step continues to read the same record rather than a new one.
SAS releases a record held by a trailing # when
a null INPUT statement executes:
input;
an INPUT statement without a trailing # executes
the next iteration of the DATA step begins.
For this example, i do not mind if it's solved using SAS, or any SQL language
What I want to do, is productionise model equations so that I can calculate a prediction on the fly
I have a lookup, that stores all model equations
For this example we can assume the formula is always intercept + coef1*var1 + coef2*var2 etc
model coef variable
churn 0.8 intercept
churn 0.5 var1
churn -0.2 var2
churn 0.2 var3
I then have a denormalised table of variables for each customer i.e.
customer_id var1 var2 var3
a 3 2 4
b 5 1 2
c 2 4 5
The output I want is the prediction from the formula for each customer
I.e. for customer a:
The output would be the result of 0.8 + 0.5*3 + -0.2*2 + 0.2*4
What is the best way to have this model information stored in a database, and for me to calculations on the fly, Do I need to normalise the customer table so that I can directly join on each variable?
Normalizing the customer data is certainly going to be the easier way if you're doing this in SQL within SAS. If you're doing it in SQL Server, you could probably use pivot to get things in the same direction.
In base SAS, I would do the opposite. Two options: PROC SCORE or a simple data step.
PROC SCORE is really intended for this purpose. Either way you need to PROC TRANSPOSE the normalized model data into one row, if you're doing the data step version make sure to add a prefix using the prefix option (say model_var1 etc.), then do one of the following:
data model;
input model $ coef variable :$10.;
datalines;
churn 0.8 intercept
churn 0.5 var1
churn -0.2 var2
churn 0.2 var3
;;;;
run;
data have;
input customer_id $ var1 var2 var3;
datalines;
a 3 2 4
b 5 1 2
c 2 4 5
;;;;
run;
proc transpose data=model out=model_score;
id variable;
by model;
run;
data model_score_fin;
set model_score;
drop _name_;
rename
model = _model_;
_type_ = 'PARMS';
run;
proc score data=have out=want_score score=model_score_fin type=parms predict nostd;
var var1 var2 var3;
run;
or
proc transpose data=model out=model_t prefix=model_;
id variable;
where model="churn";
run;
data want;
if _n_=1 then set model_t;
set have;
array models model_var:;
array vars var:;
result = model_intercept;
do _i = 1 to dim(vars);
result = sum(result,models[_i]*vars[_i]);
end;
run;
Finally, if you have SAS IML, you can easily do this by transposing (like above or with IML transpose operator) and then use matrix multiplication.
I have 90 variables in the data, I want to do the following in SAS.
Here is my SAS code:
data test;
length id class sex $ 30;
input id $ 1 class $ 4-6 sex $ 8 survial $ 10;
cards;
1 3rd F Y
2 2nd F Y
3 2nd F N
4 1st M N
5 3rd F N
6 2nd M Y
;
run;
data items2;
set test;
length tid 8;
length item $8;
tid = _n_;
item = class;
output;
item = sex;
output;
item = survial;
output;
keep tid item;
run;
What if I have 90 variables to input the data like this? There should be a very long list. I want to simplify it.
You could use an ARRAY or alternately a PROC TRANSPOSE.
The following is untested, because you haven't provided an exxample of your input dataset.
DATA ITEMS;
ARRAY VARS {*} VAR1-VAR90;
SET REPLACE;
DO I = LBOUND(VARS) TO HBOUUND(VARS);
ITEM = VARS{I};
OUTPUT;
END;
RUN;
OR
PROC TRANSPOSE DATA = TEST OUT = WANT;
BY ID;
VAR CLASS -- SURVIAL;
RUN;
In the future it would be best is you could supply your input and desired output.
I don't seem to be able to add another comment to the above answer, as such I am adding one here.
You need to extend the VAR statement to include all variables that you want transposed.
CLASS -- SURVIAL means all variables between CLASS and SURVIVAL inclusive.
Post your code and the error so that I can help you better.
I am trying to create SAS variable names based on data contained within other variables. For example, I could start with
Obs Var1 Var2
1 abc X
2 def X
3 ghi Y
4 jkl X
and I would like to end up with
Obs Var1 Var2 X Y
1 abc X abc
2 def X def
3 ghi Y ghi
4 jkl X jkl
I do have one way of doing this but it requires somewhat ugly macros to first create the variables needed (using a length statement) and then creating a whole series of numbered macro variables (1 per observation) that are later called inside a data step loop. It works but is complicated and I don't think will scale well to the real data, which contain multiple variables for creation per row, and a few thousand rows.
I've also tried something with arrays - saving variables names in a macro var, using it to generate an array statement, and trying to keep track of which array index is needed for each new variable, but it is also complicated.
What would really help would be something analogous to
vvaluex(var2)=var1
except vvaluex can't be on the left-hand side of an equals. Any thoughts or ideas?
PROC TRANSPOSE is a handy way to do the example in the question.
data have;
input Obs Var1 $ Var2 $;
datalines;
1 abc X
2 def X
3 ghi Y
4 jkl X
;;;;
run;
proc transpose data=have out=want;
by obs;
id var2;
var var1;
copy var1 var2;
run;
Another option is probably similar to what you've tried before, using arrays and VNAME:
proc sql;
select var2 into :var2list separated by ' ' from have;
quit;
data want;
set have;
array newvars $ &var2list;
do _t = 1 to dim(newvars);
if vname(newvars[_t]) = Var2 then do;
newvars[_t] = var1;
leave;
end;
end;
run;
PROC TRANSPOSE should be faster and is probably more flexible, but this might work better for some purposes.