SAS - How to separate a string with variable substring into multiple columns - variables

I have a dataset containing a variable X, made up of multiple numbers separated by a comma. The number of item is different among rows. I created a count words. Now I would like to see the numbers in different columns.
Here the example:
X Num_of_X Var1 Var2 Var3 Var4 ... Varn
3,10,165 3 3 10 165
1 1 1
15,100 2 15 100
10,52,63,90 4 10 52 63 90
I tried this way:
%let max_num_X=max(num_of_x);
data have;
set have;
length var1-var&max_num_X $10.;
array Var(&max_num_X) $;
do i=1 to &max_num_X;
Var[i]=scan(X,i,',');
output;
end;
run;
Could you help me?
Thank you

Do something like this
data have;
input X :$20.;
datalines;
3,10,165
1
15,100
10,52,63,90
;
data long;
set have;
n = _N_;
do i = 1 to countw(X, ',');
xx = scan(X, i, ',');
output;
end;
run;
proc transpose data = long out = want(drop=_:) prefix=Var;
by n;
id i;
var xx;
run;

you could use a macro to find the maximum number and create the variables:
%macro create_vars();
proc sql noprint; select max(countw(X)) into :max_num_X from have; quit;
data have; set have;
%do i = 1 %to &max_num_X.; Var&i. = scan(X,&i.,','); %end;
run;
%mend;
%create_vars();

Related

fill the nulls of a column with the mean sum of the division of two columns multiplied by one column minus the previous column SAS

I need to fill the nulls of a column with the mean sum of the division of two columns multiplied by one column and rest the previous An example would be
A B_01 B_02 ... B_60
5 . .
5 2 3
7 3 1,2
9 3 0,3
4 . .
Well, I would like the missing value for column B_01 to be (2/5 + 3/7 + 3/9) / 3 * its corresponding column A
For column B_02(3/5 + 1,2/7 + 0,3/9)/3 * its corresponding column A - his new value in B_01
I have thought about doing this, but it turns out that I have 60 columns with which to do it and the only way it comes to mi mind is to do this 60 times.
Proc sql;
create table new as
Select *
, sum(B_01/A)/sum(case when B_01 is missimg then . else 1)*A end as new_B_01
, sum(B_02/A)/sum(case when B_02 is missimg then . else 1)*A-B_01 end as new_B_02
from table_one
;
Thanks
This may be what you want.
data test;
input A B_01-B_02;
cards;
5 . .
5 2 3
7 3 1.2
9 3 0.3
4 . .
;;;;
data test2;
set test;
array B_ B_01-B_02;
array M_[2];
do i = 1 to dim(m_);
if not missing(b_[i]) then m_[i]= divide(b_[i],a);
end;
drop i;
run;
proc print;
run;
proc stdize reponly missing=mean data=test2 out=mean;
var m_:;
run;
proc print;
run;
data mean2;
set mean;
array B_ B_01-B_02;
array M_[2];
do i = 1 to dim(m_);
b_[i] = coalesce(b_[i],m_[i]);
end;
drop i m_:;
run;
proc print;
run;
Try this.
data test;
input A B_01-B_02;
cards;
5 . .
5 2 3
7 3 1.2
9 3 0.3
4 . .
;;;;
run;
data mean;
set test end=eof;
array b(*) b_:;
array sumb(2) _temporary_;
array cntb(2) _temporary_;
array mean_b_(2);
* cumulative sums and counts;
do i = 1 to dim(b);
sumb(i) = sum(sumb(i), b(i)/a);
if not missing(b(i)) then cntb(i) + 1;
end;
* mean;
if eof then do;
do i = 1 to dim(b);
mean_b_(i) = sumb(i)/cntb(i);
end;
output;
end;
keep mean_:;
run;
data test2;
if _n_ = 1 then do;
set mean;
array mean_b_(*) mean_b:;
end;
set test;
array b_(*) b_:;
array new_b_(2);
do i = 1 to dim(b_);
if i = 1 then new_b_(i) = coalesce(b_(i), mean_b_(i) * a);
else new_b_(i) = coalesce(b_(i), mean_b_(i) * a - new_b_(i-1));
end;
run;

How do i assign a value to a new variable, using another dataset which contains one value in SAS

I have a dataframe
ID value1
1 12
2 345
3 342
i have a second dataframe
value2
3823
how do I get the following result?
ID value1 value2
1 12 3823
2 345 3823
3 342 3823
any joins I have done have given me
ID value1 value2
1 12 .
2 345 .
3 342 .
. . 3823
No need for joins or helper variables:
data have;
do i = 1 to 3;
output;
end;
run;
data lookup;
j = 1;
run;
data want;
set have;
if _n_ = 1 then set lookup;
run;
Without the if _n_ = 1, the data step stops after one iteration when it tries to read a second row from the lookup dataset and finds that there are no rows remaining.
N.B. this requires that the have dataset doesn't already contain a variable with the same name as the variable(s) attached from the lookup dataset.
By far the easiest way to do this is to utilize PROC SQL and defining the condition 1=1, which is always true for each comparison:
data first;
input ID value1 ##;
cards;
1 12 2 345 3 342
run;
data second;
input value2 ;
cards;
3823
run;
proc sql;
create table wanted as
select * from first
left join second
on 1 =1
;quit;
Edit: As far as I know, there isn't direct way to merge datasets by each row, but you can do the following trick:
Add variable Help:
data second_trick;
set second;
help=1;
run;
data first_trick;
set first;
help=1;
run;
Then we just perform the merge by the static variable:
data wanted_trick;
merge first_trick(in=a) second_trick;
by help;
if a; /*Left join, just to be sure.*/
run;
now this only works if you want to add single static value. Don't try to use it your Second set has more rows.
For more on Merges and joins see: https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/249-30.pdf

Dropping variable based on sum of values in it using SAS

I wish to drop the columns in a SAS dataset which has a sum less than a particular value. Consider the case below.
Column_A Pred_1 Pred_2 Pred_3 Pred_4 Pred_5
A 1 1 0 1 0
A 0 1 0 1 0
A 0 1 0 1 0
A 0 1 0 1 1
A 0 1 0 0 1
Let us assume that our threshold is 4, so I wish to drop predictors having sum of active observations less than 4, so the output would look like
Column_A Pred_2 Pred_4
A 1 1
A 1 1
A 1 1
A 1 1
A 1 0
Currently I am using a very inefficient method of using multiple transposes to drop the predictors. There are multiple datasets with records > 30,000 so transpose approach is taking time. Would appreciate if anyone has a more efficient solution!
Thanks!
Seems like you could do:
Run PROC MEANS or similar proc to get the sums
Create a macro variable that contains all variables in the dataset with sum < threshhold
Drop those variables
Then no TRANSPOSE or whatever, just regular plain old summarization and drops. Note you should use ODS OUTPUT not the OUT= in PROC MEANS, or else you will have to PROC TRANSPOSE the normal PROC MEANS OUT= dataset.
An example using a trivial dataset:
data have;
array x[20];
do _n_ = 1 to 20;
do _i = 1 to dim(x);
x[_i] = rand('Uniform') < 0.2;
end;
output;
end;
run;
ods output summary=have_sums; *how we get our output;
ods html select none; *stop it from going to results window;
proc means data=have stackodsoutput sum; *stackodsoutput is 9.3+ I believe;
var x1-x20;
run;
ods html select all; *reenable normal html output;
%let threshhold=4; *your threshhold value;
proc sql;
select variable
into :droplist_threshhold separated by ' '
from have_sums
where sum lt &threshhold; *checking if sum is over threshhold;
quit;
data want;
set have;
drop &droplist_threshhold.; *and now, drop them!;
run;
Just use PROC SUMMARY to get the sums. You can then use a data step to generate the list of variable names to drop.
%let threshhold=4;
%let varlist= pred_1 - pred_5;
proc summary data=have ;
var &varlist ;
output out=sum sum= ;
run;
data _null_;
set sum ;
array x &varlist ;
length droplist $500 ;
do i=1 to dim(x);
if x(i) < &threshhold then droplist=catx(' ',droplist,vname(x(i)));
end;
call symputx('droplist',droplist);
run;
You can then use the macro variable to generate a DROP statement or a DROP= dataset option.
drop &droplist;

how to assign count of column to anothe variable in sas?

%Let abc = count( no of variables in data set )
The following code assigns the number of columns in the dataset 'have' to the macro variable abc.
data _null_;
if 0 then
do;
set have (obs=0);
end;
array chars _character_;
array nums _numeric_;
ncharvar = dim(chars);
nnumvar = dim(nums);
nvar = ncharvar + nnumvar;
call symput('abc',nvar);
run;

Simplifying the variable input in SAS

I have 90 variables in the data, I want to do the following in SAS.
Here is my SAS code:
data test;
length id class sex $ 30;
input id $ 1 class $ 4-6 sex $ 8 survial $ 10;
cards;
1 3rd F Y
2 2nd F Y
3 2nd F N
4 1st M N
5 3rd F N
6 2nd M Y
;
run;
data items2;
set test;
length tid 8;
length item $8;
tid = _n_;
item = class;
output;
item = sex;
output;
item = survial;
output;
keep tid item;
run;
What if I have 90 variables to input the data like this? There should be a very long list. I want to simplify it.
You could use an ARRAY or alternately a PROC TRANSPOSE.
The following is untested, because you haven't provided an exxample of your input dataset.
DATA ITEMS;
ARRAY VARS {*} VAR1-VAR90;
SET REPLACE;
DO I = LBOUND(VARS) TO HBOUUND(VARS);
ITEM = VARS{I};
OUTPUT;
END;
RUN;
OR
PROC TRANSPOSE DATA = TEST OUT = WANT;
BY ID;
VAR CLASS -- SURVIAL;
RUN;
In the future it would be best is you could supply your input and desired output.
I don't seem to be able to add another comment to the above answer, as such I am adding one here.
You need to extend the VAR statement to include all variables that you want transposed.
CLASS -- SURVIAL means all variables between CLASS and SURVIVAL inclusive.
Post your code and the error so that I can help you better.