Numeric variable to character variable conversion - formatting

I'm trying to plot two sets of calculations and compare them over time. The "cohort" variable, derived from coh_asof_yyyymm from original table, were stored in the data set in numeric format (201003 for 2010 March). Now when I plot them by using proc sgplot, 4 quarters of data are crammed together. How do I change the format of this variable in order to product outputs where x-axis should be in interval of quarters?
options nofmterr;
libname backtest "/retail/mortgage/consumer/new";
proc sql;
create table frst_cur as
select
coh_asof_yyyymm as cohort,
sum(annual_default_occurrence* wt)/sum(wt) as dr_ct,
sum(ScorePIT_PD_2013 * wt)/sum(wt) as pd_ct_pit,
sum(ScoreTTC_PD_2013 * wt)/sum(wt) as pd_ct_ttc
from backtest.sample_frst_cur_201312bkts
group by 1;
quit;
proc sgplot data = frst_cur;
series x = cohort y = pd_ct_pit;
series x = cohort y = pd_ct_ttc;
format cohort yyyyqc.;
xaxis label = 'Cohort Date';
yaxis label = 'Defaults';
title 'First Mortgage Current';
run;

If i'm getting it right, i think your date is a number and not a SAS date. It's not unusual, people do store date as integers in their RDBMS tables and when SAS import data from table, it assumes it to be integer rather than date. Check out the below solution code for reference.
data testing_date_integer;
infile datalines missover;
input int_date 8.;
/* creating another variable which would be a SAS date, based on int_date.
we would be converting the integer date to charater and then append
day (01) to the charater and read using YYMMDD8. informat for SAS
to store the character as date
*/
sas_date=input(cats(put(int_date,8.),'01'),yymmdd8.);
format sas_date YYQ8.;
datalines4;
200008
200009
200010
200011
200012
200101
200102
200103
200104
200105
200106
;;;;
run;
proc print data=testing_date_integer;run;
If above code show and solves you problem then i would recommend you to update you PROC SQL Code
proc sql;
create table frst_cur as
select
input(cats(put(coh_asof_yyyymm ,8.),'01'),yymmdd8.) as cohort,
.
.
.
Also, i would recommend updating the format statement for cohort in PROC SGPLOT
proc sgplot data = frst_cur;
.
.
format cohort yyq8.;
Hope this solves your problem.

Using the YYMMn6. informat
data HAVE;
input DATE YYMMn6.;
format date YYQ8.;
datalines;
200008
200009
200010
200011
200012
200101
200102
200103
200104
200105
200106
;
run;
Proc Print Data=HAVE noobs; Run;
data HAVE2;
input coh_asof_yyyymm 8.;
datalines;
200008
200009
200010
200011
200012
200101
200102
200103
200104
200105
200106
;
run;
proc sql;
create table frst_cur as
select
input(put(coh_asof_yyyymm,6.),YYMMn6.) as cohort format=YYQ8.
From HAVE2;
Quit;

Related

SAS - How to separate a string with variable substring into multiple columns

I have a dataset containing a variable X, made up of multiple numbers separated by a comma. The number of item is different among rows. I created a count words. Now I would like to see the numbers in different columns.
Here the example:
X Num_of_X Var1 Var2 Var3 Var4 ... Varn
3,10,165 3 3 10 165
1 1 1
15,100 2 15 100
10,52,63,90 4 10 52 63 90
I tried this way:
%let max_num_X=max(num_of_x);
data have;
set have;
length var1-var&max_num_X $10.;
array Var(&max_num_X) $;
do i=1 to &max_num_X;
Var[i]=scan(X,i,',');
output;
end;
run;
Could you help me?
Thank you
Do something like this
data have;
input X :$20.;
datalines;
3,10,165
1
15,100
10,52,63,90
;
data long;
set have;
n = _N_;
do i = 1 to countw(X, ',');
xx = scan(X, i, ',');
output;
end;
run;
proc transpose data = long out = want(drop=_:) prefix=Var;
by n;
id i;
var xx;
run;
you could use a macro to find the maximum number and create the variables:
%macro create_vars();
proc sql noprint; select max(countw(X)) into :max_num_X from have; quit;
data have; set have;
%do i = 1 %to &max_num_X.; Var&i. = scan(X,&i.,','); %end;
run;
%mend;
%create_vars();

Dropping variable based on sum of values in it using SAS

I wish to drop the columns in a SAS dataset which has a sum less than a particular value. Consider the case below.
Column_A Pred_1 Pred_2 Pred_3 Pred_4 Pred_5
A 1 1 0 1 0
A 0 1 0 1 0
A 0 1 0 1 0
A 0 1 0 1 1
A 0 1 0 0 1
Let us assume that our threshold is 4, so I wish to drop predictors having sum of active observations less than 4, so the output would look like
Column_A Pred_2 Pred_4
A 1 1
A 1 1
A 1 1
A 1 1
A 1 0
Currently I am using a very inefficient method of using multiple transposes to drop the predictors. There are multiple datasets with records > 30,000 so transpose approach is taking time. Would appreciate if anyone has a more efficient solution!
Thanks!
Seems like you could do:
Run PROC MEANS or similar proc to get the sums
Create a macro variable that contains all variables in the dataset with sum < threshhold
Drop those variables
Then no TRANSPOSE or whatever, just regular plain old summarization and drops. Note you should use ODS OUTPUT not the OUT= in PROC MEANS, or else you will have to PROC TRANSPOSE the normal PROC MEANS OUT= dataset.
An example using a trivial dataset:
data have;
array x[20];
do _n_ = 1 to 20;
do _i = 1 to dim(x);
x[_i] = rand('Uniform') < 0.2;
end;
output;
end;
run;
ods output summary=have_sums; *how we get our output;
ods html select none; *stop it from going to results window;
proc means data=have stackodsoutput sum; *stackodsoutput is 9.3+ I believe;
var x1-x20;
run;
ods html select all; *reenable normal html output;
%let threshhold=4; *your threshhold value;
proc sql;
select variable
into :droplist_threshhold separated by ' '
from have_sums
where sum lt &threshhold; *checking if sum is over threshhold;
quit;
data want;
set have;
drop &droplist_threshhold.; *and now, drop them!;
run;
Just use PROC SUMMARY to get the sums. You can then use a data step to generate the list of variable names to drop.
%let threshhold=4;
%let varlist= pred_1 - pred_5;
proc summary data=have ;
var &varlist ;
output out=sum sum= ;
run;
data _null_;
set sum ;
array x &varlist ;
length droplist $500 ;
do i=1 to dim(x);
if x(i) < &threshhold then droplist=catx(' ',droplist,vname(x(i)));
end;
call symputx('droplist',droplist);
run;
You can then use the macro variable to generate a DROP statement or a DROP= dataset option.
drop &droplist;

SAS proc sgplot with date axis formatted as m/d/yy (i.e. without leading zeros)

I'm trying to make a scatter plot with SAS proc sgplot and format the xaxis to be m/d/yy (for example 1/1/06). I created a custom date format like this:
PICTURE myDateFmt low-high = '%m/%d/%0y' (DATATYPE = date);
Then I formatted my date variable to be this format in a data step, and put this line in my proc sgplot step:
xaxis offsetmin = 0 offsetmax = 0 display=(nolabel) tickvalueformat=data;
However, when I do this, the date axis text all just disappears. Does anyone know of a way to format the date axis in a plot to be m/d/yy format?
Thank you in advance!
I think the TICKVALUEFORMAT option must have a problem with picture formats. When I tried this, my graph displayed "%m/%d/%0y" on the x-axis. But if I print the data, the formatted values are as desired so I think the picture format is created correctly.
I did a work-around where I created a value format for the date range of interest and then used that in the SGPLOT. To do this, I had to generate a dataset with one record for each day in the range of interest, and then converted that dataset to a format. Not ideal, but it works.
Hope this helps.
proc format;
PICTURE myDateFmt
low-high = '%m/%d/%0y' (DATATYPE = date)
;
run;
*** TEST DATA TO EXPERIMENT WITH - SPANS YEAR 1987 ***;
data stocks;
set sashelp.stocks;
where (mdy(1,1,1987) <= date <= mdy(12,31,1987));
format date myDateFmt. ;
run;
title 'USER CREATED PICTURE FORMAT DOES NOT WORK';
proc sgplot data=work.stocks;
scatter x=date y=close;
xaxis offsetmin = 0 offsetmax = 0 display=(nolabel) tickvalueformat=data;
run;
title 'SAS SUPPLIED FORMAT DOES WORK';
proc sgplot data=work.stocks;
scatter x=date y=close;
xaxis offsetmin = 0 offsetmax = 0 display=(nolabel) tickvalueformat=monyy5.;
run;
*** RECREATE FORMAT FOR SPECIFIC DATE RANGE THAT MATCHES DATA AND GRAPH AXIS DESIRED ***;
*** THIS WILL CREATE A FORMAT ENTRY FOR EVERY DAY IN THE RANGE ***;
data cntldate;
fmtname = 'myDateN';
type = 'n';
*** HARD CODE START/END DATES TO MATCH GRAPH AXIS DESIRED ***;
do start = mdy(1,1,1987) to mdy(1,1,1988);
*** FORMAT LABEL WILL BE DATE FORMAT WITHOUT LEADING ZEROS ***;
label = strip (put(start, myDateFmt.) );
output;
end;
run;
*** CONVERT CONTROL DATASET TO A FORMAT ***;
proc format library=work cntlin=cntldate;
run;
title 'USER CREATED VALUE FORMAT WORKS';
title2 'NOTE: HARDCODE OF START/END VALUE FOR XAXIS, OTHERWISE SAS MAY SELECT AXIS ENDPOINT OUTSIDE OF FORMAT RANGE';
title3 'NOTE2: AXIS MAY NOT REPORT EVERY MONTH DUE TO SPACE ISSUES';
proc sgplot data=work.stocks;
scatter x=date y=close;
xaxis offsetmin = 0 offsetmax = 0 display=(nolabel) tickvalueformat=myDateN.
values=('1jan87'd to '1jan88'd by month);
run;

Is there some way to tell SAS that for any obs ####1, ####2, or ####3 (where # = 1-9), I want them formatted #### Spring, #### Fall, and #### Winter?

So I have a 1000 observations for one variable that look like this:
19962
19943
19972
19951
19951
19912
The first four digits vary a bit, but the last digit is always 1, 2, or 3. Is there a way to only format the last digit, while not having to type out each iteration of the first four digits in a value statement?
That is, I want to avoid doing this:
proc format;
value varfmt
19911 = '1991 Spring'
19912 = '1991 Fall'
19913 = '1991 Winter'
19921 = '
19922 = '
[…]
19991 = '1999 Spring'
19992 = '1999 Fall'
19993 = '
;
run;
Instead, is there some way to tell SAS that for any ####1, ####2, or ####3, I want #### Spring, #### Fall, and #### Winter (which would be three lines under the value statement)?
Thanks in advance for any help.
As you are applying the format on the last digit only, so using the all the digits in the proc format is not required. Just extract the last digit and apply the format on it and concatenate it with other first four digits.
Creating the sample dataset
data test;
infile datalines;
input year;
datalines;
19962
19943
19972
19951
19951
19912
;
run;
Creating the formats
proc format;
value $varfmt
1 = 'Spring'
2 = 'Fall'
3 = 'Winter'
;
run;
Here, doing the following things
Extracting the last digit
Applying the format on it, created above
Extracting the first four digits of the number
Concatenating the output of 2 and 3
data final;
set test;
year_new = cat(substr(compress(year),1,4)," ",put(substr(compress(year),5,1),$varfmt.));
run;
You also have the option of creating a format from a dataset, if you do want a format for the whole value. You will have to create all possible rows, but it's not particularly hard.
data forfmt;
fmtname='SEASONF';
length start $5 label $8;
do startyr = 1990 to 2015;
start=cats(startyr,'1');
label=catx(' ',startyr,'Spring');
output;
start=cats(startyr,'2');
label=catx(' ',startyr,'Fall');
output;
start=cats(startyr,'3');
label=catx(' ',startyr,'Winter');
output;
end;
run;
proc format cntlin=forfmt;
quit;

SAS: Changing multiple variable names

This is my current issue:
I have 53 variable headers in a SAS data set that need to be changed, for example:
Current_Week_0 TS | Current_Week_1 TS | Current_Week_2 TS -- etc.
I need it to change such that Current_Week_# TS = Current_Week_# -- dropping the TS
Is there a way to automate this such as looping it like:
i = 0,53
Current_week_i TS = Current_Week_i ?
I just don't understand the proper syntax.
Edit: Thank you for editing my formats Sergiu, appreciate it! :)
Edit:
I used the following code, but I get the following error:
Missing numeric suffix on a numbered variable list (TS-Current_Week_53)
DATA True_Start_8;
SET True_Start_7;
ARRAY oldnames (53) Current_Week_1 TS-Current_Week_53 TS;
ARRAY newnames (53) Current_Week_1-Current_Week_53;
DO i = 1 TO 53;
newnames(i) = oldnames(i) ;
END;
RUN;
#Joe EDIT
Here's what the data looks like before and after the "denorm" / transpose
BEFORE
Product ID CurrentWeek Market TS
X 75av2kz Current_Week_0 Z 1
Y 7sav2kz Current_Week_0 Z 1
X 752v2kz Current_Week_1 Z 1
Y 255v2kz Current_Week_1 Z 1
Product ID Market Current_Week_0_TS Current_Week_1_TS
X 75av2kz Z 1 0
Y 7sav2kz Z 1 1
X 752v2kz Z 1 1
Y 255v2kz Z 1 0
This isn't too hard. I assume these are variable labels.
proc sql;
select cats('%relabel_nots(',name,')') into :relabellist separated by ' '
from dictionary.columns
where libname='WORK' and memname='True_Start_7'
and name like '%TS'; *you may need to upper case the dataset name (memname) depending on your OS;
quit;
%macro relabel_nots(name);
label &name.= substr(vlabel(&name.),1,length(vlabel(&name.))-3);
%mend relabel_nots;
data want;
set True_Start_7;
&relabellist.;
run;
Basically the PROC SQL grabs the different names that qualify for the relabelling, and generates a large macro variable with all of the rename macro calls. The relabel_nots macro generates the new labels. You may need to change the logic behind the WHERE in the PROC SQL if the variable names don't also contain the TS.
Another option is to do this in the transpose. Your example data either doesn't match the example desired output, or there is something in logic not explained, but this does the simple transpose; if there is a logical reason that the current_week_0/1 are different in yours than in the below, explain why.
data have;
format currentWeek $20.;
input Product $ ID $ CurrentWeek $ Market $ TS;
datalines;
X 75av2kz Current_Week_0 Z 1
Y 7sav2kz Current_Week_0 Z 1
X 752v2kz Current_Week_1 Z 1
Y 255v2kz Current_Week_1 Z 1
;;;;
run;
proc sort data=have;
by market id product;
run;
proc transpose data=have out=want;
by market id product ;
id currentWeek;
var TS;
run;