Creating ID variable using digits from two different variables on SAS - sql

I'm trying to create a new variable on SAS. There is a column called "Statefip" and a column called "countyfip". I need a four digit ID number that combines these two columns.
For example:
enter image description here
How do I tell SAS to follow this format when creating this new variable?

This is easy to do using put and input statements. The z3 format includes leading 0's in the output. || concatenates the put statements and then input converts the id field back to numeric.
data have;
input statefip countyfip;
datalines;
1 1
8 109
12 57
13 313
;
run;
data want;
set have;
id = input(put(statefip,2.) || put(countyfip,z3.),8.);
run;
proc print;
Output:
Obs statefip countyfip id
1 1 1 1001
2 8 109 8109
3 12 57 12057
4 13 313 13313

Related

SAS computing multiple new variables from one row

I have a dataset as listed below:
ID-----V1-----V2------V3
01------5------3-------7
02------3------8-------5
03------6------9-------1
and I want to calculate 3 new variables (ERR_CODE, ERR_DETAIL, ERR_ID) according to behavior of certain columns.
If V1 is greater than 4 then ERR_CODE = A and ERR_DETAIL = "Out of range" and ERR_ID = [ID]_A
If V2 is greater than 4 then ERR_CODE = B and ERR_DETAIL = "Check Log" and ERR_ID = [ID]_B
If V3 is greater than 4 then ERR_CODE = C and ERR_DETAIL = "Fault" and ERR_ID = [ID]_C
Desired output table be like
ID-----ERR_CODE----ERR_DETAIL---------ERR_ID
01--------A--------Out of range---------01_A
01--------C--------Fault----------------01_C
02--------B--------Check Log------------02_B
02--------C--------Fault----------------02_C
03--------A--------Out of range---------03_A
03--------B--------Check Log------------03_B
I am using SAS 9.3 with EG 5.1. I have tried do-loops, arrays, if statements and case-when's but it naturally skips to the next row to calculate when condition is met. But i want to calculate other met conditions fo each row.
I have managed to do it by creating seperate tables for each condition and then merge them. But that doesn't seem an effective way if there are much conditions to work with.
My question is how can i manage to calculate other met conditions for each ID at once without calculating seperately? The output table's row count will be more than the input as expected but for me it is not possible to achieve by applying case-when or if etc.
Thanks in advance and sorry if i am not clear.
Just use IF/THEN/DO blocks. Add an OUTPUT statement to write new observation for each error.
data have ;
input ID $ V1-V3;
cards;
01 5 3 7
02 3 8 5
03 6 9 1
;
data want;
set have;
length ERR_CODE $1 ERR_DETAIL $20 ERR_ID $10 ;
if v1>4 then do;
err_code='A'; err_detail="Out of range"; err_id=catx('_',id,err_code);
output;
end;
if v2>4 then do;
err_code='B'; err_detail="Fault"; err_id=catx('_',id,err_code);
output;
end;
if v3>4 then do;
err_code='C'; err_detail="Check Log"; err_id=catx('_',id,err_code);
output;
end;
drop v1-v3 ;
run;
Results:
Obs ID ERR_CODE ERR_DETAIL ERR_ID
1 01 A Out of range 01_A
2 01 C Check Log 01_C
3 02 B Fault 02_B
4 02 C Check Log 02_C
5 03 A Out of range 03_A
6 03 B Fault 03_B

How do i assign a value to a new variable, using another dataset which contains one value in SAS

I have a dataframe
ID value1
1 12
2 345
3 342
i have a second dataframe
value2
3823
how do I get the following result?
ID value1 value2
1 12 3823
2 345 3823
3 342 3823
any joins I have done have given me
ID value1 value2
1 12 .
2 345 .
3 342 .
. . 3823
No need for joins or helper variables:
data have;
do i = 1 to 3;
output;
end;
run;
data lookup;
j = 1;
run;
data want;
set have;
if _n_ = 1 then set lookup;
run;
Without the if _n_ = 1, the data step stops after one iteration when it tries to read a second row from the lookup dataset and finds that there are no rows remaining.
N.B. this requires that the have dataset doesn't already contain a variable with the same name as the variable(s) attached from the lookup dataset.
By far the easiest way to do this is to utilize PROC SQL and defining the condition 1=1, which is always true for each comparison:
data first;
input ID value1 ##;
cards;
1 12 2 345 3 342
run;
data second;
input value2 ;
cards;
3823
run;
proc sql;
create table wanted as
select * from first
left join second
on 1 =1
;quit;
Edit: As far as I know, there isn't direct way to merge datasets by each row, but you can do the following trick:
Add variable Help:
data second_trick;
set second;
help=1;
run;
data first_trick;
set first;
help=1;
run;
Then we just perform the merge by the static variable:
data wanted_trick;
merge first_trick(in=a) second_trick;
by help;
if a; /*Left join, just to be sure.*/
run;
now this only works if you want to add single static value. Don't try to use it your Second set has more rows.
For more on Merges and joins see: https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/249-30.pdf

SAS - Conditional input statement

I would like to use conditional if...then...else to read in the following data set, to read in using one input statement if source =1 and to read in using another input statement if source = 2. Not sure where my error is. This is what I have so far and the associated error. Not sure if the pointers are needed.
DATA results2;
infile datalines missover;
input #10 source 1. #;
if source = 1 then input #1 id #4 name $ #12 score;
else if source = 2 then input #1 id #4 score #12 name $;
DATALINES;
11 john 1 77
11 88 2 james
22 bobby 1 55
22 89 2 opey
;;;;
RUN;
It is correctly reading in the id but the source is not correctly matched to the id and having an issue with the name and score.
Thanks for helping!

SAS do loop + lag function?

This is my first post, so please let me know if I'm not clear enough. Here's what I'm trying to do - this is my dataset. My approach for this is a do loop with a lag but the result is rubbish.
data a;
input #1 obs #4 mindate mmddyy10. #15 maxdate mmddyy10.;
format mindate maxdate date9.;
datalines;
1 01/02/2013 01/05/2013
2 01/02/2013 01/05/2013
3 01/02/2013 01/05/2013
4 01/03/2013 01/06/2013
5 02/02/2013 02/08/2013
6 02/02/2013 02/08/2013
7 02/02/2013 02/08/2013
8 03/10/2013 03/11/2013
9 04/02/2013 04/22/2013
10 04/10/2013 04/22/2013
11 05/04/2013 05/07/2013
12 06/10/2013 06/20/2013
;
run;
Now, I'm trying to produce a new column - "Replacement" based on the following logic:
If a record's mindate occurs before its lag's maxdate, it cannot be a replacement for it. If it cannot be a replacement, skip forward (so- 2,3,4 cannot replace 1, but 5 can).
Otherwise... if the mindate is less than 30 days, Replacement = Y. If not, replacement = N. Once a record replaces another (so, in this case, 5 does replace 1, because 02/02/2013 is <30 than 01/05/2013, it cannot duplicate as a replacement for another record. But if it's an N for one record above, it can still be a Y for some other record. So, 6 is now evaluated against 2, 7 against 3,etc. Since those two combos are both "Y", 8 is now evaluated versus 4, but because its mindate >30 relative to 4's maxdate, it's a N. But, it's then evaluated against against
And so on...
I should that in a 100 record dataset, this would imply that the 100th record could technically replace the 1st, so I've been trying lags within loops. Any tips/help is greatly appreciated! Expected output:
obs mindate maxdate Replacement
1 02JAN2013 05JAN2013
2 02JAN2013 05JAN2013
3 02JAN2013 05JAN2013
4 03JAN2013 06JAN2013
5 02FEB2013 08FEB2013 Y
6 02FEB2013 08FEB2013 Y
7 02FEB2013 08FEB2013 Y
8 10MAR2013 11MAR2013 Y
9 02APR2013 22APR2013 Y
10 10APR2013 22APR2013 N
11 04MAY2013 07MAY2013 Y
12 10JUN2013 20JUN2013 Y
I think this is correct if the asker was mistaken about replacement = Y for obs = 12.
/*Get number of obs so we can build a temporary array to hold the dataset*/
data _null_;
set have nobs= nobs;
call symput("nobs",nobs);
stop;
run;
data want;
/*Load the dataset into a temporary array*/
array dates[2,&NOBS] _temporary_;
if _n_ = 1 then do _n_ = 1 by 1 until(eof);
set have end = eof;
dates[1,_n_] = maxdate;
dates[2,_n_] = 0;
end;
set have;
length replacement $1;
replacement = 'N';
do i = 1 to _n_ - 1 until(replacement = 'Y');
if dates[2,i] = 0 and 0 <= mindate - dates[1,i] <= 30 then do;
replacement = 'Y';
dates[2,i] = _n_;
replaces = i;
end;
end;
drop i;
run;
You could use a hash object + hash iterator instead of a temporary array if you preferred. I've also included an extra var, replaces, to show which previous row each row replaces.
Here is a solution using SQL and hash tables. It is not optimal but it was the first method that sprang to mind.
/* Join the input with its self */
proc sql;
create table b as
select
a1.obs,
a2.obs as obs2
from a as a1
inner join a as a2
/* Set the replacement criteria */
on a1.maxdate < a2.mindate <= a1.maxdate + 30
order by a2.obs, a1.obs;
quit;
/* Create a mapping for replacements */
data c;
set b;
/* Create two empty hash tables so we can look up the used observations */
if _N_ = 1 then do;
declare hash h();
h.definekey("obs");
h.definedone();
declare hash h2();
h2.definekey("obs2");
h2.definedone();
end;
/* Check if we've already used this observation as a replacement */
if h2.find() then do;
/* Check if we've already replaced his observation */
if h.find() then do;
/* Add the observations to the hash table and output */
h2.add();
h.add();
output;
end;
end;
run;
/* Combine the replacement map with the original data */
proc sql;
select
a.*,
ifc(c.obs, "Y", "N") as Replace,
c.obs as Replaces
from a
left join c
on a.obs = c.obs2
order by a.obs;
quit;
There are several ways in which this can be simplified:
The dates can be brought through the first proc sql
The if statements can be combined
The final join could be replaced by a little extra logic in the data step

SAS create a frequency of variable frequencies

I would like to create a table that lists the frequency of each variables frequencies. For example, a data set with 100 rows and 4 variables: ID, A, B, and C.
What I'm looking for would be like this:
Freqs| ID A B C
----------------------------
1 | 100 20 15 10
2 | 0 40 35 0
3 | 0 0 5 30
Since there are 100 unique IDs, there will be a frequency of 100 frequencies of 1 from the original data.
edit for clarification:
If you did a proc freq on the original data, you would get a frequency of 1 for every ID. Then if you did a proc freq on the count, you would have a frequency of 100 for counts of 1. I'm looking for that for every variable in a data set.
This should do what you want. You probably want to process the preds table since it contains "Table" in each table name, but this is a pretty simple way to do this.
ods output onewayfreqs=preds;
proc freq data=sashelp.class;
tables _all_;
run;
ods output close;
proc tabulate data=preds;
class table frequency;
tables frequency,table;
run;