This is my first post, so please let me know if I'm not clear enough. Here's what I'm trying to do - this is my dataset. My approach for this is a do loop with a lag but the result is rubbish.
data a;
input #1 obs #4 mindate mmddyy10. #15 maxdate mmddyy10.;
format mindate maxdate date9.;
datalines;
1 01/02/2013 01/05/2013
2 01/02/2013 01/05/2013
3 01/02/2013 01/05/2013
4 01/03/2013 01/06/2013
5 02/02/2013 02/08/2013
6 02/02/2013 02/08/2013
7 02/02/2013 02/08/2013
8 03/10/2013 03/11/2013
9 04/02/2013 04/22/2013
10 04/10/2013 04/22/2013
11 05/04/2013 05/07/2013
12 06/10/2013 06/20/2013
;
run;
Now, I'm trying to produce a new column - "Replacement" based on the following logic:
If a record's mindate occurs before its lag's maxdate, it cannot be a replacement for it. If it cannot be a replacement, skip forward (so- 2,3,4 cannot replace 1, but 5 can).
Otherwise... if the mindate is less than 30 days, Replacement = Y. If not, replacement = N. Once a record replaces another (so, in this case, 5 does replace 1, because 02/02/2013 is <30 than 01/05/2013, it cannot duplicate as a replacement for another record. But if it's an N for one record above, it can still be a Y for some other record. So, 6 is now evaluated against 2, 7 against 3,etc. Since those two combos are both "Y", 8 is now evaluated versus 4, but because its mindate >30 relative to 4's maxdate, it's a N. But, it's then evaluated against against
And so on...
I should that in a 100 record dataset, this would imply that the 100th record could technically replace the 1st, so I've been trying lags within loops. Any tips/help is greatly appreciated! Expected output:
obs mindate maxdate Replacement
1 02JAN2013 05JAN2013
2 02JAN2013 05JAN2013
3 02JAN2013 05JAN2013
4 03JAN2013 06JAN2013
5 02FEB2013 08FEB2013 Y
6 02FEB2013 08FEB2013 Y
7 02FEB2013 08FEB2013 Y
8 10MAR2013 11MAR2013 Y
9 02APR2013 22APR2013 Y
10 10APR2013 22APR2013 N
11 04MAY2013 07MAY2013 Y
12 10JUN2013 20JUN2013 Y
I think this is correct if the asker was mistaken about replacement = Y for obs = 12.
/*Get number of obs so we can build a temporary array to hold the dataset*/
data _null_;
set have nobs= nobs;
call symput("nobs",nobs);
stop;
run;
data want;
/*Load the dataset into a temporary array*/
array dates[2,&NOBS] _temporary_;
if _n_ = 1 then do _n_ = 1 by 1 until(eof);
set have end = eof;
dates[1,_n_] = maxdate;
dates[2,_n_] = 0;
end;
set have;
length replacement $1;
replacement = 'N';
do i = 1 to _n_ - 1 until(replacement = 'Y');
if dates[2,i] = 0 and 0 <= mindate - dates[1,i] <= 30 then do;
replacement = 'Y';
dates[2,i] = _n_;
replaces = i;
end;
end;
drop i;
run;
You could use a hash object + hash iterator instead of a temporary array if you preferred. I've also included an extra var, replaces, to show which previous row each row replaces.
Here is a solution using SQL and hash tables. It is not optimal but it was the first method that sprang to mind.
/* Join the input with its self */
proc sql;
create table b as
select
a1.obs,
a2.obs as obs2
from a as a1
inner join a as a2
/* Set the replacement criteria */
on a1.maxdate < a2.mindate <= a1.maxdate + 30
order by a2.obs, a1.obs;
quit;
/* Create a mapping for replacements */
data c;
set b;
/* Create two empty hash tables so we can look up the used observations */
if _N_ = 1 then do;
declare hash h();
h.definekey("obs");
h.definedone();
declare hash h2();
h2.definekey("obs2");
h2.definedone();
end;
/* Check if we've already used this observation as a replacement */
if h2.find() then do;
/* Check if we've already replaced his observation */
if h.find() then do;
/* Add the observations to the hash table and output */
h2.add();
h.add();
output;
end;
end;
run;
/* Combine the replacement map with the original data */
proc sql;
select
a.*,
ifc(c.obs, "Y", "N") as Replace,
c.obs as Replaces
from a
left join c
on a.obs = c.obs2
order by a.obs;
quit;
There are several ways in which this can be simplified:
The dates can be brought through the first proc sql
The if statements can be combined
The final join could be replaced by a little extra logic in the data step
Related
I need to fill the nulls of a column with the mean sum of the division of two columns An example would be
A B C ... B_01 C_01
5 . .
5 2 3
7 3 1,2
9 3 0,3
4 . .
Well, I would like the missing value for column B to be (2/5 + 3/7 + 3/9) / 3 * its corresponding column A
For column new_c (3/5 + 1,2/7 + 0,3/9)/3 * its corresponding column A
I have thought about doing this, but it turns out that I have 60 columns with which to do it and the only way it comes to mi mind is to do this 60 times.
Proc sql;
create table new as
Select *
, sum(B/A)/sum(case when B is missimg then . else 1) end as new_B
, sum(C/A)/sum(case when C is missimg then . else 1) end as new_C_01
from table_one
;
Thanks
PROC SQL should be able to do that easily.
First let's convert your data listing into an actual dataset.
data have;
input A B C ;
cards;
5 . .
5 2 3
7 3 1.2
9 3 0.3
4 . .
;
Now let's use it to create a new version of B that follows your rules.
proc sql;
create table want as
select *,coalesce(b,a*mean(b/a)) as new_b
from have
;
quit;
Results:
OBS A B C new_b
1 5 . . 1.93651
2 5 2 3.0 2.00000
3 7 3 1.2 3.00000
4 9 3 0.3 3.00000
5 4 . . 1.54921
You can use Proc MEANS to compute the mean fraction of each of the 60 columns, and apply the imputation rule in a DATA step.
Example:
data have;
call streaminit(20230129);
do row = 1 to 100;
a = rand('integer', 30);
array x x1-x60;
do over x;
x = ifn(rand('uniform') > 0.30, rand('integer', a-1), .);
end;
output;
end;
run;
data fractions;
set have;
array f x1-x60;
do over f;
if not missing(f) then f = f / a;
end;
rename x1-x60 = f1-f60;
run;
proc means noprint data=fractions;
output out=means mean(f1-f60)=mean1-mean60;
var f1-f60;
run;
data want;
set have;
one = 1;
set means point=one;
array means mean1-mean60;
array x x1-x60;
do over x;
if missing (x) then means = means * a; else means = x;
end;
rename mean1-mean60=new_x1-new_x60;
run;
I need to fill the nulls of a column with the mean sum of the division of two columns multiplied by one column and rest the previous An example would be
A B_01 B_02 ... B_60
5 . .
5 2 3
7 3 1,2
9 3 0,3
4 . .
Well, I would like the missing value for column B_01 to be (2/5 + 3/7 + 3/9) / 3 * its corresponding column A
For column B_02(3/5 + 1,2/7 + 0,3/9)/3 * its corresponding column A - his new value in B_01
I have thought about doing this, but it turns out that I have 60 columns with which to do it and the only way it comes to mi mind is to do this 60 times.
Proc sql;
create table new as
Select *
, sum(B_01/A)/sum(case when B_01 is missimg then . else 1)*A end as new_B_01
, sum(B_02/A)/sum(case when B_02 is missimg then . else 1)*A-B_01 end as new_B_02
from table_one
;
Thanks
This may be what you want.
data test;
input A B_01-B_02;
cards;
5 . .
5 2 3
7 3 1.2
9 3 0.3
4 . .
;;;;
data test2;
set test;
array B_ B_01-B_02;
array M_[2];
do i = 1 to dim(m_);
if not missing(b_[i]) then m_[i]= divide(b_[i],a);
end;
drop i;
run;
proc print;
run;
proc stdize reponly missing=mean data=test2 out=mean;
var m_:;
run;
proc print;
run;
data mean2;
set mean;
array B_ B_01-B_02;
array M_[2];
do i = 1 to dim(m_);
b_[i] = coalesce(b_[i],m_[i]);
end;
drop i m_:;
run;
proc print;
run;
Try this.
data test;
input A B_01-B_02;
cards;
5 . .
5 2 3
7 3 1.2
9 3 0.3
4 . .
;;;;
run;
data mean;
set test end=eof;
array b(*) b_:;
array sumb(2) _temporary_;
array cntb(2) _temporary_;
array mean_b_(2);
* cumulative sums and counts;
do i = 1 to dim(b);
sumb(i) = sum(sumb(i), b(i)/a);
if not missing(b(i)) then cntb(i) + 1;
end;
* mean;
if eof then do;
do i = 1 to dim(b);
mean_b_(i) = sumb(i)/cntb(i);
end;
output;
end;
keep mean_:;
run;
data test2;
if _n_ = 1 then do;
set mean;
array mean_b_(*) mean_b:;
end;
set test;
array b_(*) b_:;
array new_b_(2);
do i = 1 to dim(b_);
if i = 1 then new_b_(i) = coalesce(b_(i), mean_b_(i) * a);
else new_b_(i) = coalesce(b_(i), mean_b_(i) * a - new_b_(i-1));
end;
run;
I have a dataset as listed below:
ID-----V1-----V2------V3
01------5------3-------7
02------3------8-------5
03------6------9-------1
and I want to calculate 3 new variables (ERR_CODE, ERR_DETAIL, ERR_ID) according to behavior of certain columns.
If V1 is greater than 4 then ERR_CODE = A and ERR_DETAIL = "Out of range" and ERR_ID = [ID]_A
If V2 is greater than 4 then ERR_CODE = B and ERR_DETAIL = "Check Log" and ERR_ID = [ID]_B
If V3 is greater than 4 then ERR_CODE = C and ERR_DETAIL = "Fault" and ERR_ID = [ID]_C
Desired output table be like
ID-----ERR_CODE----ERR_DETAIL---------ERR_ID
01--------A--------Out of range---------01_A
01--------C--------Fault----------------01_C
02--------B--------Check Log------------02_B
02--------C--------Fault----------------02_C
03--------A--------Out of range---------03_A
03--------B--------Check Log------------03_B
I am using SAS 9.3 with EG 5.1. I have tried do-loops, arrays, if statements and case-when's but it naturally skips to the next row to calculate when condition is met. But i want to calculate other met conditions fo each row.
I have managed to do it by creating seperate tables for each condition and then merge them. But that doesn't seem an effective way if there are much conditions to work with.
My question is how can i manage to calculate other met conditions for each ID at once without calculating seperately? The output table's row count will be more than the input as expected but for me it is not possible to achieve by applying case-when or if etc.
Thanks in advance and sorry if i am not clear.
Just use IF/THEN/DO blocks. Add an OUTPUT statement to write new observation for each error.
data have ;
input ID $ V1-V3;
cards;
01 5 3 7
02 3 8 5
03 6 9 1
;
data want;
set have;
length ERR_CODE $1 ERR_DETAIL $20 ERR_ID $10 ;
if v1>4 then do;
err_code='A'; err_detail="Out of range"; err_id=catx('_',id,err_code);
output;
end;
if v2>4 then do;
err_code='B'; err_detail="Fault"; err_id=catx('_',id,err_code);
output;
end;
if v3>4 then do;
err_code='C'; err_detail="Check Log"; err_id=catx('_',id,err_code);
output;
end;
drop v1-v3 ;
run;
Results:
Obs ID ERR_CODE ERR_DETAIL ERR_ID
1 01 A Out of range 01_A
2 01 C Check Log 01_C
3 02 B Fault 02_B
4 02 C Check Log 02_C
5 03 A Out of range 03_A
6 03 B Fault 03_B
I have a dataframe
ID value1
1 12
2 345
3 342
i have a second dataframe
value2
3823
how do I get the following result?
ID value1 value2
1 12 3823
2 345 3823
3 342 3823
any joins I have done have given me
ID value1 value2
1 12 .
2 345 .
3 342 .
. . 3823
No need for joins or helper variables:
data have;
do i = 1 to 3;
output;
end;
run;
data lookup;
j = 1;
run;
data want;
set have;
if _n_ = 1 then set lookup;
run;
Without the if _n_ = 1, the data step stops after one iteration when it tries to read a second row from the lookup dataset and finds that there are no rows remaining.
N.B. this requires that the have dataset doesn't already contain a variable with the same name as the variable(s) attached from the lookup dataset.
By far the easiest way to do this is to utilize PROC SQL and defining the condition 1=1, which is always true for each comparison:
data first;
input ID value1 ##;
cards;
1 12 2 345 3 342
run;
data second;
input value2 ;
cards;
3823
run;
proc sql;
create table wanted as
select * from first
left join second
on 1 =1
;quit;
Edit: As far as I know, there isn't direct way to merge datasets by each row, but you can do the following trick:
Add variable Help:
data second_trick;
set second;
help=1;
run;
data first_trick;
set first;
help=1;
run;
Then we just perform the merge by the static variable:
data wanted_trick;
merge first_trick(in=a) second_trick;
by help;
if a; /*Left join, just to be sure.*/
run;
now this only works if you want to add single static value. Don't try to use it your Second set has more rows.
For more on Merges and joins see: https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/249-30.pdf
This is my current issue:
I have 53 variable headers in a SAS data set that need to be changed, for example:
Current_Week_0 TS | Current_Week_1 TS | Current_Week_2 TS -- etc.
I need it to change such that Current_Week_# TS = Current_Week_# -- dropping the TS
Is there a way to automate this such as looping it like:
i = 0,53
Current_week_i TS = Current_Week_i ?
I just don't understand the proper syntax.
Edit: Thank you for editing my formats Sergiu, appreciate it! :)
Edit:
I used the following code, but I get the following error:
Missing numeric suffix on a numbered variable list (TS-Current_Week_53)
DATA True_Start_8;
SET True_Start_7;
ARRAY oldnames (53) Current_Week_1 TS-Current_Week_53 TS;
ARRAY newnames (53) Current_Week_1-Current_Week_53;
DO i = 1 TO 53;
newnames(i) = oldnames(i) ;
END;
RUN;
#Joe EDIT
Here's what the data looks like before and after the "denorm" / transpose
BEFORE
Product ID CurrentWeek Market TS
X 75av2kz Current_Week_0 Z 1
Y 7sav2kz Current_Week_0 Z 1
X 752v2kz Current_Week_1 Z 1
Y 255v2kz Current_Week_1 Z 1
Product ID Market Current_Week_0_TS Current_Week_1_TS
X 75av2kz Z 1 0
Y 7sav2kz Z 1 1
X 752v2kz Z 1 1
Y 255v2kz Z 1 0
This isn't too hard. I assume these are variable labels.
proc sql;
select cats('%relabel_nots(',name,')') into :relabellist separated by ' '
from dictionary.columns
where libname='WORK' and memname='True_Start_7'
and name like '%TS'; *you may need to upper case the dataset name (memname) depending on your OS;
quit;
%macro relabel_nots(name);
label &name.= substr(vlabel(&name.),1,length(vlabel(&name.))-3);
%mend relabel_nots;
data want;
set True_Start_7;
&relabellist.;
run;
Basically the PROC SQL grabs the different names that qualify for the relabelling, and generates a large macro variable with all of the rename macro calls. The relabel_nots macro generates the new labels. You may need to change the logic behind the WHERE in the PROC SQL if the variable names don't also contain the TS.
Another option is to do this in the transpose. Your example data either doesn't match the example desired output, or there is something in logic not explained, but this does the simple transpose; if there is a logical reason that the current_week_0/1 are different in yours than in the below, explain why.
data have;
format currentWeek $20.;
input Product $ ID $ CurrentWeek $ Market $ TS;
datalines;
X 75av2kz Current_Week_0 Z 1
Y 7sav2kz Current_Week_0 Z 1
X 752v2kz Current_Week_1 Z 1
Y 255v2kz Current_Week_1 Z 1
;;;;
run;
proc sort data=have;
by market id product;
run;
proc transpose data=have out=want;
by market id product ;
id currentWeek;
var TS;
run;