I have a dataset with one row per week for 2 years (so 104 rows). I have a flag column which is either 1 or 0 for each week. I want to create a new column with the following logic:
if the flag=1 for that week then have a 1 for that week and the following 3 weeks as flag_new.
My current approach, which works, is:
if flag=1 or lag(flag)=1 or lag2(flag)=1 or lag3(flag)=1 then flag_new=1;
Although this works, it becomes very tedious if I want flag_new to be 1 for the following 20 or 30 weeks instead of just 3 weeks.
I was hoping there would be an easier way to do this (perhaps a loop?), but I am not too familiar with it.
Any help is much appreciated.
Maybe instead of a look back, think of it as a look ahead. That is, each time you see flag=1, set flag_new=1 for that record and the next three records. Something like (untested):
if flag=1 then count=3;
else count+(-1) ; *implicit retain from sum statement;
if count>=0 then flag_new=1;
You can use a temporary array as well to keep the lagged information and then capture the highest of the array. If it's a one then you can set the new flag to 1 as well. To change the dimensions, just change the 2 to the n-1 you need.
This also demonstrates the BY statements and resetting it for the beginning of a new group.
data want;
array p{0:2} _temporary_;
set have;
by object;
if first.object then call missing(of p{*});
p{mod(_n_,4)} = flag;
highest = max(of p{*});
if highest > 1 then do;
flag_new = 1;
end;
run;
Related
I have a data table with columns: Year, Month, Sales. It is effectively a summary table, like a pivot table in excel.
With this table, if there are no sales reported for one month (i.e. Not 0 sales, but no mention of sales so SAS cannot pinpoint a value to a certain month) then that whole row would disappear.
I do not want this to happen, I would instead like that row to display 0 rather than not appear. Is there a way to change the format of this to ensure every row would appear?
Note: The months are not calendar months, as such you could have month60 relating to 2011.
If the table is being created using proc summary or proc means, one way of achieving the sort of output you want provided that you have at least 1 row for each month in your data is to use the completetypes option, e.g.
proc summary data = sashelp.class completetypes;
class sex age;
var weight;
output out = mysummary mean=;
run;
This produces a row with frequency 0 for Sex = F, Age = 16 rather than skipping that output entirely.
A more reliable but more labour-intensive method, which works even if some values never appear anywhere in your data, is to use the classdata option, e.g.
data myclassdata;
do SEX = 'M','F';
do AGE = 13 to 17;
output;
end;
end;
run;
proc summary nway data = sashelp.class classdata=myclassdata exclusive;
class sex age;
var weight;
output out = mysummary2 mean=;
run;
The exclusive option here restricts the output to combinations of levels that are present in the classdata dataset. Without it, you get at least those specified in the classdata plus rows for all possible combinations based on observed 1-way values as though you had specified completetypes.
I'm trying to pull back all MAX instances given subset data....first.id or last.id doesn't work because I want to keep several rows of the same transaction. For example:
TableView_of_Data
In this example I want the highlighted rows as output. My data has several FORMs, QUARTERs, and CUST_ID I'd like to programmatically have SAS pull back latest based on FORM, QUARTER, CUST_ID
Last.DB_ID only brings back 1 row. I need all rows of the same DB_ID.
also this failed to do anything:
data work.want;
set work.have;
by FORM Quarter Cust_ID DB_ID ;
if Max(DB_ID) then output;
run;
You need to do two passes through your data: one to determine what the max value is for that ID, and one to find the rows that have that maximum value.
Doing this in the data step requires a DoW loop, which runs one data step iteration per cust_id value but two passes through the dataset.
data want;
do _n_ = 1 by 1 until (last.cust_id);
set have;
by form quarter cust_id;
if last.cust_id then max_db_value=db_id;
end;
do _n_ = 1 by 1 until (last.cust_id);
set have;
by form quarter cust_id;
if db_id = max_db_Value then output;
end;
run;
That works if DB_ID is sorted as it is in your example. If it's not sorted, you can compare the currently stored max_db_value to the current db_id and assign the new value from db_id to it if it's higher, something like
max_db_value = max(db_id, max_db_value);
instead of assigning it when last.cust_id is true.
I am trying to help a colleague with a SAS Script she is working with . I am a programmer so I understand the logic, however I don't now the syntax in SAS. Basically this is what she is trying to do.
We have:
Array of Procedure Dates (proc_date[i])
Array of Procedures (proc[i]).
Each record in our data can have up to 20 Procedures and 20 dates.
i=20
Each procedure has an associated code, lets just say there are 100 different codes where codes 1 to 10 is ProcedureA, 11 to 20 is ProcedureB etc etc.
We need to loop through each Procedure and assign it the correct ProcedureCategory if it falls into 1 of the 100 codes (ie: it enters one of the If Statements). When this is true we then need to loop through each other corresponding Procedure Date in that row, if they are different dates when we add the 'Weighted Values' together, else we would just take the greater of the 2 values.
I hope that helps and I hope this makes sense. I could write this in another language (ie: C/C++/C#/VB) however I'm at a loss with SAS as I'm just not that familiar with the syntax and the logic doesn't seem to be that other OO languages.
Thanks in advance for any assistance.
Kind Regards.
You don't want to do 100 if statements, one way or the other.
The answer to the core of your question is that you need the do loop outside of the if statement.
data want;
set have;
array proc[20];
array proc_date[20];
do _i = 1 to dim(proc); *this would be 20;
if proc[_i] = 53 then ... ;
else if proc[_i] = 54 then ...;
end;
run;
Now, what you're trying to do with the proc_date sounds like you need to do something with proc_date[i] in that same loop. The loop is just an iterator - the only thing changing is _i which is used as an array index. It's welcome to be the array index for either array (or to do any other thing). This is where this differs from common OOP practice, since this isn't an array class; you're not using the individual object to iterate it. This is a functional language style (c would do it the same way, even).
However, the if/else bits there would be unwieldy and long. In SAS you have a lot of ways of dealing with that. You might have another array of 100 values proc can take, and then inside that do loop have another do loop iterating over that array (do _j = 1 to 100;) - or the other way around (iterate through the 100, inside that iterate through the 20) if that makes more sense (if you want to at one time have all of the values).
data want;
set have;
array proc[20];
array proc_date[20];
array proc_val[100]; *needs to be populated in the `have` dataset, or in another;
do _i = 1 to dim(proc_val);
do _i = 1 to dim(proc);
if proc[_j] = proc_val[_i] then ...; *this statement executes 100*20 times;
end;
end;
run;
You also could have a user-defined format, which is really just a one-to-one mapping of values (start value -> label value). Map your 100 values to the 10 procedures they correspond to, or whatever. Then all 100 if statements become
proc_value[_i] = put(proc[_i],PROCFMT.);
and then proc_value[_i] stores the procedure (or whatever) which you then can evaluate more simply hopefully.
You also may want to look into hash tables; both for a similar concept as the format above, but also for doing the storage. Hash tables are a common idea in programming that perhaps you've already come across, and the way SAS implements them is actually OOP-like. If you're trying to do some sort of summarization based on the procedure values, you could easily do that in a hash table and probably much more efficiently than you could in IF statements.
Here are some statements mentioned.
*codes 1 to 10 is ProcedureA, 11 to 20 is ProcedureB ;
proc format;
value codes 1-10 = 'A'
11-20 = 'B';
Procedure(i) = put(code,codes.);
another way to recode ranges is with the between syntax
if 1 <= value <= 10 then variable = <new-value>;
hth
I'm a SAS beginner and I'm curious if the following task can be done much more simple as it is currently in my head.
I have the following (simplified) meta data in a table named user_date_money:
User - Date - Money
with various users and dates for every calendar day (for the last 4 years). The data is ordered by User ASC and Date ASC, sample data looks like this:
User | Date | Money
Anna 23.10.2013 5
Anna 24.10.2013 1
Anna 25.10.2013 12
....
Aron 23.10.2013 5
Aron 24.10.2013 12
Aron 25.10.2013 4
....
Zoe 23.10.2013 1
Zoe 24.10.2013 1
Zoe 25.10.2013 0
I now want to calculate a five day moving average for the Money. I started with the pretty popular apprach with the lag() function like this:
data cma;
set user_date_money;
if missing(money) then
do;
OBS = 0;
money = 0.0;
end;
else OBS = 1;
money5 = lag5(money);
OBS5= lag5(obs);
if missing(money5) then money5= 0.0;
if missing(obs5) then obs5= 0;
if _N_ = 1 then
do;
SUM = 0.0;
N = 0;
end;
else;
sum = sum + money-money5;
n = n + obs-obs5;
MEAN = sum / n ;
retain sum n;
run;
as you see, the problem with this method occurs if there if the data step runs into a new user. Aron would get some lagged values from Anna which of course should not happen.
Now my question: I am pretty sure you can handle the user switch by adding some extra fields like laggeduser and by resetting the N, Sum and Mean variables if you notice such a switch but:
Can this be done in an easier way? Perhaps using the BY Clause in any way?
Thanks for your ideas and help!
Best regards
I think the easiest way is to use PROC EXPAND:
PROC EXPAND data=user_date_money out=cma;
ID date;
BY user;
CONVERT money=MEAN / transformin=(setmiss 0) transformout=(movave 5);
RUN;
And as mentioned in John's comment, it's important to remember about missing values (and about beginning and ending observations as well). I've added SETMISS option to the code, as you made it clear that you want to 'zerofy' missing values, not ignore them (default MOVAVE behaviour).
And if you want to exclude first 4 observations for each user (since they don't have enough pre-history to calculate moving average 5), you can use option 'TRIMLEFT 4' inside TRANSFORMOUT=().
If your particular need is simple enough, you can calculate it using PROC MEANS and a multilabel format.
data mydata;
do id = 1 to 5;
datevar = '01JAN2010'd-1;
do month = 0 to 4;
datevar=intnx('MONTH',datevar,1,'b');
sales = floor(500*rand('normal',7))+1500;
output;
end;
end;
run;
proc format;
value movingavg (multilabel notsorted)
'01JAN2010'd-'31MAR2010'd = 'JAN-MAR 2010'
'01FEB2010'd-'30APR2010'd = 'FEB-APR 2010'
'01MAR2010'd-'31MAY2010'd = 'MAR-MAY 2010'
/* ... more of these ... */
;
quit;
proc means data=mydata;
class id datevar/mlf order=data;
types id*datevar;
format datevar movingavg.;
var sales;
run;
The PROC FORMAT can be done programatically by use of the CNTLIN dataset, see SAS documentation for PROC FORMAT for more information.
If you make sure your data is sorted, you can use the first and last named variables to initialize your running totals when you get to a new member. These and retain should get you what you need; I don't think lag() is really called for here.
Yes, you can use by groupings. First, you'll sort by user and date (as you already have).
proc sort data=user_date_money;
by user date;
run;
Then, redo the data step using the by variable and a counter.
data cma;
set user_date_money;
by user;
length User_Recs 3
Average 8;
retain User_Recs;
if First.User=1 then User_Recs=0;
User_Recs=User_Recs+1;
if User_Recs>4 then do;
Average=(lag4(money)+lag3(money)+lag2(money)+lag1(money)+money)/5;
end;
drop User_Recs;
run;
I'd like to consult one thing. I have table in DB. It has 2 columns and looks like this:
Name...bilance
Jane...+3
Jane...-5
Jane...0
Jane...-8
Jane...-2
Paul...-1
Paul...2
Paul....9
Paul...1
...
I have to walk through this table and if I find record with different "name" (than was on previous row) I process all rows with the previous "name". (If I step on the first Paul row I process all Jane rows)
The processing goes like this:
Now I work only with Jane records and walk through them one by one. On each record I stop and compare it with all previous Jane rows one by one.
The task is to sumarize "bilance" column (in the scope of actual person) if they have different signs
Summary:
I loop through this table in 3 levels paralelly (nested loops)
1st level = search for changes of "name" column
2nd level = if change was found, get all rows with previous "name" and walk through them
3rd level = on each row stop and walk through all previous rows with current "name"
Can this be solved only using CURSOR and FETCHING, or is there some smoother solution?
My real table has 30 000 rows and 1500 people and If I do the logic in PHP, it takes long minutes and than timeouts. So I would like to rewrite it to MS SQL 2000 (no other DB is allowed). Are cursors fast solution or is it better to use something else?
Thank you for your opinions.
UPDATE:
There are lots of questions about my "summarization". Problem is a little bit more difficult than I explained. I simplified it just to describe my algorithm.
Each row of my table contains much more columns. The most important is month. That's why there are more rows for each person. Each is for different month.
"Bilances" are "working overtimes" and "arrear hours" of workers. And I need to sumarize + and - bilances to neutralize them using values from previous months. I want to have as many zeroes as possible. All the table must stay as it is, just bilances must be changed to zeroes.
Example:
Row (Jane -5) will be summarized with row (Jane +3). Instead of 3 I will get 0 and instead of -5 I will get -2. Because I used this -5 to reduce +3.
Next row (Jane 0) won't be affected
Next row (Jane -8) can not be used, because all previous bilances are negative
etc.
You can sum all the values per name using a single SQL statement:
select
name,
sum(bilance) as bilance_sum
from
my_table
group by
name
order by
name
On the face of it, it sounds like this should do what you want:
select Name, sum(bilance)
from table
group by Name
order by Name
If not, you might need to elaborate on how the Names are sorted and what you mean by "summarize".
I'm not sure what you mean by this line... "The task is to sumarize "bilance" column (in the scope of actual person) if they have different signs".
But, it may be possible to use a group by query to get a lot of what you need.
select name, case when bilance < 0 then 'negative' when bilance >= 0 then 'positive', count(*)
from table
group by name, bilance
That might not be perfect syntax for the case statement, but it should get you really close.