I am very new to SAS & request help to understand use case of 'Retain' function, below are 2 codes where end goal is cumulative height to be added in new column 'Tot_Height', using 2 different codes, however both give same result, hence confused when would retain function be used and variable be initialised as 0 ?????
Data Set [Data Set][1] : (https://i.stack.imgur.com/nt1j6.png)
Code No 1
```
data class3;
set class2;
retain tot_height 0;
by sex;
if first.sex then tot_height = Height;
else tot_height + Height;
run;
````
Code No 2
````
data class3;
set class2;
if first.sex then tot_height = Height;
else tot_height + Height;
run;
````
Please help understand from above 2 codes as i am confused where would retain help
There is no real difference between those two steps because of the use of the SUM statement in the ELSE clause.
The SUM statement has the form:
variable + expression ;
The result is that the value of EXPRESSION is added to the value of VARIABLE. And that VARIABLE is defined to be RETAINed and if there is no other RETAIN statement with an explicit initial value it is initialized to zero. So adding the explicate RETAIN statement in the first data step that is setting the initial value to zero does not change how the data step will work.
Normally on each iteration of the data step variables that are created by this step (as opposed to variables that are coming from input datasets) will be set to missing. The RETAIN statement says to NOT set them to missing so the value at the start of on iteration is the same as it had at the end of the previous iteration.
Note that for these data steps the initial value is immediately overwritten by the first value of HEIGHT for the first observation since by definition the first observation in the dataset is the first observation of a BY group. So even if you were to change the initial value in the RETAIN statement to something other than zero it will have no real effect since it will be immediately replaced.
Related
I'm playing around with SAS (version: 7.11 HF2), I've a dataset which has columns A and B, variable A is decimal. When I run the below code, strangely I get a . (dot) in the first row of output.
Input data:
a, b
2.4, 1
1.2, 2
3.6, 3
Code:
data test;
c = a;
set abcd.test_data;
run;
Output data:
c, a, b
., 2.4, 1
2.4, 1.2, 2
1.2, 3.6, 3
3.6, ,
Strange things:
Derived variable is always generated on the right side, this one is being generated on left.
. (dot) is coming and the values are shifting by a row in the derived column.
Any help?
Looks like it did want you asked it to do.
On the first iteration of the data step it will set C to the value of A. The value of A is missing since you have not yet given it any value. Then the SET statement will read the first observation from your input dataset. Since there is no explicit OUTPUT statement the observation is written when the iteration reaches the end.
On the rest of the iterations of the data step the value that A will have when it is assigned to C will be the value as last read from the input dataset. Any variable that is part of an input dataset is "retained", which really just means it is not set to missing when a new iteration starts.
If the goal was to create C with the previous value of A you could have created the same output by using the LAG() function.
data test;
set abcd.test_data;
c=lag(a);
run;
Your set statement is after your variable assignment statement. SAS is first trying to assign the value of a to c, which has not yet been read. Place your set statement first, then do variable manipulation.
data test;
set abcd.test_data;
c = a;
run;
Nothing strange here, just put the SET statement before.
Datastep processing consists of 2 phases.
Compilation Phase
Execution Phase
During compilation phase, each of the statements within the data step are scanned for syntax errors.
During execution phase, a dataset's data portion is created.
It initializes variables to missing and finally executes other statements in the order determined by their location in the data step.
In your case, the set statement comes after the assignment of c. At that time a and b are set to missing, hence giving a missing value for c. Finally, the SET statement will be executed and that is why you end up with a value for both a and b on the first line.
data test;
set abcd.test_data;
c = a;
run;
Note that the first variable in your dataset is c, because this is the first stated in your code.
Is is possible in SPSS to store a value in a variable (not a variable created in a data set)?
For example I have a loop for which I want to pass the value 4 to all the locations in the loop that say NumLvl.
NumLvl = 4.
VECTOR A1L(NumLvl-1).
LOOP #i = 1 to NumLvl-1.
COMPUTE A1L(#i) = 0.
IF(att1 = #i) A1L(#i) = 1.
IF(att1 = NumLvl) A1L(#i) = -1.
END LOOP.
EXECUTE.
You can do this using DEFINE / !ENDDEFINE SPSSs Macro Facility, for example:
DEFINE !MyVar () 4 !ENDDEFINE.
You can then use !MyVar as a substitute for 4 wherever in your syntax you wish.
See DEFINE / !ENDDEFINE documentation for further notes.
There is an inconsistency with dataframes that I cant explain. In the following, I'm not looking for a workaround (already found one) but an explanation of what is going on under the hood and how it explains the output.
One of my colleagues which I talked into using python and pandas, has a dataframe "data" with 12,000 rows.
"data" has a column "length" that contains numbers from 0 to 20. she wants to divided the dateframe into groups by length range: 0 to 9 in group 1, 9 to 14 in group 2, 15 and more in group 3. her solution was to add another column, "group", and fill it with the appropriate values. she wrote the following code:
data['group'] = np.nan
mask = data['length'] < 10;
data['group'][mask] = 1;
mask2 = (data['length'] > 9) & (data['phraseLength'] < 15);
data['group'][mask2] = 2;
mask3 = data['length'] > 14;
data['group'][mask3] = 3;
This code is not good, of course. the reason it is not good is because you dont know in run time whether data['group'][mask3], for example, will be a view and thus actually change the dataframe, or it will be a copy and thus the dataframe would remain unchanged. It took me quit sometime to explain it to her, since she argued correctly that she is doing an assignment, not a selection, so the operation should always return a view.
But that was not the strange part. the part the even I couldn't understand is this:
After performing this set of operation, we verified that the assignment took place in two different ways:
By typing data in the console and examining the dataframe summary. It told us we had a few thousand of null values. The number of null values was the same as the size of mask3 so we assumed the last assignment was made on a copy and not on a view.
By typing data.group.value_counts(). That returned 3 values: 1,2 and 3 (surprise) we then typed data.group.value_counts.sum() and it summed up to 12,000!
So by method 2, the group column contained no null values and all the values we wanted it to have. But by method 1 - it didnt!
Can anyone explain this?
see docs here.
You dont' want to set values this way for exactly the reason you pointed; since you don't know if its a view, you don't know that you are actually changing the data. 0.13 will raise/warn that you are attempting to do this, but easiest/best to just access like:
data.loc[mask3,'group'] = 3
which will guarantee you inplace setitem
I would like to do something very simple, but it doesn't work
This is a simple example but I intend to use it for some more complex stuff
the output I want is :
obs. dummy newcount
1 3 1
2 5 2
3 2 3
but the output I get is :
obs. dummy newcount
1 3 1
2 5 1
3 2 1
here is my code
data test;
input dummy;
cards;
3
5
2
;
run;
%let count=1;
data test2;
set test;
newcount = &count.;
%let count = &count. + 1;
run;
The variable count doesn't get incremented. How do I do this?
Thanks for your help !
You're mixing macro variables and datastep variables in a way you cannot. Macro variables used in the data step in most cases have to have their values already defined prior to the data step when used like this; what happens is the data step compiler immediately resolves &count to the number 1, and uses that number 1 in its compilation, not the macro variable's newer values.
Further, the %let is not a data step command but a macro statement - it is also only executed once, not one time per data step pass.
You could use
data test2;
set test;
newcount = symget("count");
call symput("count",newcount+1);
put _all_;
run;
and it would work (call symput is how you define a macro variable in a data step, symget is how you retrieve the value of a macro variable that isn't finalized before the data step begins). It is probably not a good idea, however - you shouldn't generally store data values in macro variables and interact repeatedly with them inside a data step. If you post more details about why you're trying to do this (ie, what your actual goal is) I'm sure several of us could offer some suggestions for how to approach the problem.
Sometimes ABAP drives me crazy with really simple tasks such as incrementing an integer within a loop...
Here's my try:
METHOD test.
DATA lv_id TYPE integer.
lv_id = 1.
LOOP AT x ASSIGNING <y>.
lv_id = lv_id+1.
ENDLOOP.
ENDMETHOD.
This results in the error message Field type "I" does not permit subfield access.
You already answered the question yourself, but to make things a bit clearer:
variable + 1
is an arithmetic expression - add 1 to the value of the variable.
variable+1
is an offset operation on a character variable. For example, if variable contains ABC, variable+1 is BC.
This can be especially confusing when dealing with NUMCs. For example, with variable = '4711', variable + 1 is evaluated to 4712, whereas variable+1 is '711' (a character sequence).
The error you saw occurred because it's not possible to perform the index operation on a non-character-like variable.
You mean like:
ADD 1 to lv_id.
By the way, when you loop over an internal table, SY-TABIX has the loop counter.
Uh, I got it.
It's the f****** spaces...
lv_id = lv_id + 1
works...
Simple
DATA : gv_inc type I .
place this statement in loop
gv_inc = gv_inc + 1 .
from SAP NetWeaver Version 7.54 you can also use:
lv_id += 1.
Instead of
lv_id = lv_id + 1.
Happy coding!
If you are going to increment every loop cycle than you can directly get the table size.
describe table x lines data(lv_id). "Out side of the loop.