Creating and modifying a global statement in SAS - variables

I would like to do something very simple, but it doesn't work
This is a simple example but I intend to use it for some more complex stuff
the output I want is :
obs. dummy newcount
1 3 1
2 5 2
3 2 3
but the output I get is :
obs. dummy newcount
1 3 1
2 5 1
3 2 1
here is my code
data test;
input dummy;
cards;
3
5
2
;
run;
%let count=1;
data test2;
set test;
newcount = &count.;
%let count = &count. + 1;
run;
The variable count doesn't get incremented. How do I do this?
Thanks for your help !

You're mixing macro variables and datastep variables in a way you cannot. Macro variables used in the data step in most cases have to have their values already defined prior to the data step when used like this; what happens is the data step compiler immediately resolves &count to the number 1, and uses that number 1 in its compilation, not the macro variable's newer values.
Further, the %let is not a data step command but a macro statement - it is also only executed once, not one time per data step pass.
You could use
data test2;
set test;
newcount = symget("count");
call symput("count",newcount+1);
put _all_;
run;
and it would work (call symput is how you define a macro variable in a data step, symget is how you retrieve the value of a macro variable that isn't finalized before the data step begins). It is probably not a good idea, however - you shouldn't generally store data values in macro variables and interact repeatedly with them inside a data step. If you post more details about why you're trying to do this (ie, what your actual goal is) I'm sure several of us could offer some suggestions for how to approach the problem.

Related

SAS Server behaving strangely

I'm playing around with SAS (version: 7.11 HF2), I've a dataset which has columns A and B, variable A is decimal. When I run the below code, strangely I get a . (dot) in the first row of output.
Input data:
a, b
2.4, 1
1.2, 2
3.6, 3
Code:
data test;
c = a;
set abcd.test_data;
run;
Output data:
c, a, b
., 2.4, 1
2.4, 1.2, 2
1.2, 3.6, 3
3.6, ,
Strange things:
Derived variable is always generated on the right side, this one is being generated on left.
. (dot) is coming and the values are shifting by a row in the derived column.
Any help?
Looks like it did want you asked it to do.
On the first iteration of the data step it will set C to the value of A. The value of A is missing since you have not yet given it any value. Then the SET statement will read the first observation from your input dataset. Since there is no explicit OUTPUT statement the observation is written when the iteration reaches the end.
On the rest of the iterations of the data step the value that A will have when it is assigned to C will be the value as last read from the input dataset. Any variable that is part of an input dataset is "retained", which really just means it is not set to missing when a new iteration starts.
If the goal was to create C with the previous value of A you could have created the same output by using the LAG() function.
data test;
set abcd.test_data;
c=lag(a);
run;
Your set statement is after your variable assignment statement. SAS is first trying to assign the value of a to c, which has not yet been read. Place your set statement first, then do variable manipulation.
data test;
set abcd.test_data;
c = a;
run;
Nothing strange here, just put the SET statement before.
Datastep processing consists of 2 phases.
Compilation Phase
Execution Phase
During compilation phase, each of the statements within the data step are scanned for syntax errors.
During execution phase, a dataset's data portion is created.
It initializes variables to missing and finally executes other statements in the order determined by their location in the data step.
In your case, the set statement comes after the assignment of c. At that time a and b are set to missing, hence giving a missing value for c. Finally, the SET statement will be executed and that is why you end up with a value for both a and b on the first line.
data test;
set abcd.test_data;
c = a;
run;
Note that the first variable in your dataset is c, because this is the first stated in your code.

Output to a text file

I need to output lots of different datasets to different text files. The datasets share some common variables that need to be output but also have quite a lot of different ones. I have loaded these different ones into a macro variable separated by blanks so that I can macroize this.
So I created a macro which loops over the datasets and outputs each into a different text file.
For this purpose, I used a put statement inside a data step. The PUT statement looks like this:
PUT (all the common variables shared by all the datasets), (macro variable containing all the dataset-specific variables);
E.g.:
%MACRO OUTPUT();
%DO N=1 %TO &TABLES_COUNT;
DATA _NULL_;
SET &&TABLE&N;
FILE 'PATH/&&TABLE&N..txt';
PUT a b c d "&vars";
RUN;
%END;
%MEND OUTPUT;
Where &vars is the macro variable containing all the variables needed for outputting for a dataset in the current loop.
Which gets resolved, for example, to:
PUT a b c d special1 special2 special5 ... special329;
Now the problem is, the quoted string can only be 262 characters long. And some of my datasets I am trying to output have so many variables to be output that this macro variable which is a quoted string and holds all those variables will be much longer than that. Is there any other way how I can do this?
Do not include quotes around the list of variable names.
put a b c d &vars ;
There should not be any limit to the number of variables you can output, but if the length of the output line gets too long SAS will wrap to a new line. The default line length is currently 32,767 (but older versions of SAS use 256 as the default line length). You can actually set that much higher if you want. So you could use 1,000,000 for example. The upper limit probably depends on your operating system.
FILE "PATH/&&TABLE&N..txt" lrecl=1000000 ;
If you just want to make sure that the common variables appear at the front (that is you are not excluding any of the variables) then perhaps you don't need the list of variables for each table at all.
DATA _NULL_;
retain a b c d ;
SET &&TABLE&N;
FILE "&PATH/&&TABLE&N..txt" lrecl=1000000;
put (_all_) (+0) ;
RUN;
I would tackle this but having 1 put statement per variable. Use the # modifier so that you don't get a new line.
For example:
data test;
a=1;
b=2;
c=3;
output;
output;
run;
data _null_;
set test;
put a #;
put b #;
put c #;
put;
run;
Outputs this to the log:
800 data _null_;
801 set test;
802 put a #;
803 put b #;
804 put c #;
805 put;
806 run;
1 2 3
1 2 3
NOTE: There were 2 observations read from the data set WORK.TEST.
NOTE: DATA statement used (Total process time):
real time 0.07 seconds
cpu time 0.03 seconds
So modify your macro to loop through the two sets of values using this syntax.
Not sure why you're talking about quoted strings: you would not quote the &vars argument.
put a b c d &vars;
not
put a b c d "&vars";
There's a limit there, but it's much higher (64k).
That said, I would do this in a data driven fashion with CALL EXECUTE. This is pretty simple and does it all in one step, assuming you can easily determine which datasets to output from the dictionary tables in a WHERE statement. This has a limitation of 32kiB total, though if you're actually going to go over that you can work around it very easily (you can separate out various bits into multiple calls, and even structure the call so that if the callstr hits 32000 long you issue a call execute with it and then continue).
This avoids having to manage a bunch of large macro variables (your &VAR will really be &&VAR&N and will be many large macro variables).
data test;
length vars callstr $32767;
do _n_ = 1 by 1 until (last.memname);
set sashelp.vcolumn;
where memname in ('CLASS','CARS');
by libname memname;
vars = catx(' ',vars,name);
end;
callstr = catx(' ',
'data _null_;',
'set',cats(libname,'.',memname),';',
'file',cats('"c:\temp\',memname,'.txt"'),';',
'put',vars,';',
'run;');
call execute(callstr);
run;

Algorithm for calculating most stable, consecutive values from a database

I have some questions and I'm in need of your input.
Say I have a database table filled with 2000-3000 rows and each row has a value and some identifiers. I am in need of withdrawing ~100 consecutive rows with the most stable values (lowest spread). It's okay with a few jumper values if you can exclude them.
How would you do this and what algorithm would you use?
I'm currently using SAS Enterprise Guide for my DB which runs on Oracle. I don't really know that much of the generic SAS language but I don't know what other language I could use for this? Some scripting language? I have limited programming knowledge but this task seems pretty easy, correct?
The algorithms I've been thinking of is:
Select 100 consecutive rows and calculate standard deviation. Increment select statement by 1 and calculate standard deviation again. Loop trough the whole table.
Export the rows with the lowest standard deviation
Same as 1, but calculate variance instead of standard deviation (basically the same thing). When the whole table has been looped, do it again but exclude 1 row which has the highest value from avg. Repeat process until 5 jumpers has been excluded and compare the results.
Pros and cons compared to method 1?
Questions:
Best & easiest method?
Prefered language? Possible in SAS?
Do you have any other method you would recommend?
Thanks in advance
/Niklas
The below code will do what you are asking. It is just using some sample data and only calcs it for 10 observations (rather than 100). I'll leave it to you to adapt as required.
Create some sample data. available to all sas installations:
data xx;
set sashelp.stocks;
where stock = 'IBM';
obs = _n_;
run;
Create row numbers and sort it descending. Makes it easier to calc stddev:
proc sort data=xx;
by descending obs;
run;
Use an array to keep the subsequent 10 obs for every row. Calculate the stddev for each row using the array (except for the last 10 rows. Remember we are working backwards through the data.
data calcs;
set xx;
array a[10] arr1-arr10;
retain arr1-arr10 .;
do tmp=10 to 2 by -1;
a[tmp] = a[tmp-1];
end;
a[1] = close;
if _n_ ge 10 then do;
std = std(of arr1-arr10);
end;
run;
Find which obs (ie. row) had the lowest stddev calc. Save it to a macro var.
proc sql noprint;
select obs into :start_row
from calcs
having std = min(std)
;
quit;
Select the 10 observations from the sample data that were involved in calcing the lowest stddev.
proc sql noprint;
create table final as
select *
from xx
where obs between &start_row and %eval(&start_row+10)
order by obs
;
quit;
An addition to Robert's solution but with part 2 included as well, creating a second array and then looping through and removing the top 5 values. You'll still the last parts of Roberts solution to extract the row with the minimum standard deviation and then the corresponding attached rows. You didn't specify how you wanted to deal with the variances that have the max removed so they are left in the dataset.
data want;
*set arrays for looping;
/*used to calculate the std*/
array p{0:9} _temporary_;
/*used to copy the array over to reduce variables*/
array ps(1:10) _temporary_;
/*used to store the var with 5 max values removed*/
array s{1:5} var1-var5;
set sample;
p{mod(_n_,10)} = open;
if _n_ ge 10 then std = std(of p{*});
*remove max values to calculate variance;
if _n_ ge 10 then do;
*copy array over to remove values;
do i=1 to 10;
ps(i)=p(i-1);
end;
do i=1 to 5;
index=whichn(max(of ps(*)), of ps(*));
ps(index)=.;
s(i)=var(of ps(*));
end;
end;
run;

Counting rows in multiple variables using SAS

I have a question in creating a count variable using SAS.
Q R
----
1 a
1 a
1 b
1 b
1 b
2 a
3 a
3 c
4 c
4 c
4 c
I need to create a variable S that counts the rows that has same combination of Q and R. The following will be the output.
Q R S
-------------------
1 a 1
1 a 2
1 b 1
1 b 2
1 b 3*
2 a 1
3 a 1
3 c 1
4 b 1
4 b 2
4 b 3
I tried using following program:
data two;
set one;
S + 1;
by Q R;
if first.Q and first.R then S = 1;
run;
But, this did not run correctly. For example, * will come out as 1 instead of 3. I would appreciate any tips on how to make this counting variable work correctly.
Very close, your if statement is should be first.R (or change the and to OR but that isn't efficient). I usually prefer to have the increment after the set to 1.
data two;
set one;
by Q R;
*Retain S; *implicitly retained by using the +1 notation;
if first.R then S = 1;
else S+1;
run;
Reese's example is certainly sufficient in this case, but given this was a simple question with a mostly uninteresting answer, I'll instead present a very small variant simply from a programming style standpoint.
data two;
set one;
by Q R;
if first.R then s=0;
s+1;
run;
This will likely function exactly the same as Reese's code and the original question's code once the first.Q is removed. However, it has two slight differences.
First off, I like to group if first. variable-resetting code (that isn't otherwise dependent on location) in one place, as early as possible in the code (after by statements, where, "early" subsetting if, array, format, length, and retain). This is useful in an organizational standpoint because this code (usually) is something that roughly parallels what SAS does in between data step iterations, but for BY groups - so it's nice to have it at the start of the data step.
Second, I like initializing to zero. That avoids the need for the else conditional, which makes the code clearer; you're not doing anything different on the first row per BY group other than the re-initialization, so it makes sense to not have the increment be conditional.
These are both helpful in a more complex version of this data step; obviously in this particular step it doesn't matter much, but in a larger data step it can be helpful to organize your code more effectively. Imagine this data step:
data two;
set one;
by Q R;
retain Z;
format Z $12.;
array CS[15];
if first.R then do; *re-init block;
S=0;
Z=' ';
end;
S+1; *increment counter;
do _t = 1 to dim(CS);
Z = cats(Z,ifc(CS[_t]>0,put(CS[_t],8.),''));
end;
keep Q R S Z;
run;
That data step is nicely organized, and has a logical flow, even though almost every step could be moved anywhere else in the data step. Grouping the initializations together makes it more readable, particularly if you always group things that way.

SAS Call a macro variable

I have a program that I want to run on several years. Therefore at some time I have to select my data
data want;
set have(where='2014');
run;
To try on several years, I have a macro variable, that I define as
%let an=14
/*It is 14 and not 2014, because elsewhere I need it that way.*/
But then when I try to put it in my program it does not work at all
data want;
set have(where="&&20&an.");
run;
I would appreciate some help
First edit : changed ' ' into " ", but still does not work
Second edit and answer
"20&an"
The answer you arrived at (20&an) is correct - you are all set. You don't even need to read the rest of this answer I've posted :-)
However I noticed you were a bit confused about & vs. &&. If you'd like to know more about that, I put together some extra info on the difference between & and &&, as well as the purpose of &&, in SAS macro evaluation.
The & is the most common symbol - you simply use that to evaluate/de-reference a variable. So:
%LET an = 14 ;
%PUT ----- an ;
%PUT ----- &an ;
Output:
----- an
----- 14
So as you can see, you must put a & before the variable name in order to de-reference it to its value. Omitting the & simply prints the string an, which happens to be the name of the variable in this case. For most macro coding, & is all you will ever need. & in SAS macro is like $ in shell, * in C, etc.
Now, what is && for? It exists to enable you to have dynamic macro variable names. That is, having a macro variable whose value is the name of another macro variable. If you are familiar with C, you can think of this as a pointer to a pointer.
The way SAS evaluates && is in two passes. In the first pass, it converts && to &. At the same time, any & symbols it sees in that pass will be used to de-reference the variable names they are next to. The idea is for these latter expressions to resolve to a variable name. Then, in the second pass, the remaining & symbols (all originally && symbols) de-reference whatever variable names they now find themselves next to.
Here is an example with sample output:
%LET x = 3;
%LET name_of_variable = x;
%PUT ----- &x;
%PUT ----- &&name_of_variable;
%PUT ----- &&&name_of_variable;
Output:
----- 3
----- x
----- 3
In the first %PUT, we are just using plain old &, and thus we are doing what we did before, reading and printing the value that x holds. In the second %PUT, things get slightly more interesting. Because of the &&, SAS does two passes of evaluation. The first one converts this:
%PUT ----- &&name_of_variable;
To this
%PUT ----- &name_of_variable;
In the second pass, SAS does the standard & evaluation to print the value held in name_of_variable - the string x, which happens to be the name of the other variable we are using. Of course, this example is especially contrived: why would you write &&name_of_variable when you could have just written &name_of_variable?
In the third %PUT, where we now have the &&&, SAS does two passes. Here is where we finally see the true purpose of &&. I will put pieces of the expression in parentheses so you can see how they are evaluated. We go from this:
%PUT ----- (&&)(&name_of_variable);
To this:
%PUT ----- &x;
So the in the first pass, the && was converted to &, and &name_of_variable was a simple de-referencing of name_of_variable, evaluating to the content it is holding, which as we said was x.
So in the second pass, we are just left with the simple evaluation:
%PUT ----- &x;
As we had set x equal to 3, this evaluates to 3.
So in a sense, saying &&&name_of_variable is saying "Show me the value of the variable whose name is stored in name_of_variable."
Here is a motivating example for why you would want to do this. Suppose you had a simple macro subroutine that added an arbitrary number to a numerical value stored in a SAS macro variable. To do that, the subroutine would have to know the amount to add, but more importantly, it would need to know the name of the variable to add to. You would accomplish this dynamic variable naming via the && mechanism, like so:
%MACRO increment_by_amount (var_name = , amount = );
%LET &var_name = %EVAL (&&&var_name + &amount) ;
/* Note: this could also begin with %LET &&var_name = .... */
%MEND;
Here we are saying: "Let the variable whose name is held in var_name (i.e. &var_name) equal the value of the variable whose name is held in var_name (i.e. &&&var_name) plus the value held in amount (i.e. &amount).
When you call a subroutine like this, make sure you are passing the variable name, not the value. That is, say this:
%increment_by_amount (var_name = x , amount = 3 );
Not this:
%increment_by_amount (var_name = &x , amount = 3);
So an example of invocation would be:
%LET x = 3;
%PUT ----- &x;
%increment_by_amount (var_name = x , amount = 3 );
%PUT ----- &x;
Output:
----- 3
----- 6