SAS: How to use RETAIN statement to create a summed variable in the DATA step, equivalent to the SUM statement output in PROC PRINT - sum

In SAS, I'm trying to create a variable that is the sum of another. In this case, I am trying to create two variables: Total_All_Ages, which is the sum of the 2013 US population POPESTIMATE2013, and Total_18Plus, which is the sum of the 2013 US population aged 18+ POPEST18PLUS2013.
I want the output of these variables to appear as though I had used the sum statement under proc print (where the sum appears at the bottom of the variable column in a new row). However, I do not want to use the print procedure. Instead, I want to create my output only using the data step.
The way I need to do this is with the retain (and input) statement.
My code is as follows:
data _NULL_;
retain Total_All_Ages Total_18Plus;
infile RAWfoldr DLM=',' firstobs=3 obs=53;
informat STATE $2. NAME $20.;
input SUMLEV REGION $ DIVISION STATE $ NAME $ POPESTIMATE2013 POPEST18PLUS2013 PCNT_POPEST18PLUS;
Total_All_Ages = sum(Total_All_Ages, POPESTIMATE2013);
Total_18Plus = sum(Total_18Plus, POPEST18PLUS2013);
keep STATE NAME POPESTIMATE2013 POPEST18PLUS2013 Total_All_Ages Total_18Plus;
format POPESTIMATE2013 comma11. POPEST18PLUS2013 comma11.;
file print notitles;
if _n_=1 then put '=== U.S. Resident Population Estimates for All Ages and ===
Ages 18 or Older by State (in Alphabetical Order), 2013';
if _n_=1 then put ' ';
if _n_=1 then put #5 'FIPS Code' #16 'State Name' #40 'All Ages' #55 'Ages 18 or Older';
if _n_=1 then put ' ';
put #5 STATE #16 NAME #40 POPESTIMATE2013 #55 POPEST18PLUS2013;
run;
You can see that in my input statement, I create the two variables that I mentioned. I also mention them in my retain statement. However, I'm not sure how to make them appear in my output in the way I specified.
I want them to appear as a Total line at the bottom of the output, like this:
POPESTIMATE2013 POPEST18PLUS2013
112312234 1234123412341234
23413412341234 213412341234
============ ============ ============
Total 23423423429 242234545345
Is there a way to put these variables on a new line at the very bottom of the output (sort of like how I put the variable labels using the if _n_=1 code)?
Let me know if I need to explain myself better. I appreciate any help with this. Thank you.

If I understand your question, you're almost there.
First, add end=eof to your infile statement. This initializes a variable "eof" that is equal to 0, but will equal 1 only when SAS is reading in the last line of data. This works in a set statement as well.
Next, add this do block, which will execute when sas is on the last line of the file:
if eof then do;
put #5 9*'=' #40 11*'=' #55 11*'=';
put #5 'Total' #40 Total_All_Ages comma11. #55 Total_18Plus comma11.;
end;
Here, you use put statements to print out the formatting (repeated ='s signs) and the totals. Complete code is below:
data _NULL_;
retain Total_All_Ages Total_18Plus;
infile RAWfoldr DLM=',' firstobs=3 obs=53 end=eof;
informat STATE $2. NAME $20.;
input SUMLEV REGION $ DIVISION STATE $ NAME $ POPESTIMATE2013 POPEST18PLUS2013 PCNT_POPEST18PLUS;
Total_All_Ages = sum(Total_All_Ages, POPESTIMATE2013);
Total_18Plus = sum(Total_18Plus, POPEST18PLUS2013);
keep STATE NAME POPESTIMATE2013 POPEST18PLUS2013 Total_All_Ages Total_18Plus;
format POPESTIMATE2013 comma11. POPEST18PLUS2013 comma11.;
file print notitles;
if _n_=1 then put '=== U.S. Resident Population Estimates for All Ages and ===
Ages 18 or Older by State (in Alphabetical Order), 2013';
if _n_=1 then put ' ';
if _n_=1 then put #5 'FIPS Code' #16 'State Name' #40 'All Ages' #55 'Ages 18 or Older';
if _n_=1 then put ' ';
put #5 STATE #16 NAME #40 POPESTIMATE2013 comma11. #55 POPEST18PLUS2013 comma11.;
if eof then do;
put #5 9*'=' #40 11*'=' #55 11*'=';
put #5 'Total' #40 Total_All_Ages comma11. #55 Total_18Plus comma11.;
end;
run;
One final note on your code: you can right-align your numbers by specifying a format followed by "-r" in your put statement, e.g.:
put #5 STATE #16 NAME #40 POPESTIMATE2013 comma11.-r #55 POPEST18PLUS2013 comma11.-r;
This will override any format statement you have.

Related

different variables with the same name in a sas dataset

I have a text file where two different columns are having the same name. As shown in the following figure.
Let's say for SystBP, I need to change the first SystBP to SystBP_B and the second SystBP to SystBP_E.
Could someone kindly offer me some help on this?
When programming in SAS Base You should sometimes not expect SAS to read column names from a text file and interpret them as variable names.
You have to instruct SAS what the first data row is, where the values are written and how they should be interpreted (text, number, date, ...) You do that with an infile and an input statement in a data step.
As you write the code yourself, you have complete control.
data READ_FROM_TXT;
infile "C:\myFolder\myFile.txt" firstobs=3 truncover;
* firstobs=3 makes SAS skip the first 2 observations;
* truncover avoids jumping to the next line when the last variable is missing or too short ;
input
#01 ID 2.
#05 Week 4.
#11 SystBP_B 6.
#19 DiastBP_B 6.
...
#41 SystBP_E 6.
#49 DiastBP_E 6.
...
;
* #11 SystBP_B 6. instructs SAS to interpret positions 11 to 16 as a number
* and assign the value to variable SystBP_B;
run;
As you inserted the data as an image, not as text, using markup, I had to guess the positions, so you will have to correct them.
I would make timing into observations.
data test;
infile cards4 firstobs=2;
input id :$8. week #;
do time = 'STR','END';
input SystBP DiastBP Pulse Stress #;
output;
end;
cards;
ID Week SystBP DiastBP Pulse Stress SystBP DiastBP Pulse Stress
1 1 134 44 66 5.8 134 44 66 5.8
;;;;
run;
The INFILE option FIRSTOBS= will let you INPUT the data starting in row 3.
Data file: C:\temp\bp-survey.txt
Start End
ID Week SystBP DiastBP Pulse Stress SystBP DiastBP Pulse Stress
1 1 134 44 66 5.8 134 44 66 5.8
...
Program
filename survey 'c:\temp\bp-survey.txt';
data want;
infile survey firstobs=3;
input
ID Week
SystBP_start DiastBP_start Pulse_start Stress_start
SystBP_end DiastBP_end Pulse_end Stress_end
;
run;
ods html ;
proc print data=want;
run;

SAS - creating macros dynamically

I would like to dynamically create macros to query a transactional data set. I have a table that has a set of parameters (parameter_data) and transaction data (txs). For each row in my parameter data I want to create a macro that can be called to query the data.
Parameter data:
data parameter_data;
input macro_name $ parameter_name $ parameter_value $;
datalines;
A Person_ID 1
B TX_ID 2
;
Transactional Data:
data txns;
input Person_ID $ TX_ID $ TX_Amount $;
datalines;
John Sales 1123
Mary Acctng 34
John Sales 23
Mary Sales 2134
;
Here I try to create a macro that should create macros dynamically according to the parameter data. The 'inner macros' are the macros that are created from the parameter data.
%macro outerMacro;
/*loop through each row in the parameter table to get the detail of the macro we want to create*/
%DO ROW = 1 %To 2;
data _NULL_;
set parameter_data;
if _N_ = ROW then do;
call symputx('parameter_name',parameter_name);
call symputx('parameter_value',parameter_value);
end;
run;
/*define inner macro parameters*/
%let macroName = myMacro; /*set the name of the macro we want to create*/
%let innerMacroStart = macro &macroName.; /*set the macro name to start the macro definition*/
%let innerMacroEnd = mend &macroName;
%&&innerMacroStart.; /*start the inner macro*/
/*body of the macro*/
data output;
set txns;
&&parameter_name = &&parameter_value;
/*so here effectively for the first row in the parameter table we are filtering where person_id = John*/
run;
%&&innerMacroEnd.; /*end the inner macro*/
%mend outerMacro;
%&&outerMacroName.;
It seems that SAS is unable to parse the lines %innerMacroStart. Any help is much appreciated.
Thanks!
If the goal is just to subset data then it might be better to generate macro variables instead of actual macros. Try something like this instead.
data _null_;
set parameter_data ;
call symputx(macro_name,catx(' ','where also'
,parameter_name,'=',quote(trim(parameter_value)),';'));
run;
Then just use the generated where statement(s) when you need them by expanding the macro variable. Like this:
data output ;
set txns;
&a
run;
If you really want to generate a macro definition then you probably want to just use a data step to write the code to a file and then %include the file to compile the macros. That will be much easier to debug than macro logic.
Let's fix your parameter file to better match your test data. Person_ID and TX_ID are character variables in your transaction dataset. You will probably need to add logic or change the parameter file to allow it to handle testing of both numeric and character variables. For now I just made it generate code that assumes that PARAMETER_NAME refers to a character variable so that PARAMETER_VALUE will need to have quotes added to make it a string literal.
data parameter_data;
input macro_name :$32. parameter_name :$32. parameter_value $:200.;
datalines;
A Person_ID John
B TX_ID Acctng
;
data txns;
input Person_ID $ TX_ID $ TX_Amount $;
datalines;
John Sales 1123
Mary Acctng 34
John Sales 23
Mary Sales 2134
;
Now let's run a data step to generate the code for all of your macros. I added logic to use AND if there were multiple "parameters" defined for each macro.
filename code temp;
data _null_;
set parameter_data ;
by macro_name ;
file code ;
if first.macro_name then put
'%macro ' macro_name ';'
/ 'data output;'
/ ' set txns;'
/ ' where ' #
;
else put ' and ' # ;
put parameter_name '=' parameter_value :$quote. # ;
if last.macro_name then put
';'
/ 'run;'
/ '%mend ' macro_name ';'
;
run;
Now just use %include to compile the macros.
%include code / source2 ;
NOTE: %INCLUDE (level 1) file CODE is file C:\...\#LN00048.
432 +%macro A ;
433 +data output;
434 + set txns;
435 + where Person_ID ="John" ;
436 +run;
437 +%mend A ;
438 +%macro B ;
439 +data output;
440 + set txns;
441 + where TX_ID ="Acctng" ;
442 +run;
443 +%mend B ;
NOTE: %INCLUDE (level 1) ending.
Now you can use your macros.
445 options mprint;
446 %a ;
MPRINT(A): data output;
MPRINT(A): set txns;
MPRINT(A): where Person_ID ="John" ;
MPRINT(A): run;
NOTE: There were 2 observations read from the data set WORK.TXNS.
WHERE Person_ID='John';
NOTE: The data set WORK.OUTPUT has 2 observations and 3 variables.
447 %b ;
MPRINT(B): data output;
MPRINT(B): set txns;
MPRINT(B): where TX_ID ="Acctng" ;
MPRINT(B): run;
NOTE: There were 1 observations read from the data set WORK.TXNS.
WHERE TX_ID='Acctng';
NOTE: The data set WORK.OUTPUT has 1 observations and 3 variables.
I have placed a comment before each block of code, but essentially it is:
Parameter set up.
Macro generation.
%include.
Call any desired macro.
I have assumed no more than 999 parameter observations - this is controlled by seq.
You can examine file "inner_macro.sas" to see the macro definitions.
NB. When you try it, make sure to use your own path in place of <your-path> (occurs twice):
/* set up parameters */
data parameters;
infile datalines dlm=',';
input var : $8.
operator : $8.
value : $8.
;
datalines;
name,eq,"John"
age,gt,12
weight,eq,0
;
/* read parameters and generate a macro definition for each obs, written to a file */
data _null_;
file '<your-path>/inner_macro.sas';
set parameters;
seq = put(_n_,z3.);
put '%macro inner_' seq ';';
put ' where ' var operator value ';';
put '%mend inner_' seq ';';
put;
run;
/* %include (submits code in file) all of the macro definitions */
%include '<your-path>/inner_macro.sas';
options mprint;
/* invoke the macro with the required data sets */
data class1;
set sashelp.class;
%inner_001;
run;
data class2;
set sashelp.class;
%inner_002;
run;
data class3;
set sashelp.class;
%inner_003;
run;

Is there a way to input hundreds of variables into SAS without using each variable separately?

I have a set of data of gym membership starting with an ID, then 119 in-time columns and 119 out-time columns. The in-time and out-time columns are in the syntax of ##:##:## and I am trying to input the variables in the simplest way. Rather than writing [ID in1 $ in2 $ inX $ out1 $ out2 $ outX $], is there a way to easily input hundreds of columns in a simple line of code?
Just use variable lists. Let's assume your data file is comma delimited.
data want ;
infile 'myfile.csv' dsd truncover ;
input id (in1-in119 out1-out119) (:time8.) ;
format in1-in119 out1-out119 time8.;
run;
"proc import" can be an alternative solution.
It defines data type automatically.
The statement looks like the following:
proc import
datafile = myfile.csv
out = work.destination_table
dbms = csv replace
;
run;

How to import specific lines from dat file

I've got a .dat file with numbers that I need imported into a SAS dataset. However, there's plenty of information that I do not need, and I only want specific lines of data (e.g. every 6th line starting from line 1000, until I have 100 observations). I also require a unique identifier based on what is displayed on the first line.
So for example, the .dat file contains this:
DATANOTREQUIRED
DATANOTREQUIRED
DATANOTREQUIRED
UPDATE AAA_1111111_Q_BBBBBB_0_1_#
123.4,
123.5,
124.0,
124.1
DATANOTREQUIRED
DATANOTREQUIRED
DATANOTREQUIRED
UPDATE AAA_1111111__Q_BBBBBB_0_2_#
125.1,
126.0,
127.1,
130.0
What I want the eventual SAS dataset to look like is this
Identifier | Value
X.1. | 124.1
X.2. | 130.0
I'm using the infile in SAS and using input to point to line 1000 but I'm stuck and cannot get the SAS dataset I want. (Updated code based on contributors below)
data work.test;
infile '\\filepath\mydatasource.dat' dsd firstobs=1042 truncover;
input #8 ID :$40.
#4 Value1 :8.;
run;
but what I'm seeing now is that the header lines are appearing fine, but the first observation has a . and instead the first data value is appearing for the 2nd header line.
ID | Value1
UPDATE AAA_1111111_Q_BBBBBB_0_1_# | .
UPDATE AAA_1111111__Q_BBBBBB_0_2_# | 124.1
Here's an example assuming that you have the same number of rows between each header row:
data want;
if _n_ > 2 then stop; /*Stop after we've output 2 rows */
infile cards firstobs=6; /*Skip the first 5 lines in the file*/
input #1 #8 ID :$32.
#5 myvar :8.;
cards;
UPDATE AAA_1111111_Q_BBBBBB_0_1_#
123.4,
123.5,
124.0,
124.1
UPDATE AAA_1111111__Q_BBBBBB_0_2_#
125.1,
126.0,
127.1,
130.0
UPDATE AAA_1111111_Q_BBBBBB_0_3_#
123.4,
123.5,
124.0,
124.1
UPDATE AAA_1111111__Q_BBBBBB_0_4_#
125.1,
126.0,
127.1,
130.0
;
run;
Use the FIRSTOBS= option to skip the beginning of the file.
If there are always 5 records per block you could just read them individually.
data want;
infile rawdata dsd firstobs=1000 truncover;
input id :$40. (4*value) (/) ;
run;
Or you could do something like this that should allow for a variable number of values per id and just keep the last one.
data want;
infile rawdata dsd firstobs=1000 end=eof;
input # ;
length id $32 value 8 ;
retain id value;
if _infile_ =: 'UPDATE' then do;
if _n_ > 1 then output;
id = scan(_infile_,-1,' ');
end;
else input value;
if eof and _n_ > 1 then output;
run;

SAS - # symbol in the INPUT statement

I have the following program, but don't understand what # symbol in the end of the INPUT lines does:
data colors;
input #1 Var1 $ #8 Var2 $ #;
input #1 Var3 $ #8 Var4 $ #;
datalines;
RED ORANGE YELLOW GREEN
BLUE INDIGO PURPLE VIOLET
CYAN WHOTE FICSIA BLACK
GRAY BROWN PINK MAGENTA
run;
proc print data=colors;
run;
Output produced without # in the end of the INPUT line is different from the ouput with #.
Can you please clarify what does # in the end of the 2nd and 3rd INPUT lines do?
# at the end of an input statement means, do not advance the line pointer after semicolon. ## means, do not advance the line pointer after the run statement either.
Normally an input statement has an implicit advance the line pointer one after semicolon. So:
data want;
input a b;
datalines;
1 2 3 4
5 6 7 8
run;
proc print data=want;
run;
Will return
1 2
5 6
If you want to read 3 4 into another line, then, you might do something like:
data want;
input a b #;
output;
input a b;
datalines;
1 2 3 4
5 6 7 8
run;
proc print data=want;
run;
Which gives
1 2
3 4
5 6
7 8
Similarly you could simply write
data want;
input a b ##;
datalines;
1 2 3 4
5 6 7 8
run;
proc print data=want;
run;
To get the same result - ## would hold the line pointer even across the run statement. (It still would advance once it hit the end of the line.)
In Summary: I think you probably don't want the trailing # in this case. The Input statements do not seem fitting for the data you are reading. With the trailing #, you are reading the same data into var1 and var3, and the same data into var2 and var4, because it is reading the same line twice. Either way, you are not reading in what the data appears to be. You would be better off with:
input Var1 $ Var2 $ #;
input Var3 $ Var4 $;
Or, more simply:
input Var1 $ Var2 $ Var3 $ Var4 $;
Official details from the SAS support site, annotated:
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000146292.htm
Using Line-Hold Specifiers
Line-hold specifiers keep the pointer on
the current input record when
a data record is read by more than one INPUT statement (trailing #)
Use a single trailing # to allow the next INPUT statement to read from the same record.
Normally, each INPUT statement in a DATA step reads a new data record
into the input buffer. When you use a trailing #, the following
occurs:
The pointer position does not change.
No new record is read into the input buffer.
The next INPUT statement for the same iteration of the DATA step continues to read the same record rather than a new one.
SAS releases a record held by a trailing # when
a null INPUT statement executes:
input;
an INPUT statement without a trailing # executes
the next iteration of the DATA step begins.