I can create a Numbered Range List of numeric type, but not character type.
My code is similar to this:
DATA TestDataset;
INPUT a1-a3 $;
DATALINES;
A B C
;
RUN;
This produces 3 variables - [a1], [a2] and [a3] as expected. However [a3] is character, but [a1] and [a2] are numeric. This leaves me with missing values as per the following table:
a1 a2 a3
. . C
The following code works, but obviously it does not scale nicely.
INPUT a1 $ a2 $ a3 $;
Am I missing something?
I believe you can use the hyphen notation on the length statement to get what you want. You really should use a length statement regardless..otherwise it defaults to $8.
DATA TestDataset;
length a1-a3 $20;
INPUT a1-a3 ;
DATALINES;
A B C
;
RUN;
I came up with a macro solution:
%MACRO var_list_char (var_prefix, n);
%LOCAL i ;
%DO i = 1 %TO &n;
&var_prefix&i$
%END;
%MEND;
DATA TestDataset;
INPUT %var_list_char (a, 3);
DATALINES;
A B C
;
RUN;
I wish I could find a way to do this without macros - I will keep digging for a bit and will update this post if I find more. In the meantime, the above approach will definitely work.
UPDATE 1: #carolinajay65's solution above is the correct non-macro approach.
UPDATE 2: There is another way that I found.
DATA TestDataset;
INPUT (a1-a3) ($);
DATALINES;
A B C
;
RUN;
More documentation of the language features supporting this technique can be found here, in the section labeled "How to Group Variables and Informats".
Related
I would like to dynamically create macros to query a transactional data set. I have a table that has a set of parameters (parameter_data) and transaction data (txs). For each row in my parameter data I want to create a macro that can be called to query the data.
Parameter data:
data parameter_data;
input macro_name $ parameter_name $ parameter_value $;
datalines;
A Person_ID 1
B TX_ID 2
;
Transactional Data:
data txns;
input Person_ID $ TX_ID $ TX_Amount $;
datalines;
John Sales 1123
Mary Acctng 34
John Sales 23
Mary Sales 2134
;
Here I try to create a macro that should create macros dynamically according to the parameter data. The 'inner macros' are the macros that are created from the parameter data.
%macro outerMacro;
/*loop through each row in the parameter table to get the detail of the macro we want to create*/
%DO ROW = 1 %To 2;
data _NULL_;
set parameter_data;
if _N_ = ROW then do;
call symputx('parameter_name',parameter_name);
call symputx('parameter_value',parameter_value);
end;
run;
/*define inner macro parameters*/
%let macroName = myMacro; /*set the name of the macro we want to create*/
%let innerMacroStart = macro ¯oName.; /*set the macro name to start the macro definition*/
%let innerMacroEnd = mend ¯oName;
%&&innerMacroStart.; /*start the inner macro*/
/*body of the macro*/
data output;
set txns;
&¶meter_name = &¶meter_value;
/*so here effectively for the first row in the parameter table we are filtering where person_id = John*/
run;
%&&innerMacroEnd.; /*end the inner macro*/
%mend outerMacro;
%&&outerMacroName.;
It seems that SAS is unable to parse the lines %innerMacroStart. Any help is much appreciated.
Thanks!
If the goal is just to subset data then it might be better to generate macro variables instead of actual macros. Try something like this instead.
data _null_;
set parameter_data ;
call symputx(macro_name,catx(' ','where also'
,parameter_name,'=',quote(trim(parameter_value)),';'));
run;
Then just use the generated where statement(s) when you need them by expanding the macro variable. Like this:
data output ;
set txns;
&a
run;
If you really want to generate a macro definition then you probably want to just use a data step to write the code to a file and then %include the file to compile the macros. That will be much easier to debug than macro logic.
Let's fix your parameter file to better match your test data. Person_ID and TX_ID are character variables in your transaction dataset. You will probably need to add logic or change the parameter file to allow it to handle testing of both numeric and character variables. For now I just made it generate code that assumes that PARAMETER_NAME refers to a character variable so that PARAMETER_VALUE will need to have quotes added to make it a string literal.
data parameter_data;
input macro_name :$32. parameter_name :$32. parameter_value $:200.;
datalines;
A Person_ID John
B TX_ID Acctng
;
data txns;
input Person_ID $ TX_ID $ TX_Amount $;
datalines;
John Sales 1123
Mary Acctng 34
John Sales 23
Mary Sales 2134
;
Now let's run a data step to generate the code for all of your macros. I added logic to use AND if there were multiple "parameters" defined for each macro.
filename code temp;
data _null_;
set parameter_data ;
by macro_name ;
file code ;
if first.macro_name then put
'%macro ' macro_name ';'
/ 'data output;'
/ ' set txns;'
/ ' where ' #
;
else put ' and ' # ;
put parameter_name '=' parameter_value :$quote. # ;
if last.macro_name then put
';'
/ 'run;'
/ '%mend ' macro_name ';'
;
run;
Now just use %include to compile the macros.
%include code / source2 ;
NOTE: %INCLUDE (level 1) file CODE is file C:\...\#LN00048.
432 +%macro A ;
433 +data output;
434 + set txns;
435 + where Person_ID ="John" ;
436 +run;
437 +%mend A ;
438 +%macro B ;
439 +data output;
440 + set txns;
441 + where TX_ID ="Acctng" ;
442 +run;
443 +%mend B ;
NOTE: %INCLUDE (level 1) ending.
Now you can use your macros.
445 options mprint;
446 %a ;
MPRINT(A): data output;
MPRINT(A): set txns;
MPRINT(A): where Person_ID ="John" ;
MPRINT(A): run;
NOTE: There were 2 observations read from the data set WORK.TXNS.
WHERE Person_ID='John';
NOTE: The data set WORK.OUTPUT has 2 observations and 3 variables.
447 %b ;
MPRINT(B): data output;
MPRINT(B): set txns;
MPRINT(B): where TX_ID ="Acctng" ;
MPRINT(B): run;
NOTE: There were 1 observations read from the data set WORK.TXNS.
WHERE TX_ID='Acctng';
NOTE: The data set WORK.OUTPUT has 1 observations and 3 variables.
I have placed a comment before each block of code, but essentially it is:
Parameter set up.
Macro generation.
%include.
Call any desired macro.
I have assumed no more than 999 parameter observations - this is controlled by seq.
You can examine file "inner_macro.sas" to see the macro definitions.
NB. When you try it, make sure to use your own path in place of <your-path> (occurs twice):
/* set up parameters */
data parameters;
infile datalines dlm=',';
input var : $8.
operator : $8.
value : $8.
;
datalines;
name,eq,"John"
age,gt,12
weight,eq,0
;
/* read parameters and generate a macro definition for each obs, written to a file */
data _null_;
file '<your-path>/inner_macro.sas';
set parameters;
seq = put(_n_,z3.);
put '%macro inner_' seq ';';
put ' where ' var operator value ';';
put '%mend inner_' seq ';';
put;
run;
/* %include (submits code in file) all of the macro definitions */
%include '<your-path>/inner_macro.sas';
options mprint;
/* invoke the macro with the required data sets */
data class1;
set sashelp.class;
%inner_001;
run;
data class2;
set sashelp.class;
%inner_002;
run;
data class3;
set sashelp.class;
%inner_003;
run;
I need to output lots of different datasets to different text files. The datasets share some common variables that need to be output but also have quite a lot of different ones. I have loaded these different ones into a macro variable separated by blanks so that I can macroize this.
So I created a macro which loops over the datasets and outputs each into a different text file.
For this purpose, I used a put statement inside a data step. The PUT statement looks like this:
PUT (all the common variables shared by all the datasets), (macro variable containing all the dataset-specific variables);
E.g.:
%MACRO OUTPUT();
%DO N=1 %TO &TABLES_COUNT;
DATA _NULL_;
SET &&TABLE&N;
FILE 'PATH/&&TABLE&N..txt';
PUT a b c d "&vars";
RUN;
%END;
%MEND OUTPUT;
Where &vars is the macro variable containing all the variables needed for outputting for a dataset in the current loop.
Which gets resolved, for example, to:
PUT a b c d special1 special2 special5 ... special329;
Now the problem is, the quoted string can only be 262 characters long. And some of my datasets I am trying to output have so many variables to be output that this macro variable which is a quoted string and holds all those variables will be much longer than that. Is there any other way how I can do this?
Do not include quotes around the list of variable names.
put a b c d &vars ;
There should not be any limit to the number of variables you can output, but if the length of the output line gets too long SAS will wrap to a new line. The default line length is currently 32,767 (but older versions of SAS use 256 as the default line length). You can actually set that much higher if you want. So you could use 1,000,000 for example. The upper limit probably depends on your operating system.
FILE "PATH/&&TABLE&N..txt" lrecl=1000000 ;
If you just want to make sure that the common variables appear at the front (that is you are not excluding any of the variables) then perhaps you don't need the list of variables for each table at all.
DATA _NULL_;
retain a b c d ;
SET &&TABLE&N;
FILE "&PATH/&&TABLE&N..txt" lrecl=1000000;
put (_all_) (+0) ;
RUN;
I would tackle this but having 1 put statement per variable. Use the # modifier so that you don't get a new line.
For example:
data test;
a=1;
b=2;
c=3;
output;
output;
run;
data _null_;
set test;
put a #;
put b #;
put c #;
put;
run;
Outputs this to the log:
800 data _null_;
801 set test;
802 put a #;
803 put b #;
804 put c #;
805 put;
806 run;
1 2 3
1 2 3
NOTE: There were 2 observations read from the data set WORK.TEST.
NOTE: DATA statement used (Total process time):
real time 0.07 seconds
cpu time 0.03 seconds
So modify your macro to loop through the two sets of values using this syntax.
Not sure why you're talking about quoted strings: you would not quote the &vars argument.
put a b c d &vars;
not
put a b c d "&vars";
There's a limit there, but it's much higher (64k).
That said, I would do this in a data driven fashion with CALL EXECUTE. This is pretty simple and does it all in one step, assuming you can easily determine which datasets to output from the dictionary tables in a WHERE statement. This has a limitation of 32kiB total, though if you're actually going to go over that you can work around it very easily (you can separate out various bits into multiple calls, and even structure the call so that if the callstr hits 32000 long you issue a call execute with it and then continue).
This avoids having to manage a bunch of large macro variables (your &VAR will really be &&VAR&N and will be many large macro variables).
data test;
length vars callstr $32767;
do _n_ = 1 by 1 until (last.memname);
set sashelp.vcolumn;
where memname in ('CLASS','CARS');
by libname memname;
vars = catx(' ',vars,name);
end;
callstr = catx(' ',
'data _null_;',
'set',cats(libname,'.',memname),';',
'file',cats('"c:\temp\',memname,'.txt"'),';',
'put',vars,';',
'run;');
call execute(callstr);
run;
I was using the following code to read data in SAS, the given is the code that I tried
data libcards;
infile datalines;
input name $11. birthdate date9. issuedate mmddyy10.;
datalines;
A. Jones 1jan60 9-15-03
M. Rincon 05OCT1949 02-29-2000
Z. Grandage 18mar1988 10-10-2002
K. Kaminaka 29may2001 01-24-2003
;
run;
Needless to say the dates were not read in correctly, except the ones on the first row. Then I changed the format but it still didn't work. Then I looked up the solution, and this is the code that was given.
data libcards;
infile datalines;
input name $11. +1 birthdate date9. +1 issuedate mmddyy10.;
datalines;
A. Jones 1jan60 9-15-03
M. Rincon 05OCT1949 02-29-2000
Z. Grandage 18mar1988 10-10-2002
K. Kaminaka 29may2001 01-24-2003
;
run;
And this code works perfectly. I can see that the difference is the "+1" part, but I don't understand how it's working. The book that I am using has no explanation about it.
Can anyone tell me what's going on here? Thanks for your help.
+n moves the pointer n columns, in this case, just 1 to the right to read the data. This SAS doc page may help with more details.
To read this file you can use a combination of formatted input (for the first field) and "format modified" list input for the other two fields.
name : format.
tells SAS to use list input with a specific informat. You can also do same by using INFORMAT statement to associate the informat to the variable.
data libcards;
infile datalines;
input name $11. birthdate :date9. issuedate :mmddyy10.;
datalines;
A. Jones 1jan60 9-15-03
M. Rincon 05OCT1949 02-29-2000
Z. Grandage 18mar1988 10-10-2002
K. Kaminaka 29may2001 01-24-2003
;
run;
proc print;
run;
You may want to associate a FORMAT with the date variables.
So I have a 1000 observations for one variable that look like this:
19962
19943
19972
19951
19951
19912
The first four digits vary a bit, but the last digit is always 1, 2, or 3. Is there a way to only format the last digit, while not having to type out each iteration of the first four digits in a value statement?
That is, I want to avoid doing this:
proc format;
value varfmt
19911 = '1991 Spring'
19912 = '1991 Fall'
19913 = '1991 Winter'
19921 = '
19922 = '
[…]
19991 = '1999 Spring'
19992 = '1999 Fall'
19993 = '
;
run;
Instead, is there some way to tell SAS that for any ####1, ####2, or ####3, I want #### Spring, #### Fall, and #### Winter (which would be three lines under the value statement)?
Thanks in advance for any help.
As you are applying the format on the last digit only, so using the all the digits in the proc format is not required. Just extract the last digit and apply the format on it and concatenate it with other first four digits.
Creating the sample dataset
data test;
infile datalines;
input year;
datalines;
19962
19943
19972
19951
19951
19912
;
run;
Creating the formats
proc format;
value $varfmt
1 = 'Spring'
2 = 'Fall'
3 = 'Winter'
;
run;
Here, doing the following things
Extracting the last digit
Applying the format on it, created above
Extracting the first four digits of the number
Concatenating the output of 2 and 3
data final;
set test;
year_new = cat(substr(compress(year),1,4)," ",put(substr(compress(year),5,1),$varfmt.));
run;
You also have the option of creating a format from a dataset, if you do want a format for the whole value. You will have to create all possible rows, but it's not particularly hard.
data forfmt;
fmtname='SEASONF';
length start $5 label $8;
do startyr = 1990 to 2015;
start=cats(startyr,'1');
label=catx(' ',startyr,'Spring');
output;
start=cats(startyr,'2');
label=catx(' ',startyr,'Fall');
output;
start=cats(startyr,'3');
label=catx(' ',startyr,'Winter');
output;
end;
run;
proc format cntlin=forfmt;
quit;
I have two tables:
data a;
input a b c;
datalines;
1 2 .
;
run;
data b;
input a b c;
datalines;
1 . 3
;
run;
The result I want from these tables is replacing the missings by the values that are not missing:
a b c
-----
1 2 3
How can I do it with mostly less piece of code?
EDIT:
I wrote the code and it works, but may be there is more simple code for this.
%macro x;
%macro dummy; %mend dummy;
data _null_;
set x end=Last;
call symputx("name"||left(_N_),name);
if Last then call symputx("num",_n_);
run;
data c;
set a b;
run;
data c;
set c;
%do i=1 %to #
x&i=lag(&&name&i);
%end;
n=_n_;
run;
data c1 (drop= n %do i=1 %to # x&i %end;);
set c (where=(n=2));
%do i=1 %to #
if missing(&&name&i) and not missing(x&i) then &&name&i=x&i;
%end;
run;
%mend;
%x;
If the values are consistent, ie, you never have:
1 2 3
1 3 .
and/or are happy for them to be overwritten, then UPDATE is excellent for this.
data c;
update a b;
by a;
run;
UPDATE will only replace values with non-missing values, so . gets replaced by 3 but 2 is not replaced by .. Again assuming a is the ID variable as Gordon assumes.
You also can easily do this:
data c;
set a b;
by a;
retain b_1 c_1;
if first.a then do; *save the first b and c;
b_1=b;
c_1=c;
end;
else do; *now fill in missings using COALESCE which only replaces if missing;
b_1=coalesce(b_1,b); *use coalescec if this is a char var;
c_1=coalesce(c_1,c); *same;
end;
if last.a then output; *output last row;
drop b c;
rename
b_1=b
c_1=c
;
run;
This makes sure you keep the first instance of any particular value, if they may be different (the opposite of update which keeps the last instance, and different from the SQL solution which takes MAX specifically). All three should give the same result if you have only identical values. Data step options should be a bit faster than the SQL option, I expect, as they're both one pass solutions with no matching required (though it probably doesn't matter).
Using proc SQL, you can do this with aggregation:
proc sql;
select max(a) as a, max(b) as b, max(c) as c
from (select a, b, c from a union all
select a, b, c from b
) x;
If, as I suspect, the first column is an id for matching the two tables, you should instead do:
proc sql;
select coalesce(a.a, b.a), coalesce(a.b, b.b) as b, coalesce(a.c, b.c) as c
from a full join
b
on a.a = b.a;
I'm going to post how to do your approach with some details here: I wouldn't consider this the best approach for this, but you can perhaps learn more easily by starting with what you have, and it's not a horrible approach certainly - just not optimal.
Starting:
%macro x;
%macro dummy; %mend dummy;
data _null_;
set x end=Last;
call symputx("name"||left(_N_),name);
if Last then call symputx("num",_n_);
run;
data c;
set a b;
run;
data c; *NOTE 1;
set c;
%do i=1 %to #
x&i=lag(&&name&i); *NOTE 2;
%end;
n=_n_;
run;
data c1 (drop= n %do i=1 %to # x&i %end;); *NOTE 3;
set c (where=(n=2));
%do i=1 %to #
if missing(&&name&i) and not missing(x&i) then &&name&i=x&i;
%end;
run;
%mend;
%x;
Ending:
*You can still do the first datastep to figure out the dimensions of the arrays,
if you want, use &num instead of the 3s hardcoded in there (but do not need &name list).;
data c;
set a(in=in_a) b(in=in_b);
array x[3] _temporary_; *NOTE 4;
array var[3] a b c;
do i = 1 to dim(x); *NOTE 5;
x[i] = lag(vars[i]);
end;
if in_b then do; *NOTE 6;
do i=1 to dim(x);
if missing(vars[i]) then vars[i]=x[i]; *NOTE 7;
end;
output;
end;
run;
Notes:
NOTE 1: You can combine the two c datasteps here with no difference at all. In general have as few data steps as you can, as they're slow - this is a difference from R or similar which use in memory processing, in SAS you use disk processing which is nice for ability to do 200GB of data but not as fast for multiple steps like this - so make fewer steps.
NOTE 2: This is basically a macro implementation of an array. SAS datastep has an array already! Use it.
NOTE 3: You don't need to do the drop like that. drop=n x: works fine as long as none of your real variables start with x (and if they do, use an _ before all of your dummy variables and it will be the same). : is a wild card for 'starts with'.
NOTE 4: Here is the array implementation of your x array. I use temporary because that means the variables will be dropped automatically for you.
NOTE 5: Here we do the lags. I don't like using lag for this where retain does a better job of the same thing, but it works fine.
NOTE 6: This if in_b is like your if last from your step. This identifies records in b only - if there's only one then it will only happen once.
NOTE 7: This is doing the replacement for missing. COALESCE \ COALESCEC would also work for this purpose (though in some cases you might need to use this method if you are unsure of the variable type). No reason to check if not missing unless you're using special missings in some fashion - no harm in replacing . with ..