SAS union distinct records from datasets with similar names - sql

I have about 100 large datasets and within each dataset I'm hoping to extract distinct IDs to join them vertically. The datasets are unsorted, named as data_01 , data_02, data_03 ....data_100.
Since the datasets are all very large, set them together without reducing the size is not feasible, the join didn't even move after hours of running. Therefore, I believe there is the need to reduce the datasets before stacking is necessary, and I'm here to seek some help.
I tried to create a macro to select distinct ID and sum a numerical variable,cnt, by ID before vertically joining all datasets by proc sql union. The macro is not working properly:
/*Get dataset names*/
proc sql noprint;
select memname into :mylist separated by ' '
from dictionary.tables where libname= "mylib" and upcase(memname) like "DATA_%"
;
quit;
%put &mylist;
/*create union statements*/
%global nextdata;
%let nextdata =;
%macro combinedata(mylist);
data _null_;
datanum = countw("&mylist");
call symput('Dataset', put(datanum, 10.));
run;
%do i = 1 %to &Dataset ;
data _null_;
temp = scan("&mylist", &i);
call symput("Dataname", strip(put(temp,$12.)));
run;
%put &Dataname;
%put &Dataset;
%if (&i=&Dataset) %then %do;
%let nextdata = &nextdata.
select id, sum(cnt)
from mylib.&&Dataname
group by id;
%end;
%else %do;
%let nextdata = &nextdata.
select id, sum(cnt)
from mylib.&&Dataname union
group by id;
%end;
%put nextdata = &nextdata;
%end;
%mend combinedata;
%combinedata(&mylist);
/*execute from proc sql*/
proc sql;
create table combined as (&nextdata);
quit;
I have also attempted to use proc summary, but there was not enough memory to run the following code:
data vneed / view=vneed;
set data_: (keep=id cnt);
run;
proc summary data=vneed nway;
class id;
var cnt;
output out=want (drop=_type_) sum=sumcnt;
run;
Appreciate any help!

If the number of values of ID is reasonable you should be able to use a hash object.
data _null_ ;
if _n_=1 then do;
dcl hash H (ordered: "A") ;
h.definekey ("ID") ;
h.definedata ("ID", "SUMCNT") ;
h.definedone () ;
end;
set data_: (keep=id cnt) end=eof;
if h.find() then sumcnt=.;
sumcnt+cnt ;
h.replace() ;
if eof then h.output (dataset: "WANT") ;
run ;
If the number of ID values is too large to fit the summary data into a HASH object you could adapt this code to stop at some reasonable number of distinct ID values to avoid memory overload and write the current summary to an actual SAS dataset and then generate the final counts by re-aggregating the intermediate datasets. But at that point you should just use my other answer and let PROC SQL create the intermediate summary datasets instead.

Summarize the data as you go instead of trying to generate one massive query. Then re-aggregate the aggregates.
proc sql ;
%do i = 1 %to &Dataset ;
%let dataname=mylib.%scan(&mylist,&i,%str( ));
create table sum&i as
select id,sum(cnt) as cnt
from &dataname
group by id
order by id
;
%end;
quit;
data want ;
do until(last.id);
set sum1 - sum&dataset ;
by id;
sumcnt+cnt;
end;
drop cnt;
run;

Related

Using macro for formula proc sql in SAS

I need some help with macros in SAS. I want to sum variables (for example, from v_1 to v_7) to aggregate them, grouping by year. There are plenty of them, so I want to use macro. However, it doesn't work (I get only v_1) I would really appreciate Your help.
%macro my_macro();
%local i;
%do i = 1 %to 7;
proc sql;
create table my_table as select
year,
sum(v_&i.) as v_&i.
from my_table
group by year
;
quit;
%end;
%mend;
/* I don't know to run this macro - is it ok? */
data run_macro;
set my_table;
%my_macro();
run;
The macro processor just generates SAS code and then passes onto to SAS to run. You are calling a macro that generates a complete SAS step in the middle of your DATA step. So you are trying to run this code:
data run_macro;
set my_table;
proc sql;
create table my_table as select
year,
sum(v_1) as v_1
from my_table
group by year
;
quit;
proc sql;
create table my_table as select
year,
sum(v_1) as v_1
from my_table
group by year
;
quit;
...
So first you make a copy of MY_TABLE as RUN_MACRO. Then you overwrite MY_TABLE with a collapsed version of MY_TABLE that has just two variables and only one observations per year. Then you try to collapse it again but are referencing a variable named V_2 that no longer exists.
If you simply move the %DO loop inside the generation of the SQL statement it should work. Also don't overwrite your input dataset. Here is version of the macro will create a new dataset name MY_NEW_TABLE with 8 variables from the existing dataset named MY_TABLE.
%macro my_macro();
%local i;
proc sql;
create table my_NEW_table as
select year
%do i = 1 %to 7;
, sum(v_&i.) as v_&i.
%end;
from my_table
group by year
;
quit;
%mend;
%my_macro;
Note if this is all you are doing then just use PROC SUMMARY. With regular SAS code instead of SQL code you can use variable lists like v_1-v_7. So there is no need for code generation.
proc summary nway data=my_table ;
class year ;
var v_1 - v_7;
output out=my_NEW_table sum=;
run;

Array of Column Variables in SQL

Is it possible to create an array of column variables within sql to perform an operation like the following (please excuse the syntax):
array(Col1,Col2,Col3);
update tempTable
for(i=1;i<3;i++){
set array[i] =
case missing(array[i])
then 0*1
else
array[i]*1
end
};
note: I am using a proc SQL step in SAS
Desired function:
Perform the operation in the for loop above on multiple columns of a table, without writing a separate set statement for each column.
It is possible to do what you are looking for with a SAS macro.
It is easier, if this is a local SAS table, to just update it with the Data Step.
data have;
set have;
array v[3] col1 col2 col3;
do i=1 to 3;
v[i] = sum(v[i],0);
end;
drop i;
run;
The sum() function sums values (obviously). If a value is missing, it is not added and the remaining values are added. So you will get 0 in the case of missing and the value of the column when it is not.
SAS Macros write SAS code for you. They are pre-compiler scripts that generate SAS Code.
You want code that looks like
update table
set col1 = ... ,
col2 = ... ,
.... ,
;
Here is a script. It generates a test table, defines the macro, and then calls the macro on the table. It uses the sum() function from my other answer.
data have;
array col[3];
do r=1 to 100;
do i=1 to 3;
if ranuni(123)> .8 then
col[i] = .;
else
col[i] = rannor(123);
end;
output;
end;
drop i r;
run;
%macro sql_zero_if_missing(data, cols);
%local n i col;
%let n=%sysfunc(countw(&cols));
proc sql noprint;
update &data
set
%do i=1 %to &n;
%let col=%scan(&cols,&i);
&col = sum(&col,0)
%if &i ^= &n %then , ;
%end;
;
quit;
%mend;
options mprint;
%sql_zero_if_missing(have, col1 col2 col3);
The MPRINT option will let you see the SAS code that was generated. Here is the log:
MPRINT(SQL_ZERO_IF_MISSING): proc sql noprint;
MPRINT(SQL_ZERO_IF_MISSING): update have set col1 = sum(col1,0) ,
col2 = sum(col2,0) , col3 = sum(col3,0) ;
NOTE: 100 rows were updated
in WORK.HAVE.
MPRINT(SQL_ZERO_IF_MISSING): quit;

check whether proc append was successful

I have some code which appends yesterday's data to [large dataset], using proc append. After doing so it changes the value of the variable "latest_date" in another dataset to yesterday's date, thus showing the maximum date value in [large dataset] without a time-consuming data step or proc sql.
How can I check, within the same program in which proc append is used, whether proc append was successful (no errors)? My goal is to change the "latest_date" variable in this secondary dataset only if the append is successful.
Try the automatic macro variable &SYSCC.
data test;
do i=1 to 10;
output;
end;
run;
data t1;
i=11;
run;
data t2;
XXX=12;
run;
proc append base=test data=t1;
run;
%put &syscc;
proc append base=test data=t2;
run;
%put &syscc;
I'm using the %get_table_size macro, which I found here. My steps are
run %get_table_size(large_table, size_preappend)
Create dataset called to_append
run %get_table_size(to_append, append_size)
run proc append
run %get_table_size(large_table, size_postappend)
Check if &size_postappend = &size_preappend + &append_size
Using &syscc isn't exactly what I wanted, because it doesn't check specifically for an error in proc append. It could be thrown off by earlier errors.
You can do this by counting how many records are in the table pre and post appending. This would work with any sas table or database.
The best practice is to always have control table for your process to log run time and number of records read.
Code:
/*Create input data*/
data work.t1;
input row ;
datalines;
1
2
;;
run;
data work.t2;
input row ;
datalines;
3
;;
run;
/*Create Control table, Run this bit only once, otherwise you delete the table everytime*/
data work.cntrl;
length load_dt 8. source 8. delta 8. total 8. ;
format load_dt datetime21.;
run;
proc sql; delete * from work.cntrl; quit;
/*Count Records before append*/
proc sql noprint ; select count(*) into: count_t1 from work.t1; quit;
proc sql noprint; select count(*) into: count_t2 from work.t2; quit;
/*Append data*/
proc append base=work.t1 data=work.t2 ; run;
/*Count Records after append*/
proc sql noprint ; select count(*) into: count_final from work.t1; quit;
/*Insert counts and timestampe into the Control Table*/
proc sql noprint; insert into work.cntrl
/*values(input(datetime(),datetime21.), input(&count_t1.,8.) , input(&count_t2.,8.) , input(&count_final.,8.)) ; */
values(%sysfunc(datetime()), &count_t1. , &count_t2., &count_final.) ;
quit;
Output: Control table is updated

Execute SAS macro for many parameters

I use the following code to insert rows to the table:
proc sql;
create table business_keys as
select name, memname
from sashelp.vcolumn
where 1=0;
quit;
%macro insert(list);
proc sql;
%do i=1 %to &max;
%let val = %scan(&list,&i);
insert into business_keys
select distinct name, memname
from sashelp.vcolumn
where upcase(memname) = "&list"
and upcase(name) like '%_ZRODLO_ID%'
and length(name) = 12;
%end;
quit;
%mend;
%insert(&name1);
Now it inserts me the same row many &max times.
I have to execute it for all macro variables (&name#), not only for &name1. How I can pass all variables at the same time? In principle, I want to loop through all of these table names:
%insert(&name1-&name&max)
%name1 = PEOPLE, %name2 = CREDITS, ... %name%max = ANY_TABLE_NAME
Where &name# is table name and &max is number of tables.
ok, now i understand what you want to do, it is actually quite simple:
%macro insert;
proc sql;
%do i=1 %to &max;
insert into business_keys
select distinct name, memname
from sashelp.vcolumn
where upcase(memname) = upcase("&&name&i") and upcase(name) like '%_ZRODLO_ID%' and length(name) = 12;
%end;
quit;
%mend;
%insert;
&&name&i resolves to &name1\&name2...\&namex which resolves to PEOPLE\CREDITS...\ANY_TABLE_NAME depending on i.
Sounds like you want to pass a "macro array" to your macro to process. Where by "macro array" I mean a series of macro variables that all consist of a base name and a numeric suffix. Like NAME1, NAME2, etc. It would be easier to do that by passing two parameters to your macro. One for the basename of the array and one for the upper limit (or max) index.
%macro insert(basename,max);
%local i;
...
%do i=1 %to &max ;
... &&basename&i ...
%end;
...
%mend insert;
So you might call the macro like this:
%let name1=PEOPLE;
%let name2=CREDITS;
%insert(NAME,2);
Personally I would avoid the macro array and instead store the list in a single macro variable. If the list is just SAS names (datasets, libraries, variables, formats, etc.) then just use space for the delimiter. If it is something like labels that could include spaces then use some other character like | for the delimiter. Then your macro would look more like this.
%macro insert(memlist);
%local i;
...
%do i=1 %to %sysfunc(countw(&memlist,%str( ))) ;
... %scan(&memlist,&i,%str( )) ...
%end;
...
%mend insert;
So you might call the macro like this:
%insert(PEOPLE CREDITS);
If list looks like PEOPLE,CREDITS,...,ANY_TABLE_NAME
you should define max variable as following:
%let max = %sysfunc(countw(&list,',')).
You will know the number of iterations.
%macro insert(list);
%let max = %sysfunc(countw(&list,',')).
proc sql;
%do i=1 %to &max;
%let val = %scan(&list,&i);
insert into business_keys
select distinct name, memname
from sashelp.vcolumn
where upcase(memname) = "&val" and upcase(name) like '%_ZRODLO_ID%' and length(name) = 12;
%end;
quit;
%mend;

SAS Dynamic SQL join

I want to join 6 tables, which all have different variables, to one table, which has same columns as all 6 other tables. Can i somehow do it without looking at these tables and watching which columns these tables have? I have got macro variable, an array, with column names, but I cannot think of any good way how to join these tables using this array.
Array is created by this macro:
%macro getvars(dsn);
%global vlist;
proc sql noprint;
select name into :vlist separated by ' '
from dictionary.columns
where memname=upcase("&dsn");
quit;
%mend getvars;
And i want to just join tables like this:
proc sql;
create table new_table as select * from table1 as l
left join table2 as r on l.age=r.age and l.type=r.type;
quit;
but not so manually :)
For example, table1 has columns name, age, coef1 and sex, table 2 has columns name, region and coef2. The third table, where I want to join them has name, age, sex, region, coef and many other columns. I want to write a program, that doesn't know which table has which columns, but joins so that third table still has all the same columns plus coef1 and coef2.
This isn't an answer I'd normally recommend as it can lead to unwanted results if you're not careful, however it could work for you in this instance. I'm proposing using a natural join, which automatically joins on to all matching variables so you don't need to specify an ON clause. Here's example code.
proc sql;
create table want as select
*
from
a
natural left join
b
natural left join
c
;
quit;
As I say, be very careful about checking the results
Here's one method...
Firstly, use DICTIONARY.COLUMNS to find all of the common variables in each table based on the 'master' table. Then dynamically generate the join criteria for tables with common variables, and finally join them all together based on those criteria.
%MACRO COMMONJOIN(DSN,DSNLIST) ;
%LET DSNC = %SYSFUNC(countw(&DSNLIST,%STR( ))) ; /* # of additional tables */
/* Create a list of variables from primary DSN, with flags where variable exists in DSNLIST datasets */
proc sql ;
create table commonvars as
select a.name %DO I = 1 %TO &DSNC ;
%LET D = %SYSFUNC(scan(&DSNLIST,&I,%STR( ))) ;
, d&I..V&I label="&D"
%END ;
from dictionary.columns a
%DO I = 1 %TO &DSNC ;
/* Iterate over list of dataset names */
%LET D = %SYSFUNC(scan(&DSNLIST,&I,%STR( ))) ;
left join
(select name, 1 as V&I
from dictionary.columns
where libname = scan(upcase("&D"),1,'.')
and memname = scan(upcase("&D"),2,'.'))
as d&I on a.name = d&I..name
%END ;
where libname = scan(upcase("&DSN"),1,'.')
and memname = scan(upcase("&DSN"),2,'.')
;
quit ;
/* Create join criteria between master & each secondary table */
%DO I = 1 %TO &DSNC ;
%LET JOIN&I = ;
proc sql ;
select catx(' = ',cats('a.',name),cats("V&I..",name)) into :JOIN&I separated by ' and '
from commonvars
where V&I = 1 ;
quit ;
%END ;
/* Join */
proc sql ;
create table masterjoin as
select a.*
%DO I = 1 %TO &DSNC ;
%IF "&&JOIN&I" ne "" %THEN %DO ;
, V&I..*
%END ;
%END ;
from &DSN as a
%DO I = 1 %TO &DSNC ;
%IF "&&JOIN&I" ne "" %THEN %DO ;
%LET D = %SYSFUNC(scan(&DSNLIST,&I,%STR( ))) ;
left join &D as V&I on &&JOIN&I
%END ;
%END ;
;
quit ;
%MEND ;
%COMMONJOIN(work.master,work.table1 work.table2 work.table3) ;
If you are open to using a data step instead of proc sql you may be in luck.
/* pre-sorting is required for SAS merge */
proc sort data=master; by key1 key2; run;
proc sort data=table1; by key1 key2; run;
proc sort data=table2; by key1 key2; run;
proc sort data=table3; by key1 key2; run;
data want;
merge master (in=_inMaster) table1 table2 table3;
by key1 key2;
/* for a Left Join, keep all rows from Master */
if _inMaster;
run;
The only gotcha I can think of is a common variable name among the non-key fields. If more than one table has variable x, the right-most table's value of x will overwrite the previous ones, but SAS will note this in the log.