sum equal zero for variables for sas - sql

I looked at the internet but I could not find anything relevant.
I have a table with thousands of variable.
I'm trying to do a sum of one single variable and find out , which variable in sum , is equal to zero.
example
col1 col2 col3
0 0 0
1 0 2
1 0 3
results
col2
0
However, my proc means does not want to take my where clause.
proc sql;
create table toto as select nomvar,monotonic() as num_lig from dicofr
where nomvar <> 'date';
proc sql;
select nomvar into :varnom separated by ' ' from toto
where num_lig between 0 and 1000;
%put varnom: &varnom;
proc means data=afr sum (where=(sum(&varnom)=0) ;
var &varnom;
output out=want;
run;
What am I doing wrong?
Thank you for anything that can lead me to a solution.

This will do it. I believe this requires SAS 9.3+ for the stackedodsoutput option.
*Generating some data;
data have;
array x[100];
call streaminit(7);
do i = 1 to 20;
do _t = 1 to dim(x);
if rand('Uniform') < 0.9 then x[_t]=0;
else x[_t]=1;
end;
output;
end;
run;
*ods output grabs what you want from proc means;
ods output summary=want(where=(sum=0));
proc means data=have sum stackodsoutput;
var x:;
run;
ods output close;

Several ways to achieve this, here's a datastep/array method :
%LET NVARS = 1000 ;
data want ;
set have end=eof ;
array n{*} col1-col&NVARS ;
array t{&NVARS} 5. _TEMPORARY_ ;
do i = 1 to dim(n) ;
t{i} + n{i} ;
end ;
if eof then do ;
do i = 1 to dim(t) ;
if missing(t{i}) or t{i} = 0 then do ;
vname = vname(n{i}) ;
put "Sum of " vname "= 0" ; /* write message to log */
output ; /* Write to dataset */
end ;
end ;
end ;
keep vname ;
run ;

Related

Conditional processing in proc sql(SAS) using a macro variable

I need to select the name of states that do not start with M if the the macro variable M=N but return only the names of states that start with M if macro variable is equal to any other variable using conditional processing.
for example:
%let M=N;
proc sql;
select states,profit,
case
when .....
else
end
from geography_dim
quit;
For the sake of argument, suppose you change the name of the macro variable M to something more ridiculously expressive, such as YN_OPTION_SELECT_M_STATES
%let YN_OPTION_SELECT_M_STATES = N;
proc sql;
select states,profit,
case
when .....
else
end
from geography_dim
/* add this */
where
("&YN_OPTION_SELECT_M_STATES" eq 'N' & STATE not like 'M%')
or
("&YN_OPTION_SELECT_M_STATES" ne 'N' & STATE like 'M%')
;
quit;
Revert to macro variable M if you must, however the code will be somewhat opaque.
It is not SQL but very simple in datastep. If you want check the staring with M macro values in that case "N" you can do like:
/*test data*/
data geography_dim ;
states="Aaaaa";profit=10;output;
states="Naaaa";profit=10;output;
run;
/*set macro variable*/
%let M=N;
/*check if you want*/
%put "&M";
/*your case in datastep*/
data test;
set geography_dim;
if substr(states,1,1) eq "&M" then profit=profit*10;
else profit=0;
run;
/* results
states profit
Aaaaa 0
Naaaa 100
*/

Searching for pattern with characters and numerics in SAS

I am examining data quality and am trying to see how many rows are populated properly. The field should contain a string with one character followed by nine numerical and is of type 'Character' length 10.
Ex.
A123456789
B123531490
C319861045
I have tried using PRXMATCH function, but I am unsure if i use the proper syntax. I have also tried using PROC SQL with "Where not like "[A-Z][0-9][0-9]" and so on. My feeling is that this should not be difficult to perform, does anyone have a solution?
Best regards
You can construct a REGEX to make that test. Or just build the test using normal SAS functions.
data want ;
set have ;
flag1 = prxmatch('/^[A-Z][0-9]{9}$/',trim(name));
test1 = 'A' <= name <= 'Z' ;
test2 = not notdigit(trim(substr(name,2))) ;
test3 = length(name)=10;
flag2 = test1 and test2 and test3 ;
run;
Results:
Obs name flag1 test1 test2 test3 flag2
1 A123456789590 0 1 1 0 0
2 B123531490ABC 0 1 0 0 0
3 C3198610 0 1 1 0 0
4 A123456789 1 1 1 1 1
5 B123531490 1 1 1 1 1
6 C319861045 1 1 1 1 1
You can use:
^[a-zA-z][0-9]{9}$
The built-in SAS functions NOTALPHA and NOTDIGIT can perform validation testing.
invalid_flag = notalpha(substr(s,1,1)) || notdigit(s,2) ;
You can select invalid records directly with a where statement or option
data invalid;
set raw;
where notalpha(substr(s,1,1)) || notdigit(s,2) ; * statement;
run;
data invalid;
set raw (where=(notalpha(substr(s,1,1)) || notdigit(s,2))); * data set option;
run;
There are several functions in the NOT* and ANY* families and they can offer faster performance than the general purpose regular expression functions in the PRX* family.
you can use prxparse and prxmatch as shown below.
data have;
input name $20.;
datalines;
A123456789590
B123531490ABC
C3198610
A123456789
B123531490
C319861045
;
data want;
set have;
if _n_=1 then do;
retain re;
re = prxparse('/^[a-zA-z][0-9]{9}$/');
end;
if prxmatch(re,trim(name)) gt 0 then Flag ='Y';
else Flag ='N';
drop re;
run;
if you want only records those match the criteria then use
data want;
set have;
if _n_=1 then do;
retain re;
re = prxparse('/^[a-zA-z][0-9]{9}$/');
end;
if prxmatch(re,trim(name));
drop re;
run;

SAS PROC SQL NOT CONTAINS multiple values in one statement

In PROC SQL, I need to select all rows where a column called "NAME" does not contain multiple values "abc", "cde" and "fbv" regardless of what comes before or after these values. So I did it like this:
SELECT * FROM A WHERE
NAME NOT CONTAINS "abc"
AND
NAME NOT CONTAINS "cde"
AND
NAME NOT CONTAINS "fbv";
which works just fine, but I imagine it would be a headache if we had a hundred of conditions. So my question is - can we accomplish this in a single statement in PROC SQL?
I tried using this:
SELECT * FROM A WHERE
NOT CONTAINS(NAME, '"abc" AND "cde" AND "fbv"');
but this doesn't work in PROC SQL, I am getting the following error:
ERROR: Function CONTAINS could not be located.
I don't want to use LIKE.
You could use regular expressions, I suppose.
data a;
input name $;
datalines;
xyabcde
xyzxyz
xycdeyz
xyzxyzxyz
fbvxyz
;;;;
run;
proc sql;
SELECT * FROM A WHERE
NAME NOT CONTAINS "abc"
AND
NAME NOT CONTAINS "cde"
AND
NAME NOT CONTAINS "fbv";
SELECT * FROM A WHERE
NOT (PRXMATCH('~ABC|CDE|FBV~i',NAME));
quit;
You can't use CONTAINS that way, though.
You can use NOT IN:
SELECT * FROM A WHERE
NAME NOT IN ('abc','cde','fbv');
If the number of items is above reasonable number to build inside code, you can create a table (work.words below) to store the words and iterate over it to check occurrences:
data work.values;
input name $;
datalines;
xyabcde
xyzxyz
xycdeyz
xyzxyzxyz
fbvxyz
;
run;
data work.words;
length word $50;
input word $;
datalines;
abc
cde
fbv
;
run;
data output;
set values;
/* build a has of words */
length word $50;
if _n_ = 1 then do;
/* this runs once only */
call missing(word);
declare hash words (dataset: 'work.words');
words.defineKey('word');
words.defineData('word');
words.defineDone();
end;
/* iterate hash of words */
declare hiter iter('words');
rc = iter.first();
found = 0;
do while (rc=0);
if index(name, trim(word)) gt 0 then do; /* check if word present using INDEX function */
found= 1;
rc = 1;
end;
else rc = iter.next();
end;
if found = 0 then output; /* output only if no word found in name */
drop word rc found;
run;

macro into a table or a macro variable with sas

I'm having this macro. The aim is to take the name of variables from the table dicofr and put the rows inside into variable name using a symput.
However , something is not working correctly because that variable, &nvarname, is not seen as a variable.
This is the content of dico&&pays&l
varname descr
var12 aza
var55 ghj
var74 mcy
This is the content of dico&&pays&l..1
varname
var12
var55
var74
Below is my code
%macro testmac;
%let pays1=FR ;
%do l=1 %to 1 ;
data dico&&pays&l..1 ; set dico&&pays&l (keep=varname);
call symput("nvarname",trim(left(_n_))) ;
run ;
data a&&pays&l;
set a&&pays&l;
nouv_date=mdy(substr(date,6,2),01,substr(date,1,4));
format nouv_date monyy5.;
run;
proc sql;
create table toto
(nouv_date date , nomvar varchar (12));
quit;
proc sql;
insert into toto SELECT max(nouv_date),"&nvarname" as nouv_date as varname FROM a&&pays&l WHERE (&nvarname ne .);
%end;
%mend;
%testmac;
A subsidiary question. Is it possible to have the varname and the date related to that varname into a macro variable? My man-a told me about this but I have never done that before.
Thanks in advance.
Edited:
I have this table
date col1 col2 col3 ... colx
1999M12 . . . .
1999M11 . 2 . .
1999M10 1 3 . 3
1999M9 0.2 3 2 1
I'm trying to do know the name of the column with the maximum date , knowing the value inside of the column is different than a missing value.
For col1, it would be 1999M10. For col2, it would be 1999M11 etc ...
Based on your update, I think the following code does what you want. If you don't mind sorting your input dataset first, you can get all the values you're looking for with a single data step - no macros required!
data have;
length date $7;
input date col1 col2 col3;
format date2 monyy5.;
date2 = mdy(substr(date,6,2),1,substr(date,1,4));
datalines;
1999M12 . . .
1999M11 . 2 .
1999M10 1 3 .
1999M09 0.2 3 2
;
run;
/*Required for the following data step to work*/
/*Doing it this way allows us to potentially skip reading most of the input data set*/
proc sort data = have;
by descending date2;
run;
data want(keep = max_date:);
array max_dates{*} max_date1-max_date3;
array cols{*} col1-col3;
format max_date: monyy5.;
do until(eof); /*Begin DOW loop*/
set have end = eof;
/*Check to see if we've found the max date for each col yet.*/
/*Save the date for that col if applicable*/
j = 0;
do i = 1 to dim(cols);
if missing(max_dates[i]) and not(missing(cols[i])) then max_dates[i] = date2;
j + missing(max_dates[i]);
end;
/*Use j to count how many cols we still need dates for.*/
/* If we've got a full set, we can skip reading the rest of the data set*/
if j = 0 then do;
output;
stop;
end;
end; /*End DOW loop*/
run;
EDIT: if you want to output the names alongside the max date for each, that can be done with a slight modification:
data want(keep = col_name max_date);
array max_dates{*} max_date1-max_date3;
array cols{*} col1-col3;
format max_date monyy5.;
do until(eof); /*Begin DOW loop*/
set have end = eof;
/*Check to see if we've found the max date for each col yet.*/
/*If not then save date from current row for that col*/
j = 0;
do i = 1 to dim(cols);
if missing(max_dates[i]) and not(missing(cols[i])) then max_dates[i] = date2;
j + missing(max_dates[i]);
end;
/*Use j to count how many cols we still need dates for.*/
/* If we've got a full set, we can skip reading the rest of the data set*/
if j = 0 or eof then do;
do i = 1 to dim(cols);
col_name = vname(cols[i]);
max_date = max_dates[i];
output;
end;
stop;
end;
end; /*End DOW loop*/
run;
It looks to me that you're trying to use macros to generate INSERT INTO statements to populate your table. It's possible to do this without using macros at all which is the approach I'd recommend.
You could use a datastep statement to write out the INSERT INTO statements to a file. Then following the datastep, use a %include statement to run the file.
This will be easier to write/maintain/debug and will also perform better.

Select character variables that have all missing values

I have a SAS dataset with around 3,000 variables, and I would like to get rid of the character variables for which all values are missing. I know how to do this for numeric variables-- I'm wondering specifically about the character variables. I need to do the work using base SAS, but that could include proc SQL, which is why I've tagged this one 'SQL' also.
Thank you!
Edit:
Background info: This is a tall dataset, with survey data from 7 waves of interviews. Some, but not all, of the survey items (variables) were repeated across waves. I'm trying to create a list of items that were actually used in each wave by pulling all the records for that wave, getting rid of all the columns that have nothing but SAS's default missing values, and then running proc contents.
I created a macro that will check for empty character columns and either remove them from the original or create a new data set with the empty columns removed. It takes two optional arguments: The name of the data set (default is the most recently created data set), and a suffix to name the new copy (set suffix to nothing to edit the original).
It uses proc freq with the levels option and a custom format to determine the empty character columns. proc sql is then used to create a list of the columns to be removed and store them in a macro variable.
Here is the macro:
%macro delemptycol(ds=_last_, suffix=_noempty);
option nonotes;
proc format;
value $charmiss
' '= ' '
other='1';
run;
%if "&ds"="_last_" %then %let ds=&syslast.;
ods select nlevels;
ods output nlevels=nlev;
proc freq data=&ds.(keep=_character_) levels ;
format _character_ $charmiss.;
run;
ods output close;
/* create macro var with list of cols to remove */
%local emptycols;
proc sql noprint;
select tablevar into: emptycols separated by ' '
from nlev
where NNonMissLevels=0;
quit;
%if &emptycols.= %then %do;
%put DELEMPTYCOL: No empty character columns were found in data set &ds.;
%end;
%else %do;
%put DELEMPTYCOL: The following empty character columns were found in data set &ds. : &emptycols.;
%put DELEMPTYCOL: Data set &ds.&suffix created with empty columns removed;
data &ds.&suffix. ;
set &ds(drop=&emptycols);
run;
%end;
options notes;
%mend;
Examples usage:
/* create some fake data: Here char5 will be empty */
data chardata(drop= j randnum);
length char1-char5 $8.;
array chars(5) char1-char5;
do i=1 to 100;
call missing(of char:);
randnum=floor(10*ranuni(i));
do j=2 to 5;
if (j-1)<randnum<=(j+1) then chars(j-1)="FOO";
end;
output;
end;
run;
%delemptycol(); /* uses default _last_ for the data and "_noempty" as the suffix */
%delemptycol(ds=chardata, suffix=); /* removes the empty columns from the original */
There's probably a simpler way but this is what I came up with.
Cheers
Rob
EDIT: Note that this works for both character and numeric variables.
**
** TEST DATASET
*;
data x;
col1 = "a"; col2 = ""; col3 = "c"; output;
col1 = "" ; col2 = ""; col3 = "c"; output;
col1 = "a"; col2 = ""; col3 = "" ; output;
run;
**
** GET A LIST OF VARIABLE NAMES
*;
proc sql noprint;
select name into :varlist separated by " "
from sashelp.vcolumn
where upcase(libname) eq "WORK"
and upcase(memname) eq "X";
quit;
%put &varlist;
**
** USE A MACRO TO CREATE A DATASTEP. FOR EACH COLUMN THE
** THE DATASTEP WILL CREATE A NEW COLUMN WITH THE SAME NAME
** BUT PREFIXED WITH "DELETE_". IF THERE IS AT LEAST 1
** NON-MISSING VALUE FOR THE COLUMN THEN THE "DELETE" COLUMN
** WILL FINISH WITH A VALUE OF 0, ELSE 1. WE WILL ONLY
** KEEP THE COLUMNS CALLED "DELETE_" AND OUTPUT ONLY A SINGLE
** OBSERVATION TO THE FINAL DATASET.
*;
%macro find_unused_cols(iDs=);
%local cnt;
data vars_to_delete;
set &iDs end=eof;
%let cnt = 1;
%let varname = %scan(&varlist, &cnt);
%do %while ("&varname" ne "");
retain delete_&varname;
delete_&varname = min(delete_&varname, missing(&varname));
drop &varname;
%let cnt = %eval(&cnt + 1);
%let varname = %scan(&varlist, &cnt);
%end;
if eof then do;
output;
end;
run;
%mend;
%find_unused_cols(iDs=x);
**
** GET A LIST OF VARIABLE NAMES FROM THE NEW DATASET
** THAT WE WANT TO DELETE AND STORE TO A MACRO VAR.
*;
proc transpose data=vars_to_delete out=vars_to_delete;
run;
proc sql noprint;
select substr(_name_,8) into :vars_to_delete separated by " "
from vars_to_delete
where col1;
quit;
%put &vars_to_delete;
**
** CREATE A NEW DATASET CONTAINING JUST THOSE VARS
** THAT WE WANT TO KEEP
*;
data new_x;
set x;
drop &vars_to_delete;
run;
Rob and cmjohns, thank you SO MUCH for your help. Based on your solutions and an idea I had over the weekend, here is what I came up with:
%macro removeEmptyCols(origDset, outDset);
* get the number of obs in the original dset;
%let dsid = %sysfunc(open(&origDset));
%let origN = %sysfunc(attrn(&dsid, nlobs));
%let rc = %sysfunc(close(&dsid));
proc transpose data= &origDset out= transpDset;
var _all_;
run;
data transpDset;
set transpDset;
* proc transpose converted all old vars to character,
so the . from old numeric vars no longer means 'missing';
array oldVar_ _character_;
do over oldVar_;
if strip(oldVar_) = "." then oldVar_ = "";
end;
* each row from the old dset is now a column with varname starting with 'col';
numMiss = cmiss(of col:);
numCols = &origN;
run;
proc sql noprint;
select _NAME_ into: varsToKeep separated by ' '
from transpDset
where numMiss < numCols;
quit;
data &outDset;
set &origDset (keep = &varsToKeep);
run;
%mend removeEmptyCols;
I will try all 3 ways and report back on which one is fastest...
P.S. added 23 Dec 2010 for future reference: SGF Paper 048-2010: Dropping Automatically Variables with Only Missing Values
This is very simple method useful for all variables
proc freq data=class nlevels ;
ods output nlevels=levels(where=(nmisslevels>0 and nnonmisslevels=0));
run;
proc sql noprint;
select TABLEVAR into :_MISSINGVARS separated by ' ' from levels;
quit;
data want;
set class (keep=&_MISSINGVARS);
run;