How to select a range of columns in a case statement in proc SQL? - sql

I have around 80 columns names diag1 to diag80. I am wondering how can I pick just 30 columns and apply a case statment in proc SqL. The following code produces an error because it doesn't understand the range.
proc sql;
create table data_WANT as
select *,
case
when **diag1:diag30** in ('F00','G30','F01','F02','F03','F051') then 1
else 0
end as p_nervoussystem
from data_HAVE;
quit;
Thank you, any help is appreciated!

You have two problem with that attempted syntax. First is that variable lists are not supported by PROC SQL (since they are not supported by SQL syntax). The second is there is no simple syntax to search N variables for a list of M strings.
You will need a loop of some kind. It will be much easier in SAS code than in SQL.
For example you could make an array to reference your 30 variables than loop over the variables checking whether each one has a value in the list of values. You can stop checking once one is found.
data want;
set have;
array vars diag1-diag30;
p_nervoussystem=0;
do index=1 to dim(vars) while (not p_nervoussystem);
p_nervoussystem = vars[index] in ('F00','G30','F01','F02','F03','F051');
end;
run;

The inverse pattern to #Tom search for a nervous system diagnostic code:
via FINDW over a concatenation of the observed diagnoses
via WHICHC over an array of the observed diagnoses
data have;
infile datalines missover;
length id 8;
array dx(30) $5;
input id (dx1-dx50) (50*:$5.);
datalines;
1 A00 B00 A12
2 F00 Z12 T45
3 A01 A02 B12 F00
4 Q12
5 Q13
6 T14
7 F44 F45 F46
8 . . . . . . . . . . . . . . G30
;
data want;
length p_nervoussystem p_ns 4;
set have;
array dx dx:;
array ns(6) $5 _temporary_ ('F00','G30','F01','F02','F03','F051');
dx_catx = catx(' ', of dx(*));* drop dx_catx; * way 1;
do _n_ = 1 to dim(ns) until(p_nervoussystem);
p_nervoussystem = 0 < indexw(dx_catx, trim(ns(_n_))); * way 1;
p_ns = 0 < whichc(ns(_n_), of dx(*)); * way 2;
end;
run;```

try it sys.tables and sys.columns and filter your columns.
SELECT * FROM sys.tables INNER JOIN sys.columns ON columns.object_id = tables.object_id

Related

Filtering a numeric column name in SAS SQL

I was trying to select a reporting month column from table temp_trans, it looks like:
GPNr 202112 202201 202202 .... 202208
x 1 5 2 .... 3
y 0.4 2 3 .... 8
z 3 1 5 .... 6
proc sql noprint;
select distinct Berichtsmonat into :timeperiod1 - FROM work.Basis;
quit;
%put & timeperiod1
---> 202112
Now I was trying to apply a condition on the 202112 column:
Code:
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_TEMP_TRANS_0000 AS
SELECT t1.*
FROM WORK.TEMP_TRANS t1
WHERE t1.&timeperiod1 NOT = .;
QUIT;
I get following a synthax error message for t1.202112 .
It runs when I make out of it : where t1.'202112'n not = .;
Any clue how I can get fixed this?
Thanks in advance.
Kind regards,
Ben
Put the macro-variable inside double quotes and add a trailing n to use the name literal syntax.
Single quotes won't resolve macro-variables inside of them, double will.
proc sql;
create table want as
select t1.*
from have t1
where t1."&timeperiod."n ne .;
quit;
Change how you create the macro variable, so it's a valid SAS variable name.
Use the NLITERAL function.
proc sql noprint;
select distinct nliteral(Berichtsmonat) into :timeperiod1 - FROM work.Basis;
quit;
%put & timeperiod1.;
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_TEMP_TRANS_0000 AS
SELECT t1.*
FROM WORK.TEMP_TRANS t1
WHERE t1.&timeperiod1 NOT = .;
QUIT;

SAS - Proc SQL or Merge - Trying to optimise an INNER join that includes a string search (index)

I've a rudimentary SAS skillset, most of which involves "proc sql", so feel free to challenge the fundamental approach of using this.
I'm attempting to match one set of personal details against another set, the first having some ~400k rows and the other 22 million. The complexity is that the 400k rows feature previous names and postcodes as well as current ones (all on the same row), so my approach (code below) was to concatenate all of the surnames together and all of the postcodes together and search for the string from the second table (single name and postcode) within the concatenated strings using the index(source, excerpt) function.
proc sql;
CREATE TABLE R4 AS
SELECT DISTINCT
BS.CUST_ID,
ED.MATCH_ID
FROM T_RECS_WITH_CONCATS BS
INNER JOIN T_RECS_TO_MATCH ED
ON LENGTH(ED.SinglePostcode) > 4
AND index(BS.AllSurnames,ED.SingleSurname) > 0
AND index(BS.AllPostcodes,ED.SinglePostcode) > 0
;
QUIT;
In the above, AllSurnames can contain up to 9 surnames (delimited by |), and AllPostcodes up to 9 concatenated postcodes (again, delimited by |).
The downside of this is of course that it takes forever to run. Is there are more efficient way of doing this, either within a proc sql step or a real data step?
Here is a way using HASH component object
Presume the data sets are named SHORT_MANY and TALL_ONE. Use the data in SHORT_MANY to populate a multidata hash table that can operate as a lookup for values being checked in TALL_ONE.
Using just surname and postal code as the lookup key could result in many false matches.
Example (with numeric surname & postcode)
data SHORT_MANY;
do cust_id = 1 to 400;
array Surnames surname1-surname9;
array Postcodes postcode1-postcode9;
call missing (of surnames(*));
do index = 1 to dim(surnames);
surnames(index) = ceil (100000 * ranuni(123));
postcodes(index) = ceil ( 99999 * ranuni(123));
if ranuni(123) < 0.15 then leave;
end;
output;
end;
run;
data TALL_ONE(keep=match_id surname postcode forcemark);
do match_id = 1 to 22000;
surname = ceil(100000 * ranuni(1234));
postcode = ceil( 99999 * ranuni(1234));
forcemark = .;
if ranuni(123) < 0.15 then do; * randomly ensure some match will occur;
point = ceil(400*ranuni(123));
set SHORT_MANY point=point;
array surnames surname1-surname9;
array postcodes postcode1-postcode9;
do until (surname ne .);
index = ceil(9 * ranuni(123));
surname = surnames(index);
postcode = postcodes(index);
end;
forcemark = point;
end;
output;
end;
stop;
run;
data WHEN_TALL_MEETS_SHORT(keep=cust_id match_id index);
if 0 then set TALL_ONE SHORT_MANY ; * prep pdv (for hash host variables);
if _n_ = 1 then do;
length index 8;
declare hash lookup(multidata: 'yes');
lookup.defineKey('surname', 'postcode');
lookup.defineData('cust_id', 'index');
lookup.defineDone();
do while (not lookup_filled);
SET SHORT_MANY end=lookup_filled;
array Surnames surname1-surname9;
array Postcodes postcode1-postcode9;
do index = 1 to dim(surnames) while (surnames(index) ne .);
surname = surnames(index);
postcode = postcodes(index);
lookup.add();
end;
end;
end;
call missing (surname, postcode, cust_id, index);
set TALL_ONE;
rc = lookup.find(); * grab just first match -- has_next/find_next to retrieve other lookup matches;
run;

Searching for pattern with characters and numerics in SAS

I am examining data quality and am trying to see how many rows are populated properly. The field should contain a string with one character followed by nine numerical and is of type 'Character' length 10.
Ex.
A123456789
B123531490
C319861045
I have tried using PRXMATCH function, but I am unsure if i use the proper syntax. I have also tried using PROC SQL with "Where not like "[A-Z][0-9][0-9]" and so on. My feeling is that this should not be difficult to perform, does anyone have a solution?
Best regards
You can construct a REGEX to make that test. Or just build the test using normal SAS functions.
data want ;
set have ;
flag1 = prxmatch('/^[A-Z][0-9]{9}$/',trim(name));
test1 = 'A' <= name <= 'Z' ;
test2 = not notdigit(trim(substr(name,2))) ;
test3 = length(name)=10;
flag2 = test1 and test2 and test3 ;
run;
Results:
Obs name flag1 test1 test2 test3 flag2
1 A123456789590 0 1 1 0 0
2 B123531490ABC 0 1 0 0 0
3 C3198610 0 1 1 0 0
4 A123456789 1 1 1 1 1
5 B123531490 1 1 1 1 1
6 C319861045 1 1 1 1 1
You can use:
^[a-zA-z][0-9]{9}$
The built-in SAS functions NOTALPHA and NOTDIGIT can perform validation testing.
invalid_flag = notalpha(substr(s,1,1)) || notdigit(s,2) ;
You can select invalid records directly with a where statement or option
data invalid;
set raw;
where notalpha(substr(s,1,1)) || notdigit(s,2) ; * statement;
run;
data invalid;
set raw (where=(notalpha(substr(s,1,1)) || notdigit(s,2))); * data set option;
run;
There are several functions in the NOT* and ANY* families and they can offer faster performance than the general purpose regular expression functions in the PRX* family.
you can use prxparse and prxmatch as shown below.
data have;
input name $20.;
datalines;
A123456789590
B123531490ABC
C3198610
A123456789
B123531490
C319861045
;
data want;
set have;
if _n_=1 then do;
retain re;
re = prxparse('/^[a-zA-z][0-9]{9}$/');
end;
if prxmatch(re,trim(name)) gt 0 then Flag ='Y';
else Flag ='N';
drop re;
run;
if you want only records those match the criteria then use
data want;
set have;
if _n_=1 then do;
retain re;
re = prxparse('/^[a-zA-z][0-9]{9}$/');
end;
if prxmatch(re,trim(name));
drop re;
run;

Use SAS macro variable to create variable name in PROC SQL

I'm trying to create a set of flags based off of a column of character strings in a data set. The string has thousands of unique values, but I want to create flags for only a small subset (say 10). I'd like to use a SAS macro variable to do this. I've tried many different approaches, none of which have worked. Here is the code that seems simplest and most logical to me, although it's still not working:
%let Px1='12345';
PROC SQL;
CREATE TABLE CLAIM1 AS
SELECT
b.MEMBERID
, b.ENROL_MN
, CASE WHEN (a.PROCEDURE = &Px1.) THEN 1 ELSE 0 END AS CPT_+&Px1.
, a.DX1
, a.DX2
, a.DX3
, a.DX4
FROM ENROLLMENT as b
left join CLAIMS as a
on a.MEMBERID = b.MEMBERID;
QUIT;
Obviously there is only one flag in this code, but once I figure it out the idea is that I would add additional macro variables and flags. Here is the error message I get:
8048 , CASE WHEN (PROCEDURE= &Px1.) THEN 1 ELSE 0 END AS CPT_+&Px1.
-
78
ERROR 78-322: Expecting a ','.
It seems that the cause of the problem is related to combining the string CPT_ with the macro variable. As I mentioned, I've tried several approaches to addressing this, but none have worked.
Thanks in advance for your help.
Something like this normally requires dynamic sql (although I am not sure how will that works with SAS, I believe it may depend on how you have established connection with the database).
Proc sql;
DECLARE #px1 varchar(20) = '12345'
,#sql varhcar(max) =
'SELECT b.MEMBERID
, b.ENROL_MN
, CASE WHEN (a.PROCEDURE = ' + #Px1 + ') THEN 1 ELSE 0
END AS CPT_' + #px1 + '
, a.DX1
, a.DX2
, a.DX3
, a.DX4
FROM ENROLLMENT as b
left join CLAIMS as a
on a.MEMBERID = b.MEMBERID'
EXEC sp_excutesql #sql;
QUIT;
Your issue here is the quotes in the macro variable.
%let Px1='12345';
So now SAS is seeing this:
... THEN 1 ELSE 0 END AS CPT_+'12345'
That's not remotely legal! You need to remove the '.
%let Px1 = 12345;
Then add back on at the right spot.
CASE WHEN a.procedure = "&px1." THEN 1 ELSE 0 END AS CPT_&px1.
Note " not ' as that lets the macro variable resolve.
If you have a list it might help to put the list into a table. Then you can use SAS code to generate the code to make the flag variables instead of macro code.
Say a table with PX code variable.
data pxlist;
input px $10. ;
cards;
12345
4567
;
You could then use PROC SQL query to generate code to make the flag variable into a macro variable.
proc sql noprint;
select catx(' ','PROCEDURE=',quote(trim(px)),'as',cats('CPT_',px))
into :flags separated by ','
from pxlist
;
%put &=flags;
quit;
Code looks like
PROCEDURE= "12345" as CPT_12345,PROCEDURE= "4567" as CPT_4567
So if we make some dummy data.
data enrollment ;
length memberid $8 enrol_mn $6 ;
input memberid enrol_nm;
cards;
1 201612
;
data claims;
length memberid $8 procedure $10 dx1-dx4 $10 ;
input memberid--dx4 ;
cards;
1 12345 1 2 . . .
1 345 1 2 3 . .
;
We can then combine the two tables and create the flag variables.
proc sql noprint;
create table want as
select *,&flags
from ENROLLMENT
natural join CLAIMS
;
quit;
Results
memberid procedure dx1 dx2 dx3 dx4 enrol_mn CPT_12345 CPT_4567
1 12345 1 2 201612 1 0
1 345 1 2 3 201612 0 0

macro into a table or a macro variable with sas

I'm having this macro. The aim is to take the name of variables from the table dicofr and put the rows inside into variable name using a symput.
However , something is not working correctly because that variable, &nvarname, is not seen as a variable.
This is the content of dico&&pays&l
varname descr
var12 aza
var55 ghj
var74 mcy
This is the content of dico&&pays&l..1
varname
var12
var55
var74
Below is my code
%macro testmac;
%let pays1=FR ;
%do l=1 %to 1 ;
data dico&&pays&l..1 ; set dico&&pays&l (keep=varname);
call symput("nvarname",trim(left(_n_))) ;
run ;
data a&&pays&l;
set a&&pays&l;
nouv_date=mdy(substr(date,6,2),01,substr(date,1,4));
format nouv_date monyy5.;
run;
proc sql;
create table toto
(nouv_date date , nomvar varchar (12));
quit;
proc sql;
insert into toto SELECT max(nouv_date),"&nvarname" as nouv_date as varname FROM a&&pays&l WHERE (&nvarname ne .);
%end;
%mend;
%testmac;
A subsidiary question. Is it possible to have the varname and the date related to that varname into a macro variable? My man-a told me about this but I have never done that before.
Thanks in advance.
Edited:
I have this table
date col1 col2 col3 ... colx
1999M12 . . . .
1999M11 . 2 . .
1999M10 1 3 . 3
1999M9 0.2 3 2 1
I'm trying to do know the name of the column with the maximum date , knowing the value inside of the column is different than a missing value.
For col1, it would be 1999M10. For col2, it would be 1999M11 etc ...
Based on your update, I think the following code does what you want. If you don't mind sorting your input dataset first, you can get all the values you're looking for with a single data step - no macros required!
data have;
length date $7;
input date col1 col2 col3;
format date2 monyy5.;
date2 = mdy(substr(date,6,2),1,substr(date,1,4));
datalines;
1999M12 . . .
1999M11 . 2 .
1999M10 1 3 .
1999M09 0.2 3 2
;
run;
/*Required for the following data step to work*/
/*Doing it this way allows us to potentially skip reading most of the input data set*/
proc sort data = have;
by descending date2;
run;
data want(keep = max_date:);
array max_dates{*} max_date1-max_date3;
array cols{*} col1-col3;
format max_date: monyy5.;
do until(eof); /*Begin DOW loop*/
set have end = eof;
/*Check to see if we've found the max date for each col yet.*/
/*Save the date for that col if applicable*/
j = 0;
do i = 1 to dim(cols);
if missing(max_dates[i]) and not(missing(cols[i])) then max_dates[i] = date2;
j + missing(max_dates[i]);
end;
/*Use j to count how many cols we still need dates for.*/
/* If we've got a full set, we can skip reading the rest of the data set*/
if j = 0 then do;
output;
stop;
end;
end; /*End DOW loop*/
run;
EDIT: if you want to output the names alongside the max date for each, that can be done with a slight modification:
data want(keep = col_name max_date);
array max_dates{*} max_date1-max_date3;
array cols{*} col1-col3;
format max_date monyy5.;
do until(eof); /*Begin DOW loop*/
set have end = eof;
/*Check to see if we've found the max date for each col yet.*/
/*If not then save date from current row for that col*/
j = 0;
do i = 1 to dim(cols);
if missing(max_dates[i]) and not(missing(cols[i])) then max_dates[i] = date2;
j + missing(max_dates[i]);
end;
/*Use j to count how many cols we still need dates for.*/
/* If we've got a full set, we can skip reading the rest of the data set*/
if j = 0 or eof then do;
do i = 1 to dim(cols);
col_name = vname(cols[i]);
max_date = max_dates[i];
output;
end;
stop;
end;
end; /*End DOW loop*/
run;
It looks to me that you're trying to use macros to generate INSERT INTO statements to populate your table. It's possible to do this without using macros at all which is the approach I'd recommend.
You could use a datastep statement to write out the INSERT INTO statements to a file. Then following the datastep, use a %include statement to run the file.
This will be easier to write/maintain/debug and will also perform better.