SAS - Proc SQL or Merge - Trying to optimise an INNER join that includes a string search (index) - indexing

I've a rudimentary SAS skillset, most of which involves "proc sql", so feel free to challenge the fundamental approach of using this.
I'm attempting to match one set of personal details against another set, the first having some ~400k rows and the other 22 million. The complexity is that the 400k rows feature previous names and postcodes as well as current ones (all on the same row), so my approach (code below) was to concatenate all of the surnames together and all of the postcodes together and search for the string from the second table (single name and postcode) within the concatenated strings using the index(source, excerpt) function.
proc sql;
CREATE TABLE R4 AS
SELECT DISTINCT
BS.CUST_ID,
ED.MATCH_ID
FROM T_RECS_WITH_CONCATS BS
INNER JOIN T_RECS_TO_MATCH ED
ON LENGTH(ED.SinglePostcode) > 4
AND index(BS.AllSurnames,ED.SingleSurname) > 0
AND index(BS.AllPostcodes,ED.SinglePostcode) > 0
;
QUIT;
In the above, AllSurnames can contain up to 9 surnames (delimited by |), and AllPostcodes up to 9 concatenated postcodes (again, delimited by |).
The downside of this is of course that it takes forever to run. Is there are more efficient way of doing this, either within a proc sql step or a real data step?

Here is a way using HASH component object
Presume the data sets are named SHORT_MANY and TALL_ONE. Use the data in SHORT_MANY to populate a multidata hash table that can operate as a lookup for values being checked in TALL_ONE.
Using just surname and postal code as the lookup key could result in many false matches.
Example (with numeric surname & postcode)
data SHORT_MANY;
do cust_id = 1 to 400;
array Surnames surname1-surname9;
array Postcodes postcode1-postcode9;
call missing (of surnames(*));
do index = 1 to dim(surnames);
surnames(index) = ceil (100000 * ranuni(123));
postcodes(index) = ceil ( 99999 * ranuni(123));
if ranuni(123) < 0.15 then leave;
end;
output;
end;
run;
data TALL_ONE(keep=match_id surname postcode forcemark);
do match_id = 1 to 22000;
surname = ceil(100000 * ranuni(1234));
postcode = ceil( 99999 * ranuni(1234));
forcemark = .;
if ranuni(123) < 0.15 then do; * randomly ensure some match will occur;
point = ceil(400*ranuni(123));
set SHORT_MANY point=point;
array surnames surname1-surname9;
array postcodes postcode1-postcode9;
do until (surname ne .);
index = ceil(9 * ranuni(123));
surname = surnames(index);
postcode = postcodes(index);
end;
forcemark = point;
end;
output;
end;
stop;
run;
data WHEN_TALL_MEETS_SHORT(keep=cust_id match_id index);
if 0 then set TALL_ONE SHORT_MANY ; * prep pdv (for hash host variables);
if _n_ = 1 then do;
length index 8;
declare hash lookup(multidata: 'yes');
lookup.defineKey('surname', 'postcode');
lookup.defineData('cust_id', 'index');
lookup.defineDone();
do while (not lookup_filled);
SET SHORT_MANY end=lookup_filled;
array Surnames surname1-surname9;
array Postcodes postcode1-postcode9;
do index = 1 to dim(surnames) while (surnames(index) ne .);
surname = surnames(index);
postcode = postcodes(index);
lookup.add();
end;
end;
end;
call missing (surname, postcode, cust_id, index);
set TALL_ONE;
rc = lookup.find(); * grab just first match -- has_next/find_next to retrieve other lookup matches;
run;

Related

How to sort by dynamic column in oracle?

I have some complex oracle query, but I will try to make it simple. I have something like this:
SELECT TBL1.*, TBL2.*
FROM TABLE_1 TBL1
LEFT JOIN (
SELECT *
FROM
(
SELECT TBL2.VERSION_ID, TBL2.CONFIG_ID, TBL2.VALUE
FROM TABLE_2 TBL2
)
PIVOT
(
MAX(VALUE) FOR CONFIG_ID IN (:metadataClassConfigs)
)
) TBL2 ON TBL1.VERSION_ID = TBL2.VERSION_ID
ORDER BY
CASE
WHEN :orderByCustomClass IS NOT NULL THEN
CASE
WHEN :orderByCustomClass = 1 THEN TBL2."1"
WHEN :orderByCustomClass = 21 THEN TBL2."21"
WHEN :orderByCustomClass = 22 THEN TBL2."22"
WHEN :orderByCustomClass = 23 THEN TBL2."23"
WHEN :orderByCustomClass = 24 THEN TBL2."24"
WHEN :orderByCustomClass = 25 THEN TBL2."25"
WHEN :orderByCustomClass = 26 THEN TBL2."26"
WHEN :orderByCustomClass = 27 THEN TBL2."27"
WHEN :orderByCustomClass = 28 THEN TBL2."28"
WHEN :orderByCustomClass = 29 THEN TBL2."29"
WHEN :orderByCustomClass = 30 THEN TBL2."30"
WHEN :orderByCustomClass = 31 THEN TBL2."31"
WHEN :orderByCustomClass = 32 THEN TBL2."32"
WHEN :orderByCustomClass = 34 THEN TBL2."34"
WHEN :orderByCustomClass = 35 THEN TBL2."35"
WHEN :orderByCustomClass = 36 THEN TBL2."36"
WHEN :orderByCustomClass = 41 THEN TBL2."41"
WHEN :orderByCustomClass = 52 THEN TBL2."42"
END
END;
and this is working fine. This input parameters are: :metadataClassConfigs is the list of numbers (1,21,22,23,24,25,26,27,28,29,30,31,32,34,35,36,41,42) and :orderByCustomClass can be any of this number.
I have much more numbers then this list, more than 1000, so I am wondering how can I order by dynamic column something like:
WHEN :orderByCustomClass IS NOT NULL THEN TBL2."{:orderByCustomClass}"
?
There is a multiple way for dynamic SQL in Oracle PL/SQL. I'm assuming that you are talking about PL/SQL, because in other kind of clients (like python-oracle, jdbc) the only way to send a "query" is to create cursor from string. So you've always forced to build query kind of dynamically...
Native Dynamic SQL - execute immediate
Good for simple cases (look how to get the result - works best for one row - its more complicated for arrays).
The query is a string - so you can "build" it. If you need - you can use parameters with USING clause (each parameter in query must have colon as a prefix). Be aware that they are mapped by position in query - not by name.
declare
type t_rec is record (
<describe returned columns>
);
type t_result_array is table of t_rec index by pls_integer;
v_result_array t_result_array;
v_sort_column varchar2(4000);
begin
-- do some logic to determine name of column for order by:
v_sort_column := <some_logic determining column name for sorting>;
-- if logic is based on raw user input, then you should sanitize it:
v_sort_column := DBMS_ASSERT.QUALIFIED_SQL_NAME(v_sort_column);
-- build query based on v_sort_column value
execute immediate 'select ... from ...
order by '||v_sort_column
bulk collect into v_result_array;
<do something with result stored in v_result_array>
end;
/
Native Dynamic SQL - OPEN FOR
Very similar to execute immediate but based on cursor variable and OPEN FOR statement. To accomplish it you have to do 3 steps: open cursor, fetch rows and close cursor.
declare
type t_rec is record (
<describe returned columns>
);
type t_result_array is table of t_rec index by pls_integer;
v_result_array t_result_array;
v_sort_column varchar2(4000);
type t_ret_cursor is ref cursor return t_rec;
v_cursor t_ret_cursor;
begin
-- do some logic to determine name of column for order by:
v_sort_column := <some_logic determining column name for sorting>;
-- if logic is based on raw user input, then you should sanitize it:
v_sort_column := DBMS_ASSERT.QUALIFIED_SQL_NAME(v_sort_column);
open v_cursor for 'select ... from ...
order by '||v_sort_column;
fetch v_cursor bulk collect into v_cursor;
close v_cursor;
<do something with result stored in v_result_array>
end;
/
Dynamic SQL - DBMS_SQL package
This is the most flexible way of doing it - you can even conditionally change selected columns or dynamically check what kind of row is in result (number of columns, data types etc.). Furthermore, it is also one of the best in terms of performance.
I'm just putting information about this option here so you can see for yourself if you need these features and capabilities.
There are many more steps here and they are more complex: open cursor, parse, bind every parameter (optional), define columns, execute, fetch, access data, so I will not post any example. Probably it's an overkill for your purposes.

sql substring contains

I have a datasetA with a long narrative field. This field is called "narrative."
I have datasetB full of animal terms, such as "dog", "cat", "mouse". This field is called "animals."
I would like to flag any instance where the animal names are found in the narrative of datasetA, and to create a new field in datasetA, called "animal_found" which pulls that name.
For example, if the word "dog" is found in a narrative, the animal_found field for that record will populate "dog"
if the word "dog" and "cat" is found, the animal_found field will show "dog,cat"
Any thought on how to code this in SQL?
If you are using SQL Server, there is a way with Dynamic SQL but it's not very elegant nor performant.
DECLARE #Animal nvarchar(100)
DECLARE cur CURSOR LOCAL FORWARD_ONLY FOR
SELECT Animal FROM datasetB
OPEN cur
FETCH NEXT FROM cur INTO #Animal
WHILE ##FETCH_STATUS = 0
BEGIN
DECLARE #Query NVARCHAR(MAX)
SELECT #Query = 'SELECT Columns FROM datasetA where narrative like ''%' + #Animal + '%'''
exec sp_executeSql #Query
FETCH NEXT FROM cur INTO #Animal
END
CLOSE cur
DEALLOCATE cur
The way to do it would probably be to create a temp table or something like that. Then insert the results into your temp table and format it the way you want to. But as I said, cursors are not really performant. But it works
Not SQL, but within a data step this can be done relatively easily.
load lookup data into a temporary array
Loop through list and search text for the data
Concatenate results as you loop
NOTE: This does not handle extra 's' at the end of words, so you may want to consider how you'll handle frog vs frogs since those are technically not the same word. You cannot just switch to find because of partial matches in other words but you could duplicate the loop twice or modify the FIND to check for both at the same time. I'll leave that to you to solve.
*fake text data;
data statements;
infile cards;
input narrative $100.;
cards;
This is some random text with words that are weirhd such as cat, dog frogs, and any other weird names
This is a notehr rnaodm text with word ssuch as bird and cat
This has nothing in it
This is another phrages with elephants
;
run;
*fake words;
data words;
input word $20.;
cards;
cat
dog
frog
bird
elephant
;;;;
run;
*lookup;
data want;
*loads data set M into temporary array M;
array W(5) $20. _temporary_;
if _n_=1 then do j=1 to 5;
set words;
W(j)=word;
end;
*main data set to work with;
length found $100.;
found = '';
set statements;
do i=1 to dim(w);
x = findw(narrative, W(i), " ,", 'ir');
if x > 0 then found = catx(", ", trim(found), w(i));
*for debugging, comment out/delete as needed;
put "N=" _N_ " | I= " i;
put "Statement = " narrative;
put "Word = " w(i);
put "Found = " x;
put "---------------";
end;
run;
SAS SQL is the wrong tool for aggregating rows into a concatenation result (csv string).
SQL can be used to obtain the found items to be concatenated and data step DOW loop for concatenating:
proc sql;
create view matched_animals as
select narrative, animal from
narratives left join animals on narrative contains trim(animal)
order by narratives, animal;
data want;
length animal_found $2000;
do until (last.narrative);
set matched_animals;
by narrative;
animal_found = catx(',',animal_found,animal);
end;
run;
This will work but may run out of resources depending on the cardinality of the narratives and animals tables and the matching rate.
A data step approach can utilize hash object, countw and scan, or findw. There are two approaches, with way2 the probable best / most typical use case.
* Thanks Reeza for sample data;
data narratives;
infile cards;
input narrative $100.;
cards;
This is some random text with words that are weirhd such as cat, dog frogs, and any other weird names
This is a notehr rnaodm text with word ssuch as bird and cat
This has nothing in it
This is another phrages with elephants
;
run;
data animals;
input animal $20.;
cards;
cat
dog
frog
bird
elephant
;;;;
run;
data want;
set narratives;
length animals_found_way1 animals_found_way2 $2000;
if _n_ = 1 then do;
if 0 then set animals(keep=animal); * prep pdv;
declare hash animals(dataset:'animals');
animals.defineKey('animal');
animals.defineDone();
declare hiter animals_iter('animals');
end;
* check each word of narrative for animal match;
* way 1 use case: narratives shorter than animals list;
do _n_ = 1 to countw(narrative);
token = scan(narrative, _n_);
if animals.find(key:token) = 0 then
animals_found_way1 = catx(',', animals_found_way1, token);
loopcount_way1 = sum (loopcount_way1, 1);
end;
* check each animal for match;
* way 2 use case: animal list shorter than narratives;
do while (animals_iter.next() = 0);
if findw(narrative, trim(animal)) then
animals_found_way2 = catx(',', animals_found_way2, animal);
loopcount_way2 = sum(loopcount_way2, 1);
end;
put;
drop token animal;
run;
If the list of animals is not too long, try this method and see how it performs. I tested this on SQL Server 2017.
with
cte1 as
(select 'I have a dog, a cat and a bunny as my pets' narrative union all
select 'I have a horse, a bunny and a dog as my pets' union all
select 'I have a cat as my pet' union all
select 'I have a dog as my pet' union all
select 'I have nothing')
,cte2 as
(select 'cat' animals union all
select 'dog' union all
select 'parrot' union all
select 'bunny' union all
select 'horse')
select
narrative,
string_agg(case when narrative like concat('%',animals,'%') then animals end,',') animals_found
from cte1 cross join cte2
group by narrative;
Fiddle

SAS PROC SQL NOT CONTAINS multiple values in one statement

In PROC SQL, I need to select all rows where a column called "NAME" does not contain multiple values "abc", "cde" and "fbv" regardless of what comes before or after these values. So I did it like this:
SELECT * FROM A WHERE
NAME NOT CONTAINS "abc"
AND
NAME NOT CONTAINS "cde"
AND
NAME NOT CONTAINS "fbv";
which works just fine, but I imagine it would be a headache if we had a hundred of conditions. So my question is - can we accomplish this in a single statement in PROC SQL?
I tried using this:
SELECT * FROM A WHERE
NOT CONTAINS(NAME, '"abc" AND "cde" AND "fbv"');
but this doesn't work in PROC SQL, I am getting the following error:
ERROR: Function CONTAINS could not be located.
I don't want to use LIKE.
You could use regular expressions, I suppose.
data a;
input name $;
datalines;
xyabcde
xyzxyz
xycdeyz
xyzxyzxyz
fbvxyz
;;;;
run;
proc sql;
SELECT * FROM A WHERE
NAME NOT CONTAINS "abc"
AND
NAME NOT CONTAINS "cde"
AND
NAME NOT CONTAINS "fbv";
SELECT * FROM A WHERE
NOT (PRXMATCH('~ABC|CDE|FBV~i',NAME));
quit;
You can't use CONTAINS that way, though.
You can use NOT IN:
SELECT * FROM A WHERE
NAME NOT IN ('abc','cde','fbv');
If the number of items is above reasonable number to build inside code, you can create a table (work.words below) to store the words and iterate over it to check occurrences:
data work.values;
input name $;
datalines;
xyabcde
xyzxyz
xycdeyz
xyzxyzxyz
fbvxyz
;
run;
data work.words;
length word $50;
input word $;
datalines;
abc
cde
fbv
;
run;
data output;
set values;
/* build a has of words */
length word $50;
if _n_ = 1 then do;
/* this runs once only */
call missing(word);
declare hash words (dataset: 'work.words');
words.defineKey('word');
words.defineData('word');
words.defineDone();
end;
/* iterate hash of words */
declare hiter iter('words');
rc = iter.first();
found = 0;
do while (rc=0);
if index(name, trim(word)) gt 0 then do; /* check if word present using INDEX function */
found= 1;
rc = 1;
end;
else rc = iter.next();
end;
if found = 0 then output; /* output only if no word found in name */
drop word rc found;
run;

How to split a column by the number of white spaces in it with SQL?

I've got a single column that contains a set of names in it. I didn't design the database so that it contains multiple values in one column, but as it is I've got to extract that information now.
The problem is that in one field I've got multiple values like in this example:
"Jack Tom Larry Stan Kenny"
So the first three should be one group, and the other ones on the far right are another group. (Basically the only thing that separates them in the column is a specific number of whitespace between them, let's say 50 characters.)
How can I split them in pure SQL, so that I can get two columns like this:
column1 "Jack Tom Larry"
column2 "Stan Kenny"
A fairly simplistic answer would be to use a combination of left(), right() and locate(). Something like this (note I've substituted 50 spaces with "XXX" for readability):
declare global temporary table session.x(a varchar(100))
on commit preserve rows with norecovery;
insert into session.x values('Jack Tom LarryXXXStan Kenny');
select left(a,locate(a,'XXX')-1),right(a,length(a)+1-(locate(a,'XXX')+length('XXX'))) from session.x;
If you need a more general method of extracting the nth field from a string with a given separator, a bit like the split_part() function in PostgreSQL, in Ingres your options would be:
Write a user defined function using the Object Management Extension (OME). This isn't entirely straightforward but there is an excellent example in the wiki pages of Actian's community site to get you started:
http://community.actian.com/wiki/OME:_User_Defined_Functions
Create a row-producing procedure. A bit more clunky to use than an OME function, but much easier to implement. Here's my attempt at such a procedure, not terribly well tested but it should serve as an example. You may need to adjust the widths of the input and output strings:
create procedure split
(
inval = varchar(200) not null,
sep = varchar(50) not null,
n = integer not null
)
result row r(x varchar(200)) =
declare tno = integer not null;
srch = integer not null;
ptr = integer not null;
resval = varchar(50);
begin
tno = 1;
srch = 1;
ptr = 1;
while (:srch <= length(:inval))
do
while (substr(:inval, :srch, length(:sep)) != :sep
and :srch <= length(:inval))
do
srch = :srch + 1;
endwhile;
if (:tno = :n)
then
resval=substr(:inval, :ptr, :srch - :ptr);
return row(:resval);
return;
endif;
srch = :srch + length(:sep);
ptr = :srch;
tno = :tno + 1;
endwhile;
return row('');
end;
select s.x from session.x t, split(t.a,'XXX',2) s;

macro into a table or a macro variable with sas

I'm having this macro. The aim is to take the name of variables from the table dicofr and put the rows inside into variable name using a symput.
However , something is not working correctly because that variable, &nvarname, is not seen as a variable.
This is the content of dico&&pays&l
varname descr
var12 aza
var55 ghj
var74 mcy
This is the content of dico&&pays&l..1
varname
var12
var55
var74
Below is my code
%macro testmac;
%let pays1=FR ;
%do l=1 %to 1 ;
data dico&&pays&l..1 ; set dico&&pays&l (keep=varname);
call symput("nvarname",trim(left(_n_))) ;
run ;
data a&&pays&l;
set a&&pays&l;
nouv_date=mdy(substr(date,6,2),01,substr(date,1,4));
format nouv_date monyy5.;
run;
proc sql;
create table toto
(nouv_date date , nomvar varchar (12));
quit;
proc sql;
insert into toto SELECT max(nouv_date),"&nvarname" as nouv_date as varname FROM a&&pays&l WHERE (&nvarname ne .);
%end;
%mend;
%testmac;
A subsidiary question. Is it possible to have the varname and the date related to that varname into a macro variable? My man-a told me about this but I have never done that before.
Thanks in advance.
Edited:
I have this table
date col1 col2 col3 ... colx
1999M12 . . . .
1999M11 . 2 . .
1999M10 1 3 . 3
1999M9 0.2 3 2 1
I'm trying to do know the name of the column with the maximum date , knowing the value inside of the column is different than a missing value.
For col1, it would be 1999M10. For col2, it would be 1999M11 etc ...
Based on your update, I think the following code does what you want. If you don't mind sorting your input dataset first, you can get all the values you're looking for with a single data step - no macros required!
data have;
length date $7;
input date col1 col2 col3;
format date2 monyy5.;
date2 = mdy(substr(date,6,2),1,substr(date,1,4));
datalines;
1999M12 . . .
1999M11 . 2 .
1999M10 1 3 .
1999M09 0.2 3 2
;
run;
/*Required for the following data step to work*/
/*Doing it this way allows us to potentially skip reading most of the input data set*/
proc sort data = have;
by descending date2;
run;
data want(keep = max_date:);
array max_dates{*} max_date1-max_date3;
array cols{*} col1-col3;
format max_date: monyy5.;
do until(eof); /*Begin DOW loop*/
set have end = eof;
/*Check to see if we've found the max date for each col yet.*/
/*Save the date for that col if applicable*/
j = 0;
do i = 1 to dim(cols);
if missing(max_dates[i]) and not(missing(cols[i])) then max_dates[i] = date2;
j + missing(max_dates[i]);
end;
/*Use j to count how many cols we still need dates for.*/
/* If we've got a full set, we can skip reading the rest of the data set*/
if j = 0 then do;
output;
stop;
end;
end; /*End DOW loop*/
run;
EDIT: if you want to output the names alongside the max date for each, that can be done with a slight modification:
data want(keep = col_name max_date);
array max_dates{*} max_date1-max_date3;
array cols{*} col1-col3;
format max_date monyy5.;
do until(eof); /*Begin DOW loop*/
set have end = eof;
/*Check to see if we've found the max date for each col yet.*/
/*If not then save date from current row for that col*/
j = 0;
do i = 1 to dim(cols);
if missing(max_dates[i]) and not(missing(cols[i])) then max_dates[i] = date2;
j + missing(max_dates[i]);
end;
/*Use j to count how many cols we still need dates for.*/
/* If we've got a full set, we can skip reading the rest of the data set*/
if j = 0 or eof then do;
do i = 1 to dim(cols);
col_name = vname(cols[i]);
max_date = max_dates[i];
output;
end;
stop;
end;
end; /*End DOW loop*/
run;
It looks to me that you're trying to use macros to generate INSERT INTO statements to populate your table. It's possible to do this without using macros at all which is the approach I'd recommend.
You could use a datastep statement to write out the INSERT INTO statements to a file. Then following the datastep, use a %include statement to run the file.
This will be easier to write/maintain/debug and will also perform better.