sql substring contains - sql

I have a datasetA with a long narrative field. This field is called "narrative."
I have datasetB full of animal terms, such as "dog", "cat", "mouse". This field is called "animals."
I would like to flag any instance where the animal names are found in the narrative of datasetA, and to create a new field in datasetA, called "animal_found" which pulls that name.
For example, if the word "dog" is found in a narrative, the animal_found field for that record will populate "dog"
if the word "dog" and "cat" is found, the animal_found field will show "dog,cat"
Any thought on how to code this in SQL?

If you are using SQL Server, there is a way with Dynamic SQL but it's not very elegant nor performant.
DECLARE #Animal nvarchar(100)
DECLARE cur CURSOR LOCAL FORWARD_ONLY FOR
SELECT Animal FROM datasetB
OPEN cur
FETCH NEXT FROM cur INTO #Animal
WHILE ##FETCH_STATUS = 0
BEGIN
DECLARE #Query NVARCHAR(MAX)
SELECT #Query = 'SELECT Columns FROM datasetA where narrative like ''%' + #Animal + '%'''
exec sp_executeSql #Query
FETCH NEXT FROM cur INTO #Animal
END
CLOSE cur
DEALLOCATE cur
The way to do it would probably be to create a temp table or something like that. Then insert the results into your temp table and format it the way you want to. But as I said, cursors are not really performant. But it works

Not SQL, but within a data step this can be done relatively easily.
load lookup data into a temporary array
Loop through list and search text for the data
Concatenate results as you loop
NOTE: This does not handle extra 's' at the end of words, so you may want to consider how you'll handle frog vs frogs since those are technically not the same word. You cannot just switch to find because of partial matches in other words but you could duplicate the loop twice or modify the FIND to check for both at the same time. I'll leave that to you to solve.
*fake text data;
data statements;
infile cards;
input narrative $100.;
cards;
This is some random text with words that are weirhd such as cat, dog frogs, and any other weird names
This is a notehr rnaodm text with word ssuch as bird and cat
This has nothing in it
This is another phrages with elephants
;
run;
*fake words;
data words;
input word $20.;
cards;
cat
dog
frog
bird
elephant
;;;;
run;
*lookup;
data want;
*loads data set M into temporary array M;
array W(5) $20. _temporary_;
if _n_=1 then do j=1 to 5;
set words;
W(j)=word;
end;
*main data set to work with;
length found $100.;
found = '';
set statements;
do i=1 to dim(w);
x = findw(narrative, W(i), " ,", 'ir');
if x > 0 then found = catx(", ", trim(found), w(i));
*for debugging, comment out/delete as needed;
put "N=" _N_ " | I= " i;
put "Statement = " narrative;
put "Word = " w(i);
put "Found = " x;
put "---------------";
end;
run;

SAS SQL is the wrong tool for aggregating rows into a concatenation result (csv string).
SQL can be used to obtain the found items to be concatenated and data step DOW loop for concatenating:
proc sql;
create view matched_animals as
select narrative, animal from
narratives left join animals on narrative contains trim(animal)
order by narratives, animal;
data want;
length animal_found $2000;
do until (last.narrative);
set matched_animals;
by narrative;
animal_found = catx(',',animal_found,animal);
end;
run;
This will work but may run out of resources depending on the cardinality of the narratives and animals tables and the matching rate.
A data step approach can utilize hash object, countw and scan, or findw. There are two approaches, with way2 the probable best / most typical use case.
* Thanks Reeza for sample data;
data narratives;
infile cards;
input narrative $100.;
cards;
This is some random text with words that are weirhd such as cat, dog frogs, and any other weird names
This is a notehr rnaodm text with word ssuch as bird and cat
This has nothing in it
This is another phrages with elephants
;
run;
data animals;
input animal $20.;
cards;
cat
dog
frog
bird
elephant
;;;;
run;
data want;
set narratives;
length animals_found_way1 animals_found_way2 $2000;
if _n_ = 1 then do;
if 0 then set animals(keep=animal); * prep pdv;
declare hash animals(dataset:'animals');
animals.defineKey('animal');
animals.defineDone();
declare hiter animals_iter('animals');
end;
* check each word of narrative for animal match;
* way 1 use case: narratives shorter than animals list;
do _n_ = 1 to countw(narrative);
token = scan(narrative, _n_);
if animals.find(key:token) = 0 then
animals_found_way1 = catx(',', animals_found_way1, token);
loopcount_way1 = sum (loopcount_way1, 1);
end;
* check each animal for match;
* way 2 use case: animal list shorter than narratives;
do while (animals_iter.next() = 0);
if findw(narrative, trim(animal)) then
animals_found_way2 = catx(',', animals_found_way2, animal);
loopcount_way2 = sum(loopcount_way2, 1);
end;
put;
drop token animal;
run;

If the list of animals is not too long, try this method and see how it performs. I tested this on SQL Server 2017.
with
cte1 as
(select 'I have a dog, a cat and a bunny as my pets' narrative union all
select 'I have a horse, a bunny and a dog as my pets' union all
select 'I have a cat as my pet' union all
select 'I have a dog as my pet' union all
select 'I have nothing')
,cte2 as
(select 'cat' animals union all
select 'dog' union all
select 'parrot' union all
select 'bunny' union all
select 'horse')
select
narrative,
string_agg(case when narrative like concat('%',animals,'%') then animals end,',') animals_found
from cte1 cross join cte2
group by narrative;
Fiddle

Related

How do I "conditionally" count missing values per variable in PROC SQL?

I'm measuring expenses in different categories. I have two types of variables. A categorical variables which states if the respondent have had expenses in the category (such as "Exkl_UtgUtl_Flyg") and I have numerical variables (such as UtgUtl_FlygSSEK_Pers), which provides information on the amount spent by each respondent in that category.
I want to create a table which tells me if there are missing values in my numerical variables for categories where expenses have been reported (so missing values of "UtgUtl_FlygSSEK_Pers" where the variable "Exkl_UtgUtl_Flyg" equals 1, in an example with only one variable).
This works in a simple SQL query, so something like:
PROC SQL;
SELECT nmiss(UtgUtl_FlygSSEK_Pers)
FROM IBIS3_5
WHERE Exkl_UtgUtl_Flyg=1;
quit;
But I don't want to navigate between 20 different datasets to find my missing values, I want them all in the same table. I figure this should be possible if i write a subquery in the SELECT clause for each variable, so something like:
PROC SQL;
SELECT (SELECT nmiss(UtgUtl_FlygSSEK_Pers)
FROM IBIS3_5
WHERE Exkl_UtgUtl_Flyg=1) as nmiss_variable_1
FROM IBIS3_5;
quit;
This last query does not seem to work, however. It does not return a single value, but one value for each row in the dataset.
How do I make this work?
I suspect you want to generate a single value.
Either the total number of mis-matches.
select sum(missing(UtgUtl_FlygSSEK_Pers) and Exkl_UtgUtl_Flyg=1) as nmiss
from ibis3_5
;
Or perhaps just a binary 1/0 flag of whether or not there are any mismatches.
select max(missing(UtgUtl_FlygSSEK_Pers) and Exkl_UtgUtl_Flyg=1) as any_miss
from ibis3_5
;
Maybe a good usage of proc freq instead. Especially if you have multiple values.
Not all of this is necessary but this is a missing report. Depends exactly how you're defining missing of course.
*create sample data to work with;
data class;
set sashelp.class;
if age=14 then
call missing(height, weight, sex);
if name='Alfred' then
call missing(sex, age, height);
label age="Fancy Age Label";
run;
*set input data set name;
%let INPUT_DSN = class;
%let OUTPUT_DSN = want;
*create format for missing;
proc format;
value $ missfmt ' '="Missing" other="Not Missing";
value nmissfmt .="Missing" other="Not Missing";
run;
*Proc freq to count missing/non missing;
ods select none;
*turns off the output so the results do not get too messy;
ods table onewayfreqs=temp;
proc freq data=&INPUT_DSN.;
table _all_ / missing;
format _numeric_ nmissfmt. _character_ $missfmt.;
run;
ods select all;
*Format output;
data long;
length variable $32. variable_value $50.;
set temp;
Variable=scan(table, 2);
Variable_Value=strip(trim(vvaluex(variable)));
presentation=catt(frequency, " (", trim(put(percent/100, percent7.1)), ")");
keep variable variable_value frequency percent cum: presentation;
label variable='Variable' variable_value='Variable Value';
run;
proc sort data=long;
by variable;
run;
*make it a wide data set for presentation, with values as N (Percent);
proc transpose data=long out=wide_presentation (drop=_name_);
by variable;
id variable_value;
var presentation;
run;
*transpose only N;
proc transpose data=long out=wide_N prefix=N_;
by variable;
id variable_value;
var frequency;
run;
*transpose only percents;
proc transpose data=long out=wide_PCT prefix=PCT_;
by variable;
id variable_value;
var percent;
run;
*final output file;
data &Output_DSN.;
merge wide_N wide_PCT wide_presentation;
by variable;
drop _name_;
label N_Missing='# Missing' N_Not_Missing='# Not Missing'
PCT_Missing='% Missing' N_Not_Missing='% Not Missing' Missing='Missing'
Not_missing='Not Missing';
run;
title "Missing Report of &INPUT_DSN.";
proc print data=&output_dsn. noobs label;
run;

SAS - Proc SQL or Merge - Trying to optimise an INNER join that includes a string search (index)

I've a rudimentary SAS skillset, most of which involves "proc sql", so feel free to challenge the fundamental approach of using this.
I'm attempting to match one set of personal details against another set, the first having some ~400k rows and the other 22 million. The complexity is that the 400k rows feature previous names and postcodes as well as current ones (all on the same row), so my approach (code below) was to concatenate all of the surnames together and all of the postcodes together and search for the string from the second table (single name and postcode) within the concatenated strings using the index(source, excerpt) function.
proc sql;
CREATE TABLE R4 AS
SELECT DISTINCT
BS.CUST_ID,
ED.MATCH_ID
FROM T_RECS_WITH_CONCATS BS
INNER JOIN T_RECS_TO_MATCH ED
ON LENGTH(ED.SinglePostcode) > 4
AND index(BS.AllSurnames,ED.SingleSurname) > 0
AND index(BS.AllPostcodes,ED.SinglePostcode) > 0
;
QUIT;
In the above, AllSurnames can contain up to 9 surnames (delimited by |), and AllPostcodes up to 9 concatenated postcodes (again, delimited by |).
The downside of this is of course that it takes forever to run. Is there are more efficient way of doing this, either within a proc sql step or a real data step?
Here is a way using HASH component object
Presume the data sets are named SHORT_MANY and TALL_ONE. Use the data in SHORT_MANY to populate a multidata hash table that can operate as a lookup for values being checked in TALL_ONE.
Using just surname and postal code as the lookup key could result in many false matches.
Example (with numeric surname & postcode)
data SHORT_MANY;
do cust_id = 1 to 400;
array Surnames surname1-surname9;
array Postcodes postcode1-postcode9;
call missing (of surnames(*));
do index = 1 to dim(surnames);
surnames(index) = ceil (100000 * ranuni(123));
postcodes(index) = ceil ( 99999 * ranuni(123));
if ranuni(123) < 0.15 then leave;
end;
output;
end;
run;
data TALL_ONE(keep=match_id surname postcode forcemark);
do match_id = 1 to 22000;
surname = ceil(100000 * ranuni(1234));
postcode = ceil( 99999 * ranuni(1234));
forcemark = .;
if ranuni(123) < 0.15 then do; * randomly ensure some match will occur;
point = ceil(400*ranuni(123));
set SHORT_MANY point=point;
array surnames surname1-surname9;
array postcodes postcode1-postcode9;
do until (surname ne .);
index = ceil(9 * ranuni(123));
surname = surnames(index);
postcode = postcodes(index);
end;
forcemark = point;
end;
output;
end;
stop;
run;
data WHEN_TALL_MEETS_SHORT(keep=cust_id match_id index);
if 0 then set TALL_ONE SHORT_MANY ; * prep pdv (for hash host variables);
if _n_ = 1 then do;
length index 8;
declare hash lookup(multidata: 'yes');
lookup.defineKey('surname', 'postcode');
lookup.defineData('cust_id', 'index');
lookup.defineDone();
do while (not lookup_filled);
SET SHORT_MANY end=lookup_filled;
array Surnames surname1-surname9;
array Postcodes postcode1-postcode9;
do index = 1 to dim(surnames) while (surnames(index) ne .);
surname = surnames(index);
postcode = postcodes(index);
lookup.add();
end;
end;
end;
call missing (surname, postcode, cust_id, index);
set TALL_ONE;
rc = lookup.find(); * grab just first match -- has_next/find_next to retrieve other lookup matches;
run;

Rename column headers - either after a key database in sas - or after values from first row

I need to rename the column headers of my variables so they match what I have in my key list. I attached a picture below to describe what I have and what I need.
My Data
I don't necesarily need actual code, just an idea of how to make it happen. :)
Thank you so much folks, and so sorry about the changes, I have never posted a question before.
If you have a table like
NEW1 NEW2 NEW3
OLDX OLDY OLDZ
And you want to use it to generate rename statement like
rename oldx=new1 oldy=new2 oldz=new3 ;
Then an easy way to do it is to use PROC TRANSPOSE to convert it into a separate row for each name pair.
proc transpose data=have out=names ;
var _all_;
run;
Which will get you a table like
_NAME_ COL1
NEW1 OLDX
NEW2 OLDY
NEW3 OLDZ
Then you can either use PROC SQL to quickly generate a macro variable with the pairs.
proc sql noprint;
select catx('=',col1,_name_) into :rename separated by ' '
from names;
quit;
data new ;
set old;
rename &rename ;
run;
If the list of names is too long to put into a single macro variable then just use a data step to generate the rename statement to a text file and use %INCLUDE to run it where you want.
filename code temp;
data _null_;
set names end=eof;
file code ;
if _n_=1 then put 'rename' ;
put col1 '=' _name_ ;
if eof then put ';';
run;
data new ;
set old;
%include code ;
run;
EDIT
You could probably do the last step directly from the data set and skip the proc transpose.
filename code temp;
data _null_;
set have ;
array _X _character_ ;
file code ;
put 'rename ' # ;
do i=1 to dim(_X);
oldname = _x(i);
newname = vname(_x(i));
put oldname '=' newname #;
end;
put / ';' ;
stop;
run;
You can use column aliases to change what's displayed in the results header row.
SELECT A AS 'NewA',
B AS 'OtherB',
C AS 'diffC'
FROM <<Table>>
If you want 'NewA OtherB diffC' as a row in the results, you could do this:
SELECT 'NewA' AS 'A',
'OtherB' AS 'B',
'diffC' AS 'C'
UNION
SELECT A,
B,
C
FROM <<Table>>

SAS PROC SQL NOT CONTAINS multiple values in one statement

In PROC SQL, I need to select all rows where a column called "NAME" does not contain multiple values "abc", "cde" and "fbv" regardless of what comes before or after these values. So I did it like this:
SELECT * FROM A WHERE
NAME NOT CONTAINS "abc"
AND
NAME NOT CONTAINS "cde"
AND
NAME NOT CONTAINS "fbv";
which works just fine, but I imagine it would be a headache if we had a hundred of conditions. So my question is - can we accomplish this in a single statement in PROC SQL?
I tried using this:
SELECT * FROM A WHERE
NOT CONTAINS(NAME, '"abc" AND "cde" AND "fbv"');
but this doesn't work in PROC SQL, I am getting the following error:
ERROR: Function CONTAINS could not be located.
I don't want to use LIKE.
You could use regular expressions, I suppose.
data a;
input name $;
datalines;
xyabcde
xyzxyz
xycdeyz
xyzxyzxyz
fbvxyz
;;;;
run;
proc sql;
SELECT * FROM A WHERE
NAME NOT CONTAINS "abc"
AND
NAME NOT CONTAINS "cde"
AND
NAME NOT CONTAINS "fbv";
SELECT * FROM A WHERE
NOT (PRXMATCH('~ABC|CDE|FBV~i',NAME));
quit;
You can't use CONTAINS that way, though.
You can use NOT IN:
SELECT * FROM A WHERE
NAME NOT IN ('abc','cde','fbv');
If the number of items is above reasonable number to build inside code, you can create a table (work.words below) to store the words and iterate over it to check occurrences:
data work.values;
input name $;
datalines;
xyabcde
xyzxyz
xycdeyz
xyzxyzxyz
fbvxyz
;
run;
data work.words;
length word $50;
input word $;
datalines;
abc
cde
fbv
;
run;
data output;
set values;
/* build a has of words */
length word $50;
if _n_ = 1 then do;
/* this runs once only */
call missing(word);
declare hash words (dataset: 'work.words');
words.defineKey('word');
words.defineData('word');
words.defineDone();
end;
/* iterate hash of words */
declare hiter iter('words');
rc = iter.first();
found = 0;
do while (rc=0);
if index(name, trim(word)) gt 0 then do; /* check if word present using INDEX function */
found= 1;
rc = 1;
end;
else rc = iter.next();
end;
if found = 0 then output; /* output only if no word found in name */
drop word rc found;
run;

macro into a table or a macro variable with sas

I'm having this macro. The aim is to take the name of variables from the table dicofr and put the rows inside into variable name using a symput.
However , something is not working correctly because that variable, &nvarname, is not seen as a variable.
This is the content of dico&&pays&l
varname descr
var12 aza
var55 ghj
var74 mcy
This is the content of dico&&pays&l..1
varname
var12
var55
var74
Below is my code
%macro testmac;
%let pays1=FR ;
%do l=1 %to 1 ;
data dico&&pays&l..1 ; set dico&&pays&l (keep=varname);
call symput("nvarname",trim(left(_n_))) ;
run ;
data a&&pays&l;
set a&&pays&l;
nouv_date=mdy(substr(date,6,2),01,substr(date,1,4));
format nouv_date monyy5.;
run;
proc sql;
create table toto
(nouv_date date , nomvar varchar (12));
quit;
proc sql;
insert into toto SELECT max(nouv_date),"&nvarname" as nouv_date as varname FROM a&&pays&l WHERE (&nvarname ne .);
%end;
%mend;
%testmac;
A subsidiary question. Is it possible to have the varname and the date related to that varname into a macro variable? My man-a told me about this but I have never done that before.
Thanks in advance.
Edited:
I have this table
date col1 col2 col3 ... colx
1999M12 . . . .
1999M11 . 2 . .
1999M10 1 3 . 3
1999M9 0.2 3 2 1
I'm trying to do know the name of the column with the maximum date , knowing the value inside of the column is different than a missing value.
For col1, it would be 1999M10. For col2, it would be 1999M11 etc ...
Based on your update, I think the following code does what you want. If you don't mind sorting your input dataset first, you can get all the values you're looking for with a single data step - no macros required!
data have;
length date $7;
input date col1 col2 col3;
format date2 monyy5.;
date2 = mdy(substr(date,6,2),1,substr(date,1,4));
datalines;
1999M12 . . .
1999M11 . 2 .
1999M10 1 3 .
1999M09 0.2 3 2
;
run;
/*Required for the following data step to work*/
/*Doing it this way allows us to potentially skip reading most of the input data set*/
proc sort data = have;
by descending date2;
run;
data want(keep = max_date:);
array max_dates{*} max_date1-max_date3;
array cols{*} col1-col3;
format max_date: monyy5.;
do until(eof); /*Begin DOW loop*/
set have end = eof;
/*Check to see if we've found the max date for each col yet.*/
/*Save the date for that col if applicable*/
j = 0;
do i = 1 to dim(cols);
if missing(max_dates[i]) and not(missing(cols[i])) then max_dates[i] = date2;
j + missing(max_dates[i]);
end;
/*Use j to count how many cols we still need dates for.*/
/* If we've got a full set, we can skip reading the rest of the data set*/
if j = 0 then do;
output;
stop;
end;
end; /*End DOW loop*/
run;
EDIT: if you want to output the names alongside the max date for each, that can be done with a slight modification:
data want(keep = col_name max_date);
array max_dates{*} max_date1-max_date3;
array cols{*} col1-col3;
format max_date monyy5.;
do until(eof); /*Begin DOW loop*/
set have end = eof;
/*Check to see if we've found the max date for each col yet.*/
/*If not then save date from current row for that col*/
j = 0;
do i = 1 to dim(cols);
if missing(max_dates[i]) and not(missing(cols[i])) then max_dates[i] = date2;
j + missing(max_dates[i]);
end;
/*Use j to count how many cols we still need dates for.*/
/* If we've got a full set, we can skip reading the rest of the data set*/
if j = 0 or eof then do;
do i = 1 to dim(cols);
col_name = vname(cols[i]);
max_date = max_dates[i];
output;
end;
stop;
end;
end; /*End DOW loop*/
run;
It looks to me that you're trying to use macros to generate INSERT INTO statements to populate your table. It's possible to do this without using macros at all which is the approach I'd recommend.
You could use a datastep statement to write out the INSERT INTO statements to a file. Then following the datastep, use a %include statement to run the file.
This will be easier to write/maintain/debug and will also perform better.