SAS update multiple records for a by group - sql

I have a master A and transaction set B. I am trying to udpate records in A with the records in B by variable C.
DATA TEST;
UPDATE A B;
BY C;
RUN;
The issue is, I have got some duplicate records in my master set and I still want to update them all. But what I get is a warning
There was more than one record for the specified BY group
And only the first record out of those duplicates gets updated.
Is there any way how to tell SAS to update all of them?
Or is there any other, completely different way?
Any help appreciated.

If you create an index on the ID variable used for your update, you can do this using a modify statement. This should be much quicker than using an update statement as it avoids creating a temporary copy of the master table - however, if the data step is interrupted there is a risk of data corruption. The syntax is a bit clunky but it can potentially be macro-ised if necessary.
data master;
input ID1 ID2 VAR1 VAR2;
cards;
1 1 2 3
1 2 3 4
2 1 5 6
;
run;
data transaction;
input ID1 VAR1 VAR2;
cards;
1 7 8
;
run;
proc datasets lib =work nolist nodetails;
modify master;
index create ID1;
quit;
data master;
set transaction(rename = (VAR1 = t_VAR1 VAR2 = t_VAR2));
do until(eof);
modify master key = ID1 end = eof;
if _IORC_ then _ERROR_ = 0;
else do;
VAR1 = t_VAR1;
VAR2 = t_VAR2;
replace;
end;
end;
drop t_VAR1 t_VAR2;
run;

If you really want to apply transactions then expand your transaction file to have all possible values of the key variables C,D for the values of C it does contain.
proc sql ;
create table transactions as
select a.D,b.*
from A right join B
on a.C = b.C
order by b.C,a.D
;
quit;
Then do the update.
data want ;
update A transactions ;
id c d;
run;
If you try to use MERGE then you will get in trouble when the extra variables exist in both tables. SAS will only change the values of the first record for each value of C. You could program around this by renaming the variables in the B dataset. You could then explicitly code whether you want the action to be like a MERGE or an UPDATE. So if your extra variable is named E then you could code like this:
data want;
merge a b(in=inb rename=(e=new_e)) ;
by c ;
updated_e = coalesce(new_e,e);
if inb then merged_e = new_e ;
else merged_e = e;
run;
So if you want the effect of merge (so a missing value of E in the transaction makes it missing the result) then use the formula like in MERGED_E. If you want the effect of update then use the formula like in UPDATED_E. If you have more than one extra variable then rename them also and add extra assignment statements to handle them.

Related

check whether proc append was successful

I have some code which appends yesterday's data to [large dataset], using proc append. After doing so it changes the value of the variable "latest_date" in another dataset to yesterday's date, thus showing the maximum date value in [large dataset] without a time-consuming data step or proc sql.
How can I check, within the same program in which proc append is used, whether proc append was successful (no errors)? My goal is to change the "latest_date" variable in this secondary dataset only if the append is successful.
Try the automatic macro variable &SYSCC.
data test;
do i=1 to 10;
output;
end;
run;
data t1;
i=11;
run;
data t2;
XXX=12;
run;
proc append base=test data=t1;
run;
%put &syscc;
proc append base=test data=t2;
run;
%put &syscc;
I'm using the %get_table_size macro, which I found here. My steps are
run %get_table_size(large_table, size_preappend)
Create dataset called to_append
run %get_table_size(to_append, append_size)
run proc append
run %get_table_size(large_table, size_postappend)
Check if &size_postappend = &size_preappend + &append_size
Using &syscc isn't exactly what I wanted, because it doesn't check specifically for an error in proc append. It could be thrown off by earlier errors.
You can do this by counting how many records are in the table pre and post appending. This would work with any sas table or database.
The best practice is to always have control table for your process to log run time and number of records read.
Code:
/*Create input data*/
data work.t1;
input row ;
datalines;
1
2
;;
run;
data work.t2;
input row ;
datalines;
3
;;
run;
/*Create Control table, Run this bit only once, otherwise you delete the table everytime*/
data work.cntrl;
length load_dt 8. source 8. delta 8. total 8. ;
format load_dt datetime21.;
run;
proc sql; delete * from work.cntrl; quit;
/*Count Records before append*/
proc sql noprint ; select count(*) into: count_t1 from work.t1; quit;
proc sql noprint; select count(*) into: count_t2 from work.t2; quit;
/*Append data*/
proc append base=work.t1 data=work.t2 ; run;
/*Count Records after append*/
proc sql noprint ; select count(*) into: count_final from work.t1; quit;
/*Insert counts and timestampe into the Control Table*/
proc sql noprint; insert into work.cntrl
/*values(input(datetime(),datetime21.), input(&count_t1.,8.) , input(&count_t2.,8.) , input(&count_final.,8.)) ; */
values(%sysfunc(datetime()), &count_t1. , &count_t2., &count_final.) ;
quit;
Output: Control table is updated

How can I add a new observation on top of a data set in SAS?

I know that I can use PROC SQL INSERT INTO (_Table_Name_) values('Value')
or by using INSERT INTO SET Variable Name=numeric_value or 'char_value'
but my question is: how do I make sure that this value appears on top of the dataset instead of the default bottom position?
data temp;
input x y;
datalines;
1 2
3 4
5 6
;
run;
proc sql;
insert into work.temp
(x,y)
values (8,9);
quit;
You cannot insert values "on top" of the dataset without rewriting the dataset. INSERT (and PROC APPEND) work by avoiding rewriting the entire dataset, instead just adding the rows to the bottom. SAS has a defined data structure where observations are physically stored in the order they will be processed in when normal, sequential processing is used (as opposed to index-based or random-access methods).
To put rows at the "top" of the dataset, simply create a new dataset (which can use the same name, if you choose, though it will be a different dataset technically) and add them however you choose; even something as simple as below would work, though I'd put the data-to-be-inserted into a separate dataset in a real application (as it probably would come from a different data source).
data temp;
if _n_=1 then do; *if on first iteration, add things here;
x=8;
y=9;
output; *outputs the new record;
end;
set temp;
output; *outputs the original record;
run;
you can do this in a data step as follows:
data a;
x=1;
y=2;
output;
x=3;
y=4;
output;
run;
data b;
x=7;
y=8;
output;
run;
data c;
set b a;
run;

Conditional Insert using SAS Proc SQL

I am trying to add records from one smaller table into a very large table if the primary key value for the rows in the smaller table is not in the larger one:
data test;
Length B C $4;
infile datalines delimiter=',';
input a b $ c $;
datalines;
1000,Test,File
2000,Test,File
3000,Test,File
;
data test2;
Length B C $4;
infile datalines delimiter=',';
input a b $ c $;
datalines;
1000,Test,File
4000,Test,File
;
proc sql;
insert into test
select * from test2
where a not in (select a from test2);
quit;
This however insets no records into the table Test. Can anyone tell me what I am doing wrong? The end result should be that the row where a = 4000 should be added to the table Test.
EDIT:
Using where a not in (select a from test) was what I originally tried and it generated the following error:
WARNING: This DELETE/INSERT statement recursively references the target table. A consequence of this is a possible data integrity problem.
ERROR: You cannot reopen WORK.TEST.DATA for update access with member-level control because WORK.TEST.DATA is in use by you in resource
environment SQL.
ERROR: PROC SQL could not undo this statement if an ERROR were to happen as it could not obtain exclusive access to the data set. This
statement will not execute as the SQL option UNDO_POLICY=REQUIRED is in effect.
224 quit;
Thanks
You can either do the process in two steps. First create the table of records to insert and then insert them.
proc sql ;
create table to_add as
select * from test2
where a not in (select a from test)
;
insert into test select * from to_add ;
quit;
Or you could just change the setting for the UNDO_POLICY option and SAS will let you reference TEST while updating TEST.
proc sql undo_policy=none;
insert into test
select * from test2
where a not in (select a from test)
;
quit;

Delete specific rows in Oracle Database using SAS table

I have a Oracle table with 1M rows in it. I have a subset of oracle table in SAS with 3000 rows in it. I want to delete these 3000 rows from the oracle table.
Oracle Table columns are
Col1 Col2 Col3 timestamp
SAS Table columns are:
Col1 Col2 Col3
The only additional column that Oracle table has is a timestamp. This is the code that I using currently, but it's taking a lot of time.
libname ora oracle user='xxx' password='ppp' path = abcd;
PROC SQL;
DELETE from ora.oracle_table a
where exists (select * from sas_table b where a.col1=B.col1 AND a.col2=B.col2 AND A.col3=B.col3 );
QUIT;
Please advise as to how to make it faster and more efficient.
Thank You !
One option is to push your SAS table up to Oracle, then use oracle-side commands to perform the delete. I'm not sure exactly how SAS will translate the above code to DBMS-specific code, but it might be pushing a lot of data over the network depending on how it's able to optimize the query; in particular, if it has to perform the join locally instead of on the database, that's going to be very expensive. Further, Oracle can probably do the delete faster using entirely native operations.
IE:
libname ora ... ;
data ora.gtt_tableb; *or create a temporary or GT table in Oracle and insert into it via proc sql;
set sas_tableb;
run;
proc sql;
connect to oracle (... );
execute (
delete from ...
) by connection to oracle;
quit;
That may offer significant performance improvements over using the LIBNAME connection.
Further improvements may be possible if you take full advantage of an index on your PKs, if you don't already have that.
#Joe has a good answer. Another way would be to do something like this. This MIGHT allow the libname engine to pass all the work to Oracle instead of retrieving rows back to SAS (which is where your time is going).
Created some test data to show
data test1 test2;
do i=1 to 10;
do j=1 to 10;
do k=1 to 10;
output;
end;
end;
end;
run;
data todel;
do i=1 to 3;
do j=1 to 3;
do k=1 to 3;
output;
end;
end;
end;
run;
proc sql noprint;
delete from test1 as a
where a.i in (select distinct i from todel)
and a.j in (select distinct j from todel)
and a.k in (select distinct k from todel);
quit;
proc sql noprint;
delete from test2 as a
where exists (select * from todel as b where a.i=b.i and a.j=b.j and a.k=b.k);
quit;
thank you guys. Joe I used your suggestion and wrote this code.
/*---create a temp table in oracle---*/
libname ora oracle user='xxx' password='ppp' path = abcd;
proc append base=ora.TEMP_TABLE data=SAS.sas_TABLE;
run;
/*-----delete the rows using the temp table--------*/
proc sql;
connect to oracle(......);
execute (delete from ORA.ORACLE_TABLE a
where exists (select * from ora.TEMP_TABLE b where a.col1=B.col1 AND a.col2=B.col2 AND A.col3=B.col3)
) by oracle;
quit;
Thank you so much guys ! I appreciate your feedback.

drop the duplicate columns with sas

Would the following code work?
I'm dropping duplcated columns from my tables. I feel a bit confused after thinking about it. My code looks like working but I'm concerned about unseen mistakes.
proc sql;
create table toto
as select min(nomvar) as nomvar,count(intitule) as compte
from dicoat
group by intitule
having count(intitule) > 1;
data work.toto;
set toto;
do while(cpte>=1);
proc sql;
delete from dicoat where nomvar in (select nomvar from toto);
insert into toto
select min(nomvar) as nomvar,count(intitule) as compte from dicoat
group by intitule
having count(intitule) > 1;
end;
run;
data _null_;
file tempf;
set toto end=lastobs;
if _n_=1 then put "data aat;set aat (drop=";
put var /;
if lastobs then put ");run;";
run;
%inc tempf;
filename tempf clear;
After some thoughts and some questioning (ok - lots of questionning), one of my acquaintance helped me out with this
proc sort data=dicoat;
by title;
run;
data _null_;
set dicoat end=last;
length dropvar $1000;
retain dropvar;
by title;
if not first.title then dropvar = catx(' ',dropvar,nomvar);
if last then call symput('dropvar',trim(dropvar));
run;
data aat;
set aat(drop=&DROPVAR.);
run;
It should do the tricks of removing duplicate columns. And no,proc sql does not work within a data step.
Best.