Using Proc SQL to join two datasets with 2 matching variables - sql

I have two datasets A & B. I want to join them against two fields: ID and End of Month date. This is defined as EOMDate in dataset A and BalDate in dataset B. How do I join them so that ID and the dates match with each other?

Tom's comment works. Here are a few worked samples:
/*Create some input data for the samples...*/
data first;
input id_a id_b data $;
cards;
1 1 A
2 2 B
3 33 C
4 4 D
55 5 E
;
run;
data second;
input id_a id_b data2 $;
cards;
1 1 AA
2 2 BB
3 3 CC
4 4 DD
5 5 EE
;
run;
/*The proc sql way. We create table 'combo' as result. */
/*You can add more conditions than one. */
proc sql noprint;
create table combo as
select * from first join second
on first.id_a=second.id_a and first.Id_b=second.id_b;
quit;
I've noticed that proc sql is quite slow when working with large sets.
This is a way to make the same with data statements.
First you need to sort the data.
/*A way to accomplish this with datasets.*/
proc sort data=first; by id_a id_b; run;
proc sort data=second; by id_a id_b; run;
data Combo_sas;
merge first(in=a) second(in=b);
by id_a id_b;
if a and b;
run;

Related

Using SQL in SAS, how do I create a new column that counts/indicates the uniqueness of values in an existing column..?

My data is as follows:
ID
1
2
3
3
4
5
6
6
I want to create a column that indicates the uniqueness of a value in the ID column as such:
ID COUNT
1 1
2 1
3 1
3 0
4 1
5 1
6 1
6 0
I'd like to do this without creating a temporary table, via a subquery or something. Any assistance would be much appreciated.
One option would be to use by functionality in the data step:
data have;
input ID;
datalines;
1
2
3
3
4
5
6
6
;run;
data want;
set have;
by ID;
if first.ID then count = 1;
else count = 0;
run;
That type of logic is not really amenable to SQL since the order of observations is not really insured. In a more modern version of SQL you could use windowing functions (like ROW_NUMBER() with PARTITION BY) to impose an record count.
If you really wanted to try to do it just in PROC SQL you might need to resort to using the undocumented MONOTONIC() function. But even then to defeat the optimizer eliminating the duplicate rows you might need to make a temporary table with the row counter first.
data have;
input ID ##;
datalines;
1 2 3 3 4 5 6 6
;
proc sql ;
create table _temp_ as select id,monotonic() as row from have;
create table want as
select a.id
, b.row=min(b.row) as FLAG
from have a,_temp_ b
where a.id=b.id
group by a.id
order by 1,2
;
quit;

How can I replace data from a table with data of another table, similar to a vlookup from Excel?

Sorry in advance for my bad English.
Using SAS, I'm trying to substitute data from one table, let's call it t1. To substitute, I'm comparing t1 column 1 and t2 column 1. If I have a match, I'd like to use t2 column 2 value.
Table 1 has lots of columns, and the data in the relevant column can be repeated. Table 2 has only two columns, the first one has only unique values, and will be compared to table 1. I will, after that, use values of the second column.
For some reason, I'm generating a cartesian product.
proc sql;
create view
v1 as
select
t2.c2, (final result)
t1.c10, (not relevant to problem)
SUM(t1.c11) (not relevant to problem)
from
_outres.table1 t1
left join
_outres.table2 t2
on
t1.c1=t2.c1 (comparing the tables)
where
t1.c10= "criteria"
group by
t2.c2,
t1.c10
;run;quit;
If it was Excel, I would solve it like this:
Table 1
column 1
A
A
A
B
B
C
C
Table 2
Column 1 column 2
A AA
B BB
C CC
=vlookup(table 1 column1, table 2, 2, false)
Result:
Table 1
column 1
AA
AA
AA
BB
BB
CC
CC
------------------ EDITED -----------------
#DCR, this was the code I used to test, based on your reply. I made some small changes to reflect better my data and tables. This worked as expected, but I failed to translate this to my original code.
data tttttt1;
input col1 $ col11 col10 $;
datalines;
A 10 critA
A 12 critA
A 13 critA
A 13 critB
B 11 critA
B 41 critA
B 19 critA
C 20 critA
C 55 critA
;
run;
data tttttt2;
input col1 $ col2 $ ;
datalines;
A AA
B BB
C CC
;
run;
proc sql noprint;
create table tttttt3 as
select b.col2, SUM(a.col11), a.col10
from (select * from tttttt1) as a
left join (select * from tttttt2) as b
on a.col1 = b.col1
where a.col10 = "critA"
group by b.col2, a.col10
;quit;
Expected and result were the same:
AA 35 critA
BB 71 critA
CC 75 critA
SAS has a unique feature in the form of custom formats. A format maps a source value to a target value, much in the way of VLOOKUP.
A format is associated with a variable by using the FORMAT statement.
proc format;
value $MyFormat
'A' = 'AA'
'B' = 'BB'
'C' = 'CC'
;
run;
data have;
input col1 $ ##;
col1_formatted_value = put(col1,$MyFormat.); * typically don't have to do this;
datalines;
A A A B B C C D D A
run;
proc print data=have;
title "Data rendered per attributes associated with variables in data set metadata";
run;
proc print data=have;
title "col1 Format applied at step time";
format col1 $MyFormat.;
run;
* col1 format attribute saved with data set;
data have2;
input col1 $ ##;
format col1 $MyFormat.;
datalines;
A A A B B C C D D A
run;
proc print data=have2;
title "Data rendered per format attributes associated with variables (in data set metadata)";
run;
SAS Formats can also be constructed directly from data:
data formatMappingData;
input source $ target $;
fmtname = "$MyFormatFromData";
start = source;
label = target;
datalines;
A AA!
B BB!
C CC!
;
run;
proc format cntlin=formatMappingData;
run;
proc print data=have2;
title "Data rendered per format attributes associated with variables (in data set metadata)";
format col1 $MyFormatFromData.;
run;
I think you might be looking for a left join using proc sql. Try the following:
data t1;
input col1 $ ;
datalines;
A
A
A
B
B
C
C
;
run;
data t2;
input col1 $ col2 $ ;
datalines;
A AA
B BB
C CC
;
run;
proc sql noprint;
create table t3 as
select b.col2
from (select * from t1) as a
left join (select * from t2) as b
on a.col1 = b.col1;
quit;
I found a solution!
Thanks everyone, all the answers, they gave me some insight.
#nvioli and #DCR gave me a huge insight. I was working to understand the cartesian product I was generating. I counted the lines and found the same amount of lines in the result compared to original t1 table. But the summed values were clearly wrong. So I understood that, somehow, my code was inserting the total sum in each line instead of the subtotals of "group by".
I solved it with the easiest solution possible: I splitted the view in two different views. The first one would group and sum, because na older version of this code was doing it correctly. The second view would only left join and change data in a simple select. Final code is something like this (simplified version, as the original example):
/*view to group and sum columns from t1*/
proc sql;
create view
v1 as
select
t1.c1, (column that will be substitute later)
t1.c10, (not relevant to problem, only to show the "criteria"/group by)
SUM(t1.c11) (not relevant to problem, only to show sum)
from
_outres.table1 t1
where
t1.c10= "criteria"
group by
t1.c1,
t1.c10
;quit;run;
After that:
/*view to substitute the desired column from t1 (now v1) */
proc sql;
create view
v2 as
select
t2.c2, (column with new data)
t1.c10, (now already grouped)
Sum_of_t1.c11 (now already summed)
from
v1
left join
t2
on
v1.c1 = t2.c1 (comparing view from t1 with t2)
;quit;run;

SAS/SQL - Create a Column that shows the number of times a value has occurred

I have a table with an account number and several attributes.
acct | attr1 | attr2 | attr3...
The issue is that there are duplicate account numbers in the list with different attributes. To make matters worse, when there are two account number entries, those entries may have entirely different attributes.
I have a sorting scheme to use to somewhat solve the issue, but after I sort the table, I only need the first occurrence of each account number. I am attempting to do this in sas using Proc SQL.
Any ideas?
I don't think it's possible to do this with PROC SQL, however in DATA STEP logic it is possible.
After the data is sorted, use first. (pronounced first-dot) logic to pick the first occurrence:
First sort the data, using your desired scheme.
proc sort data=have out=intermediate_table;
by acct <other variables>;
run;
Then just use first.acct:
data want;
set intermediate_table;
by acct <other variables>;
if first.acct then output;
run;
proc sort is easiest way to do this. You can use undocumented monotonic() function to do this in Proc sql as shown below
data have;
input acct attr1 $ attr2 $ attr3 $;
datalines;
100 a b c
100 b d e
100 c e f
101 a b c
102 h i j
102 h k l
;
proc sql;
create table want(drop =rn) as
select * from
(select b.*,monotonic() as rn
from have b)
group by acct
having rn =min(rn);
or by using n in a datastep(creating view is a good option as suggested #richard in comments sections)followed by group by as shown below.
data have_view/view=have_view;;
set have;
rn=_n_;
run;
proc sql;
create table want as
select acct, attr1 , attr2 , attr3
from have_view b
group by acct
having rn =min(rn);

SAS: Merge or join and retain all records while filling missing

I'm essentially splitting a dataset into two (those that have an ID and those that are missing ID), and merging the missing back into the non-missing by a set of match keys to help fill the ID. I have five total records below, and the final dataset needs to maintain all five records. Below is an example:
DATA TEST;
LENGTH ID MKEY $12.;
INPUT ID $ MKEY $;
DATALINES;
. M123
. M456
A M123
B M456
C M789
;
RUN;
DATA MPOOL CPOOL; SET TEST;
IF ID IN ("","0") THEN DO;
MISS_ID = 1;
OUTPUT MPOOL;
END;
ELSE DO;
MISS_ID = 0;
OUTPUT CPOOL;
END;
RUN;
So we end up merging the missing data MPOOL:
. M123
. M456
with the non-missing data CPOOL:
A M123
B M456
C M789
Merging only gives me three records back, but I need to maintain all records and fill the missing (note that through the MKEY, we should be to link A and B records to missing IDs) as shown below:
A M123
A M123
B M456
B M456
C M789
What kind of SQL JOIN would allow me to keep all records and fill the missing with those that have been successfully joined? Seems feasible but MERGE with this data never retains all records. I know I can lag/retain/fill in this example, but the big data I'm working with requires merging/joining due to other factors.
One method uses union all and join:
proc sql;
select c.id, c.mkey
from cpool c
union all
select c.id, m.mkey
from mpool m join
cpool c
on m.mkey = c.mkey;
Also try to use data step to fill in ID.
proc sort data=test;
by mkey descending id;
run;
data want;
set test;
retain temp;
by mkey id notsorted;
if first.mkey then temp=id;
if missing(id) then id=temp;
drop temp;
run;

Rrearrange order of column by columns name

I have dataset like follows;
data dataset;
input name $ mob5 mob1 mob3 x;
datalines;
a 1 3 5 7
b 2 4 6 8
c 3 5 7 9
d 5 7 9 2
;
run;
I would like to select the fields name and those with mob (UNKNOW columns name and number of columns contain mob). i dunno how to use retain i do not know how many of columns with columns name contains mob.
proc sql;
create table table1 as
select *
from dataset(keep=name mob:)
quit;
My desired output will be
name mob1 mob3 mob5
a 3 5 1
b 4 6 2
c 5 7 3
d 7 9 5
You can use the dictionary tables for this (assuming your source dataset is called 'dataset' and resides in the work library, make changes to the WHERE clause if not, but make sure you use upper-case for the values):
PROC SQL;
SELECT name INTO: mob_cols SEPARATED BY ','
FROM dictionary.columns
WHERE libname = 'WORK' and memname = 'DATASET'
AND upcase(name) LIKE 'MOB%'
ORDER BY name;
QUIT;
This code loads all of the 'mob' columns into a macro variable, ordered by name and separated by comma.
Then you can use this macro variable in the SELECT clause of your PROC SQL:
PROC SQL;
CREATE TABLE table1 AS
SELECT name,
&mob_cols.
FROM dataset;
QUIT;