I am trying to merge two large (million+) datasets in SAS. I'm pretty new to SAS and this is my first stackexchange question so hopefully the following makes sense...
SETUP:
All observations in the "Master" dataset have a unique identifier var1 and some have unique identifier var2. Some observations in the "Addition" dataset have unique identifier var1 and some have unique identifier var2; some observations have var2 but not var2.
I want to merge in all matches from the Addition dataset on EITHER var1 or var2 into the Master dataset.
METHODS I HAVE EXPLORED:
Option A: proc sql left join on var1 OR var2. Unfortunately, because there are multiple missing observations on var2 in both Master and Addition this runs into a Cartesian product problem - it works, but is impractically slow with my large datasets.
proc sql;
create table match as
select a.id1, a.id2, varmast, b.varadd
from master a
left join addition b
on (a.id1=b.id1 and a.id2=b.id2) or (a.id2=b.id2 and b.id2 is not null);
quit;
Option B: I'm thinking maybe merge on the first identifier and then use proc sql update to update from the Addition variables using the second identifier? I'm not sure of the syntax.
Option C: I could see probably doing this with a few regular merges & then appending and deduping, but as this would probably take 5+ steps and each step takes a few minutes to run (on a good day) am hoping for something shorter.
I suspect that two left joins are what you want . . . and it should have better performance. The result is something like this:
proc sql;
create table match as
select m.id1, a.id2, varmast, coalesce(a.varadd, a2.varadd) as varadd
from master m left join
addition a
on m.id1 = a.id1 and m.id2 = a.id2 left join
addition a2
on m.id1 = a2.id1 and m.id2 is null and a.id1 is null
quit;
Related
I believe this is an easy one. Just getting started on SQL, so I am finding it a bit tricky. So I am using SQL on SAS, and I want to join two tables but on different columns based on a value of a column. Practical example:
Proc sql;
create table new_table_name as select
a.proposal_code as new_name_proposal_code,
a.1st_client_code as new_name_1st_client_code,
a.2nd_client_code as new_name_2nd_client_code,
a.3rd_client_code as new_name_3rd_client_code,
a.4th_client_code as new_name_4th_client_code,
a.product_type as new_name_product_type,
b.2nd_client_code
from existing_table a
left join existing table b (on b.2nd_client_code=a.2nd_client_code and a.product_type = "clothes") or
left join existing table b (on b.2nd_client_code=a.3rd_client_code and (a.product_type = "cars" or a.product_type = "bikes"));
quit;
So this is the code that I'm using at the moment, and the goal is to join table a and table b using b.2nd client code = a.2nd client code if the product type from table a is = to "clothes", and if the product type from table a is either "cars" or "bikes", join table a and table b using b.2nd client code = a.3rd client code. Basically, look at two different "on's" regarding the specific product type. When joining these two tables, if one row has product type "clothes", I want it to look at the 2nd client code, if it is either "cars" or "bikes", look at the 3rd client code.
Hope I made it clear. The error I am getting at the moment is "expecting an on". Is it a problem of syntax?
Yes. The parentheses before the on is not correct. Your query has other issues as well. I think you want:
create table new_table_name as
select a.proposal_code as new_name_proposal_code,
a.1st_client_code as new_name_1st_client_code,
a.2nd_client_code as new_name_2nd_client_code,
a.3rd_client_code as new_name_3rd_client_code,
a.4th_client_code as new_name_4th_client_code,
a.product_type as new_name_product_type,
coalsesce(bc.2nd_client_code, bcb.2nd_client_code)
from existing_table a left join
existing_table bc
on bc.2nd_client_code = a.2nd_client_code and
a.product_type = 'clothes' left join
existing_table bcb
on bcb.2nd_client_code = a.3rd_client_code and
a.product_type in ('cars', 'bikes');
Notes:
No parentheses before the on clause.
No or left join. or is a boolean operator. left join is an operator on sets (i.e. tables and results sets). The don't mix.
No repeated table aliases.
You want to combine the two code, so you need something like coalesce() in the select.
The SQL delimiter for strings is the single quote, not the double quote.
in is simpler than a string of or conditions.
Sounds like you just want a complex ON criteria and not two joins.
Something like this:
proc sql;
create table new_table_name as
select
a.proposal_code as new_name_proposal_code
,a.client_code1 as new_name_client_code1
,a.client_code2 as new_name_client_code2
,a.client_code3 as new_name_client_code3
,a.client_code4 as new_name_client_code4
,a.product_type as new_name_product_type
,b.client_code2 as new_name_other_client_code2
from tableA a
left join tableB b
on (b.client_code2=a.client_code2 and a.product_type = "clothes")
or (b.client_code2=a.client_code3 and a.product_type in ("cars","bikes"))
;
quit;
For a better answer post example inputs and desired output.
If I have a table with the following atributes:
A: id, race, key1
B: key1, driving_id
C: driving_id, fines
why would it be possible for us to have the following queries:
select A.id, A.race, B.key1, B.driving_id, C.fines
from A
left join B on A.key1=B.key1
left join C on B.driving_id= C.driving_id
even though there are no common keys for A and C in the last line of the SQL query?
The query that you have written is parsed as:
select A.id, A.race, B.key1, B.driving_id, C.fines
from (A left join
B
on A.key1 = B.key1
) left join
C
on B.driving_id = C.driving_id;
That is, C is -- logically -- being joined to the result of A and B. Any keys from those tables would be valid.
Although your original query is the preferable way to write it, you could also write:
select ab.id, ab.race, ab.key1, ab.driving_id, C.fines
from (select . . . -- whatever columns you need
from A left join
B
on A.key1 = B.key1
) ab left join
C
on ab.driving_id = C.driving_id;
The three versions are all equivalent, but the last one may help you better understand what is going on with joins between multiple tables.
Without seeing sample data from the three tables, we might not know for sure in the query makes any sense or would even run. Assuming it does run, then there should be nothing wrong with the join logic. For example, it is perfectly possible for table B to have a key key1 which relates to the A table, while at the same time having another key driving_id which relates to the C table. Note that either of these keys (but not both) could be a primary key in the B table, and if not then each key would be a foreign key.
The LEFT JOIN keyword returns all records from the left table (tableA), and the matched records from the right table (tableB). Furthermore, Similarly it returns all records from the result of first set, and the matched records from the right table (tableC). The result is NULL from the right side, if there is no match.
So A & C have a link through table B.
proc sql;
create table test_Check10 as
select
a.KRI_RK,SCORE,
KRI_ID,
b.KRI_TEMPLATE_RK,
KRI_TEMPLATE_ID,
d.KRI_RSPNS_SCL_RK,
RANGE_MID_2,
RANGE_MAX
from
Sasoprsk.Kri_Obs_l as a,
Sasoprsk.Kri_l as b,
Sasoprsk.Kri_template_l as c,
Sasoprsk.Kri_rspns_scl_l as d
where
a.KRI_RK=b.KRI_TEMPLATE_RK and
b.KRI_ID=c.KRI_TEMPLATE_ID
order by
SCORE
;
quit;
proc sql;
create table final as
select * from test_Check10
where
SCORE <= RANGE_MAX and SCORE >= RANGE_MID_2
;
quit;
The actual source data may not be meeting your minds presumed data model.
The case of 'extra rows' is often due to an insufficiently constrained join, which can be based on repeated rows with respect to the join key fields. What this means is that there may not be 1-1, or presumed 1-N, concurrence in the data with respect to
a.KRI_RK=b.KRI_TEMPLATE_RK and
b.KRI_ID=c.KRI_TEMPLATE_ID
Look for replicates in the result set and use the join key field values to closely examine the corresponding records in source data tables. You may have to
add more constraints to the join expression
include additional tables in the join
cogitate over a long walk or a couple beers
I want to create a new variable based on certain conditions in a table and merge it to another table using the newly created variable in a single data or a proc sql step.
eg)
table 1
var new_var
x 3x
y 4y
z 5z
table 2
new_var additional_var
3x a
3x a
4y z
and merging both the tables using the new_var in a single step
Thanks
You can accomplish this using a join with an inline view. Inline views can save you time coding and reduce I/O operations, which is the biggest cause of slowdowns. This is particularly true on non-SSD hard drives.
proc sql noprint;
create table want as
select var, t1.new_var, additional_var
from table1 as t1
LEFT JOIN
(select new_var,
CASE
when(<conditions>) then 'a'
else 'z'
END as additional_var
from table2) as t2
ON t1.new_var = t2.new_var
quit;
I have a problem with full outer join in SAS
I want to join two database.
A is the "mama" containing patient ID,SEX,RACE,blablabla...but dont have the status variable.
B is the one only containing ID and status.
So A is actually a way bigger database than B and what I'm going to do is to put B including status into A. Here's my code:
proc sql;
CREATE TABLE C AS
select *
from A full outer join B
on A.id=B.id ;
RUN;
The result I got is actually not merging two database. Instead, I got the database C, which all the data from A on the top(status variable is null), and then the data B following by A (status variable is there but all other variables are showing Null). Thus, what i did is just adding rows....
Here is some conditions on my codes;
1. I use the University Edition
2. the format of ID is actually Char. Since B's ID (example:BD123), I convert numeric variable ID from A into char variable .
Anybody could help me with this? Thank you very much :-D
If you got an entire concatenation (100 rows in A, 15 rows in B, 115 in C) then you likely didn't correctly match the ID variable format when you converted. You may have an issue with additional spaces or something to that effect (the length of B.id may not match A.id). If possible I would convert the ID to numeric, or do a more careful conversion to character.
Second, if you're intending to just get the number of rows of A back (just adding B information to A), then you want a left join not a full outer join.
I Think you might be looking at left join instead.
proc sql;
create table C as
select A.*, B.*
from A left join B
on A.ID=B.ID;
quit;