SAS/SQL: Combine two columns while retaining others - sql

I need to merge two data sets. Each data set contains a sequential observation number. The first data set contains only the first observation. The second data set contains all subsequent observations. Not all subjects have the same number of observations.
The problem is as follows. There are two different types of subject. The type is contained only in the first data set. When I merge the two data sets together, the type is missing on all observations but the first for each subject. Please see my example below.
I would like to know how to do this with both SQL and a DATA step. My real data sets are not large, so efficiency of processing is not major a concern.
I have tried using RETAIN, but as the second data set doesn't contain the TYPE variable, there is no value to retain. Regarding SQL, it seems like UNION should work, and there are countless examples of UNION on the internet, but they all involve a single variable. I need to know how to union the Observation variable by ID while retaining the Amount and assigning the Type.
Example
data set1;
input ID $
Observation
Type $
Amount
;
datalines;
002 1 A 15
026 1 A 30
031 1 B 7
028 1 B 10
036 1 A 22
;
run;
data set2;
input ID $
Observation
Amount
;
datalines;
002 2 11
002 3 35
002 4 13
002 5 12
026 2 21
026 3 12
026 4 40
031 2 11
028 2 27
036 2 10
036 3 15
036 4 16
036 5 12
036 6 20
;
run;
proc sort data = set1;
by ID
Observation
;
run;
proc sort data = set2;
by ID
Observation
;
run;
data merged;
merge set1
set2
;
by ID
Observation
;
run;
This gives
ID Observation Type Amount
002 1 A 15
002 2 11
002 3 35
002 4 13
002 5 12
026 1 A 30
026 2 21
026 3 12
026 4 40
028 1 B 10
028 2 27
031 1 B 7
031 2 11
036 1 A 22
036 2 10
036 3 15
036 4 16
036 5 12
036 6 20
However, what I need is
ID Observation Type Amount
002 1 A 15
002 2 A 11
002 3 A 35
002 4 A 13
002 5 A 12
026 1 A 30
026 2 A 21
026 3 A 12
026 4 A 40
028 1 B 10
028 2 B 27
031 1 B 7
031 2 B 11
036 1 A 22
036 2 A 10
036 3 A 15
036 4 A 16
036 5 A 12
036 6 A 20

I'm sure there are other ways to do it, but this is how I'd do it.
First, stack the data keeping only the common fields.
data new;
set set1 (drop = TYPE) set2;
run;
Then merge the type field back over.
proc sql;
create table new2 as select
a.*,
b.TYPE
from new a
left join set1 b
on a.id=b.id;
quit;

Proc SQL:
proc sql;
create table want as
select coalesce(a.id,b.id) as id,observation,type,amount from (select * from set1(drop=type) union
select * from set2) a left join set1 (keep=id type) b
on a.id=b.id;
quit;

The DATA step method is straight forward, just use SET with BY to interleave the records. You need to create a NEW variable to retain the values. If you want you can drop the old one and rename the new one to have its name.
data want ;
set set1 set2 ;
by id ;
if first.id then new_type=type;
retain new_type;
run;
For SQL use the method that #JJFord3 posted to first union the common fields and then merge on the TYPE flag. You can combine into a single statement.
proc sql;
create table want as
select a.*,b.type
from
(select id,observation,amount from set1
union
select id,observation,amount from set2
) a
left join set1 b
on a.id = b.id
order by 1,2
;
quit;

Related

Merge and copy columns into database view using sql

Simplified problem:
I have 3 tables with same structure: 2 columns, TIMESTAMP and VALUE.
Example
(I simplified timestamp and value for easier understanding):
Table A:
TIMESTAMP VALUE
1 a101
5 a105
9 a109
17 a117
Table B:
TIMESTAMP VALUE
3 b103
5 b105
8 b108
13 b113
Table C:
TIMESTAMP VALUE
9 c109
11 c111
13 c113
18 c118
View should contain TIMESTAMPs of all tables in one single column and one column for each VALUE column of each table:
View
TIMESTAMP A_VALUE B_VALUE C_VALUE
1 a101
3 b103
5 a105 b105
8 b108
9 a109 c109
11 c111
13 b113 c113
17 a117
18 c118
Is this possible using a view?
Many thanks for answers.

Postgresql: append two tables with different columns

I would like to append one table to the other; both tables may have different columns. The result should be a table with all columns and where values do not exist, it should be a missing observation. The data are time series - which I am getting from different sources due to time span constraints - so I need to "stack" them on each other, but it could be that one or the other column is added or dropped off.
As there is a little overlap in the rows I am looking for a solution that would take the data of first table. The problem is then for those column not existing in table 1, they wouldn't exist either when I pick table 1 over table 2.
Current solution is to cut-off table 2 so there is no overlap.
table 1:
date AA BB CC DD
20100101 9 10 11 12
20100102 10 11 12 13
table 2:
date AA BB CC EE FF
20100102 99 99 10
20100103 11 12 13 14 10
20100104 12 13 14 15 11
and the result should be
date AA BB CC DD EE FF
20100101 9 10 11 12
20100102 10 11 12 13 99 10
20100103 11 12 13 14 10
20100104 12 13 14 15 11
So I do not in fact have anything to "join" on as suggested here: SQL union of two tables with different columns
coalesce function may be used like in the following :
select coalesce(t1.date,t2.date) date,
coalesce(t1.aa,t2.aa) aa,
coalesce(t1.bb,t2.bb) bb,
coalesce(t1.cc,t2.cc) cc,
t1.dd,
t2.ee,
t2.ff
from table1 t1 full outer join table2 t2 on ( t1.date = t2.date );
SQL Fiddle Demo

How do I change my SQL SELECT GROUP BY query to show me which records are missing a value?

I have a list of codes by area and type. I need to get the unique codes for each type, which I can do with a simple SELECT query with a GROUP BY. I now need to know which area does not have one of the codes. So how do I run a query to group by unique values and tell me how records do not have one of the values?
ID Area Type Code
1 10 A 123
2 10 A 456
3 10 B 789
4 10 B 987
5 10 C 654
6 10 C 321
7 20 A 123
8 20 B 789
9 20 B 987
10 20 C 654
11 20 C 321
12 30 A 137
13 30 A 456
14 30 B 579
15 30 B 789
16 30 B 987
17 30 C 654
18 30 C 321
I can run this query to group them by type and get get the unique codes:
SELECT tblExample.Type, tblExample.Code
FROM tblExample
GROUP BY tblExample.Type, tblExample.Code
This gives me this:
Type Code
A 123
A 137
A 456
B 579
B 789
B 987
C 321
C 654
Now I need to know which areas do not have a given code. For example, Code 123 does not appear for Area 10 and code 137 does not appear for codes 10 and 20. How do I write a query to give me that areas are missing a code? The format of the output doesn't matter, I just need to get the results. I'm thinking the results could be in one column or spread out in multiple columns:
Type Code Missing Areas or Missing1 Missing2
A 123 30 30
A 137 10, 20 10 20
A 456 20 20
B 579 10, 20 10 20
B 789
B 987
C 321
C 654
You can get a list of the missing code/areas by first generating all combinations and then filtering out the ones that exist:
select t.type, c.code
from (select distinct type from tblExample) t cross join
(select distinct code from tblExample) c left join
tblExample e
on t.type = e.type and c.code = e.code
where e.type is null;

How do I merge by more than one variable using proc SQL is SAS

I have 2 datasets in SAS:
main_1
ID Rep Dose Response
1 2 34 567
1 1 45 756
2 1 35 456
3 1 56 345
main_2
ID Rep Hour Day
1 1 89 157
2 1 62 365
3 1 12 689
I can easily merge these 2 datasets first by ID and then by Rep (as one of the ID's has two observations) with the following code in SAS:
proc import out=main_1
datafile='/folders/myfolders/sasuser.v94/main_1.xls'
dbms=xls replace;
/*optional*/
sheet='Sheet1';
getnames=yes;
run;
proc import out=main_2
datafile='/folders/myfolders/sasuser.v94/main_2.xls'
dbms=xls replace;
/*optional*/
sheet='Sheet1';
getnames=yes;
run;
/*merge datasets based on common variable (ID then Rep)*/
/*first sort all datasets by target variables*/
proc sort data=main_1;
by ID Rep;
proc sort data=main_2;
by ID Rep;
run;
/*can now be merged*/
data main_merge;
merge main_1 main_2;
by ID Rep;
run;
this produces the following table:
ID Rep Dose Response Hour Day
1 1 45 756 89 157
1 2 34 567 . .
2 1 35 456 62 365
3 1 56 345 12 689
I currently have the following proc SQL alternative (I am learning so sorry of its terrible) but cannot seem to merge by more than 1 variable (i.e. ID and Rep):
proc sql;
create table merged_sql as
select L.*, R.*
from main_1 as L
LEFT JOIN main_2 as R
on L.ID = R.ID;
quit;
producing the following:
ID Rep Dose Response Hour Day
1 2 34 567 89 157
1 1 45 756 89 157
2 1 35 456 62 365
3 1 56 345 12 689
Any suggestion on a proc SQL code to achieve the same table as previously? My current code adds the '89 157' to both ID=1 observations.
Many thanks.
You're almost there...
proc sql;
create table merged_sql as
select L.*,
R.HOUR,
R.DAY
from main_1 as L
LEFT JOIN main_2 as R
on L.ID = R.ID
and L.REP = R.REP;
quit;
The reason not to use R.* is to avoid a note or warning about having duplicate ID and REP fields.

Access SQL - Select only the last sequence

I have a table with an ID and multiple informative columns. Sometimes however, I can have multiple data for an ID, so I added a column called "Sequence". Here is a shortened example:
ID Sequence Name Tel Date Amount
124 1 Bob 873-4356 2001-02-03 10
124 2 Bob 873-4356 2002-03-12 7
124 3 Bob 873-4351 2006-07-08 24
125 1 John 983-4568 2007-02-01 3
125 2 John 983-4568 2008-02-08 13
126 1 Eric 345-9845 2010-01-01 18
So, I would like to obtain only these lines:
124 3 Bob 873-4351 2006-07-08 24
125 2 John 983-4568 2008-02-08 13
126 1 Eric 345-9845 2010-01-01 18
Anyone could give me a hand on how I could build a SQL query to do this ?
Thanks !
You can calculate the maximum sequence using group by. Then you can use join to get only the maximum in the original data.
Assuming your table is called t:
select t.*
from t join
(select id, MAX(sequence) as maxs
from t
group by id
) tmax
on t.id = tmax.id and
t.sequence = tmax.maxs