SAS SQL: Many to many relationships with 2 tables BUT don't want multiple rows - sql

I have two tables I need to join. These tables only share 1 field in common (ID, and it isn't unique). Is it possible to join these two tables but make it unique and keep all matching data in a row?
For example, I have two tables as follows:
+-------+----------+
| ID | NAME |
+-------+----------+
| A | Jack |
| A | Andy |
| A | Steve |
| A | Jay |
| B | Chris |
| B | Vicky |
| B | Emma |
+-------+----------+
And another table that is ONLY related by the ID column:
+-------+--------+
| ID | Age |
+-------+--------+
| A | 22 |
| A | 31 |
| A | 11 |
| B | 40 |
| B | 17 |
| B | 20 |
| B | 3 |
| B | 65 |
+-------+--------+
The end result I'd like to get is:
+-------+----------+++-------+
| ID | NAME | Age |
+-------+----------++-------+-
| A | Jack | 22 |
| A | Andy | 31 |
| A | Steve | 11 |
| A | Jay | null |
| B | Chris | 40 |
| B | Vicky | 17 |
| B | Emma | 20 |
| B | null | 3 |
| B | null | 65 |
+-------+----------+++-------+

This is the default behavior of the data step merge, except that it won't set the last row's variable to missing - but it's easy to fudge.
There are other ways to do this, the best in my opinion being the hash object if you're comfortable with that.
data names;
infile datalines dlm='|';
input ID $ NAME $;
datalines;
| A | Jack |
| A | Andy |
| A | Steve |
| A | Jay |
| B | Chris |
| B | Vicky |
| B | Emma |
;;;;
run;
data ages;
infile datalines dlm='|';
input id $ age;
datalines;
| A | 22 |
| A | 31 |
| A | 11 |
| B | 40 |
| B | 17 |
| B | 20 |
| B | 3 |
| B | 65 |
;;;;
run;
data want;
merge names(in=_a) ages(in=_b);
by id;
if _a;
if name ne lag(name) then output; *this assumes `name` is unique in id - if it is not we may have to do a bit more work here;
call missing(age); *clear age after output so we do not attempt to fill extra rows with the same age - age will be 'retain'ed;
run;

Related

Postgres key-value table, select values as columns

I have the following table:
+----+---------+-------+
| id | Key | Value |
+----+---------+-------+
| 1 | name | Bob |
| 1 | surname | Test |
| 1 | car | Tesla |
| 2 | name | Mark |
| 2 | cat | Bobby |
+----+---------+-------+
Key can hold basically anything. I would like to arrive at the following output:
+----+------+---------+-------+-------+
| id | name | surname | car | cat |
+----+------+---------+-------+-------+
| 1 | Bob | Test | Tesla | |
| 2 | Mark | | | Bobby |
+----+------+---------+-------+-------+
Then I would like to merge the output with another table (based on the id).
Is it possible to do, if I don't know what the Key column holds? Values there are dynamic.
Could you point me to the right direction?

How to remove a row and the next row (rows) if it is in another table using SAS?

I'm working with SAS and I have a data frame like this:
Table1:
+------+------------+-----------+--------+
| name | date | time | price |
+------+------------+-----------+--------+
| A | 7-May-08 | 11:12:41 | 1 |
| A | 11-Jul-08 | 11:23:41 | 2 |
| A | 3-Jan-09 | 11:31:41 | 1 |
| A | 4-Jan-09 | 11:32:41 | 2 |
| A | 4-Jan-09 | 11:32:41 | 2 |
| A | 8-Jul-09 | 11:32:41 | 1 |
| A | 8-Jul-09 | 11:32:41 | 2 |
| A | 24-Jul-09 | 11:32:41 | 3 |
| A | 24-Jul-09 | 11:32:41 | 4 |
| A | 8-Dec-09 | 12:32:41 | 1 |
| B | 7-May-08 | 11:31:41 | 2 |
| B | 10-May-08 | 11:32:41 | 3 |
| B | 17-May-08 | 11:33:41 | 4 |
| B | 24-May-08 | 11:34:41 | 1 |
| B | 1-Jun-08 | 11:35:41 | 5 |
| B | 18-Jun-08 | 11:36:41 | 1 |
| B | 9-May-09 | 11:37:41 | 3 |
| C | 7-Oct-09 | 11:21:41 | 3 |
| C | 17-Oct-09 | 11:22:41 | 2 |
| C | 25-Oct-09 | 11:32:41 | 1 |
| C | 18-Nov-09 | 11:33:41 | 3 |
| C | 4-Dec-09 | 11:12:41 | 4 |
| C | 19-Dec-09 | 10:22:41 | 1 |
| C | 9-May-10 | 11:42:41 | 3 |
| C | 9-May-10 | 11:12:41 | 1 |
| C | 10-May-10 | 12:52:41 | 2 |
+------+------------+-----------+--------+
I have another data frame like this:
table2:
+------+-----------+
| name | date |
+------+-----------+
| A | 11-Jul-08 |
| A | 3-Jan-09 |
| A | 24-Jul-09 |
| B | 7-May-08 |
| B | 17-May-08 |
| B | 18-Jun-08 |
| B | 9-Jul-09 |
| C | 17-Oct-09 |
| C | 4-Dec-09 |
| C | 19-Dec-09 |
+------+-----------+
Now I want to do two main operations:
1- If date and name in table2 is in table1 then delete that row in table1;
2- If the previous step happened then delete the next row of that name and date and if name and date of the next row is repeated for other next rows then delete all of them.
For example table1 at last should be like this:
+------+-----------+----------+-------+
| name | date | time | price |
+------+-----------+----------+-------+
| A | 7-May-08 | 11:12:41 | 1 |
| A | 8-Jul-09 | 11:32:41 | 1 |
| A | 8-Jul-09 | 11:32:41 | 1 |
| B | 1-Jun-08 | 11:35:41 | 5 |
| C | 7-Oct-09 | 11:21:41 | 3 |
| C | 18-Nov-09 | 11:33:41 | 3 |
| C | 10-May-10 | 12:52:41 | 2 |
+------+-----------+----------+-------+
Here is a code which is not perfect for this operation because of two reasons:
1- Using the nodupkey option delete all duplicate observations from table1 which is not necessary. Because delete them is happened when conditions, which is described above, is satisfied.
2- And the "(inb=0 and lag(inb)=1 and not first.name)" statement delete just one next rows and the other next rows whit same name and date still be in table1.
proc sort data=table1 out=tablea1 nodupkey;
by name date;
run;
proc sort data=table2 out=tableb1 nodupkey;
by name date;
run;
data want;
merge tablea1 tableb1(in=inb) ;
by name date;
if inb or (inb=0 and lag(inb)=1 and not first.name) then delete;
run;
Thanks in advance.
Armin:
In a complex merge and process operation you will need some additional variables for maintaining the state of your business rules. The case of deleting the next row of a match and duplicates thereof requires tracking of the next name and date.
For example:
data have;
input #;
if _infile_ ne: '+';
attrib
name length=$10
date length=4 informat=date9. format=date11.
time length=4 informat=time8. format=time8.
price length=8
;
infile cards dlm='|' firstobs=4;
input #1 name date time price;
datalines;
+------+------------+-----------+--------+
| name | date | time | price |
+------+------------+-----------+--------+
| A | 7-May-08 | 11:12:41 | 1 |
| A | 11-Jul-08 | 11:23:41 | 2 |
| A | 3-Jan-09 | 11:31:41 | 1 |
| A | 4-Jan-09 | 11:32:41 | 2 |
| A | 4-Jan-09 | 11:32:41 | 2 |
| A | 8-Jul-09 | 11:32:41 | 1 |
| A | 8-Jul-09 | 11:32:41 | 2 |
| A | 24-Jul-09 | 11:32:41 | 3 |
| A | 24-Jul-09 | 11:32:41 | 4 |
| A | 8-Dec-09 | 12:32:41 | 1 |
| B | 7-May-08 | 11:31:41 | 2 |
| B | 10-May-08 | 11:32:41 | 3 |
| B | 17-May-08 | 11:33:41 | 4 |
| B | 24-May-08 | 11:34:41 | 1 |
| B | 1-Jun-08 | 11:35:41 | 5 |
| B | 18-Jun-08 | 11:36:41 | 1 |
| B | 9-May-09 | 11:37:41 | 3 |
| C | 7-Oct-09 | 11:21:41 | 3 |
| C | 17-Oct-09 | 11:22:41 | 2 |
| C | 25-Oct-09 | 11:32:41 | 1 |
| C | 18-Nov-09 | 11:33:41 | 3 |
| C | 4-Dec-09 | 11:12:41 | 4 |
| C | 19-Dec-09 | 10:22:41 | 1 |
| C | 9-May-10 | 11:42:41 | 3 |
| C | 9-May-10 | 11:12:41 | 1 |
| C | 10-May-10 | 12:52:41 | 2 |
+------+------------+-----------+--------+
;
data filter;
input #;
if _infile_ ne: '+';
attrib
name length=$10
date length=4 informat=date9. format=date11.
;
infile cards dlm='|' firstobs=4;
input #1 name date;
datalines;
+------+-----------+
| name | date |
+------+-----------+
| A | 11-Jul-08 |
| A | 3-Jan-09 |
| A | 24-Jul-09 |
| B | 7-May-08 |
| B | 17-May-08 |
| B | 18-Jun-08 |
| B | 9-Jul-09 |
| C | 17-Oct-09 |
| C | 4-Dec-09 |
| C | 19-Dec-09 |
+------+-----------+
;
run;
data want (keep=name date time price);
merge have(in=_have) filter(in=_filter);
by name date;
length match_at_n 4 next_name $10 next_date 4;
retain match_at_n next_name next_date;
if first.name then /* prevent delete next from sloshing into next group */
match_at_n = -1;
if _have and _filter then do;
match_at_n = _n_;
delete;
end;
if _filter then
delete;
* condition here is _have and _not filter;
if _n_ = match_at_n + 1 then do;
next_name = name;
next_date = date;
delete;
end;
if name = next_name and date = next_date then
delete;
run;
Suppose the same outcome could be achieved with a single complex compound if statement that involved a variety of lags, flags and sums -- regardless, I would favor clarity over cleverness.
Based on Ksharp's code in SAS community:
data temp;
set table2(in=inb) table1;
by name date;
group+first.date;
_inb=inb;
run;
data key;
set temp(where=(_inb=1));
output;
group=group+1;
output;
keep name group;
run;
proc sql;
create table want as
select name, date, time, price
from temp
where catx(' ',name,group) not in
(select catx(' ',name,group) from key);
quit;
data want;
merge table1 table2(in=inb);
by name date;
retain _date num;
if first.name then call missing(_date,num);
if inb then do;
num=_n_;
delete;
end;
else if _n_-num=1 then do;
_date=date;
delete;
end;
else if _date=date then delete;
drop _date num;
run;

Need to shift the data to next column, unfortunately added data in wrong column

I have a table test
+----+--+------+--+--+--------------+--+--------------+
| ID | | Name1 | | | Name2 |
+----+--+------+--+--+--------------+--+--------------+
| 1 | | Andy | | | NULL |
| 2 | | Kevin | | | NULL |
| 3 | | Phil | | | NULL |
| 4 | | Maria | | | NULL |
| 5 | | Jackson | | | NULL |
+----+--+------+--+--+----------+--+--
I am expecting output like
+----+--+------+--+--+----------+--
| ID | | Name1 | | | Name2 |
+----+--+------+--+--+----------+--
| 1 | | NULL | | | Andy |
| 2 | | NULL | | | Kevin |
| 3 | | NULL | | | Phil |
| 4 | | NULL | | | Maria |
| 5 | | NULL | | | Jackson |
+----+--+------+--+--+----------+--
I unfortunately inserted data in wrong column and now I want to shift the data to the next column.
You can use an UPDATE statement with no WHERE condition, to cover the entire table.
UPDATE test
SET Name2 = Name1,
Name1 = NULL

Series of conditional table and cell references

I have a reference table as such in Sheet2 of my workbook
|Score 1| | |Score 2 | | |
----------------------------------------------------------
| name | min | max | target | min | max | target |
----------------------------------------------------------
| jeff | 30 | 40 | 35 | 45 | 55 | 50 |
----------------------------------------------------------
| steve | 35 | 45 | 40 | 45 | 65 | 55 |
then in Sheet1 I have a list of scores for each name as such
| jeff | 1 | | | | steve | 3 | | |
------------------------------------------------------------
| jeff | 2 | | | | steve | 2 | | |
------------------------------------------------------------
| jeff | 2 | | | | steve | 3 | | |
------------------------------------------------------------
| jeff | 3 | | | | steve | 3 | | |
------------------------------------------------------------
| jeff | 1 | | | | steve | 2 | | |
------------------------------------------------------------
I am aware of simple lookups and offsetting values but I can't think of a way to do multiple references on different levels... Is there a way to in Sheet1 next to the scores have a function that looks up the score, then who the score is for, and then prints the corresponding min max and target values for that person with that score.
So if it sees 1 and then jeff, it returns 30 | 40 | 35 in the next 3 boxes. I would do this manually but the list is very long and is populated daily by an existing macro.
Use VLOOKUP with the name (jeff) and take the index (1) to calculate the column to take.

sort a table while keeping the hierarchy of rows

I have a table which represents the hierarchy of departments:
+-----------+--------------+--------------+--------------+-----------+-------+
| Top Dept. | 2-tier Dept. | 3-tire Dept. | 4-tier Dept. | name | tier |
+-----------+--------------+--------------+--------------+-----------+-------+
| 00 | | | | abc | 0 |
| | 00-01 | | | bcd | 1 |
| | | 00-01-01 | | cde | 2 |
| | | 00-01-02 | | abc | 2 |
| | 00-02 | | | aef | 1 |
| | | 00-02-01 | | qwe | 2 |
| | | 00-02-03 | | abc | 2 |
| | | | 00-02-03-01 | abc | 3 |
+-----------+--------------+--------------+--------------+-----------+-------+
now I want to sort the rows which are in the same tier by their names while keeping the hierarchy overall, That's what I expect:
+-----------+--------------+--------------+--------------+-----------+-------+
| Top Dept. | 2-tier Dept. | 3-tire Dept. | 4-tier Dept. | name | tier |
+-----------+--------------+--------------+--------------+-----------+-------+
| 00 | | | | abc | 0 |
| | 00-02 | | | aef | 1 |
| | | 00-02-03 | | abc | 2 |
| | | 00-02-01 | | qwe | 2 |
| | 00-01 | | | def | 1 |
| | | 00-01-02 | | abc | 2 |
| | | 00-01-01 | | cde | 2 |
| | | | 00-02-03-01 | abc | 3 |
+-----------+--------------+--------------+--------------+-----------+-------+
the missing data means null, I'm using Oracle DB, can anyone help me?
EDIT: Actually, it's a simple version of this sql, I've tried to add a new column which concats the values of the first four columns and then order by it and by name, but it did't work.
Update: This appears to be working... SQL Fiddle
All that was really needed from my original comment was to amend name to department in that order in both selects. This allows the engine to sort by name first, while maintaining the hierarchy.
WITH cte(Dept, superiorDept, name, depth, sort)AS (
SELECT
Dept,
superiorDept,
name,
0,
name|| dept
FROM hierarchy h
WHERE superiorDept IS NULL
UNION ALL
SELECT
h2.Dept,
h2.superiorDept,
h2.name,
cte.depth + 1,
cte.sort || h2.name ||h2.dept
FROM hierarchy h2
INNER JOIN cte ON h2.superiorDept = cte.Dept
)
SELECT
CASE WHEN depth = 0 THEN Dept END AS 一级部门,
CASE WHEN depth = 1 THEN Dept END AS 二级部门,
CASE WHEN depth = 2 THEN Dept END AS 三级部门,
CASE WHEN depth = 3 THEN Dept END AS 四级部门,
name,
depth,
sort
FROM cte
ORDER BY sort, name