How to remove a row and the next row (rows) if it is in another table using SAS? - sql

I'm working with SAS and I have a data frame like this:
Table1:
+------+------------+-----------+--------+
| name | date | time | price |
+------+------------+-----------+--------+
| A | 7-May-08 | 11:12:41 | 1 |
| A | 11-Jul-08 | 11:23:41 | 2 |
| A | 3-Jan-09 | 11:31:41 | 1 |
| A | 4-Jan-09 | 11:32:41 | 2 |
| A | 4-Jan-09 | 11:32:41 | 2 |
| A | 8-Jul-09 | 11:32:41 | 1 |
| A | 8-Jul-09 | 11:32:41 | 2 |
| A | 24-Jul-09 | 11:32:41 | 3 |
| A | 24-Jul-09 | 11:32:41 | 4 |
| A | 8-Dec-09 | 12:32:41 | 1 |
| B | 7-May-08 | 11:31:41 | 2 |
| B | 10-May-08 | 11:32:41 | 3 |
| B | 17-May-08 | 11:33:41 | 4 |
| B | 24-May-08 | 11:34:41 | 1 |
| B | 1-Jun-08 | 11:35:41 | 5 |
| B | 18-Jun-08 | 11:36:41 | 1 |
| B | 9-May-09 | 11:37:41 | 3 |
| C | 7-Oct-09 | 11:21:41 | 3 |
| C | 17-Oct-09 | 11:22:41 | 2 |
| C | 25-Oct-09 | 11:32:41 | 1 |
| C | 18-Nov-09 | 11:33:41 | 3 |
| C | 4-Dec-09 | 11:12:41 | 4 |
| C | 19-Dec-09 | 10:22:41 | 1 |
| C | 9-May-10 | 11:42:41 | 3 |
| C | 9-May-10 | 11:12:41 | 1 |
| C | 10-May-10 | 12:52:41 | 2 |
+------+------------+-----------+--------+
I have another data frame like this:
table2:
+------+-----------+
| name | date |
+------+-----------+
| A | 11-Jul-08 |
| A | 3-Jan-09 |
| A | 24-Jul-09 |
| B | 7-May-08 |
| B | 17-May-08 |
| B | 18-Jun-08 |
| B | 9-Jul-09 |
| C | 17-Oct-09 |
| C | 4-Dec-09 |
| C | 19-Dec-09 |
+------+-----------+
Now I want to do two main operations:
1- If date and name in table2 is in table1 then delete that row in table1;
2- If the previous step happened then delete the next row of that name and date and if name and date of the next row is repeated for other next rows then delete all of them.
For example table1 at last should be like this:
+------+-----------+----------+-------+
| name | date | time | price |
+------+-----------+----------+-------+
| A | 7-May-08 | 11:12:41 | 1 |
| A | 8-Jul-09 | 11:32:41 | 1 |
| A | 8-Jul-09 | 11:32:41 | 1 |
| B | 1-Jun-08 | 11:35:41 | 5 |
| C | 7-Oct-09 | 11:21:41 | 3 |
| C | 18-Nov-09 | 11:33:41 | 3 |
| C | 10-May-10 | 12:52:41 | 2 |
+------+-----------+----------+-------+
Here is a code which is not perfect for this operation because of two reasons:
1- Using the nodupkey option delete all duplicate observations from table1 which is not necessary. Because delete them is happened when conditions, which is described above, is satisfied.
2- And the "(inb=0 and lag(inb)=1 and not first.name)" statement delete just one next rows and the other next rows whit same name and date still be in table1.
proc sort data=table1 out=tablea1 nodupkey;
by name date;
run;
proc sort data=table2 out=tableb1 nodupkey;
by name date;
run;
data want;
merge tablea1 tableb1(in=inb) ;
by name date;
if inb or (inb=0 and lag(inb)=1 and not first.name) then delete;
run;
Thanks in advance.

Armin:
In a complex merge and process operation you will need some additional variables for maintaining the state of your business rules. The case of deleting the next row of a match and duplicates thereof requires tracking of the next name and date.
For example:
data have;
input #;
if _infile_ ne: '+';
attrib
name length=$10
date length=4 informat=date9. format=date11.
time length=4 informat=time8. format=time8.
price length=8
;
infile cards dlm='|' firstobs=4;
input #1 name date time price;
datalines;
+------+------------+-----------+--------+
| name | date | time | price |
+------+------------+-----------+--------+
| A | 7-May-08 | 11:12:41 | 1 |
| A | 11-Jul-08 | 11:23:41 | 2 |
| A | 3-Jan-09 | 11:31:41 | 1 |
| A | 4-Jan-09 | 11:32:41 | 2 |
| A | 4-Jan-09 | 11:32:41 | 2 |
| A | 8-Jul-09 | 11:32:41 | 1 |
| A | 8-Jul-09 | 11:32:41 | 2 |
| A | 24-Jul-09 | 11:32:41 | 3 |
| A | 24-Jul-09 | 11:32:41 | 4 |
| A | 8-Dec-09 | 12:32:41 | 1 |
| B | 7-May-08 | 11:31:41 | 2 |
| B | 10-May-08 | 11:32:41 | 3 |
| B | 17-May-08 | 11:33:41 | 4 |
| B | 24-May-08 | 11:34:41 | 1 |
| B | 1-Jun-08 | 11:35:41 | 5 |
| B | 18-Jun-08 | 11:36:41 | 1 |
| B | 9-May-09 | 11:37:41 | 3 |
| C | 7-Oct-09 | 11:21:41 | 3 |
| C | 17-Oct-09 | 11:22:41 | 2 |
| C | 25-Oct-09 | 11:32:41 | 1 |
| C | 18-Nov-09 | 11:33:41 | 3 |
| C | 4-Dec-09 | 11:12:41 | 4 |
| C | 19-Dec-09 | 10:22:41 | 1 |
| C | 9-May-10 | 11:42:41 | 3 |
| C | 9-May-10 | 11:12:41 | 1 |
| C | 10-May-10 | 12:52:41 | 2 |
+------+------------+-----------+--------+
;
data filter;
input #;
if _infile_ ne: '+';
attrib
name length=$10
date length=4 informat=date9. format=date11.
;
infile cards dlm='|' firstobs=4;
input #1 name date;
datalines;
+------+-----------+
| name | date |
+------+-----------+
| A | 11-Jul-08 |
| A | 3-Jan-09 |
| A | 24-Jul-09 |
| B | 7-May-08 |
| B | 17-May-08 |
| B | 18-Jun-08 |
| B | 9-Jul-09 |
| C | 17-Oct-09 |
| C | 4-Dec-09 |
| C | 19-Dec-09 |
+------+-----------+
;
run;
data want (keep=name date time price);
merge have(in=_have) filter(in=_filter);
by name date;
length match_at_n 4 next_name $10 next_date 4;
retain match_at_n next_name next_date;
if first.name then /* prevent delete next from sloshing into next group */
match_at_n = -1;
if _have and _filter then do;
match_at_n = _n_;
delete;
end;
if _filter then
delete;
* condition here is _have and _not filter;
if _n_ = match_at_n + 1 then do;
next_name = name;
next_date = date;
delete;
end;
if name = next_name and date = next_date then
delete;
run;
Suppose the same outcome could be achieved with a single complex compound if statement that involved a variety of lags, flags and sums -- regardless, I would favor clarity over cleverness.

Based on Ksharp's code in SAS community:
data temp;
set table2(in=inb) table1;
by name date;
group+first.date;
_inb=inb;
run;
data key;
set temp(where=(_inb=1));
output;
group=group+1;
output;
keep name group;
run;
proc sql;
create table want as
select name, date, time, price
from temp
where catx(' ',name,group) not in
(select catx(' ',name,group) from key);
quit;

data want;
merge table1 table2(in=inb);
by name date;
retain _date num;
if first.name then call missing(_date,num);
if inb then do;
num=_n_;
delete;
end;
else if _n_-num=1 then do;
_date=date;
delete;
end;
else if _date=date then delete;
drop _date num;
run;

Related

SQL Query - Add column data from another table adding nulls

I have 2 tables, tableStock and tableParts:
tableStock
+----+----------+-------------+
| ID | Num_Part | Description |
+----+----------+-------------+
| 1 | sr37 | plate |
+----+----------+-------------+
| 2 | sr56 | punch |
+----+----------+-------------+
| 3 | sl30 | crimper |
+----+----------+-------------+
| 4 | mp11 | holder |
+----+----------+-------------+
tableParts
+----+----------+-------+
| ID | Location | Stock |
+----+----------+-------+
| 1 | A | 2 |
+----+----------+-------+
| 3 | B | 5 |
+----+----------+-------+
| 5 | C | 2 |
+----+----------+-------+
| 7 | A | 1 |
+----+----------+-------+
And I just want to do this:
+----+----------+-------------+----------+-------+
| ID | Num_Part | Description | Location | Stock |
+----+----------+-------------+----------+-------+
| 1 | sr37 | plate | A | 2 |
+----+----------+-------------+----------+-------+
| 2 | sr56 | punch | NULL | NULL |
+----+----------+-------------+----------+-------+
| 3 | sl30 | crimper | B | 5 |
+----+----------+-------------+----------+-------+
| 4 | mp11 | holder | NULL | NULL |
+----+----------+-------------+----------+-------+
List ALL the rows of the first table and if the second table has the info, in this case 'location' and 'stock', add to the column, if not, just null.
I have been using inner and left join but some rows of the first table disappear because the lack of data in the second one:
select tableStock.ID, tableStock.Num_Part, tableStock.Description, tableParts.Location, tableParts.Stock from tableStock inner join tableParts on tableStock.ID = tableParts.ID;
What can I do?
You can use left join. Here is the demo.
select
s.ID,
Num_Part,
Description,
Location,
Stock
from Stock s
left join Parts p
on s.ID = p.ID
order by
s.ID
output:
| id | num_part | description | location | stock |
| --- | -------- | ----------- | -------- | ----- |
| 1 | sr37 | plate | A | 2 |
| 2 | sr56 | punch | NULL | NULL |
| 3 | sl30 | crimper | B | 5 |
| 4 | mp11 | holder | NULL | NULL |

SAS SQL: Many to many relationships with 2 tables BUT don't want multiple rows

I have two tables I need to join. These tables only share 1 field in common (ID, and it isn't unique). Is it possible to join these two tables but make it unique and keep all matching data in a row?
For example, I have two tables as follows:
+-------+----------+
| ID | NAME |
+-------+----------+
| A | Jack |
| A | Andy |
| A | Steve |
| A | Jay |
| B | Chris |
| B | Vicky |
| B | Emma |
+-------+----------+
And another table that is ONLY related by the ID column:
+-------+--------+
| ID | Age |
+-------+--------+
| A | 22 |
| A | 31 |
| A | 11 |
| B | 40 |
| B | 17 |
| B | 20 |
| B | 3 |
| B | 65 |
+-------+--------+
The end result I'd like to get is:
+-------+----------+++-------+
| ID | NAME | Age |
+-------+----------++-------+-
| A | Jack | 22 |
| A | Andy | 31 |
| A | Steve | 11 |
| A | Jay | null |
| B | Chris | 40 |
| B | Vicky | 17 |
| B | Emma | 20 |
| B | null | 3 |
| B | null | 65 |
+-------+----------+++-------+
This is the default behavior of the data step merge, except that it won't set the last row's variable to missing - but it's easy to fudge.
There are other ways to do this, the best in my opinion being the hash object if you're comfortable with that.
data names;
infile datalines dlm='|';
input ID $ NAME $;
datalines;
| A | Jack |
| A | Andy |
| A | Steve |
| A | Jay |
| B | Chris |
| B | Vicky |
| B | Emma |
;;;;
run;
data ages;
infile datalines dlm='|';
input id $ age;
datalines;
| A | 22 |
| A | 31 |
| A | 11 |
| B | 40 |
| B | 17 |
| B | 20 |
| B | 3 |
| B | 65 |
;;;;
run;
data want;
merge names(in=_a) ages(in=_b);
by id;
if _a;
if name ne lag(name) then output; *this assumes `name` is unique in id - if it is not we may have to do a bit more work here;
call missing(age); *clear age after output so we do not attempt to fill extra rows with the same age - age will be 'retain'ed;
run;

Need to shift the data to next column, unfortunately added data in wrong column

I have a table test
+----+--+------+--+--+--------------+--+--------------+
| ID | | Name1 | | | Name2 |
+----+--+------+--+--+--------------+--+--------------+
| 1 | | Andy | | | NULL |
| 2 | | Kevin | | | NULL |
| 3 | | Phil | | | NULL |
| 4 | | Maria | | | NULL |
| 5 | | Jackson | | | NULL |
+----+--+------+--+--+----------+--+--
I am expecting output like
+----+--+------+--+--+----------+--
| ID | | Name1 | | | Name2 |
+----+--+------+--+--+----------+--
| 1 | | NULL | | | Andy |
| 2 | | NULL | | | Kevin |
| 3 | | NULL | | | Phil |
| 4 | | NULL | | | Maria |
| 5 | | NULL | | | Jackson |
+----+--+------+--+--+----------+--
I unfortunately inserted data in wrong column and now I want to shift the data to the next column.
You can use an UPDATE statement with no WHERE condition, to cover the entire table.
UPDATE test
SET Name2 = Name1,
Name1 = NULL

Considering values from one table as column header in another

I have a base table where I need to calculate the difference between two dates based on the type of the entry.
tblA
+----------+------------+---------------+--------------+
| TypeCode | Log_Date | Complete_Date | Pending_Date |
+----------+------------+---------------+--------------+
| 1 | 18/04/2016 | 19/04/2016 | |
| 2 | 10/04/2016 | 18/04/2016 | 15/04/2016 |
| 3 | 12/04/2016 | 19/04/2016 | |
| 4 | 15/04/2016 | 17/04/2016 | 16/04/2016 |
| 5 | 16/04/2016 | 21/04/2016 | |
| 1 | 19/04/2016 | 20/04/2016 | |
| 2 | 20/03/2016 | 31/03/2015 | |
| 3 | 25/03/2016 | 28/03/2016 | |
| 4 | 26/03/2016 | 27/03/2016 | |
| 5 | 27/03/2016 | 30/03/2016 | |
+----------+------------+---------------+--------------+
I have another look up table which has the column names to be considered based on the TypeCode.
tblB
+----------+----------+---------------+
| TypeCode | DateCol1 | DateCol2 |
+----------+----------+---------------+
| 1 | Log_Date | Complete_Date |
| 2 | Log_Date | Pending_Date |
| 3 | Log_Date | Complete_Date |
| 4 | Log_Date | Pending_Date |
| 5 | Log_Date | Complete_Date |
+----------+----------+---------------+
I am doing a simple DATEDIFF between two dates for my calculation. However I want to lookup which columns to consider for this calculation from tblB and apply it on tblA based on the TypeCode.
Resulting table:
For example: When the TypeCode is 2 or 4 then the calculation should be DATEDIFF(d, Log_Date, Pending_Date), otherwise DATEDIFF(d, Log_Date, Complete_Date)
+----------+------------+---------------+--------------+----------+
| TypeCode | Log_Date | Complete_Date | Pending_Date | Cal_Days |
+----------+------------+---------------+--------------+----------+
| 1 | 18/04/2016 | 19/04/2016 | | 1 |
| 2 | 10/04/2016 | 18/04/2016 | 15/04/2016 | 5 |
| 3 | 12/04/2016 | 19/04/2016 | | 7 |
| 4 | 15/04/2016 | 17/04/2016 | 16/04/2016 | 1 |
| 5 | 16/04/2016 | 21/04/2016 | | 5 |
| 1 | 19/04/2016 | 20/04/2016 | | 1 |
| 2 | 20/03/2016 | 31/03/2015 | | |
| 3 | 25/03/2016 | 28/03/2016 | | 3 |
| 4 | 26/03/2016 | 27/03/2016 | | |
| 5 | 27/03/2016 | 30/03/2016 | | 3 |
+----------+------------+---------------+--------------+----------+
Any help would be appreciated. Thanks.
Use JOIN with CASE expression:
SELECT
a.*,
Cal_Days =
DATEDIFF(
DAY,
CASE
WHEN b.DateCol1 = 'Log_Date' THEN a.Log_Date
WHEN b.DateCol1 = 'Complete_Date' THEN a.Complete_Date
ELSE a.Pending_Date
END,
CASE
WHEN b.DateCol2 = 'Log_Date' THEN a.Log_Date
WHEN b.DateCol2 = 'Complete_Date' THEN a.Complete_Date
ELSE a.Pending_Date
END
)
FROM TblA a
INNER JOIN TblB b
ON b.TypeCode = a.TypeCode

Hiding inside group columns from other columns that don't have values

I'm working on a report. How do I get columns from the outside that are displaying dates to be next to a column inside the matrix that is displaying values.
For example it is setup like this:
| HiredDt | TermDt | [Type] | LicDt | MedDt |
---------------------------------------------------------------------------------
ID | [HiredDt] | [TermDt] | SUM([Count_of_Type]) | [LicDt] | [MedDt] |
---------------------------------------------------------------------------------
And looks like this:
| HiredDt | TermDt | Lic | Med | App | LicDt | MedDt |
----------------------------------------------------------------------------------------
1 | 1/31/12 | 1/31/14 | 1 | 1 | 12 | 6/1/15 | 9/1/14 |
2 | 2/19/12 | 9/18/14 | 1 | 1 | 12 | 3/2/15 | 9/1/14 |
But when I use inside grouping to match up the date next to the associated document type I get:
| HiredDt | TermDt | Lic | | | Med | | | App | | |
----------------------------------------------------------------------------------------------------------------------------
1 | 1/31/12 | 1/31/14 | 1 | 6/1/15 | | 1 | | 9/1/2014 | 12 | | |
2 | 2/19/12 | 9/18/14 | 1 | 3/2/15 | | 1 | | 9/1/2014 | 12 | | |
What I'm trying to get this:
| HiredDt | TermDt | Lic | LicDt | Med | MedDt | App |
--------------------------------------------------------------------------------------
1 | 1/31/12 | 1/31/14 | 1 | 6/1/15 | 1 | 9/1/14 | 12 |
2 | 2/19/12 | 9/18/14 | 1 | 3/2/15 | 1 | 9/1/14 | 12 |
Is this possible?
I would right-click on the cell you have labelled SUM([Count_of_Type]) and choose Insert Column - Inside Group - Right.
In that new cell I would set the expression to: = Max ( [LicDt] )