Summation/counting over overlapping values or dates with group by over id's in sql - sql

I am working with an sas table and the dates are represented as numbers given in columns "entered" and "left" . I have to count the days the member remained in the system. Like, for example below for id 1, the person entered on 7071 and again used a different product on 7075 although he remained continuously in system from 7071 to 7083. That is the dates overlap. I want to count the final duration a member stayed in the system like as for id 1 it is 12 days (7083-7071) + 2 days (7087 to 7089) + 4 days (7095 to 7099). So the total is 18 days. (There are some duplicate entered and left values but other columns (not shown here) are not same, so these rows were not removed.) . Since i'm working in sas so the idea can be both in sas data or the sas-sql format.
For member 2, there is no overlap of values. So the day count is 2 (8921 to 8923) + 5 days (8935 to 8940) = 7 days. I was able to solve this case as the days didn't overlap but for overlap case, any suggestion or code/advice is appreciated.
id Entered left
1 7071 7077
1 7071 7077
1 7075 7079
1 7077 7083
1 7077 7083
1 7078 7085
1 7087 7089
1 7095 7099
2 8921 8923
2 8935 8940
So the final table should be of the form
id days_in_system
1 18
2 7

This is a surprisingly tricky problem as every row has to be compared to every other row for the same id to check for overlaps and if there are multiple overlaps you have to be very careful not to double-count them.
Here's a hash-based solution - the idea is to build up a hash containing all of the individual days a member has stayed as you go along, then count the number of items in it at the end:
data have;
input id Entered left;
cards;
1 7071 7077
1 7071 7077
1 7075 7079
1 7077 7083
1 7077 7083
1 7078 7085
1 7087 7089
1 7095 7099
2 8921 8923
2 8935 8940
;
run;
data want;
length day 8;
if _n_ = 1 then do;
declare hash h();
rc = h.definekey('day');
rc = h.definedone();
end;
do until(last.id);
set have;
by id;
do day = entered to left - 1;
rc = h.add();
end;
end;
total_days = h.num_items;
rc = h.clear();
keep id total_days;
run;
This should be fairly light on memory as it only has to load the days for 1 id at a time.
The output from id 1 is 20, not 18 - here's a breakdown of the new days added row-by-row that I generated by adding a bit of debugging logic. If this is wrong, please indicate where:
_N_=1
7071 7072 7073 7074 7075 7076
_N_=2
No new days
_N_=3
7077 7078
_N_=4
7079 7080 7081 7082
_N_=5
No new days
_N_=6
7083 7084
_N_=7
7087 7088
_N_=8
7095 7096 7097 7098
_N_=1
8921 8922
_N_=2
8935 8936 8937 8938 8939
If you want to add only days for rows matching a particular condition, you can pick those using a where clause on the set statement, e.g.
set have(where = (var1 in ('value1', 'value2', ...)));

Related

Oracle PL\SQL to Update Data with the 2nd minimum values between two dates

I have the following problem related industrial pump readings. A pump usually have a meter that keeps the record of volume of material processed by that specific pump. Sometimes the meter needs to be replaced with a entirely new meter (meter reading starts with 0) or an old working meter (meter reading can be more than 0). I have a dataset that keeps maintenance record of the pump with meter readings.
And the indication of a meter change is only when we have data in OLD_METER_READING column, otherwise it is blank.
In ideal scenario the data looks like following:
PUMP_NO INSPECTION_DATE MAINTENANCE_TASK METER_READING OLD_METER_READING TOTAL_PUMP_LIFE
11 11-AUG-2000 A 12489 12489
11 14-JUL-2001 B 14007 14007
11 03-SEP-2002 Y 0 14007 14007
11 03-SEP-2002 C 0 14007 14007
11 03-SEP-2002 B 0 14007 14007
11 04-JUN-2003 A 1200 16007
11 21-DEC-2003 A 8000 22007
11 23-FEB-2004 Y 0 10000 24007
11 26-MAY-2004 B 10 24017
11 26-MAY-2004 P 20 24027
11 26-MAY-2004 R 300 24307
11 04-OCT-2004 B 2312 26319
11 31-MAR-2005 A 2889 26896
11 06-NOv-2006 V 5000 29007
11 14-JUL-2008 T 0 7000 31007
However in many cases the Pump technician will make a mistake in loging METER_READING during change of meter. So the data may end up looking like:
PUMP_NO INSPECTION_DATE MAINTENANCE_TASK METER_READING OLD_METER_READING TOTAL_PUMP_LIFE
11 11-AUG-2000 A 12489 12489
11 14-JUL-2001 B 14007 14007
11 03-SEP-2002 Y 0 14007 14007
11 03-SEP-2002 C 0 14007 14007
11 03-SEP-2002 B 0 14007 14007
11 04-JUN-2003 A 1200 16007
11 21-DEC-2003 A 8000 22007
11 23-FEB-2004 Y 0 10000 24007
11 26-MAY-2004 B 10000 34007
11 26-MAY-2004 P 10000 34007
11 26-MAY-2004 R 10000 34007
11 04-OCT-2004 B 2312 26319
11 31-MAR-2005 A 2889 26896
11 06-NOV-2006 V 5000 29007
11 14-JUL-2008 T 0 7000 31007
The mistake in the 2nd set of data is that the technician rather than loging the actual METER_READING used last METER_READING from old meter as the new METER_READING on the day of 26-MAY-2004. However, correct METER_READING was logged again from 04-OCT-2004. We have numerous occasion where for a specific pump (PUMP_NO) we will have erroneous METER_READING entered in the database after a meter change event. It is also creating wrong and confusing value for the TOTAL_PUMP_LIFE.
So, to correct the data we want to add another column in the table and update the table with a Oracle Procedure where the procedure will check the METER_READING field with the following logic:
check the data between two subsequent meter change event. (for example, in this case between 1st meter 03-SEP-2002 and 2nd meter change-23-FEB-2004. And again between 2nd meter change-23-FEB-2004 and 3rd meter change 14-JUL-2008).
if METER_READING between any of these period is higher at prior date compared to METER_READING on a prior date then update the higher METER_READING with the 2nd lowest value (0 and 2312 are the 2 lowest, so update with 2312) in that period.
So, the period between first 2 meter changes will pass and no update will be necessary.However, in the 2nd set of the date all the values (10000) in the METER_READING column for 26-MAY-2014 will be updated with the value of 2312.
I am not sure how to write a PL\SQL to do the compare the values between two events and also how to update the value of a prior date (if higher value found in the METER_READING column) with a lower value between that period.
Database: Oracle SQL 11g
So in looking at your problem, I don't know that you need to resort to PL/SQL. The following query should help you identify which records are in need of updating:
SELECT m.*,
MIN(meter_reading)
OVER (PARTITION BY m.pump_no
ORDER BY m.inspection_date
RANGE BETWEEN NVL((SELECT min(n.inspection_date)-m.inspection_date
FROM maintenance n
WHERE n.inspection_date > m.inspection_date),
0) FOLLOWING
AND NVL((SELECT min(n.inspection_date)-m.inspection_date-1
FROM maintenance n
WHERE n.old_meter_reading IS NOT NULL
AND n.inspection_date > m.inspection_date),
0) FOLLOWING) AS MIN_READING_FOLLOWING
FROM maintenance m
ORDER BY m.inspection_date, old_meter_reading ASC NULLS LAST;
I created a SQLFiddle to demonstrate the query. (Link)
The analytic MIN function is looking at all rows between the next date a read was performed AND the next meter change to see if any of them have a value which is less than the current read.
You could use this as part of an update statement. As for TOTAL_PUMP_LIFE, it might be easiest to recalculate that after you've corrected the meter_readings as part of a separate operation.
Edit 1: Adding PL/SQL to make updates
DECLARE
CURSOR c_readings IS
SELECT m.*,
MIN(meter_reading)
OVER (PARTITION BY m.pump_no
ORDER BY m.inspection_date
RANGE BETWEEN NVL((SELECT min(n.inspection_date)-m.inspection_date
FROM maintenance n
WHERE n.inspection_date > m.inspection_date),
0) FOLLOWING
AND NVL((SELECT min(n.inspection_date)-m.inspection_date-1
FROM maintenance n
WHERE n.old_meter_reading IS NOT NULL
AND n.inspection_date > m.inspection_date),
0) FOLLOWING) AS MIN_READING_FOLLOWING
FROM maintenance m
ORDER BY m.inspection_date, old_meter_reading ASC NULLS LAST;
BEGIN
FOR rec IN c_readings LOOP
IF rec.meter_reading > rec.min_reading_following THEN
UPDATE maintenance m
SET m.meter_reading = rec.min_reading_following
WHERE m.pump_no = rec.pump_no
AND m.inspection_date = rec.inspection_date
AND m.maintenance_task = rec.maintenance_task;
END IF;
END LOOP;
END;
/
You'll need to either COMMIT when this is done or add it to the code.
Maybe what u need to do is something like this:
update MyTable mt1
set value = (select min(value)
from MyTable2 mt2
where mt1.id = mt2.id --your relation
and value NOT IN (select min(value)
from MyTable2 mt3
where mt2.id = mt3.id))
With this update u are getting the min value and not taking the min value original with the NOT IN.

SQL complex grouping "in column"

I have a table with 3 columns (sorted by the first two):
letter
number (sorted for each letter)
difference between current number and previous number of the same letter
I'd like to calculate (with vanlla SQL) a fourth new column RESULT to group these data when the third column (difference of number between contiguos record; i.e #2 --> 4 = 5-1) is greater than 30 marking all the records of this interval with letter-number of the first record (i.e A1 for #1,#2,#3).
Since the difference between contiguos numbers makes sense just for records with the same letter, for the first record of a new letter, the value of differnce is 31 (meaning that it's a new group; i.e. #6).
Here is what I'd like to get as result:
# Letter Number Difference RESULT (new column)
1 A 1 1 A1
2 A 5 4 A1
3 A 7 2 A1
4 A 40 33 A40 (*)
5 A 43 3 A40
6 B 1 31 B1 (*)
7 B 25 24 B1
8 B 27 2 B1
9 B 70 43 B70 (*)
10 B 75 5 B70
Now I can only find the "breaking values" (*) with this query where they get a value of 1:
select letter
,number
,cast(difference/30 as int) break
from table
where cast(difference/30 as int) = 1
Even though I'm able to find these breaking values I can't finish my task.
Can anyone help me finding a way to obtain the column RESULT?
Thanks in advance
FF
As I understand you need to construct the last result column. You can use concat to do that:
SELECT letter
,number
,concat(letter, cast(difference/30 as int)) result
FROM table
HAVING result = 'A1'
after some exercise and a little help from a friend of mine, I've found a possible solution to my sql prolblem.
The only requirment for the solution is that my first record must have a value of 31 in Difference field (since I need "breaks" when Difference > 30 than the previous record).
Here is the query to get the column RESULT I needed:
select alls.letter
,alls.number
,ints.letter||ints.number as result
from competition.lag alls
,(select letter
,number
,difference
,result
from (select letter
,number
,difference
,case when difference>30 then 1 else 2 end as result
from competition.lag
) temp
where result = 1
) ints
where ints.letter=alls.letter
and alls.number>=ints.number
and alls.number-30<=ints.number

SAS - keeping observations 6 months before and after a service date

I have a health care data set in which each month a subject is enrolled they have a row of data a variable indicating the month of active enrollment. therefore if someone is enrolled for 12 months, they will have 12 rows in the set. They also have a variable for service date, giving the exact date they received a service.
I need to select the 6 consecutive months of enrollment before and the 6 consecutive months of enrollment after the service date. The specific days of the month are irrelevant. What is important is only the month and year of the service and enrollment month.
Here is what my data looks like:
service_dt MemberID enroll_month
11May2010 1 01Nov2009
11May2010 1 01Dec2009
11May2010 1 01Jan2010
11May2010 1 01Feb2010
11May2010 1 01Mar2010
11May2010 1 01Apr2010
11May2010 1 01May2010
11May2010 1 01Jun2010
11May2010 1 01Jul2010
15Jun2010 2 01Jun2010
15Jun2010 2 01Aug2010
So, for member 1 we see that the service was in may, so I need to select November 2009 though November 2010 IF the months are consecutive. For member 2, service was in June, but enrollment skips from June to August...July is not an enrollment month, and so I would need to throw out member 2 from my final cohort.
You want records for the first six months after the earliest enroll month. You can get all members that fit this criteria by doing a join. Because of the SQL tag, I am assuming that you want this as a SQL statement:
select d.memberid
from data d join
(select min(year(service_dt) * 12 + month(service_dt)) as enroll_ym, d.*
from data d
) dym
on d.memberid = dym.memberid and
year(d.service_dt) * 12 + month(service_dt) between enroll_ym and enroll_ym + 5
group by d.memberid
having count(distinct month(service_dt)) = 6;
To get the original rows, you would join back to the original data.
I took the advice of #Joe , and separated my data into two sets. Before the service date and after the service date. I then followed code also supplied by Joe from a previous question I asked. However, I simply modified slightly.
/* This code will focus on the months before the service date.*/
data eligibility_before2;
set eligibility_before;
by memberid descending monthid;
if first.memberid then counter = 0;
if dif(monthid) < -1 and mod(monthid, 100) ne 12 then counter = 0;
if mod(monthid, 100) eq 12 and dif(monthid) ne -89 then counter = 0;
counter+1;
if counter = 6 then output;
run;
/*This code will focus on enrollment months after the service date*/
data eligibility_after2;
set eligibility_after;
by memberid monthid;
if first.memberid then counter = 0;
if dif(monthid) > 1 and mod(monthid, 100) ne 1 then counter = 0;
if mod(monthid, 100) eq 1 and dif(monthid) ne 89 then counter = 0;
counter+1;
if counter = 6 then output;
run;
After this point simply merge the data sets back together, specifying that the memberid must occur in BOTH data sets to be included in the final set.

create variable for unique sessions

I have some data about when, how long, and what channel people are listening to the radio. I need to make a variable called sessions that groups all entries which occur while the radio is on. Because the data may contain some errors I would like to say that if less than five minutes passes from the end of one channel period to the next then it is still the same session. Hopefully a brief example will clarify.
obs Entry_date Entry_time duration(in secs) channel
1 01/01/12 23:25:21 6000 2
2 01/03/12 01:05:64 300 5
3 01/05/12 12:12:35 456 5
4 01/05/12 16:45:21 657 8
I want to create the variable sessions so that
obs Entry_date Entry_time duration(in secs) channel session
1 01/01/12 23:25:21 6000 2 1
2 01/03/12 01:05:64 300 5 1
3 01/05/12 12:12:35 456 5 2
4 01/05/12 16:45:21 657 8 3
for defining 1 session i need to use entry_time (and date if it goes from 11pm into the next morning) so that if entry_time+duration + (5minutes) < entry_time(next channel) then the session changes. This has been killing me and simple arrays wont do the trick, or my attempt using arrays has not worked. Thanks in advance
Aside from the comments I made in the OP, here's how I would do it using a SAS data step. I've changed the date and time values for row 2 to what I suspect they should be (in order to get the same result as in the OP). This avoids having to perform a self join, which is likely to be performance intensive on a large dataset.
I've used the DIF and LAG functions, so care needs to be taken if you're adding in extra code (particularly IF statements).
data have;
input entry_date :mmddyy10. entry_time :time. duration channel;
format entry_date date9. entry_time time.;
datalines;
01/01/2012 23:25:21 6000 2
01/02/2012 01:05:54 300 5
01/05/2012 12:12:35 456 5
01/05/2012 16:45:21 657 8
;
run;
data want;
set have;
by entry_date entry_time; /* put in to check data is sorted correctly */
retain session 1; /* initialise session with value 1 */
session+(dif(dhms(entry_date,0,0,entry_time))-lag(duration)>300); /* increment session by 1 if time difference > 5 minutes */
run;
hopefully I got your requirements right!
Since you need to base result on adjoining rows, there is a need to join a table to itself.
The Session #s are not consecutive, but you should get the point.
create table #temp
(obs int not null,
entry_date datetime not null,
duration int not null,
channel int not null)
--obs Entry_date Entry_time duration(in secs) channel
insert #temp
select 1, '01/01/12 23:25:21', 6000, 2
union all select 2, '01/03/12 01:05:54', 300, 5
union all select 3, '01/05/12 12:12:35', 456, 5
union all select 4, '01/05/12 16:45:21', 657, 8
select a.obs,
a.entry_date,
a.duration,
endSession = dateadd(mi,5,dateadd(mi,a.duration,a.entry_date)),
a.channel,
b.entry_date,
minOverlapping = datediff(mi,b.entry_date,
dateadd(mi,5,dateadd(mi,a.duration,a.entry_date))),
anotherSession = case
when dateadd(mi,5,dateadd(mi,a.duration,a.entry_date))<b.entry_date
then b.obs
else a.obs end
from #temp a
left join #temp b on a.obs = b.obs - 1
hope this helps a bit

MS Access Update with Increment of Prior Record

I have an MS Access 2007 database that I need to create an update for. The table I am trying to update looks like this:
CarID WeekOf NumDataPoints NumWksZeroPoints
3AA May-14-2011 23 0
7BB May-14-2011 9 0
3AA May-21-2011 35 0
7BB May-21-2011 0 1
3AA May-28-2011 24
7BB May-28-2011 0
I am processing the latest recordset of May-28-2011 and the gist is to update each car with the number of weeks its had no data points. I do this by checking the current week number of points and if it does have some points then the #WeeksZeroPoints gets set to zero, and if the current number of points is zero then i take the prior weeks count and increment by one. For my last week I would have input
0
2
So I have tried something like
UPDATE tblCars
SET NumWksZeroPoints = IIF(NumDataPoints<>0, 0, (SELECT MAX(NumWksZeroPoints) AS wzp
FROM tblCars AS f
WHERE f.CarID=tblCars.CarID AND
f.WeekEnding=#5/21/2011#) + 1
)
WHERE WeekOf=#5/28/2011#;
Unfortunately this doesn't work like I thought it would. I think I have the concept down and most of the SQL, I just cant seem to make it work. This is against MS Access so some of the other tricks I know just don't work. Any help appreciated.
You could (and some might say should) do this as a query, without updating the table. If you are capturing the datapoints per week per car, your query can compute the number of weeks a car has had no data points using date math. What happens if someone inserts data for a car after you have run your update? You end up with data that are inconsistent.
Using your sample data I ran the following
UPDATE tblcar AS c
INNER JOIN tblcar AS previous
ON c.carid = previous.carid
SET c.numwkszeropoints = Iif([previous].[NumWksZeroPoints] = 0, 0,
[previous].[NumWksZeroPoints] + 1)
WHERE c.weekof =#5/28/2011 #
AND previous.weekof =#5/21/2011#;
The table afterwards looked like this
CarID WeekOf NumDataPoints NumWksZeroPoints
----- ---------- ------------- -----------------
3AA 05/14/2011 23 0
7BB 05/14/2011 9 0
3AA 05/21/2011 35 0
7BB 05/21/2011 0 1
3AA 05/28/2011 24 0
7BB 05/28/2011 0 2
Basically the query does a self join back to the previous week, and the update the current week to the previous week's value + 1 if its not zero.