Retrive data by row with lag function - sql

Good morning.
I've this dataset:
Appendix | Change_Serial_Number| Status | Duration | Mileage | Service
20101234 0 . 60 120000 Z
20101234 1 Proposed 48 110000 Z
20101234 2 Activated 24 90000 Z
20101234 3 Proposed 60 120000 Z
20101234 4 Proposed 50 160000 B
20101234 5 Activated 36 110000 B
Each row is a variation that could be activated or only proposed with the first row with status like blank or the previously activated variation.
I need to have this table:
Appendix | Change_Serial_Number| Status | Duration | Mileage | Service |Duration_Prev| Mileage_prev |
20101234 0 . 60 120000 Z .
20101234 1 Proposed 48 110000 Z 60 120000
20101234 2 Activated 24 90000 Z 60 120000
20101234 3 Proposed 60 120000 Z 24 90000
20101234 4 Proposed 50 160000 B 24 90000
20101234 5 Activated 36 110000 B 24 90000
I need to compare the duration, mileage and service of each variation with the previously activated or with the initial condition only if there aren't variation activated.
I tried with lag function to retrieve a data of previous row, but i need to retrieve data of 3 field and retrieve data only from the last activated variation or, if there aren't, from the initial condition.
I used this code:
proc sort data=db_rdg;
by Appendix Change_Serial_Number descending Change_Serial_Number;
run;
data db_rdg2;
set db_rdg;
by Appendix;
Duration_prev=lag(Duration);
if first. Appendix then Durata_prev =.;
run;
With this code, i can retrieve a data only from the previously row (not from the previosly actived row or from the first condition) and only for a duration variable (not at the same time for duration, mileage and service).
I hope I have been clear enough :)
Thank you for your help!

The lag() function is only really useful for working with values from a specific number of observations earlier. In this case, you don't know whether the values you want to work with are from the previous observation or from five or six observations earlier, so instead of using lag(), you should RETAIN the additional variables and update their values when appropriate:
data db_rdg2;
retain duration_prev .;
set db_rdg;
by Appendix;
if first.Appendix or status = 'Activated' then duration_prev = duration;
run;
The RETAIN statement allows duration_prev to retain its value as each new observation in read from the input, instead of being reset to missing.
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000214163.htm

Instead of using LAG to retrieve the duration from the prior row, you will want to store the activate state tracking variables (for duration, mileage and serial) in a variable that is retained and updated after an explicit output.
In these two sample codes I tossed in tracking serial as you may want to know # of changes from prior activate.
data have; input
Appendix Change_Serial_Number Status $ Duration Mileage Service $;
datalines;
20101234 0 . 60 120000 Z
20101234 1 Proposed 48 110000 Z
20101234 2 Activated 24 90000 Z
20101234 3 Proposed 60 120000 Z
20101234 4 Proposed 50 160000 B
20101234 5 Activated 36 110000 B
run;
* NOTE: _APA suffix means # prior activate;
* version 1;
* implicit loop with by group processing means ;
* explicit first. test needed in order to reset the apa tracking variables;
data want;
set have;
by appendix;
if first.appendix then do;
length csn_apa dur_apa mil_apa 8;
call missing(csn_apa, dur_apa, mil_apa);
end;
output;
if status in (' ' 'Activate') then do;
csn_apa = change_serial_number;
dur_apa = duration;
mil_apa = mileage;
end;
retain csn_apa dur_apa mil_apa;
run;
* version 2;
* DOW version;
* explicit loop over group means first. handling not explicitly needed;
* implicit loop performs tracking variable resets;
* retain not needed because output and tracking variables modified;
* within current iteration of implicit loop;
data want2;
do until (last.appendix);
set have;
by appendix;
output;
if status in (' ' 'Activate') then do;
csn_apa = change_serial_number;
dur_apa = duration;
mil_apa = mileage;
end;
end;
run;

Related

SAS delete and group by

Simplified version of the dataset I have is:
DATA HAVE;
INPUT ID match1 $ match2 $ not_relevant;
DATALINES;
1 "ABC" "ABC" 4
1 "XYZ" "XYZ" 29
2 "QQQ" "AAA" 5
2 "ABC" "ABC" 9
3 "EFG" "EFG" 7
3 "DEF" "DEF" 12
3 "LMK" LMK" 16
3 "LMK" . 29
;RUN;
I am looking to compare match1 and match2, and if anywhere in the ID column match1 does not equal match2, I would like to remove all of the rows with that ID. So for this example dataset I want to remove all of ID 2 (rows 3 and 4) since row 3 does not have a match between match1 and match2. All I can figure out how to do so far is to delete the rows where they dont match, which isnt terribly helpful for this application. I assume it would be easier to make it a new data set with some wheres but I am unsure how to begin there. Any ideas / advice?
EDIT:
Apologies, I dumbed down my dataset too much and forgot about an important exception. Note in my new dataset (I only added one row to the end). I do NOT want to delete group 3, since match2 is blank. I only want to delete a group where match2 is not blank and match1 does not equal match2.
Thanks
There's a few ways to do this. One would be to just construct a dataset of IDs that have non-matching rows, then do a merge or a SQL join and remove anything that matched this list.
However, my preferred option (partly because of speed, but also it's more straightforward once you understand how it works) is the DoW loop.
data want;
id_nonmatch = 0;
do _n_ = 1 by 1 until (last.id);
set have;
by id;
if match1 ne match2 then id_nonmatch = 1; *set the flag to 1 if we find a nonmatch;
end;
do _n_ = 1 by 1 until (last.id);
set have;
by id;
if id_nonmatch = 0 then output;
end;
run;
There are two set statements on the data step, each of which runs through the same dataset separately. If it doesn't make sense, throw a put _all_; inside each of the do loops - that will show you what it's doing. The first loop goes over all of the rows for one ID, checks if any violate the constraint, and if none do, the flag variable (id_nonmatch) stays 0. If one does, it becomes a 1 (and stays that way). Then, when it hits an ID boundary, it stops pulling records from the first set statement, and goes onto the second - re-pulling those same rows. Now, it outputs only when the flag is a zero.
This is very efficient because of buffering - unless your id groups are very large, the data step may be able to use buffers to keep the same rows in memory and not have to reread them from disk. (This will depend on your disk and buffers - and seems to help much less on flash than on physical disks [since there is not the additional benefit of the disk head not having to move] - so your mileage may vary here.)
Just to show this difference, here is a log showing that there isn't much additional time needed for the second read - when the record is reasonably sized. This benefit is less when the record is very small - I imagine there is more overhead involved. Note that the second read adds only 1/7 of the time of the first read to the total processing time!
69 data have;
70 call streaminit(7);
71 length strvar $1000;
72 do id = 1 to 100000;
73 do iter = 1 to 50;
74 x = rand('Uniform');
75 output;
76 end;
77 end;
78 run;
NOTE: Variable strvar is uninitialized.
NOTE: The data set WORK.HAVE has 5000000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 5.20 seconds
cpu time 5.20 seconds
79
80
81 data _null_;
82 do _n_ = 1 by 1 until (last.id);
83 set have;
84 by id;
85 end;
86 run;
NOTE: There were 5000000 observations read from the data set WORK.HAVE.
NOTE: DATA statement used (Total process time):
real time 2.37 seconds
cpu time 2.37 seconds
87
88
89 data _null_;
90 do _n_ = 1 by 1 until (last.id);
91 set have;
92 by id;
93 end;
94 do _n_ = 1 by 1 until (last.id);
95 set have;
96 by id;
97 end;
98 run;
NOTE: There were 5000000 observations read from the data set WORK.HAVE.
NOTE: There were 5000000 observations read from the data set WORK.HAVE.
NOTE: DATA statement used (Total process time):
real time 2.74 seconds
cpu time 2.73 seconds
It is easy to do this with an SQL query with a GROUP BY and HAVING clause.
proc sql;
create table want as
select *
from have
group by id
having max( (match1 ne match2) and not missing(match2))
;
quit;
SAS evaluates boolean expressions as 1/0 for TRUE/FALSE so the MAX() of a series of TRUE/FALSE values will be TRUE if ANY of them are TRUE.

How to write a basic SQL function to match values with a maximum and minimum range of values?

If I have a data set like the following :
type| min | max
-----------------
a | 25 | 30
b | 20 | 30
c | 15 | 20
My goal is to match an input with a type, and to do that while taking into account that my types have overlapping values.
So let's say I have an input in my system that is 25, and I want to match my input to a type (either a, b, or c). My input is most likely b, since the average of the min and max of b is 25, and could possibly be a, but that is less likely. I've tried implementing this and have had no luck, and have also thought of using p-values, but am not sure how I can do it.
What would be the best way to implement this?
Something like this fits your description:
select t.*
from t
where ? >= min and ? <= max
order by abs( ? - (max - min) / 2 )
fetch first 1 row only;
This identifies the ranges where the value matches. It then chooses the range where the value is closest to the middle of the range.

find closest match within a vector to fill missing values using dplyr

A dummy dataset is :
data <- data.frame(
group = c(1,1,1,1,1,2),
dates = as.Date(c("2005-01-01", "2006-05-01", "2007-05-01","2004-08-01",
"2005-03-01","2010-02-01")),
value = c(10,20,NA,40,NA,5)
)
For each group, the missing values need to be filled with the non-missing value corresponding to the nearest date within same group. In case of a tie, pick any.
I am using dplyr. which.closest from birk but it needs a vector and a value. How to look up within a vector without writing loops. Even if there is an SQL solution, will do.
Any pointers to the solution?
May be something like: value = value[match(which.closest(dates,THISdate) & !is.na(value))]
Not sure how to specify Thisdate.
Edit: The expected value vector should look like:
value = c(10,20,20,40,10,5)
Using knn1 (nearest neighbor) from the class package (which comes with R -- don't need to install it) and dplyr define an na.knn1 function which replaces each NA value in x with the non-NA x value having the closest time.
library(class)
na.knn1 <- function(x, time) {
is_na <- is.na(x)
if (sum(is_na) == 0 || all(is_na)) return(x)
train <- matrix(time[!is_na])
test <- matrix(time[is_na])
cl <- x[!is_na]
x[is_na] <- as.numeric(as.character(knn1(train, test, cl)))
x
}
data %>% mutate(value = na.knn1(value, dates))
giving:
group dates value
1 1 2005-01-01 10
2 1 2006-05-01 20
3 1 2007-05-01 20
4 1 2004-08-01 40
5 1 2005-03-01 10
6 2 2010-02-01 5
Add an appropriate group_by if the intention was to do this by group.
You can try the use of sapply to find the values closest since the x argument in `which.closest only takes a single value.
first create a vect whereby the dates with no values are replaced with NA and use it within the which.closest function.
library(birk)
vect=replace(data$dates,which(is.na(data$value)),NA)
transform(data,value=value[sapply(dates,which.closest,vec=vect)])
group dates value
1 1 2005-01-01 10
2 1 2006-05-01 20
3 1 2007-05-01 20
4 1 2004-08-01 40
5 1 2005-03-01 10
6 2 2010-02-01 5
if which.closest was to take a vector then there would be no need of sapply. But this is not the case.
Using the dplyr package:
library(birk)
library(dplyr)
data%>%mutate(vect=`is.na<-`(dates,is.na(value)),
value=value[sapply(dates,which.closest,vect)])%>%
select(-vect)

Sum Previous Years rows in SAS, 3 groups

I would like to Sum, for each Codinv and Class, values of previous years listed in column D.
Thank you Rigerta. Here is my New Request. Now that I think about that, when there is just one row per CodInv per Class, it should show the same value as D. Hence, I would like a new column to be calculated as follows
Codinv Class year D NewColumn
----------------------------------------------------------
13 C08F 1977 5 5
76 C01B 1999 1 1
76 C21D 2005 2 2
76 C23C 1998 2 2
76 C23C 1999 2 4
I would change the code as follows, but it still does not work
As I read online, I tried with
data Want;
set Have;
by Codinv Class year;
retain NewColumn;
if first.Class then NewColumn=D; output;
if last.year NewColumn=NewColumn+D;
run;
It worked well with another analysis I had to do where I sorted by Codinv and Year only, now that I am doing it with three I tried different variations, but it is showing missing data for all rows or 0... Can you help me out? Forever Grateful
You're close with your attempt, I've modified it to produce the desired output. A summary of the changes I've made are :
Removed the retain statement. The method I've used adopts an automatic retain, so isn't necessary.
Initialise NewColumn to 0 each time the class changes
Add D to NewColumn for each row. (x+y, as used here, creates an implied retain)
Removed the output statement. This is implied at the end of the data step, so isn't needed.
Removed the if last.year... line as it is not necessary
Strictly speaking, having year in the by statement isn't necessary, but it is useful to keep in to ensure the data is sorted properly.
data have;
input Codinv Class $ year D;
datalines;
13 C08F 1977 5
76 C01B 1999 1
76 C21D 2005 2
76 C23C 1998 2
76 C23C 1999 2
;
run;
data Want;
set Have;
by Codinv Class year;
if first.Class then NewColumn=0;
newcolumn+D;
run;

Daemon to monitor query and send mail conditionally in SQL Server

I've been melting my brains over a peculiar request: execute every two minutes a certain query and if it returns rows, send an e-mail with these. This was already done and delivered, so far so good. The result set of query is like this:
+----+---------------------+
| ID | last_update |
+----+---------------------|
| 21 | 2011-07-20 13:03:21 |
| 32 | 2011-07-20 13:04:31 |
| 43 | 2011-07-20 13:05:27 |
| 54 | 2011-07-20 13:06:41 |
+----+---------------------|
The trouble starts when the user asks me to modify it so the solution so that, e.g., the first time that ID 21 is caught being more than 5 minutes old, the e-mail is sent to a particular set of recipients; the second time, when ID 21 is between 5 and 10 minutes old another set of recipients is chosen. So far it's ok. The gotcha for me is from the third time onwards: the e-mails are now sent each half-hour, instead of every five minutes.
How should I keep track of the status of Mr. ID = 43 ? How would I know if he has already received an e-mail, two or three? And how to ensure that from the third e-mail onwards, the mails are sent each half-hour, instead of the usual 5 minutes?
I get the impression that you think this can be solved with a simple mathematical formula. And it probably can be, as long as your system is reliable.
Every thirty minutes can be seen as 360 degrees, or 2 pi radians, on a harmonic function graph. That's 12 degrees = 1 minute. Let's take cosin for instance:
f(x) = cos(x)
f(x) = cos(elapsedMinutes * 12 degrees)
Where elapsed minutes is the time since the first 30 minute update was due to go out. This should be a constant number of minutes added to the value of last_update.
Since you have a two minute window of error, it will be time to transmit the 30 minute update if the the value of f(x) (above) is between the value you would get at less than one minute before or after the scheduled update. Which would be = cos(1* 12 degrees) = 0.9781476007338056379285667478696.
Bringing it all together, it's time to send a thirty minute update if this SQL expression is true:
COS(RADIANS( 12 * DATEDIFF(minutes,
DATEADD(minutes, constantNumberOfMinutesBetweenSecondAndThirdUpdate, last_update),
CURRENT_TIMESTAMP))) > 0.9781476007338056379285667478696
If you need a wider window than exactly two minutes, just lower this number slightly.