SAS delete and group by - sql

Simplified version of the dataset I have is:
DATA HAVE;
INPUT ID match1 $ match2 $ not_relevant;
DATALINES;
1 "ABC" "ABC" 4
1 "XYZ" "XYZ" 29
2 "QQQ" "AAA" 5
2 "ABC" "ABC" 9
3 "EFG" "EFG" 7
3 "DEF" "DEF" 12
3 "LMK" LMK" 16
3 "LMK" . 29
;RUN;
I am looking to compare match1 and match2, and if anywhere in the ID column match1 does not equal match2, I would like to remove all of the rows with that ID. So for this example dataset I want to remove all of ID 2 (rows 3 and 4) since row 3 does not have a match between match1 and match2. All I can figure out how to do so far is to delete the rows where they dont match, which isnt terribly helpful for this application. I assume it would be easier to make it a new data set with some wheres but I am unsure how to begin there. Any ideas / advice?
EDIT:
Apologies, I dumbed down my dataset too much and forgot about an important exception. Note in my new dataset (I only added one row to the end). I do NOT want to delete group 3, since match2 is blank. I only want to delete a group where match2 is not blank and match1 does not equal match2.
Thanks

There's a few ways to do this. One would be to just construct a dataset of IDs that have non-matching rows, then do a merge or a SQL join and remove anything that matched this list.
However, my preferred option (partly because of speed, but also it's more straightforward once you understand how it works) is the DoW loop.
data want;
id_nonmatch = 0;
do _n_ = 1 by 1 until (last.id);
set have;
by id;
if match1 ne match2 then id_nonmatch = 1; *set the flag to 1 if we find a nonmatch;
end;
do _n_ = 1 by 1 until (last.id);
set have;
by id;
if id_nonmatch = 0 then output;
end;
run;
There are two set statements on the data step, each of which runs through the same dataset separately. If it doesn't make sense, throw a put _all_; inside each of the do loops - that will show you what it's doing. The first loop goes over all of the rows for one ID, checks if any violate the constraint, and if none do, the flag variable (id_nonmatch) stays 0. If one does, it becomes a 1 (and stays that way). Then, when it hits an ID boundary, it stops pulling records from the first set statement, and goes onto the second - re-pulling those same rows. Now, it outputs only when the flag is a zero.
This is very efficient because of buffering - unless your id groups are very large, the data step may be able to use buffers to keep the same rows in memory and not have to reread them from disk. (This will depend on your disk and buffers - and seems to help much less on flash than on physical disks [since there is not the additional benefit of the disk head not having to move] - so your mileage may vary here.)
Just to show this difference, here is a log showing that there isn't much additional time needed for the second read - when the record is reasonably sized. This benefit is less when the record is very small - I imagine there is more overhead involved. Note that the second read adds only 1/7 of the time of the first read to the total processing time!
69 data have;
70 call streaminit(7);
71 length strvar $1000;
72 do id = 1 to 100000;
73 do iter = 1 to 50;
74 x = rand('Uniform');
75 output;
76 end;
77 end;
78 run;
NOTE: Variable strvar is uninitialized.
NOTE: The data set WORK.HAVE has 5000000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 5.20 seconds
cpu time 5.20 seconds
79
80
81 data _null_;
82 do _n_ = 1 by 1 until (last.id);
83 set have;
84 by id;
85 end;
86 run;
NOTE: There were 5000000 observations read from the data set WORK.HAVE.
NOTE: DATA statement used (Total process time):
real time 2.37 seconds
cpu time 2.37 seconds
87
88
89 data _null_;
90 do _n_ = 1 by 1 until (last.id);
91 set have;
92 by id;
93 end;
94 do _n_ = 1 by 1 until (last.id);
95 set have;
96 by id;
97 end;
98 run;
NOTE: There were 5000000 observations read from the data set WORK.HAVE.
NOTE: There were 5000000 observations read from the data set WORK.HAVE.
NOTE: DATA statement used (Total process time):
real time 2.74 seconds
cpu time 2.73 seconds

It is easy to do this with an SQL query with a GROUP BY and HAVING clause.
proc sql;
create table want as
select *
from have
group by id
having max( (match1 ne match2) and not missing(match2))
;
quit;
SAS evaluates boolean expressions as 1/0 for TRUE/FALSE so the MAX() of a series of TRUE/FALSE values will be TRUE if ANY of them are TRUE.

Related

SAS computing multiple new variables from one row

I have a dataset as listed below:
ID-----V1-----V2------V3
01------5------3-------7
02------3------8-------5
03------6------9-------1
and I want to calculate 3 new variables (ERR_CODE, ERR_DETAIL, ERR_ID) according to behavior of certain columns.
If V1 is greater than 4 then ERR_CODE = A and ERR_DETAIL = "Out of range" and ERR_ID = [ID]_A
If V2 is greater than 4 then ERR_CODE = B and ERR_DETAIL = "Check Log" and ERR_ID = [ID]_B
If V3 is greater than 4 then ERR_CODE = C and ERR_DETAIL = "Fault" and ERR_ID = [ID]_C
Desired output table be like
ID-----ERR_CODE----ERR_DETAIL---------ERR_ID
01--------A--------Out of range---------01_A
01--------C--------Fault----------------01_C
02--------B--------Check Log------------02_B
02--------C--------Fault----------------02_C
03--------A--------Out of range---------03_A
03--------B--------Check Log------------03_B
I am using SAS 9.3 with EG 5.1. I have tried do-loops, arrays, if statements and case-when's but it naturally skips to the next row to calculate when condition is met. But i want to calculate other met conditions fo each row.
I have managed to do it by creating seperate tables for each condition and then merge them. But that doesn't seem an effective way if there are much conditions to work with.
My question is how can i manage to calculate other met conditions for each ID at once without calculating seperately? The output table's row count will be more than the input as expected but for me it is not possible to achieve by applying case-when or if etc.
Thanks in advance and sorry if i am not clear.
Just use IF/THEN/DO blocks. Add an OUTPUT statement to write new observation for each error.
data have ;
input ID $ V1-V3;
cards;
01 5 3 7
02 3 8 5
03 6 9 1
;
data want;
set have;
length ERR_CODE $1 ERR_DETAIL $20 ERR_ID $10 ;
if v1>4 then do;
err_code='A'; err_detail="Out of range"; err_id=catx('_',id,err_code);
output;
end;
if v2>4 then do;
err_code='B'; err_detail="Fault"; err_id=catx('_',id,err_code);
output;
end;
if v3>4 then do;
err_code='C'; err_detail="Check Log"; err_id=catx('_',id,err_code);
output;
end;
drop v1-v3 ;
run;
Results:
Obs ID ERR_CODE ERR_DETAIL ERR_ID
1 01 A Out of range 01_A
2 01 C Check Log 01_C
3 02 B Fault 02_B
4 02 C Check Log 02_C
5 03 A Out of range 03_A
6 03 B Fault 03_B

Retrive data by row with lag function

Good morning.
I've this dataset:
Appendix | Change_Serial_Number| Status | Duration | Mileage | Service
20101234 0 . 60 120000 Z
20101234 1 Proposed 48 110000 Z
20101234 2 Activated 24 90000 Z
20101234 3 Proposed 60 120000 Z
20101234 4 Proposed 50 160000 B
20101234 5 Activated 36 110000 B
Each row is a variation that could be activated or only proposed with the first row with status like blank or the previously activated variation.
I need to have this table:
Appendix | Change_Serial_Number| Status | Duration | Mileage | Service |Duration_Prev| Mileage_prev |
20101234 0 . 60 120000 Z .
20101234 1 Proposed 48 110000 Z 60 120000
20101234 2 Activated 24 90000 Z 60 120000
20101234 3 Proposed 60 120000 Z 24 90000
20101234 4 Proposed 50 160000 B 24 90000
20101234 5 Activated 36 110000 B 24 90000
I need to compare the duration, mileage and service of each variation with the previously activated or with the initial condition only if there aren't variation activated.
I tried with lag function to retrieve a data of previous row, but i need to retrieve data of 3 field and retrieve data only from the last activated variation or, if there aren't, from the initial condition.
I used this code:
proc sort data=db_rdg;
by Appendix Change_Serial_Number descending Change_Serial_Number;
run;
data db_rdg2;
set db_rdg;
by Appendix;
Duration_prev=lag(Duration);
if first. Appendix then Durata_prev =.;
run;
With this code, i can retrieve a data only from the previously row (not from the previosly actived row or from the first condition) and only for a duration variable (not at the same time for duration, mileage and service).
I hope I have been clear enough :)
Thank you for your help!
The lag() function is only really useful for working with values from a specific number of observations earlier. In this case, you don't know whether the values you want to work with are from the previous observation or from five or six observations earlier, so instead of using lag(), you should RETAIN the additional variables and update their values when appropriate:
data db_rdg2;
retain duration_prev .;
set db_rdg;
by Appendix;
if first.Appendix or status = 'Activated' then duration_prev = duration;
run;
The RETAIN statement allows duration_prev to retain its value as each new observation in read from the input, instead of being reset to missing.
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000214163.htm
Instead of using LAG to retrieve the duration from the prior row, you will want to store the activate state tracking variables (for duration, mileage and serial) in a variable that is retained and updated after an explicit output.
In these two sample codes I tossed in tracking serial as you may want to know # of changes from prior activate.
data have; input
Appendix Change_Serial_Number Status $ Duration Mileage Service $;
datalines;
20101234 0 . 60 120000 Z
20101234 1 Proposed 48 110000 Z
20101234 2 Activated 24 90000 Z
20101234 3 Proposed 60 120000 Z
20101234 4 Proposed 50 160000 B
20101234 5 Activated 36 110000 B
run;
* NOTE: _APA suffix means # prior activate;
* version 1;
* implicit loop with by group processing means ;
* explicit first. test needed in order to reset the apa tracking variables;
data want;
set have;
by appendix;
if first.appendix then do;
length csn_apa dur_apa mil_apa 8;
call missing(csn_apa, dur_apa, mil_apa);
end;
output;
if status in (' ' 'Activate') then do;
csn_apa = change_serial_number;
dur_apa = duration;
mil_apa = mileage;
end;
retain csn_apa dur_apa mil_apa;
run;
* version 2;
* DOW version;
* explicit loop over group means first. handling not explicitly needed;
* implicit loop performs tracking variable resets;
* retain not needed because output and tracking variables modified;
* within current iteration of implicit loop;
data want2;
do until (last.appendix);
set have;
by appendix;
output;
if status in (' ' 'Activate') then do;
csn_apa = change_serial_number;
dur_apa = duration;
mil_apa = mileage;
end;
end;
run;

SAS - Advanced querying

I have one SQL table with the data in SAS. The first column is a datetime, and there is one row for each second. The set spans for about 20 minutes. The other columns contain integer values.
Here is what I need:
For example, Let's pick 50. How many times did the integer value go from below 50 to above 50 and stay above 50 for at least n seconds.
Is it possible to conduct such analysis with proc sql? If yes, how so, and if not, how else?
I am new to SAS, so any help is appreciated. Let me know if you need more info!
Thanks!
How many times did the integer value go from below/above 50
I think this could be solution to first part of the question. Resolution is maybe the best obtained by comparing current value with prior
data begin; /*Some test data...*/
input int_in_question;
datalines;
51
51
49
55
55
40
40
60
40
;
run;
data With_calc;
set begin;
if int_in_question < 50 and
lag(int_in_question)>=50
then Times_below_50+1;
run;

Sum Previous Years rows in SAS, 3 groups

I would like to Sum, for each Codinv and Class, values of previous years listed in column D.
Thank you Rigerta. Here is my New Request. Now that I think about that, when there is just one row per CodInv per Class, it should show the same value as D. Hence, I would like a new column to be calculated as follows
Codinv Class year D NewColumn
----------------------------------------------------------
13 C08F 1977 5 5
76 C01B 1999 1 1
76 C21D 2005 2 2
76 C23C 1998 2 2
76 C23C 1999 2 4
I would change the code as follows, but it still does not work
As I read online, I tried with
data Want;
set Have;
by Codinv Class year;
retain NewColumn;
if first.Class then NewColumn=D; output;
if last.year NewColumn=NewColumn+D;
run;
It worked well with another analysis I had to do where I sorted by Codinv and Year only, now that I am doing it with three I tried different variations, but it is showing missing data for all rows or 0... Can you help me out? Forever Grateful
You're close with your attempt, I've modified it to produce the desired output. A summary of the changes I've made are :
Removed the retain statement. The method I've used adopts an automatic retain, so isn't necessary.
Initialise NewColumn to 0 each time the class changes
Add D to NewColumn for each row. (x+y, as used here, creates an implied retain)
Removed the output statement. This is implied at the end of the data step, so isn't needed.
Removed the if last.year... line as it is not necessary
Strictly speaking, having year in the by statement isn't necessary, but it is useful to keep in to ensure the data is sorted properly.
data have;
input Codinv Class $ year D;
datalines;
13 C08F 1977 5
76 C01B 1999 1
76 C21D 2005 2
76 C23C 1998 2
76 C23C 1999 2
;
run;
data Want;
set Have;
by Codinv Class year;
if first.Class then NewColumn=0;
newcolumn+D;
run;

retrieve data with a proc print or a sql query or else

I'm looking for advice on that one. Some context before.
I have the following table on SAS. There are 711 observations and many more variables. Below is a sample from my table.
date col1 col2 col3
jun14 0 0 0
may14 1 0 2
apr14 1 0 3
The table has no index, no primary key , nothing.
The results I'm aiming for, is to know for a specific date, all the values of that column.
date col1 col2 col3
may14 1 0 2
apr14 1 0 3
Example for May 14, I will have
I'm running the following SQL query on it
proc sql;
select * from mytable where date < (input('may14',MONYY5.));
As you can imagine, the query is heavy when you have many variables and many observations. The query started 50 minutes ago and it is still running.
I also thought about using a proc print
proc print data=mytable;
var date col1 col2 col3;
where date = (input('may14',MONYY5.));
run;
So here is my question.
Is there an other way to have my results rather than through this query or the proc print? Do I need to have a datastep like a transpose , although if I'm doing a transpose, things would be different (see below).
date jun14 may14 apr14
col1 0 1 1
col2 0 0 0
col3 0 2 3
Thanks in advance for your insight.
Aren't you just missing QUIT; after select statement to end PROC SQL? I can't believe this could take such time for 771 records.
EDIT:
So the problem is in creating and displaying output directly in SAS windows.
Below log from my test with PROC PRINTTO to direct output to text file.
Takes less than 10 seconds.
The size of file is 100MB for 1000 records and 10000 variables.
Obviously, is would make more sense to output data to some other formats.
Also, what's the use of presenting thousands of values to the user?
114 data mytable ;
115 format date date9.;
116 array var {10000};
117 do i=1 to 10000;
118 var(i)=i;
NOTE: The array var has the same name as a SAS-supplied or user-defined function. Parentheses
following this name are treated as array references and not function references.
119 end;
120 do i=1 to 1000;
121 date = i;
122 output;
123 end;
124 run;
NOTE: The data set WORK.MYTABLE has 1000 observations and 10002 variables.
NOTE: DATA statement used (Total process time):
real time 0.13 seconds
cpu time 0.12 seconds
125
126 ods html close;* no html output;
127 ods listing; *text output rather;
128
129 proc printto print="E:\sasoutput.lst";run;
NOTE: PROCEDURE PRINTTO used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
129! * output to file;
130 proc sql;
131 select * from mytable where date < '1may2014'd;
132 ;
133 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 9.35 seconds
cpu time 8.57 seconds
134
135 %put &SQLOBS;
1000