Supose we've got the following dataset:
DATE VAR1 VAR2
1 A 1
2 A 1
3 B 1
4 C 2
5 D 3
6 E 4
7 F 5
8 B 6
9 B 7
10 D 1
Each record belongs to a person, the problem is that a single person can have more than one record with different values.
To identify a person: If you share the same VAR1, you are the same person, BUT also if you share the same VAR2, you are the same person.
My objective is to create a new variable IDPERSON which uniquely identifies the person for each record. In my example, there are only 4 different people:
DATE VAR1 VAR2 IDPERSON
1 A 1 1
2 A 1 1
3 B 1 1
4 C 2 2
5 D 3 1
6 E 4 3
7 F 5 4
8 B 6 1
9 B 7 1
10 D 1 1
How could I achieve this by using SQL or SAS?
%macro grouper(
inData /*Input dataset*/,
outData /*output dataset*/,
id1 /*First identification variable (must be numeric)*/,
id2 /*Second identification variable*/,
idOut /*Name of variable to contain group ID*/,
maxN = 5 /*Max number of itterations in case of failure*/);
/* Assign an ID to each distict connected graph in a a network */
/* Create first guess for group ID */
data _g_temp;
set &inData.;
&idOut. = &id1.;
run;
/* Loop, improve group ID each time*/
%let i = 1;
%do %while (&i. <= &maxN.);
%put Loop number &i.;
%let i = %eval(&i. + 1);
proc sql noprint;
/* Find the lowest group ID for each group of first variable */
create table _g_map1 as
select
min(&idOut.) as &idOut.,
&id1.
from _g_temp
group by &id1.;
/* Find the lowest group ID for each group of second variable */
create table _g_map2 as
select
min(&idOut.) as &idOut.,
&id2.
from _g_temp
group by &id2.;
/* Find the lowest group ID from both grouping variables */
create table _g_new as
select
a.&id1.,
a.&id2.,
coalesce(min(b.&idOut., c.&idOut.), a.&idOut.) as &idOut.,
a.&idOut. as &idOut._old
from _g_temp as a
full outer join _g_map1 as b
on a.&id1. = b.&id1.
full outer join _g_map2 as c
on a.&id2. = c.&id2.;
/* Put results into temporary dataset ready for next itteration */
create table _g_temp as
select *
from _g_new;
/* Check if the itteration provided any improvement */
select
min(
case when &idOut._old = &idOut. then 1
else 0
end) into :stopFlag
from _g_temp;
quit;
/* End loop if ID unchanged over last itteration */
%if &stopFlag. %then %let i = %eval(&maxN. + 1);
%end;
/* Output lookup table */
proc sql;
create table &outData. as
select
&id1.,
min(&idOut.) as &idOut.
from _g_temp
group by &id1.;
quit;
/* Clean up */
proc datasets nolist;
delete _g_:;
quit;
%mend grouper;
DATA baseData;
INPUT VAR1 VAR2 $;
CARDS;
1 A
1 A
1 B
2 C
3 D
4 E
5 F
6 B
7 B
1 D
1 X
7 G
6 Y
6 D
6 I
8 D
9 Z
9 X
;
RUN;
%grouper(
baseData,
outData,
VAR1,
VAR2,
groupID);
Do you think this will work?
It's written in SAS, but it uses SQL sentences.
DATA TEMP3;
INPUT VAR1 VAR2 $ DATE;
CARDS;
1 A 1
1 A 2
1 B 3
2 C 4
3 D 5
4 E 6
5 F 7
6 B 8
7 B 9
1 D 10
;
RUN;
PROC SQL;
CREATE TABLE WORK.TEMP4 AS SELECT DISTINCT VAR2, VAR1 FROM WORK.TEMP3 ORDER BY VAR2, VAR1;
CREATE TABLE WORK.TEMP5 AS SELECT DISTINCT VAR1, VAR2 FROM WORK.TEMP3 ORDER BY VAR1, VAR2;
CREATE TABLE WORK.TEMP6 AS SELECT TEMP4.VAR2, TEMP4.VAR1, TEMP5.VAR2 AS VAR22 FROM WORK.TEMP4 INNER JOIN WORK.TEMP5 ON (TEMP4.VAR1=TEMP5.VAR1);
CREATE TABLE WORK.TEMP7 AS SELECT TEMP6.*, TEMP5.VAR1 AS VAR12 FROM WORK.TEMP6 INNER JOIN WORK.TEMP5 ON (TEMP6.VAR2=TEMP5.VAR2);
CREATE TABLE WORK.TEMP8 AS SELECT DISTINCT VAR22, VAR12 FROM WORK.TEMP7 ORDER BY VAR22, VAR12;
CREATE TABLE WORK.TEMP9 AS SELECT VAR22, MAX(VAR12) AS VAR12 FROM WORK.TEMP8 GROUP BY VAR22;
CREATE TABLE WORK.TEMP10 AS SELECT TEMP8.* FROM WORK.TEMP8 INNER JOIN WORK.TEMP9 ON (TEMP8.VAR22=TEMP9.VAR22 AND TEMP8.VAR12=TEMP9.VAR12);
CREATE TABLE WORK.TEMP11 AS SELECT TEMP3.*, TEMP10.VAR12 AS IDPERSONA FROM WORK.TEMP3 LEFT JOIN WORK.TEMP10 ON (TEMP3.VAR2=TEMP10.VAR22);
QUIT;
I've broken down this problem into a few steps, which works for the data you've supplied. There's probably a way to reduce the number of steps, at the expense of readability. Let me know if this works for your real data.
/* create input dataset */
data have;
input DATE VAR1 $ VAR2;
datalines;
1 A 1
2 A 1
3 B 1
4 C 2
5 D 3
6 E 4
7 F 5
8 B 6
9 B 7
10 D 1
;
run;
/* calculate min VAR2 per VAR1 */
proc summary data=have nway idmin;
class var1;
output out=minvar2 (drop=_:) min(var2)=temp_var;
run;
/* add in min VAR2 data */
proc sql;
create table temp1 as select
a.*,
b.temp_var
from have as a
inner join
minvar2 as b
on a.var1 = b.var1
order by b.temp_var;
quit;
/* create idperson variable */
data want;
set temp1;
by temp_var;
if first.temp_var then idperson+1;
drop temp_var;
run;
/* sort back to original order */
proc sort data=want;
by date var1;
run;
Keith:
You solution does not work properly, take a look at the following dataset:
DATA TEMP3;
INPUT VAR2 VAR1 $ DATE;
DUMMY=1;
CARDS;
1 A 1
1 A 2
1 B 3
2 C 4
3 D 5
4 E 6
5 F 7
6 B 8
7 B 9
1 D 10
1 X 11
7 G 14
6 Y 15
6 D 16
6 I 18
8 D 20
9 Z 21
9 X 22
;
RUN;
Your program's result is:
VAR2 VAR1 DATE DUMMY idperson
1 A 1 1 1
1 A 2 1 1
1 B 3 1 1
2 C 4 1 2
3 D 5 1 1
4 E 6 1 3
5 F 7 1 4
6 B 8 1 1
7 B 9 1 1
1 D 10 1 1
1 X 11 1 1
7 G 14 1 6
6 Y 15 1 5
6 D 16 1 1
6 I 18 1 5
8 D 20 1 1
9 Z 21 1 7
9 X 22 1 1
Which are not corrent since Var1=6 records have two different ids.
This is what i've done, the whole program (not posted here) is more complex (and not so elegant) since it deals with missing data in Var1 and Var2.
PROC SQL;
CREATE TABLE WORK.TEMP4 AS SELECT DISTINCT VAR1, VAR2 FROM WORK.TEMP3 WHERE DUMMY=1 AND VAR2^=. ORDER BY VAR1, VAR2;
CREATE TABLE WORK.TEMP5 AS SELECT DISTINCT VAR2, VAR1 FROM WORK.TEMP3 WHERE DUMMY=1 AND VAR2^=. ORDER BY VAR2, VAR1;
CREATE TABLE WORK.TEMP6 AS SELECT TEMP4.*, TEMP5.VAR1 AS CIP2 FROM WORK.TEMP4 INNER JOIN WORK.TEMP5 ON (TEMP4.VAR2=TEMP5.VAR2);
CREATE TABLE WORK.TEMP7 AS SELECT TEMP6.*, TEMP4.VAR2 AS IDHH2 FROM WORK.TEMP6 INNER JOIN WORK.TEMP4 ON (TEMP6.VAR1=TEMP4.VAR1);
CREATE TABLE WORK.TEMP8 AS SELECT DISTINCT IDHH2, CIP2 FROM WORK.TEMP7;
CREATE TABLE WORK.TEMP9 AS SELECT TEMP7.*, TEMP8.CIP2 AS CIP3 FROM WORK.TEMP7 INNER JOIN WORK.TEMP8 ON (TEMP7.IDHH2=TEMP8.IDHH2);
CREATE TABLE WORK.TEMP10 AS SELECT TEMP9.*, TEMP8.IDHH2 AS IDHH3 FROM WORK.TEMP9 INNER JOIN WORK.TEMP8 ON (TEMP9.CIP3=TEMP8.CIP2);
CREATE TABLE WORK.TEMP11 AS SELECT DISTINCT VAR1, IDHH3 AS VAR2 FROM WORK.TEMP10 ORDER BY VAR1, IDHH3;
CREATE TABLE WORK.TEMP12 AS SELECT VAR1, MAX(VAR2) AS VAR2 FROM WORK.TEMP11 GROUP BY VAR1;
CREATE TABLE WORK.TEMP13 AS SELECT TEMP11.* FROM WORK.TEMP11 INNER JOIN WORK.TEMP12 ON (TEMP11.VAR1=TEMP12.VAR1 AND TEMP11.VAR2=TEMP12.VAR2);
CREATE TABLE WORK.TEMP14 AS SELECT TEMP3.*, TEMP13.VAR2 AS IDPERSONA FROM WORK.TEMP3 LEFT JOIN WORK.TEMP13 ON (TEMP3.VAR1=TEMP13.VAR1);
CREATE TABLE WORK.TEMP15 AS SELECT DISTINCT VAR2, IDPERSONA FROM WORK.TEMP14 WHERE VAR2^=. AND IDPERSONA^=.;
CREATE TABLE WORK.TEMP16 AS SELECT TEMP14.*, TEMP15.IDPERSONA AS IDPERSONA2 FROM WORK.TEMP14 LEFT JOIN WORK.TEMP15 ON (TEMP14.VAR2=TEMP15.VAR2) ORDER BY DATE;
QUIT;
DATA TEMP16;
SET TEMP16;
IF IDPERSONA=. THEN IDPERSONA=IDPERSONA2;
DROP IDPERSONA2;
RUN;
And the right results:
VAR2 VAR1 DATE DUMMY IDPERSONA
1 A 1 1 9
1 A 2 1 9
1 B 3 1 9
2 C 4 1 2
3 D 5 1 9
4 E 6 1 4
5 F 7 1 5
6 B 8 1 9
7 B 9 1 9
1 D 10 1 9
1 X 11 1 9
7 G 14 1 9
6 Y 15 1 9
6 D 16 1 9
6 I 18 1 9
8 D 20 1 9
9 Z 21 1 9
9 X 22 1 9
I forgot to post my final solution, it is a SAS macro. I've made another one for 3 variables.
%MACRO GROUPER2(INDATA,OUTDATA,ID1,ID2,IDOUT,IDN=_N_,MAXN=5);
%PUT ****************************************************************;
%PUT ****************************************************************;
%PUT **** GROUPER MACRO;
%PUT **** PARAMETERS:;
%PUT **** INPUT DATA: &INDATA.;
%PUT **** OUTPUT DATA: &OUTDATA.;
%PUT **** FIRST VARIABLE: &ID1.;
%PUT **** SECOND VARIABLE: &ID2.;
%PUT **** OUTPUT GROUPING VARIABLE: &IDOUT.;
%IF (&IDN.=_N_) %THEN %PUT **** STARTING NUMBER VARIABLE: AUTONUMBER;
%ELSE %PUT **** STARTING NUMBER VARIABLE: &IDN.;
%PUT **** MAX ITERATIONS: &MAXN.;
%PUT ****************************************************************;
%PUT ****************************************************************;
/* CREATE FIRST GUESS FOR GROUP ID */
DATA _G_TEMP1 _G_TEMP2;
SET &INDATA.;
&IDOUT.=&IDN.;
IF &IDOUT.=. THEN OUTPUT _G_TEMP2;
ELSE OUTPUT _G_TEMP1;
RUN;
PROC SQL NOPRINT;
SELECT MAX(&IDOUT.) INTO :MAXIDOUT FROM _G_TEMP1;
QUIT;
DATA _G_TEMP2;
SET _G_TEMP2;
&IDOUT.=_N_+&MAXIDOUT.;
RUN;
DATA _G_TEMP;
SET _G_TEMP1 _G_TEMP2;
RUN;
PROC SQL;
UPDATE _G_TEMP SET &IDOUT.=. WHERE &ID1. IS NULL AND &ID2. IS NULL;
QUIT;
/* LOOP, IMPROVE GROUP ID EACH TIME*/
%LET I = 1;
%DO %WHILE (&I. <= &MAXN.);
%PUT LOOP NUMBER &I.;
%LET I = %EVAL(&I. + 1);
PROC SQL NOPRINT;
/* FIND THE LOWEST GROUP ID FOR EACH GROUP OF FIRST VARIABLE */
CREATE TABLE _G_MAP1 AS SELECT MIN(&IDOUT.) AS &IDOUT., &ID1. FROM _G_TEMP WHERE &ID1. IS NOT NULL GROUP BY &ID1.;
/* FIND THE LOWEST GROUP ID FOR EACH GROUP OF SECOND VARIABLE */
CREATE TABLE _G_MAP2 AS SELECT MIN(&IDOUT.) AS &IDOUT., &ID2. FROM _G_TEMP WHERE &ID2. IS NOT NULL GROUP BY &ID2.;
/* FIND THE LOWEST GROUP ID FROM BOTH GROUPING VARIABLES */
CREATE TABLE _G_NEW AS SELECT A.&ID1., A.&ID2., COALESCE(MIN(B.&IDOUT., C.&IDOUT.), A.&IDOUT.) AS &IDOUT.,
A.&IDOUT. AS &IDOUT._OLD FROM _G_TEMP AS A FULL OUTER JOIN _G_MAP1 AS B ON A.&ID1. = B.&ID1.
FULL OUTER JOIN _G_MAP2 AS C ON A.&ID2. = C.&ID2.;
/* PUT RESULTS INTO TEMPORARY DATASET READY FOR NEXT ITTERATION */
CREATE TABLE _G_TEMP AS SELECT * FROM _G_NEW ORDER BY &ID1., &ID2.;
/* CHECK IF THE ITTERATION PROVIDED ANY IMPROVEMENT */
SELECT MIN(CASE WHEN &IDOUT._OLD = &IDOUT. THEN 1 ELSE 0 END) INTO :STOPFLAG FROM _G_TEMP;
%PUT NO IMPROVEMENT? &STOPFLAG.;
QUIT;
/* END LOOP IF ID UNCHANGED OVER LAST ITTERATION */
%LET ITERATIONS=%EVAL(&I. - 1);
%IF &STOPFLAG. %THEN %LET I = %EVAL(&MAXN. + 1);
%END;
%PUT ****************************************************************;
%PUT ****************************************************************;
%IF &STOPFLAG. %THEN %PUT **** LOOPING ENDED BY NO-IMPROVEMENT CRITERIA. OUTPUT FULLY GROUPED.;
%ELSE %PUT **** WARNING: LOOPING ENDED BY REACHING THE MAXIMUM NUMBER OF ITERARIONS. OUTPUT NOT FULLY GROUPED.;
%PUT **** NUMBER OF ITERATIONS: &ITERATIONS. (MAX: &MAXN.);
%PUT ****************************************************************;
%PUT ****************************************************************;
DATA &OUTDATA.;
SET _G_TEMP;
DROP &IDOUT._OLD;
RUN;
/* OUTPUT LOOKUP TABLE */
PROC SQL;
CREATE TABLE &OUTDATA._1 AS SELECT &ID1., MIN(&IDOUT.) AS &IDOUT. FROM _G_TEMP WHERE &ID1. IS NOT NULL GROUP BY &ID1. ORDER BY &ID1.;
CREATE TABLE &OUTDATA._2 AS SELECT &ID2., MIN(&IDOUT.) AS &IDOUT. FROM _G_TEMP WHERE &ID2. IS NOT NULL GROUP BY &ID2. ORDER BY &ID2.;
QUIT;
/* CLEAN UP */
PROC DATASETS NOLIST;
DELETE _G_:;
QUIT;
%MEND GROUPER2;
Related
I need to find the min_week when the counter is 0 and that row's week value should continue until again the counter becomes 0.
Following is an example of the Output I am looking for:
In my current output with the following code, I am getting the output as:
proc sql;
create table week_min_1 as
select t1.*, t2.week as min_week from emp_table t1
left join (select * from emp_table where counter = 0 group by emp having sequence = min(sequence)) t2
on t1.emp= t2.emp
;
quit;
Let's first create your sample data:
data have;
input Emp $ Sequence Week Counter;
datalines;
a001 1 2 0
a001 2 4 1
a001 3 8 2
a001 4 12 3
a001 5 24 0
a001 6 36 1
a001 7 48 2
a001 8 52 3
;
run;
Now we should sort our data. In this sample it is already sorted, but it's better safe than sorry.
proc sort data=have;
by Emp Sequence;
run;
With a simple if statement we identify those min_week values.
data have2;
set have;
if counter = 0 then do;
min_week = Week;
end;
run;
And with update statement we put those values trough every row.
data want;
update have2 (obs=0) have2;
by Emp;
output;
run;
I just started learning SAS and realised that proc sql don't use window functions. As I am more at ease with sql I was wondering how I can simulate a sum window function in proc?
desired result
select a.active, a.store_id, a.nbr, sum(nbr) over (partition by a.store_id)
from(select active, store_id, count(customer_id) as nbr from customer group by active, store_id) as a
;
active
store_id
nbr
sum
0
1
8
326
1
1
318
326
0
2
7
273
1
2
266
273
eg of raw data
select active, store_id, customer_id
from customer
limit 10;
active
store_id
customer_id
1
1
1
1
1
2
1
2
3
1
2
4
1
1
5
1
1
6
0
1
7
1
2
8
1
1
9
1
2
10
current result and query
select a.active, a.store_id, a.nbr, sum(nbr)
from(select active, store_id, count(customer_id) as nbr from customer group by active, store_id) as a
group by a.active, a.store_id, a.nbr;
active
store_id
nbr
sum
0
1
8
8
1
1
318
318
0
2
7
7
1
2
266
266
Unlike some SQL implementation SAS is happy to re-merge summary statistics back onto the detail rows when you include variables that are neither group by nor summary statistics.
Let's convert your print outs of data into an actual dataset. And let's change one value so we have at least two values of ACTIVE to group by.
data have;
input active store_id customer_id;
cards;
1 1 1
1 1 2
1 2 3
1 2 4
1 1 5
1 1 6
0 1 7
1 2 8
1 1 9
1 2 10
;
Now we can count the records by ACTIVE and STORE_ID and then generate the report by appending the store total.
proc sql;
select active,store_id,nbr,sum(nbr) as store_nbr
from (select active,store_id,count(*) as nbr
from have
group by active,store_id
)
group by store_id
;
Resulting printout:
active store_id nbr store_nbr
---------------------------------------
0 1 1 6
1 1 5 6
1 2 4 4
You can do the equivalent in proc sql by merging two sub-queries: one for the count of customers by active, store_id, and another for the total customers for each store_id.
proc sql noprint;
create table want as
select t1.active
, t1.store_id
, t1.nbr
, t2.sum
from (select active
, store_id
, count(customer_id) as nbr
from have
group by store_id, active
) as t1
LEFT JOIN
(select store_id
, count(customer_id) as sum
from have
group by store_id
) as t2
ON t1.store_id = t2.store_id
;
quit;
If you wanted to do this in a more SASsy way, you can run proc means and merge together the results from a single dataset that holds everything you need. proc means will calculate all possible combinations of your variables by default.
proc means data=have noprint;
class store_id active;
ways 1 2;
output out=want_total
n(customer_id) = total
;
run;
data want;
merge want_total(where=(_TYPE_ = 3) rename=(total = nbr) )
want_total(where=(_TYPE_ = 2) rename=(total = sum) keep=_TYPE_ store_id total)
;
by store_id;
drop _:;
run;
Or, in SQL:
proc sql;
create table want as
select t1.store_id
, t1.active
, t1.total as nbr
, t2.total as sum
from want_total as t1
LEFT JOIN
want_total as t2
ON t1.store_id = t2.store_id
where t1._TYPE_ = 3
AND t2._TYPE_ = 2
;
quit;
The _TYPE_ variable identifies the level of the analysis. For example, _TYPE_ = 1 is for active only, _TYPE_ = 2 is for store_id only, and _TYPE_ = 3 is for all combinations. You can view this in the output dataset from proc means:
store_id active _TYPE_ _FREQ_ total
. 0 1 3 3
. 1 1 7 7
1 . 2 6 6
2 . 2 4 4
1 0 3 1 1
1 1 3 5 5
2 0 3 2 2
2 1 3 2 2
And if you wanted faster high-performance results, check out its big sibling, proc hpsummary.
Therein lies the cool thing about SAS: You can bounce between PROCs, SQL, the DATA Step, and Python via Pandas/proc python. You can exploit the unique benefits of each of these methods and thought processes for any number of data engineering and statistics problems.
I want to assign a value in new_col based on value in column 'ind' when months = 1;
idnum1 months ind new_col
1 1 X X
1 2 X X
1 3 Y X
1 4 Y X
1 5 X X
2 1 Y Y
2 2 Y Y
2 3 X Y
2 4 X Y
2 5 X Y
Below query just assign the value X where months = 1 but I want in all the rows of new_col for all the id -
create table tmp as
select t1.*,
case when months = 1 then ind end as new_col
from table t1;
I am trying to do it in SAS using proc sql;
Ideally you would use RETAIN within a data step:
data want;
set have;
retain new_var;
if month=1 then new_var = ind;
run;
SQL isn't as good with this as a data step.
But assuming your variable ID is repeated then this would work. If it's not then you really do need the data step approach.
proc sql;
create table want as
select *, max(ind) as new_col
from have
group by ID;
quit;
EDIT: If you want to retain the first per ID just use FIRST. instead of If month =1.
data want;
set have;
by ID;
retain new_var;
if first.id then new_var = ind;
run;
A robust Proc SQL statement that deals with possibly repeated first month situations that chooses the lowest ind to distribute to the group
data have; input
idnum1 months ind $ new_col $; datalines;
1 1 X X
1 2 X X
1 3 Y X
1 4 Y X
1 5 X X
2 1 Y Y
2 2 Y Y
2 3 X Y
2 4 X Y
2 5 X Y
3 1 Z .
3 1 Y .
3 1 X .
3 2 A .
;
create table want as
select
have.idnum1, months, ind, new_col, lowest_first_ind
from
have
join
( select idnum1, min(ind) as lowest_first_ind from
(
select idnum1, ind
from have
group by idnum1
having months = min(months)
)
group by idnum1
) value_seeker
on
have.idnum1 = value_seeker.idnum1
;
You can use a window function:
select t1.*,
max(case when months = 1 then ind end) over (partition by id) as new_col
from t1;
If there is only one MONTH=1 observation per BY group then just use a simple join.
create table WANT as
select t1.*,t2.ind as new_col
from table t1
left join (select idnum1,ind from table where month=1) t2
on t1.idnum1 = t2.idnum1
;
I have a data set containing an unbalanced panel of observations, where I want to forward and backward fill missing and/or "wrong" observations of ticker with the latest non-missing string.
id time ticker_have ticker_want
------------------------------
1 1 ABCDE YYYYY
1 2 . YYYYY
1 3 . YYYYY
1 4 YYYYY YYYYY
1 5 . YYYYY
------------------------------
2 4 . ZZZZZ
2 5 ZZZZZ ZZZZZ
2 6 . ZZZZZ
------------------------------
3 1 . .
------------------------------
4 2 OOOOO OOOOO
4 3 OOOOO OOOOO
4 4 OOOOO OOOOO
Basically, if the observation already has a ticker, but this ticker is not the same as the latest non-empty ticker, we replace this ticker using the latest ticker.
So far, I have managed to fill missing observations forward using this code
proc sql;
create table have as select * from old_have order by id, time desc;
quit;
data want;
drop temp;
set have;
by id;
/* RETAIN the new variable*/
retain temp; length temp $ 5;
/* Reset TEMP when the BY-Group changes */
if first.id then temp=' ';
/* Assign TEMP when X is non-missing */
if ticker ne ' ' then temp=ticker;
/* When X is missing, assign the retained value of TEMP into X */
else if ticker=' ' then ticker=temp;
run;
Now I am stuck figuring out the cases where I can't access the non-missing value using last.ticker or first.ticker ...
How would one do this using DATA or PROC SQL or any other SAS commands?
You can do this several ways, but proc sql with some nested sub-queries is one solution.
(Read it from inside out, #1 then 2 then 3. You could build each subquery into a dataset first if it helps)
proc sql ;
create table want as
/* #3 - match last ticker on id */
select a.id, a.time, a.ticker_have, b.ticker_want
from have a
left join
/* #2 - id and last ticker */
(select x.id, x.ticker_have as ticker_want
from have x
inner join
/* #1 - max time with a ticker per id */
(select id, max(time) as mt
from have
where not missing(ticker_have)
group by id) as y on x.id = y.id and x.time = y.mt) as b on a.id = b.id
;
quit ;
Consider using a data step to retrieve the last ticker by time for each id, then joining it to main table. Also, use a CASE statement to conditionally assign new ticker if missing or not.
data LastTicker;
set Tickers (where=(ticker_have ~=""));
by id;
first = first.id;
last = last.id;
if last = 1;
run;
proc sql;
create table Tickers_Want as
select t.id, t.time, t.ticker_have,
case when t.ticker_have = ""
then l.ticker_have
else t.ticker_have
end as tickerwant
from Tickers t
left join LastTicker l
on t.id = l.id
order by t.id, t.time;
quit;
Data
data Tickers;
length ticker_have $ 5;
input id time ticker_have $;
datalines;
1 1 ABCDE
1 2 .
1 3 .
1 4 YYYYY
1 5 .
2 4 .
2 5 ZZZZZ
2 6 .
3 1 .
4 2 OOOOO
4 3 OOOOO
4 4 OOOOO
;
Output
Obs id time ticker_have tickerwant
1 1 1 ABCDE ABCDE
2 1 2 YYYYY
3 1 3 YYYYY
4 1 4 YYYYY YYYYY
5 1 5 YYYYY
6 2 4 ZZZZZ
7 2 5 ZZZZZ ZZZZZ
8 2 6 ZZZZZ
9 3 1
10 4 2 OOOOO OOOOO
11 4 3 OOOOO OOOOO
12 4 4 OOOOO OOOOO
I have to get rid of a subject if it satisfies a condition.
DATA:
Name Value1
A 60
A 30
B 70
B 30
C 60
C 50
D 70
D 40
What I want is if the value=30 then both the lines should not come in theoutput.
Desired outpu is
Name Value1
C 60
C 50
D 70
D 40
I have written a code in proc sql as
proc sql;
create table ck1 as
select * from ip where name in
(select distinct name from ip where value = 30)
order by name, subject, folderseq;
quit;
Change your SQL to be:
proc sql;
create table ck1 as
select * from ip where name not in
(select distinct name from ip where value = 30)
order by name, subject, folderseq;
quit;
Data step method:
data have;
input Name $ Value1;
datalines;
A 60
A 30
B 70
B 30
C 60
C 50
D 70
D 40
;;;;
run;
data want;
do _n_ = 1 by 1 until (last.name);
set have;
by name;
if value1=30 then value1_30=1;
if value1_30=1 then leave;
end;
do _n_ = 1 by 1 until (last.name);
set have;
by name;
if value1_30 ne 1 then output;
end;
run;
And an alternate, slightly faster method in some cases that avoids the second set statement when value1_30 is 1 (this is faster in particular if most have a 30 in them, so you're only keeping a small number of records).
data want;
do _n_ = 1 by 1 until (last.name);
set have;
by name;
counter+1;
if first.name then firstcounter=counter;
else if last.name then lastcounter=counter;
if value1=30 then value1_30=1;
if value1_30=1 then leave;
end;
if value1_30 ne 1 then
do _n_ = firstcounter to lastcounter ;
set have point=_n_;
output;
end;
run;
Another SQL option...
proc sql number;
select
a.name,
a.value1,
case
when value1 = 30 then 1
else 0
end as flag,
sum(calculated flag) as countflagpername
from have a
group by a.name
having countflagpername = 0
;quit;