tracking customer retension on weekly basis - sql

I have start and end weeks for a given customer and I need to make panel data for the weeks they are subscribed. I have manipulated the data into an easy form to convert, but when I transpose I do not get the weeks in between start and end filled in. Hopefully an example will shed some light on my request. Weeks start at 0 and end at 61, so forced any week above 61 to be 61, again for simplicity. Populate with a 1 if they are subscribed still and a blank if not.
ID Start_week End_week
1 6 61
2 0 46
3 45 61
what I would like
ID week0 week1 ... week6 ... week45 week46 week47 ... week61
1 . . ... 1 ... 1 1 1 ... 1
2 1 1 ... 1 ... 1 1 0 ... 0
3 0 0 ... 0 ... 1 1 1 ... 1

I see two ways to do it.
I would go for an array approach, since it will probably be the fastest (single data step) and is not that complex:
data RESULT (drop=start_week end_week);
set YOUR_DATA;
array week_array{62} week0-week61;
do week=0 to 61;
if week between start_week and end_week then week_array[week+1]=1;
else week_array[week+1]=0;
end;
run;
Alternatively, you can prepare a table for the transpose to work by creating one record per week per id::
data BEFORE_TRANSPOSE (drop=start_week end_week);
set YOUR_DATA;
do week=0 to 61;
if week between start_week and end_week then subscribed=1;
else subscribed=0;
output;
end;
run;

Use an array to create the variables. The one gotcha is SAS arrays are 1 indexed.
data input;
input ID Start_week End_week;
datalines;
1 6 61
2 0 46
3 45 61
;
data output;
array week[62] week0-week61;
set input;
do i=1 to 62;
if i > start_week and i<= (end_week+1) then
week[i] = 1;
else
week[i] = 0;
end;
drop i;
run;

I have no working syntax but a guideline for you.
first make a table with CTE or physically with the numbers 0 to 61 as rows. Then join this table with the subscribed table. Something like
FROM sub
INNER JOIN CTE
ON CTE.week BETWEEN sub.Start_week AND sub.End_week
Now you will have a row for every week a customer is subscribed. Transpose that and you will have the in between weeks also filled in.

Related

SAS sum observations not in a group, by group

I have a data set :
data have;
input group $ value;
datalines;
A 4
A 3
A 2
A 1
B 1
C 1
D 2
D 1
E 1
F 1
G 2
G 1
H 1
;
run;
The first variable is a group identifier, the second a value.
For each group, I want a new variable "sum" with the sum of all values in the column, exept for the group the observation is in.
My issue is having to do that on nearly 30 millions of observations, so efficiency matters.
I found that using data step was more efficient than using procs.
The final database should looks like :
data want;
input group $ value $ sum;
datalines;
A 4 11
A 3 11
A 2 11
A 1 11
B 1 20
C 1 20
D 2 18
D 1 18
E 1 20
F 1 20
G 2 18
G 1 20
H 1 20
;
run;
Any idea how to perform this please?
Edit: I don't know if this matter but the example I gave is a simplified version of my issue. In the real case, I have 2 other group variable, thus taking the sum of the whole column and substract the sum in the group is not a viable solution.
The requirement
sum of all values in the column, except for the group the observation is in
indicates two passes of the data must occur:
Compute the all_sum and each group's group_sumA hash can store each group's sum -- computed via a specified suminc: variable and .ref() method invocation. A variable can accumulate allsum.
Compute allsum - group_sum for each row of a group.The group_sum is retrieved from hash and subtracted from allsum.
Example:
data want;
if 0 then set have; * prep pdv;
declare hash sums (suminc:'value');
sums.defineKey('group');
sums.defineDone();
do while (not hash_loaded);
set have end=hash_loaded;
sums.ref(); * adds value to internal sum of hash data record;
allsum + value;
end;
do while (not last_have);
set have end=last_have;
sums.sum(sum:sum); * retrieve groups sum. Do you hear the Dragnet theme too?;
sum = allsum - sum; * subtract from allsum;
output;
end;
stop;
run;
What is wrong with a straight forward approach? You need to make two passes no matter what you do.
Like this. I included extra variables so you can see how the values are derived.
proc sql ;
create table want as
select a.*,b.grand,sum(value) as total, b.grand - sum(value) as sum
from have a
, (select sum(value) as grand from have) b
group by a.group
;
quit;
Results:
Obs group value grand total sum
1 A 3 21 10 11
2 A 1 21 10 11
3 A 2 21 10 11
4 A 4 21 10 11
5 B 1 21 1 20
6 C 1 21 1 20
7 D 2 21 3 18
8 D 1 21 3 18
9 E 1 21 1 20
10 F 1 21 1 20
11 G 1 21 3 18
12 G 2 21 3 18
13 H 1 21 1 20
Note it does not matter what you have as your GROUP BY clause.
Do you really need to output all of the original observations? Why not just output the summary table?
proc sql ;
create table want as
select a.group, b.grand - sum(value) as sum
from have a
, (select sum(value) as grand from have) b
group by a.group
;
quit;
Results
Obs group total sum
1 A 10 11
2 B 1 20
3 C 1 20
4 D 3 18
5 E 1 20
6 F 1 20
7 G 3 18
8 H 1 20
I would break this out into two different segments:
1.) You could start by using PROC SQL to get the sums by the group
2.) Then use some IF/THEN statements to reassign the values by group

identifying the rows with maximum continuous values

I have two columns in a table. the second column has 1 or zero depending on a predefined condition. Can someone help me with a logic to identify the maximum continuous occurrence of 1s. For example, in the below table the maximum continuous occurrence is between rows 7 and 18. Just the logic to identify this would be enough.
Thanks
Create the intervals.
data intervals ;
set have ;
by B NOTSORTED ;
if first.b then start=A ;
retain start ;
if last.b then do;
end = A ;
duration = end - start + 1 ;
output;
end;
drop A ;
run;
Then find the interval with the maximum duration. Perhaps you want the first occurrence of the maximum duration?
proc sort data=intervals out=want ;
by descending duration start;
run;
data want ;
set want (obs=1);
where B=1;
run;
something like this
data have;
input A B;
datalines;
1 0
2 0
3 1
4 1
5 1
6 0
7 0
8 0
9 1
10 0
11 1
12 1
13 1
14 1
15 1
16 1
17 0
18 0
19 0
20 1
21 0
;
proc sort data=have;
by A;
run;
data want;
set have;
if B=1 then count + 1;
if B = 0 then count = 0;
run;
proc sql;
select max(count) as max_value from want;

How do I add a key to a row based on its "group"?

I have a data set like this:
a 10
a 13
a 14
b 15
b 44
c 64
c 32
d 12
I want to write a PROC SQL statement or DATA step that will yield this:
a 10 1
a 13 1
a 14 1
b 15 2
b 44 2
c 64 3
c 32 3
d 12 4
How do?
DATA TEST;
INPUT id $ value ;
DATALINES;
a 10
a 13
a 14
b 15
b 44
c 64
c 32
d 12
;
RUN;
Sort your data if needed:
proc sort data=test;
by id;
run;
Then:
data want;
set test;
retain key;
by id;
if _n_ = 1 then key = 0;
if first.id then key = key + 1;
run;
The retain statement will retain the value of key through the iterations.
Then, whenever a new id appears, we sum 1 to key.
Alternatively as stated by Keith, you could use this simplified data step to do the job:
data want;
set test;
by id;
if first.id then key + 1;
run;
I'll leave both versions here for reference because I think the first one is easier to understand, and the last one from Keith's comments is a lot cleaner.

How to do a last observation carrying forward using SAS PROC SQL

I have the data below. I want to write a sas proc sql code to get the last non-missing values for each patient(ptno).
data sda;
input ptno visit weight;
format ptno z3. ;
cards;
1 1 122
1 2 123
1 3 .
1 4 .
2 1 156
2 2 .
2 3 70
2 4 .
3 1 60
3 2 .
3 3 112
3 4 .
;
run;
proc sql noprint;
create table new as
select ptno,visit,weight,
case
when weight = . then weight
else .
end as _weight_1
from sda
group by ptno,visit
order by ptno,visit;
quit;
The sql code above does not work well.
The desire output data like this:
ptno visit weight
1 1 122
1 2 123
1 3 123
1 4 123
2 1 156
2 2 .
2 3 70
2 4 70
3 1 60
3 2 .
3 3 112
3 4 112
Since you do have effectively a row number (visit), you can do this - though it's much slower than the data step.
Here it is, broken out into a separate column for demonstration purposes - of course in your case you will want to coalesce this into one column.
Basically, you need a subquery that determines the maximum visit number less than the current one that does have a legitimate weight count, and then join that to the table to get the weight.
proc sql;
select ptno, visit, weight,
(
select weight
from sda A,
(select ptno, max(visit) as visit
from sda D
where D.ptno=S.ptno
and D.visit<S.visit
and D.weight is not null
group by ptno
) V
where A.visit=V.visit and A.ptno=V.ptno
)
from sda S
;
quit;
Although you don't describe it that way you do not carry forward VISIT 1 right?
I don't know why you would want to do this using SQL. In SAS a data step is much better suited to the task. I like using the "update trick". If you're interested in how this works I will leave it to you to study the UPDATE statement.
data locf;
update sda(obs=0 keep=ptno) sda;
by ptno;
output;
if visit eq 1 then call missing(weight);
run;

Create a variable based on sum of two variables (one lag)

I have a data set like the one below, where the amount has dropped off, but the adjustment remains. For each row amount should be the sum of the previous amount and the adjustment. So, amount for observation 5 is 134 (124+10).
I have an answer which gets me the next value, but I need some sort of recursion to get me the rest of the way there. What am I missing? Thanks.
data have;
input amount adjust;
cards;
100 0
101 1
121 20
124 3
. 10
. 4
. 3
. 0
. 1
;
run;
data attempt;
set have;
x=lag1(amount);
if amount=. then amount=adjust+x;
run;
data want;
input amount adjust;
cards;
100 0
101 1
121 20
124 3
134 10
138 4
141 3
141 0
142 1
;
run;
EDIT:
Also trying something like this now, still not quite what I want.
%macro doodoo;
%do i = 1 %to 5;
data have;
set have;
/* if _n_=i+4 then*/
amount=lag1(amount)+adjust;
run;
%end;
%mend;
%doodoo;
No need to LAG() use RETAIN instead.
data want ;
set have ;
retain previous ;
if amount = . then amount=sum(previous,adjust);
previous=amount ;
run;