How do I add a key to a row based on its "group"?

How do I add a key to a row based on its "group"? - sql

I have a data set like this:
a 10
a 13
a 14
b 15
b 44
c 64
c 32
d 12
I want to write a PROC SQL statement or DATA step that will yield this:
a 10 1
a 13 1
a 14 1
b 15 2
b 44 2
c 64 3
c 32 3
d 12 4
How do?
DATA TEST;
INPUT id $ value ;
DATALINES;
a 10
a 13
a 14
b 15
b 44
c 64
c 32
d 12
;
RUN;

Sort your data if needed:
proc sort data=test;
by id;
run;
Then:
data want;
set test;
retain key;
by id;
if _n_ = 1 then key = 0;
if first.id then key = key + 1;
run;
The retain statement will retain the value of key through the iterations.
Then, whenever a new id appears, we sum 1 to key.
Alternatively as stated by Keith, you could use this simplified data step to do the job:
data want;
set test;
by id;
if first.id then key + 1;
run;
I'll leave both versions here for reference because I think the first one is easier to understand, and the last one from Keith's comments is a lot cleaner.

Related

SAS sum observations not in a group, by multiple groups

This post follow this one: SAS sum observations not in a group, by group
Where my minimal example was a bit too minimal sadly,I wasn't able to use it on my data.
Here is a complete case example, what I have is :
data have;
input group1 group2 group3 $ value;
datalines;
1 A X 2
1 A X 4
1 A Y 1
1 A Y 3
1 B Z 2
1 B Z 1
1 C Y 1
1 C Y 6
1 C Z 7
2 A Z 3
2 A Z 9
2 A Y 2
2 B X 8
2 B X 5
2 B X 5
2 B Z 7
2 C Y 2
2 C X 1
;
run;
For each group, I want a new variable "sum" with the sum of all values in the column for the same sub groups (group1 and group2), exept for the group (group3) the observation is in.
data want;
input group1 group2 group3 $ value $ sum;
datalines;
1 A X 2 8
1 A X 4 6
1 A Y 1 9
1 A Y 3 7
1 B Z 2 1
1 B Z 1 2
1 C Y 1 13
1 C Y 6 8
1 C Z 7 7
2 A Z 3 11
2 A Z 9 5
2 A Y 2 12
2 B X 8 17
2 B X 5 20
2 B X 5 20
2 B Z 7 18
2 C Y 2 1
2 C X 1 2
;
run;
My goal is to use either datasteps or proc sql (doing it on around 30 millions observations and proc means and such in SAS seems slower than those on previous similar computations).
My issue with solutions provided in the linked post is that is uses the total value of the column and I don't know how to change this by using the total in the sub group.
Any idea please?

A SQL solution will join all data to an aggregating select:
proc sql;
create table want as
select have.group1, have.group2, have.group3, have.value
, aggregate.sum - value as sum
from
have
join
(select group1, group2, sum(value) as sum
from have
group by group1, group2
) aggregate
on
aggregate.group1 = have.group1
& aggregate.group2 = have.group2
;
SQL can be slower than hash solution, but SQL code is understood by more people than those that understand SAS DATA Step involving hashes ( which can be faster the SQL. )
data want2;
if 0 then set have; * prep pdv;
declare hash sums (suminc:'value');
sums.defineKey('group1', 'group2');
sums.defineDone();
do while (not hash_loaded);
set have end=hash_loaded;
sums.ref(); * adds value to internal sum of hash data record;
end;
do while (not last_have);
set have end=last_have;
sums.sum(sum:sum); * retrieve group sum.;
sum = sum - value; * subtract from group sum;
output;
end;
stop;
run;
SAS documentation touches on SUMINC and has some examples
The question does not address this concept:
For each row compute the tier 2 sum that excludes the tier 3 this row is in
A hash based solution would require tracking each two level and three level sums:
data want2;
if 0 then set have; * prep pdv;
declare hash T2 (suminc:'value'); * hash for two (T)iers;
T2.defineKey('group1', 'group2'); * one hash record per combination of group1, group2;
T2.defineDone();
declare hash T3 (suminc:'value'); * hash for three (T)iers;
T3.defineKey('group1', 'group2', 'group3'); * one hash record per combination of group1, group2, group3;
T3.defineDone();
do while (not hash_loaded);
set have end=hash_loaded;
T2.ref(); * adds value to internal sum of hash data record;
T3.ref();
end;
T2_cardinality = T2.num_items;
T3_cardinality = T3.num_items;
put 'NOTE: |T2| = ' T2_cardinality;
put 'NOTE: |T3| = ' T3_cardinality;
do while (not last_have);
set have end=last_have;
T2.sum(sum:t2_sum);
T3.sum(sum:t3_sum);
sum = t2_sum - t3_sum;
output;
end;
stop;
drop t2_: t3:;
run;

SAS sum observations not in a group, by group

I have a data set :
data have;
input group $ value;
datalines;
A 4
A 3
A 2
A 1
B 1
C 1
D 2
D 1
E 1
F 1
G 2
G 1
H 1
;
run;
The first variable is a group identifier, the second a value.
For each group, I want a new variable "sum" with the sum of all values in the column, exept for the group the observation is in.
My issue is having to do that on nearly 30 millions of observations, so efficiency matters.
I found that using data step was more efficient than using procs.
The final database should looks like :
data want;
input group $ value $ sum;
datalines;
A 4 11
A 3 11
A 2 11
A 1 11
B 1 20
C 1 20
D 2 18
D 1 18
E 1 20
F 1 20
G 2 18
G 1 20
H 1 20
;
run;
Any idea how to perform this please?
Edit: I don't know if this matter but the example I gave is a simplified version of my issue. In the real case, I have 2 other group variable, thus taking the sum of the whole column and substract the sum in the group is not a viable solution.

The requirement
sum of all values in the column, except for the group the observation is in
indicates two passes of the data must occur:
Compute the all_sum and each group's group_sumA hash can store each group's sum -- computed via a specified suminc: variable and .ref() method invocation. A variable can accumulate allsum.
Compute allsum - group_sum for each row of a group.The group_sum is retrieved from hash and subtracted from allsum.
Example:
data want;
if 0 then set have; * prep pdv;
declare hash sums (suminc:'value');
sums.defineKey('group');
sums.defineDone();
do while (not hash_loaded);
set have end=hash_loaded;
sums.ref(); * adds value to internal sum of hash data record;
allsum + value;
end;
do while (not last_have);
set have end=last_have;
sums.sum(sum:sum); * retrieve groups sum. Do you hear the Dragnet theme too?;
sum = allsum - sum; * subtract from allsum;
output;
end;
stop;
run;

What is wrong with a straight forward approach? You need to make two passes no matter what you do.
Like this. I included extra variables so you can see how the values are derived.
proc sql ;
create table want as
select a.*,b.grand,sum(value) as total, b.grand - sum(value) as sum
from have a
, (select sum(value) as grand from have) b
group by a.group
;
quit;
Results:
Obs group value grand total sum
1 A 3 21 10 11
2 A 1 21 10 11
3 A 2 21 10 11
4 A 4 21 10 11
5 B 1 21 1 20
6 C 1 21 1 20
7 D 2 21 3 18
8 D 1 21 3 18
9 E 1 21 1 20
10 F 1 21 1 20
11 G 1 21 3 18
12 G 2 21 3 18
13 H 1 21 1 20
Note it does not matter what you have as your GROUP BY clause.
Do you really need to output all of the original observations? Why not just output the summary table?
proc sql ;
create table want as
select a.group, b.grand - sum(value) as sum
from have a
, (select sum(value) as grand from have) b
group by a.group
;
quit;
Results
Obs group total sum
1 A 10 11
2 B 1 20
3 C 1 20
4 D 3 18
5 E 1 20
6 F 1 20
7 G 3 18
8 H 1 20

I would break this out into two different segments:
1.) You could start by using PROC SQL to get the sums by the group
2.) Then use some IF/THEN statements to reassign the values by group

identifying the rows with maximum continuous values

I have two columns in a table. the second column has 1 or zero depending on a predefined condition. Can someone help me with a logic to identify the maximum continuous occurrence of 1s. For example, in the below table the maximum continuous occurrence is between rows 7 and 18. Just the logic to identify this would be enough.
Thanks

Create the intervals.
data intervals ;
set have ;
by B NOTSORTED ;
if first.b then start=A ;
retain start ;
if last.b then do;
end = A ;
duration = end - start + 1 ;
output;
end;
drop A ;
run;
Then find the interval with the maximum duration. Perhaps you want the first occurrence of the maximum duration?
proc sort data=intervals out=want ;
by descending duration start;
run;
data want ;
set want (obs=1);
where B=1;
run;

something like this
data have;
input A B;
datalines;
1 0
2 0
3 1
4 1
5 1
6 0
7 0
8 0
9 1
10 0
11 1
12 1
13 1
14 1
15 1
16 1
17 0
18 0
19 0
20 1
21 0
;
proc sort data=have;
by A;
run;
data want;
set have;
if B=1 then count + 1;
if B = 0 then count = 0;
run;
proc sql;
select max(count) as max_value from want;

Data transformation Using Proc Transpose or simpler procedures

I have a data set:
Period Store Item feature_1 feature_2
JAN A a1 3 4
JAN A a2 4 9
JAN A a3 2 1
JAN A a4 4 9
FEB A a2 4 9
JAN B a2 3 1
FEB B b2 4 9
.....
I would like to get the dataset:
Period Store a1_feature_1 a1_feature_2 a2_feature_1 a2_feature_2....
JAN A 3 4 4 9
FEB A . . 4 9
JAN B . . 3 1
where the final data set have each observation containing each outlet during each period, while having all the features for each item together in the same observation.
My initial guess is to attempt using first a macro to create the variables a1_feature_1, a1_feature_2, a2_feature_1, a2_feature_2....
and then use a proc sql group by to collapse across the Store and period.
I am wondering if this can be done using proc transpose, sql, or would there be any other simpler steps to transforming this data?

Here is one way of doing this:
data have;
input (Period Store Item) ($) feature_1 feature_2; cards;
JAN A a1 3 4
JAN A a2 4 9
JAN A a3 2 1
JAN A a4 4 9
FEB A a2 4 9
JAN B a2 3 1
FEB B b2 4 9
;
run;
proc sql noprint;
select distinct cats(item,'_feature1'),cats(item,'_feature2'),
into :item_list1 separated by ' ', :item_list2 separated by ' '
from have;
quit;
data want;
do until(last.period);
set have;
by store period notsorted;
array f1[*] &item_list1;
array f2[*] &item_list2;
do i = 1 to dim(f1);
if vname(f1[i]) eq: trim(item) then do;
f1[i] = feature_1;
f2[i] = feature_2;
end;
end;
end;
drop i feature_1 feature_2;
run;
N.B. this does not give you the column order shown in the question, but you could easily fix that with a bit of additional logic if you wanted. Also, the macro variables used to define the arrays will only hold enough variable names for a few thousand items.

You can also put all your feature_ variables into a list, transpose the data using each one and name the suffix, then merge together. With this method you DO NOT have to manually type out all your feature_ variables as the sql method does it for you:
data test;
length Period Store Item $5 feature_1 feature_2 8;
input Period $ Store $ Item $ feature_1 feature_2;
datalines;
JAN A a1 3 4
JAN A a2 4 9
JAN A a3 2 1
JAN A a4 4 9
FEB A a2 4 9
JAN B a2 3 1
FEB B b2 4 9
;
run;
proc sort data = test;
by PERIOD STORE;
run;
** how many feature_ vars do I have? **;
proc sql noprint;
create table features as
select NAME
from dictionary.columns
where libname="WORK" and memname="TEST" and index(NAME,"feature");
** put them into a list to loop over **;
select NAME
into: feature_list separated by " "
from features;
quit;
%put &feature_list.;
** transpose data using each feature_ variable then merge when finished **;
%MACRO loop_over(feature_list);
%do i=1 %to %sysfunc(countw(&feature_list.));
%let feature=%scan(&feature_list.,&i.);
proc transpose data = test out=trans_&feature.(drop=_NAME_) SUFFIX=_&feature.;
by PERIOD STORE;
id ITEM;
var &feature.;
run;
%end;
data merged;
merge trans_:;
by PERIOD STORE;
run;
%MEND;
%loop_over(&feature_list.);

tracking customer retension on weekly basis

I have start and end weeks for a given customer and I need to make panel data for the weeks they are subscribed. I have manipulated the data into an easy form to convert, but when I transpose I do not get the weeks in between start and end filled in. Hopefully an example will shed some light on my request. Weeks start at 0 and end at 61, so forced any week above 61 to be 61, again for simplicity. Populate with a 1 if they are subscribed still and a blank if not.
ID Start_week End_week
1 6 61
2 0 46
3 45 61
what I would like
ID week0 week1 ... week6 ... week45 week46 week47 ... week61
1 . . ... 1 ... 1 1 1 ... 1
2 1 1 ... 1 ... 1 1 0 ... 0
3 0 0 ... 0 ... 1 1 1 ... 1

I see two ways to do it.
I would go for an array approach, since it will probably be the fastest (single data step) and is not that complex:
data RESULT (drop=start_week end_week);
set YOUR_DATA;
array week_array{62} week0-week61;
do week=0 to 61;
if week between start_week and end_week then week_array[week+1]=1;
else week_array[week+1]=0;
end;
run;
Alternatively, you can prepare a table for the transpose to work by creating one record per week per id::
data BEFORE_TRANSPOSE (drop=start_week end_week);
set YOUR_DATA;
do week=0 to 61;
if week between start_week and end_week then subscribed=1;
else subscribed=0;
output;
end;
run;

Use an array to create the variables. The one gotcha is SAS arrays are 1 indexed.
data input;
input ID Start_week End_week;
datalines;
1 6 61
2 0 46
3 45 61
;
data output;
array week[62] week0-week61;
set input;
do i=1 to 62;
if i > start_week and i<= (end_week+1) then
week[i] = 1;
else
week[i] = 0;
end;
drop i;
run;

I have no working syntax but a guideline for you.
first make a table with CTE or physically with the numbers 0 to 61 as rows. Then join this table with the subscribed table. Something like
FROM sub
INNER JOIN CTE
ON CTE.week BETWEEN sub.Start_week AND sub.End_week
Now you will have a row for every week a customer is subscribed. Transpose that and you will have the in between weeks also filled in.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How do I add a key to a row based on its "group"? - sql

I have a data set like this: a 10 a 13 a 14 b 15 b 44 c 64 c 32 d 12 I want to write a PROC SQL statement or DATA step that will yield this: a 10 1 a 13 1 a 14 1 b 15 2 b 44 2 c 64 3 c 32 3 d 12 4 How do? DATA TEST; INPUT id $ value ; DATALINES; a 10 a 13 a 14 b 15 b 44 c 64 c 32 d 12 ; RUN;

Related

SAS sum observations not in a group, by multiple groups

SAS sum observations not in a group, by group

identifying the rows with maximum continuous values

Data transformation Using Proc Transpose or simpler procedures

tracking customer retension on weekly basis

Categories

Resources