SAS hierarchical structure sum - sql

I have a dataset with a hierarchical codelist variable.
The logics of hierarchy is determined by the LEVEL variable and the prefix structure of the CODE character variable.
There are 6 (code length from 1 to 6) "aggregate" levels and the terminal level (code length of 10 characters).
I need to update the nodes variable (count of terminal nodes - the aggregate levels do not count in the "higher" aggregates, only the terminal nodes) - so the sum of counts in one level, for example every level 5's total count is the same as every level 6's.
And I need to calculate (sum up) the weight to "higher" level nodes.
NOTE: I offset the output table's NODES and WEIGHT variable so you can see better what I am talking about (just add up the numbers in each offset and you get the same value).
EDIT1: there can be multiple observations with the same code. A unique observations is a combination of 3 variables code + var1 + var2.
Input table:
ID level code var1 var2 nodes weight myIndex
1 1 1 . . 999 999 999
2 2 11 . . 999 999 999
3 3 111 . . 999 999 999
4 4 1111 . . 999 999 999
5 5 11111 . . 999 999 999
6 6 111111 . . 999 999 999
7 10 1111119999 01 1 1 0.1 105,5
8 10 1111119999 01 2 1 0.1 109,1
9 6 111112 . . 999 999 999
10 10 1111120000 01 1 1 0.5 95,0
11 5 11119 . . 999 999 999
12 6 111190 . . 999 999 999
13 10 1111901000 01 1 1 0.1 80,7
14 10 1111901000 02 1 1 0.2 105,5
Desired output table:
ID level code var1 var2 nodes weight myIndex
1 1 1 . . 5 1.0 98,1
2 2 11 . . 5 1.0 98,1
3 3 111 . . 5 1.0 98,1
4 4 1111 . . 5 1.0 98,1
5 5 11111 . . 3 0.7 98,5
6 6 111111 . . 2 0.2 107,3
7 10 1111119999 01 1 1 0.1 105,5
8 10 1111119999 01 2 1 0.1 109,1
9 6 111112 . . 1 0.5 95,0
10 10 1111120000 01 1 1 0.5 95,0
11 5 11119 . . 2 0.3 97,2
12 6 111190 . . 2 0.3 97,2
13 10 1111901000 01 1 1 0.1 80,7
14 10 1111901000 02 1 1 0.2 105,5
And here's the code I came up with. It works just like I wanted, but man, it is really slow. I need something way faster, because this is a part of a webservice which has to run "instantly" on request.
Any suggestions on speeding up the code, or any other solutions are welcome.
%macro doit;
data temporary;
set have;
run;
%do i=6 %to 2 %by -1;
%if &i = 6 %then %let x = 10;
%else %let x = (&i+1);
proc sql noprint;
select count(code)
into :cc trimmed
from have
where level = &i;
select code
into :id1 - :id&cc
from have
where level = &i;
quit;
%do j=1 %to &cc.;
%let idd = &&id&j;
proc sql;
update have t1
set nodes = (
select sum(nodes)
from temporary t2
where t2.level = &x and t2.code like ("&idd" || "%")),
set weight = (
select sum(weight)
from temporary t2
where t2.level = &x and t2.code like ("&idd" || "%"))
where (t1.level = &i and t1.code like "&idd");
quit;
%end;
%end;
%mend doit;
Current code based on #Quentin's solution:
data have;
input ID level code : $10. nodes weight myIndex;
cards;
1 1 1 . . .
2 2 11 . . .
3 3 111 . . .
4 4 1111 . . .
5 5 11111 . . .
6 6 111111 . . .
7 10 1111110000 1 0.1 105.5
8 10 1111119999 1 0.1 109.1
9 6 111112 . . .
10 10 1111129999 1 0.5 95.0
11 5 11119 . . .
12 6 111190 . . .
13 10 1111900000 1 0.1 80.7
14 10 1111901000 1 0.2 105.5
;
data want (drop=_:);
*hash table of terminal nodes;
if (_n_ = 1) then do;
if (0) then set have (rename=(code=_code weight=_weight));
declare hash h(dataset:'have(where=(level=10) rename=(code=_code weight=_weight myIndex=_myIndex))');
declare hiter iter('h');
h.definekey('ID');
h.definedata('_code','_weight','_myIndex');
h.definedone();
end;
set have;
*for each non-terminal node, iterate through;
*hash table of all terminal nodes, looking for children;
if level ne 10 then do;
call missing(weight, nodes, myIndex);
do _n_ = iter.first() by 0 while (_n_ = 0);
if trim(code) =: _code then do;
weight=sum(weight,_weight);
nodes=sum(nodes,1);
myIndex=sum(myIndex,_myIndex*_weight);
end;
_n_ = iter.next();
end;
myIndex=round(myIndex/weight,.1);
end;
output;
run;

Here's an alternative hash approach.
Rather than using a hash object to do a cartesian join, this adds the nodes & weight from each level 10 node to each of the 6 applicable parent nodes as it goes along. This may be marginally faster than Quentin's approach as there are no redundant hash lookups.
It takes a bit longer than Quentin's approach when constructing the hash object, and uses a bit more memory, as each terminal node is added 6 times with different keys and existing entries often have to be updated, but afterwards each parent node only has to look up its own individual stats, rather than looping through all the terminal nodes, which is a substantial saving.
Weighted stats are possible as well, but you have to update both loops, not just the second one.
data want;
if 0 then set have;
dcl hash h();
h.definekey('code');
h.definedata('nodes','weight','myIndex');
h.definedone();
length t_code $10;
do until(eof);
set have(where = (level = 10)) end = eof;
t_nodes = nodes;
t_weight = weight;
t_myindex = weight * myIndex;
do _n_ = 1 to 6;
t_code = substr(code,1,_n_);
if h.find(key:t_code) ne 0 then h.add(key:t_code,data:t_nodes,data:t_weight,data:t_myIndex);
else do;
nodes + t_nodes;
weight + t_weight;
myIndex + t_myIndex;
h.replace(key:t_code,data:nodes,data:weight,data:MyIndex);
end;
end;
end;
do until(eof2);
set have end = eof2;
if level ne 10 then do;
h.find();
myIndex = round(MyIndex / Weight,0.1);
end;
output;
end;
drop t_:;
run;

Below is a brute-force hash approach to doing a similar Cartesian product as in the SQL. Load a hash table of the terminal nodes. Then read through the dataset of nodes, and for each node that is not a terminal node, iterate through the hash table, identifying all of the child terminal nodes.
I think the approach #joop is describing may be more efficient, as this approach doesn't take advantage of the tree structure. So there is a lot of re-calculating. With 5000 records and 3000 terminal nodes, this would do 2000*3000 comparisons. But might not be that slow since the hash table is in memory, so you're not going to have excessive I/O ....
data want (drop=_:);
*hash table of terminal nodes;
if (_n_ = 1) then do;
if (0) then set have (rename=(code=_code weight=_weight));
declare hash h(dataset:'have(where=(level=10) rename=(code=_code weight=_weight))');
declare hiter iter('h');
h.definekey('ID');
h.definedata('_code','_weight');
h.definedone();
end;
set have;
*for each non-terminal node, iterate through;
*hash table of all terminal nodes, looking for children;
if level ne 10 then do;
call missing(weight, nodes);
do _n_ = iter.first() by 0 while (_n_ = 0);
if trim(code) =: _code then do;
weight=sum(weight,_weight);
nodes=sum(nodes,1);
end;
_n_ = iter.next();
end;
end;
output;
run;

It seems pretty simple. Just join back with itself and count/sum.
proc sql ;
create table want as
select a.id, a.level, a.code , a.var1, a.var2
, count(b.id) as nodes
, sum(b.weight) as weight
from have a
left join have b
on a.code eqt b.code
and b.level=10
group by 1,2,3,4,5
order by 1
;
quit;
If you don't want to use the EQT operator then you can use the SUBSTR() function instead.
on a.code = substr(b.code,1,a.level)
and b.level=10

Since you're using SAS, how about using proc summary to do the heavy lifting here? No cartesian joins required!
One advantage of this option over the some of the others is that it's a bit easier to generalise if you want to calculate lots of more complex statistics for multiple variables.
data have;
input ID level code : $10. nodes weight myIndex;
format myIndex 5.1;
cards;
1 1 1 . . .
2 2 11 . . .
3 3 111 . . .
4 4 1111 . . .
5 5 11111 . . .
6 6 111111 . . .
7 10 1111110000 1 0.1 105.5
8 10 1111119999 1 0.1 109.1
9 6 111112 . . .
10 10 1111129999 1 0.5 95.0
11 5 11119 . . .
12 6 111190 . . .
13 10 1111900000 1 0.1 80.7
14 10 1111901000 1 0.2 105.5
;
run;
data v_have /view = v_have;
set have(where = (level = 10));
array lvl[6] $6;
do i = 1 to 6;
lvl[i]=substr(code,1,i);
end;
drop i;
run;
proc summary data = v_have;
class lvl1-lvl6;
var nodes weight;
var myIndex /weight = weight;
ways 1;
output out = summary(drop = _:) sum(nodes weight)= mean(myIndex)=;
run;
data v_summary /view = v_summary;
set summary;
length code $10;
code = cats(of lvl:);
drop lvl:;
run;
data have;
modify have v_summary;
by code;
replace;
run;
In theory a hash of hashes might also be an appropriate data structure, but that would be extremely complicated for a relatively small benefit. I might have a go anyway just as a learning exercise...

One approach (I think) would be to make the Cartesian product, and find all of the terminal nodes that are a "match" to each of the nodes, then sum the weights.
Something like:
data have;
input ID level code : $10. nodes weight ;
cards;
1 1 1 . .
2 2 11 . .
3 3 111 . .
4 4 1111 . .
5 5 11111 . .
6 6 111111 . .
7 10 1111110000 1 0.1
8 10 1111119999 1 0.1
9 6 111112 . .
10 10 1111129999 1 0.5
11 5 11119 . .
12 6 111190 . .
13 10 1111900000 1 0.1
14 10 1111901000 1 0.2
;
proc sql;
select min(id) as id
, min(level) as level
, a.code
, count(b.weight) as nodes /*count of terminal nodes*/
, sum(b.weight) as weight /*sum of weights of terminal nodes*/
from
have as a
,(select code , weight
from have
where level=10 /*selects terminal nodes*/
) as b
where a.code eqt b.code /*EQT is equivalent to =: */
group by a.code
;
quit;
I'm not sure that is correct, but it gives the desired results for the sample data.

This is the SQL needed to estimate the parent record for every record. It only uses string functions (position and length) so it should be adaptable to any dialect of SQL, maybe even SAS. (the CTE might need to be rewritten to subqueries or a view) The idea is to:
add a parent_id field to the dataset
find the record with the longest substring of code
and use its id as the value for our parent_id
(after that) update the records from the sum(nodes),sum(weight) of their direct children (the ones with child.parent_id = this.id )
BTW: I could have used the LEVEL instead of the LENGTH(code) ; the data is a bit redundant in this aspect.
WITH sub AS (
SELECT id, length(code) AS len
, code
FROM tree)
UPDATE tree t
SET parent_id = s.id
FROM sub s
WHERE length(t.code) > s.len AND POSITION (s.code IN t.code) = 1
AND NOT EXISTS (
SELECT *
FROM sub nx
WHERE nx.len > s.len AND POSITION (nx.code IN t.code ) = 1
AND nx.len < length(t.code) AND POSITION (nx.code IN t.code ) = 1
)
;
SELECT * FROM tree
ORDER BY parent_id DESC NULLS LAST
, id
;
After finding the parents, the whole table should be updated (repeatedly) from itself
like:
-- PREPARE omg( integer) AS
UPDATE tree t
SET nodes = s.nodes , weight = s.weight
FROM ( SELECT parent_id , SUM(nodes) AS nodes , SUM(weight) AS weight
FROM tree GROUP BY parent_id) s
WHERE s.parent_id = t.id
;
In SAS, this could probably be done by sorting on {0-parent_id, id} and do some retain+summation magic. (my SAS is a bit rusty in this area)
UPDATE: if only the leaf nodes have non-NULL (non-missing) values for {nodes, weight}, the aggregation can be done in one sweep for the entire tree, without first computing the parent_ids:
UPDATE tree t
SET nodes = s.nodes , weight = s.weight
FROM ( SELECT p.id , SUM(c.nodes) AS nodes , SUM(c.weight) AS weight
FROM tree p
JOIN tree c ON c.lev > p.lev AND POSITION (p.code IN c.code ) = 1
GROUP BY p.id
) s
WHERE s.id = t.id
;
An index on {lev,code} will probably speed up things. (assuming an index on id)

Related

SAS sum observations not in a group, by multiple groups

This post follow this one: SAS sum observations not in a group, by group
Where my minimal example was a bit too minimal sadly,I wasn't able to use it on my data.
Here is a complete case example, what I have is :
data have;
input group1 group2 group3 $ value;
datalines;
1 A X 2
1 A X 4
1 A Y 1
1 A Y 3
1 B Z 2
1 B Z 1
1 C Y 1
1 C Y 6
1 C Z 7
2 A Z 3
2 A Z 9
2 A Y 2
2 B X 8
2 B X 5
2 B X 5
2 B Z 7
2 C Y 2
2 C X 1
;
run;
For each group, I want a new variable "sum" with the sum of all values in the column for the same sub groups (group1 and group2), exept for the group (group3) the observation is in.
data want;
input group1 group2 group3 $ value $ sum;
datalines;
1 A X 2 8
1 A X 4 6
1 A Y 1 9
1 A Y 3 7
1 B Z 2 1
1 B Z 1 2
1 C Y 1 13
1 C Y 6 8
1 C Z 7 7
2 A Z 3 11
2 A Z 9 5
2 A Y 2 12
2 B X 8 17
2 B X 5 20
2 B X 5 20
2 B Z 7 18
2 C Y 2 1
2 C X 1 2
;
run;
My goal is to use either datasteps or proc sql (doing it on around 30 millions observations and proc means and such in SAS seems slower than those on previous similar computations).
My issue with solutions provided in the linked post is that is uses the total value of the column and I don't know how to change this by using the total in the sub group.
Any idea please?
A SQL solution will join all data to an aggregating select:
proc sql;
create table want as
select have.group1, have.group2, have.group3, have.value
, aggregate.sum - value as sum
from
have
join
(select group1, group2, sum(value) as sum
from have
group by group1, group2
) aggregate
on
aggregate.group1 = have.group1
& aggregate.group2 = have.group2
;
SQL can be slower than hash solution, but SQL code is understood by more people than those that understand SAS DATA Step involving hashes ( which can be faster the SQL. )
data want2;
if 0 then set have; * prep pdv;
declare hash sums (suminc:'value');
sums.defineKey('group1', 'group2');
sums.defineDone();
do while (not hash_loaded);
set have end=hash_loaded;
sums.ref(); * adds value to internal sum of hash data record;
end;
do while (not last_have);
set have end=last_have;
sums.sum(sum:sum); * retrieve group sum.;
sum = sum - value; * subtract from group sum;
output;
end;
stop;
run;
SAS documentation touches on SUMINC and has some examples
The question does not address this concept:
For each row compute the tier 2 sum that excludes the tier 3 this row is in
A hash based solution would require tracking each two level and three level sums:
data want2;
if 0 then set have; * prep pdv;
declare hash T2 (suminc:'value'); * hash for two (T)iers;
T2.defineKey('group1', 'group2'); * one hash record per combination of group1, group2;
T2.defineDone();
declare hash T3 (suminc:'value'); * hash for three (T)iers;
T3.defineKey('group1', 'group2', 'group3'); * one hash record per combination of group1, group2, group3;
T3.defineDone();
do while (not hash_loaded);
set have end=hash_loaded;
T2.ref(); * adds value to internal sum of hash data record;
T3.ref();
end;
T2_cardinality = T2.num_items;
T3_cardinality = T3.num_items;
put 'NOTE: |T2| = ' T2_cardinality;
put 'NOTE: |T3| = ' T3_cardinality;
do while (not last_have);
set have end=last_have;
T2.sum(sum:t2_sum);
T3.sum(sum:t3_sum);
sum = t2_sum - t3_sum;
output;
end;
stop;
drop t2_: t3:;
run;

SAS sum observations not in a group, by group

I have a data set :
data have;
input group $ value;
datalines;
A 4
A 3
A 2
A 1
B 1
C 1
D 2
D 1
E 1
F 1
G 2
G 1
H 1
;
run;
The first variable is a group identifier, the second a value.
For each group, I want a new variable "sum" with the sum of all values in the column, exept for the group the observation is in.
My issue is having to do that on nearly 30 millions of observations, so efficiency matters.
I found that using data step was more efficient than using procs.
The final database should looks like :
data want;
input group $ value $ sum;
datalines;
A 4 11
A 3 11
A 2 11
A 1 11
B 1 20
C 1 20
D 2 18
D 1 18
E 1 20
F 1 20
G 2 18
G 1 20
H 1 20
;
run;
Any idea how to perform this please?
Edit: I don't know if this matter but the example I gave is a simplified version of my issue. In the real case, I have 2 other group variable, thus taking the sum of the whole column and substract the sum in the group is not a viable solution.
The requirement
sum of all values in the column, except for the group the observation is in
indicates two passes of the data must occur:
Compute the all_sum and each group's group_sumA hash can store each group's sum -- computed via a specified suminc: variable and .ref() method invocation. A variable can accumulate allsum.
Compute allsum - group_sum for each row of a group.The group_sum is retrieved from hash and subtracted from allsum.
Example:
data want;
if 0 then set have; * prep pdv;
declare hash sums (suminc:'value');
sums.defineKey('group');
sums.defineDone();
do while (not hash_loaded);
set have end=hash_loaded;
sums.ref(); * adds value to internal sum of hash data record;
allsum + value;
end;
do while (not last_have);
set have end=last_have;
sums.sum(sum:sum); * retrieve groups sum. Do you hear the Dragnet theme too?;
sum = allsum - sum; * subtract from allsum;
output;
end;
stop;
run;
What is wrong with a straight forward approach? You need to make two passes no matter what you do.
Like this. I included extra variables so you can see how the values are derived.
proc sql ;
create table want as
select a.*,b.grand,sum(value) as total, b.grand - sum(value) as sum
from have a
, (select sum(value) as grand from have) b
group by a.group
;
quit;
Results:
Obs group value grand total sum
1 A 3 21 10 11
2 A 1 21 10 11
3 A 2 21 10 11
4 A 4 21 10 11
5 B 1 21 1 20
6 C 1 21 1 20
7 D 2 21 3 18
8 D 1 21 3 18
9 E 1 21 1 20
10 F 1 21 1 20
11 G 1 21 3 18
12 G 2 21 3 18
13 H 1 21 1 20
Note it does not matter what you have as your GROUP BY clause.
Do you really need to output all of the original observations? Why not just output the summary table?
proc sql ;
create table want as
select a.group, b.grand - sum(value) as sum
from have a
, (select sum(value) as grand from have) b
group by a.group
;
quit;
Results
Obs group total sum
1 A 10 11
2 B 1 20
3 C 1 20
4 D 3 18
5 E 1 20
6 F 1 20
7 G 3 18
8 H 1 20
I would break this out into two different segments:
1.) You could start by using PROC SQL to get the sums by the group
2.) Then use some IF/THEN statements to reassign the values by group

identifying the rows with maximum continuous values

I have two columns in a table. the second column has 1 or zero depending on a predefined condition. Can someone help me with a logic to identify the maximum continuous occurrence of 1s. For example, in the below table the maximum continuous occurrence is between rows 7 and 18. Just the logic to identify this would be enough.
Thanks
Create the intervals.
data intervals ;
set have ;
by B NOTSORTED ;
if first.b then start=A ;
retain start ;
if last.b then do;
end = A ;
duration = end - start + 1 ;
output;
end;
drop A ;
run;
Then find the interval with the maximum duration. Perhaps you want the first occurrence of the maximum duration?
proc sort data=intervals out=want ;
by descending duration start;
run;
data want ;
set want (obs=1);
where B=1;
run;
something like this
data have;
input A B;
datalines;
1 0
2 0
3 1
4 1
5 1
6 0
7 0
8 0
9 1
10 0
11 1
12 1
13 1
14 1
15 1
16 1
17 0
18 0
19 0
20 1
21 0
;
proc sort data=have;
by A;
run;
data want;
set have;
if B=1 then count + 1;
if B = 0 then count = 0;
run;
proc sql;
select max(count) as max_value from want;

How to do a last observation carrying forward using SAS PROC SQL

I have the data below. I want to write a sas proc sql code to get the last non-missing values for each patient(ptno).
data sda;
input ptno visit weight;
format ptno z3. ;
cards;
1 1 122
1 2 123
1 3 .
1 4 .
2 1 156
2 2 .
2 3 70
2 4 .
3 1 60
3 2 .
3 3 112
3 4 .
;
run;
proc sql noprint;
create table new as
select ptno,visit,weight,
case
when weight = . then weight
else .
end as _weight_1
from sda
group by ptno,visit
order by ptno,visit;
quit;
The sql code above does not work well.
The desire output data like this:
ptno visit weight
1 1 122
1 2 123
1 3 123
1 4 123
2 1 156
2 2 .
2 3 70
2 4 70
3 1 60
3 2 .
3 3 112
3 4 112
Since you do have effectively a row number (visit), you can do this - though it's much slower than the data step.
Here it is, broken out into a separate column for demonstration purposes - of course in your case you will want to coalesce this into one column.
Basically, you need a subquery that determines the maximum visit number less than the current one that does have a legitimate weight count, and then join that to the table to get the weight.
proc sql;
select ptno, visit, weight,
(
select weight
from sda A,
(select ptno, max(visit) as visit
from sda D
where D.ptno=S.ptno
and D.visit<S.visit
and D.weight is not null
group by ptno
) V
where A.visit=V.visit and A.ptno=V.ptno
)
from sda S
;
quit;
Although you don't describe it that way you do not carry forward VISIT 1 right?
I don't know why you would want to do this using SQL. In SAS a data step is much better suited to the task. I like using the "update trick". If you're interested in how this works I will leave it to you to study the UPDATE statement.
data locf;
update sda(obs=0 keep=ptno) sda;
by ptno;
output;
if visit eq 1 then call missing(weight);
run;

Create a variable based on sum of two variables (one lag)

I have a data set like the one below, where the amount has dropped off, but the adjustment remains. For each row amount should be the sum of the previous amount and the adjustment. So, amount for observation 5 is 134 (124+10).
I have an answer which gets me the next value, but I need some sort of recursion to get me the rest of the way there. What am I missing? Thanks.
data have;
input amount adjust;
cards;
100 0
101 1
121 20
124 3
. 10
. 4
. 3
. 0
. 1
;
run;
data attempt;
set have;
x=lag1(amount);
if amount=. then amount=adjust+x;
run;
data want;
input amount adjust;
cards;
100 0
101 1
121 20
124 3
134 10
138 4
141 3
141 0
142 1
;
run;
EDIT:
Also trying something like this now, still not quite what I want.
%macro doodoo;
%do i = 1 %to 5;
data have;
set have;
/* if _n_=i+4 then*/
amount=lag1(amount)+adjust;
run;
%end;
%mend;
%doodoo;
No need to LAG() use RETAIN instead.
data want ;
set have ;
retain previous ;
if amount = . then amount=sum(previous,adjust);
previous=amount ;
run;