Create a variable based on sum of two variables (one lag) - sum

I have a data set like the one below, where the amount has dropped off, but the adjustment remains. For each row amount should be the sum of the previous amount and the adjustment. So, amount for observation 5 is 134 (124+10).
I have an answer which gets me the next value, but I need some sort of recursion to get me the rest of the way there. What am I missing? Thanks.
data have;
input amount adjust;
cards;
100 0
101 1
121 20
124 3
. 10
. 4
. 3
. 0
. 1
;
run;
data attempt;
set have;
x=lag1(amount);
if amount=. then amount=adjust+x;
run;
data want;
input amount adjust;
cards;
100 0
101 1
121 20
124 3
134 10
138 4
141 3
141 0
142 1
;
run;
EDIT:
Also trying something like this now, still not quite what I want.
%macro doodoo;
%do i = 1 %to 5;
data have;
set have;
/* if _n_=i+4 then*/
amount=lag1(amount)+adjust;
run;
%end;
%mend;
%doodoo;

No need to LAG() use RETAIN instead.
data want ;
set have ;
retain previous ;
if amount = . then amount=sum(previous,adjust);
previous=amount ;
run;

Related

how to Sort negative and positive data in SAS

I have bellow data in variable NUM
-3 1 0 1 3 2 -2 5 -5 -6 4 6 -4
i want data NUM in bellow sorting order
0 -1 1 -2 2 -3 3 -4 4 -5 5 -6 6
How can we sort negative and positive values together? please help
data have;
input NUM ##;
cards;
-3 1 0 1 3 2 -2 5 -5 -6 4 6 -4
;
run;
Sort by abs(num), num if you want the negative values to appear before the positive within the same absolute value as in the requested data.
data have;
input NUM ##;
cards;
-3 1 0 -1 3 2 -2 5 -5 -6 4 6 -4
;
run;
proc sql;
create table want as
select * from have
order by abs(num), num
;
quit;
Make a new variable with the absolute value and include it in the sort.
data want;
set have;
absolute=abs(num);
run;
proc sort data=want;
by absolute num;
run;

identifying the rows with maximum continuous values

I have two columns in a table. the second column has 1 or zero depending on a predefined condition. Can someone help me with a logic to identify the maximum continuous occurrence of 1s. For example, in the below table the maximum continuous occurrence is between rows 7 and 18. Just the logic to identify this would be enough.
Thanks
Create the intervals.
data intervals ;
set have ;
by B NOTSORTED ;
if first.b then start=A ;
retain start ;
if last.b then do;
end = A ;
duration = end - start + 1 ;
output;
end;
drop A ;
run;
Then find the interval with the maximum duration. Perhaps you want the first occurrence of the maximum duration?
proc sort data=intervals out=want ;
by descending duration start;
run;
data want ;
set want (obs=1);
where B=1;
run;
something like this
data have;
input A B;
datalines;
1 0
2 0
3 1
4 1
5 1
6 0
7 0
8 0
9 1
10 0
11 1
12 1
13 1
14 1
15 1
16 1
17 0
18 0
19 0
20 1
21 0
;
proc sort data=have;
by A;
run;
data want;
set have;
if B=1 then count + 1;
if B = 0 then count = 0;
run;
proc sql;
select max(count) as max_value from want;

SAS hierarchical structure sum

I have a dataset with a hierarchical codelist variable.
The logics of hierarchy is determined by the LEVEL variable and the prefix structure of the CODE character variable.
There are 6 (code length from 1 to 6) "aggregate" levels and the terminal level (code length of 10 characters).
I need to update the nodes variable (count of terminal nodes - the aggregate levels do not count in the "higher" aggregates, only the terminal nodes) - so the sum of counts in one level, for example every level 5's total count is the same as every level 6's.
And I need to calculate (sum up) the weight to "higher" level nodes.
NOTE: I offset the output table's NODES and WEIGHT variable so you can see better what I am talking about (just add up the numbers in each offset and you get the same value).
EDIT1: there can be multiple observations with the same code. A unique observations is a combination of 3 variables code + var1 + var2.
Input table:
ID level code var1 var2 nodes weight myIndex
1 1 1 . . 999 999 999
2 2 11 . . 999 999 999
3 3 111 . . 999 999 999
4 4 1111 . . 999 999 999
5 5 11111 . . 999 999 999
6 6 111111 . . 999 999 999
7 10 1111119999 01 1 1 0.1 105,5
8 10 1111119999 01 2 1 0.1 109,1
9 6 111112 . . 999 999 999
10 10 1111120000 01 1 1 0.5 95,0
11 5 11119 . . 999 999 999
12 6 111190 . . 999 999 999
13 10 1111901000 01 1 1 0.1 80,7
14 10 1111901000 02 1 1 0.2 105,5
Desired output table:
ID level code var1 var2 nodes weight myIndex
1 1 1 . . 5 1.0 98,1
2 2 11 . . 5 1.0 98,1
3 3 111 . . 5 1.0 98,1
4 4 1111 . . 5 1.0 98,1
5 5 11111 . . 3 0.7 98,5
6 6 111111 . . 2 0.2 107,3
7 10 1111119999 01 1 1 0.1 105,5
8 10 1111119999 01 2 1 0.1 109,1
9 6 111112 . . 1 0.5 95,0
10 10 1111120000 01 1 1 0.5 95,0
11 5 11119 . . 2 0.3 97,2
12 6 111190 . . 2 0.3 97,2
13 10 1111901000 01 1 1 0.1 80,7
14 10 1111901000 02 1 1 0.2 105,5
And here's the code I came up with. It works just like I wanted, but man, it is really slow. I need something way faster, because this is a part of a webservice which has to run "instantly" on request.
Any suggestions on speeding up the code, or any other solutions are welcome.
%macro doit;
data temporary;
set have;
run;
%do i=6 %to 2 %by -1;
%if &i = 6 %then %let x = 10;
%else %let x = (&i+1);
proc sql noprint;
select count(code)
into :cc trimmed
from have
where level = &i;
select code
into :id1 - :id&cc
from have
where level = &i;
quit;
%do j=1 %to &cc.;
%let idd = &&id&j;
proc sql;
update have t1
set nodes = (
select sum(nodes)
from temporary t2
where t2.level = &x and t2.code like ("&idd" || "%")),
set weight = (
select sum(weight)
from temporary t2
where t2.level = &x and t2.code like ("&idd" || "%"))
where (t1.level = &i and t1.code like "&idd");
quit;
%end;
%end;
%mend doit;
Current code based on #Quentin's solution:
data have;
input ID level code : $10. nodes weight myIndex;
cards;
1 1 1 . . .
2 2 11 . . .
3 3 111 . . .
4 4 1111 . . .
5 5 11111 . . .
6 6 111111 . . .
7 10 1111110000 1 0.1 105.5
8 10 1111119999 1 0.1 109.1
9 6 111112 . . .
10 10 1111129999 1 0.5 95.0
11 5 11119 . . .
12 6 111190 . . .
13 10 1111900000 1 0.1 80.7
14 10 1111901000 1 0.2 105.5
;
data want (drop=_:);
*hash table of terminal nodes;
if (_n_ = 1) then do;
if (0) then set have (rename=(code=_code weight=_weight));
declare hash h(dataset:'have(where=(level=10) rename=(code=_code weight=_weight myIndex=_myIndex))');
declare hiter iter('h');
h.definekey('ID');
h.definedata('_code','_weight','_myIndex');
h.definedone();
end;
set have;
*for each non-terminal node, iterate through;
*hash table of all terminal nodes, looking for children;
if level ne 10 then do;
call missing(weight, nodes, myIndex);
do _n_ = iter.first() by 0 while (_n_ = 0);
if trim(code) =: _code then do;
weight=sum(weight,_weight);
nodes=sum(nodes,1);
myIndex=sum(myIndex,_myIndex*_weight);
end;
_n_ = iter.next();
end;
myIndex=round(myIndex/weight,.1);
end;
output;
run;
Here's an alternative hash approach.
Rather than using a hash object to do a cartesian join, this adds the nodes & weight from each level 10 node to each of the 6 applicable parent nodes as it goes along. This may be marginally faster than Quentin's approach as there are no redundant hash lookups.
It takes a bit longer than Quentin's approach when constructing the hash object, and uses a bit more memory, as each terminal node is added 6 times with different keys and existing entries often have to be updated, but afterwards each parent node only has to look up its own individual stats, rather than looping through all the terminal nodes, which is a substantial saving.
Weighted stats are possible as well, but you have to update both loops, not just the second one.
data want;
if 0 then set have;
dcl hash h();
h.definekey('code');
h.definedata('nodes','weight','myIndex');
h.definedone();
length t_code $10;
do until(eof);
set have(where = (level = 10)) end = eof;
t_nodes = nodes;
t_weight = weight;
t_myindex = weight * myIndex;
do _n_ = 1 to 6;
t_code = substr(code,1,_n_);
if h.find(key:t_code) ne 0 then h.add(key:t_code,data:t_nodes,data:t_weight,data:t_myIndex);
else do;
nodes + t_nodes;
weight + t_weight;
myIndex + t_myIndex;
h.replace(key:t_code,data:nodes,data:weight,data:MyIndex);
end;
end;
end;
do until(eof2);
set have end = eof2;
if level ne 10 then do;
h.find();
myIndex = round(MyIndex / Weight,0.1);
end;
output;
end;
drop t_:;
run;
Below is a brute-force hash approach to doing a similar Cartesian product as in the SQL. Load a hash table of the terminal nodes. Then read through the dataset of nodes, and for each node that is not a terminal node, iterate through the hash table, identifying all of the child terminal nodes.
I think the approach #joop is describing may be more efficient, as this approach doesn't take advantage of the tree structure. So there is a lot of re-calculating. With 5000 records and 3000 terminal nodes, this would do 2000*3000 comparisons. But might not be that slow since the hash table is in memory, so you're not going to have excessive I/O ....
data want (drop=_:);
*hash table of terminal nodes;
if (_n_ = 1) then do;
if (0) then set have (rename=(code=_code weight=_weight));
declare hash h(dataset:'have(where=(level=10) rename=(code=_code weight=_weight))');
declare hiter iter('h');
h.definekey('ID');
h.definedata('_code','_weight');
h.definedone();
end;
set have;
*for each non-terminal node, iterate through;
*hash table of all terminal nodes, looking for children;
if level ne 10 then do;
call missing(weight, nodes);
do _n_ = iter.first() by 0 while (_n_ = 0);
if trim(code) =: _code then do;
weight=sum(weight,_weight);
nodes=sum(nodes,1);
end;
_n_ = iter.next();
end;
end;
output;
run;
It seems pretty simple. Just join back with itself and count/sum.
proc sql ;
create table want as
select a.id, a.level, a.code , a.var1, a.var2
, count(b.id) as nodes
, sum(b.weight) as weight
from have a
left join have b
on a.code eqt b.code
and b.level=10
group by 1,2,3,4,5
order by 1
;
quit;
If you don't want to use the EQT operator then you can use the SUBSTR() function instead.
on a.code = substr(b.code,1,a.level)
and b.level=10
Since you're using SAS, how about using proc summary to do the heavy lifting here? No cartesian joins required!
One advantage of this option over the some of the others is that it's a bit easier to generalise if you want to calculate lots of more complex statistics for multiple variables.
data have;
input ID level code : $10. nodes weight myIndex;
format myIndex 5.1;
cards;
1 1 1 . . .
2 2 11 . . .
3 3 111 . . .
4 4 1111 . . .
5 5 11111 . . .
6 6 111111 . . .
7 10 1111110000 1 0.1 105.5
8 10 1111119999 1 0.1 109.1
9 6 111112 . . .
10 10 1111129999 1 0.5 95.0
11 5 11119 . . .
12 6 111190 . . .
13 10 1111900000 1 0.1 80.7
14 10 1111901000 1 0.2 105.5
;
run;
data v_have /view = v_have;
set have(where = (level = 10));
array lvl[6] $6;
do i = 1 to 6;
lvl[i]=substr(code,1,i);
end;
drop i;
run;
proc summary data = v_have;
class lvl1-lvl6;
var nodes weight;
var myIndex /weight = weight;
ways 1;
output out = summary(drop = _:) sum(nodes weight)= mean(myIndex)=;
run;
data v_summary /view = v_summary;
set summary;
length code $10;
code = cats(of lvl:);
drop lvl:;
run;
data have;
modify have v_summary;
by code;
replace;
run;
In theory a hash of hashes might also be an appropriate data structure, but that would be extremely complicated for a relatively small benefit. I might have a go anyway just as a learning exercise...
One approach (I think) would be to make the Cartesian product, and find all of the terminal nodes that are a "match" to each of the nodes, then sum the weights.
Something like:
data have;
input ID level code : $10. nodes weight ;
cards;
1 1 1 . .
2 2 11 . .
3 3 111 . .
4 4 1111 . .
5 5 11111 . .
6 6 111111 . .
7 10 1111110000 1 0.1
8 10 1111119999 1 0.1
9 6 111112 . .
10 10 1111129999 1 0.5
11 5 11119 . .
12 6 111190 . .
13 10 1111900000 1 0.1
14 10 1111901000 1 0.2
;
proc sql;
select min(id) as id
, min(level) as level
, a.code
, count(b.weight) as nodes /*count of terminal nodes*/
, sum(b.weight) as weight /*sum of weights of terminal nodes*/
from
have as a
,(select code , weight
from have
where level=10 /*selects terminal nodes*/
) as b
where a.code eqt b.code /*EQT is equivalent to =: */
group by a.code
;
quit;
I'm not sure that is correct, but it gives the desired results for the sample data.
This is the SQL needed to estimate the parent record for every record. It only uses string functions (position and length) so it should be adaptable to any dialect of SQL, maybe even SAS. (the CTE might need to be rewritten to subqueries or a view) The idea is to:
add a parent_id field to the dataset
find the record with the longest substring of code
and use its id as the value for our parent_id
(after that) update the records from the sum(nodes),sum(weight) of their direct children (the ones with child.parent_id = this.id )
BTW: I could have used the LEVEL instead of the LENGTH(code) ; the data is a bit redundant in this aspect.
WITH sub AS (
SELECT id, length(code) AS len
, code
FROM tree)
UPDATE tree t
SET parent_id = s.id
FROM sub s
WHERE length(t.code) > s.len AND POSITION (s.code IN t.code) = 1
AND NOT EXISTS (
SELECT *
FROM sub nx
WHERE nx.len > s.len AND POSITION (nx.code IN t.code ) = 1
AND nx.len < length(t.code) AND POSITION (nx.code IN t.code ) = 1
)
;
SELECT * FROM tree
ORDER BY parent_id DESC NULLS LAST
, id
;
After finding the parents, the whole table should be updated (repeatedly) from itself
like:
-- PREPARE omg( integer) AS
UPDATE tree t
SET nodes = s.nodes , weight = s.weight
FROM ( SELECT parent_id , SUM(nodes) AS nodes , SUM(weight) AS weight
FROM tree GROUP BY parent_id) s
WHERE s.parent_id = t.id
;
In SAS, this could probably be done by sorting on {0-parent_id, id} and do some retain+summation magic. (my SAS is a bit rusty in this area)
UPDATE: if only the leaf nodes have non-NULL (non-missing) values for {nodes, weight}, the aggregation can be done in one sweep for the entire tree, without first computing the parent_ids:
UPDATE tree t
SET nodes = s.nodes , weight = s.weight
FROM ( SELECT p.id , SUM(c.nodes) AS nodes , SUM(c.weight) AS weight
FROM tree p
JOIN tree c ON c.lev > p.lev AND POSITION (p.code IN c.code ) = 1
GROUP BY p.id
) s
WHERE s.id = t.id
;
An index on {lev,code} will probably speed up things. (assuming an index on id)

SAS table with percentage attached

I am trying to create a matrix with both numeric and percentage result. I was given two tables
id cc
1 2
1 5
1 40
2 55
2 2
2 130
2 177
3 20
3 55
3 40
4 30
4 100
id Description
1 Dell
1 Lenovo
1 HP
2 Sony
2 Dell
2 Acer
2 Other
3 Fujitsu
3 Sony
3 HP
4 Apple
4 Asus
I have already created a table that looks like..I used the code
CC CC1 CC2… …CC177
1 264 5 0
2 0 132 6
…
…
177 2 1 692
data RESULT;
set id_CC;
by id;
retain CC1-CC177; /*CC range from 1 to 177*/
array CC_List(177) CC1-CC177;
if first.id then do i=1 to 177;
id_LIST(i)=0;
end;
CC_List(CC)=1;
if last.id then output;
run;
ods output sscp=coocs;
ods select sscp;
proc corr data=RESULT sscp;
var CC1-CC177;
run;
/*proc print data=coocs;*/
/*run;*/
/**/
In other words, how many id have cc1 also have cc2..cc177..etc. Now, I am wondering if it's doable to add percentage next to each number. For instance if CC1*CC1=264 (100%) then CC1*CC2= 5/264=1.9%
Another table I am trying to create is to have description of each CC on the matrix. Each CC number stands for one brand. 2=Dell 177=Other, etc. I want to create a table looks like
If I want to change the CC1 CC2 to characters, how do I modify the arrays? Eventually, I would like my table looks like
Description Dell Lenovo HP Sony Acer Other Fujitsu Sony
Dell 264 (100%)
Lenovo
HP 50 (10%)
Sony
Acer
Other
Fujitsu
Sony
In other words, how many people have dell also have acer, sony, other, etc?
The rename is a question that's been asked on here so I'll leave that one for now.
For the percentages you'll need to create a character variable. TO calculate the percent use the automatic variable _n_ which is the row, but will also be the denominator for your calculation. Then use a concatenate function such as cats to create the variable in the format N(PP%).
data want;
set have;
array cc(177) cc1-cc177;
array dd(177) $ dd1-dd177;
do i=1 to 177;
percent=cc(i)/cc(_n_);
dd(i)=cats(cc(i), "(", put(percent, percent8.1), ")");
end;
run;
In answering Reeza, I did:
data RESULT_PRE;
set ID_CC;
by ID;
retain CC1-CC177;
array CC_LIST(177) CC1-CC177;
array DD_LIST(177) $ DD1-DD177;
if first.id then do i=1 to 177;
CC_LIST(i)=0;
end;
CC_LIST(CC)=1;
if last.id then output;
run;
data RESULT;
set RESULT_PRE;
array CC_LIST(177) CC1-CC177;
array DD_LIST(177) $ DD1-DD177;
do i=1 to 177;
percent=CC_LIST(i)/CC_LIST(_n_);
DD_LIST(i)=cats(CC_LIST(i), "(", put(percent, percent8.1), ")");
end;
run;
The error shows that Array subscript out of range at line xx column xx and ERROR 68-185: The function CC is unknown, or cannot be accessed.

tracking customer retension on weekly basis

I have start and end weeks for a given customer and I need to make panel data for the weeks they are subscribed. I have manipulated the data into an easy form to convert, but when I transpose I do not get the weeks in between start and end filled in. Hopefully an example will shed some light on my request. Weeks start at 0 and end at 61, so forced any week above 61 to be 61, again for simplicity. Populate with a 1 if they are subscribed still and a blank if not.
ID Start_week End_week
1 6 61
2 0 46
3 45 61
what I would like
ID week0 week1 ... week6 ... week45 week46 week47 ... week61
1 . . ... 1 ... 1 1 1 ... 1
2 1 1 ... 1 ... 1 1 0 ... 0
3 0 0 ... 0 ... 1 1 1 ... 1
I see two ways to do it.
I would go for an array approach, since it will probably be the fastest (single data step) and is not that complex:
data RESULT (drop=start_week end_week);
set YOUR_DATA;
array week_array{62} week0-week61;
do week=0 to 61;
if week between start_week and end_week then week_array[week+1]=1;
else week_array[week+1]=0;
end;
run;
Alternatively, you can prepare a table for the transpose to work by creating one record per week per id::
data BEFORE_TRANSPOSE (drop=start_week end_week);
set YOUR_DATA;
do week=0 to 61;
if week between start_week and end_week then subscribed=1;
else subscribed=0;
output;
end;
run;
Use an array to create the variables. The one gotcha is SAS arrays are 1 indexed.
data input;
input ID Start_week End_week;
datalines;
1 6 61
2 0 46
3 45 61
;
data output;
array week[62] week0-week61;
set input;
do i=1 to 62;
if i > start_week and i<= (end_week+1) then
week[i] = 1;
else
week[i] = 0;
end;
drop i;
run;
I have no working syntax but a guideline for you.
first make a table with CTE or physically with the numbers 0 to 61 as rows. Then join this table with the subscribed table. Something like
FROM sub
INNER JOIN CTE
ON CTE.week BETWEEN sub.Start_week AND sub.End_week
Now you will have a row for every week a customer is subscribed. Transpose that and you will have the in between weeks also filled in.