Choose first occurrence and then choose the next one down - sql

In SAS(Data Step) or Proc SQL, I want to choose the first occurrence of TransB based on DaysBetweenTrans first and then flag, if TransB has already been chosen then I want the next available one although I also want TransA to be unique as well i.e. TransA is a unique row and TransB is unique too.
For example, the original table looks like this:
TransA
TransB
DaysBetweenTrans
Flag
A
1
1
1
A
2
1
1
B
1
3
1
B
2
2
1
B
3
3
1
C
1
1
1
C
3
4
1
but I want only:
TransA
TransB
DaysBetweenTrans
Flag
A
2
1
1
B
1
3
1
C
3
4
1
I tried using sorting TransA and dedupkey and then sort TranB and dedupkey but no luck. The other way I thought of was to do first.TransA and output. Join back on the original table and remove any TransA and repeat, but there has to be a better way.

You might want to look into SAS procedures for optimization as a straight forward approach of taking the best next match for the current case might not find the best solution.
Here is an approach that uses a HASH to keep track of which targets have already been assigned.
It is not totally clear to me what your preference for ordering are but here is one method. It sounds like you want to find the best match for TRANSB=1. Then for TRANSB=2, etc.
data have;
input TransA $ TransB $ DaysBetweenTrans Flag;
cards;
A 1 1 0
A 2 1 1
B 1 3 1
B 2 2 1
B 3 3 1
C 1 1 1
C 3 4 1
;
proc sort data=have;
by transB daysbetweentrans descending flag transA;
run;
data _null_;
if _n_=1 then do;
declare hash h(ordered:'Y');
rc=h.definekey('transA');
rc=h.definedata('transA','transB','daysbetweentrans','flag');
rc=h.definedone();
end;
set have end=eof;
by transB;
if first.transB then found=0;
retain found;
if not found then if not h.add() then found=1;
if eof then do;
rc=h.output(dataset:'want');
end;
run;
Results:
Days
Trans Trans Between
Obs A B Trans Flag
1 A 2 1 1
2 B 3 3 1
3 C 1 1 1

Related

How to create 2 datalines in sas with different length

I want to create a table like that:
a 1 2 3
b 1 2 3 4
a has 3 values, b has 4.
How can I do it in SAS?
When I enter it like that it deletes the 4 at the end.
data my_data;
input a b;
datalines;
1 1
2 2
3 3
4
I am very new to SAS thanks for your advice.
If you want to use LIST MODE input, like in your example, then each variable needs to have a "word" on the line. Use a period to indicate the missing values.
data my_data;
input a b;
datalines;
1 1
2 2
3 3
. 4
;
Otherwise switch to COLUMN MODE input.
data my_data;
input a 1-2 b 3-4 ;
datalines;
1 1
2 2
3 3
4
;
Or FORMATTED MODE
data my_data;
input a 2. b 2.;
datalines;
1 1
2 2
3 3
4
;
Note that you can use the period to indicate a missing value even when the variable is character. This is because the normal character informat will convert that single period into a blank value.
data my_data;
input a $ b;
datalines;
1 1
2 2
3 3
. 4
;

SAS sum observations not in a group, by multiple groups

This post follow this one: SAS sum observations not in a group, by group
Where my minimal example was a bit too minimal sadly,I wasn't able to use it on my data.
Here is a complete case example, what I have is :
data have;
input group1 group2 group3 $ value;
datalines;
1 A X 2
1 A X 4
1 A Y 1
1 A Y 3
1 B Z 2
1 B Z 1
1 C Y 1
1 C Y 6
1 C Z 7
2 A Z 3
2 A Z 9
2 A Y 2
2 B X 8
2 B X 5
2 B X 5
2 B Z 7
2 C Y 2
2 C X 1
;
run;
For each group, I want a new variable "sum" with the sum of all values in the column for the same sub groups (group1 and group2), exept for the group (group3) the observation is in.
data want;
input group1 group2 group3 $ value $ sum;
datalines;
1 A X 2 8
1 A X 4 6
1 A Y 1 9
1 A Y 3 7
1 B Z 2 1
1 B Z 1 2
1 C Y 1 13
1 C Y 6 8
1 C Z 7 7
2 A Z 3 11
2 A Z 9 5
2 A Y 2 12
2 B X 8 17
2 B X 5 20
2 B X 5 20
2 B Z 7 18
2 C Y 2 1
2 C X 1 2
;
run;
My goal is to use either datasteps or proc sql (doing it on around 30 millions observations and proc means and such in SAS seems slower than those on previous similar computations).
My issue with solutions provided in the linked post is that is uses the total value of the column and I don't know how to change this by using the total in the sub group.
Any idea please?
A SQL solution will join all data to an aggregating select:
proc sql;
create table want as
select have.group1, have.group2, have.group3, have.value
, aggregate.sum - value as sum
from
have
join
(select group1, group2, sum(value) as sum
from have
group by group1, group2
) aggregate
on
aggregate.group1 = have.group1
& aggregate.group2 = have.group2
;
SQL can be slower than hash solution, but SQL code is understood by more people than those that understand SAS DATA Step involving hashes ( which can be faster the SQL. )
data want2;
if 0 then set have; * prep pdv;
declare hash sums (suminc:'value');
sums.defineKey('group1', 'group2');
sums.defineDone();
do while (not hash_loaded);
set have end=hash_loaded;
sums.ref(); * adds value to internal sum of hash data record;
end;
do while (not last_have);
set have end=last_have;
sums.sum(sum:sum); * retrieve group sum.;
sum = sum - value; * subtract from group sum;
output;
end;
stop;
run;
SAS documentation touches on SUMINC and has some examples
The question does not address this concept:
For each row compute the tier 2 sum that excludes the tier 3 this row is in
A hash based solution would require tracking each two level and three level sums:
data want2;
if 0 then set have; * prep pdv;
declare hash T2 (suminc:'value'); * hash for two (T)iers;
T2.defineKey('group1', 'group2'); * one hash record per combination of group1, group2;
T2.defineDone();
declare hash T3 (suminc:'value'); * hash for three (T)iers;
T3.defineKey('group1', 'group2', 'group3'); * one hash record per combination of group1, group2, group3;
T3.defineDone();
do while (not hash_loaded);
set have end=hash_loaded;
T2.ref(); * adds value to internal sum of hash data record;
T3.ref();
end;
T2_cardinality = T2.num_items;
T3_cardinality = T3.num_items;
put 'NOTE: |T2| = ' T2_cardinality;
put 'NOTE: |T3| = ' T3_cardinality;
do while (not last_have);
set have end=last_have;
T2.sum(sum:t2_sum);
T3.sum(sum:t3_sum);
sum = t2_sum - t3_sum;
output;
end;
stop;
drop t2_: t3:;
run;

How to sum values of two columns by an ID column, keeping some columns with repeated values and excluding others?

I need to organize a large df adding values of a column by a column ID (the ID is not sequencial), keeping some columns of the df that have repeated values by ID and excluding column that have different values by ID. Below I inserted a reproducible example and the output I need. I think there is a simple way to do that, but I am not soo familiar with R.
df=read.table(textConnection("
ID spp effort generalist specialist
1 a 10 1 0
1 b 10 1 0
1 c 10 0 1
1 d 10 0 1
2 a 16 1 0
2 b 16 1 0
2 e 16 0 1
"), header = TRUE)
The output I need:
ID effort generalist specialist
1 10 2 2
2 16 2 1

SAS sum observations not in a group, by group

I have a data set :
data have;
input group $ value;
datalines;
A 4
A 3
A 2
A 1
B 1
C 1
D 2
D 1
E 1
F 1
G 2
G 1
H 1
;
run;
The first variable is a group identifier, the second a value.
For each group, I want a new variable "sum" with the sum of all values in the column, exept for the group the observation is in.
My issue is having to do that on nearly 30 millions of observations, so efficiency matters.
I found that using data step was more efficient than using procs.
The final database should looks like :
data want;
input group $ value $ sum;
datalines;
A 4 11
A 3 11
A 2 11
A 1 11
B 1 20
C 1 20
D 2 18
D 1 18
E 1 20
F 1 20
G 2 18
G 1 20
H 1 20
;
run;
Any idea how to perform this please?
Edit: I don't know if this matter but the example I gave is a simplified version of my issue. In the real case, I have 2 other group variable, thus taking the sum of the whole column and substract the sum in the group is not a viable solution.
The requirement
sum of all values in the column, except for the group the observation is in
indicates two passes of the data must occur:
Compute the all_sum and each group's group_sumA hash can store each group's sum -- computed via a specified suminc: variable and .ref() method invocation. A variable can accumulate allsum.
Compute allsum - group_sum for each row of a group.The group_sum is retrieved from hash and subtracted from allsum.
Example:
data want;
if 0 then set have; * prep pdv;
declare hash sums (suminc:'value');
sums.defineKey('group');
sums.defineDone();
do while (not hash_loaded);
set have end=hash_loaded;
sums.ref(); * adds value to internal sum of hash data record;
allsum + value;
end;
do while (not last_have);
set have end=last_have;
sums.sum(sum:sum); * retrieve groups sum. Do you hear the Dragnet theme too?;
sum = allsum - sum; * subtract from allsum;
output;
end;
stop;
run;
What is wrong with a straight forward approach? You need to make two passes no matter what you do.
Like this. I included extra variables so you can see how the values are derived.
proc sql ;
create table want as
select a.*,b.grand,sum(value) as total, b.grand - sum(value) as sum
from have a
, (select sum(value) as grand from have) b
group by a.group
;
quit;
Results:
Obs group value grand total sum
1 A 3 21 10 11
2 A 1 21 10 11
3 A 2 21 10 11
4 A 4 21 10 11
5 B 1 21 1 20
6 C 1 21 1 20
7 D 2 21 3 18
8 D 1 21 3 18
9 E 1 21 1 20
10 F 1 21 1 20
11 G 1 21 3 18
12 G 2 21 3 18
13 H 1 21 1 20
Note it does not matter what you have as your GROUP BY clause.
Do you really need to output all of the original observations? Why not just output the summary table?
proc sql ;
create table want as
select a.group, b.grand - sum(value) as sum
from have a
, (select sum(value) as grand from have) b
group by a.group
;
quit;
Results
Obs group total sum
1 A 10 11
2 B 1 20
3 C 1 20
4 D 3 18
5 E 1 20
6 F 1 20
7 G 3 18
8 H 1 20
I would break this out into two different segments:
1.) You could start by using PROC SQL to get the sums by the group
2.) Then use some IF/THEN statements to reassign the values by group

Pig Latin Count difference between two tables

I have one table loaded twice to perform a self join called current and previous. Both contain columns "key" (not unique) and "value". I have grouped by key, and counted the number of values in each group of keys.
I would like to find how many more values were added to the current table compared to the previous table, but I get the error "Invalid scalar projection: cur_count : A column needs to be projected from a relation for it to be used as a scalar". I am relatively new to pig latin, so I'm unsure of what the syntax should be for performing this difference.
Please disregard syntax for the cur_count and prev_count.
cur_count = FOREACH cur_grouped GENERATE COUNT(current);
prev_count = FOREACH prev_grouped GENERATE COUNT(previous);
left_join = join current by key LEFT OUTER, previous by key-1;
difference = FOREACH left_join GENERATE key, cur_count-prev_count; //error here
dump difference;
Below are some sample data
key value
1 12
1 34
1 11
1 45
2 4
3 34
3 34
3 23
4 15
4 19
What my script does so far: it counts the number of values in each group of keys
key count
1 4
2 1
3 3
4 2
I would like to find the difference in number of values between a key and the previous key
key difference
2 -3
3 2
4 -1
cur_count and prev_count are relations and cannot be used the way you are using.You can achieve the desired output using the script below.After joining the relations with (key-1),use the columns from the relation to get the difference.
A = LOAD 'data.txt' USING PigStorage(',') AS (f1:int,f2:int);
B = GROUP A BY f1;
C = FOREACH B GENERATE group,COUNT(A);
D = FOREACH B GENERATE group,COUNT(A);
E = JOIN C BY $0,D BY ($0-1);
F = FOREACH E GENERATE $2,$3-$1;
DUMP F;
Presume you have two groups grp1 and grp2 with the content you described earlier
key count
1 4
2 1
3 3
4 2
Note: I have not executed below Pig statements.
-- Generate the Ranks for two relations
grp1 = rank grp1;
grp2 = rank grp2;
-- Increment rank by 1 for each record in grp2
grp2 = foreach grp2 generate ($0+1) as rank,key,count
After these the two relations would look like below. Arranged them side by side for comparison.
Group 1 Group 2
Rank key count Rank key count
1 1 4 2 1 4
2 2 1 3 2 1
3 3 3 4 3 3
4 4 2 5 4 2
Join the two groups by RANK which would yield below output
Rank key count Rank key count
2 2 1 2 1 4
3 3 3 3 2 1
4 4 2 4 3 3
5 4 2
Now you can run another "foreach" statement that finds the difference in two count columns above.
result = FOREACH <<joined relation>> GENERATE $1 as key,($2-$5) as difference