SAS: prof freq list view, creating dummy - variables

is there any way to create dummy variables for the list view generated from SAS: proc freq?
e.g.
this is my proc freq output :
x y z N %
0 0 0 10 2.8
0 0 1 20 5.6
0 1 0 30 8.3
0 1 1 40 11.1
1 0 0 50 13.9
1 0 1 60 16.7
1 1 0 70 19.4
1 1 1 80 22.2
can I create (easily in proc freq) dummy variables that can have 1/0 values for each level of the output (that is, 8 dummy variables) OR alternatively, a single variable which will have incremental value of 1,2,3,... for each level of output???
Thanks in advance !!

Here's one way you can do it with a single variable, assuming you just have combinations of variables with values of only 0 or 1:
data yourdata;
do i = 1 to 100;
x = round(ranuni(1));
y = round(ranuni(2));
z = round(ranuni(3));
t = 1;
output;
end;
run;
proc summary nway data = yourdata;
class x y z;
var t;
output out = summary_ds n=;
run;
data summary_ds;
set summary_ds;
singlevar = input(cats(x,y,z),binary3.);
run;

Related

SAS sum observations not in a group, by multiple groups

This post follow this one: SAS sum observations not in a group, by group
Where my minimal example was a bit too minimal sadly,I wasn't able to use it on my data.
Here is a complete case example, what I have is :
data have;
input group1 group2 group3 $ value;
datalines;
1 A X 2
1 A X 4
1 A Y 1
1 A Y 3
1 B Z 2
1 B Z 1
1 C Y 1
1 C Y 6
1 C Z 7
2 A Z 3
2 A Z 9
2 A Y 2
2 B X 8
2 B X 5
2 B X 5
2 B Z 7
2 C Y 2
2 C X 1
;
run;
For each group, I want a new variable "sum" with the sum of all values in the column for the same sub groups (group1 and group2), exept for the group (group3) the observation is in.
data want;
input group1 group2 group3 $ value $ sum;
datalines;
1 A X 2 8
1 A X 4 6
1 A Y 1 9
1 A Y 3 7
1 B Z 2 1
1 B Z 1 2
1 C Y 1 13
1 C Y 6 8
1 C Z 7 7
2 A Z 3 11
2 A Z 9 5
2 A Y 2 12
2 B X 8 17
2 B X 5 20
2 B X 5 20
2 B Z 7 18
2 C Y 2 1
2 C X 1 2
;
run;
My goal is to use either datasteps or proc sql (doing it on around 30 millions observations and proc means and such in SAS seems slower than those on previous similar computations).
My issue with solutions provided in the linked post is that is uses the total value of the column and I don't know how to change this by using the total in the sub group.
Any idea please?
A SQL solution will join all data to an aggregating select:
proc sql;
create table want as
select have.group1, have.group2, have.group3, have.value
, aggregate.sum - value as sum
from
have
join
(select group1, group2, sum(value) as sum
from have
group by group1, group2
) aggregate
on
aggregate.group1 = have.group1
& aggregate.group2 = have.group2
;
SQL can be slower than hash solution, but SQL code is understood by more people than those that understand SAS DATA Step involving hashes ( which can be faster the SQL. )
data want2;
if 0 then set have; * prep pdv;
declare hash sums (suminc:'value');
sums.defineKey('group1', 'group2');
sums.defineDone();
do while (not hash_loaded);
set have end=hash_loaded;
sums.ref(); * adds value to internal sum of hash data record;
end;
do while (not last_have);
set have end=last_have;
sums.sum(sum:sum); * retrieve group sum.;
sum = sum - value; * subtract from group sum;
output;
end;
stop;
run;
SAS documentation touches on SUMINC and has some examples
The question does not address this concept:
For each row compute the tier 2 sum that excludes the tier 3 this row is in
A hash based solution would require tracking each two level and three level sums:
data want2;
if 0 then set have; * prep pdv;
declare hash T2 (suminc:'value'); * hash for two (T)iers;
T2.defineKey('group1', 'group2'); * one hash record per combination of group1, group2;
T2.defineDone();
declare hash T3 (suminc:'value'); * hash for three (T)iers;
T3.defineKey('group1', 'group2', 'group3'); * one hash record per combination of group1, group2, group3;
T3.defineDone();
do while (not hash_loaded);
set have end=hash_loaded;
T2.ref(); * adds value to internal sum of hash data record;
T3.ref();
end;
T2_cardinality = T2.num_items;
T3_cardinality = T3.num_items;
put 'NOTE: |T2| = ' T2_cardinality;
put 'NOTE: |T3| = ' T3_cardinality;
do while (not last_have);
set have end=last_have;
T2.sum(sum:t2_sum);
T3.sum(sum:t3_sum);
sum = t2_sum - t3_sum;
output;
end;
stop;
drop t2_: t3:;
run;

how I can initialize a parameter in AMPL when it is defined on multiple sets?

Suppose I have
param m; #number of modes
param n; #number of individual
param a; #number of alternatives
param f; #number of household
set M, default{1..m}; #set of modes
set N, default{1..n}; #set of individuals
set A, default{1..a}; #set of alternatives
set F, default{1..f}; #set of family
set E, within F cross N
How I can initialize param X{E,M,A} ?
Suppose
a:=2 , m:=3 , n:= 4 f:=2;
and set E is defined:
set E:= 1 1 1 2 2 3 2 4 ;
You can declare the parameter just as you suggested:
param X{E,M,A};
Now, if you want to provide a default value (which I assume is what you are asking), you can do it in the usual way:
param X{E,M,A} default 0;
Then provide some non-default values in the .dat file, e.g.,:
param: X :=
1 1 1 2 5
2 3 2 1 6;
Note that AMPL doesn't fill the default values into the parameter until you call solve. From the AMPL book, p. 120:
The expression that gives the default value of a parameter is evaluated only when the parameter’s value is first needed, such as when an objective or constraint that uses the parameter is processed by a solve command.
So if you type display X; after you have issued the model and data commands but before you have issued the solve command, you'll only get the non-default values, e.g.:
X :=
1 1 1 2 5
2 3 2 1 6
;
But if you use display X; after you call solve, you'll get the full list:
X [1,*,*,1] (tr)
: 1 2 :=
1 0 0
2 0 0
3 0 0
[1,*,*,2] (tr)
: 1 2 :=
1 5 0
2 0 0
3 0 0
[2,*,*,1] (tr)
: 3 4 :=
1 0 0
2 6 0
3 0 0
[2,*,*,2] (tr)
: 3 4 :=
1 0 0
2 0 0
3 0 0
;
For completeness, here are the .mod and .dat files I used for this answer:
.mod:
param m; #number of modes
param n; #number of individual
param a; #number of alternatives
param f; #number of household
set M, default{1..m}; #set of modes
set N, default{1..n}; #set of individuals
set A, default{1..a}; #set of alternatives
set F, default{1..f}; #set of family
set E, within F cross N;
param X{E,M,A} default 0;
var myVar{E,M,A} >= 0;
minimize Obj: sum {(i,j) in E, mm in M, aa in A} X[i,j,mm,aa] * myVar[i,j,mm,aa];
.dat:
param a:=2;
param m:=3;
param n:= 4;
param f:=2;
set E:= 1 1 1 2 2 3 2 4 ;
param: X :=
1 1 1 2 5
2 3 2 1 6;

SAS Lookup on Per Variable Basis

I have two tables in SAS, Table A and Table B. Suppose I want to write a little SAS code to obtain the table "Desired Output." How would I do this?
Table A:
Observation Var1 Var2
1 0 0
2 1 2
3 2 1
4 0 0
Table B:
Var Level Lookup
Var1 0 0.1
Var1 1 0.3
Var1 2 0.5
Var2 0 0.7
Var2 1 0.8
Var2 2 0.9
Desired output:
Observation Var1 Var2 Var1_new Var2_new
1 0 0 0.1 0.7
2 1 2 0.3 0.9
3 2 1 0.5 0.8
4 0 2 0.1 0.9
From my understanding, this may involve SQL in SAS, but I'm not sure. I have no idea how to do this. Pseudo-code might look like this, but I don't know how to actually make it work:
data DATA_OUT.DESIRED_OUTPUT;
set DATA_IN.TABLE_A;
set PP.TABLE_B key=(Var Level);
Var1_new = TABLE_B["Var1" Var1][Lookup];
Var2_new = TABLE_B["Var2" Var2][Lookup];
run;
How would you achieve the desired output in SAS?
Here is a method using a hash object to store your table B.
data A ;
input var1 var2;
cards;
0 0
1 2
2 1
0 0
;
data B;
input Var :$32. Level Lookup;
cards;
Var1 0 0.1
Var1 1 0.3
Var1 2 0.5
Var2 0 0.7
Var2 1 0.8
Var2 2 0.9
;
data want;
if _n_=1 then do;
if 0 then set b;
dcl hash h(dataset: 'b');
h.definekey('var','level');
h.definedata('lookup');
h.definedone();
end;
set a;
h.find(key:'Var1',key:var1);
lookup1=lookup;
h.find(key:'Var2',key:var2);
lookup2=lookup;
drop var level lookup;
run;
There's about a dozen ways to do this, but the best way for what you have there is probably to make a format from the second dataset.
Formats are just relationships between one value and another value, which is exactly what you have here! You use the CNTLIN option on PROC FORMAT to create the relationship from a dataset (your dataset B) and then apply it using PUT. (Then use INPUT to change back to a number - formats only create character values. You can't use INFORMAT here because those only take character values as input. Number to number always takes an extra step.)
You could also use a hash table lookup, or just a pair of data step merges, or keyed set statements... a lot of options, as well as SQL joins. But format here will be the fastest and the easiest to code IMO.
data a;
input Observation Var1 Var2;
datalines;
1 0 0
2 1 2
3 2 1
4 0 0
;;;;
run;
data b;
input Var $ Level Lookup;
datalines;
Var1 0 0.1
Var1 1 0.3
Var1 2 0.5
Var2 0 0.7
Var2 1 0.8
Var2 2 0.9
;;;;
run;
*Here we make a new dataset that has the required names for a format cntlin dataset;
data for_fmt;
set b;
rename var=fmtname
level=start
lookup=label
;
var = cats(var,'F'); *format names cannot end with numbers, so add an F at the end;
run;
proc format cntlin=for_fmt; *read in the format;
quit;
*now use the formats;
data want;
set a;
var1_new = input(put(var1,var1f.),best12.);
var2_new = input(put(var2,var2f.),best12.);
run;

identifying the rows with maximum continuous values

I have two columns in a table. the second column has 1 or zero depending on a predefined condition. Can someone help me with a logic to identify the maximum continuous occurrence of 1s. For example, in the below table the maximum continuous occurrence is between rows 7 and 18. Just the logic to identify this would be enough.
Thanks
Create the intervals.
data intervals ;
set have ;
by B NOTSORTED ;
if first.b then start=A ;
retain start ;
if last.b then do;
end = A ;
duration = end - start + 1 ;
output;
end;
drop A ;
run;
Then find the interval with the maximum duration. Perhaps you want the first occurrence of the maximum duration?
proc sort data=intervals out=want ;
by descending duration start;
run;
data want ;
set want (obs=1);
where B=1;
run;
something like this
data have;
input A B;
datalines;
1 0
2 0
3 1
4 1
5 1
6 0
7 0
8 0
9 1
10 0
11 1
12 1
13 1
14 1
15 1
16 1
17 0
18 0
19 0
20 1
21 0
;
proc sort data=have;
by A;
run;
data want;
set have;
if B=1 then count + 1;
if B = 0 then count = 0;
run;
proc sql;
select max(count) as max_value from want;

How would you do this task using SQL or R library sqldf?

I need to implement the following function (ideally in R or SQL): given two data frames (have a column for userid and the rest of the colums are booleans attributes (they are just permitted to be 0's or 1's)) I need to return a new data frame with two columns (userid and count) where count is the number of matches for 0's and 1's for each user in both tables. An user F could occur in both data frames or it could occur in just one. In this last case, I need to return NA for that user count. I write an example:
DF1
ID c1 c2 c3 c4 c5
1 0 1 0 1 1
10 1 0 1 0 0
5 0 1 1 1 0
20 1 1 0 0 1
3 1 1 0 0 1
6 0 0 1 1 1
71 1 0 1 0 0
15 0 1 1 1 0
80 0 0 0 1 0
DF2
ID c1 c2 c3 c4 c5
5 1 0 1 1 0
6 0 1 0 0 1
15 1 0 0 1 1
80 1 1 1 0 0
78 1 1 1 0 0
98 0 0 1 1 1
1 0 1 0 0 1
2 1 0 0 1 1
9 0 0 0 1 0
My function must return something like this: (the following is a subset)
DF_Return
ID Count
1 4
2 NA
80 1
20 NA
.
.
.
Could you give me any suggestions to carry this out? I'm not that expert in sql.
I put the codes in R to generate the experiment I used above.
id1=c(1,10,5,20,3,6,71,15,80)
c1=c(0,1,0,1,1,0,1,0,0)
c2=c(1,0,1,1,1,0,0,1,0)
c3=c(0,1,1,0,0,1,1,1,0)
c4=c(1,0,1,0,0,1,0,1,1)
c5=c(1,0,0,1,1,1,0,0,0)
DF1=data.frame(ID=id1,c1=c1,c2=c2,c3=c3,c4=c4,c5=c5)
DF2=data.frame(ID=c(5,6,15,80,78,98,1,2,9),c1=c2,c2=c1,c3=c5,c4=c4,c5=c3)
Many thanks in advance.
Best Regards!
Here's an approach for you. The first hardcodes the columns to compare, while the other is more general and agnostic to how many columns DF1 and DF2 have:
#Merge together using ALL = TRUE for equivlent of outer join
DF3 <- merge(DF1, DF2, by = "ID", all = TRUE, suffixes= c(".1", ".2"))
#Calculate the rowSums where the same columns match
out1 <- data.frame(ID = DF3[, 1], count = rowSums(DF3[, 2:6] == DF3[, 7:ncol(DF3)]))
#Approach that is agnostic to the number of columns you have
library(reshape2)
library(plyr)
DF3.m <- melt(DF3, id.vars = 1)
DF3.m[, c("level", "DF")] <- with(DF3.m, colsplit(variable, "\\.", c("level", "DF")))
out2 <- dcast(data = DF3.m, ID + level ~ DF, value.var="value")
colnames(out)[3:4] <- c("DF1", "DF2")
out2 <- ddply(out, "ID", summarize, count = sum(DF1 == DF2))
#Are they the same?
all.equal(out1, out2)
#[1] TRUE
> head(out1)
ID count
1 1 4
2 2 NA
3 3 NA
4 5 3
5 6 2
6 9 NA
SELECT
COALESCE(DF1.ID, DF2.ID) AS ID,
CASE WHEN DF1.c1 = DF2.c1 THEN 1 ELSE 0 END +
CASE WHEN DF1.c2 = DF2.c2 THEN 1 ELSE 0 END +
CASE WHEN DF1.c3 = DF2.c3 THEN 1 ELSE 0 END +
CASE WHEN DF1.c4 = DF2.c4 THEN 1 ELSE 0 END +
CASE WHEN DF1.c5 = DF2.c5 THEN 1 ELSE 0 END AS count_of_matches
FROM
DF1
FULL OUTER JOIN
DF2
ON DF1.ID = DF2.ID
There's probably a more elegant way, but this works:
x <- merge(DF1,DF2,by="ID",all=TRUE)
pre <- paste("c",1:5,sep="")
x$Count <- rowSums(x[,paste(pre,"x",sep=".")]==x[,paste(pre,"y",sep=".")])
DF_Return <- x[,c("ID","Count")]
We could use safe_full_join from my package safejoin, and apply ==
between conflicting columns. This will yield a new data frame with logical
c* columns that we can use rowSums on.
# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
library(dplyr)
safe_full_join(DF1, DF2, by = "ID", conflict = `==`) %>%
transmute(ID, count = rowSums(.[-1]))
# ID count
# 1 1 4
# 2 10 NA
# 3 5 3
# 4 20 NA
# 5 3 NA
# 6 6 2
# 7 71 NA
# 8 15 1
# 9 80 1
# 10 78 NA
# 11 98 NA
# 12 2 NA
# 13 9 NA
You can use the apply function to handle this. To get the sum of each row, you can use:
sums <- apply(df1[2:ncol(df1)], 1, sum)
cbind(df1[1], sums)
which will return the sum of all but the first column, then bind that to the first column to get the ID back.
You could do that on both data frames. I'm not really clear what the desired behavior is after that, but maybe look at the merge function.