proc sql correlation - sql

Can anybody help me that how I can calculate the correlation between two variables within each group in Proc Sql? Is there any such function just as sum or mean? Thanks a lot!

You should use proc corr to start with, as this does all the required calculations, which gets you most of the way there. You will need to filter and transpose the output dataset into your desired format. There are many answers on this site showing how to do that sort of thing, so have a look at those - in this case a wide to long transposition is required.
proc sort data = sashelp.class out = class;
by sex;
run;
proc corr data = class outp=mypcorr noprint;
var HEIGHT WEIGHT;
by SEX;
run;

Related

Making simple calculations across rows and columns in SAS

I have an issue with making calculations in SAS as an example I have the following data:
Type amount
Axiom_Indlån 19966699113
Puljerneskontantindestående 133819901
Puljerne Andre passiver -9389117
Rap_Indlån 47501558321
I want to calculate the following:
('Rap_Indlån' - 'Puljerneskontantindestående' - 'Puljerne Andre passiver') - Axiom_Indlån
How do I achieve this ?
And how would I do it if it was columns instead of rows?
This is one of my big issues I hope you can point me in the right direction.
Hard to tell what you are asking for since you have not shown an input or output datasets.
But it sounds like you just want to multiple some of the values by negative one before summing.
So if your dataset looks like this:
data have;
infile cards dsd truncover;
input name :$50. value;
cards;
Axiom_Indlån,19966699113
Puljerneskontantindestående,133819901
Puljerne Andre passiver,-9389117
Rap_Indlån,47501558321
;
You could get the total pretty easily in SQL for example.
proc sql;
create table want as
select sum(VALUE*case when (NAME in ('Rap_Indlån')) then 1 else -1 end) as TOTAL
from HAVE
;
quit;
If you wanted to do it by columns, simply transpose and subtract as normal.
proc transpose data=have out=have_tpose;
id name;
var value;
run;
data want;
set have_tpose;
total = ('Rap_Indlån'n - 'Puljerneskontantindestående'n - 'Puljerne Andre passiver'n) - 'Axiom_Indlån'n;
run;

SAS SQL Loop Inputting Sequential Files

I have:
6 files, named as follows: "ROSTER2008", "ROSTER2009", ..., "ROSTER2013"
Each file has these variables: TEAMID and MATE1, MATE2, ..., MATEX. The names of the teammates are stored for each team, through to teammate X.
Now: I want to loop through code that reads in the 6 files and creates one output file with TEAM, MATES2008, MATES2009, ..., MATES2013, where the MATES20XX variables contain the number of teammates on each team in that respective year (I no longer care about their names).
Here is what I've tried to do:
%macro sqlloop(start, end);
proc sql;
%DO year=&start. %TO &end.;
create table mates_by_team_&year. as
select distinct put(teamID, z2.) as team,
count(*) as mates
from myLibrary.rosters
where d='ROSTER(HOW DO I PUT YEAR HERE?)'.d;
%end;
quit;
%mend;
%sqlloop(start=2008, end=2013)
SO, here are my questions:
How do I fix the 'where' statement such that it understands the file name I intend to pull in on each iteration? (That is ROSTER2008,...,ROSTER2013)
Right now, I am creating 6 different tables for each file I input. How can I, instead, join these files into one table as I have described above?
Thank you in advance for your help!
Xtina
For the sake of simplicity I have assumed that your X in MATEX never goes beyond 999. Obviously you can increase the limit or you can use proc contents or dictionary tables to find out the max limit.
I have pulled all the variables Mate1:MateX in an array and checking if they are missing or not and increasing the counter to count the number of team mates.
%macro loops(start=,end=);
%do year=&start. %to &end.;
data year_&year.(keep=teamid Mates_&year.);
set roster&year.;
array mates{*} mate1-mate999;
Mates_&year.=0;
do i=1 to 999;
if not missing(mates{i}) then Mates_&year.=Mates_&year.+1;
end;
run;
proc sort data=year_&year.;
by teamid;
%end;
At the end I have merged all the datasets together.
data team;
merge year_&start.-year_&end.;
by teamid;
run;
%mend;
%loops(start=2008,end=2009);
Assuming those are SAS datasets and your using SAS 9.3+ I'd recommend a data step and proc freq approach. It's infinitely easier to understand. If you wanted to make it a macro replace the years (2008/2012) with your macro variables.
Edit: Added N() to count number of teammates assuming it follows the mate1 mate2 ... matex.
The single datastep would have the answer and you can drop the variables.
Data combined;
Set roster2008 - roster2012 indsname=source;
Year = substr(scan(source, 2, '.'), 7);
Num_mates = n(of mate:);
*drop mate:;
Run;
PROC sort data=combined;
By teamid year;
Run;
PROC transpose data=combined out=want prefix=mate;
By teamid;
VAR num_mates;
Id year;
Run;
*OLD CODE BASED ON SQL SOLUTION;
*Proc freq data=combined;
*Table dsn*teamid/list;
*Run;

SAS how to get random selection by group randomly split into multiple groups

I have a simple data set of customers (about 40,000k)
It looks like:
customerid, group, other_variable
a,blue,y
b,blue,x
c,blue,z
d,green,y
e,green,d
f,green,r
g,green,e
I want to randomly select for each group, Y amounts of customers (along with their other variable(s).
The catch is, i want to have two random selections of Y amounts for each group
i.e.
4000 random green customers split into two sets of 2000 randomly
and 4000 random blue customers split into two sets of 2000 randomly
This is because I have different messages to give to the two different splits
I'm not sampling with replacement. Needs to be unique customers
Would prefer a solution in PROC SQL but happy for alternative solution in sas if proc sql isn't idea
proc surveyselect is the general tool of choice for random sampling in SAS. The code is very simple, I would just sample 4000 of each group, then assign a new subgroup every 2000 rows, since the data is in a random order anyway (although sorted by group).
The default sampling method for proc surveyselect is srs, which is simple random sampling without replacement, exactly what is required here.
Here's some example code.
/* create dummy dataset */
data have;
do customerid = 1 to 10000;
length group other_variable $8;
if rand('uniform')<0.5 then group = 'blue'; /* assign blue or green with equal likelihood */
else group = 'green';
other_variable = byte(97+(floor((1+122-97)*rand('uniform')))); /* random letter between a and z */
output;
end;
run;
/* dataset must be sorted by group variable */
proc sort data=have;
by group;
run;
/* extract random sample of 4000 from each group */
proc surveyselect data=have
out=want
n=4000
seed=12345; /* specify seed to enable results to be reproduced */
strata group; /* set grouping variable */
run;
/* assign a new subgroup for every 2000 rows */
data want;
set want;
sub=int((_n_-1)/2000)+1;
run;
data custgroup ;
do i=1 to nobs;
set sorted_data nobs=nobs ;
point = ranuni(0);
end;
proc sort data = custgroup out=sortedcust
by group point;
run;
data final;
set sortedcust;
by group point;
if first group then i=1;
i+1;
run;
Basically what I am doing is first assign a random number to all observations in the data set. Then perform sorting based on the variable group and point.
Now I achieved a random sequence of observation within group. i=1 and i+1
would be to identify the row of observation(s) within group. This means would avoid extracting duplicated observations . Use output statement as well to control where you want to store the observation based on i.
My approach may not be the most efficient one.
The code below should do it. First, you will need to generate a random number. As Joe said above, it is better to seed it with a specific number so that you can reproduce the sample if necessary. Then you can use Proc Sql with the outobs statement to generate a sample.
(BTW, it would be a good idea not to name a variable 'group'.)
data YourDataSet;
set YourDataSet;
myrandomnumber = ranuni(123);
run;
proc sql outobs=2000;
create table bluesample as
select *
from YourDataSet
where group eq 'blue'
order by myrandomnumber;
quit;
proc sql outobs=2000;
create table greensample as
select *
from YourDataSet
where group eq 'green'
order by myrandomnumber;
quit;

Compute rolling skewness and kurtosis in sas quickly for a very large data set

I have the following code to compute the skewness on a rolling window of returns:
libname backup 'C:\Users\Anwender\Desktop\backup sas data';
data crsp_daily;
set backup.crsp_daily;
run;
proc sort data=crsp_daily;
by permno date;
run;
data crsp_daily1a;
set crsp_daily;
lastofmonth = last.month;
by permno year month;
run;
proc sql;
create table roll_ret as
select h2.permno, h2.Date, h1.retadj as lagret
from crsp_daily1a as h1,
crsp_daily1a as h2
where h1.permno = h2.permno
and intck("MONTH",h1.date,h2.date) between 0 and 11
group by h2.permno, h2.date
having count(h2.permno)>250 and h2.lastofmonth = 1
;
quit;
proc means data = roll_ret noprint;
by permno date;
var lagret;
output out=crsp_daily_final skew=skewRet kurt=KurtRet;
run;
The input data set has a daily date variable, from which I have already constructed a year and month variable. It also has an ID for the stock (permno) and daily returns (retadj).I want to compute rolling skewness from all observations from the last year, but only if there are at least 250 observations in this window. I am only interested in results for the last of the month.
The Input data set has more than 60 million!!! observations, the above code is simply too slow. I have already tried to work with a view instead of an data set for roll_view without improvement.
How can I quickly compute a rolling skewness in the above sense for this very large data set?
General comments on my code would be appreciated as well.
Thanks very much!
PROC SQL performs heuristic analysis of the potential join strategies, you can review it by using proc sql _method option. Potential user optimization strategies are outlined here (http://support.sas.com/techsup/technote/ts553.html). Probably, your case falls into the category of join of a small (h2) and large (h1) datasets - creating an index on the key usually helps in this case.

How do you create an index for unique variables in sas?

I am trying to combine the odds ratios outputted from two models with different adjustments in SAS:
i.e:
ods output oddsratios=adjustedOR1(rename=(OddsRatioEst=OR1);
proc logistic data=dataname;
model y= b d c a e; run;
ods output oddsratios=adjustedOR2 (rename=(OddsRatioEst=OR2);
proc logistic data=dataname;
model y= b d c; run;
proc sort.....
data Oddsratios (keep=Effect OR1 OR2);
merge adjustedOR1 adjustedOR2; by effect; run;
Problem is that if I sort and merge by the Effect variable, I lose the order in which I put the explanatory variables in the model.
Is there anyway to assign an index to the variable according to the order I put it in the model, so that the final table will have the effect column in the order: b d c a e?
Thanks for your help
I suggest creating a new sequence variable in your "primary" dataset with the sort order you want. then re-sorting the merged result by that variable:
data adjustedOR1;
set adjustedOR1;
sortkey = _n_;
run;
proc sort data=adjustedOR1;
by effect;
run;
proc sort data=adjustedOR2;
by effect;
run;
data Oddsratios (keep=Effect OR1 OR2 sortkey);
merge adjustedOR1 adjustedOR2;
by effect;
run;
proc sort data=Oddsratios;
by sortkey;
run;
This would be a bit more generic than hard-coding the sort sequence as Keith suggests using PROC SQL (which also works by the way).
And thanks to Keith for providing a practical example!
I think the easiest way to sort your data is to do the merge in Proc Sql and use a case statement in an 'order by' clause. Here is an example.
ods output oddsratios=adjustedOR1(rename=(OddsRatioEst=OR1));
proc logistic data=sashelp.class;
model sex= height age weight; run;
ods output oddsratios=adjustedOR2 (rename=(OddsRatioEst=OR2));
proc logistic data=sashelp.class;
model sex= height age; run;
proc sql;
create table Oddsratios as select
a.effect,
a.or1,
b.or2
from adjustedOR1 as a
left join
adjustedOR2 as b
on a.effect=b.effect
order by
case a.effect
when 'Height' then 1
when 'Age' then 2
when 'Weight' then 3
end;
quit;
If you want to just change the order in which the variables appear in the data set, you can use the retain statement:
data Oddsratios (keep=Effect OR1 OR2);
retain b d c a e;
merge adjustedOR1 adjustedOR2;
by effect;
run;
This isn't really what retain is for, but it works.
But I wonder why you care what the order of the variables in the data set is. You can specify the order when you display results with proc print, for example.
Simplest answer I think that is still fairly flexible is to create an informat from your original dataset. Then during the merge you can create a new variable with the numeric order variable and sort by that afterwards.
Another solution would be to merge in a fashion not requiring a sort - create a hash table, for example, or create a format out of the odds ratio 2 dataset and append it on in a simple data step rather than via merge.
data have;
input effect $;
datalines;
b
d
c
a
e
;;;;
run;
data for_format;
set have;
fmtname='EFF';
type='j';
hlo='s';
start=effect;
label=_n_;
keep hlo type fmtname start label;
run;
proc format cntlin=for_format;
quit;
proc sort data=have;
by effect;
run;
data want;
set have; *your merge here instead;
by effect;
eff_order=input(effect,$EFF.);
run;
proc sort data=want;
by eff_order;
run;
proc print data=want;
run;