I have a gender variable in the old dataset from a week ago and a gender variable in the current dataset. I have created old and new versions of these variables and combined them into a table called joined.
My goal is to create a flag that will show if there are differences between the old and the new gender variable +- 100. I'd like to know if there is drops.
The gender variable is categorical (categories are: female, male, unknown).
Is there a way for me to create flags using proc sql -I will need to do this for the other variables including sex, race, age, etc. What is the easiest way to create a flag and also show the counts of the variables? The end goal would be three columns the counts of the two gender variables and then a flag saying if there was an increase or decrease of 100 from the two weeks.
This is the code I have so far:
proc sql;
create table joined as
select coalesce(a.person_key, b.person_key) as person_key,
a.gender_old, b.gender_new,
a.sexor_old, b.sexor_new,
a.race_cd_old, b.race_cd_new,
a.age_group_old, b.age_group_new,
a.sex_old, b.sex_new,
a.peh1_old, b.peh1_new,
a.peh2_old, b.peh2_new,
a.dose_total_old, b.dose_total_new
from oldperson a
full join newperson b
on a.person_key = b.person_key;
quit;
/Goal: Flag significant increases and flag ANY drops./
Related
I have this code in SAS, I'm trying to write SQL equivalent. I have no experience in SAS.
data Fulls Fulls_Dupes;
set Fulls;
by name, coeff, week;
if rid = 0 and ^last.week then output Fulls_Dupes;
else output Fulls;
run;
I tried the following, but didn't produce the same output:
Select * from Fulls where rid = 0 groupby name,coeff,week
is my sql query correct ?
SQL does not have a concept of observation order. So there is no direct equivalent of the LAST. concept. If you have some variable that is monotonically increasing within the groups defined by distinct values of name, coeff, and week then you could select the observation that has the maximum value of that variable to find the observation that is the LAST.
So for example if you also had a variable named DAY that uniquely identified and ordered the observations in the same way as they exist in the FULLES dataset now then you could use the test DAY=MAX(DAY) to find the last observation. In PROC SQL you can use that test directly because SAS will automatically remerge the aggregate value back onto all of the detailed observations. In other SQL implementations you might need to add an extra query to get the max.
create table new_FULLES as
select * from FULLES
group by name, coeff, week
having day=max(day) or rid ne 0
;
SQL also does not have any concept of writing two datasets at once. But for this example since the two generated datasets are distinct and include all of the original observations you could generate the second from the first using EXCEPT.
So if you could build the new FULLS you could get FULLS_DUPES from the new FULLS and the old FULLS.
create table FULLS_DUPES as
select * from FULLES
except
select * from new_FULLES
;
I have a data table with columns: Year, Month, Sales. It is effectively a summary table, like a pivot table in excel.
With this table, if there are no sales reported for one month (i.e. Not 0 sales, but no mention of sales so SAS cannot pinpoint a value to a certain month) then that whole row would disappear.
I do not want this to happen, I would instead like that row to display 0 rather than not appear. Is there a way to change the format of this to ensure every row would appear?
Note: The months are not calendar months, as such you could have month60 relating to 2011.
If the table is being created using proc summary or proc means, one way of achieving the sort of output you want provided that you have at least 1 row for each month in your data is to use the completetypes option, e.g.
proc summary data = sashelp.class completetypes;
class sex age;
var weight;
output out = mysummary mean=;
run;
This produces a row with frequency 0 for Sex = F, Age = 16 rather than skipping that output entirely.
A more reliable but more labour-intensive method, which works even if some values never appear anywhere in your data, is to use the classdata option, e.g.
data myclassdata;
do SEX = 'M','F';
do AGE = 13 to 17;
output;
end;
end;
run;
proc summary nway data = sashelp.class classdata=myclassdata exclusive;
class sex age;
var weight;
output out = mysummary2 mean=;
run;
The exclusive option here restricts the output to combinations of levels that are present in the classdata dataset. Without it, you get at least those specified in the classdata plus rows for all possible combinations based on observed 1-way values as though you had specified completetypes.
I've created a procedure that predicts College football game lines by using the variables #Team1 and #Team2. In the current setup, these teams are entered manually.
For example: #Team1 = 'Ohio St.', #Team2 = 'Southern Miss.'
Then, my calculation will go through a series of calculations on stats comparisons, strength of schedule, etc. to calculate the hypothetical game line (in this case, Ohio St. -39.)
Here's where I need your help: I'm trying to turn this line prediction system into a ranking system, ranking each team from greatest to worst. I'd like to take each team in my Team table and put it through this calculation with each possible matchup. Then, rank the teams based on who has the biggest advantage over every team that given week, vs. who has the least advantage.
Any ideas? I've toyed around with the idea of turning the calculation into a function and pass the values through that way, but not sure where to start.
Thanks!
Apologies for the made-up column names, but the following should do what you want if you convert your proc to a function that takes the two team names as arguments:
Select a.Name as Team1
, b.Name as Team2
, fn_GetStats(a.Name, b.Name)
from TeamsList a
inner join TeamsList b
on a.Name > b.Name --To avoid duplicate rows
order by 3 desc
The join will create a list of all possible unique combinations (e.g. TeamB and TeamA, but not also TeamA and TeamB or TeamA and TeamA).
Assuming the proc outputs just a single value right now, this seems like the easiest solution. You could also do the same join and then loop through your proc with the results, instead.
I have 2 SAS data sets, Lab and Rslt. They both have all of the same variables, but Rslt is supposed to have what is essentially a subset of Lab. For what I'm trying to do, there are 4 important variables: visit, accsnnum, battrnam, and lbtestcd. All are character variables. I want to compare the two files Lab and Rslt to find out where they vary -- specifically, I need to know the count of lbtestcd per unique accsnnum.
But I must control for a few factors. First, I only need to compare observations that have "Lipid Panel" or "Chemistry (6)" in the battrnam variable. The Rslt file only contains these observations, so we don't need to worry about that one. So I subsetted Lab using this code:
data work.lab;
set livingston.ndb_lab_1;
where battrnam contains "Lipid Panel" or battrnam = "Chemisty (6)";
run;
This worked fine. Now, I need to control for the variable visit. I need to get rid of all observations in both Lab and Rslt that have visits that contain "Day 1" or "Screening". I accomplished this using the following code:
data work.lab;
set work.lab;
if visit = "Day 1" or visit = "Screening" then delete;
else visit = visit;
run;
data work.rslt;
set work.rslt;
if visit = "Day 1" or visit = "Screening" then delete;
else visit = visit;
run;
Now this is where I get stuck. I need to create a way to compare the count of lbtestcd by accsnnum between the two separate files Lab and Rslt, and I need a way for it to flag the accsnum where there is a difference between Lab and Rslt for the count of lbtestcd. For example, if Lab has an accsnum A1 that has 5 unique lbtestcd values, and Rslt has the accsnum A1 with 7 unique lbtestcd value, I need that one to be brought to my attention.
I can do a proc freq for each file, but these are large data sets and I don't want to have to compare by hand. Perhaps exporting the count of lbtestcd by accsnum to a variable in a new 3rd dataset for each of the 2 files Lab and Rslt, then creating a variable that is the difference of these two? So that if difference != 0 then I can get a report of those asscnum? Advice in SQL will work too, as I can run that through SAS.
Edit
I've used some SQL to get the count of lbtestcd by accsnum for each data set using the code below, though I still need to figure out how to export these values to a data set to compare.
proc sql;
select accsnnum, count(lbtestcd)
from work.lab1
group by accsnnum;
quit;
proc sql;
select accsnnum, count(lbtestcd)
from work.rslt1
group by accsnnum;
quit;
Thanks for any and all help you can give. This one is really stumping me!
I would do a PROC FREQ on each dataset (or proc whatever-you-like-that-does-counts) and then use PROC COMPARE. For example:
proc freq data=rslt1;
tables accsnnum*ibtestcd/out=rsltcounts;
run;
proc freq data=lab1;
tables accsnnum*ibtestcd/out=labcounts;
run;
proc compare base=lab1 compare=rslt1 out=compares /* options */;
by accsnnum;
run;
PROC COMPARE has a lot of options; in this case the most helpful would probably be:
outnoequal - only outputs rows for each row that are not identical in the two datasets
outbase and outcomp - outputs a row for each of BASE and COMPARE datasets (if OUTNOEQUAL, then only when they differ)
outdif - outputs 'difference' rows, ie, one minus the other; this may or may not be helpful for you
The documentation lists all of the options. You may also need to look at the METHOD options if your data might have numeric precision issues.
I have a query to pull clickthrough for a funnel, where if a user hit a page it records as "1", else NULL --
SELECT datestamp
,COUNT(visits) as Visits
,count([QE001]) as firstcount
,count([QE002]) as secondcount
,count([QE004]) as thirdcount
,count([QE006]) as finalcount
,user_type
,user_loc
FROM
dbname.dbo.loggingtable
GROUP BY user_type, user_loc
I want to have a column for each ratio, e.g. firstcount/Visits, secondcount/firstcount, etc. as well as a total (finalcount/Visits).
I know this can be done
in an Excel PivotTable by adding a "calculated field"
in SQL by grouping
in PowerPivot by adding a CalculatedColumn, e.g.
=IFERROR(QueryName[finalcount]/QueryName[Visits],0)
BUT I need give the report consumer the option of slicing by just user_type or just user_loc, etc, and excel will tend to ADD the proportions, which won't work b/c
SUM(A/B) != SUM(A)/SUM(B)
Is there a way in DAX/MDX/PowerPivot to add a calculated column/measure, so that it will be calculated as SUM(finalcount)/SUM(Visits), for any user-defined subset of the data (daterange, user type, location, etc.)?
Yes, via calculated measures. calculated columns are for creating values that you want to see on rows/columns/report header...calculated measures are for creating values that you want to see in the values section of a pivot table and can slice/dice by the columns in the model.
The easiest way would be to create 3 calculated "measures" in the calculation area of the powerpivot sheet.
TotalVisits:=SUM(QueryName[visits])
TotalFinalCount:=SUM(QueryName[finalcount])
TotalFinalCount2VisitsRatio:=[TotalFinalCount]/[TotalVisits]
You can then slice the calculated measure [TotalFinalCount2VisitsRatio] by user_type or just user_loc (or whatever) and the value will be calculated correctly. The difference here is that you are explicitly telling the xVelocity engine to SUM-then-DIVIDE. If you create the calculated column, then the engine thinks you want to DIVIDE-then-SUM.
Also, you don't have to break down the measure into 3 separate measures...it's just good practice. If you're interested in learning more, I'd recommend this book...the author is the PowerPivot/DAX guru and the book is very straightforward.