How to do conditional count based on row value in SAS/SQL? - sql

Re-uploading since there was some problems with my last post, and I did not know that we were supposed to post sample data. I'm fairly new to SAS, and I have a problem that I know how to solve in Excel but not SAS. however, the dataset is too large to reasonably use in Excel.
I have four variables: id, year_start, groupname, test_score.
Sample data:
id year_start group_name test_score
1 19931231 Red 90
1 19941230 Red 89
1 19951231 Red 91
1 19961231 Red 92
2 19930630 Red 85
2 19940629 Red 87
2 19950630 Red 95
3 19950931 Blue 90
3 19960931 Blue 90
4 19930331 Red 95
4 19940331 Red 97
4 19950330 Red 98
4 19960331 Red 95
5 19931231 Red 96
5 19941231 Red 97
My goal is to achieve a ranked list (fractional) by test_score for each year. I hoped that I would be able to achieve this using PROC RANK FRACTION. This function would calculate order by a test_score (highest is 1, 2nd highest is 2 and so on) and then divide by the total number of observations to provide a fractional rank. Unfortunately, year_start differs widely from row to row. For each id/year combo, I want to perform a one-year look-back from year-start, and rank that observation compared to all other id's that have a year_start in that one year range. I'm not interested in comparing by calendar year, and the rank of each id should be relative to its own year_start. Adding another level of complication, I would like this rank to be performed by groupname.
PROC SQL is totally fine if someone has a SQL solution.
Using the above data, the ranks would be like this:
id year_start group_name test_score rank
1 19931231 Red 90 0.75
1 19941230 Red 89 0.8
1 19951231 Red 91 1
1 19961231 Red 92 1
2 19930630 Red 85 1
2 19940629 Red 87 0.8
2 19950630 Red 95 0.75
3 19950931 Blue 90 1
3 19960931 Blue 90 1
4 19930331 Red 95 1
4 19940331 Red 97 0.2
4 19950330 Red 98 0.2
4 19960331 Red 95 0.333
5 19931231 Red 96 0.25
5 19941231 Red 97 0.667
In order to calculate the rank for row 1,
we first exclude blue observations.
Then we count the number of observations that fall within a year before that year_start, 19931231 (so we have 4 observations).
We count how many of these observations have a higher test_score, and then add 1 to find the order of the current observation (So it is the 3rd highest).
Then, we divide the order by the total number to get the rank (3/4= 0.75).
In Excel, the formula for this variable would look something like this. Assume formula is for row 1 and there are 100 rows. id=A, year_start=B, groupname=C, and test_score=D:
=(1+countifs(D1:D100,">"&D1,
B1:B100,"<="&B1,
B1:B100,">"&B1-365.25,
C1:C100, C1))/
countifs(B1:B100,"<="&B1,
B1:B100,">"&B1-365.25,
C1:C100, C1)
Thanks so much for the help!
ahammond428

Your example isn't correct if I'm reading it correctly, so it's hard to know exactly what you're trying to do. But try the following and see if it works. You may need to tweak inequalities to be open or closed depending on whether you want to include one year to the date. Note that your year_start column needs to be imported in a SAS date format for this to work. Otherwise you can change it over with input(year_start, yymmdd8.).
proc sql;
select distinct
a.id,
a.year_start,
a.group_name,
a.test_score,
1+sum(case when b.test_score > a.test_score then 1 else 0 end) as rank_num,
count(b.id) as rank_denom,
calculated rank_num / calculated rank_denom as rank
from testdata a left join testdata b
on a.group_name = b.group_name
and intnx('year',a.year_start,-1,'s') le b.year_start le a.year_start
group by a.id, a.year_start, a.group_name, a.test_score
order by id, year_start;
quit;
Note that I changed dates of 9/31 to 9/30 (since there is no 9/31), but left 3/30, 6/29, and 12/30 alone since perhaps that was intended, though the other dates seem to be quarter-end.

Consider correlated count subqueries in SQL:
DATA
data ranktable;
infile datalines missover;
input id year_start group_name $ test_score;
datalines;
1 19931231 Red 90
1 19941230 Red 89
1 19951231 Red 91
1 19961231 Red 92
2 19930630 Red 85
2 19940629 Red 87
2 19950630 Red 95
3 19950930 Blue 90
3 19960930 Blue 90
4 19930331 Red 95
4 19940331 Red 97
4 19950330 Red 98
4 19960331 Red 95
5 19931231 Red 96
5 19941231 Red 97
;
run;
data ranktable;
set ranktable;
format year_start date9.;
year_start = input(put(year_start,z8.),yymmdd8.);
run;
PROC SQL
Additional fields included for your review
proc sql;
select r.id, r.year_start, r.group_name, r.test_score,
put(intnx('year', r.year_start, -1, 's'), yymmdd10.) as year_ago,
(select count(*) from ranktable sub
where sub.test_score >= r.test_score
and sub.group_name = r.group_name
and sub.year_start <= r.year_start
and sub.year_start >= intnx('year', r.year_start, -1, 's')) as num_rank,
(select count(*) from ranktable sub
where sub.group_name = r.group_name
and sub.year_start <= r.year_start
and sub.year_start >= intnx('year', r.year_start, -1, 's')) as denom_rank,
calculated num_rank / calculated denom_rank as rank
from ranktable r;
run;
OUTPUT
You will notice a slight difference between your expected results which may be due to the quarter day (365.25) you apply for all years as SAS's intnx takes one full calendar year in days which change with each year

Related

Assigning Score based on Order Sequence in pandas

Following are the dataframes I have
score_df
col1_id col2_id score
1 2 10
5 6 20
records_df
date col_id
D1 6
D2 4
D3 1
D4 2
D5 5
D6 7
I would like to compute a score based on the following criteria:
When 2 occurs after 1 the score should be assigned 10 or when 1 occurs after 2, score should be assigned 10.
i.e when (1,2) gives a score 10 .. (2,1) also get the same score 10.
considering (1,2) . When 1 occurs first time we dont assign a score. We flag the row and wait for 2 to occur. When 2 occurs in the column we give the score 10.
considering (2,1). When 2 comes first. We assign value 0 and wait for 1 to occur. When 1 occurs, we give the score 10.
So, for the first time - dont assign the score and wait for the corresponding event to occur and then assign the score
So, my result dataframe should look something like this
result
date col_id score
D1 6 0 -- Eventhough 6 is there in score list, it occured for first time. So 0
D2 4 0 -- 4 is not even there in list
D3 1 0 -- 1 occurred for first time . So 0
D4 2 10 -- 1 occurred previously. 2 occurred now.. we can assign 10.
D5 5 20 -- 6 occurred previously. we can assign 20
D6 7 0 -- 7 is not in the list
I have around 100k rows in both score_df and record_df. Looping and assigning score is taking the time. Can someone help with logic without looping the entire dataframe?
From what i understand , you can try melt for unpivotting and then merge. keeping the index from the melted df , we check where the index is duplicated , and then return score from the merge else 0.
m = score_df.reset_index().melt(['index','uid','score'],
var_name='col_name',value_name='col_id')
final = records_df.merge(m.drop('col_name',1),on=['uid','col_id'],how='left')
c = final.duplicated(['index']) & final['index'].notna()
final = final.drop('index',1).assign(score=lambda x: x['score'].where(c,0))
print(final)
uid date col_id score
0 123 D1 6 0.0
1 123 D2 4 0.0
2 123 D3 1 0.0
3 123 D4 2 10.0
4 123 D5 5 20.0
5 123 D6 7 0.0

Find the largest value from column using SQL?

I am using SQL where I have column having values like
A B
X1 2 4 6 8 10
X2 2 33 44 56 78 98 675 891 11111
X3 2 4 672 234 2343 56331
X4 51 123 232 12 12333
I want a query to get the value from col B with col A which has max count of values. I.e output should be
x2 2 33 44 56 78 98 675 891 11111
Query I use:
select max(B) from table
Results in
51 123 232 12 12333
Assuming that both columns are strings, and that column B uses single space for separators and no leading/trailing spaces, you can use this approach:
SELECT A, B
FROM MyTable
ORDER BY DESC LENGTH(B)-LENGTH(REPLACE(B, ' ', ''))
FETCH FIRST 1 ROW ONLY
The heart of this solution is LENGTH(B)-LENGTH(REPLACE(B, ' ', '')) expression, which counts the number of spaces in the string B.
Note: FETCH FIRST N ROWS ONLY is Oracle-12c syntax. For earlier versions use ROWNUM approach described in this answer.
In case there is more than one separating space or more then one row meets criteria do this: count number of spaces (or groups of spaces) in each row using regexp_count(). Use rank to find most (groups of) spaces. Take only rows ranked as 1:
demo
select *
from (select t.*, rank() over (order by regexp_count(b, ' +') desc) rnk from t)
where rnk = 1

PowerPivot formula for row wise weighted average

I have a table in PowerPivot which contains the logged data of a traffic control camera mounted on a road. This table is filled the velocity and the number of vehicles that pass this camera during a specific time(e.g. 14:10 - 15:25). Now I want to know that how can I get the average velocity of cars for an specific hour and list them in a separate table with 24 rows(hour 0 - 23) where the second column of each row is the weighted average velocity of that hour? A sample of my stat_table data is given below:
count vel hour
----- --- ----
133 96.00237 15
117 91.45705 21
81 81.90521 6
2 84.29946 21
4 77.7841 18
1 140.8766 17
2 56.14951 14
6 71.72839 13
4 64.14309 9
1 60.949 17
1 77.00728 21
133 100.3956 6
109 100.8567 15
54 86.6369 9
1 83.96901 17
10 114.6556 21
6 85.39127 18
1 76.77993 15
3 113.3561 2
3 94.48055 2
In a separate PowerPivot table I have 24 rows and 2 columns but when I enter my formula, the whole rows get updated with the same number. My formula is:
=sumX(FILTER(stat_table, stat_table[hour]=[hour]), stat_table[count] * stat_table[vel])/sumX(FILTER(stat_table, stat_table[hour]=[hour]), stat_table[count])
Create a new calculated column named "WeightedVelocity" as follows
WeightedVelocity = [count]*[vel]
Create a measure "WeightedAverage" as follows
WeightedAverage = sum(stat_table[WeightedVelocity]) / sum(stat_table[count])
Use measure "WeightedAverage" in VALUES area of pivot Table and use "hour" column in ROWS to get desired result.

sql server 2008 - calculated and ordered list needs to return only 2 entries per supplier

I have a dataset like below, but longer. I want to ensure I am picking the 'fleet_id' in terms of their 'StarDriver' value overall, but I want to return only two results for each 'supplier_id' and return a max of 20 in total.
(I'm sorry I didnt work out how to copy the below in proper formatting, couldn't find from toolbar above and google results were about copying data; would also be grateful if someone would point out how)
fleet_id supplier_id Ratings Driver Punctuality Car StarDriver
19442 151 10 5 5 5 5
19634 151 11 5 5 5 5
19437 151 12 5 5 5 5
12832 10 14 5 4.92857142857143 5 4.97619047619048
12217 111 10 5 5 4.9 4.96666666666667
21135 158 19 5 4.89473684210526 5 4.96491228070175
19436 151 14 4.85714285714286 5 5 4.95238095238095
12239 111 12 4.91666666666667 5 4.91666666666667 4.94444444444445
10520 92 12 4.91666666666667 5 4.91666666666667 4.94444444444445
19997 151 12 5 5 4.83333333333333 4.94444444444444
To limit to the top 2 for each supplier, use row_number(). This will enumerate the rows and you can choose just two with where seqnum <= 2.
The rest of the query is just selecting 20 rows based on a field:
select top 20 t.*
from (select t.*,
row_number() over (partition by supplier order by StarDriver desc) as seqnum
from table t
) t
where seqnum <= 2
order by StarDriver;

complex sorting sql

I have the following table
Priority Time
100 1
86 3
85 2
I want to sort it by first by priority and then by time, however, priority differce within 20 points are treated the same. e.g. 100 and 85 are considered as the same priority level.
so the result will be:
Priority Time
100 1
85 2
86 3
Thanks,
Try this (assuming that priority is an integer)
select *
from foobar
order by ( priority / 20 ) , -- 0-19 yields 0 , 20-39 yields 1, etc.
time