SQL - How to query a VIEW with a CASE expression - Find the average number in a column - sql

I am looking to compute the avg win odds and avg place odds of horse racing markets
I have tried a CASE expression since averages for WinOdds should only be calculated if Place =1 and averages for PlaceOdds should only be included when Place <= 10
SELECT
CASE WHEN Place = 1 THEN AVG(IndustrySP) AS AvgWinOdds,
CASE WHEN Place <= 10 THEN AVG((IndustrySP - 1.0) / 5) AS AvgPlaceOdds
FROM dbo.GrandNational -- This is the `view` I want to query
I am looking for average odds for win (when Place =1)
and average odds for place (when Place is <= 10)
Should return something like this:
-- AvgWinOdds -- AvgPlaceOdds
-- 6.44 -- 4.22

You must average the CASE expressions:
SELECT
AVG(CASE WHEN Place = 1 THEN IndustrySP END) AS AvgWinOdds,
AVG(CASE WHEN Place <= 10 THEN (IndustrySP - 1.0) / 5 END) AS AvgPlaceOdds
FROM dbo.GrandNational

Related

Assign an age to a person based on known population average but no Date of birth

I would like to use Postgres SQL to assign an age category to a list of househoulds, where we don't know the date of birth of any of the family members.
Dataset looks like:
household_id
household_size
x1
5
x2
1
x3
8
...
...
I then have a set of percentages for each age group with that dataset looking like:
age_group
percentage
0-18
30
19-30
40
31-100
30
I want the query to calculate overall what will make the whole dataset as close to the percentages as possible and if possible similar at the household level(not as important). the dataset will end up looking like:
household_id
household_size
0-18
19-30
31-100
x1
5
2
2
1
x2
1
0
1
0
x3
8
3
3
2
...
...
...
....
...
I have looked at the ntile function but any pointers as to how I could handle this with postgres would be really helpful.
I didn't want to post an answer with just a link so I figured I'll give it a shot and see if I can simplify depesz weighted_random to plain sql. The result is this slower, less readable, worse version of it, but in shorter, plain sql:
CREATE FUNCTION weighted_random( IN p_choices ANYARRAY, IN p_weights float8[] )
RETURNS ANYELEMENT language sql as $$
select choice
from
( select case when (sum(weight) over (rows UNBOUNDED PRECEDING)) >= hit
then choice end as choice
from ( select unnest(p_choices) as choice,
unnest(p_weights) as weight ) inputs,
( select sum(weight)*random() as hit
from unnest(p_weights) a(weight) ) as random_hit
) chances
where choice is not null
limit 1
$$;
It's not inlineable because of aggregate and window function calls. It's faster if you assume weights will only be probabilities that sum up to 1.
The principle is that you provide any array of choices and an equal length array of weights (those can be percentages but don't have to, nor do they have to sum up to any specific number):
update test_area t
set ("0-18",
"19-30",
"31-100")
= (with cte AS (
select weighted_random('{0-18,19-30,31-100}'::TEXT[], '{30,40,30}')
as age_group
from generate_series(1,household_size,1))
select count(*) filter (where age_group='0-18') as "0-18",
count(*) filter (where age_group='19-30') as "19-30",
count(*) filter (where age_group='31-100') as "31-100"
from cte)
returning *;
Online demo showing that both his version and mine are statistically reliable.
A minimum start could be:
SELECT
household_id,
MIN(household_size) as size,
ROUND(SUM(CASE WHEN agegroup_from=0 THEN g ELSE 0 END),1) as g1,
ROUND(SUM(CASE WHEN agegroup_from=19 THEN g ELSE 0 END),1) as g2,
ROUND(SUM(CASE WHEN agegroup_from=31 THEN g ELSE 0 END),1) as g3
FROM (
SELECT
h.household_id,
h.household_size,
p.agegroup_from,
p.percentage/100.0 * h.household_size as g
FROM households h
CROSS JOIN PercPerAge p) x
GROUP BY household_id
ORDER BY household_id;
output:
household_id
size
g1
g2
g3
x1
5
1.5
2.0
1.5
x2
1
0.3
0.4
0.3
x3
8
2.4
3.2
2.4
see: DBFIDDLE
Notes:
Of course you should round the columns g to whole numbers, taking into account the complete split (g1+g2+g3 = total)
Because g1,g2 and g3 are based on percentages, their values can change (as long as the total split is OK.... (see, for more info: Return all possible combinations of values on columns in SQL )

Oracle 11g Nested Case Statement Calculation

In Oracle 11g, I am trying to get to a sell price from a query of data. Yes I can export this and write the code somewhere else, but I want to try to do this elegantly in the query.
I only seem to get the first part of the equation and not the last CASE where I use:
WHEN sales_code
What I am ultimately trying to do is take the result from the top and divide it by the bottom except in the case of SALE_CODE 4 where I add 1+1 or 2 to the top result and then divide by the equation.
round(to_number(price) *
CASE WHEN class_code='X'
THEN .48
ELSE .5
END * e1.set_qty +
CASE WHEN carton_pack_qty = '1'
THEN 0
ELSE (
CASE WHEN NVL(SUBSTR(size, 1,NVL(LENGTH(size) - 2,0)),1) > '35'
THEN 3.5
ELSE 3
END)
END +
CASE
WHEN sales_code='1' THEN 0 /(1-17/100)
WHEN sales_code='2' THEN 0 /(1-5/100)
WHEN sales_code='3' THEN 0 /(1-18/100)
WHEN sales_code='4' THEN 1+1 / (1-9.5/100)
WHEN sales_code='5' THEN 0 /(1-17/100)
WHEN sales_code='6' THEN 0 /(1-8/100)
WHEN sales_code='7' THEN 0 /((1-150)/100)
ELSE (100/100)
END,2) AS "Price",
I get a result from the query, but not the whole calculation. I tried this many other ways and there was always an error with parentheses or some other arbitrary error.
Any help would be appreciated.
I think this is your problem:
WHEN sales_code='1' THEN 0 /(1-17/100)
CASE returns a scalar, a number. You're trying to have it return the second half of the formula in your calculation. You need something more like this:
...
END +
CASE WHEN sales_code='4' THEN 1 ELSE 0 END /
CASE
WHEN sales_code='1' THEN (1-17/100)
WHEN sales_code='2' THEN (1-5/100)
WHEN sales_code='3' THEN (1-18/100)
WHEN sales_code='4' THEN (1-9.5/100)
WHEN sales_code='5' THEN (1-17/100)
WHEN sales_code='6' THEN (1-8/100)
WHEN sales_code='7' THEN ((1-150)/100)
ELSE 1 END ...
Actually, I'm not entirely sure what you're trying to do with sales_code='4', but that looks close.
I think I understand now what you are trying to do. Almost at least :-)
The first thing you should do is write down the complete formula with parentheses where needed. Something like:
final = ((price * class_code_factor * set_qty) + quantity_summand + two_if_sales_code4) * sales_code_factor
(That last part looks like a percentage factor, not a divisor to me. I may be wrong of course.)
Once you have the formula right, translate this to SQL:
ROUND
(
(
(
TO_NUMBER(price) *
CASE WHEN class_code = 'X' THEN 0.48 ELSE 0.5 END *
e1.set_qty
)
+
CASE WHEN carton_pack_qty = 1 THEN 0
ELSE CASE WHEN NVL(SUBSTR(size, 1,NVL(LENGTH(size) - 2,0)),1) > '35'
THEN 3.5
ELSE 3
END
END
+
CASE WHEN sales_code = 4 THEN 2 ELSE 0 END
)
*
CASE
WHEN sales_code = 1 THEN 1 - (17 / 100)
WHEN sales_code = 2 THEN 1 - (5 / 100)
WHEN sales_code = 3 THEN 1 - (18 / 100)
WHEN sales_code = 4 THEN 1 - (9.5 / 100)
WHEN sales_code = 5 THEN 1 - (17 / 100)
WHEN sales_code = 6 THEN 1 - (8 / 100)
WHEN sales_code = 7 THEN (1 - 150) / 100)
ELSE 1
END
, 2 ) AS "Price",
Adjust this to the formula you actually want. There are some things I want to point out:
Why is price not a number in your database, but a string that you must convert to a number with TO_NUMBER? That must not be. Store values in the appropriate format in your database.
In a good database you would not have to get a substring of size. It seems you are storing two different things in this column, which violates database normalization. Separate the two things and store them in separate columns.
The substring thing looks strange at that, too. You are taking the left part of the size leaving out the last two characters. It seems hence that you don't know the lenth of the part you are getting, so let's say that this can be one, two or three characers. (I don't know of course.) Now you compare this result with another string; a string that contains a numeric value. But as you are comparing strings, '4' is greater than '35', because '4' > '3'. And '200' is lesser than '35' because '2' < '3'. Is this really intended?
There are more things you treated as strings and I took the liberty to change this to numbers. It seems for instance that a quantity (carton_pack_qty) should be stored as a number. So do this and don't compare it to the string '1', but to the number 1. The sales code seems to be numeric, too. Well, again, I may be wrong.
In a good database there would be no magic numbers in the query. Knowledge belongs in the database, not in the query. If a class code 'X' means a factor of 0.48 and other class codes mean a factor of 0.5, then why is there no table of class codes showing what a class code represents and what factor to apply? Same for the mysterious summand 3 resp. 3.5; there should be a table holding these values and the size and quantity ranges they apply to. And at last there is the sales code which should also be stored in a table showing the summand (2 for code 4, 0 elsewise) and the factor.
The query part would then look something like this:
ROUND((price * cc.factor * el.set_qty) + qs.value + sc.value) * sc.factor, 2) AS "Price"
Breaking the dividend into a sub query worked and then adding parentheses around it to divide by in the main query worked.
(
select
style,
to_number(price) *
CASE WHEN class_code='X'
THEN .48
ELSE .5
END * set_qty +
CASE WHEN carton_pack_qty = '1'
THEN 1
ELSE (
CASE WHEN to_number(NVL(SUBSTR(size, 1,NVL(LENGTH(size) - 2,0)),1)) > 35
THEN 3.5
ELSE 3
END)
END as Price
FROM STYL1 s1,STY2 s2
WHERE s1.style=s2.style
) P1

Posgtres CASE condition with SUM aggregation evaluates not needed ELSE part

According to Postgres documentation:
A CASE expression does not evaluate any subexpressions that are not
needed to determine the result. For example, this is a possible way of
avoiding a division-by-zero failure:
SELECT ... WHERE CASE WHEN x <> 0 THEN y/x > 1.5 ELSE false END;
Why does the following expression return an ERROR: division by zero? - apparently evaluating the else part:
SELECT CASE WHEN SUM(0) = 0 THEN 42 ELSE 43 / 0 END
while
SELECT CASE WHEN SUM(0) = 0 THEN 42 ELSE 43 END
returns 42.
EDIT: So the example above fails because Postgres calculates immutable values (43/0) already in planning phase. Our actual query looks more like this:
case when sum( column1 ) = 0
then 0
else round( sum( price
* hours
/ column1 ), 2 )
Although this query doesn't look immutable (depends on actual values), there is still a division by zero error. Of course sum(column1) is actually 0 in our case.
Interesting example. This does have a good explanation. Say you have data like this:
db=# table test;
column1 | price | hours
---------+-------+-------
1 | 2 | 3
3 | 2 | 1
PostgreSQL executes your SELECT in two passes, first it would calculate all the aggregate functions (like sum()) present:
db=# select sum(column1) as sum1, sum(price * hours / column1) as sum2 from test;
sum1 | sum2
------+------
4 | 6
And then it would plug those results in your final expression and calculate the actual result:
db=# with temp as (
db(# select sum(column1) as sum1, sum(price * hours / column1) as sum2 from test
db(# ) select case when sum1 = 0 then 0 else round(sum2, 2) end from temp;
round
-------
6.00
Now clearly if there's an error in the first aggregate pass, it never reaches the CASE statement.
So this isn't really a problem in documentation about the CASE statement -- it applies to all conditional constructs -- but about the way aggregates are processed in a SELECT statement. This kind of problem cannot occur in any other context because aggregates are only allowed in SELECT.
But the documentation does need updating in this case too. The right documentation in this instance is "the general processing of SELECT". Step #4 there talks about GROUP BY and HAVING clauses, but it actually evaluates any aggregate functions in this step as well, regardless of GROUP BY/HAVING. And your CASE statement is evaluated in step #5.
Solution
The common solution, if you want to ignore aggregate inputs that would otherwise cause a division by zero, use the nullif() construct to turn them into NULLs:
round( sum( price
* hours
/ nullif(column1, 0) ), 2 )
PostgreSQL 9.4 will introduce a new FILTER clause for aggregates, which can also be used for this purpose:
round( sum( price
* hours
/ column1
) filter (where column1!=0), 2 )

How to find daily differences over a flexible time period?

I have a very set of data as follows:
CustomerId char(6)
Points int
PointsDate date
with example data such as:
000021 0 01-JAN-2014
000021 10 02-JAN-2014
000021 20 03-JAN-2014
000021 30 06-JAN-2014
000021 40 07-JAN-2014
000021 10 12-JAN-2014
000034 0 04-JAN-2014
000034 40 05-JAN-2014
000034 20 06-JAN-2014
000034 40 08-JAN-2014
000034 60 10-JAN-2014
000034 80 21-JAN-2014
000034 10 22-JAN-2014
So, the PointsDate component is NOT consistent, nor is it contiguous (it's based around some "activity" happening)
I am trying to get, for each customer, the total amount of positive and negative differences in points, the number of positive and negative changes, as well as Max and Min...but ignoring the very first instance of the customer - which will always be zero.
e.g.
CustomerId Pos Neg Count(pos) Count(neg) Max Min
000021 40 30 3 1 40 10
000034 100 90 4 2 80 10
...but I have not a single clue how to achieve this!
I would put it in a cube, but a) there is only a single table and no other references and b) I know almost nothing about cubes!
The problem can be solved in regular TSQL with a common table expression that numbers the lines per customer, along with an outer self join that compares each row with the previous one;
WITH cte AS (
SELECT customerid, points,
ROW_NUMBER() OVER (PARTITION BY customerid ORDER BY pointsdate) rn
FROM mytable
)
SELECT cte.customerid,
SUM(CASE WHEN cte.points > old.points THEN cte.points - old.points ELSE 0 END) pos,
SUM(CASE WHEN cte.points < old.points THEN old.points - cte.points ELSE 0 END) neg,
SUM(CASE WHEN cte.points > old.points THEN 1 ELSE 0 END) [Count(pos)],
SUM(CASE WHEN cte.points < old.points THEN 1 ELSE 0 END) [Count(neg)],
MAX(cte.points) max,
MIN(cte.points) min
FROM cte
JOIN cte old
ON cte.rn = old.rn + 1
AND cte.customerid = old.customerid
GROUP BY cte.customerid
An SQLfiddle to test with.
The query would have been somewhat simplified using SQL Server 2012's more extensive analytic functions.
An approach similar to the one of Joachim Isaksson, but with more work in the CTE and less on the main query
WITH A AS (
SELECT c.CustomerID, c.Points, c.PointsDate
, Diff = c.Points - l.Points
, l.PointsDate lPointsDate
FROM Customer c
CROSS APPLY (SELECT TOP 1
Points, PointsDate
FROM Customer cu
WHERE c.CustomerID = cu.CustomerID
AND c.PointsDate > cu.PointsDate
ORDER BY cu.PointsDate Desc) l
)
SELECT CustomerID
, Pos = SUM(Diff * CAST(Sign(Diff) + 1 AS BIT))
, Neg = SUM(Diff * (1 - CAST(Sign(Diff) + 1 AS BIT)))
, [Count(pos)] = SUM(0 + CAST(Sign(Diff) + 1 AS BIT))
, [Count(neg)] = SUM(1 - CAST(Sign(Diff) + 1 AS BIT))
, Max(Points) [Max], Min(Points) [Min]
FROM A
GROUP BY CustomerID
SQLFiddle Demo
The condition that remove the first day is the JOIN (CROSS APPLY) in the CTE: the first day as no previous day, so is filtered out.
In the main query instead of using a CASE to filter the positive and negative difference I preferred the SIGN function:
this function return -1 for negative, 0 for zero and +1 for positive
shifting the value with Sign(Diff) + 1 mean that the new return values are 0, 1 and 2
the CAST to bit compress those to 0 for negative and 1 for zero or positive.
The 0 + in the definition of the [Count(pos)] create a implicit conversion to an integer value as BIT cannot be summed.
The 1 - to SUM and COUNT the negative difference is equivalent to a NOT: it invert the values of the BIT SIGN to 1 for negative and 0 for zero of positive.
I'll copy my comment from above: I know literally nothing about cubes, but it sounds like what you're looking for is just a cursor, is it not? I know everyone hates cursors, but that's the best way I know to compare consecutive rows without loading it down onto a client machine (which is obviously worse).
I see you mentioned in your response to me that you'd be okay setting it off to run overnight, so if you're willing to accept that sort of performance, I definitely think a cursor will be the easiest and quickest to implement. If this is just something you do here or there, I'd definitely do that. It's nobody's favorite solution, but it'd get the job done.
Unfortunately, yeah, at twelve million records, you'll definitely want to spend some time optimizing your cursor. I work frequently with a database that's around that size, and I can only imagine how long it'd take. Although depending on your usage, you might want to filter based on user, in which case the cursor will be easier to write, and I doubt you'll be facing enough records to cause much of a problem. For instance, you could just look at the top twenty users and test their records, then do more as needed.

How to do 'grading' in pure (i.e. ANSI) SQL

I have a table that looks something like this:
CREATE TABLE student_results(id integer, name varchar(32), score float);
Lets make the following two assumptions:
assume that the score goes from 0 to a maximum of 100.
assume that I want to grade students in 'step sizes' of 10
so I want to apply the following grading:
Score Grade Awarded
0-10 GRADE9
10-20 GRADE8
20-30 GRADE7
30-40 GRADE6
40-50 GRADE5
50-60 GRADE4
60-70 GRADE3
70-80 GRADE2
80-90 GRADE1
99-100 GENIUS
I would like to write an SQL query that takes in the following input arguments:
lowest score: 0 in this example
highest score: 100 in this example
'step' size: 10 in this example
As ever, if possible, I would like to write such a query using ANSI SQL. If I have to choose a database, then in order of DECREASING preference, it would have to be:
PostgreSQL
MySQL
Could someone please explain how I may be able to write an SQL query that does this kind of grading, using the above table as an example?
[Edit]
Sample input data
1, 'homer', 10.5
2. 'santas little helper', 15.2
3, 'bart', 20.5
4, 'marge', 40.5
5. 'lisa', 100
I will have an SQL function grade_rank() - that ranks the student:
The arguments for function grade_rank() are :
1st argument: LOWEST possible score value
2nd argument: HIGHEST possible score value
3rd argument: step size, which determines the levels/divisions between the ranks
select id, name, grade_rank(0,100, 10) grade from student_scores;
the output (based on the input above) should be:
1, homer, GRADE9
2. santas liitle helper GRADE9
3, bart, GRADE8
4, marge, GRADE6
5. lisa, GENIUS
In this way you can do it more general but the grades will be in reverse order, starting from 1 up to N, ie
0-10 Grade1
10-20 Grade2
20-30 Grade3
30-40 Grade4
...
For example using the values
step 10
score 43
This algorithm
SELECT (((score-1)-((score-1) % step))/step)+1
will return 5
You don't have to know the maximum score. If the max score is 100 no one will be able to perform higher than 100, you just have to decide the size of the steps. For example if you want a step size of 25. Knowing that the maximum score is 100 there will be 4 grade levels. So by setting step level to 25 instead of 10 the result will be 2, ie grade 2.
SELECT (((43-1)-((43-1) % 25))/25)+1
Perhaps not right on spot what you expected but maybe generic enough to be useful. Here is how the function would look like in SQL.
CREATE OR REPLACE FUNCTION grade_rank(IN score integer, IN step integer, OUT rank integer)
AS 'SELECT ((($1-1)-(($1-1) % $2))/$2)+1'
LANGUAGE 'SQL';
Now calling this function
select * from grade_rank(43,10)
returns 5.
And this the plpgsql equivalent:
CREATE OR REPLACE FUNCTION grade_rank(IN score integer, IN step integer)
RETURNS integer AS
$BODY$
DECLARE rank integer;
BEGIN
SELECT (((score-1)-((score-1) % step))/step)+1 INTO rank;
RETURN rank;
END;
$BODY$
LANGUAGE 'plpgsql';
There are a few options:
1)
create a table with grades (min, max) and join on that table
SELECT score, grades.grade
FROM table
INNER JOIN grades ON table.score >= grades.min AND table.score <= grades.max
2)
create a temporary table (or even select from DUAL) and join on it, for example in the above instead of grades you can write subquery
(SELECT 0 as MIN, 10 as max, 'GRADE9' as grade FROM DUAL
UNION ALL
SELECT 11 as MIN, 20 as max, 'GRADE8' as grade FROM DUAL
UNION ALL
...
SELECT 91 as min, 100 as max, 'GENIUS' as grade FROM DUAL
) AS grades
3)
use a case
SELECT score,
CASE WHEN score = 0 THEN 'GRADE9'
WHEN score >= 1 AND score <= 90 THEN 'GRADE' || (9 - (score-1) / 10)
WHEN score >= 91 THEN 'GENIUS'
ELSE 'ERROR'
END grade
FROM table
(notice that in the above query you could substitute 0, 100 and 10 with lowest, highest and step to get dynamic sql)
4)
create user function (but this will get RDBMS-specific)
something like this?
SELECT
[name],
score,
CASE
WHEN score > #max - #stepsize THEN 'GENIUS'
ELSE CONCAT('GRADE',
CAST(
FLOOR((#max - score)/#stepsize -
CASE score
WHEN #min THEN 1
ELSE 0
END CASE
) as char(3)
)
)
END CASE
FROM
student_results
you might have to tweak it a bit - i didn't quite understand the min part (is it used only because the last range is 1 size bigger than the other ranges?)
Edit
Renamed #step to #stepsize for clarity per Ivar (#step could be misinterpreted as step count)
How about this?
(Turns out I've used #steps, as in number of steps, instead of #step. If you rather specify #step, #steps could be calculated as #steps = (#highest-#lowest)/#step
SET #lowest = 0;
SET #highest = 100;
SET #steps = 10;
SELECT
name,
CASE
WHEN score >= (#highest-#steps) THEN 'GENIUS'
ELSE
CONCAT(
'GRADE',
#steps-FLOOR((score-#lowest)/((#highest-#lowest)/#steps))-1)
END
FROM
student_results
This will give you a new grade whenever you pass the next step.
0-9.999 => GRADE1
10-19.999 => GRADE2
etc.