I have a table like this (unsorted):
risk
category
Low
A
Medium
B
High
C
Medium
A
Low
B
High
A
Low
C
Low
E
Low
D
High
B
I need to sort rows by category, but first based on the value of risk. The desired result should look like this (sorted):
risk
category
Low
A
Low
B
Low
C
Low
D
Low
E
Medium
A
Medium
B
High
A
High
B
High
C
I've come up with below query but wonder if it is correct:
SELECT
*
FROM
some_table
ORDER BY
CASE
WHEN risk = 'Low' THEN
0
WHEN risk = 'Medium' THEN
1
WHEN risk = 'High' THEN
2
ELSE
3
END,
category;
Just want to understand whether the query is correct or not. The actual data set is huge and there are many other values for risk and categories and hence I can't figure out if the results are correct or not. I've just simplified it here.
Basically correct, but you can simplify:
SELECT *
FROM some_table
ORDER BY CASE risk
WHEN 'Low' THEN 0
WHEN 'Medium' THEN 1
WHEN 'High' THEN 2
-- rest defaults to NULL and sorts last
END
, category;
A "switched" CASE is shorter and slightly cheaper.
In the absence of an ELSE branch, all remaining cases default to NULL, and NULL sorts last in default ascending sort order. So you don't need to do anything extra.
Many other values
... there are many other values for risk
While all other values are lumped together at the bottom of the sort order, this seems ok.
If all of those many values get their individual ranking, I would suggest an additional table to handle ranks of risk values. Like:
CREATE TABLE riskrank (
risk text PRIMARY KEY
, riskrank real
);
INSERT INTO riskrank VALUES
('Low' , 0)
, ('Medium', 1)
, ('High' , 2)
-- many more?
;
Data type real, so it's easy to squeeze in rows with fractional digits in different positions (like enum values do it internally).
Then your query is:
SELECT s.*
FROM some_table s
LEFT JOIN risk_rank rr USING (risk)
ORDER BY rr.riskrank, s.category;
LEFT JOIN, so missing entries in riskrank don't eliminate rows.
enum?
I already mentioned the data type enum. That's a possible alternative as enum values are sorted in the order they are defined (not how they are spelled). They only occupy 4 bytes on disk (real internally), are fast and enforce valid values implicitly. See:
How to change the data type of a table column to enum?
However, I would only even consider an enum if the sort order of your values is immutable. Changing sort order and adding / removing allowed values is cumbersome. The manual:
Although enum types are primarily intended for static sets of values,
there is support for adding new values to an existing enum type, and
for renaming values (see ALTER TYPE). Existing values cannot be
removed from an enum type, nor can the sort ordering of such values be
changed, short of dropping and re-creating the enum type.
Related
I'm working on a financial system application, in Microsoft SQL Server, that
requires calculating a rate for a given tenor (maturity in days) from yield
curves stored in a database. The problem is that the yield curves are stored
sparsely around dates and tenors; i.e. not all dates, and not all tenors,
are available in the database.
The yield curves are stored as this:
CREATE TABLE Curves (
date DATE NOT NULL,
type VARCHAR(10) NOT NULL,
currency CHAR(3) NOT NULL,
tenor INT NOT NULL,
rate DOUBLE NOT NULL
);
I want to be able to get a rate for a given date, type, currency, and
tenor. With the following assumptions:
For a givenDate, choose the most recent curve for the given type and
currency. If the givenDate is not found use the previously available date
to get the rates.
For a givenTenor, do a linear interpolation from the available tenors. If
the givenTenor is smaller than the smallest tenor, use the rate associated
with the smallest tenor. If the givenTenor is larger than the largest tenor,
use the date associated with the largest tenor. For aything in between use a
linear interpolation between the two closest tenors.
So assuming that I have another table with financial instruments:
CREATE TABLE Instruments (
id INT NOT NULL PRIMARY KEY,
date DATE NOT NULL,
type VARCHAR(10) NOT NULL,
currency CHAR(3) NOT NULL,
tenor INT NOT NULL,
);
I would like to have a stored function that can produce the rates, following the
assumptions above, in a query like this:
SELECT *, SOME_FUNCTION(date, type, currency, tenor) AS rate FROM Instruments;
I believe that a function can be created to do this, but I'm also sure that a
naive implementation would not perform at all. Particularly with millions of
records. The algorithm for the function would be something like this:
With the givenDate, get the availableDate as the maximum date that is less
than or equal to the givenDate, for the givenType and givenCurrency.
For the availableDate get all the tenors and rates. With this vector, go
over the tenors with the asumptions as above, to calculate a rate.
Any ideas on how to do this in Microsoft SQL Server in a performant way?
-- Edit
Following up on #dalek #trenton-ftw comments, here is the 'naive'
implementation of the algorithm. I've created random data to test:
5,000 Curve points, spanding the last 5 years, with two types and
two currencies,
1,000,000 Instruments with random dates, types, currencies and
tenors.
To test the function I just executed the following query:
SELECT AVG(dbo.TenorRate("date", "type", "currency", "tenor"))
FROM Instruments
For the random sample above this query executes in 2:30 minutes in my
laptop. Nevertheless, the real application would need to work on
something close to 100,000,000 instruments in a heavily loaded sql
server. That would potentially put it in the 20 minute ballpark.
How can I improve the performance of this function?
Here is the naive implementation:
CREATE OR ALTER FUNCTION TenorRate(
#date DATE,
#type VARCHAR(8),
#currency VARCHAR(3),
#tenor INT
) RETURNS REAL
BEGIN
DECLARE #rate REAL
-- The available date is the first date that is less than or equal to
-- the provided one, or the first one if none are less than or equal to it.
DECLARE #availableDate DATE
SET #availableDate = (
SELECT MAX("date") FROM Curves
WHERE "date" <= #date AND "type" = #type AND "currency" = #currency
)
IF (#availableDate IS NULL)
BEGIN
SET #availableDate = (
SELECT MIN("date") FROM Curves
WHERE "type" = #type AND "currency" = #currency
)
END
-- Get the tenors and rates for the available date, type and currency.
-- Ordering by tenor to ensure that the fetch is made in order.
DECLARE #PreviousTenor INTEGER, #PreviousRate REAL
DECLARE #CurrentTenor INTEGER, #CurrentRate REAL
DECLARE Curve CURSOR FAST_FORWARD READ_ONLY FOR
SELECT "tenor", "rate"
FROM Curves
WHERE "date" = #availableDate AND "type" = #type AND "currency" = #currency
ORDER BY "tenor"
-- Open a cursor to iterate over the tenors and rates in order.
OPEN Curve
FETCH NEXT FROM Curve INTO #CurrentTenor, #CurrentRate
IF (#Tenor < #CurrentTenor)
BEGIN
-- If the tenor is less than the first one,
-- then use the first tenor rate.
SET #rate = #CurrentRate
END
ELSE
BEGIN
WHILE ##FETCH_STATUS = 0
BEGIN
IF (#Tenor = #CurrentTenor)
BEGIN
-- If it mathces exactly return the found rate.
SET #rate = #CurrentRate
BREAK
END
IF (#Tenor < #CurrentTenor)
BEGIN
-- If the current tenor is lesss than the current
-- (but not equal) then interpolate with the
-- previous one and return the calculated rate.
SET #rate = #PreviousRate +
(#tenor - #PreviousTenor) *
(#CurrentRate - #PreviousRate) /
(#CurrentTenor - #PreviousTenor)
BREAK
END
-- Keep track ot the previous tenor and rate.
SET #PreviousTenor = #CurrentTenor
SET #PreviousRate = #CurrentRate
-- Fetch the next tenor and rate.
FETCH NEXT FROM Curve INTO #CurrentTenor, #CurrentRate
END
IF (#Tenor > #CurrentTenor)
BEGIN
-- If we exahusted the tenors and still nothing found,
-- then use the last tenor rate.
SET #rate = #CurrentRate
END
END
CLOSE Curve
DEALLOCATE Curve
RETURN #rate
END;
Assuming a table dbo.Curves exists with the following structure:
CREATE TABLE dbo.Curves(
[date] [date] NOT NULL
,[type] [varchar](10) NOT NULL
,currency [char](3) NOT NULL
,tenor [int] NOT NULL
,rate [real] NOT NULL
,UNIQUE NONCLUSTERED ([date] ASC,[type] ASC,currency ASC,tenor ASC)
,INDEX YourClusteredIndexNameHere CLUSTERED ([type] ASC,currency ASC,[date] ASC)
);
Notable differences from OP's DDL script:
column rate has the data type corrected from double to real. Assuming this is the correct data type because the provided function stores rate values in variables with a real data type. The data type double does not exist in SQL Server.
I have added a clustered index on (type,currency,date). The table did not have a clustered index in OP's DDL script, but should certainly have one. Considering the only knowledge I have of how this table is used is this function, I will design one that works best for this function.
OP indicated this table has a unique constraint on (date,type,currency,tenor), so I have included.
Just to note, if the unique constraint that already exists does not require the provided order, I would recommend removing that unique constraint and the clustered index I recommended and simply creating a single clustered primary key constraint on (type,currency,date,tenor).
Differences between OP's function and set based approach
The actual flow is very similar to OP's provided function. It is well commented, but the function does not read top to bottom. Start from the FROM of the inner most subquery and read 'outwards'.
This function is inline-able. You can confirm this once this function is installed to a given database using Microsoft's documentation on this topic.
I corrected the input type for the #i_Type variable to varchar(10) to match the type column of the dbo.Curves table.
I created the two tables and populated them with what I believe was reasonable randomized data. I used the same scale of 200:1 from OP's test scenario of 1m records in dbo.Instruments and 5k records in dbo.Curves. I tested using 80k records in dbo.Instruments and 400 records in dbo.Curves. Using the exact same test script from OP of SELECT AVG(dbo.TenorRate("date", "type", "currency", "tenor")) FROM Instruments;
OP's CURSOR based approach had a CPU/Elapsed ms of 57774/58654.
Set based approach I wrote had a CPU/Elapsed ms of 7437/7440. So roughly 12.6% elapsed time of the original.
The set based approach is guaranteed to result in more reads overall because it is repeatedly reading from dbo.Curves as opposed to the CURSOR option which pulls the rows from dbo.Curves one time and then reads from the CURSOR. It is worth it to note that comparing execution plans on both function approaches in the test case above will yield misleading results. Because the CURSOR based option cannot be inlined, the logical operators and query statistics are not going to be shown in the execution plan, but the query plan for the set based approach can be inlined and so the logcial operators and query statistics will be included in the execution plan. Just thought I would point that out.
I did validate the results of this compared to the original function. I noticed that there some potential areas for discrepany.
Both this function and the OP's CURSOR based option are doing math using real data types. I saw a few examples where it looked like the math, although using the same numbers between the CURSOR approach and the set based approach resulted in different numbers. When I validated that the output number, when rounded to the 5th decimal, matched exactly. You might need to track that down, but considering that you said you are building a financial application and a financial application is the classic example where real and float data types should be avoided (because they are approximate data types) I strongly suggest you change the way these values are both stored and used in this function to be numeric. It was not worth my time to track this issue down because IMO it only exists because of very bad practice that should be resolved anyhow.
Based on my understanding of OP's approach, in scenarios where the input tenor is between the minimum and maximum tenor we would want to use the two rows from dbo.Curves where the tenor value is as close to the input tenor as possible, and then use those two rows (and the rate from those two rows) to perform a linear interpolation for the input tenor. In this same scenario, the OP's function finds the closest tenor available and then instead of using the row with the next closest available tenor, it uses the tenor with the previous row (sorted by tenor ASC). Because of how the CURSOR implementation was written, this is not guaranteed to be the next closest row to the input tenor so I fixed that in my approach. Feel free to revert if I misunderstood.
Notes about my specific implementation
This approach is not heavily optimized towards certain conditions that might exist more commonly. There are still potential optimizations that you might be able to make in your system based on the conditions that exist.
I am using an aggregate to pivot rows to columns. I did test out LEAD/LAG to do this, but found that pivoting using the aggregate was consistently faster by roughly 7% decrease in elapsed time.
There are simple warehousing strategies that could be used to GREATLY increase the performance of this approach. For example, storing the pre-calculated min/max for a given combination of a (type,currency,date) could have a tremendous impact.
Here is the function:
CREATE OR ALTER FUNCTION dbo.TenorRate(
#i_Date DATE
,#i_Type VARCHAR(10)
,#i_Currency CHAR(3)
,#i_Tenor INT
)
RETURNS REAL
BEGIN
RETURN (
SELECT
/*now pick the return value based on the scenario that was returned, as a given scenario can have a specific formula*/
CASE
WHEN pivoted_vals.Scenario = 1 THEN pivoted_vals.CurrentRate /*returned row is first row with matching tenor*/
WHEN pivoted_vals.Scenario = 2 THEN pivoted_vals.CurrentRate /*returned row is row with smallest tenor*/
WHEN pivoted_vals.Scenario = 3 THEN pivoted_vals.CurrentRate /*returned row is row with largest tenor*/
WHEN pivoted_vals.Scenario = 4 /*returned row contains pivoted values for the two rows that are two closest tenor rows*/
THEN pivoted_vals.PreviousRate
+
(#i_Tenor - pivoted_vals.PreviousTenor)
*
(pivoted_vals.CurrentRate - pivoted_vals.PreviousRate)
/
(pivoted_vals.CurrentTenor - pivoted_vals.PreviousTenor)
END AS ReturnValue
FROM
(SELECT
/*pivot rows using aggregates. aggregates are only considering one row and value at a time, so they are not truly aggregating anything. simply pivoting values.*/
MAX(CASE WHEN top_2_rows.RowNumber = 1 THEN top_2_rows.Scenario ELSE NULL END) AS Scenario
,MAX(CASE WHEN top_2_rows.RowNumber = 1 THEN top_2_rows.CurrentRate ELSE NULL END) AS CurrentRate
,MAX(CASE WHEN top_2_rows.RowNumber = 1 THEN top_2_rows.CurrentTenor ELSE NULL END) AS CurrentTenor
,MAX(CASE WHEN top_2_rows.RowNumber = 2 THEN top_2_rows.CurrentRate ELSE NULL END) AS PreviousRate
,MAX(CASE WHEN top_2_rows.RowNumber = 2 THEN top_2_rows.CurrentTenor ELSE NULL END) AS PreviousTenor
FROM
/*return a row number for our top two returned rows. done in an outer subquery so that the row_number function is not used until the final two rows have been selected*/
(SELECT TOP (2)
picked_curve.Scenario
,picked_curve.ScenarioRankValue
,picked_curve.CurrentRate
,picked_curve.CurrentTenor
/*generate row number to match order by clause*/
,ROW_NUMBER() OVER(ORDER BY picked_curve.Scenario ASC,picked_curve.ScenarioRankValue ASC) AS RowNumber
FROM
/*we need top two rows because scenario 4 requires the previous row's value*/
(SELECT TOP (2)
scenarios.Scenario
,scenarios.ScenarioRankValue
,c.rate AS CurrentRate
,c.tenor AS CurrentTenor
FROM
/*first subquery to select the date that we will use from two separate conditions*/
(SELECT TOP (1)
date_options.[date]
FROM
(/*most recent available date before or equal to input date*/
SELECT TOP (1)
c.[date]
FROM
dbo.Curves AS c
WHERE
/*match on type and currency*/
c.[type] = #i_Type
AND
c.currency = #i_Currency
AND
c.[date] <= #i_Date
ORDER BY
c.[date] DESC
UNION ALL
/*first available date after input date*/
SELECT TOP (1)
c.[date]
FROM
dbo.Curves AS c
WHERE
/*match on type and currency*/
c.[type] = #i_Type
AND
c.currency = #i_Currency
AND
c.[date] > #i_Date
ORDER BY
c.[date] ASC) AS date_options
ORDER BY
/*we want to prioritize date from first query, which we know will be 'older' than date from second query. So ascending order*/
date_options.[date] ASC) AS selected_date
/*go get curve values for input type, input currency, and selected date*/
INNER JOIN dbo.Curves AS c ON
#i_Type = c.[type]
AND
#i_Currency = c.currency
AND
selected_date.[date] = c.[date]
/*go get max and min curve values for input type, input currency, and selected date*/
OUTER APPLY (SELECT TOP (1) /*TOP (1) is redundant since this is an aggregate with no grouping, but keeping for clarity*/
MAX(c_inner.tenor) AS MaxTenor
,MIN(c_inner.tenor) AS MinTenor
FROM
dbo.Curves AS c_inner
WHERE
c_inner.[type] = #i_Type
AND
c_inner.currency = #i_Currency
AND
c_inner.[date] = selected_date.[date]) AS max_min_tenor
/*for readibility in select, outer apply logic to give us a value will prioritize certain scenarios over others (and indicate to us the scenarios that was returned) and
return the minimum number of rows that are ranked in a manner in which the top returned rows contain the information needed for the given scenario*/
OUTER APPLY (SELECT
CASE /*rank the scenarios*/
WHEN #i_Tenor = c.tenor THEN 1
WHEN #i_Tenor < max_min_tenor.MinTenor THEN 2
WHEN #i_Tenor > max_min_tenor.MaxTenor THEN 3
ELSE 4 /*input tenor is between the max/min tenor*/
END AS Scenario
,CASE /*rank value that ensures the top row will be the row we need for the returned scenario*/
WHEN #i_Tenor = c.tenor THEN c.tenor
WHEN #i_Tenor < max_min_tenor.MinTenor THEN c.tenor
WHEN #i_Tenor > max_min_tenor.MaxTenor THEN c.tenor*-1
ELSE ABS(c.tenor-#i_Tenor)
END AS ScenarioRankValue) AS scenarios
ORDER BY
/*highest priority scenario (by lowest value) and the associated value to be ranked, which is designed to be used in ascending order*/
scenarios.Scenario ASC
,scenarios.ScenarioRankValue ASC) AS picked_curve
ORDER BY
picked_curve.Scenario ASC
,picked_curve.ScenarioRankValue ASC) AS top_2_rows) AS pivoted_vals
);
END;
GO
Just one last note on usage, there are certain ways that you might use this query that could prevent it from being inline. I highly recommend you read the entirety of Microsoft's doc on Scalar UDF Inlining, but at a minimum you should at least read over the Inlineable scalar UDF requirements contained in that same doc.
Hopefully this helps you out or at least points you in the right direction.
In fact your function look like an aggregate function. If it is the case, you should write a SQL CLR one. We do that for financial score and the performance was fine...
Aside from doing a direct match on something like a whitespace normalized hash of a query, what might be a useful (but-not-necessarily-perfect) way to handle query cache in a partial manner? For example, let's take the following basic case:
SELECT
Product, # VARCHAR
Revenue # DOUBLE
FROM
Sales
WHERE
Country='US'
This potentially could be used as a 'base-cache' upon which a further query could be executed to potentially improve performance:
SELECT
Product, # VARCHAR
Revenue # DOUBLE
FROM
Sales
WHERE
Country='US' AND State='CA'
So, assuming the data in the from table(s) don't change, the following might serve as a starting point for determining cache:
fields: [field:type, ...] // can be less but not more
from: hash of table(s)+joins
filters: [filter1, filter2, ...] // can be less but not more
aggregations: [agg1, agg2, ...] // can be less but not more
having: [having1, having2, ...] // can be less but not more
order+limit+offset if limited result-set // can be less but not more
However, this becomes quite tricky when we think about something like the following case:
SELECT
ProductGroup AS Product, # Would produce a Product:VARCHAR hash
Revenue
FROM
Sales
WHERE
Country='US'
What might be a realistic starting point for how a partial- query cache could be implemented.
Use case: writing SQL to query data in a non-DBMS-managed source, such as a CSV file which will take ~20s or so to issue any query and we cannot create indexes on the file. https://en.wikipedia.org/wiki/SQL/MED or Spark-like.
I think the following might be a good starting place for a basic cache implementation that allows the usage of a cache that can be further queried for refinements:
Start by substituting any udf's or cte's. The query itself needs to be self-contained.
Normalize whitespaces and capitalization.
Hash the entire query. This will be our starting place.
Remove the select fields and hash the rest of the query. Now store a hash of all the individual items in the select list.
For partial cache, generate a hash minus select fields, where, sort, and limit+offset. Hash the where's list (separated by AND), making sure no filter is contained in the cache that is not contained in the current query, the orderby, seeing if the data needs to be re-sorted, and the limit+offset number, making sure the limit+offset in the initial query is null or greater than the current query.
Here would be an example of how the data might look saved:
Hash
673c0185c6a580d51266e78608e8e9b2
HashMinusFields
41257d239fb19ec0ccf34c36eba1948e
HashOfFields
[dc99e4006c8a77025c0407c1fdebeed3, …]
HashMinusFieldsWhereOrderLimit
d50961b6ca0afe05120a0196a93726f5
HashOfWheres
[0519669bae709d2efdc4dc8db2d171aa, ...]
HashOfOrder
81961d1ff6063ed9d7515a3cefb0c2a5
LimitOffset
null
Now let's try a few examples, I will use human-readable hashes for easier readability:
SELECT Name, Age FROM Sales WHERE id=2
-- fullHash: selectname,agefromsaleswhereid=2
-- selectless: fromsaleswhereid=2
-- hashoffields: [name, age]
-- minusfieldswhereorderlimit: null
-- hashofwheres: [id=2, ]
-- hashororder: null
-- limitoffset: null
-- query1
select age FROM sales where id=2
-- selectless: fromsaleswhereid=2
-- fields: [age] OK, all fields contained in initial fields
-- query2
select age FROM sales where id=2 and country='us' order by id limit 100
-- minusfieldswhereorderlimit: null
-- hashofwheres: [id=2, country=us] OK initial query does not contain any additional filters
-- limitoffset: 100 OK initial limitoffset is null (infinity)
-- hashorder: orderbyid
--> Can grab partial cache, need to apply one filter and re-sort/limit:
--> SELECT * FROM <cache> WHERE country='us' order by id limit 100
Does the above seem like a valid initial implementation?
I am writing a query to show returns of placing each way bets on horse races
There is an issue with the PlaceProfit result - This should show a return if the horses finishing position is between 1-4 and a loss if the position is => 5
It does show the correct return if the horses finishing position is below 9th, but 10th place and above is being counted as a win.
I include my code below along with the output.
ALTER VIEW EachWayBetting
AS
SELECT a.ID,
RaceDate,
runners,
track.NAME AS Track,
horse.NAME as HorseName,
IndustrySP,
Place AS 'FinishingPosition',
-- // calculates returns on the win & place parts of an each way bet with 1/5 place terms //
IIF(A.Place = '1', 1.0 * (A.IndustrySP-1), '-1') AS WinProfit,
IIF(A.Place <='4', 1.0 * (A.IndustrySP-1)/5, '-1') AS PlaceProfit
FROM dbo.NewRaceResult a
LEFT OUTER JOIN track ON track.ID = A.TrackID
LEFT OUTER JOIN horse ON horse.ID = A.HorseID
WHERE a.Runners > 22
This returns:
As I mention in the comments, the problem is your choice of data type for place, it's varchar. The ordering for a string data type is completely different to that of a numerical data type. Strings are sorted by character from left to right, in the order the characters are ordered in the collation you are using. Numerical data types, however, are ordered from the lowest to highest.
This means that, for a numerical data type, the value 2 has a lower value than 10, however, for a varchar the value '2' has a higher value than '10'. For the varchar that's because the ordering is completed on the first character first. '2' has a higher value than '1' and so '2' has a higher value than '10'.
The solution here is simple, fix your design; store numerical data in a numerical data type (int seems appropriate here). You're also breaking Normal Form rules, as you're storing other data in the column; mainly the reason a horse failed to be classified. Such data isn't a "Place" but information on why the horse didn't place, and so should be in a separate column.
You can therefore fix this by firstly adding a new column, then updating it's value to be the values that aren't numerical and making place only contain numerical data, and then finally altering your place column.
ALTER TABLE dbo.YourTable ADD UnClassifiedReason varchar(5) NULL; --Obviously use an appropriate length.
GO
UPDATE dbo.YourTable
SET Place = TRY_CONVERT(int,Place),
UnClassifiedReason = CASE WHEN TRY_CONVERT(int,Place) IS NULL THEN Place END;
GO
ALTER TABLE dbo.YourTable ALTER COLUMN Place int NULL;
GO
If Place does not allow NULL values, you will need to ALTER the column first to allow them.
In addition to fixing the data as Larnu suggests, you should also fix the query:
SELECT nrr.ID, nrr.RaceDate, nrr.runners,
t.NAME AS Track, t.NAME as HorseName, nrr.IndustrySP,
Place AS FinishingPosition,
-- // calculates returns on the win & place parts of an each way bet with 1/5 place terms //
(CASE WHEN nrr.Place = 1 THEN (nrr..IndustrySP - 1.0) ELSE -1 END) AS WinProfit,
(CASE WHEN nrr.Place <= 4 THEN (nrr.IndustrySP - 1.0) / 5 THEN -1 END) AS PlaceProfit
FROM dbo.NewRaceResult nrr LEFT JOIN
track t
ON t.ID = nrr.TrackID LEFT JOIN
horse h
ON h.ID = nrr.HorseID
WHERE nrr.Runners > 22;
The important changes are removing single quotes from numbers and column names. It seems you need to understand the differences among strings, numbers, and identifiers.
Other changes are:
Meaningful table aliases, rather than meaningless letters such as a.
Qualifying all column references, so it is clear where columns are coming from.
Switching from IFF() to CASE. IFF() is bespoke SQL Server; CASE is standard SQL for conditional expressions (both work fine).
Being sure that the types returned by all branches of the conditional expressions are consistent.
Note: This version will work even if you don't change the type of Place. The strings will be converted to numbers in the appropriate places. I don't advocate relying on such silent conversion, so I recommend fixing the data.
If place can have non-numeric values, then you need to convert them:
(CASE WHEN TRY_CONVERT(int, nrr.Place) = 1 THEN (nrr..IndustrySP - 1.0) ELSE -1 END) AS WinProfit,
(CASE WHEN TRY_CONVERT(int, nrr.Place) <= 4 THEN (nrr.IndustrySP - 1.0) / 5 THEN -1 END) AS PlaceProfit
But the important point is to fix the data.
I have some entries in my database, in my case Videos with a rating and popularity and other factors. Of all these factors I calculate a likelihood factor or more to say a boost factor.
So I essentially have the fields ID and BOOST.The boost is calculated in a way that it turns out as an integer that represents the percentage of how often this entry should be hit in in comparison.
ID Boost
1 1
2 2
3 7
So if I run my random function indefinitely I should end up with X hits on ID 1, twice as much on ID 2 and 7 times as much on ID 3.
So every hit should be random but with a probability of (boost / sum of boosts). So the probability for ID 3 in this example should be 0.7 (because the sum is 10. I choose those values for simplicity).
I thought about something like the following query:
SELECT id FROM table WHERE CEIL(RAND() * MAX(boost)) >= boost ORDER BY rand();
Unfortunately that doesn't work, after considering the following entries in the table:
ID Boost
1 1
2 2
It will, with a 50/50 chance, have only the 2nd or both elements to choose from randomly.
So 0.5 hit goes to the second element
And 0.5 hit goes to the (second and first) element which is chosen from randomly so so 0.25 each.
So we end up with a 0.25/0.75 ratio, but it should be 0.33/0.66
I need some modification or new a method to do this with good performance.
I also thought about storing the boost field cumulatively so I just do a range query from (0-sum()), but then I would have to re-index everything coming after one item if I change it or develop some swapping algorithm or something... but that's really not elegant and stuff.
Both inserting/updating and selecting should be fast!
Do you have any solutions to this problem?
The best use case to think of is probably advertisement delivery. "Please choose a random ad with given probability"... however i need it for another purpose but just to give you a last picture what it should do.
edit:
Thanks to kens answer i thought about the following approach:
calculate a random value from 0-sum(distinct boost)
SET #randval = (select ceil(rand() * sum(DISTINCT boost)) from test);
select the boost factor from all distinct boost factors which added up surpasses the random value
then we have in our 1st example 1 with a 0.1, 2 with a 0.2 and 7 with a 0.7 probability.
now select one random entry from all entries having this boost factor
PROBLEM: because the count of entries having one boost is always different. For example if there is only 1-boosted entry i get it in 1 of 10 calls, but if there are 1 million with 7, each of them is hardly ever returned...
so this doesnt work out :( trying to refine it.
I have to somehow include the count of entries with this boost factor ... but i am somehow stuck on that...
You need to generate a random number per row and weight it.
In this case, RAND(CHECKSUM(NEWID())) gets around the "per query" evaluation of RAND. Then simply multiply it by boost and ORDER BY the result DESC. The SUM..OVER gives you the total boost
DECLARE #sample TABLE (id int, boost int)
INSERT #sample VALUES (1, 1), (2, 2), (3, 7)
SELECT
RAND(CHECKSUM(NEWID())) * boost AS weighted,
SUM(boost) OVER () AS boostcount,
id
FROM
#sample
GROUP BY
id, boost
ORDER BY
weighted DESC
If you have wildly different boost values (which I think you mentioned), I'd also consider using LOG (which is base e) to smooth the distribution.
Finally, ORDER BY NEWID() is a randomness that would take no account of boost. It's useful to seed RAND but not by itself.
This sample was put together on SQL Server 2008, BTW
I dare to suggest straightforward solution with two queries, using cumulative boost calculation.
First, select sum of boosts, and generate some number between 0 and boost sum:
select ceil(rand() * sum(boost)) from table;
This value should be stored as a variable, let's call it {random_number}
Then, select table rows, calculating cumulative sum of boosts, and find the first row, which has cumulative boost greater than {random number}:
SET #cumulative_boost=0;
SELECT
id,
#cumulative_boost:=(#cumulative_boost + boost) AS cumulative_boost,
FROM
table
WHERE
cumulative_boost >= {random_number}
ORDER BY id
LIMIT 1;
My problem was similar: Every person had a calculated number of tickets in the final draw. If you had more tickets then you would have an higher chance to win "the lottery".
Since I didn't trust any of the found results rand() * multiplier or the one with -log(rand()) on the web I wanted to implement my own straightforward solution.
What I did and in your case would look a little bit like this:
(SELECT id, boost FROM foo) AS values
INNER JOIN (
SELECT id % 100 + 1 AS counter
FROM user
GROUP BY counter) AS numbers ON numbers.counter <= values.boost
ORDER BY RAND()
Since I don't have to run it often I don't really care about future performance and at the moment it was fast for me.
Before I used this query I checked two things:
The maximum number of boost is less than the maximum returned in the number query
That the inner query returns ALL numbers between 1..100. It might not depending on your table!
Since I have all distinct numbers between 1..100 then joining on numbers.counter <= values.boost would mean that if a row has a boost of 2 it would end up duplicated in the final result. If a row has a boost of 100 it would end up in the final set 100 times. Or in another words. If sum of boosts is 4212 which it was in my case you would have 4212 rows in the final set.
Finally I let MySql sort it randomly.
Edit: For the inner query to work properly make sure to use a large table, or make sure that the id's don't skip any numbers. Better yet and probably a bit faster you might even create a temporary table which would simply have all numbers between 1..n. Then you could simply use INNER JOIN numbers ON numbers.id <= values.boost
I have a table foodbar, created with the following DDL. (I am using mySQL 5.1.x)
CREATE TABLE foodbar (
id INT NOT NULL AUTO_INCREMENT,
user_id INT NOT NULL,
weight double not null,
created_at date not null
);
I have four questions:
How may I write a query that returns
a result set that gives me the
following information: user_id,
weight_gain where weight_gain is
the difference between a weight and
a weight that was recorded 7 days
ago.
How may I write a query that will
return the top N users with the
biggest weight gain (again say over
a week).? An 'obvious' way may be to
use the query obtained in question 1
above as a subquery, but somehow
picking the top N.
Since in question 2 (and indeed
question 1), I am searching the
records in the table using a
calculated field, indexing would be
preferable to optimise the query -
however since it is a calculated
field, it is not clear which field
to index (I'm guessing the 'weight'
field is the one that needs
indexing). Am I right in that
assumption?.
Assuming I had another field in the
foodbar table (say 'height') and I
wanted to select records from the
table based on (say) the product
(i.e. multiplication) of 'height'
and 'weight' - would I be right in
assuming again that I need to index
'height' and 'weight'?. Do I also
need to create a composite key (say
(height,weight)). If this question
is not clear, I would be happy to
clarify
I don't see why you should need the synthetic key, so I'll use this table instead:
CREATE TABLE foodbar (
user_id INT NOT NULL
, created_at date not null
, weight double not null
, PRIMARY KEY (user_id, created_at)
);
How may I write a query that returns a result set that gives me the following information: user_id, weight_gain where weight_gain is the difference between a weight and a weight that was recorded 7 days ago.
SELECT curr.user_id, curr.weight - prev.weight
FROM foodbar curr, foodbar prev
WHERE curr.user_id = prev.user_id
AND curr.created_at = CURRENT_DATE
AND prev.created_at = CURRENT_DATE - INTERVAL '7 days'
;
the date arithmetic syntax is probably wrong but you get the idea
How may I write a query that will return the top N users with the biggest weight gain (again say over a week).? An 'obvious' way may be to use the query obtained in question 1 above as a subquery, but somehow picking the top N.
see above, add ORDER BY curr.weight - prev.weight DESC and LIMIT N
for the last two questions: don't speculate, examine execution plans. (postgresql has EXPLAIN ANALYZE, dunno about mysql) you'll probably find you need to index columns that participate in WHERE and JOIN, not the ones that form the result set.
I think that "just somebody" covered most of what you're asking, but I'll just add that indexing columns that take part in a calculation is unlikely to help you at all unless it happens to be a covering index.
For example, it doesn't help to order the following rows by X, Y if I want to get them in the order of their product X * Y:
X Y
1 8
2 2
4 4
The products would order them as:
X Y Product
2 2 4
1 8 8
4 4 16
If mySQL supports calculated columns in a table and allows indexing on those columns then that might help.
I agree with just somebody regarding the primary key, but for what you're asking regarding the weight calculation, you'd be better off storing the delta rather than the weight:
CREATE TABLE foodbar (
user_id INT NOT NULL,
created_at date not null,
weight_delta double not null,
PRIMARY KEY (user_id, created_at)
);
It means you'd store the users initial weight in say, the user table, and when you write records to the foodbar table, a user could supply the weight at that time, but the query would subtract the initial weight from the current weight. So you'd see values like:
user_id weight_delta
------------------------
1 2
1 5
1 -3
Looking at that, you know that user 1 gained 4 pounds/kilos/stones/etc.
This way you could use SUM, because it's possible for someone to have weighings every day - using just somebody's equation of curr.weight - prev.weight wouldn't work, regardless of time span.
Getting the top x is easy in MySQL - use the LIMIT clause, but mind that you provide an ORDER BY to make sure the limit is applied correctly.
It's not obvious, but there's some important information missing in the problem you're trying to solve. It becomes more noticeable when you think about realistic data going into this table. The problem is that you're unlikely to to have a consistent regular daily record of users' weights. So you need to clarify a couple of rules around determining 'current-weight' and 'weight x days ago'. I'm going to assume the following simplistic rules:
The most recent weight reading is the 'current-weight'. (Even though that could be months ago.)
The most recent weight reading more than x days ago will be the weight assumed at x days ago. (Even though for example a reading from 6 days ago would be more reliable than a reading from 21 days ago when determining weight 7 days ago.)
Now to answer the questions:
1&2: Using the above extra rules provides an opportunity to produce two result sets: current weights, and previous weights:
Current weights:
select rd.*,
w.Weight
from (
select User_id,
max(Created_at) AS Read_date
from Foodbar
group by User_id
) rd
inner join Foodbar w on
w.User_id = rd.User_id
and w.Created_at = rd.Read_date
Similarly for the x days ago reading:
select rd.*,
w.Weight
from (
select User_id,
max(Created_at) AS Read_date
from Foodbar
where Created_at < DATEADD(dd, -7, GETDATE()) /*Or appropriate MySql equivalent*/
group by User_id
) rd
inner join Foodbar w on
w.User_id = rd.User_id
and w.Created_at = rd.Read_date
Now simply join these results as subqueries
select cur.User_id,
cur.Weight as Cur_weight,
prev.Weight as Prev_weight
cur.Weight - prev.Weight as Weight_change
from (
/*Insert query #1 here*/
) cur
inner join (
/*Insert query #2 here*/
) prev on
prev.User_id = cur.User_id
If I remember correctly the MySql syntax to get the top N weight gains would be to simply add:
ORDER BY cur.Weight - prev.Weight DESC limit N
2&3: Choosing indexes requires a little understanding of how the query optimiser will process the query:
The important thing when it comes to index selection is what columns you are filtering by or joining on. The optimiser will use the index if it is determined to be selective enough (note that sometimes your filters have to be extremely selective returning < 1% of data to be considered useful). There's always a trade of between slow disk seek times of navigating indexes and simply processing all the data in memory.
3: Although weights feature significantly in what you display, the only relevance is in terms of filtering (or selection) is in #2 to get the top N weight gains. This is a complex calculation based on a number of queries and a lot of processing that has gone before; so Weight will provide zero benefit as an index.
Another note is that even for #2 you have to calculate the weight change of all users in order to determine the which have gained the most. Therefore unless you have a very large number of readings per user you will read most of the table. (I.e. a table scan will be used to obtain the bulk of the data)
Where indexes can benefit:
You are trying to identify specific Foodbar rows based on User_id and Created_at.
You are also joining back to the Foodbar table again using User_id and Created_at.
This implies an index on User_id, Created__at would be useful (more-so if this is the clustered index).
4: No, unfortunately it is mathematically impossible to determine how the individual values H and W would independently determine the ordering of the product. E.g. both H=3 & W=3 are less than 5, yet if H=5 and W=1 then the product 3*3 is greater than 5*1.
You would have to actually store the calculation an index on that additional column. However, as indicated in my answer to #3 above, it is still unlikely to prove beneficial.