Completing values from a sparse representation of data - sql

I'm working on a financial system application, in Microsoft SQL Server, that
requires calculating a rate for a given tenor (maturity in days) from yield
curves stored in a database. The problem is that the yield curves are stored
sparsely around dates and tenors; i.e. not all dates, and not all tenors,
are available in the database.
The yield curves are stored as this:
CREATE TABLE Curves (
date DATE NOT NULL,
type VARCHAR(10) NOT NULL,
currency CHAR(3) NOT NULL,
tenor INT NOT NULL,
rate DOUBLE NOT NULL
);
I want to be able to get a rate for a given date, type, currency, and
tenor. With the following assumptions:
For a givenDate, choose the most recent curve for the given type and
currency. If the givenDate is not found use the previously available date
to get the rates.
For a givenTenor, do a linear interpolation from the available tenors. If
the givenTenor is smaller than the smallest tenor, use the rate associated
with the smallest tenor. If the givenTenor is larger than the largest tenor,
use the date associated with the largest tenor. For aything in between use a
linear interpolation between the two closest tenors.
So assuming that I have another table with financial instruments:
CREATE TABLE Instruments (
id INT NOT NULL PRIMARY KEY,
date DATE NOT NULL,
type VARCHAR(10) NOT NULL,
currency CHAR(3) NOT NULL,
tenor INT NOT NULL,
);
I would like to have a stored function that can produce the rates, following the
assumptions above, in a query like this:
SELECT *, SOME_FUNCTION(date, type, currency, tenor) AS rate FROM Instruments;
I believe that a function can be created to do this, but I'm also sure that a
naive implementation would not perform at all. Particularly with millions of
records. The algorithm for the function would be something like this:
With the givenDate, get the availableDate as the maximum date that is less
than or equal to the givenDate, for the givenType and givenCurrency.
For the availableDate get all the tenors and rates. With this vector, go
over the tenors with the asumptions as above, to calculate a rate.
Any ideas on how to do this in Microsoft SQL Server in a performant way?
-- Edit
Following up on #dalek #trenton-ftw comments, here is the 'naive'
implementation of the algorithm. I've created random data to test:
5,000 Curve points, spanding the last 5 years, with two types and
two currencies,
1,000,000 Instruments with random dates, types, currencies and
tenors.
To test the function I just executed the following query:
SELECT AVG(dbo.TenorRate("date", "type", "currency", "tenor"))
FROM Instruments
For the random sample above this query executes in 2:30 minutes in my
laptop. Nevertheless, the real application would need to work on
something close to 100,000,000 instruments in a heavily loaded sql
server. That would potentially put it in the 20 minute ballpark.
How can I improve the performance of this function?
Here is the naive implementation:
CREATE OR ALTER FUNCTION TenorRate(
#date DATE,
#type VARCHAR(8),
#currency VARCHAR(3),
#tenor INT
) RETURNS REAL
BEGIN
DECLARE #rate REAL
-- The available date is the first date that is less than or equal to
-- the provided one, or the first one if none are less than or equal to it.
DECLARE #availableDate DATE
SET #availableDate = (
SELECT MAX("date") FROM Curves
WHERE "date" <= #date AND "type" = #type AND "currency" = #currency
)
IF (#availableDate IS NULL)
BEGIN
SET #availableDate = (
SELECT MIN("date") FROM Curves
WHERE "type" = #type AND "currency" = #currency
)
END
-- Get the tenors and rates for the available date, type and currency.
-- Ordering by tenor to ensure that the fetch is made in order.
DECLARE #PreviousTenor INTEGER, #PreviousRate REAL
DECLARE #CurrentTenor INTEGER, #CurrentRate REAL
DECLARE Curve CURSOR FAST_FORWARD READ_ONLY FOR
SELECT "tenor", "rate"
FROM Curves
WHERE "date" = #availableDate AND "type" = #type AND "currency" = #currency
ORDER BY "tenor"
-- Open a cursor to iterate over the tenors and rates in order.
OPEN Curve
FETCH NEXT FROM Curve INTO #CurrentTenor, #CurrentRate
IF (#Tenor < #CurrentTenor)
BEGIN
-- If the tenor is less than the first one,
-- then use the first tenor rate.
SET #rate = #CurrentRate
END
ELSE
BEGIN
WHILE ##FETCH_STATUS = 0
BEGIN
IF (#Tenor = #CurrentTenor)
BEGIN
-- If it mathces exactly return the found rate.
SET #rate = #CurrentRate
BREAK
END
IF (#Tenor < #CurrentTenor)
BEGIN
-- If the current tenor is lesss than the current
-- (but not equal) then interpolate with the
-- previous one and return the calculated rate.
SET #rate = #PreviousRate +
(#tenor - #PreviousTenor) *
(#CurrentRate - #PreviousRate) /
(#CurrentTenor - #PreviousTenor)
BREAK
END
-- Keep track ot the previous tenor and rate.
SET #PreviousTenor = #CurrentTenor
SET #PreviousRate = #CurrentRate
-- Fetch the next tenor and rate.
FETCH NEXT FROM Curve INTO #CurrentTenor, #CurrentRate
END
IF (#Tenor > #CurrentTenor)
BEGIN
-- If we exahusted the tenors and still nothing found,
-- then use the last tenor rate.
SET #rate = #CurrentRate
END
END
CLOSE Curve
DEALLOCATE Curve
RETURN #rate
END;

Assuming a table dbo.Curves exists with the following structure:
CREATE TABLE dbo.Curves(
[date] [date] NOT NULL
,[type] [varchar](10) NOT NULL
,currency [char](3) NOT NULL
,tenor [int] NOT NULL
,rate [real] NOT NULL
,UNIQUE NONCLUSTERED ([date] ASC,[type] ASC,currency ASC,tenor ASC)
,INDEX YourClusteredIndexNameHere CLUSTERED ([type] ASC,currency ASC,[date] ASC)
);
Notable differences from OP's DDL script:
column rate has the data type corrected from double to real. Assuming this is the correct data type because the provided function stores rate values in variables with a real data type. The data type double does not exist in SQL Server.
I have added a clustered index on (type,currency,date). The table did not have a clustered index in OP's DDL script, but should certainly have one. Considering the only knowledge I have of how this table is used is this function, I will design one that works best for this function.
OP indicated this table has a unique constraint on (date,type,currency,tenor), so I have included.
Just to note, if the unique constraint that already exists does not require the provided order, I would recommend removing that unique constraint and the clustered index I recommended and simply creating a single clustered primary key constraint on (type,currency,date,tenor).
Differences between OP's function and set based approach
The actual flow is very similar to OP's provided function. It is well commented, but the function does not read top to bottom. Start from the FROM of the inner most subquery and read 'outwards'.
This function is inline-able. You can confirm this once this function is installed to a given database using Microsoft's documentation on this topic.
I corrected the input type for the #i_Type variable to varchar(10) to match the type column of the dbo.Curves table.
I created the two tables and populated them with what I believe was reasonable randomized data. I used the same scale of 200:1 from OP's test scenario of 1m records in dbo.Instruments and 5k records in dbo.Curves. I tested using 80k records in dbo.Instruments and 400 records in dbo.Curves. Using the exact same test script from OP of SELECT AVG(dbo.TenorRate("date", "type", "currency", "tenor")) FROM Instruments;
OP's CURSOR based approach had a CPU/Elapsed ms of 57774/58654.
Set based approach I wrote had a CPU/Elapsed ms of 7437/7440. So roughly 12.6% elapsed time of the original.
The set based approach is guaranteed to result in more reads overall because it is repeatedly reading from dbo.Curves as opposed to the CURSOR option which pulls the rows from dbo.Curves one time and then reads from the CURSOR. It is worth it to note that comparing execution plans on both function approaches in the test case above will yield misleading results. Because the CURSOR based option cannot be inlined, the logical operators and query statistics are not going to be shown in the execution plan, but the query plan for the set based approach can be inlined and so the logcial operators and query statistics will be included in the execution plan. Just thought I would point that out.
I did validate the results of this compared to the original function. I noticed that there some potential areas for discrepany.
Both this function and the OP's CURSOR based option are doing math using real data types. I saw a few examples where it looked like the math, although using the same numbers between the CURSOR approach and the set based approach resulted in different numbers. When I validated that the output number, when rounded to the 5th decimal, matched exactly. You might need to track that down, but considering that you said you are building a financial application and a financial application is the classic example where real and float data types should be avoided (because they are approximate data types) I strongly suggest you change the way these values are both stored and used in this function to be numeric. It was not worth my time to track this issue down because IMO it only exists because of very bad practice that should be resolved anyhow.
Based on my understanding of OP's approach, in scenarios where the input tenor is between the minimum and maximum tenor we would want to use the two rows from dbo.Curves where the tenor value is as close to the input tenor as possible, and then use those two rows (and the rate from those two rows) to perform a linear interpolation for the input tenor. In this same scenario, the OP's function finds the closest tenor available and then instead of using the row with the next closest available tenor, it uses the tenor with the previous row (sorted by tenor ASC). Because of how the CURSOR implementation was written, this is not guaranteed to be the next closest row to the input tenor so I fixed that in my approach. Feel free to revert if I misunderstood.
Notes about my specific implementation
This approach is not heavily optimized towards certain conditions that might exist more commonly. There are still potential optimizations that you might be able to make in your system based on the conditions that exist.
I am using an aggregate to pivot rows to columns. I did test out LEAD/LAG to do this, but found that pivoting using the aggregate was consistently faster by roughly 7% decrease in elapsed time.
There are simple warehousing strategies that could be used to GREATLY increase the performance of this approach. For example, storing the pre-calculated min/max for a given combination of a (type,currency,date) could have a tremendous impact.
Here is the function:
CREATE OR ALTER FUNCTION dbo.TenorRate(
#i_Date DATE
,#i_Type VARCHAR(10)
,#i_Currency CHAR(3)
,#i_Tenor INT
)
RETURNS REAL
BEGIN
RETURN (
SELECT
/*now pick the return value based on the scenario that was returned, as a given scenario can have a specific formula*/
CASE
WHEN pivoted_vals.Scenario = 1 THEN pivoted_vals.CurrentRate /*returned row is first row with matching tenor*/
WHEN pivoted_vals.Scenario = 2 THEN pivoted_vals.CurrentRate /*returned row is row with smallest tenor*/
WHEN pivoted_vals.Scenario = 3 THEN pivoted_vals.CurrentRate /*returned row is row with largest tenor*/
WHEN pivoted_vals.Scenario = 4 /*returned row contains pivoted values for the two rows that are two closest tenor rows*/
THEN pivoted_vals.PreviousRate
+
(#i_Tenor - pivoted_vals.PreviousTenor)
*
(pivoted_vals.CurrentRate - pivoted_vals.PreviousRate)
/
(pivoted_vals.CurrentTenor - pivoted_vals.PreviousTenor)
END AS ReturnValue
FROM
(SELECT
/*pivot rows using aggregates. aggregates are only considering one row and value at a time, so they are not truly aggregating anything. simply pivoting values.*/
MAX(CASE WHEN top_2_rows.RowNumber = 1 THEN top_2_rows.Scenario ELSE NULL END) AS Scenario
,MAX(CASE WHEN top_2_rows.RowNumber = 1 THEN top_2_rows.CurrentRate ELSE NULL END) AS CurrentRate
,MAX(CASE WHEN top_2_rows.RowNumber = 1 THEN top_2_rows.CurrentTenor ELSE NULL END) AS CurrentTenor
,MAX(CASE WHEN top_2_rows.RowNumber = 2 THEN top_2_rows.CurrentRate ELSE NULL END) AS PreviousRate
,MAX(CASE WHEN top_2_rows.RowNumber = 2 THEN top_2_rows.CurrentTenor ELSE NULL END) AS PreviousTenor
FROM
/*return a row number for our top two returned rows. done in an outer subquery so that the row_number function is not used until the final two rows have been selected*/
(SELECT TOP (2)
picked_curve.Scenario
,picked_curve.ScenarioRankValue
,picked_curve.CurrentRate
,picked_curve.CurrentTenor
/*generate row number to match order by clause*/
,ROW_NUMBER() OVER(ORDER BY picked_curve.Scenario ASC,picked_curve.ScenarioRankValue ASC) AS RowNumber
FROM
/*we need top two rows because scenario 4 requires the previous row's value*/
(SELECT TOP (2)
scenarios.Scenario
,scenarios.ScenarioRankValue
,c.rate AS CurrentRate
,c.tenor AS CurrentTenor
FROM
/*first subquery to select the date that we will use from two separate conditions*/
(SELECT TOP (1)
date_options.[date]
FROM
(/*most recent available date before or equal to input date*/
SELECT TOP (1)
c.[date]
FROM
dbo.Curves AS c
WHERE
/*match on type and currency*/
c.[type] = #i_Type
AND
c.currency = #i_Currency
AND
c.[date] <= #i_Date
ORDER BY
c.[date] DESC
UNION ALL
/*first available date after input date*/
SELECT TOP (1)
c.[date]
FROM
dbo.Curves AS c
WHERE
/*match on type and currency*/
c.[type] = #i_Type
AND
c.currency = #i_Currency
AND
c.[date] > #i_Date
ORDER BY
c.[date] ASC) AS date_options
ORDER BY
/*we want to prioritize date from first query, which we know will be 'older' than date from second query. So ascending order*/
date_options.[date] ASC) AS selected_date
/*go get curve values for input type, input currency, and selected date*/
INNER JOIN dbo.Curves AS c ON
#i_Type = c.[type]
AND
#i_Currency = c.currency
AND
selected_date.[date] = c.[date]
/*go get max and min curve values for input type, input currency, and selected date*/
OUTER APPLY (SELECT TOP (1) /*TOP (1) is redundant since this is an aggregate with no grouping, but keeping for clarity*/
MAX(c_inner.tenor) AS MaxTenor
,MIN(c_inner.tenor) AS MinTenor
FROM
dbo.Curves AS c_inner
WHERE
c_inner.[type] = #i_Type
AND
c_inner.currency = #i_Currency
AND
c_inner.[date] = selected_date.[date]) AS max_min_tenor
/*for readibility in select, outer apply logic to give us a value will prioritize certain scenarios over others (and indicate to us the scenarios that was returned) and
return the minimum number of rows that are ranked in a manner in which the top returned rows contain the information needed for the given scenario*/
OUTER APPLY (SELECT
CASE /*rank the scenarios*/
WHEN #i_Tenor = c.tenor THEN 1
WHEN #i_Tenor < max_min_tenor.MinTenor THEN 2
WHEN #i_Tenor > max_min_tenor.MaxTenor THEN 3
ELSE 4 /*input tenor is between the max/min tenor*/
END AS Scenario
,CASE /*rank value that ensures the top row will be the row we need for the returned scenario*/
WHEN #i_Tenor = c.tenor THEN c.tenor
WHEN #i_Tenor < max_min_tenor.MinTenor THEN c.tenor
WHEN #i_Tenor > max_min_tenor.MaxTenor THEN c.tenor*-1
ELSE ABS(c.tenor-#i_Tenor)
END AS ScenarioRankValue) AS scenarios
ORDER BY
/*highest priority scenario (by lowest value) and the associated value to be ranked, which is designed to be used in ascending order*/
scenarios.Scenario ASC
,scenarios.ScenarioRankValue ASC) AS picked_curve
ORDER BY
picked_curve.Scenario ASC
,picked_curve.ScenarioRankValue ASC) AS top_2_rows) AS pivoted_vals
);
END;
GO
Just one last note on usage, there are certain ways that you might use this query that could prevent it from being inline. I highly recommend you read the entirety of Microsoft's doc on Scalar UDF Inlining, but at a minimum you should at least read over the Inlineable scalar UDF requirements contained in that same doc.
Hopefully this helps you out or at least points you in the right direction.

In fact your function look like an aggregate function. If it is the case, you should write a SQL CLR one. We do that for financial score and the performance was fine...

Related

Sort by given "rank" of column values

I have a table like this (unsorted):
risk
category
Low
A
Medium
B
High
C
Medium
A
Low
B
High
A
Low
C
Low
E
Low
D
High
B
I need to sort rows by category, but first based on the value of risk. The desired result should look like this (sorted):
risk
category
Low
A
Low
B
Low
C
Low
D
Low
E
Medium
A
Medium
B
High
A
High
B
High
C
I've come up with below query but wonder if it is correct:
SELECT
*
FROM
some_table
ORDER BY
CASE
WHEN risk = 'Low' THEN
0
WHEN risk = 'Medium' THEN
1
WHEN risk = 'High' THEN
2
ELSE
3
END,
category;
Just want to understand whether the query is correct or not. The actual data set is huge and there are many other values for risk and categories and hence I can't figure out if the results are correct or not. I've just simplified it here.
Basically correct, but you can simplify:
SELECT *
FROM some_table
ORDER BY CASE risk
WHEN 'Low' THEN 0
WHEN 'Medium' THEN 1
WHEN 'High' THEN 2
-- rest defaults to NULL and sorts last
END
, category;
A "switched" CASE is shorter and slightly cheaper.
In the absence of an ELSE branch, all remaining cases default to NULL, and NULL sorts last in default ascending sort order. So you don't need to do anything extra.
Many other values
... there are many other values for risk
While all other values are lumped together at the bottom of the sort order, this seems ok.
If all of those many values get their individual ranking, I would suggest an additional table to handle ranks of risk values. Like:
CREATE TABLE riskrank (
risk text PRIMARY KEY
, riskrank real
);
INSERT INTO riskrank VALUES
('Low' , 0)
, ('Medium', 1)
, ('High' , 2)
-- many more?
;
Data type real, so it's easy to squeeze in rows with fractional digits in different positions (like enum values do it internally).
Then your query is:
SELECT s.*
FROM some_table s
LEFT JOIN risk_rank rr USING (risk)
ORDER BY rr.riskrank, s.category;
LEFT JOIN, so missing entries in riskrank don't eliminate rows.
enum?
I already mentioned the data type enum. That's a possible alternative as enum values are sorted in the order they are defined (not how they are spelled). They only occupy 4 bytes on disk (real internally), are fast and enforce valid values implicitly. See:
How to change the data type of a table column to enum?
However, I would only even consider an enum if the sort order of your values is immutable. Changing sort order and adding / removing allowed values is cumbersome. The manual:
Although enum types are primarily intended for static sets of values,
there is support for adding new values to an existing enum type, and
for renaming values (see ALTER TYPE). Existing values cannot be
removed from an enum type, nor can the sort ordering of such values be
changed, short of dropping and re-creating the enum type.

IIF Function returning incorrect calculated values - SQL Server

I am writing a query to show returns of placing each way bets on horse races
There is an issue with the PlaceProfit result - This should show a return if the horses finishing position is between 1-4 and a loss if the position is => 5
It does show the correct return if the horses finishing position is below 9th, but 10th place and above is being counted as a win.
I include my code below along with the output.
ALTER VIEW EachWayBetting
AS
SELECT a.ID,
RaceDate,
runners,
track.NAME AS Track,
horse.NAME as HorseName,
IndustrySP,
Place AS 'FinishingPosition',
-- // calculates returns on the win & place parts of an each way bet with 1/5 place terms //
IIF(A.Place = '1', 1.0 * (A.IndustrySP-1), '-1') AS WinProfit,
IIF(A.Place <='4', 1.0 * (A.IndustrySP-1)/5, '-1') AS PlaceProfit
FROM dbo.NewRaceResult a
LEFT OUTER JOIN track ON track.ID = A.TrackID
LEFT OUTER JOIN horse ON horse.ID = A.HorseID
WHERE a.Runners > 22
This returns:
As I mention in the comments, the problem is your choice of data type for place, it's varchar. The ordering for a string data type is completely different to that of a numerical data type. Strings are sorted by character from left to right, in the order the characters are ordered in the collation you are using. Numerical data types, however, are ordered from the lowest to highest.
This means that, for a numerical data type, the value 2 has a lower value than 10, however, for a varchar the value '2' has a higher value than '10'. For the varchar that's because the ordering is completed on the first character first. '2' has a higher value than '1' and so '2' has a higher value than '10'.
The solution here is simple, fix your design; store numerical data in a numerical data type (int seems appropriate here). You're also breaking Normal Form rules, as you're storing other data in the column; mainly the reason a horse failed to be classified. Such data isn't a "Place" but information on why the horse didn't place, and so should be in a separate column.
You can therefore fix this by firstly adding a new column, then updating it's value to be the values that aren't numerical and making place only contain numerical data, and then finally altering your place column.
ALTER TABLE dbo.YourTable ADD UnClassifiedReason varchar(5) NULL; --Obviously use an appropriate length.
GO
UPDATE dbo.YourTable
SET Place = TRY_CONVERT(int,Place),
UnClassifiedReason = CASE WHEN TRY_CONVERT(int,Place) IS NULL THEN Place END;
GO
ALTER TABLE dbo.YourTable ALTER COLUMN Place int NULL;
GO
If Place does not allow NULL values, you will need to ALTER the column first to allow them.
In addition to fixing the data as Larnu suggests, you should also fix the query:
SELECT nrr.ID, nrr.RaceDate, nrr.runners,
t.NAME AS Track, t.NAME as HorseName, nrr.IndustrySP,
Place AS FinishingPosition,
-- // calculates returns on the win & place parts of an each way bet with 1/5 place terms //
(CASE WHEN nrr.Place = 1 THEN (nrr..IndustrySP - 1.0) ELSE -1 END) AS WinProfit,
(CASE WHEN nrr.Place <= 4 THEN (nrr.IndustrySP - 1.0) / 5 THEN -1 END) AS PlaceProfit
FROM dbo.NewRaceResult nrr LEFT JOIN
track t
ON t.ID = nrr.TrackID LEFT JOIN
horse h
ON h.ID = nrr.HorseID
WHERE nrr.Runners > 22;
The important changes are removing single quotes from numbers and column names. It seems you need to understand the differences among strings, numbers, and identifiers.
Other changes are:
Meaningful table aliases, rather than meaningless letters such as a.
Qualifying all column references, so it is clear where columns are coming from.
Switching from IFF() to CASE. IFF() is bespoke SQL Server; CASE is standard SQL for conditional expressions (both work fine).
Being sure that the types returned by all branches of the conditional expressions are consistent.
Note: This version will work even if you don't change the type of Place. The strings will be converted to numbers in the appropriate places. I don't advocate relying on such silent conversion, so I recommend fixing the data.
If place can have non-numeric values, then you need to convert them:
(CASE WHEN TRY_CONVERT(int, nrr.Place) = 1 THEN (nrr..IndustrySP - 1.0) ELSE -1 END) AS WinProfit,
(CASE WHEN TRY_CONVERT(int, nrr.Place) <= 4 THEN (nrr.IndustrySP - 1.0) / 5 THEN -1 END) AS PlaceProfit
But the important point is to fix the data.

MS SQL - Count by range of data

I have a database of aircraft flight track data that cross certain points. I'm looking at the altitude that the aircraft crossed these points at and trying to bin them by every 100 ft. The altitudes range from about 2000 ft to 15000 ft so I want a way to do this that automates the 100 ft increments. So I want to have the crossing point, a range (say 2000-2100 ft), and the count. And the next line is the crossing point, the next range (2100-2200 ft), and the count, and so on.
I'm still a SQL newbie so any help to get me pointed in the right direction would be appreciated. Thanks.
Edited for clarity - I have nothing. I want a column with my crossing location, another with the altitude range, and a third with the count. I'm just not sure to bin the data so it will give me the ranges in 100 ft. increments.
You can use a calculated column for the AltitudeBucket. This is automatically calculated. (This technique is often used for loading dimension tables into data warehouses.)
In this case, having the AltitudeBucket as a calculated column means you can do calculations on it and use it in WHERE clauses.
Create and populate a table.
CREATE TABLE dbo.TrackPoint
(
TrackPointID int NOT NULL IDENTITY(1,1) PRIMARY KEY,
CrossingPoint nvarchar(50) NOT NULL,
AltitudeFeet int NOT NULL
CHECK (AltitudeFeet BETWEEN 1 AND 60000),
AltitudeBucket AS (AltitudeFeet / 100) * 100 PERSISTED NOT NULL
);
GO
INSERT INTO dbo.TrackPoint (CrossingPoint, AltitudeFeet)
VALUES
(N'Paris', 12772),
(N'Paris', 12765),
(N'Paris', 32123),
(N'Toulouse', 5123),
(N'Toulouse', 6123),
(N'Toulouse', 6120),
(N'Lyon', 15000),
(N'Lyon', 15010);
Display what's in the table.
SELECT *
FROM dbo.TrackPoint;
Run a SELECT query to calculate summarised counts.
SELECT CrossingPoint, AltitudeBucket, COUNT(*) AS 'Count'
FROM dbo.TrackPoint
GROUP BY CrossingPoint, AltitudeBucket
ORDER BY CrossingPoint, AltitudeBucket;
If you want to display the altitude range.
SELECT CrossingPoint, AltitudeBucket, CAST(AltitudeBucket AS nvarchar) + N'-' + CAST(AltitudeBucket + 99 AS nvarchar) AS 'AltitudeBucketRange', COUNT(*) AS 'Count'
FROM dbo.TrackPoint
GROUP BY CrossingPoint, AltitudeBucket
ORDER BY CrossingPoint, AltitudeBucket;
Whenever you're attempting to automate any kind of process, you first must design the algorithm for the process to successfully execute manually. To begin, pick out the smallest piece of this process: returning a count of altitudes between range x and x+100. So when x = 2000, you want to return all records between 2000 and 2100.
SELECT COUNT(*) FROM AltitudesTable
WHERE altitude >= 2000 AND altitude < 2100;
The above code works for one case: 2000 <= x < 2100.
To "automate," or loop through all cases, try using T-SQL:
DECLARE #x INT = 2000;
WHILE EXISTS(SELECT * FROM AltitudesTable)
BEGIN
SELECT COUNT(*) FROM AltitudesTable
WHERE altitude >= #x AND altitude < #x+100;
#x = #x+100;
END
Respectfully, your requirements are not solidly defined, so I had to make some assumptions regarding table structure and datatypes.

Is it faster to check that a Date is (not) NULL or compare a bit to 1/0?

I'm just wondering what is faster in SQL (specifically SQL Server).
I could have a nullable column of type Date and compare that to NULL, or I could have a non-nullable Date column and a separate bit column, and compare the bit column to 1/0.
Is the comparison to the bit column going to be faster?
In order to check that a column IS NULL SQL Server would actually just check a bit anyway. There is a NULL BITMAP stored for each row indicating whether each column contains a NULL or not.
I just did a simple test for this:
DECLARE #d DATETIME
,#b BIT = 0
SELECT 1
WHERE #d IS NULL
SELECT 2
WHERE #b = 0
The actual execution plan results show the computation as exactly the same cost relative to the batch.
Maybe someone can tear this apart, but to me it seems there's no difference.
MORE TESTS
SET DATEFORMAT ymd;
CREATE TABLE #datenulltest
(
dteDate datetime NULL
)
CREATE TABLE #datebittest
(
dteDate datetime NOT NULL,
bitNull bit DEFAULT (1)
)
INSERT INTO #datenulltest ( dteDate )
SELECT CASE WHEN CONVERT(bit, number % 2) = 1 THEN '2010-08-18' ELSE NULL END
FROM master..spt_values
INSERT INTO #datebittest ( dteDate, bitNull )
SELECT '2010-08-18', CASE WHEN CONVERT(bit, number % 2) = 1 THEN 0 ELSE 1 END
FROM master..spt_values
SELECT 1
FROM #datenulltest
WHERE dteDate IS NULL
SELECT 2
FROM #datebittest
WHERE bitNull = CONVERT(bit, 1)
DROP TABLE #datenulltest
DROP TABLE #datebittest
dteDate IS NULL result:
bitNull = 1 result:
OK, so this extended test comes up with the same responses again.
We could do this all day - it would take some very complex query to find out which is faster on average.
All other things being equal, I would say the Bit would be faster because it is a "smaller" data type. However, if performance is very important here (and I assume it is because of the question) then you should always do testing, as there may be other factors such as indexes, caching that affect this.
It sounds like you are trying to decide on a datatype for field which will record whether an event X has happened or not. So, either a timestamp (when X happened) or just a Bit (1 if X happened, otherwise 0). In this case I would be tempted to go for the Date as it gives you more information (not only whether X happened, but also exactly when) which will most likely be useful in the future for reporting purposes. Only go against this if the minor performance gain really is more important.
Short answer, If you have only 1s and 0s something like bit-map index 1,0 is uber fast. Nulls are not indexed on certain sqlengines so 'is null' and 'not null' are slow. However, do think of the entity semantics before dishing this out. It is always better to have a semantic table definition, if you know what I mean.
The speed comes from ability to use indices and not from data size in this case.
Edit
Please refer to Martin Smith's answer. That makes more sense for sqlserver, I got carried away by oracle DB, my mistake here.
The bit will be faster as loading th bit to memory will load only 1 byte and loading the date will take 8 bytes. The comparison itself will take the same time, but the loading from the disk will take more time. Unless you use a very old server or need to load more then 10^8 rows you won't notice anything.

table design + SQL question

I have a table foodbar, created with the following DDL. (I am using mySQL 5.1.x)
CREATE TABLE foodbar (
id INT NOT NULL AUTO_INCREMENT,
user_id INT NOT NULL,
weight double not null,
created_at date not null
);
I have four questions:
How may I write a query that returns
a result set that gives me the
following information: user_id,
weight_gain where weight_gain is
the difference between a weight and
a weight that was recorded 7 days
ago.
How may I write a query that will
return the top N users with the
biggest weight gain (again say over
a week).? An 'obvious' way may be to
use the query obtained in question 1
above as a subquery, but somehow
picking the top N.
Since in question 2 (and indeed
question 1), I am searching the
records in the table using a
calculated field, indexing would be
preferable to optimise the query -
however since it is a calculated
field, it is not clear which field
to index (I'm guessing the 'weight'
field is the one that needs
indexing). Am I right in that
assumption?.
Assuming I had another field in the
foodbar table (say 'height') and I
wanted to select records from the
table based on (say) the product
(i.e. multiplication) of 'height'
and 'weight' - would I be right in
assuming again that I need to index
'height' and 'weight'?. Do I also
need to create a composite key (say
(height,weight)). If this question
is not clear, I would be happy to
clarify
I don't see why you should need the synthetic key, so I'll use this table instead:
CREATE TABLE foodbar (
user_id INT NOT NULL
, created_at date not null
, weight double not null
, PRIMARY KEY (user_id, created_at)
);
How may I write a query that returns a result set that gives me the following information: user_id, weight_gain where weight_gain is the difference between a weight and a weight that was recorded 7 days ago.
SELECT curr.user_id, curr.weight - prev.weight
FROM foodbar curr, foodbar prev
WHERE curr.user_id = prev.user_id
AND curr.created_at = CURRENT_DATE
AND prev.created_at = CURRENT_DATE - INTERVAL '7 days'
;
the date arithmetic syntax is probably wrong but you get the idea
How may I write a query that will return the top N users with the biggest weight gain (again say over a week).? An 'obvious' way may be to use the query obtained in question 1 above as a subquery, but somehow picking the top N.
see above, add ORDER BY curr.weight - prev.weight DESC and LIMIT N
for the last two questions: don't speculate, examine execution plans. (postgresql has EXPLAIN ANALYZE, dunno about mysql) you'll probably find you need to index columns that participate in WHERE and JOIN, not the ones that form the result set.
I think that "just somebody" covered most of what you're asking, but I'll just add that indexing columns that take part in a calculation is unlikely to help you at all unless it happens to be a covering index.
For example, it doesn't help to order the following rows by X, Y if I want to get them in the order of their product X * Y:
X Y
1 8
2 2
4 4
The products would order them as:
X Y Product
2 2 4
1 8 8
4 4 16
If mySQL supports calculated columns in a table and allows indexing on those columns then that might help.
I agree with just somebody regarding the primary key, but for what you're asking regarding the weight calculation, you'd be better off storing the delta rather than the weight:
CREATE TABLE foodbar (
user_id INT NOT NULL,
created_at date not null,
weight_delta double not null,
PRIMARY KEY (user_id, created_at)
);
It means you'd store the users initial weight in say, the user table, and when you write records to the foodbar table, a user could supply the weight at that time, but the query would subtract the initial weight from the current weight. So you'd see values like:
user_id weight_delta
------------------------
1 2
1 5
1 -3
Looking at that, you know that user 1 gained 4 pounds/kilos/stones/etc.
This way you could use SUM, because it's possible for someone to have weighings every day - using just somebody's equation of curr.weight - prev.weight wouldn't work, regardless of time span.
Getting the top x is easy in MySQL - use the LIMIT clause, but mind that you provide an ORDER BY to make sure the limit is applied correctly.
It's not obvious, but there's some important information missing in the problem you're trying to solve. It becomes more noticeable when you think about realistic data going into this table. The problem is that you're unlikely to to have a consistent regular daily record of users' weights. So you need to clarify a couple of rules around determining 'current-weight' and 'weight x days ago'. I'm going to assume the following simplistic rules:
The most recent weight reading is the 'current-weight'. (Even though that could be months ago.)
The most recent weight reading more than x days ago will be the weight assumed at x days ago. (Even though for example a reading from 6 days ago would be more reliable than a reading from 21 days ago when determining weight 7 days ago.)
Now to answer the questions:
1&2: Using the above extra rules provides an opportunity to produce two result sets: current weights, and previous weights:
Current weights:
select rd.*,
w.Weight
from (
select User_id,
max(Created_at) AS Read_date
from Foodbar
group by User_id
) rd
inner join Foodbar w on
w.User_id = rd.User_id
and w.Created_at = rd.Read_date
Similarly for the x days ago reading:
select rd.*,
w.Weight
from (
select User_id,
max(Created_at) AS Read_date
from Foodbar
where Created_at < DATEADD(dd, -7, GETDATE()) /*Or appropriate MySql equivalent*/
group by User_id
) rd
inner join Foodbar w on
w.User_id = rd.User_id
and w.Created_at = rd.Read_date
Now simply join these results as subqueries
select cur.User_id,
cur.Weight as Cur_weight,
prev.Weight as Prev_weight
cur.Weight - prev.Weight as Weight_change
from (
/*Insert query #1 here*/
) cur
inner join (
/*Insert query #2 here*/
) prev on
prev.User_id = cur.User_id
If I remember correctly the MySql syntax to get the top N weight gains would be to simply add:
ORDER BY cur.Weight - prev.Weight DESC limit N
2&3: Choosing indexes requires a little understanding of how the query optimiser will process the query:
The important thing when it comes to index selection is what columns you are filtering by or joining on. The optimiser will use the index if it is determined to be selective enough (note that sometimes your filters have to be extremely selective returning < 1% of data to be considered useful). There's always a trade of between slow disk seek times of navigating indexes and simply processing all the data in memory.
3: Although weights feature significantly in what you display, the only relevance is in terms of filtering (or selection) is in #2 to get the top N weight gains. This is a complex calculation based on a number of queries and a lot of processing that has gone before; so Weight will provide zero benefit as an index.
Another note is that even for #2 you have to calculate the weight change of all users in order to determine the which have gained the most. Therefore unless you have a very large number of readings per user you will read most of the table. (I.e. a table scan will be used to obtain the bulk of the data)
Where indexes can benefit:
You are trying to identify specific Foodbar rows based on User_id and Created_at.
You are also joining back to the Foodbar table again using User_id and Created_at.
This implies an index on User_id, Created__at would be useful (more-so if this is the clustered index).
4: No, unfortunately it is mathematically impossible to determine how the individual values H and W would independently determine the ordering of the product. E.g. both H=3 & W=3 are less than 5, yet if H=5 and W=1 then the product 3*3 is greater than 5*1.
You would have to actually store the calculation an index on that additional column. However, as indicated in my answer to #3 above, it is still unlikely to prove beneficial.