BigQuery ASOF joins usecase - google-bigquery

Is it possible to do a asof join in BigQuery? I think it only support equality joins but trying to understand workarounds
Quotes table
time sym bid ask
2020-01-9T14:30:00.023 XYZ 16.22 16.25
2020-01-9T14:30:00.023 XYZ 16.21 16.27
2020-01-9T14:30:00.030 XYZ 16.20 16.28
2020-01-9T14:30:00.041 XYZ 162.22 16.26
2020-01-9T14:30:00.048 XYZ 162.23 16.28
Trade table
time sym price quantity
2020-01-9T14:30:00.023 MMM 16.23 75
2020-01-9T14:30:00.041 MMM 16.24 50
2020-01-9T14:30:00.041 MMM 16.25 100
Typical SQL to do this in a time series database would look something like this, but wondering if it would be possible to compute such result in BigQuery
SELECT timestamp, trades.sym, price, quantity, ask, bid, (ask - bid) AS spread FROM trades LEFT ASOF JOIN quotes
Expected result
$timestamp sym price quantity ask bid spread
2020-01-9T14:30:00.023000000Z MMM 16.23 75 16.25 16.22 0.03
2020-01-9T14:30:00.041000000Z MMM 16.24 50 16.26 16.22 0.04
2020-01-9T14:30:00.041000000Z MMM 16.25 100 16.26 16.22 0.04

There are at least two approaches here, one that is unscalable using pure Standard SQL, and a second scalable solution that utilizes a helper function created with BigQuery's JavaScript UDF functionality.
Unscalable Solution
I had exactly this same problem with BigQuery using tick data organized much like yours is. You can do an inequality join in BigQuery, but this will not scale for even a few months worth of US equities data consisting of a few hundred names or more. In other words you might think you could try:
-- Example level 1 quotes table.
WITH l1_quotes AS (
SELECT TIMESTAMP('2020-01-09T14:30:00.023') as time, 'XYZ' as sym, 16.22 as bid, 16.25 as ask
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.023') as time, 'XYZ' as sym, 16.21 as bid, 16.27 as ask
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.030') as time, 'XYZ' as sym, 16.20 as bid, 16.28 as ask
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.041') as time, 'XYZ' as sym, 16.22 as bid, 16.26 as ask
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.048') as time, 'XYZ' as sym, 16.23 as bid, 16.28 as ask
),
-- Example trades table.
trades AS (
SELECT TIMESTAMP('2020-01-9T14:30:00.023') as time, 'MMM' as sym, 16.23 as price, 75 as quantity
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.041') as time, 'MMM' as sym, 16.24 as price, 50 as quantity
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.041') as time, 'MMM' as sym, 16.25 as price, 100 as quantity
),
-- Get all level 1 quotes <= each trade for each day using an inequality join.
inequality_join as (
SELECT
t.time as time,
t.sym,
t.price,
t.quantity,
l.time as l1_time,
l.sym as l1_sym,
l.bid,
l.ask,
l.ask - l.bid as spread
FROM l1_quotes as l
JOIN trades as t
ON EXTRACT(DATE FROM l.time) = EXTRACT(DATE FROM t.time)
AND l.time <= t.time
)
-- Use DISTINCT and LAST_VALUE to get the latest l1 quote asof each trade.
SELECT DISTINCT
time, sym, price, quantity,
LAST_VALUE(bid) OVER (latest_asof_time) as bid,
LAST_VALUE(ask) OVER (latest_asof_time) as ask,
LAST_VALUE(spread) OVER (latest_asof_time) as spread
FROM inequality_join
WINDOW latest_asof_time AS (
PARTITION BY time, sym, l1_sym
ORDER BY l1_time
ROWS between unbounded preceding and unbounded following
);
This will solve your example problem, but it will not scale in practice. Notice in the above that the join hinges on having one equality join condtion by date and one inequality join condition by l1_quotes.time <= trades.time. The inequality portion will ultimately generate all l1_quotes entries whose timestamp is at or before the timestamp of each trade entry. Although this is not quite a full CROSS JOIN, and although it gets partitioned by date thanks to the equality join piece, it becomes a very large join in practice. For the US equity database I was working with, BigQuery was unable to do the ASOF JOIN using this technique even after letting it run for many hours. You can play around with shortening the time window of the equality condition, e.g. EXTRACT an hour, minute, etc., but the smaller you make that window the more you run the risk of not getting a prior level 1 quote for some entries, particularly for illiquid symbols.
Scalable Solution
So how do you ultimately solve this in a scalable way in BigQuery? Well, one idiomatic approach in BigQuery is to leverage one or more helper JavaScript UDF's when Standard SQL cannot do what you need.
Here is a JavaScript UDF that I wrote to solve this problem. It will generate an array of tuples of X,Y timestamps representing X ASOF Y.
CREATE TEMPORARY FUNCTION asof_join(x ARRAY<TIMESTAMP>, y ARRAY<TIMESTAMP>) RETURNS ARRAY<STRUCT<x TIMESTAMP, y TIMESTAMP>> LANGUAGE js AS """
function js_timestamp_sort(a, b) { return a - b }
x.sort(js_timestamp_sort);
y.sort(js_timestamp_sort);
var epoch = new Date(1970, 0, 1);
var results = [];
var j = 0;
for (var i = 0; i < y.length; i++) {
var curr_res = {x: epoch, y: epoch};
for (; j < x.length; j++) {
if (x[j] <= y[i]) {
curr_res.x = x[j];
curr_res.y = y[i];
} else {
break;
}
}
if (curr_res.x !== epoch) {
results.push(curr_res);
j--;
}
}
return results;
""";
The idea is to pull out the l1_quotes timstamps (X) and the trades timestamps (Y) into two separate arrays. Then we use the above function to generate the array of X,Y ASOF tuples. We will then UNNEST this array of tuples to convert them back to an SQL table, and finally we join them back with the l1_quotes and then with the trades. Here is the query that achieves this. In your query editor you will want to paste in the JavaScript UDF above and this query following it:
-- Example level 1 quotes table.
WITH l1_quotes AS (
SELECT TIMESTAMP('2020-01-09T14:30:00.023') as time, 'XYZ' as sym, 16.22 as bid, 16.25 as ask
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.023') as time, 'XYZ' as sym, 16.21 as bid, 16.27 as ask
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.030') as time, 'XYZ' as sym, 16.20 as bid, 16.28 as ask
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.041') as time, 'XYZ' as sym, 16.22 as bid, 16.26 as ask
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.048') as time, 'XYZ' as sym, 16.23 as bid, 16.28 as ask
),
-- Example trades table.
trades AS (
SELECT TIMESTAMP('2020-01-9T14:30:00.023') as time, 'MMM' as sym, 16.23 as price, 75 as quantity
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.041') as time, 'MMM' as sym, 16.24 as price, 50 as quantity
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.041') as time, 'MMM' as sym, 16.25 as price, 100 as quantity
),
-- Extract distinct l1 quote times (use DISTINCT to reduce the size of the array for each day since we don't need duplicates for the UDF script to work).
distinct_l1_times as (
SELECT DISTINCT time
FROM l1_quotes
),
arrayed_l1_times AS (
SELECT
ARRAY_AGG(time) as times,
EXTRACT(DATE FROM time) as curr_day
FROM distinct_l1_times
GROUP BY curr_day
),
-- Do the same for trade times.
distinct_trade_times AS (
SELECT DISTINCT time
FROM trades
),
arrayed_trade_times AS (
SELECT
ARRAY_AGG(time) as times,
EXTRACT(DATE FROM time) as curr_day
FROM distinct_trade_times
GROUP BY curr_day
),
-- Use the handy asof_join JavaScript UDF created above.
asof_l1_trade_time_tuples AS (
SELECT
asof_join(arrayed_l1_times.times, arrayed_trade_times.times) as asof_tuples,
arrayed_l1_times.curr_day as curr_day
FROM arrayed_l1_times
JOIN arrayed_trade_times
USING (curr_day)
),
-- UNNEST the array of l1 quote/trade time asof tuples. Date grouping is no longer needed and dropped here.
unnested_times AS (
SELECT
a.x as l1_time,
a.y as trade_time
FROM
asof_l1_trade_time_tuples , UNNEST(asof_l1_trade_time_tuples.asof_tuples) as a
)
-- Join back the l1 quote and trade information for the final result. As before, use DISTINCT and LAST_VALUE to
-- eliminate duplicates that arise from repeated timestamps.
SELECT DISTINCT
u.trade_time as time,
t.price as price,
t.quantity as quantity,
LAST_VALUE(l.bid) OVER (latest_asof_time) as bid,
LAST_VALUE(l.ask) OVER (latest_asof_time) as ask,
LAST_VALUE(l.ask - l.bid) OVER (latest_asof_time) as spread,
FROM unnested_times as u
JOIN l1_quotes as l
ON u.l1_time = l.time
JOIN trades as t
ON u.trade_time = t.time
WINDOW latest_asof_time AS (
PARTITION BY t.time, t.sym, l.sym
ORDER BY l.time
ROWS between unbounded preceding and unbounded following
)
The above query using the JavaScript UDF runs in a few minutes on my database as opposed to never finishing after many hours using the unscalable approach. You can even store the asof_join JavaScript UDF as a permanent UDF in your dataset for use in other contexts.
I agree that Google should consider implementing an ASOF JOIN, particularly since timeseries analysis using distributed, columnar OLAP's like BigQuery is becoming much more prevalent. While it's achievable in BigQuery with the approach described here, it would be a lot simpler to do this with a single SELECT statement and a built-in ASOF JOIN construct.

You can do this easily with kinetica.
SELECT timestamp, t.sym, price, quantity, ask, bid, (ask - bid) AS spread
FROM trades t
LEFT JOIN quotes q
ON t.sym = q.sym AND
ASOF(t.time, q.time, INTERVAL '0' MILLISECONDS, INTERVAL '10' MILLISECONDS, MIN)

Related

Bigquery Rolling Count Distinct

I'd like to have a rolling count for daily visitors
Example.
date
visitor
2022-02-01
A
2022-02-01
B
2022-02-01
C
2022-02-02
D
2022-02-02
E
2022-02-03
C
2022-02-03
F
I want the output to be:
date
count_visitor
2022-02-01
3 (ABC)
2022-02-02
5 (DE)
2022-02-03
6 (CF)
I can't seem to find the query for this. Kindly need your help/
Consider below approach
select date, daily_count, daily_visitors_list,
( select count(distinct visitor)
from unnest(split(rolled_visitors)) visitor
) rolled_count
from (
select date, daily_count, daily_visitors_list,
string_agg(daily_visitors_list) over(order by unix_date(date)) rolled_visitors
from (
select date, count(distinct visitor) daily_count,
string_agg(distinct visitor) daily_visitors_list,
from your_table
group by date
)
) t
if applied to sample data in your question - output is
If you have heavy amount of data - above approach might end up with resource related error - in this case you can use below approach - it is using HyperLogLog++ functions
select date, daily_count, daily_visitors_list,
( select hll_count.merge(sketch)
from t.rolling_visitors_sketches sketch
) rolled_count
from (
select date, daily_count, daily_visitors_list,
array_agg(daily_visitors_sketch) over(order by unix_date(date)) rolling_visitors_sketches
from (
select date, count(distinct visitor) daily_count,
string_agg(distinct visitor) daily_visitors_list,
hll_count.init(visitor) daily_visitors_sketch
from your_table
group by date
)
) t
HLL++ functions are approximate aggregate functions. Approximate aggregation typically requires less memory than exact aggregation functions, like COUNT(DISTINCT), but also introduces statistical uncertainty. This makes HLL++ functions appropriate for large data streams for which linear memory usage is impractical, as well as for data that is already approximate.

Break down from range date to daily with high effeciency?

The data have start_date and end_date like 2020-09-18 and 2020-09-28. I need to break it down to daily which is 11 days including 2020-09-18.
My solution is to create a date table with every single day.
with cte as(
select b.fulldate,
count(1) over (partition by a,b,metric_c,metric_d) as count,
a,b,
metric_c, metric_d
from a
join dim_date b
on b.fulldate between a.start_date and a.end_date
)
select
fulldate,
a,b,
metric_c / count as metric_c, --maybe some cast or convert in here
metric_d / count as metric_d
from cte
This is what I'm using currently. But is there a more effective way? If the table have 1,000,000 rows and maybe 10 metric, how can I get a better performance?
Thanks in advance anyway. Maybe there's some method that don't have to use an extra date table(which need some update if it's not enough date there), and have a really brilliant performance with millions data. If not, I'll keep using my method then.
I would keep the dim_date data model you have, as it has materialized the rows between the start_dates and end_dates.
The table DIM_DATE is an example of a confirmed dimension and it cab be used across any other subject areas in your reporting application that would need a date dimension.
I would check if in your DIM_DATE you have an index on the key which is being looked up (b.full_date) field.
I wouldn't be surprised if a recursive subquery had better performance if you have lots of dates and relatively short periods:
with cte as (
select start_date, end_date,
metric_a / (datediff(day, start_date, end_date) + 1) as metric_a,
metric_b / (datediff(day, start_date, end_date) + 1) as metric_b
from a
union all
select dateadd(day, 1, start_date), end_date, metric_a, metric_b
from cte
where start_date < end_date
)
select *
from cte;
You can just add more metrics into the CTE as needed.
If any of the periods exceed 100 days, then you need to add option (maxrecursion 0).

Expected payments by day given start and end date

I'm trying to create a SQL view that gives me the expected amount to be received by calendar day for recurring transactions. I have a table containing recurring commitments data, with the following columns:
id,
start_date,
end_date (null if still active),
payment day (1,2,3,etc.),
frequency (monthly, quarterly, semi-annually, annually),
commitment amount
For now, I do not need to worry about business days vs calendar days.
In its simplest form, the end result would contain every historical calendar day as well as future dates for the next year, and produce how much was/is expected to be received in those particular days.
I've done quite a bit of researching, but cannot seem to find an answer that addresses the specific problem. Any direction on where to start would be greatly appreciated.
The expect output would look something like this:
| Date | Expected Amount |
|1/1/18 | 100 |
|1/2/18 | 200 |
|1/3/18 | 150 |
Thank you ahead of time!
Link to data table in db-fiddle
Expected Output Spreadsheet
It's something like this, but I've never used Netezza
SELECT
cal.d, sum(r.amount) as expected_amount
FROM
(
SELECT MIN(a.start_date) + ROW_NUMBER() OVER(ORDER BY NULL) as d
FROM recurring a, recurring b, recurring c
) cal
LEFT JOIN
recurring r
ON
(
(r.frequency = 'monthly' AND r.payment_day = DATE_PART('DAY', cal.d)) OR
(r.frequency = 'annually' AND DATE_PART('MONTH', cal.d) = DATE_PART('MONTH', r.start_date) AND r.payment_day = DATE_PART('DAY', cal.d))
) AND
r.start_date >= cal.d AND
(r.end_date <= cal.d OR r.end_date IS NULL)
GROUP BY cal.d
In essence, we cartesian join our recurring table together a few times to generate a load of rows, number them and add the number onto the min date to get an incrementing date series.
The payments data table is left joined onto this incrementing date series on:
(the day of the date from the series) = (the payment day) for monthlies
(the month-day of the date from the series) = (the month and payment day of the start_date)
Finally, the whole lot is grouped and summed
I don't have a test instance of Netezza so if you encounter some minor syntax errors, do please have a stab at fixing them up yourself (to make it faster for you to get a solution). If you reach a point where you can't work out what the query is doing, let me know
Disclaimer: I'm no expert on Netezza, so I decided to write you a standard SQL that may need some tweaking to run on Netezza.
with
digit as (select 0 as x union select 1 union select 2 union select 3 union select 4
union select 5 union select 6 union select 7 union select 8 union select 9
),
number as ( -- produces numbers from 0 to 9999 (28 years)
select d1.x + d2.x * 10 + d3.x * 100 + d4.x * 1000 as n
from digit d1
cross join digit d2
cross join digit d3
cross join digit d4
),
expected_payment as ( -- expands all expected payments
select
c.start_date + nb.n as day,
c.committed_amount
from recurring_commitement c
cross join number nb
where c.start_date + nb.n <= c.end_data
and c.frequency ... -- add logic for monthly, quarterly, etc. here
)
select
day,
sum(committed_amout) as expected_amount
from expected_payment
group by day
order by day
This solution is valid for commitments that do not exceed 28 years, since the number CTE (Common Table Expression) is producing up to a maximum of 9999 days. Expand with a fifth digit if you need longer commitments.
Note: I think the way I'm adding days to a day to a date is not correct in Netezza's SQL. The expression c.start_date + nb.n may need to be rephrased.

sql server stock quote every second

I have a table with stock quotes
Symbol
Ask
Bid
QuoteDateTime
Using SQL Server 2008, lets say for any given 60 second time period I want to select quotes on all symbols so that there is a record for every second in that time period.
Problem is not every symbol has the same number of quotes - so there are some seconds that have no quote record for any given symbol. So I want to fill in the missing holes of data. So if ORCL has quotes at second 1, 2, 3, 5, 7, I want the result set that has 1,2,3,4,5,6,7...up to 60 sec (covering the whole minute). The values in row 4 come from row 3
In this scenario I would want to select the previous quote and use that for that particular second. So there is continuous records for each symbol and the same number of records are selected for each symbol.
I am not sure what this is called in sql server but any help building a query to do this would be great
For output I am expecting that for any given 60 sec time period. For those symbols that have a record in the 60 seconds, there will be 60 records for each symbol, one for each second
Symbol Ask Bid QuoteDateTime
MSFT 26.00 27.00 2010-05-20 06:28:00
MSFT 26.01 27.02 2010-05-20 06:28:01
...
ORCL 26.00 27.00 2010-05-20 06:28:00
ORCL 26.01 27.02 2010-05-20 06:28:01
etc
Here's one way. A triangular join could also be used. Wonder if there are any other options?
DECLARE #startTime DATETIME = '2010-09-16 14:59:00.000';
WITH Times AS
(
SELECT #startTime AS T
UNION ALL
SELECT DATEADD(SECOND,1, T) FROM Times
WHERE T < DATEADD(MINUTE,1,#startTime)
),
Stocks AS
(
SELECT 'GOOG' AS Symbol
UNION ALL
SELECT 'MSFT'
),
Prices AS
(
SELECT 'GOOG' AS Symbol, 1 AS Ask, 1 AS Bid,
CAST('2010-09-16 14:59:02.000' AS DATETIME) AS QuoteDateTime
UNION ALL
SELECT 'GOOG' AS Symbol, 1 AS Ask, 1 AS Bid,
CAST('2010-09-16 14:59:02.000' AS DATETIME) AS QuoteDateTime
UNION ALL
SELECT 'GOOG' AS Symbol, 1 AS Ask, 1 AS Bid,
CAST('2010-09-16 14:59:02.000' AS DATETIME) AS QuoteDateTime
UNION ALL
SELECT 'MSFT' AS Symbol, 1 AS Ask, 1 AS Bid,
CAST('2010-09-01 12:00:00.000' AS DATETIME) AS QuoteDateTime
)
SELECT p.Symbol, p.Ask, p.Bid, p.QuoteDateTime
FROM Times t CROSS JOIN Stocks s
CROSS APPLY
(SELECT TOP 1 p.Symbol, p.Ask, p.Bid, p.QuoteDateTime
FROM Prices p
WHERE s.Symbol = p.Symbol AND p.QuoteDateTime <= t.T) p

SQL join against date ranges?

Consider two tables:
Transactions, with amounts in a foreign currency:
Date Amount
========= =======
1/2/2009 1500
2/4/2009 2300
3/15/2009 300
4/17/2009 2200
etc.
ExchangeRates, with the value of the primary currency (let's say dollars) in the foreign currency:
Date Rate
========= =======
2/1/2009 40.1
3/1/2009 41.0
4/1/2009 38.5
5/1/2009 42.7
etc.
Exchange rates can be entered for arbitrary dates - the user could enter them on a daily basis, weekly basis, monthly basis, or at irregular intervals.
In order to translate the foreign amounts to dollars, I need to respect these rules:
A. If possible, use the most recent previous rate; so the transaction on 2/4/2009 uses the rate for 2/1/2009, and the transaction on 3/15/2009 uses the rate for 3/1/2009.
B. If there isn't a rate defined for a previous date, use the earliest rate available. So the transaction on 1/2/2009 uses the rate for 2/1/2009, since there isn't an earlier rate defined.
This works...
Select
t.Date,
t.Amount,
ConvertedAmount=(
Select Top 1
t.Amount/ex.Rate
From ExchangeRates ex
Where t.Date > ex.Date
Order by ex.Date desc
)
From Transactions t
... but (1) it seems like a join would be more efficient & elegant, and (2) it doesn't deal with Rule B above.
Is there an alternative to using the subquery to find the appropriate rate? And is there an elegant way to handle Rule B, without tying myself in knots?
You could first do a self-join on the exchange rates which are ordered by date so that you have the start and the end date of each exchange rate, without any overlap or gap in the dates (maybe add that as view to your database - in my case I'm just using a common table expression).
Now joining those "prepared" rates with the transactions is simple and efficient.
Something like:
WITH IndexedExchangeRates AS (
SELECT Row_Number() OVER (ORDER BY Date) ix,
Date,
Rate
FROM ExchangeRates
),
RangedExchangeRates AS (
SELECT CASE WHEN IER.ix=1 THEN CAST('1753-01-01' AS datetime)
ELSE IER.Date
END DateFrom,
COALESCE(IER2.Date, GETDATE()) DateTo,
IER.Rate
FROM IndexedExchangeRates IER
LEFT JOIN IndexedExchangeRates IER2
ON IER.ix = IER2.ix-1
)
SELECT T.Date,
T.Amount,
RER.Rate,
T.Amount/RER.Rate ConvertedAmount
FROM Transactions T
LEFT JOIN RangedExchangeRates RER
ON (T.Date > RER.DateFrom) AND (T.Date <= RER.DateTo)
Notes:
You could replace GETDATE() with a date in the far future, I'm assuming here that no rates for the future are known.
Rule (B) is implemented by setting the date of the first known exchange rate to the minimal date supported by the SQL Server datetime, which should (by definition if it is the type you're using for the Date column) be the smallest value possible.
Suppose you had an extended exchange rate table that contained:
Start Date End Date Rate
========== ========== =======
0001-01-01 2009-01-31 40.1
2009-02-01 2009-02-28 40.1
2009-03-01 2009-03-31 41.0
2009-04-01 2009-04-30 38.5
2009-05-01 9999-12-31 42.7
We can discuss the details of whether the first two rows should be combined, but the general idea is that it is trivial to find the exchange rate for a given date. This structure works with the SQL 'BETWEEN' operator which includes the ends of the ranges. Often, a better format for ranges is 'open-closed'; the first date listed is included and the second is excluded. Note that there is a constraint on the data rows - there are (a) no gaps in the coverage of the range of dates and (b) no overlaps in the coverage. Enforcing those constraints is not completely trivial (polite understatement - meiosis).
Now the basic query is trivial, and Case B is no longer a special case:
SELECT T.Date, T.Amount, X.Rate
FROM Transactions AS T JOIN ExtendedExchangeRates AS X
ON T.Date BETWEEN X.StartDate AND X.EndDate;
The tricky part is creating the ExtendedExchangeRate table from the given ExchangeRate table on the fly.
If it is an option, then revising the structure of the basic ExchangeRate table to match the ExtendedExchangeRate table would be a good idea; you resolve the messy stuff when the data is entered (once a month) instead of every time an exchange rate needs to be determined (many times a day).
How to create the extended exchange rate table? If your system supports adding or subtracting 1 from a date value to obtain the next or previous day (and has a single row table called 'Dual'), then a variation
on this will work (without using any OLAP functions):
CREATE TABLE ExchangeRate
(
Date DATE NOT NULL,
Rate DECIMAL(10,5) NOT NULL
);
INSERT INTO ExchangeRate VALUES('2009-02-01', 40.1);
INSERT INTO ExchangeRate VALUES('2009-03-01', 41.0);
INSERT INTO ExchangeRate VALUES('2009-04-01', 38.5);
INSERT INTO ExchangeRate VALUES('2009-05-01', 42.7);
First row:
SELECT '0001-01-01' AS StartDate,
(SELECT MIN(Date) - 1 FROM ExchangeRate) AS EndDate,
(SELECT Rate FROM ExchangeRate
WHERE Date = (SELECT MIN(Date) FROM ExchangeRate)) AS Rate
FROM Dual;
Result:
0001-01-01 2009-01-31 40.10000
Last row:
SELECT (SELECT MAX(Date) FROM ExchangeRate) AS StartDate,
'9999-12-31' AS EndDate,
(SELECT Rate FROM ExchangeRate
WHERE Date = (SELECT MAX(Date) FROM ExchangeRate)) AS Rate
FROM Dual;
Result:
2009-05-01 9999-12-31 42.70000
Middle rows:
SELECT X1.Date AS StartDate,
X2.Date - 1 AS EndDate,
X1.Rate AS Rate
FROM ExchangeRate AS X1 JOIN ExchangeRate AS X2
ON X1.Date < X2.Date
WHERE NOT EXISTS
(SELECT *
FROM ExchangeRate AS X3
WHERE X3.Date > X1.Date AND X3.Date < X2.Date
);
Result:
2009-02-01 2009-02-28 40.10000
2009-03-01 2009-03-31 41.00000
2009-04-01 2009-04-30 38.50000
Note that the NOT EXISTS sub-query is rather crucial. Without it, the 'middle rows' result is:
2009-02-01 2009-02-28 40.10000
2009-02-01 2009-03-31 40.10000 # Unwanted
2009-02-01 2009-04-30 40.10000 # Unwanted
2009-03-01 2009-03-31 41.00000
2009-03-01 2009-04-30 41.00000 # Unwanted
2009-04-01 2009-04-30 38.50000
The number of unwanted rows increases dramatically as the table increases in size (for N > 2 rows, there are (N-2) * (N - 3) / 2 unwanted rows, I believe).
The result for ExtendedExchangeRate is the (disjoint) UNION of the three queries:
SELECT DATE '0001-01-01' AS StartDate,
(SELECT MIN(Date) - 1 FROM ExchangeRate) AS EndDate,
(SELECT Rate FROM ExchangeRate
WHERE Date = (SELECT MIN(Date) FROM ExchangeRate)) AS Rate
FROM Dual
UNION
SELECT X1.Date AS StartDate,
X2.Date - 1 AS EndDate,
X1.Rate AS Rate
FROM ExchangeRate AS X1 JOIN ExchangeRate AS X2
ON X1.Date < X2.Date
WHERE NOT EXISTS
(SELECT *
FROM ExchangeRate AS X3
WHERE X3.Date > X1.Date AND X3.Date < X2.Date
)
UNION
SELECT (SELECT MAX(Date) FROM ExchangeRate) AS StartDate,
DATE '9999-12-31' AS EndDate,
(SELECT Rate FROM ExchangeRate
WHERE Date = (SELECT MAX(Date) FROM ExchangeRate)) AS Rate
FROM Dual;
On the test DBMS (IBM Informix Dynamic Server 11.50.FC6 on MacOS X 10.6.2), I was able to convert the query into a view but I had to stop cheating with the data types - by coercing the strings into dates:
CREATE VIEW ExtendedExchangeRate(StartDate, EndDate, Rate) AS
SELECT DATE('0001-01-01') AS StartDate,
(SELECT MIN(Date) - 1 FROM ExchangeRate) AS EndDate,
(SELECT Rate FROM ExchangeRate WHERE Date = (SELECT MIN(Date) FROM ExchangeRate)) AS Rate
FROM Dual
UNION
SELECT X1.Date AS StartDate,
X2.Date - 1 AS EndDate,
X1.Rate AS Rate
FROM ExchangeRate AS X1 JOIN ExchangeRate AS X2
ON X1.Date < X2.Date
WHERE NOT EXISTS
(SELECT *
FROM ExchangeRate AS X3
WHERE X3.Date > X1.Date AND X3.Date < X2.Date
)
UNION
SELECT (SELECT MAX(Date) FROM ExchangeRate) AS StartDate,
DATE('9999-12-31') AS EndDate,
(SELECT Rate FROM ExchangeRate WHERE Date = (SELECT MAX(Date) FROM ExchangeRate)) AS Rate
FROM Dual;
I can't test this, but I think it would work. It uses coalesce with two sub-queries to pick the rate by rule A or rule B.
Select t.Date, t.Amount,
ConvertedAmount = t.Amount/coalesce(
(Select Top 1 ex.Rate
From ExchangeRates ex
Where t.Date > ex.Date
Order by ex.Date desc )
,
(select top 1 ex.Rate
From ExchangeRates
Order by ex.Date asc)
)
From Transactions t
SELECT
a.tranDate,
a.Amount,
a.Amount/a.Rate as convertedRate
FROM
(
SELECT
t.date tranDate,
e.date as rateDate,
t.Amount,
e.rate,
RANK() OVER (Partition BY t.date ORDER BY
CASE WHEN DATEDIFF(day,e.date,t.date) < 0 THEN
DATEDIFF(day,e.date,t.date) * -100000
ELSE DATEDIFF(day,e.date,t.date)
END ) AS diff
FROM
ExchangeRates e
CROSS JOIN
Transactions t
) a
WHERE a.diff = 1
The difference between tran and rate date is calculated, then negative values ( condition b) are multiplied by -10000 so that they can still be ranked but positive values (condition a always take priority. we then select the minimum date difference for each tran date using the rank over clause.
Many solutions will work. You should really find the one that works best (fastest) for your workload: do you search usually for one Transaction, list of them, all of them?
The tie-breaker solution given your schema is:
SELECT t.Date,
t.Amount,
r.Rate
--//add your multiplication/division here
FROM "Transactions" t
INNER JOIN "ExchangeRates" r
ON r."ExchangeRateID" = (
SELECT TOP 1 x."ExchangeRateID"
FROM "ExchangeRates" x
WHERE x."SourceCurrencyISO" = t."SourceCurrencyISO" --//these are currency-related filters for your tables
AND x."TargetCurrencyISO" = t."TargetCurrencyISO" --//,which you should also JOIN on
AND x."Date" <= t."Date"
ORDER BY x."Date" DESC)
You need to have the right indices for this query to be fast. Also ideally you should not have a JOIN on "Date", but on "ID"-like field (INTEGER). Give me more schema info, I will create an example for you.
There's nothing about a join that will be more elegant than the TOP 1 correlated subquery in your original post. However, as you say, it doesn't satisfy requirement B.
These queries do work (SQL Server 2005 or later required). See the SqlFiddle for these.
SELECT
T.*,
ExchangeRate = E.Rate
FROM
dbo.Transactions T
CROSS APPLY (
SELECT TOP 1 Rate
FROM dbo.ExchangeRate E
WHERE E.RateDate <= T.TranDate
ORDER BY
CASE WHEN E.RateDate <= T.TranDate THEN 0 ELSE 1 END,
E.RateDate DESC
) E;
Note that the CROSS APPLY with a single column value is functionally equivalent to the correlated subquery in the SELECT clause as you showed. I just prefer CROSS APPLY now because it is much more flexible and lets you reuse the value in multiple places, have multiple rows in it (for custom unpivoting) and lets you have multiple columns.
SELECT
T.*,
ExchangeRate = Coalesce(E.Rate, E2.Rate)
FROM
dbo.Transactions T
OUTER APPLY (
SELECT TOP 1 Rate
FROM dbo.ExchangeRate E
WHERE E.RateDate <= T.TranDate
ORDER BY E.RateDate DESC
) E
OUTER APPLY (
SELECT TOP 1 Rate
FROM dbo.ExchangeRate E2
WHERE E.Rate IS NULL
ORDER BY E2.RateDate
) E2;
I don't know which one might perform better, or if either will perform better than other answers on the page. With a proper index on the Date columns, they should zing pretty well--definitely better than any Row_Number() solution.