Bigquery Rolling Count Distinct - google-bigquery

I'd like to have a rolling count for daily visitors
Example.
date
visitor
2022-02-01
A
2022-02-01
B
2022-02-01
C
2022-02-02
D
2022-02-02
E
2022-02-03
C
2022-02-03
F
I want the output to be:
date
count_visitor
2022-02-01
3 (ABC)
2022-02-02
5 (DE)
2022-02-03
6 (CF)
I can't seem to find the query for this. Kindly need your help/

Consider below approach
select date, daily_count, daily_visitors_list,
( select count(distinct visitor)
from unnest(split(rolled_visitors)) visitor
) rolled_count
from (
select date, daily_count, daily_visitors_list,
string_agg(daily_visitors_list) over(order by unix_date(date)) rolled_visitors
from (
select date, count(distinct visitor) daily_count,
string_agg(distinct visitor) daily_visitors_list,
from your_table
group by date
)
) t
if applied to sample data in your question - output is
If you have heavy amount of data - above approach might end up with resource related error - in this case you can use below approach - it is using HyperLogLog++ functions
select date, daily_count, daily_visitors_list,
( select hll_count.merge(sketch)
from t.rolling_visitors_sketches sketch
) rolled_count
from (
select date, daily_count, daily_visitors_list,
array_agg(daily_visitors_sketch) over(order by unix_date(date)) rolling_visitors_sketches
from (
select date, count(distinct visitor) daily_count,
string_agg(distinct visitor) daily_visitors_list,
hll_count.init(visitor) daily_visitors_sketch
from your_table
group by date
)
) t
HLL++ functions are approximate aggregate functions. Approximate aggregation typically requires less memory than exact aggregation functions, like COUNT(DISTINCT), but also introduces statistical uncertainty. This makes HLL++ functions appropriate for large data streams for which linear memory usage is impractical, as well as for data that is already approximate.

Related

Converting event-wise table to timeseries

I have an SQLite database (with Django as ORM) with a table of change events (an Account is assigned a new Strategy). I would like to convert it to a timeseries, to have on each day the Strategy the Account was following.
My table :
Expected output :
As showed, there can be more than 1 change per day. In this case I select the last change of the day, as the desired timeseries output must have only one value per day.
My question is similar to this one but in SQL, not BigQuery (but I'm not sure I understood the unnest part they propose). I have a working solution in Pandas with reindex and fillna, but I'm sure there is an elegant and simple solution in SQL (maybe even better with Django ORM).
You can use a RECURSIVE Common Table Expression to generate all dates between first and last and then join this generated table with your data to get the needed value for each day:
WITH RECURSIVE daterange(d) AS (
SELECT date(min(created_at)) from events
UNION ALL
SELECT date(d,'1 day') FROM daterange WHERE d<(select max(created_at) from events)
)
SELECT d, account_id, strategy_id
FROM daterange JOIN events
WHERE created_at = (select max(e.created_at) from events e where e.account_id=events.account_id and date(e.created_at) <= d)
GROUP BY account_id, d
ORDER BY account_id, d
date() function is used to convert a datetime value to a simple date, so you can use it to group your data by date.
date(d, '1 day') applies a modifier of +1 calendar day to d.
Here is an example with your data:
CREATE TABLE events (
created_at,
account_id,
strategy_id
);
insert into events
VALUES ('2022-10-07 12:53:53', 4801323843, 7),
('2022-10-07 08:10:07', 4801323843, 5),
('2022-10-07 15:00:45', 4801323843, 8),
('2022-10-10 13:01:16', 4801323843, 6);
WITH RECURSIVE daterange(d) AS (
SELECT date(min(created_at)) from events
UNION ALL
SELECT date(d,'1 day') FROM daterange WHERE d<(select max(created_at) from events)
)
SELECT d, account_id, strategy_id
FROM daterange JOIN events
WHERE created_at = (select max(e.created_at) from events e where e.account_id=events.account_id and date(e.created_at) <= d)
GROUP BY account_id, d
ORDER BY account_id, d
d
account_id
strategy_id
2022-10-07
4801323843
8
2022-10-08
4801323843
8
2022-10-09
4801323843
8
2022-10-10
4801323843
6
2022-10-11
4801323843
6
fiddle
The query could be slow with many rows. In that case create an index on the created_at column:
CREATE INDEX events_created_idx ON events(created_at);
My final version is the version proposed by #Andrea B., with just a slight improve in performance, merging only the rows that we need in the join, and therefore discarding the where clause.
I also converted the null to date('now')
Here is the final version I used :
with recursive daterange(day) as
(
select min(date(created_at)) from events
union all select date(day, '1 day') from daterange
where day < date('now')
),
events as (
select account_id, strategy_id, created_at as start_date,
case lead(created_at) over(partition by account_id order by created_at) is null
when True then datetime('now')
else lead(created_at) over(partition by account_id order by created_at)
end as end_date
from events
)
select * from daterange
join events on events.start_date<daterange.day and daterange.day<events.end_date
order by events.account_id
Hope this helps !

Past 7 days running amounts average as progress per each date

So, the query is simple but i am facing issues in implementing the Sql logic. Heres the query suppose i have records like
Phoneno Company Date Amount
83838 xyz 20210901 100
87337 abc 20210902 500
47473 cde 20210903 600
Output expected is past 7 days progress as running avg of amount for each date (current date n 6 days before)
Date amount avg
20210901 100 100
20210902 500 300
20210903 600 400
I tried
Select date, amount, select
avg(lg) from (
Select case when lag(amount)
Over (order by NULL) IS NULL
THEN AMOUNT
ELSE
lag(amount)
Over (order by NULL) END AS LG)
From table
WHERE DATE>=t.date-7) as avg
From table t;
But i am getting wrong avg values. Could anyone please help?
Note: Ive tried without lag too it results the wrong avgs too
You could use a self join to group the dates
select distinct
a.dt,
b.dt as preceding_dt, --just for QA purpose
a.amt,
b.amt as preceding_amt,--just for QA purpose
avg(b.amt) over (partition by a.dt) as avg_amt
from t a
join t b on a.dt-b.dt between 0 and 6
group by a.dt, b.dt, a.amt, b.amt; --to dedupe the data after the join
If you want to make your correlated subquery approach work, you don't really need the lag.
select dt,
amt,
(select avg(b.amt) from t b where a.dt-b.dt between 0 and 6) as avg_lg
from t a;
If you don't have multiple rows per date, this gets even simpler
select dt,
amt,
avg(amt) over (order by dt rows between 6 preceding and current row) as avg_lg
from t;
Also the condition DATE>=t.date-7 you used is left open on one side meaning it will qualify a lot of dates that shouldn't have been qualified.
DEMO
You can use analytical function with the windowing clause to get your results:
SELECT DISTINCT BillingDate,
AVG(amount) OVER (ORDER BY BillingDate
RANGE BETWEEN TO_DSINTERVAL('7 00:00:00') PRECEDING
AND TO_DSINTERVAL('0 00:00:00') FOLLOWING) AS RUNNING_AVG
FROM accounts
ORDER BY BillingDate;
Here is a DBFiddle showing the query in action (LINK)

Restrict LAG to specific row condition

I am using the following query to return the percentage difference between this month and last month for a given Site ID.
SELECT
reporting_month,
total_revenue,
invoice_count,
--total_revenue_prev,
--invoice_count_prev,
ROUND(SAFE_DIVIDE(total_revenue,total_revenue_prev)-1,4) AS actual_growth,
site_name
FROM (
SELECT DATE_TRUNC(table.date, MONTH) AS reporting_month,
ROUND(SUM(table.revenue),2) AS total_revenue,
COUNT(*) AS invoice_count,
ROUND(IFNULL(
LAG(SUM(table.revenue)) OVER (ORDER BY MIN(DATE_TRUNC(table.date, MONTH))) ,
0),2) AS total_revenue_prev,
IFNULL(
LAG(COUNT(*)) OVER (ORDER BY MIN(DATE_TRUNC(table.date, MONTH))) ,
0) AS invoice_count_prev,
tbl_sites.name AS site_name
FROM table
LEFT JOIN tbl_sites ON tbl_sites.id = table.site
WHERE table.site = '123'
GROUP BY site_name, reporting_month
ORDER BY reporting_month
)
This is working correctly, printing:
reporting_month
total revenue
invoice_count
actual_growth
site_name
2020-11-01 00:00:00 UTC
100.00
10
0.571
SiteNameString
2020-12-01 00:00:00 UTC
125.00
7
0.2500
SiteNameString
However I would like to be able to run the same query for all sites. When I remove WHERE table.site = '123' from the subquery, I assume it is the use of LAG that is making the numbers report incorrectly. Is there a way to restrict the LAG to the 'current' row site?
You can simply add a PARTITION BY clause in your LAG statement to define a window function :
LAG(SUM(table.revenue)) OVER (PARTITION BY table.site ORDER BY table.date, MONTH)
Here is the related BigQuery documentation page
"PARTITION BY: Breaks up the input rows into separate partitions, over which the analytic function is independently evaluated."

BigQuery ASOF joins usecase

Is it possible to do a asof join in BigQuery? I think it only support equality joins but trying to understand workarounds
Quotes table
time sym bid ask
2020-01-9T14:30:00.023 XYZ 16.22 16.25
2020-01-9T14:30:00.023 XYZ 16.21 16.27
2020-01-9T14:30:00.030 XYZ 16.20 16.28
2020-01-9T14:30:00.041 XYZ 162.22 16.26
2020-01-9T14:30:00.048 XYZ 162.23 16.28
Trade table
time sym price quantity
2020-01-9T14:30:00.023 MMM 16.23 75
2020-01-9T14:30:00.041 MMM 16.24 50
2020-01-9T14:30:00.041 MMM 16.25 100
Typical SQL to do this in a time series database would look something like this, but wondering if it would be possible to compute such result in BigQuery
SELECT timestamp, trades.sym, price, quantity, ask, bid, (ask - bid) AS spread FROM trades LEFT ASOF JOIN quotes
Expected result
$timestamp sym price quantity ask bid spread
2020-01-9T14:30:00.023000000Z MMM 16.23 75 16.25 16.22 0.03
2020-01-9T14:30:00.041000000Z MMM 16.24 50 16.26 16.22 0.04
2020-01-9T14:30:00.041000000Z MMM 16.25 100 16.26 16.22 0.04
There are at least two approaches here, one that is unscalable using pure Standard SQL, and a second scalable solution that utilizes a helper function created with BigQuery's JavaScript UDF functionality.
Unscalable Solution
I had exactly this same problem with BigQuery using tick data organized much like yours is. You can do an inequality join in BigQuery, but this will not scale for even a few months worth of US equities data consisting of a few hundred names or more. In other words you might think you could try:
-- Example level 1 quotes table.
WITH l1_quotes AS (
SELECT TIMESTAMP('2020-01-09T14:30:00.023') as time, 'XYZ' as sym, 16.22 as bid, 16.25 as ask
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.023') as time, 'XYZ' as sym, 16.21 as bid, 16.27 as ask
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.030') as time, 'XYZ' as sym, 16.20 as bid, 16.28 as ask
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.041') as time, 'XYZ' as sym, 16.22 as bid, 16.26 as ask
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.048') as time, 'XYZ' as sym, 16.23 as bid, 16.28 as ask
),
-- Example trades table.
trades AS (
SELECT TIMESTAMP('2020-01-9T14:30:00.023') as time, 'MMM' as sym, 16.23 as price, 75 as quantity
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.041') as time, 'MMM' as sym, 16.24 as price, 50 as quantity
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.041') as time, 'MMM' as sym, 16.25 as price, 100 as quantity
),
-- Get all level 1 quotes <= each trade for each day using an inequality join.
inequality_join as (
SELECT
t.time as time,
t.sym,
t.price,
t.quantity,
l.time as l1_time,
l.sym as l1_sym,
l.bid,
l.ask,
l.ask - l.bid as spread
FROM l1_quotes as l
JOIN trades as t
ON EXTRACT(DATE FROM l.time) = EXTRACT(DATE FROM t.time)
AND l.time <= t.time
)
-- Use DISTINCT and LAST_VALUE to get the latest l1 quote asof each trade.
SELECT DISTINCT
time, sym, price, quantity,
LAST_VALUE(bid) OVER (latest_asof_time) as bid,
LAST_VALUE(ask) OVER (latest_asof_time) as ask,
LAST_VALUE(spread) OVER (latest_asof_time) as spread
FROM inequality_join
WINDOW latest_asof_time AS (
PARTITION BY time, sym, l1_sym
ORDER BY l1_time
ROWS between unbounded preceding and unbounded following
);
This will solve your example problem, but it will not scale in practice. Notice in the above that the join hinges on having one equality join condtion by date and one inequality join condition by l1_quotes.time <= trades.time. The inequality portion will ultimately generate all l1_quotes entries whose timestamp is at or before the timestamp of each trade entry. Although this is not quite a full CROSS JOIN, and although it gets partitioned by date thanks to the equality join piece, it becomes a very large join in practice. For the US equity database I was working with, BigQuery was unable to do the ASOF JOIN using this technique even after letting it run for many hours. You can play around with shortening the time window of the equality condition, e.g. EXTRACT an hour, minute, etc., but the smaller you make that window the more you run the risk of not getting a prior level 1 quote for some entries, particularly for illiquid symbols.
Scalable Solution
So how do you ultimately solve this in a scalable way in BigQuery? Well, one idiomatic approach in BigQuery is to leverage one or more helper JavaScript UDF's when Standard SQL cannot do what you need.
Here is a JavaScript UDF that I wrote to solve this problem. It will generate an array of tuples of X,Y timestamps representing X ASOF Y.
CREATE TEMPORARY FUNCTION asof_join(x ARRAY<TIMESTAMP>, y ARRAY<TIMESTAMP>) RETURNS ARRAY<STRUCT<x TIMESTAMP, y TIMESTAMP>> LANGUAGE js AS """
function js_timestamp_sort(a, b) { return a - b }
x.sort(js_timestamp_sort);
y.sort(js_timestamp_sort);
var epoch = new Date(1970, 0, 1);
var results = [];
var j = 0;
for (var i = 0; i < y.length; i++) {
var curr_res = {x: epoch, y: epoch};
for (; j < x.length; j++) {
if (x[j] <= y[i]) {
curr_res.x = x[j];
curr_res.y = y[i];
} else {
break;
}
}
if (curr_res.x !== epoch) {
results.push(curr_res);
j--;
}
}
return results;
""";
The idea is to pull out the l1_quotes timstamps (X) and the trades timestamps (Y) into two separate arrays. Then we use the above function to generate the array of X,Y ASOF tuples. We will then UNNEST this array of tuples to convert them back to an SQL table, and finally we join them back with the l1_quotes and then with the trades. Here is the query that achieves this. In your query editor you will want to paste in the JavaScript UDF above and this query following it:
-- Example level 1 quotes table.
WITH l1_quotes AS (
SELECT TIMESTAMP('2020-01-09T14:30:00.023') as time, 'XYZ' as sym, 16.22 as bid, 16.25 as ask
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.023') as time, 'XYZ' as sym, 16.21 as bid, 16.27 as ask
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.030') as time, 'XYZ' as sym, 16.20 as bid, 16.28 as ask
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.041') as time, 'XYZ' as sym, 16.22 as bid, 16.26 as ask
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.048') as time, 'XYZ' as sym, 16.23 as bid, 16.28 as ask
),
-- Example trades table.
trades AS (
SELECT TIMESTAMP('2020-01-9T14:30:00.023') as time, 'MMM' as sym, 16.23 as price, 75 as quantity
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.041') as time, 'MMM' as sym, 16.24 as price, 50 as quantity
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.041') as time, 'MMM' as sym, 16.25 as price, 100 as quantity
),
-- Extract distinct l1 quote times (use DISTINCT to reduce the size of the array for each day since we don't need duplicates for the UDF script to work).
distinct_l1_times as (
SELECT DISTINCT time
FROM l1_quotes
),
arrayed_l1_times AS (
SELECT
ARRAY_AGG(time) as times,
EXTRACT(DATE FROM time) as curr_day
FROM distinct_l1_times
GROUP BY curr_day
),
-- Do the same for trade times.
distinct_trade_times AS (
SELECT DISTINCT time
FROM trades
),
arrayed_trade_times AS (
SELECT
ARRAY_AGG(time) as times,
EXTRACT(DATE FROM time) as curr_day
FROM distinct_trade_times
GROUP BY curr_day
),
-- Use the handy asof_join JavaScript UDF created above.
asof_l1_trade_time_tuples AS (
SELECT
asof_join(arrayed_l1_times.times, arrayed_trade_times.times) as asof_tuples,
arrayed_l1_times.curr_day as curr_day
FROM arrayed_l1_times
JOIN arrayed_trade_times
USING (curr_day)
),
-- UNNEST the array of l1 quote/trade time asof tuples. Date grouping is no longer needed and dropped here.
unnested_times AS (
SELECT
a.x as l1_time,
a.y as trade_time
FROM
asof_l1_trade_time_tuples , UNNEST(asof_l1_trade_time_tuples.asof_tuples) as a
)
-- Join back the l1 quote and trade information for the final result. As before, use DISTINCT and LAST_VALUE to
-- eliminate duplicates that arise from repeated timestamps.
SELECT DISTINCT
u.trade_time as time,
t.price as price,
t.quantity as quantity,
LAST_VALUE(l.bid) OVER (latest_asof_time) as bid,
LAST_VALUE(l.ask) OVER (latest_asof_time) as ask,
LAST_VALUE(l.ask - l.bid) OVER (latest_asof_time) as spread,
FROM unnested_times as u
JOIN l1_quotes as l
ON u.l1_time = l.time
JOIN trades as t
ON u.trade_time = t.time
WINDOW latest_asof_time AS (
PARTITION BY t.time, t.sym, l.sym
ORDER BY l.time
ROWS between unbounded preceding and unbounded following
)
The above query using the JavaScript UDF runs in a few minutes on my database as opposed to never finishing after many hours using the unscalable approach. You can even store the asof_join JavaScript UDF as a permanent UDF in your dataset for use in other contexts.
I agree that Google should consider implementing an ASOF JOIN, particularly since timeseries analysis using distributed, columnar OLAP's like BigQuery is becoming much more prevalent. While it's achievable in BigQuery with the approach described here, it would be a lot simpler to do this with a single SELECT statement and a built-in ASOF JOIN construct.
You can do this easily with kinetica.
SELECT timestamp, t.sym, price, quantity, ask, bid, (ask - bid) AS spread
FROM trades t
LEFT JOIN quotes q
ON t.sym = q.sym AND
ASOF(t.time, q.time, INTERVAL '0' MILLISECONDS, INTERVAL '10' MILLISECONDS, MIN)

SQL join against date ranges?

Consider two tables:
Transactions, with amounts in a foreign currency:
Date Amount
========= =======
1/2/2009 1500
2/4/2009 2300
3/15/2009 300
4/17/2009 2200
etc.
ExchangeRates, with the value of the primary currency (let's say dollars) in the foreign currency:
Date Rate
========= =======
2/1/2009 40.1
3/1/2009 41.0
4/1/2009 38.5
5/1/2009 42.7
etc.
Exchange rates can be entered for arbitrary dates - the user could enter them on a daily basis, weekly basis, monthly basis, or at irregular intervals.
In order to translate the foreign amounts to dollars, I need to respect these rules:
A. If possible, use the most recent previous rate; so the transaction on 2/4/2009 uses the rate for 2/1/2009, and the transaction on 3/15/2009 uses the rate for 3/1/2009.
B. If there isn't a rate defined for a previous date, use the earliest rate available. So the transaction on 1/2/2009 uses the rate for 2/1/2009, since there isn't an earlier rate defined.
This works...
Select
t.Date,
t.Amount,
ConvertedAmount=(
Select Top 1
t.Amount/ex.Rate
From ExchangeRates ex
Where t.Date > ex.Date
Order by ex.Date desc
)
From Transactions t
... but (1) it seems like a join would be more efficient & elegant, and (2) it doesn't deal with Rule B above.
Is there an alternative to using the subquery to find the appropriate rate? And is there an elegant way to handle Rule B, without tying myself in knots?
You could first do a self-join on the exchange rates which are ordered by date so that you have the start and the end date of each exchange rate, without any overlap or gap in the dates (maybe add that as view to your database - in my case I'm just using a common table expression).
Now joining those "prepared" rates with the transactions is simple and efficient.
Something like:
WITH IndexedExchangeRates AS (
SELECT Row_Number() OVER (ORDER BY Date) ix,
Date,
Rate
FROM ExchangeRates
),
RangedExchangeRates AS (
SELECT CASE WHEN IER.ix=1 THEN CAST('1753-01-01' AS datetime)
ELSE IER.Date
END DateFrom,
COALESCE(IER2.Date, GETDATE()) DateTo,
IER.Rate
FROM IndexedExchangeRates IER
LEFT JOIN IndexedExchangeRates IER2
ON IER.ix = IER2.ix-1
)
SELECT T.Date,
T.Amount,
RER.Rate,
T.Amount/RER.Rate ConvertedAmount
FROM Transactions T
LEFT JOIN RangedExchangeRates RER
ON (T.Date > RER.DateFrom) AND (T.Date <= RER.DateTo)
Notes:
You could replace GETDATE() with a date in the far future, I'm assuming here that no rates for the future are known.
Rule (B) is implemented by setting the date of the first known exchange rate to the minimal date supported by the SQL Server datetime, which should (by definition if it is the type you're using for the Date column) be the smallest value possible.
Suppose you had an extended exchange rate table that contained:
Start Date End Date Rate
========== ========== =======
0001-01-01 2009-01-31 40.1
2009-02-01 2009-02-28 40.1
2009-03-01 2009-03-31 41.0
2009-04-01 2009-04-30 38.5
2009-05-01 9999-12-31 42.7
We can discuss the details of whether the first two rows should be combined, but the general idea is that it is trivial to find the exchange rate for a given date. This structure works with the SQL 'BETWEEN' operator which includes the ends of the ranges. Often, a better format for ranges is 'open-closed'; the first date listed is included and the second is excluded. Note that there is a constraint on the data rows - there are (a) no gaps in the coverage of the range of dates and (b) no overlaps in the coverage. Enforcing those constraints is not completely trivial (polite understatement - meiosis).
Now the basic query is trivial, and Case B is no longer a special case:
SELECT T.Date, T.Amount, X.Rate
FROM Transactions AS T JOIN ExtendedExchangeRates AS X
ON T.Date BETWEEN X.StartDate AND X.EndDate;
The tricky part is creating the ExtendedExchangeRate table from the given ExchangeRate table on the fly.
If it is an option, then revising the structure of the basic ExchangeRate table to match the ExtendedExchangeRate table would be a good idea; you resolve the messy stuff when the data is entered (once a month) instead of every time an exchange rate needs to be determined (many times a day).
How to create the extended exchange rate table? If your system supports adding or subtracting 1 from a date value to obtain the next or previous day (and has a single row table called 'Dual'), then a variation
on this will work (without using any OLAP functions):
CREATE TABLE ExchangeRate
(
Date DATE NOT NULL,
Rate DECIMAL(10,5) NOT NULL
);
INSERT INTO ExchangeRate VALUES('2009-02-01', 40.1);
INSERT INTO ExchangeRate VALUES('2009-03-01', 41.0);
INSERT INTO ExchangeRate VALUES('2009-04-01', 38.5);
INSERT INTO ExchangeRate VALUES('2009-05-01', 42.7);
First row:
SELECT '0001-01-01' AS StartDate,
(SELECT MIN(Date) - 1 FROM ExchangeRate) AS EndDate,
(SELECT Rate FROM ExchangeRate
WHERE Date = (SELECT MIN(Date) FROM ExchangeRate)) AS Rate
FROM Dual;
Result:
0001-01-01 2009-01-31 40.10000
Last row:
SELECT (SELECT MAX(Date) FROM ExchangeRate) AS StartDate,
'9999-12-31' AS EndDate,
(SELECT Rate FROM ExchangeRate
WHERE Date = (SELECT MAX(Date) FROM ExchangeRate)) AS Rate
FROM Dual;
Result:
2009-05-01 9999-12-31 42.70000
Middle rows:
SELECT X1.Date AS StartDate,
X2.Date - 1 AS EndDate,
X1.Rate AS Rate
FROM ExchangeRate AS X1 JOIN ExchangeRate AS X2
ON X1.Date < X2.Date
WHERE NOT EXISTS
(SELECT *
FROM ExchangeRate AS X3
WHERE X3.Date > X1.Date AND X3.Date < X2.Date
);
Result:
2009-02-01 2009-02-28 40.10000
2009-03-01 2009-03-31 41.00000
2009-04-01 2009-04-30 38.50000
Note that the NOT EXISTS sub-query is rather crucial. Without it, the 'middle rows' result is:
2009-02-01 2009-02-28 40.10000
2009-02-01 2009-03-31 40.10000 # Unwanted
2009-02-01 2009-04-30 40.10000 # Unwanted
2009-03-01 2009-03-31 41.00000
2009-03-01 2009-04-30 41.00000 # Unwanted
2009-04-01 2009-04-30 38.50000
The number of unwanted rows increases dramatically as the table increases in size (for N > 2 rows, there are (N-2) * (N - 3) / 2 unwanted rows, I believe).
The result for ExtendedExchangeRate is the (disjoint) UNION of the three queries:
SELECT DATE '0001-01-01' AS StartDate,
(SELECT MIN(Date) - 1 FROM ExchangeRate) AS EndDate,
(SELECT Rate FROM ExchangeRate
WHERE Date = (SELECT MIN(Date) FROM ExchangeRate)) AS Rate
FROM Dual
UNION
SELECT X1.Date AS StartDate,
X2.Date - 1 AS EndDate,
X1.Rate AS Rate
FROM ExchangeRate AS X1 JOIN ExchangeRate AS X2
ON X1.Date < X2.Date
WHERE NOT EXISTS
(SELECT *
FROM ExchangeRate AS X3
WHERE X3.Date > X1.Date AND X3.Date < X2.Date
)
UNION
SELECT (SELECT MAX(Date) FROM ExchangeRate) AS StartDate,
DATE '9999-12-31' AS EndDate,
(SELECT Rate FROM ExchangeRate
WHERE Date = (SELECT MAX(Date) FROM ExchangeRate)) AS Rate
FROM Dual;
On the test DBMS (IBM Informix Dynamic Server 11.50.FC6 on MacOS X 10.6.2), I was able to convert the query into a view but I had to stop cheating with the data types - by coercing the strings into dates:
CREATE VIEW ExtendedExchangeRate(StartDate, EndDate, Rate) AS
SELECT DATE('0001-01-01') AS StartDate,
(SELECT MIN(Date) - 1 FROM ExchangeRate) AS EndDate,
(SELECT Rate FROM ExchangeRate WHERE Date = (SELECT MIN(Date) FROM ExchangeRate)) AS Rate
FROM Dual
UNION
SELECT X1.Date AS StartDate,
X2.Date - 1 AS EndDate,
X1.Rate AS Rate
FROM ExchangeRate AS X1 JOIN ExchangeRate AS X2
ON X1.Date < X2.Date
WHERE NOT EXISTS
(SELECT *
FROM ExchangeRate AS X3
WHERE X3.Date > X1.Date AND X3.Date < X2.Date
)
UNION
SELECT (SELECT MAX(Date) FROM ExchangeRate) AS StartDate,
DATE('9999-12-31') AS EndDate,
(SELECT Rate FROM ExchangeRate WHERE Date = (SELECT MAX(Date) FROM ExchangeRate)) AS Rate
FROM Dual;
I can't test this, but I think it would work. It uses coalesce with two sub-queries to pick the rate by rule A or rule B.
Select t.Date, t.Amount,
ConvertedAmount = t.Amount/coalesce(
(Select Top 1 ex.Rate
From ExchangeRates ex
Where t.Date > ex.Date
Order by ex.Date desc )
,
(select top 1 ex.Rate
From ExchangeRates
Order by ex.Date asc)
)
From Transactions t
SELECT
a.tranDate,
a.Amount,
a.Amount/a.Rate as convertedRate
FROM
(
SELECT
t.date tranDate,
e.date as rateDate,
t.Amount,
e.rate,
RANK() OVER (Partition BY t.date ORDER BY
CASE WHEN DATEDIFF(day,e.date,t.date) < 0 THEN
DATEDIFF(day,e.date,t.date) * -100000
ELSE DATEDIFF(day,e.date,t.date)
END ) AS diff
FROM
ExchangeRates e
CROSS JOIN
Transactions t
) a
WHERE a.diff = 1
The difference between tran and rate date is calculated, then negative values ( condition b) are multiplied by -10000 so that they can still be ranked but positive values (condition a always take priority. we then select the minimum date difference for each tran date using the rank over clause.
Many solutions will work. You should really find the one that works best (fastest) for your workload: do you search usually for one Transaction, list of them, all of them?
The tie-breaker solution given your schema is:
SELECT t.Date,
t.Amount,
r.Rate
--//add your multiplication/division here
FROM "Transactions" t
INNER JOIN "ExchangeRates" r
ON r."ExchangeRateID" = (
SELECT TOP 1 x."ExchangeRateID"
FROM "ExchangeRates" x
WHERE x."SourceCurrencyISO" = t."SourceCurrencyISO" --//these are currency-related filters for your tables
AND x."TargetCurrencyISO" = t."TargetCurrencyISO" --//,which you should also JOIN on
AND x."Date" <= t."Date"
ORDER BY x."Date" DESC)
You need to have the right indices for this query to be fast. Also ideally you should not have a JOIN on "Date", but on "ID"-like field (INTEGER). Give me more schema info, I will create an example for you.
There's nothing about a join that will be more elegant than the TOP 1 correlated subquery in your original post. However, as you say, it doesn't satisfy requirement B.
These queries do work (SQL Server 2005 or later required). See the SqlFiddle for these.
SELECT
T.*,
ExchangeRate = E.Rate
FROM
dbo.Transactions T
CROSS APPLY (
SELECT TOP 1 Rate
FROM dbo.ExchangeRate E
WHERE E.RateDate <= T.TranDate
ORDER BY
CASE WHEN E.RateDate <= T.TranDate THEN 0 ELSE 1 END,
E.RateDate DESC
) E;
Note that the CROSS APPLY with a single column value is functionally equivalent to the correlated subquery in the SELECT clause as you showed. I just prefer CROSS APPLY now because it is much more flexible and lets you reuse the value in multiple places, have multiple rows in it (for custom unpivoting) and lets you have multiple columns.
SELECT
T.*,
ExchangeRate = Coalesce(E.Rate, E2.Rate)
FROM
dbo.Transactions T
OUTER APPLY (
SELECT TOP 1 Rate
FROM dbo.ExchangeRate E
WHERE E.RateDate <= T.TranDate
ORDER BY E.RateDate DESC
) E
OUTER APPLY (
SELECT TOP 1 Rate
FROM dbo.ExchangeRate E2
WHERE E.Rate IS NULL
ORDER BY E2.RateDate
) E2;
I don't know which one might perform better, or if either will perform better than other answers on the page. With a proper index on the Date columns, they should zing pretty well--definitely better than any Row_Number() solution.