I have a table with stock quotes
Symbol
Ask
Bid
QuoteDateTime
Using SQL Server 2008, lets say for any given 60 second time period I want to select quotes on all symbols so that there is a record for every second in that time period.
Problem is not every symbol has the same number of quotes - so there are some seconds that have no quote record for any given symbol. So I want to fill in the missing holes of data. So if ORCL has quotes at second 1, 2, 3, 5, 7, I want the result set that has 1,2,3,4,5,6,7...up to 60 sec (covering the whole minute). The values in row 4 come from row 3
In this scenario I would want to select the previous quote and use that for that particular second. So there is continuous records for each symbol and the same number of records are selected for each symbol.
I am not sure what this is called in sql server but any help building a query to do this would be great
For output I am expecting that for any given 60 sec time period. For those symbols that have a record in the 60 seconds, there will be 60 records for each symbol, one for each second
Symbol Ask Bid QuoteDateTime
MSFT 26.00 27.00 2010-05-20 06:28:00
MSFT 26.01 27.02 2010-05-20 06:28:01
...
ORCL 26.00 27.00 2010-05-20 06:28:00
ORCL 26.01 27.02 2010-05-20 06:28:01
etc
Here's one way. A triangular join could also be used. Wonder if there are any other options?
DECLARE #startTime DATETIME = '2010-09-16 14:59:00.000';
WITH Times AS
(
SELECT #startTime AS T
UNION ALL
SELECT DATEADD(SECOND,1, T) FROM Times
WHERE T < DATEADD(MINUTE,1,#startTime)
),
Stocks AS
(
SELECT 'GOOG' AS Symbol
UNION ALL
SELECT 'MSFT'
),
Prices AS
(
SELECT 'GOOG' AS Symbol, 1 AS Ask, 1 AS Bid,
CAST('2010-09-16 14:59:02.000' AS DATETIME) AS QuoteDateTime
UNION ALL
SELECT 'GOOG' AS Symbol, 1 AS Ask, 1 AS Bid,
CAST('2010-09-16 14:59:02.000' AS DATETIME) AS QuoteDateTime
UNION ALL
SELECT 'GOOG' AS Symbol, 1 AS Ask, 1 AS Bid,
CAST('2010-09-16 14:59:02.000' AS DATETIME) AS QuoteDateTime
UNION ALL
SELECT 'MSFT' AS Symbol, 1 AS Ask, 1 AS Bid,
CAST('2010-09-01 12:00:00.000' AS DATETIME) AS QuoteDateTime
)
SELECT p.Symbol, p.Ask, p.Bid, p.QuoteDateTime
FROM Times t CROSS JOIN Stocks s
CROSS APPLY
(SELECT TOP 1 p.Symbol, p.Ask, p.Bid, p.QuoteDateTime
FROM Prices p
WHERE s.Symbol = p.Symbol AND p.QuoteDateTime <= t.T) p
Related
Is it possible to do a asof join in BigQuery? I think it only support equality joins but trying to understand workarounds
Quotes table
time sym bid ask
2020-01-9T14:30:00.023 XYZ 16.22 16.25
2020-01-9T14:30:00.023 XYZ 16.21 16.27
2020-01-9T14:30:00.030 XYZ 16.20 16.28
2020-01-9T14:30:00.041 XYZ 162.22 16.26
2020-01-9T14:30:00.048 XYZ 162.23 16.28
Trade table
time sym price quantity
2020-01-9T14:30:00.023 MMM 16.23 75
2020-01-9T14:30:00.041 MMM 16.24 50
2020-01-9T14:30:00.041 MMM 16.25 100
Typical SQL to do this in a time series database would look something like this, but wondering if it would be possible to compute such result in BigQuery
SELECT timestamp, trades.sym, price, quantity, ask, bid, (ask - bid) AS spread FROM trades LEFT ASOF JOIN quotes
Expected result
$timestamp sym price quantity ask bid spread
2020-01-9T14:30:00.023000000Z MMM 16.23 75 16.25 16.22 0.03
2020-01-9T14:30:00.041000000Z MMM 16.24 50 16.26 16.22 0.04
2020-01-9T14:30:00.041000000Z MMM 16.25 100 16.26 16.22 0.04
There are at least two approaches here, one that is unscalable using pure Standard SQL, and a second scalable solution that utilizes a helper function created with BigQuery's JavaScript UDF functionality.
Unscalable Solution
I had exactly this same problem with BigQuery using tick data organized much like yours is. You can do an inequality join in BigQuery, but this will not scale for even a few months worth of US equities data consisting of a few hundred names or more. In other words you might think you could try:
-- Example level 1 quotes table.
WITH l1_quotes AS (
SELECT TIMESTAMP('2020-01-09T14:30:00.023') as time, 'XYZ' as sym, 16.22 as bid, 16.25 as ask
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.023') as time, 'XYZ' as sym, 16.21 as bid, 16.27 as ask
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.030') as time, 'XYZ' as sym, 16.20 as bid, 16.28 as ask
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.041') as time, 'XYZ' as sym, 16.22 as bid, 16.26 as ask
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.048') as time, 'XYZ' as sym, 16.23 as bid, 16.28 as ask
),
-- Example trades table.
trades AS (
SELECT TIMESTAMP('2020-01-9T14:30:00.023') as time, 'MMM' as sym, 16.23 as price, 75 as quantity
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.041') as time, 'MMM' as sym, 16.24 as price, 50 as quantity
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.041') as time, 'MMM' as sym, 16.25 as price, 100 as quantity
),
-- Get all level 1 quotes <= each trade for each day using an inequality join.
inequality_join as (
SELECT
t.time as time,
t.sym,
t.price,
t.quantity,
l.time as l1_time,
l.sym as l1_sym,
l.bid,
l.ask,
l.ask - l.bid as spread
FROM l1_quotes as l
JOIN trades as t
ON EXTRACT(DATE FROM l.time) = EXTRACT(DATE FROM t.time)
AND l.time <= t.time
)
-- Use DISTINCT and LAST_VALUE to get the latest l1 quote asof each trade.
SELECT DISTINCT
time, sym, price, quantity,
LAST_VALUE(bid) OVER (latest_asof_time) as bid,
LAST_VALUE(ask) OVER (latest_asof_time) as ask,
LAST_VALUE(spread) OVER (latest_asof_time) as spread
FROM inequality_join
WINDOW latest_asof_time AS (
PARTITION BY time, sym, l1_sym
ORDER BY l1_time
ROWS between unbounded preceding and unbounded following
);
This will solve your example problem, but it will not scale in practice. Notice in the above that the join hinges on having one equality join condtion by date and one inequality join condition by l1_quotes.time <= trades.time. The inequality portion will ultimately generate all l1_quotes entries whose timestamp is at or before the timestamp of each trade entry. Although this is not quite a full CROSS JOIN, and although it gets partitioned by date thanks to the equality join piece, it becomes a very large join in practice. For the US equity database I was working with, BigQuery was unable to do the ASOF JOIN using this technique even after letting it run for many hours. You can play around with shortening the time window of the equality condition, e.g. EXTRACT an hour, minute, etc., but the smaller you make that window the more you run the risk of not getting a prior level 1 quote for some entries, particularly for illiquid symbols.
Scalable Solution
So how do you ultimately solve this in a scalable way in BigQuery? Well, one idiomatic approach in BigQuery is to leverage one or more helper JavaScript UDF's when Standard SQL cannot do what you need.
Here is a JavaScript UDF that I wrote to solve this problem. It will generate an array of tuples of X,Y timestamps representing X ASOF Y.
CREATE TEMPORARY FUNCTION asof_join(x ARRAY<TIMESTAMP>, y ARRAY<TIMESTAMP>) RETURNS ARRAY<STRUCT<x TIMESTAMP, y TIMESTAMP>> LANGUAGE js AS """
function js_timestamp_sort(a, b) { return a - b }
x.sort(js_timestamp_sort);
y.sort(js_timestamp_sort);
var epoch = new Date(1970, 0, 1);
var results = [];
var j = 0;
for (var i = 0; i < y.length; i++) {
var curr_res = {x: epoch, y: epoch};
for (; j < x.length; j++) {
if (x[j] <= y[i]) {
curr_res.x = x[j];
curr_res.y = y[i];
} else {
break;
}
}
if (curr_res.x !== epoch) {
results.push(curr_res);
j--;
}
}
return results;
""";
The idea is to pull out the l1_quotes timstamps (X) and the trades timestamps (Y) into two separate arrays. Then we use the above function to generate the array of X,Y ASOF tuples. We will then UNNEST this array of tuples to convert them back to an SQL table, and finally we join them back with the l1_quotes and then with the trades. Here is the query that achieves this. In your query editor you will want to paste in the JavaScript UDF above and this query following it:
-- Example level 1 quotes table.
WITH l1_quotes AS (
SELECT TIMESTAMP('2020-01-09T14:30:00.023') as time, 'XYZ' as sym, 16.22 as bid, 16.25 as ask
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.023') as time, 'XYZ' as sym, 16.21 as bid, 16.27 as ask
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.030') as time, 'XYZ' as sym, 16.20 as bid, 16.28 as ask
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.041') as time, 'XYZ' as sym, 16.22 as bid, 16.26 as ask
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.048') as time, 'XYZ' as sym, 16.23 as bid, 16.28 as ask
),
-- Example trades table.
trades AS (
SELECT TIMESTAMP('2020-01-9T14:30:00.023') as time, 'MMM' as sym, 16.23 as price, 75 as quantity
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.041') as time, 'MMM' as sym, 16.24 as price, 50 as quantity
UNION ALL SELECT TIMESTAMP('2020-01-9T14:30:00.041') as time, 'MMM' as sym, 16.25 as price, 100 as quantity
),
-- Extract distinct l1 quote times (use DISTINCT to reduce the size of the array for each day since we don't need duplicates for the UDF script to work).
distinct_l1_times as (
SELECT DISTINCT time
FROM l1_quotes
),
arrayed_l1_times AS (
SELECT
ARRAY_AGG(time) as times,
EXTRACT(DATE FROM time) as curr_day
FROM distinct_l1_times
GROUP BY curr_day
),
-- Do the same for trade times.
distinct_trade_times AS (
SELECT DISTINCT time
FROM trades
),
arrayed_trade_times AS (
SELECT
ARRAY_AGG(time) as times,
EXTRACT(DATE FROM time) as curr_day
FROM distinct_trade_times
GROUP BY curr_day
),
-- Use the handy asof_join JavaScript UDF created above.
asof_l1_trade_time_tuples AS (
SELECT
asof_join(arrayed_l1_times.times, arrayed_trade_times.times) as asof_tuples,
arrayed_l1_times.curr_day as curr_day
FROM arrayed_l1_times
JOIN arrayed_trade_times
USING (curr_day)
),
-- UNNEST the array of l1 quote/trade time asof tuples. Date grouping is no longer needed and dropped here.
unnested_times AS (
SELECT
a.x as l1_time,
a.y as trade_time
FROM
asof_l1_trade_time_tuples , UNNEST(asof_l1_trade_time_tuples.asof_tuples) as a
)
-- Join back the l1 quote and trade information for the final result. As before, use DISTINCT and LAST_VALUE to
-- eliminate duplicates that arise from repeated timestamps.
SELECT DISTINCT
u.trade_time as time,
t.price as price,
t.quantity as quantity,
LAST_VALUE(l.bid) OVER (latest_asof_time) as bid,
LAST_VALUE(l.ask) OVER (latest_asof_time) as ask,
LAST_VALUE(l.ask - l.bid) OVER (latest_asof_time) as spread,
FROM unnested_times as u
JOIN l1_quotes as l
ON u.l1_time = l.time
JOIN trades as t
ON u.trade_time = t.time
WINDOW latest_asof_time AS (
PARTITION BY t.time, t.sym, l.sym
ORDER BY l.time
ROWS between unbounded preceding and unbounded following
)
The above query using the JavaScript UDF runs in a few minutes on my database as opposed to never finishing after many hours using the unscalable approach. You can even store the asof_join JavaScript UDF as a permanent UDF in your dataset for use in other contexts.
I agree that Google should consider implementing an ASOF JOIN, particularly since timeseries analysis using distributed, columnar OLAP's like BigQuery is becoming much more prevalent. While it's achievable in BigQuery with the approach described here, it would be a lot simpler to do this with a single SELECT statement and a built-in ASOF JOIN construct.
You can do this easily with kinetica.
SELECT timestamp, t.sym, price, quantity, ask, bid, (ask - bid) AS spread
FROM trades t
LEFT JOIN quotes q
ON t.sym = q.sym AND
ASOF(t.time, q.time, INTERVAL '0' MILLISECONDS, INTERVAL '10' MILLISECONDS, MIN)
I'm trying to create a SQL view that gives me the expected amount to be received by calendar day for recurring transactions. I have a table containing recurring commitments data, with the following columns:
id,
start_date,
end_date (null if still active),
payment day (1,2,3,etc.),
frequency (monthly, quarterly, semi-annually, annually),
commitment amount
For now, I do not need to worry about business days vs calendar days.
In its simplest form, the end result would contain every historical calendar day as well as future dates for the next year, and produce how much was/is expected to be received in those particular days.
I've done quite a bit of researching, but cannot seem to find an answer that addresses the specific problem. Any direction on where to start would be greatly appreciated.
The expect output would look something like this:
| Date | Expected Amount |
|1/1/18 | 100 |
|1/2/18 | 200 |
|1/3/18 | 150 |
Thank you ahead of time!
Link to data table in db-fiddle
Expected Output Spreadsheet
It's something like this, but I've never used Netezza
SELECT
cal.d, sum(r.amount) as expected_amount
FROM
(
SELECT MIN(a.start_date) + ROW_NUMBER() OVER(ORDER BY NULL) as d
FROM recurring a, recurring b, recurring c
) cal
LEFT JOIN
recurring r
ON
(
(r.frequency = 'monthly' AND r.payment_day = DATE_PART('DAY', cal.d)) OR
(r.frequency = 'annually' AND DATE_PART('MONTH', cal.d) = DATE_PART('MONTH', r.start_date) AND r.payment_day = DATE_PART('DAY', cal.d))
) AND
r.start_date >= cal.d AND
(r.end_date <= cal.d OR r.end_date IS NULL)
GROUP BY cal.d
In essence, we cartesian join our recurring table together a few times to generate a load of rows, number them and add the number onto the min date to get an incrementing date series.
The payments data table is left joined onto this incrementing date series on:
(the day of the date from the series) = (the payment day) for monthlies
(the month-day of the date from the series) = (the month and payment day of the start_date)
Finally, the whole lot is grouped and summed
I don't have a test instance of Netezza so if you encounter some minor syntax errors, do please have a stab at fixing them up yourself (to make it faster for you to get a solution). If you reach a point where you can't work out what the query is doing, let me know
Disclaimer: I'm no expert on Netezza, so I decided to write you a standard SQL that may need some tweaking to run on Netezza.
with
digit as (select 0 as x union select 1 union select 2 union select 3 union select 4
union select 5 union select 6 union select 7 union select 8 union select 9
),
number as ( -- produces numbers from 0 to 9999 (28 years)
select d1.x + d2.x * 10 + d3.x * 100 + d4.x * 1000 as n
from digit d1
cross join digit d2
cross join digit d3
cross join digit d4
),
expected_payment as ( -- expands all expected payments
select
c.start_date + nb.n as day,
c.committed_amount
from recurring_commitement c
cross join number nb
where c.start_date + nb.n <= c.end_data
and c.frequency ... -- add logic for monthly, quarterly, etc. here
)
select
day,
sum(committed_amout) as expected_amount
from expected_payment
group by day
order by day
This solution is valid for commitments that do not exceed 28 years, since the number CTE (Common Table Expression) is producing up to a maximum of 9999 days. Expand with a fifth digit if you need longer commitments.
Note: I think the way I'm adding days to a day to a date is not correct in Netezza's SQL. The expression c.start_date + nb.n may need to be rephrased.
I have records with a two dates check_in and check_out, I want to know the ranges when more than one person was checked in at the same time.
So if I have the following checkin / checkouts:
Person A: 1PM - 6PM
Person B: 3PM - 10PM
Person C: 9PM - 11PM
I would want to get 3PM - 6PM (Overlap of person A and B) and 9PM - 10PM (overlap of person B and C).
I can write an algorithm to do this in linear time with code, is it possible to do this via a relational query in linear time with PostgreSQL as well?
It needs to have a minimal response, meaning no overlapping ranges. So if there were a result which gave the range 6PM - 9PM and 8PM - 10PM it would be incorrect. It should instead return 6PM - 10pm.
Assumptions
The solution heavily depends on the exact table definition including all constraints. For lack of information in the question I'll assume this table:
CREATE TABLE booking (
booking_id serial PRIMARY KEY
, check_in timestamptz NOT NULL
, check_out timestamptz NOT NULL
, CONSTRAINT valid_range CHECK (check_out > check_in)
);
So, no NULL values, only valid ranges with inclusive lower and exclusive upper bound, and we don't really care who checks in.
Also assuming a current version of Postgres, at least 9.2.
Query
One way to do it with only SQL using a UNION ALL and window functions:
SELECT ts AS check_id, next_ts As check_out
FROM (
SELECT *, lead(ts) OVER (ORDER BY ts) AS next_ts
FROM (
SELECT *, lag(people_ct, 1 , 0) OVER (ORDER BY ts) AS prev_ct
FROM (
SELECT ts, sum(sum(change)) OVER (ORDER BY ts)::int AS people_ct
FROM (
SELECT check_in AS ts, 1 AS change FROM booking
UNION ALL
SELECT check_out, -1 FROM booking
) sub1
GROUP BY 1
) sub2
) sub3
WHERE people_ct > 1 AND prev_ct < 2 OR -- start overlap
people_ct < 2 AND prev_ct > 1 -- end overlap
) sub4
WHERE people_ct > 1 AND prev_ct < 2;
SQL Fiddle.
Explanation
In subquery sub1 derive a table of check_in and check_out in one column. check_in adds one to the crowd, check_out subtracts one.
In sub2 sum all events for the same point in time and compute a running count with a window function: that's the window function sum() over an aggregate sum() - and cast to integer or we get numeric from this:
sum(sum(change)) OVER (ORDER BY ts)::int
In sub3 look at the count of the previous row
In sub4 only keep rows where overlapping time ranges start and end, and pull the end of the time range into the same row with lead().
Finally, only keep rows, where time ranges start.
To optimize performance I would walk through the table once in a plpgsql function like demonstrated in this related answer on dba.SE:
Calculate Difference in Overlapping Time in PostgreSQL / SSRS
Idea is to divide time in periods and save them as bit values with specified granularity.
0 - nobody is checked in one grain
1 - somebody is checked in one grain
Let's assume that granularity is 1 hour and period is 1 day.
000000000000000000000000 means nobody is checked in that day
000000000000000000000110 means somebody is checked between 21 and 23
000000000000011111000000 means somebody is checked between 13 and 18
000000000000000111111100 means somebody is checked between 15 and 22
After that we do binary OR on the each value in the range and we have our answer.
000000000000011111111110
It can be done in linear time. Here is an example from Oracle but it can be transformed to PostgreSQL easily.
with rec (checkin, checkout)
as ( select 13, 18 from dual
union all
select 15, 22 from dual
union all
select 21, 23 from dual )
,spanempty ( empt)
as ( select '000000000000000000000000' from dual) ,
spanfull( full)
as ( select '111111111111111111111111' from dual)
, bookingbin( binbook) as ( select substr(empt, 1, checkin) ||
substr(full, checkin, checkout-checkin) ||
substr(empt, checkout, 24-checkout)
from rec
cross join spanempty
cross join spanfull ),
bookingInt (rn, intbook) as
( select rownum, bin2dec(binbook) from bookingbin),
bitAndSum (bitAndSumm) as (
select sum(bitand(b1.intbook, b2.intbook)) from bookingInt b1
join bookingInt b2
on b1.rn = b2.rn -1 ) ,
SumAll (sumall) as (
select sum(bin2dec(binbook)) from bookingBin )
select lpad(dec2bin(sumall - bitAndSumm), 24, '0')
from SumAll, bitAndSum
Result:
000000000000011111111110
If I had a large table (100000 + entries) which had service records or perhaps admission records. How would I find all the instances of re-occurrence within a set number of days.
The table setup could be something like this likely with more columns.
Record ID Customer ID Start Date Time Finish Date Time
1 123456 24/04/2010 16:49 25/04/2010 13:37
3 654321 02/05/2010 12:45 03/05/2010 18:48
4 764352 24/03/2010 21:36 29/03/2010 14:24
9 123456 28/04/2010 13:49 31/04/2010 09:45
10 836472 19/03/2010 19:05 20/03/2010 14:48
11 123456 05/05/2010 11:26 06/05/2010 16:23
What I am trying to do is work out a way to select the records where there is a re-occurrence of the field [Customer ID] within a certain time period (< X days). (Where the time period is Start Date Time of the 2nd occurrence - Finish Date Time of the first occurrence.
This is what I would like it to look like once it was run for say x=7
Record ID Customer ID Start Date Time Finish Date Time Re-occurence
9 123456 28/04/2010 13:49 31/04/2010 09:45 1
11 123456 05/05/2010 11:26 06/05/2010 16:23 2
I can solve this problem with a smaller set of records in Excel but have struggled to come up with a SQL solution in MS Access. I do have some SQL queries that I have tried but I am not sure I am on the right track.
Any advice would be appreciated.
I think this is a clear expression of what you want. It's not extremely high performance but I'm not sure that you can avoid either correlated sub-query or a cartesian JOIN of the table to itself to solve this problem. It is standard SQL and should work in most any engine, although the details of the date math may differ:
SELECT * FROM YourTable YT1 WHERE EXISTS
(SELECT * FROM YourTable YT2 WHERE
YT2.CustomerID = YT1.CustomerID AND YT2.StartTime <= YT2.FinishTime + 7)
In order to accomplish this you would need to make a self join as you are comparing the entire table to itself. Assuming similar names it would look something like this:
select r1.customer_id, min(start_time), max(end_time), count(1) as reoccurences
from records r1,
records r2
where r1.record_id > r2.record_id -- this ensures you don't double count the records
and r1.customer_id = r2.customer_id
and r1.finish_time - r2.start_time <= 7
group by r1.customer_id
You wouldn't be able to easily get both the record_id and the number of occurences, but you could go back and find it by correlating the start time to the record number with that customer_id and start_time.
This will do it:
declare #t table(Record_ID int, Customer_ID int, StartDateTime datetime, FinishDateTime datetime)
insert #t values(1 ,123456,'2010-04-24 16:49','2010-04-25 13:37')
insert #t values(3 ,654321,'2010-05-02 12:45','2010-05-03 18:48')
insert #t values(4 ,764352,'2010-03-24 21:36','2010-03-29 14:24')
insert #t values(9 ,123456,'2010-04-28 13:49','2010-04-30 09:45')
insert #t values(10,836472,'2010-03-19 19:05','2010-03-20 14:48')
insert #t values(11,123456,'2010-05-05 11:26','2010-05-06 16:23')
declare #days int
set #days = 7
;with a as (
select record_id, customer_id, startdatetime, finishdatetime,
rn = row_number() over (partition by customer_id order by startdatetime asc)
from #t),
b as (
select record_id, customer_id, startdatetime, finishdatetime, rn, 0 recurrence
from a
where rn = 1
union all
select a.record_id, a.customer_id, a.startdatetime, a.finishdatetime,
a.rn, case when a.startdatetime - #days < b.finishdatetime then recurrence + 1 else 0 end
from b join a
on b.rn = a.rn - 1 and b.customer_id = a.customer_id
)
select record_id, customer_id, startdatetime, recurrence from b
where recurrence > 0
Result:
https://data.stackexchange.com/stackoverflow/q/112808/
I just realize it should be done in access. I am so sorry, this was written for sql server 2005. I don't know how to rewrite it for access.
I have a database table that contains collection data for product collected from a supplier and I need to produce an estimate of month-to-date production figures for that supplier using an Oracle SQL query. Each day can have multiple collections, and each collection can contain product produced across multiple days.
Here's an example of the raw collection data:
Date Volume ColectionNumber ProductionDays
2011-08-22 500 1 2
2011-08-22 200 2 2
2011-08-20 600 1 2
Creating a month-to-date estimate is tricky because the first day of the month may have a collection for two days worth of production. Only a portion of that collected volume is actually attributable to the current month.
How can I write a query to produce this estimate?
My gut feeling is that I should be able to create a database view that transforms the raw data into estimated daily production figures by summing collections on the same day and distributing collection volumes across the number of days they were produced on. This would allow me to write a simple query to find the month-to-date production figure.
Here's what the above collection data would look like after being transformed into estimated daily production figures:
Date VolumeEstimate
2011-08-22 350
2011-08-21 350
2011-08-20 300
2011-08-19 300
Am I on the right track? If so, how can this be implemented? I have absolutely no idea how to do this type of transformation in SQL. If not, what is a better approach?
Note: I cannot do this calculation in application code since that would require a significant code change which we can't afford.
try
CREATE TABLE TableA (ProdDate DATE, Volume NUMBER, CollectionNumber NUMBER, ProductionDays NUMBER);
INSERT INTO TableA VALUES (TO_DATE ('20110822', 'YYYYMMDD'), 500, 1, 2);
INSERT INTO TableA VALUES (TO_DATE ('20110822', 'YYYYMMDD'), 200, 2, 2);
INSERT INTO TableA VALUES (TO_DATE ('20110820', 'YYYYMMDD'), 600, 1, 2);
COMMIT;
CREATE VIEW DailyProdVolEst AS
SELECT DateList.TheDate, SUM (DateRangeSums.DailySum) VolumeEstimate FROM
(
SELECT ProdStart, ProdEnd, SUM (DailyProduction) DailySum
FROM
(
SELECT (ProdDate - ProductionDays + 1) ProdStart, ProdDate ProdEnd, CollectionNumber, VolumeSum/ProductionDays DailyProduction
FROM
(
Select ProdDate, CollectionNumber, ProductionDays, Sum (Volume) VolumeSum FROM TableA
GROUP BY ProdDate, CollectionNumber, ProductionDays
)
)
GROUP BY ProdStart, ProdEnd
) DateRangeSums,
(
SELECT A.MinD + MyList.L TheDate FROM
(SELECT MIN (ProdDate - ProductionDays + 1) MinD FROM TableA) A,
(SELECT LEVEL - 1 L FROM DUAL CONNECT BY LEVEL <= (SELECT Max (ProdDate) - MIN (ProdDate - ProductionDays + 1) + 1 FROM TableA)) MyList
) DateList
WHERE DateList.TheDate BETWEEN DateRangeSums.ProdStart AND DateRangeSums.ProdEnd
GROUP BY DateList.TheDate;
The view DailyProdVolEst gives you dynamically the result you described... though some "constraints" apply:
the combination of ProdDate and CollectionNumber should be unique.
the ProductionDays need to be > 0 for all rows
EDIT - as per comment requested:
How this query works:
It finds out what the smallest + biggest date in the table are, then builds rows with each row being a date in that range (DateList)... this is matched up against a list of rows containing the daily sum for unique combinations of ProdDate Start/End (DateRangeSums) and sums it up on the date level.
What do SUM (DateRangeSums.DailySum) and SUM (DailyProduction) do ?
Both sum things up - the SUM (DateRangeSums.DailySum) sums up in cases of partialy overlapping date ranges, and the SUM (DailyProduction) sums up within one date range if there are more than one CollectionNumber. Without SUM the GROUP BY wouldn't be needed.
I think a UNION query will do the trick for you. You aren't using the CollectionNumber field in your example, so I excluded it from the sample below.
Something similar to the below query should work (Disclaimer: My oracle db isn't accessible to me at the moment):
SELECT Date, SUM(Volume) VolumeEstimate
FROM
(SELECT Date, SUM(Volume / ProductionDays) Volume
FROM [Table]
GROUP BY Date
UNION
SELECT (Date - 1) Date, SUM(Volume / 2)
WHERE ProductionDays = 2
GROUP BY Date - 1)
GROUP BY Date
It sounds like what you want to do is sum up by day and then use a tally table to divide out the results.
Here's a runnable example with your data in T-SQL dialect:
DECLARE #tbl AS TABLE (
[Date] DATE
, Volume INT
, ColectionNumber INT
, ProductionDays INT);
INSERT INTO #tbl
VALUES ('2011-08-22', 500, 1, 2)
, ('2011-08-22', 200, 2, 2)
, ('2011-08-20', 600, 1, 2);
WITH Numbers AS (SELECT 1 AS N UNION ALL SELECT 2 AS N)
,AssignedVolumes AS (
SELECT t.*
, t.Volume / t.ProductionDays AS PerDay
, DATEADD(d, 1 - n.N, t.[Date]) AS AssignedDate
FROM #tbl AS t
INNER JOIN Numbers AS n
ON n.N <= t.ProductionDays
)
SELECT AssignedDate
, SUM(PerDay)
FROM AssignedVolumes
GROUP BY AssignedDate;
I dummied up a simple numbers table with only 1 and 2 in it to perform the pivot. Typically you'll have a table with a million numbers in sequence.
For Oracle, the only thing you should need to change would be the DATEADD.