Oracle SQL Join Data Sequentially - sql

I am trying to track the usage of material with my SQL. There is no way in our database to link when a part is used to the order it originally came from. A part simply ends up in a bin after an order arrives, and then usage of parts basically just creates a record for the number of parts used at a time of transaction. I am attempting to, as best I can, link usage to an order number by summing over the data and sequentially assigning it to order numbers.
My sub queries have gotten me this far. Each order number is received on a date. I then join the usage table records based on the USEDATE needing to be equal to or greater than the RECEIVEDATE of the order. The data produced by this is as such:
| ORDERNUM | PARTNUM | RECEIVEDATE | ORDERQTY | USEQTY | USEDATE |
|----------|----------|-------------------------|-----------|---------|------------------------|
| 4412 | E1125 | 10/26/2016 1:32:25 PM | 1 | 1 | 11/18/2016 1:40:55 PM |
| 4412 | E1125 | 10/26/2016 1:32:25 PM | 1 | 3 | 12/26/2016 2:19:32 PM |
| 4412 | E1125 | 10/26/2016 1:32:25 PM | 1 | 1 | 1/3/2017 8:31:21 AM |
| 4111 | E1125 | 10/28/2016 2:54:13 PM | 1 | 1 | 11/18/2016 1:40:55 PM |
| 4111 | E1125 | 10/28/2016 2:54:13 PM | 1 | 3 | 12/26/2016 2:19:32 PM |
| 4111 | E1125 | 10/28/2016 2:54:13 PM | 1 | 1 | 1/3/2017 8:31:21 AM |
| 0393 | E1125 | 12/22/2016 11:52:04 AM | 3 | 3 | 12/26/2016 2:19:32 PM |
| 0393 | E1125 | 12/22/2016 11:52:04 AM | 3 | 1 | 1/3/2017 8:31:21 AM |
| 7812 | E1125 | 12/27/2016 10:56:01 AM | 1 | 1 | 1/3/2017 8:31:21 AM |
| 1191 | E1125 | 1/5/2017 1:12:01 PM | 2 | 0 | null |
The query for the above section looks as such:
SELECT
B.*,
NVL(B2.QTY, ‘0’) USEQTY
B2.USEDATE USEDATE
FROM <<Sub Query B>>
LEFT JOIN USETABLE B2 ON B.PARTNUM = B2.PARTNUM AND B2.USEDATE >= B.RECEIVEDATE
My ultimate goal here is to join USEQTY records sequentially until they have filled enough ORDERQTY’s. I also need to add an ORDERUSE column that represents what QTY from the USEQTY column was actually applied to that record. Not really sure how to word this any better so here is example of what I need to happen based on the table above:
| ORDERNUM | PARTNUM | RECEIVEDATE | ORDERQTY | USEQTY | USEDATE | ORDERUSE |
|----------|----------|-------------------------|-----------|---------|------------------------|-----------|
| 4412 | E1125 | 10/26/2016 1:32:25 PM | 1 | 1 | 11/18/2016 1:40:55 PM | 1 |
| 4111 | E1125 | 10/28/2016 2:54:13 PM | 1 | 3 | 12/26/2016 2:19:32 PM | 1 |
| 0393 | E1125 | 12/22/2016 11:52:04 AM | 3 | 2 | 12/26/2016 2:19:32 PM | 2 |
| 0393 | E1125 | 12/22/2016 11:52:04 AM | 3 | 1 | 1/3/2017 8:31:21 AM | 1 |
| 7812 | E1125 | 12/27/2016 10:56:01 AM | 1 | 0 | null | 0 |
| 1191 | E1125 | 1/5/2017 1:12:01 PM | 2 | 0 | null | 0 |
If I can get the query to pull the information like above, I will then be able to group the records together and sum the ORDERUSE column which would get me the information I need to know what orders have been used and which have not been fully used. So in the example above, if I were to sum the ORDERUSE column for each of the ORDERNUMs, orders 4412, 4111, 0393 would all show full usage. Orders 7812, 1191 would show not being fully used.

If i am reading this correctly you want to determine how many parts have been used. In your example it looks like you have 5 usages and with 5 orders coming to a total of 8 parts with the following orders having been used.
4412 - one part - one used
4111 - one part - one used
7812 - one part - one used
0393 - three
parts - two used
After a bit of hacking away I came up with the following SQL. Not sure if this works outside of your sample data since thats the only thing I used to test and I am no expert.
WITH data
AS (SELECT *
FROM (SELECT *
FROM sub_b1
join (SELECT ROWNUM rn
FROM dual
CONNECT BY LEVEL < 15) a
ON a.rn <= sub_b1.orderqty
ORDER BY receivedate)
WHERE ROWNUM <= (SELECT SUM(useqty)
FROM sub_b2))
SELECT sub_b1.ordernum,
partnum,
receivedate,
orderqty,
usage
FROM sub_b1
join (SELECT ordernum,
Max(rn) AS usage
FROM data
GROUP BY ordernum) b
ON sub_b1.ordernum = b.ordernum

You are looking for "FIFO" inventory accounting.
The proper data model should have two tables, one for "received" parts and the other for "delivered" or "used". Each table should show an order number, a part number and quantity (received or used) for that order, and a timestamp or date-time. I model both in CTE's in my query below, but in your business they should be two separate table. Also, a trigger or similar should enforce the constraint that a part cannot be used until it is available in stock (that is: for each part id, the total quantity used since inception, at any point in time, should not exceed the total quantity received since inception, also at the same point in time). I assume that the two input tables do, in fact, satisfy this condition, and I don't check it in the solution.
The output shows a timeline of quantity used, by timestamp, matching "received" and "delivered" (used) quantities for each part_id. In the sample data I illustrate a single part_id, but the query will work with multiple part_id's, and orders (both for received and for delivered or used) that include multiple parts (part id's) with different quantities.
with
received ( order_id, part_id, ts, qty ) as (
select '0030', '11A4', timestamp '2015-03-18 15:00:33', 20 from dual union all
select '0032', '11A4', timestamp '2015-03-22 15:00:33', 13 from dual union all
select '0034', '11A4', timestamp '2015-03-24 10:00:33', 18 from dual union all
select '0036', '11A4', timestamp '2015-04-01 15:00:33', 25 from dual
),
delivered ( order_id, part_id, ts, qty ) as (
select '1200', '11A4', timestamp '2015-03-18 16:30:00', 14 from dual union all
select '1210', '11A4', timestamp '2015-03-23 10:30:00', 8 from dual union all
select '1220', '11A4', timestamp '2015-03-23 11:30:00', 7 from dual union all
select '1230', '11A4', timestamp '2015-03-23 11:30:00', 4 from dual union all
select '1240', '11A4', timestamp '2015-03-26 15:00:33', 1 from dual union all
select '1250', '11A4', timestamp '2015-03-26 16:45:11', 3 from dual union all
select '1260', '11A4', timestamp '2015-03-27 10:00:33', 2 from dual union all
select '1270', '11A4', timestamp '2015-04-03 15:00:33', 16 from dual
),
(end of test data; the SQL query begins below - just add the word WITH at the top)
-- with
combined ( part_id, rec_ord, rec_ts, rec_sum, del_ord, del_ts, del_sum) as (
select part_id, order_id, ts,
sum(qty) over (partition by part_id order by ts, order_id),
null, cast(null as date), cast(null as number)
from received
union all
select part_id, null, cast(null as date), cast(null as number),
order_id, ts,
sum(qty) over (partition by part_id order by ts, order_id)
from delivered
),
prep ( part_id, rec_ord, del_ord, del_ts, qty_sum ) as (
select part_id, rec_ord, del_ord, del_ts, coalesce(rec_sum, del_sum)
from combined
)
select part_id,
last_value(rec_ord ignore nulls) over (partition by part_id
order by qty_sum desc) as rec_ord,
last_value(del_ord ignore nulls) over (partition by part_id
order by qty_sum desc) as del_ord,
last_value(del_ts ignore nulls) over (partition by part_id
order by qty_sum desc) as used_date,
qty_sum - lag(qty_sum, 1, 0) over (partition by part_id
order by qty_sum, del_ts) as used_qty
from prep
order by qty_sum
;
Output:
PART_ID REC_ORD DEL_ORD USED_DATE USED_QTY
------- ------- ------- ----------------------------------- ----------
11A4 0030 1200 18-MAR-15 04.30.00.000000000 PM 14
11A4 0030 1210 23-MAR-15 10.30.00.000000000 AM 6
11A4 0032 1210 23-MAR-15 10.30.00.000000000 AM 2
11A4 0032 1220 23-MAR-15 11.30.00.000000000 AM 7
11A4 0032 1230 23-MAR-15 11.30.00.000000000 AM 4
11A4 0032 1230 23-MAR-15 11.30.00.000000000 AM 0
11A4 0034 1240 26-MAR-15 03.00.33.000000000 PM 1
11A4 0034 1250 26-MAR-15 04.45.11.000000000 PM 3
11A4 0034 1260 27-MAR-15 10.00.33.000000000 AM 2
11A4 0034 1270 03-APR-15 03.00.33.000000000 PM 12
11A4 0036 1270 03-APR-15 03.00.33.000000000 PM 4
11A4 0036 21
12 rows selected.
Notes: (1) One needs to be careful if at one moment the cumulative used quantity exactly matches cumulative received quantity. All rows must be include in all the intermediate results, otherwise there will be bad data in the output; but this may result (as you can see in the output above) in a few rows with a "used quantity" of 0. Depending on how this output is consumed (for further processing, for reporting, etc.) these rows may be left as they are, or they may be discarded in a further outer-query with the condition where used_qty > 0.
(2) The last row shows a quantity of 21 with no used_date and no del_ord. This is, in fact, the "current" quantity in stock for that part_id as of the last date in both tables - available for future use. Again, if this is not needed, it can be removed in an outer query. There may be one or more rows like this at the end of the table.

Related

Special SQL windows function that works like a loop

I’m looking for some kind of SQL window function that calculate values based on a calculated value from a previous iteration when looping over the window. I’m not looking for ‘lag’ which will just take the original value of the previous row.
Here is the case: We have web analytics sessions. We would like to attribute to each session to the last relevant channel. There are 3 channels: direct, organic and paid. However, they have different priorities: paid will always be relevant. Organic will only be relevant if there was no paid channel in the last 30 days and direct would only be relevant if there was not paid or organic channel in the last 30 days
So in the example table we would like to calculate the values in column ‘attributed’ based on channel and the date columns. Note, the data is there for several users so this should be calculated per user.
+─────────────+───────+──────────+─────────────+
| date | user | channel | attributed |
+─────────────+───────+──────────+─────────────+
| 2022-01-01 | 123 | direct | direct |
| 2022-01-14 | 123 | paid | paid |
| 2022-02-01 | 123 | direct | paid |
| 2022-02-12 | 123 | direct | paid |
| 2022-02-13 | 123 | organic | paid |
| 2022-03-08 | 123 | direct | direct |
| 2022-03-10 | 123 | paid | paid |
+─────────────+───────+──────────+─────────────+
So in the table above row 1 is attributed direct because it’s the first line. The second then is paid as this has priority to direct. It stays paid for the next 2 sessions as direct has lower priority, then it switches to organic as the paid attribution is older than 30 days. The last one is then paid again as it has a higher priority than organic.
I would know how to solve it if you could decide whether a new channel needs to be attributed only based on the current row and the previous. I added here the SQL to do it:
with source as ( -- example data
select cast("2022-01-01" as date) as date, 123 as user, "direct" as channel
union all
select "2022-01-14", 123, "paid"
union all
select "2022-02-01", 123, "direct"
union all
select "2022-02-12", 123, "direct"
union all
select "2022-02-13", 123, "organic"
union all
select "2022-03-08", 123, "direct"
union all
select "2022-03-10", 123, "paid"
),
flag_new_channel as( -- flag sessions that would override channel informaton ; this only works statically here
select *,
case
when lag(channel) over (partition by user order by date) is null then 1
when date_diff(date,lag(date) over (partition by user order by date),day)>30 then 1
when channel = "paid" then 1
when channel = "organic" and lag(channel) over (partition by user order by date)!='paid' then 1
else 0
end flag
from source
qualify flag=1
)
select s.*,
f.channel attributed_channel,
row_number() over (partition by s.user, s.date order by f.date desc) rn -- number of flagged previous sessions
from source s
left join flag_new_channel f
on s.date>=f.date
qualify rn=1 --only keep the last flagged session at or before the current session
However, this would for example set "organic" in row 5 because it doesn't know "paid" is still relevant.
+─────────────+───────+──────────+─────────────────────+
| date | user | channel | attributed_channel |
+─────────────+───────+──────────+─────────────────────+
| 2022-01-01 | 123 | direct | direct |
| 2022-01-14 | 123 | paid | paid |
| 2022-02-01 | 123 | direct | paid |
| 2022-02-12 | 123 | direct | paid |
| 2022-02-13 | 123 | organic | organic |
| 2022-03-08 | 123 | direct | organic |
| 2022-03-10 | 123 | paid | paid |
+─────────────+───────+──────────+─────────────────────+
Any ideas? Not sure recursive queries can help or udfs. I’m using BigQuery usually but if you know solutions in other dialects it would still be interesting to know.
Here's one approach:
Updated: Corrected. I lost track of your correct / expected result, due to the confusing story.
For PostgreSQL, we can do something like this (with CTE and window functions):
The fiddle for PG 14
pri - provides a table of (channel, priority) pairs
cte0 - provides the test data
cte1 - determines the minimum priority over the last 30 days per user
final - the final query expression obtains the attributed channel name
WITH pri (channel, pri) AS (
VALUES ('paid' , 1)
, ('organic' , 2)
, ('direct' , 3)
)
, cte0 (date, xuser, channel) AS (
VALUES
('2022-01-01'::date, 123, 'direct')
, ('2022-01-14' , 123, 'paid')
, ('2022-02-01' , 123, 'direct')
, ('2022-02-12' , 123, 'direct')
, ('2022-02-13' , 123, 'organic')
, ('2022-03-08' , 123, 'direct')
, ('2022-03-10' , 123, 'paid')
)
, cte1 AS (
SELECT cte0.*
, pri.pri
, MIN(pri) OVER (PARTITION BY xuser ORDER BY date
RANGE BETWEEN INTERVAL '30' DAY PRECEDING AND CURRENT ROW
) AS mpri
FROM cte0
JOIN pri
ON pri.channel = cte0.channel
)
SELECT cte1.*
, pri.channel AS attributed
FROM cte1
JOIN pri
ON pri.pri = cte1.mpri
;
The result:
date
xuser
channel
pri
mpri
attributed
2022-01-01
123
direct
3
3
direct
2022-01-14
123
paid
1
1
paid
2022-02-01
123
direct
3
1
paid
2022-02-12
123
direct
3
1
paid
2022-02-13
123
organic
2
1
paid
2022-03-08
123
direct
3
2
organic
2022-03-10
123
paid
1
1
paid

Question: Joining two data sets with date conditions

I'm pretty new with SQL, and I'm struggling to figure out a seemingly simple task.
Here's the situation:
I'm working with two data sets
Data Set A, which is the most accurate but only refreshes every quarter
Data Set B, which has all the date, including the most recent data, but is overall less accurate
My goal is to combine both data sets where I would have Data Set A for all data up to the most recent quarter and Data Set B for anything after (i.e., all recent data not captured in Data Set A)
For example:
Data Set A captures anything from Q1 2020 (January to March)
Let's say we are April 15th
Data Set B captures anything from Q1 2020 to the most current date, April 15th
My goal is to use Data Set A for all data from January to March 2020 (Q1) and then Data Set B for all data from April 1 to 15
Any thoughts or advice on how to do this? Potentially a join function along with a date one?
Any help would be much appreciated.
Thanks in advance for the help.
I hope I got your question right.
I put in some sample data that might match your description: a date and an amount. To keep it simple, one row per any month. You can extract the quarter from a date, and keep that as an additional column, and then filter by that down the line.
WITH
-- some sample data: date and amount ...
indata(dt,amount) AS (
SELECT DATE '2020-01-15', 234.45
UNION ALL SELECT DATE '2020-02-15', 344.45
UNION ALL SELECT DATE '2020-03-15', 345.45
UNION ALL SELECT DATE '2020-04-15', 346.45
UNION ALL SELECT DATE '2020-05-15', 347.45
UNION ALL SELECT DATE '2020-06-15', 348.45
UNION ALL SELECT DATE '2020-07-15', 349.45
UNION ALL SELECT DATE '2020-08-15', 350.45
UNION ALL SELECT DATE '2020-09-15', 351.45
UNION ALL SELECT DATE '2020-10-15', 352.45
UNION ALL SELECT DATE '2020-11-15', 353.45
UNION ALL SELECT DATE '2020-12-15', 354.45
)
-- real query starts here ...
SELECT
EXTRACT(QUARTER FROM dt) AS the_quarter
, CAST(
TIMESTAMPADD(
QUARTER
, CAST(EXTRACT(QUARTER FROM dt) AS INTEGER)-1
, TRUNC(dt,'YEAR')
)
AS DATE
) AS qtr_start
, *
FROM indata;
-- out the_quarter | qtr_start | dt | amount
-- out -------------+------------+------------+--------
-- out 1 | 2020-01-01 | 2020-01-15 | 234.45
-- out 1 | 2020-01-01 | 2020-02-15 | 344.45
-- out 1 | 2020-01-01 | 2020-03-15 | 345.45
-- out 2 | 2020-04-01 | 2020-04-15 | 346.45
-- out 2 | 2020-04-01 | 2020-05-15 | 347.45
-- out 2 | 2020-04-01 | 2020-06-15 | 348.45
-- out 3 | 2020-07-01 | 2020-07-15 | 349.45
-- out 3 | 2020-07-01 | 2020-08-15 | 350.45
-- out 3 | 2020-07-01 | 2020-09-15 | 351.45
-- out 4 | 2020-10-01 | 2020-10-15 | 352.45
-- out 4 | 2020-10-01 | 2020-11-15 | 353.45
-- out 4 | 2020-10-01 | 2020-12-15 | 354.45
If you filter by quarter, you can group your data by that column ...

Postgresql: how to select from map of multiple values

I have a SOME_DELTA table which records all party related transactions with amount change
Ex.:
PARTY_ID | SOME_DATE | AMOUNT
--------------------------------
party_id_1 | 2019-01-01 | 100
party_id_1 | 2019-01-15 | 30
party_id_1 | 2019-01-15 | -60
party_id_1 | 2019-01-21 | 80
party_id_2 | 2019-01-02 | 50
party_id_2 | 2019-02-01 | 100
I have a case where where MVC controller accepts map someMap(party_id, some_date) and I need to get part_id list with summed amount till specific some_date
In this case if I send mapOf("party_id_1" to Date(2019 - 1 - 15), "party_id_2" to Date(2019 - 1 - 2))
I should get list of party_id with summed amount till some_date
Output should look like:
party_id_1 | 70
party_id_2 | 50
Currently code is:
select sum(amount) from SOME_DELTA where party_id=:partyId and some_date <= :someDate
But in this case I need to iterate through map and do multiple DB calls for summed amount for eatch party_id till some_date which feels wrong
Is there a more delicate way to get in one select query? (to avoid +100 DB calls)
You can use a lateral join for this:
select map.party_id,
c.amount
from (
values
('party_id_1', date '2019-01-15'),
('party_id_2', date '2019-01-02')
) map (party_id, cutoff_date)
join lateral (
select sum(amount) amount
from some_delta sd
where sd.party_id = map.party_id
and sd.some_date <= map.cutoff_date
) c on true
order by map.party_id;
Online example

How to create a table that loops over data in Postgres

I want to create a table that returns the top 10 aggregate cons_name over a given week, that repeats every day.
So for 5/29/2019 it will pull the top 10 cons_name by their sum dating back to 5/22/2019.
Then, for 5/28/2019, the top 10 cons_name by their sum back to 5/21/2019.
A table of top 10 dating back 7 days all the way to 2018-12-01.
I can make the simple code dating back 7 days but, I have tried Windows to no avail.
SELECT cons_name,
pricedate,
sum(shadow)
FROM spp.rtbinds
WHERE pricedate >= current_date - 7
GROUP BY cons_name, shadow, pricedate
ORDER BY shadow asc
LIMIT 10
This query generates the output below
cons_name pricedate sum
"TEMP17_24078" "2019-05-28 00:00:00" "-1473.29723333333"
"TEMP17_24078" "2019-05-28 00:00:00" "-1383.56638333333"
"TMP175_24736" "2019-05-23 00:00:00" "-1378.40504166667"
"TMP159_24149" "2019-05-23 00:00:00" "-1328.847675"
"TMP397_24836" "2019-05-23 00:00:00" "-1221.19560833333"
"TEMP17_24078" "2019-05-28 00:00:00" "-1214.9914"
"TMP175_24736" "2019-05-23 00:00:00" "-1123.83254166667"
"TEMP72_22893" "2019-05-29 00:00:00" "-1105.93840833333"
"TMP164_23704" "2019-05-24 00:00:00" "-1053.051375"
"TMP175_24736" "2019-05-27 00:00:00" "-1043.52104166667"
I would like a table and function that returns a table of each day's top 10 dating back a week.
Using window functions get's you on the right track but you should be reading further in the documentation about the possibilities.
We have multiple issues here that we need to solve:
gaps in the data (missing pricedate) not get us the correct number of rows (7) to calculate the overall sum
for the calculation itself we need all data rows so the WHERE clause cannot be used to limit only to the visible days
in order to select the top-10 for each day, we have to generate a row number per partition because the LIMIT clause cannot be applied per group
This is why I came up with the following CTE's:
CTE days: generate the gap-less date series and mark visible days
CTE daily: LEFT JOIN the data to the generated days and produce daily sums (and handle NULL entries)
CTE calc: produce the cumulative sums
CTE numbered: produce row numbers reset each day
select the actual visible rows and limit them to max. 10 per day
So for a specific week (2019-05-26 - 2019-06-01), the query will look like the following:
WITH
days (c_day, c_visible, c_lookback) as (
SELECT gen::date, (CASE WHEN gen::date < '2019-05-26' THEN false ELSE true END), gen::date - 6
FROM generate_series('2019-05-26'::date - 6, '2019-06-01'::date, '1 day'::interval) AS gen
),
daily (cons_name, pricedate, shadow_sum) AS (
SELECT
r.cons_name,
r.pricedate::date,
coalesce(sum(r.shadow), 0)
FROM days
LEFT JOIN spp.rtbinds AS r ON (r.pricedate::date = days.c_day)
GROUP BY 1, 2
),
calc (cons_name, pricedate, shadow_sum) AS (
SELECT
cons_name,
pricedate,
sum(shadow_sum) OVER (PARTITION BY cons_name ORDER BY pricedate ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
FROM daily
),
numbered (cons_name, pricedate, shadow_sum, position) AS (
SELECT
calc.cons_name,
calc.pricedate,
calc.shadow_sum,
ROW_NUMBER() OVER (PARTITION BY calc.pricedate ORDER BY calc.shadow_sum DESC)
FROM calc
)
SELECT
days.c_lookback,
numbered.cons_name,
numbered.shadow_sum
FROM numbered
INNER JOIN days ON (days.c_day = numbered.pricedate AND days.c_visible)
WHERE numbered.position < 11
ORDER BY numbered.pricedate DESC, numbered.shadow_sum DESC;
Online example with generated test data: https://dbfiddle.uk/?rdbms=postgres_11&fiddle=a83a52e33ffea3783207e6b403bc226a
Example output:
c_lookback | cons_name | shadow_sum
------------+--------------+------------------
2019-05-26 | TMP400_27000 | 4578.04474575352
2019-05-26 | TMP700_25000 | 4366.56857151864
2019-05-26 | TMP200_24000 | 3901.50325547671
2019-05-26 | TMP400_24000 | 3849.39595793188
2019-05-26 | TMP700_28000 | 3763.51693260809
2019-05-26 | TMP600_26000 | 3751.72016620729
2019-05-26 | TMP500_28000 | 3610.75970225036
2019-05-26 | TMP300_26000 | 3598.36888491176
2019-05-26 | TMP600_27000 | 3583.89777677553
2019-05-26 | TMP300_21000 | 3556.60386707587
2019-05-25 | TMP400_27000 | 4687.20302128047
2019-05-25 | TMP200_24000 | 4453.61603102228
2019-05-25 | TMP700_25000 | 4319.10566615313
2019-05-25 | TMP400_24000 | 4039.01832416654
2019-05-25 | TMP600_27000 | 3986.68667223025
2019-05-25 | TMP600_26000 | 3879.92447655788
2019-05-25 | TMP700_28000 | 3632.56970774056
2019-05-25 | TMP800_25000 | 3604.1630071504
2019-05-25 | TMP600_28000 | 3572.50801157858
2019-05-25 | TMP500_27000 | 3536.57885829499
2019-05-24 | TMP400_27000 | 5034.53660146287
2019-05-24 | TMP200_24000 | 4646.08844632655
2019-05-24 | TMP600_26000 | 4377.5741555281
2019-05-24 | TMP700_25000 | 4321.11906399066
2019-05-24 | TMP400_24000 | 4071.37184911687
2019-05-24 | TMP600_25000 | 3795.00857752701
2019-05-24 | TMP700_26000 | 3518.6449117614
2019-05-24 | TMP600_24000 | 3368.15348120732
2019-05-24 | TMP200_25000 | 3305.84444172308
2019-05-24 | TMP500_28000 | 3162.57388606668
2019-05-23 | TMP400_27000 | 4057.08620966971
2019-05-23 | TMP700_26000 | 4024.11812392669
...

SQL - Average number of records within a time period

I'm trying to compile some lifetime value information for customers within one of our databases.
We have an MS SQL Server database which stores all of our customer/transactional information.
My issue is that I don't have much experience when it comes to MS SQL Server (or SQL in general) - I'd like to be able to run a query against the database that pulls AVG number of loans, and AVG revenue based on three criteria:
1.) Loans be counted if they are 'approved'
2.) Loans from a customer_id only be counted if the first loan (first identified by date_created field) be on or after a certain 'mm/yyyy'
3.) I'm able to specify for how many months after the first 'mm/yyyy' to tally the number of loans / revenue to be included within the AVG
Here is what the database would look like:
customer_id | loan_status | date_created | revenue
111 | 'approved' | 2010-06-20 17:17:09 | 100.00
222 | 'approved' | 2010-06-21 09:54:43 | 255.12
333 | 'denied' | 2011-06-21 12:47:30 | NULL
333 | 'approved' | 2011-06-21 12:47:20 | 56.87
222 | 'denied' | 2011-06-21 09:54:48 | NULL
222 | 'approved' | 2011-06-21 09:54:18 | 50.00
111 | 'approved' | 2011-06-20 17:17:23 | 100.00
... loads' of records ...
555 | 'approved' | 2012-01-02 09:08:42 | 24.70
111 | 'denied' | 2012-01-05 02:10:36 | NULL
666 | 'denied' | 2012-02-05 03:31:16 | NULL
555 | 'approved' | 2012-02-17 09:32:26 | 197.10
777 | 'approved' | 2012-04-03 18:28:45 | 300.50
777 | 'approved' | 2012-06-28 02:42:01 | 201.80
555 | 'approved' | 2012-06-21 22:16:59 | 10.00
666 | 'approved' | 2012-09-30 01:17:20 | 50.00
If I wanted to find the avg transaction count (approved transactions), and average revenue per approved transaction for all customer's who's first loan was in/after 2012-01, and for a period of 4 months after then, how would I go about querying the database?
Any help is greatly appreciated.
something like this (there maybe a few typos here and there)...
you could first calculate the minimum loan date:
select customer_id, min(date_created) from table t where loan_status = 'approved' group by customer_id
then you can join to it:
select customer_id, count(date_created), avg(revenue) from table t
join (
select customer_id, min(date_created) as min_date from table t where loan_status = 'approved' group by customer_id ) s
on t.customer_id = s.customer_id
where t.date_created between s.min_date and DATEADD(month, 4, s.min_date) and t.loan_status = 'approved'
Rename tbl to your table name.
Specify dates in the format YYYYMMDD.
select customer_id, AVG(revenue) average_revenue
from
(
select customer_id
from tbl
group by customer_id
having min(date_created) >= '20120101'
) fl
join tbl t on t.customer_id = fl.customer_id
where t.loan_status = 'approved'
and date_created < '20120501' -- NOT including May the first, so Jan through Apr (4 months)
If you mean 4 months after each customer's first loan, leave me a comment, state whether it's 4 calendar months (e.g. 15-Jan to 15-May) or up to the last day of the 4th month (15-Jan to 30-Apr), and I'll update the answer.