Special SQL windows function that works like a loop - sql

I’m looking for some kind of SQL window function that calculate values based on a calculated value from a previous iteration when looping over the window. I’m not looking for ‘lag’ which will just take the original value of the previous row.
Here is the case: We have web analytics sessions. We would like to attribute to each session to the last relevant channel. There are 3 channels: direct, organic and paid. However, they have different priorities: paid will always be relevant. Organic will only be relevant if there was no paid channel in the last 30 days and direct would only be relevant if there was not paid or organic channel in the last 30 days
So in the example table we would like to calculate the values in column ‘attributed’ based on channel and the date columns. Note, the data is there for several users so this should be calculated per user.
+─────────────+───────+──────────+─────────────+
| date | user | channel | attributed |
+─────────────+───────+──────────+─────────────+
| 2022-01-01 | 123 | direct | direct |
| 2022-01-14 | 123 | paid | paid |
| 2022-02-01 | 123 | direct | paid |
| 2022-02-12 | 123 | direct | paid |
| 2022-02-13 | 123 | organic | paid |
| 2022-03-08 | 123 | direct | direct |
| 2022-03-10 | 123 | paid | paid |
+─────────────+───────+──────────+─────────────+
So in the table above row 1 is attributed direct because it’s the first line. The second then is paid as this has priority to direct. It stays paid for the next 2 sessions as direct has lower priority, then it switches to organic as the paid attribution is older than 30 days. The last one is then paid again as it has a higher priority than organic.
I would know how to solve it if you could decide whether a new channel needs to be attributed only based on the current row and the previous. I added here the SQL to do it:
with source as ( -- example data
select cast("2022-01-01" as date) as date, 123 as user, "direct" as channel
union all
select "2022-01-14", 123, "paid"
union all
select "2022-02-01", 123, "direct"
union all
select "2022-02-12", 123, "direct"
union all
select "2022-02-13", 123, "organic"
union all
select "2022-03-08", 123, "direct"
union all
select "2022-03-10", 123, "paid"
),
flag_new_channel as( -- flag sessions that would override channel informaton ; this only works statically here
select *,
case
when lag(channel) over (partition by user order by date) is null then 1
when date_diff(date,lag(date) over (partition by user order by date),day)>30 then 1
when channel = "paid" then 1
when channel = "organic" and lag(channel) over (partition by user order by date)!='paid' then 1
else 0
end flag
from source
qualify flag=1
)
select s.*,
f.channel attributed_channel,
row_number() over (partition by s.user, s.date order by f.date desc) rn -- number of flagged previous sessions
from source s
left join flag_new_channel f
on s.date>=f.date
qualify rn=1 --only keep the last flagged session at or before the current session
However, this would for example set "organic" in row 5 because it doesn't know "paid" is still relevant.
+─────────────+───────+──────────+─────────────────────+
| date | user | channel | attributed_channel |
+─────────────+───────+──────────+─────────────────────+
| 2022-01-01 | 123 | direct | direct |
| 2022-01-14 | 123 | paid | paid |
| 2022-02-01 | 123 | direct | paid |
| 2022-02-12 | 123 | direct | paid |
| 2022-02-13 | 123 | organic | organic |
| 2022-03-08 | 123 | direct | organic |
| 2022-03-10 | 123 | paid | paid |
+─────────────+───────+──────────+─────────────────────+
Any ideas? Not sure recursive queries can help or udfs. I’m using BigQuery usually but if you know solutions in other dialects it would still be interesting to know.

Here's one approach:
Updated: Corrected. I lost track of your correct / expected result, due to the confusing story.
For PostgreSQL, we can do something like this (with CTE and window functions):
The fiddle for PG 14
pri - provides a table of (channel, priority) pairs
cte0 - provides the test data
cte1 - determines the minimum priority over the last 30 days per user
final - the final query expression obtains the attributed channel name
WITH pri (channel, pri) AS (
VALUES ('paid' , 1)
, ('organic' , 2)
, ('direct' , 3)
)
, cte0 (date, xuser, channel) AS (
VALUES
('2022-01-01'::date, 123, 'direct')
, ('2022-01-14' , 123, 'paid')
, ('2022-02-01' , 123, 'direct')
, ('2022-02-12' , 123, 'direct')
, ('2022-02-13' , 123, 'organic')
, ('2022-03-08' , 123, 'direct')
, ('2022-03-10' , 123, 'paid')
)
, cte1 AS (
SELECT cte0.*
, pri.pri
, MIN(pri) OVER (PARTITION BY xuser ORDER BY date
RANGE BETWEEN INTERVAL '30' DAY PRECEDING AND CURRENT ROW
) AS mpri
FROM cte0
JOIN pri
ON pri.channel = cte0.channel
)
SELECT cte1.*
, pri.channel AS attributed
FROM cte1
JOIN pri
ON pri.pri = cte1.mpri
;
The result:
date
xuser
channel
pri
mpri
attributed
2022-01-01
123
direct
3
3
direct
2022-01-14
123
paid
1
1
paid
2022-02-01
123
direct
3
1
paid
2022-02-12
123
direct
3
1
paid
2022-02-13
123
organic
2
1
paid
2022-03-08
123
direct
3
2
organic
2022-03-10
123
paid
1
1
paid

Related

Self join to create a new column with updated records

I am trying to write a SQL query to get the start date for employees in a store. As seen in the first screenshot, employee number 5041 had the number A0EH but as the number got updated, it updated the start date for the employee as well. This effects the metric of total duration in the store.
I am trying to get to the output below but haven't been able to figure out how to get this view.
This is the code I was trying but I am not getting the correct output.
select
esd.employee_number,
(case when esd.old_employee_number is null then es.employee_number else es.old_employee_number end) as old_employee_number,
esd.entity_id,
esd.original_start_date
from earliest_start_date as esd
left join earliest_start_date as es
on (es.employee_number = esd.old_employee_number)
How do I solve this on SQL?
Redshift reportedly supports recursion via WITH clause. Here's an example:
MariaDB 10.5 has similar support. Test case is here:
Fully working test case (via MariaDB 10.5) (Updated)
Link to Amazon Redshift detail for WITH clause and window functions:
Amazon Redshift - WITH clause
Amazon redshift - Window functions
WITH RECURSIVE cte (employee_number, original_no, entity_id, original_start_date, n) AS (
SELECT employee_number, employee_number, entity_id, original_start_date, 1 FROM earliest_start_date WHERE old_employee_number IS NULL UNION ALL
SELECT new_tbl.employee_number, cte.original_no, cte.entity_id, cte.original_start_date, n+1
FROM earliest_start_date new_tbl
JOIN cte
ON cte.employee_number = new_tbl.old_employee_number
)
, xrows AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY entity_id ORDER BY n DESC) AS rn
FROM cte
)
SELECT * FROM xrows WHERE rn = 1
;
Result:
+-----------------+-------------+-----------+---------------------+------+----+
| employee_number | original_no | entity_id | original_start_date | n | rn |
+-----------------+-------------+-----------+---------------------+------+----+
| XXXX | XXXX | 88 | 2021-09-02 | 1 | 1 |
| 5041 | A0EH | 96 | 2021-09-05 | 2 | 1 |
+-----------------+-------------+-----------+---------------------+------+----+
2 rows in set
Raw test data:
SELECT * FROM earliest_start_date;
+-----------------+---------------------+-----------+---------------------+
| employee_number | old_employee_number | entity_id | original_start_date |
+-----------------+---------------------+-----------+---------------------+
| 5041 | A0EH | 96 | 2021-09-10 |
| A0EH | NULL | 96 | 2021-09-05 |
| XXXX | NULL | 88 | 2021-09-02 |
+-----------------+---------------------+-----------+---------------------+
Note that the logic makes assumption about uniqueness of the employee_number and, in the current form, can't handle cases where the employee_number is reused by the same employee or used again with a different employee without adjusting prior data. There may not be enough detail in the current structure to handle those cases.

How to pull data based on current and last update?

Our data table looks like this:
Machine Name
Lot Number
Qty
Load TxnDate
Unload TxnDate
M123
ABC
500
10/1/2020
10/2/2020
M741
DEF
325
10/1/2020
M123
ZZZ
100
10/5/2020
10/7/2020
M951
AAA
550
10/5/2020
10/9/2020
M123
BBB
550
10/7/2020
I need to create an SQL query that shows the currently loaded Lot number - Machines with no Unload TxnDate - and the last loaded Lot number based on the unload TxnDate.
So in the example, when I run a query for M123, the result will show:
Machine Name
Lot Number
Qty
Load TxnDate
Unload TxnDate
M123
ZZZ
100
10/5/2020
10/7/2020
M123
BBB
550
10/7/2020
As you can see although Machine Name has 3 records, the results only show the currently loaded and the last loaded. Is there anyway to replicate this? The Machine Name is dynamic, so my user can enter the Machine Name and see the results the machine based on the missing Unload TxnDate and the last Unload Txn Date
You seem to want the last two rows. That would be something like this:
select t.*
from t
where machine_name = 'M123'
order by load_txn_date desc
fetch first 2 rows only;
Note: not all databases support the first first clause. Some spell it limit, or select top, or even something else.
If you want two rows per machine, one option uses window functions:
select *
from (
select t.*,
row_number() over(
partition by machine_name, (case when unload_txn_date is null then 0 else 1 end)
order by coalesce(unload_txn_date, load_txn_date) desc
) rn
from mytable t
) t
where rn = 1
The idea is to separate rows between those that have an unload date, and those that do not. We can then bring the top record per group.
For your sample data, this returns:
Machine_Name | Lot_Number | Qty | Load_Txn_Date | Unload_Txn_Date | rn
:----------- | :--------- | --: | :------------ | :-------------- | -:
M123 | BBB | 550 | 2020-10-07 | null | 1
M123 | ZZZ | 100 | 2020-10-05 | 2020-10-07 | 1
M741 | DEF | 325 | 2020-10-01 | null | 1
M951 | AAA | 550 | 2020-10-05 | 2020-10-09 | 1
You might use the following query, presuming that you're on a database having Window(or Analytic) Function
WITH t AS
(
SELECT COALESCE(Unload_Txn_Date -
LAG(Load_Txn_Date) OVER
(PARTITION BY Machine_Name ORDER BY Load_Txn_Date DESC),0) AS lg,
MAX(CASE WHEN Unload_Txn_Date IS NULL THEN Load_Txn_Date END) OVER
(PARTITION BY Machine_Name) AS mx,
t.*
FROM tab t
), t2 AS
(
SELECT DENSE_RANK() OVER (ORDER BY mx DESC NULLS LAST) AS dr, t.*
FROM t
WHERE mx IS NOT NULL
)
SELECT Machine_Name,Lot_Number,Qty,Load_Txn_Date,Unload_Txn_Date
FROM t2
WHERE dr = 1 AND lg = 0
ORDER BY Load_Txn_Date
where if previous row's Unload_Txn_Date is equal to the current Load_Txn_Date, then it's accepted that there's no interruption will occur for the job, while determining the last Unload Txn Dates with no unload date values per each machine. And then, the result set returns through filtering by the values derived from the window functions within the penultimate query.
Demo

SQL query that finds dates between a range and takes values from another query & iterates range over them?

Sorry if the wording for this question is strange. Wasn't sure how to word it, but here's the context:
I'm working on an application that shows some data about the how often individual applications are being used when users make a request from my web server. The way we take data is by every time the start page loads, it increments a data table called WEB_TRACKING at the date of when it loaded. So there are a lot of holes in data, for example, an application might've been used heavily on September 1st but not at all September 2nd. What I want to do, is add those holes with a value on hits of 0. This is what I came up with.
Select HIT_DATA.DATE_ACCESSED, HIT_DATA.APP_ID, HIT_DATA.NAME, WORKDAYS.BENCH_DAYS, NVL(HIT_DATA.HITS, 0) from (
select DISTINCT( TO_CHAR(WEB.ACCESS_TIME, 'MM/DD/YYYY')) as BENCH_DAYS
FROM WEB_TRACKING WEB
) workDays
LEFT join (
SELECT TO_CHAR(WEB.ACCESS_TIME, 'MM/DD/YYYY') as DATE_ACCESSED, APP.APP_ID, APP.NAME,
COUNT(WEB.IP_ADDRESS) AS HITS
FROM WEB_TRACKING WEB
INNER JOIN WEB_APP APP ON WEB.APP_ID = APP.APP_ID
WHERE APP.IS_ENABLED = 1 AND (APP.APP_ID = 1 OR APP.APP_ID = 2)
AND (WEB.ACCESS_TIME > TO_DATE('08/04/2018', 'MM/DD/YYYY')
AND WEB.ACCESS_TIME < TO_DATE('09/04/2018', 'MM/DD/YYYY'))
GROUP BY TO_CHAR(WEB.ACCESS_TIME, 'MM/DD/YYYY'), APP.APP_ID, APP.NAME
ORDER BY TO_CHAR(WEB.ACCESS_TIME, 'MM/DD/YYYY'), app_id DESC
) HIT_DATA ON HIT_DATA.DATE_ACCESSED = WORKDAYS.BENCH_DAYS
ORDER BY WORKDAYS.BENCH_DAYS
It returns all the dates that between the date range and even converts null hits to 0. However, it returns null for app id and app name. Which makes sense, and I understand how to give a default value for 1 application. I was hoping someone could help me figure out how to do it for multiple applications.
Basically, I am getting this (in the case of using just one application):
| APP_ID | NAME | BENCH_DAYS | HITS |
| ------ | ---------- | ---------- | ---- |
| NULL | NULL | 08/04/2018 | 0 |
| 1 | test_app | 08/05/2018 | 1 |
| NULL | NULL | 08/06/2018 | 0 |
But I want this(with multiple applications):
| APP_ID | NAME | BENCH_DAYS | HITS |
| ------ | ---------- | ---------- | ---- |
| 1 | test_app | 08/04/2018 | 0 |<- these 0's are converted from null
| 1 | test_app | 08/05/2018 | 1 |
| 1 | test_app | 08/06/2018 | 0 | <- these 0's are converted from null
| 2 | prod_app | 08/04/2018 | 2 |
| 2 | prod_app | 08/05/2018 | 0 | <- these 0's are converted from null
So again to reiterate the question in this long post. How should I go about populating this query so that it fills up the holes in the dates but also reuses the application names and ids and populates that information as well?
You need a list of dates, that probably comes from a number generator rather than a table (if that table has holes, your report will too)
Example, every date for the past 30 days:
select trunc(sysdate-30) + level as bench_days from dual connect by level < 30
Use TRUNC instead of turning a date into a string in order to cut the time off
Now you have a list of dates, you want to add in repeating app id and name:
select * from
(select trunc(sysdate-30) + level as bench_days from dual connect by level < 30) dat
CROSS JOIN
(select app_id, name from WEB_APP WHERE APP.IS_ENABLED = 1 AND APP_ID in (1, 2) app
Now you have all your dates, crossed with all your apps. 2 apps and 30 days will make a 60 row resultset via a cross join. Left join your stat data onto it, and group/count/sum/aggregate ...
select app.app_id, app.name, dat.artificialday, COALESCE(stat.ct, 0) as hits from
(select trunc(sysdate-30) + level as artificialday from dual connect by level < 30) dat
CROSS JOIN
(select app_id, name from WEB_APP WHERE APP.IS_ENABLED = 1 AND APP_ID in (1, 2) app
LEFT JOIN
(SELECT app_id, trunc(access_time) accdate, count(ip_address) ct from web_tracking group by app_id, trunc(access_time)) stat
ON
stat.app_id = app.app_id AND
stat.accdate = dat.artificialday
You don't have to write the query this way/do your grouping as a subquery, I'm just representing it this way to lead you to thinking about your data in blocks, that you build in isolation and join together later, to build more comprehensive blocks

Oracle SQL Join Data Sequentially

I am trying to track the usage of material with my SQL. There is no way in our database to link when a part is used to the order it originally came from. A part simply ends up in a bin after an order arrives, and then usage of parts basically just creates a record for the number of parts used at a time of transaction. I am attempting to, as best I can, link usage to an order number by summing over the data and sequentially assigning it to order numbers.
My sub queries have gotten me this far. Each order number is received on a date. I then join the usage table records based on the USEDATE needing to be equal to or greater than the RECEIVEDATE of the order. The data produced by this is as such:
| ORDERNUM | PARTNUM | RECEIVEDATE | ORDERQTY | USEQTY | USEDATE |
|----------|----------|-------------------------|-----------|---------|------------------------|
| 4412 | E1125 | 10/26/2016 1:32:25 PM | 1 | 1 | 11/18/2016 1:40:55 PM |
| 4412 | E1125 | 10/26/2016 1:32:25 PM | 1 | 3 | 12/26/2016 2:19:32 PM |
| 4412 | E1125 | 10/26/2016 1:32:25 PM | 1 | 1 | 1/3/2017 8:31:21 AM |
| 4111 | E1125 | 10/28/2016 2:54:13 PM | 1 | 1 | 11/18/2016 1:40:55 PM |
| 4111 | E1125 | 10/28/2016 2:54:13 PM | 1 | 3 | 12/26/2016 2:19:32 PM |
| 4111 | E1125 | 10/28/2016 2:54:13 PM | 1 | 1 | 1/3/2017 8:31:21 AM |
| 0393 | E1125 | 12/22/2016 11:52:04 AM | 3 | 3 | 12/26/2016 2:19:32 PM |
| 0393 | E1125 | 12/22/2016 11:52:04 AM | 3 | 1 | 1/3/2017 8:31:21 AM |
| 7812 | E1125 | 12/27/2016 10:56:01 AM | 1 | 1 | 1/3/2017 8:31:21 AM |
| 1191 | E1125 | 1/5/2017 1:12:01 PM | 2 | 0 | null |
The query for the above section looks as such:
SELECT
B.*,
NVL(B2.QTY, ‘0’) USEQTY
B2.USEDATE USEDATE
FROM <<Sub Query B>>
LEFT JOIN USETABLE B2 ON B.PARTNUM = B2.PARTNUM AND B2.USEDATE >= B.RECEIVEDATE
My ultimate goal here is to join USEQTY records sequentially until they have filled enough ORDERQTY’s. I also need to add an ORDERUSE column that represents what QTY from the USEQTY column was actually applied to that record. Not really sure how to word this any better so here is example of what I need to happen based on the table above:
| ORDERNUM | PARTNUM | RECEIVEDATE | ORDERQTY | USEQTY | USEDATE | ORDERUSE |
|----------|----------|-------------------------|-----------|---------|------------------------|-----------|
| 4412 | E1125 | 10/26/2016 1:32:25 PM | 1 | 1 | 11/18/2016 1:40:55 PM | 1 |
| 4111 | E1125 | 10/28/2016 2:54:13 PM | 1 | 3 | 12/26/2016 2:19:32 PM | 1 |
| 0393 | E1125 | 12/22/2016 11:52:04 AM | 3 | 2 | 12/26/2016 2:19:32 PM | 2 |
| 0393 | E1125 | 12/22/2016 11:52:04 AM | 3 | 1 | 1/3/2017 8:31:21 AM | 1 |
| 7812 | E1125 | 12/27/2016 10:56:01 AM | 1 | 0 | null | 0 |
| 1191 | E1125 | 1/5/2017 1:12:01 PM | 2 | 0 | null | 0 |
If I can get the query to pull the information like above, I will then be able to group the records together and sum the ORDERUSE column which would get me the information I need to know what orders have been used and which have not been fully used. So in the example above, if I were to sum the ORDERUSE column for each of the ORDERNUMs, orders 4412, 4111, 0393 would all show full usage. Orders 7812, 1191 would show not being fully used.
If i am reading this correctly you want to determine how many parts have been used. In your example it looks like you have 5 usages and with 5 orders coming to a total of 8 parts with the following orders having been used.
4412 - one part - one used
4111 - one part - one used
7812 - one part - one used
0393 - three
parts - two used
After a bit of hacking away I came up with the following SQL. Not sure if this works outside of your sample data since thats the only thing I used to test and I am no expert.
WITH data
AS (SELECT *
FROM (SELECT *
FROM sub_b1
join (SELECT ROWNUM rn
FROM dual
CONNECT BY LEVEL < 15) a
ON a.rn <= sub_b1.orderqty
ORDER BY receivedate)
WHERE ROWNUM <= (SELECT SUM(useqty)
FROM sub_b2))
SELECT sub_b1.ordernum,
partnum,
receivedate,
orderqty,
usage
FROM sub_b1
join (SELECT ordernum,
Max(rn) AS usage
FROM data
GROUP BY ordernum) b
ON sub_b1.ordernum = b.ordernum
You are looking for "FIFO" inventory accounting.
The proper data model should have two tables, one for "received" parts and the other for "delivered" or "used". Each table should show an order number, a part number and quantity (received or used) for that order, and a timestamp or date-time. I model both in CTE's in my query below, but in your business they should be two separate table. Also, a trigger or similar should enforce the constraint that a part cannot be used until it is available in stock (that is: for each part id, the total quantity used since inception, at any point in time, should not exceed the total quantity received since inception, also at the same point in time). I assume that the two input tables do, in fact, satisfy this condition, and I don't check it in the solution.
The output shows a timeline of quantity used, by timestamp, matching "received" and "delivered" (used) quantities for each part_id. In the sample data I illustrate a single part_id, but the query will work with multiple part_id's, and orders (both for received and for delivered or used) that include multiple parts (part id's) with different quantities.
with
received ( order_id, part_id, ts, qty ) as (
select '0030', '11A4', timestamp '2015-03-18 15:00:33', 20 from dual union all
select '0032', '11A4', timestamp '2015-03-22 15:00:33', 13 from dual union all
select '0034', '11A4', timestamp '2015-03-24 10:00:33', 18 from dual union all
select '0036', '11A4', timestamp '2015-04-01 15:00:33', 25 from dual
),
delivered ( order_id, part_id, ts, qty ) as (
select '1200', '11A4', timestamp '2015-03-18 16:30:00', 14 from dual union all
select '1210', '11A4', timestamp '2015-03-23 10:30:00', 8 from dual union all
select '1220', '11A4', timestamp '2015-03-23 11:30:00', 7 from dual union all
select '1230', '11A4', timestamp '2015-03-23 11:30:00', 4 from dual union all
select '1240', '11A4', timestamp '2015-03-26 15:00:33', 1 from dual union all
select '1250', '11A4', timestamp '2015-03-26 16:45:11', 3 from dual union all
select '1260', '11A4', timestamp '2015-03-27 10:00:33', 2 from dual union all
select '1270', '11A4', timestamp '2015-04-03 15:00:33', 16 from dual
),
(end of test data; the SQL query begins below - just add the word WITH at the top)
-- with
combined ( part_id, rec_ord, rec_ts, rec_sum, del_ord, del_ts, del_sum) as (
select part_id, order_id, ts,
sum(qty) over (partition by part_id order by ts, order_id),
null, cast(null as date), cast(null as number)
from received
union all
select part_id, null, cast(null as date), cast(null as number),
order_id, ts,
sum(qty) over (partition by part_id order by ts, order_id)
from delivered
),
prep ( part_id, rec_ord, del_ord, del_ts, qty_sum ) as (
select part_id, rec_ord, del_ord, del_ts, coalesce(rec_sum, del_sum)
from combined
)
select part_id,
last_value(rec_ord ignore nulls) over (partition by part_id
order by qty_sum desc) as rec_ord,
last_value(del_ord ignore nulls) over (partition by part_id
order by qty_sum desc) as del_ord,
last_value(del_ts ignore nulls) over (partition by part_id
order by qty_sum desc) as used_date,
qty_sum - lag(qty_sum, 1, 0) over (partition by part_id
order by qty_sum, del_ts) as used_qty
from prep
order by qty_sum
;
Output:
PART_ID REC_ORD DEL_ORD USED_DATE USED_QTY
------- ------- ------- ----------------------------------- ----------
11A4 0030 1200 18-MAR-15 04.30.00.000000000 PM 14
11A4 0030 1210 23-MAR-15 10.30.00.000000000 AM 6
11A4 0032 1210 23-MAR-15 10.30.00.000000000 AM 2
11A4 0032 1220 23-MAR-15 11.30.00.000000000 AM 7
11A4 0032 1230 23-MAR-15 11.30.00.000000000 AM 4
11A4 0032 1230 23-MAR-15 11.30.00.000000000 AM 0
11A4 0034 1240 26-MAR-15 03.00.33.000000000 PM 1
11A4 0034 1250 26-MAR-15 04.45.11.000000000 PM 3
11A4 0034 1260 27-MAR-15 10.00.33.000000000 AM 2
11A4 0034 1270 03-APR-15 03.00.33.000000000 PM 12
11A4 0036 1270 03-APR-15 03.00.33.000000000 PM 4
11A4 0036 21
12 rows selected.
Notes: (1) One needs to be careful if at one moment the cumulative used quantity exactly matches cumulative received quantity. All rows must be include in all the intermediate results, otherwise there will be bad data in the output; but this may result (as you can see in the output above) in a few rows with a "used quantity" of 0. Depending on how this output is consumed (for further processing, for reporting, etc.) these rows may be left as they are, or they may be discarded in a further outer-query with the condition where used_qty > 0.
(2) The last row shows a quantity of 21 with no used_date and no del_ord. This is, in fact, the "current" quantity in stock for that part_id as of the last date in both tables - available for future use. Again, if this is not needed, it can be removed in an outer query. There may be one or more rows like this at the end of the table.

SQL - Average number of records within a time period

I'm trying to compile some lifetime value information for customers within one of our databases.
We have an MS SQL Server database which stores all of our customer/transactional information.
My issue is that I don't have much experience when it comes to MS SQL Server (or SQL in general) - I'd like to be able to run a query against the database that pulls AVG number of loans, and AVG revenue based on three criteria:
1.) Loans be counted if they are 'approved'
2.) Loans from a customer_id only be counted if the first loan (first identified by date_created field) be on or after a certain 'mm/yyyy'
3.) I'm able to specify for how many months after the first 'mm/yyyy' to tally the number of loans / revenue to be included within the AVG
Here is what the database would look like:
customer_id | loan_status | date_created | revenue
111 | 'approved' | 2010-06-20 17:17:09 | 100.00
222 | 'approved' | 2010-06-21 09:54:43 | 255.12
333 | 'denied' | 2011-06-21 12:47:30 | NULL
333 | 'approved' | 2011-06-21 12:47:20 | 56.87
222 | 'denied' | 2011-06-21 09:54:48 | NULL
222 | 'approved' | 2011-06-21 09:54:18 | 50.00
111 | 'approved' | 2011-06-20 17:17:23 | 100.00
... loads' of records ...
555 | 'approved' | 2012-01-02 09:08:42 | 24.70
111 | 'denied' | 2012-01-05 02:10:36 | NULL
666 | 'denied' | 2012-02-05 03:31:16 | NULL
555 | 'approved' | 2012-02-17 09:32:26 | 197.10
777 | 'approved' | 2012-04-03 18:28:45 | 300.50
777 | 'approved' | 2012-06-28 02:42:01 | 201.80
555 | 'approved' | 2012-06-21 22:16:59 | 10.00
666 | 'approved' | 2012-09-30 01:17:20 | 50.00
If I wanted to find the avg transaction count (approved transactions), and average revenue per approved transaction for all customer's who's first loan was in/after 2012-01, and for a period of 4 months after then, how would I go about querying the database?
Any help is greatly appreciated.
something like this (there maybe a few typos here and there)...
you could first calculate the minimum loan date:
select customer_id, min(date_created) from table t where loan_status = 'approved' group by customer_id
then you can join to it:
select customer_id, count(date_created), avg(revenue) from table t
join (
select customer_id, min(date_created) as min_date from table t where loan_status = 'approved' group by customer_id ) s
on t.customer_id = s.customer_id
where t.date_created between s.min_date and DATEADD(month, 4, s.min_date) and t.loan_status = 'approved'
Rename tbl to your table name.
Specify dates in the format YYYYMMDD.
select customer_id, AVG(revenue) average_revenue
from
(
select customer_id
from tbl
group by customer_id
having min(date_created) >= '20120101'
) fl
join tbl t on t.customer_id = fl.customer_id
where t.loan_status = 'approved'
and date_created < '20120501' -- NOT including May the first, so Jan through Apr (4 months)
If you mean 4 months after each customer's first loan, leave me a comment, state whether it's 4 calendar months (e.g. 15-Jan to 15-May) or up to the last day of the 4th month (15-Jan to 30-Apr), and I'll update the answer.