How to compute cumulative product in SQL Server 2008? - sql

I have below table with 2 columns, DATE & FACTOR. I would like to compute cumulative product, something like CUMFACTOR in SQL Server 2008.
Can someone please suggest me some alternative.

Unfortunately, there's not PROD() aggregate or window function in SQL Server (or in most other SQL databases). But you can emulate it as such:
SELECT Date, Factor, exp(sum(log(Factor)) OVER (ORDER BY Date)) CumFactor
FROM MyTable

You can do it by:
SELECT A.ROW
, A.DATE
, A.RATE
, A.RATE * B.RATE AS [CUM RATE]
FROM (
SELECT ROW_NUMBER() OVER(ORDER BY DATE) as ROW, DATE, RATE
FROM TABLE
) A
LEFT JOIN (
SELECT ROW_NUMBER() OVER(ORDER BY DATE) as ROW, DATE, RATE
FROM TABLE
) B
ON A.ROW + 1 = B.ROW

To calculate the cumulative product, as displayed in the CumFactor column in the original post, the following code does the job:
--first, load the sample data to a temp table
select *
into #t
from
(
values
('2/3/2000', 10),
('2/4/2000', 20),
('2/5/2000', 30),
('2/6/2000', 40)
) d ([Date], [Rate]);
--next, calculate cumulative product
select *, CumFactor = cast(exp(sum(log([Rate])) over (order by [Date])) as int) from #t;
Here is the result:

Related

Parametrizing dates in SQL IN clause - using cell magic in jupyter notebook

Using MSSQL db as the backend, I have a cell in my notebook with this sql which works fine.
%%sql
select * from (
select count(*) as CNT, COL1,CONVERT(VARCHAR,CAST(CREATED_DATE AS date)) as dt from TABLE1
group by COL1,CONVERT(VARCHAR,CAST(CREATED_DATE AS date))
)t
PIVOT
(
sum(CNT) for [dt] in ([2022-07-25],[2022-07-26])
) AS PivotTable
I am trying to parameterize the [IN] clause in the pivot.
Tried a few things, but without much success
import pandas as pd
from datetime import datetime
rng = pd.date_range(end = datetime.today(), periods = 5).strftime('%Y-%m-%d').tolist()
#rng = format(','.join('[{}]'.format(i) for i in rng))
#rng = pd.date_range(end = datetime.today().date(), periods = 5)
print (rng)
['2022-07-26', '2022-07-27', '2022-07-28', '2022-07-29', '2022-07-30']
select * from (
select count(*) as CNT, COL1,CONVERT(VARCHAR,CAST(CREATED_DATE AS date)) as dt from TABLE1
group by COL1,CONVERT(VARCHAR,CAST(CREATED_DATE AS date))
)t
PIVOT
(
sum(CNT) for [dt] in (:rng)
) AS PivotTable
* mssql+pymssql://---
(pymssql._pymssql.ProgrammingError) (102, b"Incorrect syntax near '('.DB-Lib error message 20018, severity 15:\nGeneral SQL Server error: Check messages from the SQL Server\n")
[SQL: select * from (
select count(*) as CNT, COL1,CONVERT(VARCHAR,CAST(CREATED_DATE AS date)) as dt from TABLE1
group by COL1,CONVERT(VARCHAR,CAST(CREATED_DATE AS date))
)t
PIVOT
(
sum(CNT) for [dt] in (%(rng)s)
) AS PivotTable]
[parameters: {'rng': ['2022-07-26', '2022-07-27', '2022-07-28', '2022-07-29', '2022-07-30']}]
(Background on this error at: https://sqlalche.me/e/14/f405)
any ideas on how I can achieve this. I will try creating the entire query dynamically but it will be much better if I can pass the dates alone into the query.
thanks for your time.

Classic banking task

I think this is a usual task in banking area.
I need to fill 'Income' column by previous values from 'Outcome'. But every 'Outcome' value calculated like Outcome = Income + Debit - Credit from current row (each rows).
I guess I should use lag() for 'Income'. But this creates cyclicality in calculating.
I hope this can help:
create table account(acc_date date,income int, debit int, credit int, outcome int);
insert into account values('2021-01-01', 100,800,500,400),
('2021-02-01', null,900,1500,null),
('2021-03-01', null,1700,2000,null),
('2021-04-01', null,2100,2800,null),
('2021-05-01', null,3500,4000,null);
select * from account;
Untested, but by using a sum() over() and coalesce in concert with the lag() over()
with cte as (
Select *
,OutCome = sum( isnull(Income,0)+Debit-Credit ) over (order by date)
From YourTable
)
Select Date
,Income = coalesce(Income,lag(outcome,1) over (order by date))
,Credit
,Debit
,OutCome
From cte

calculate time difference of consecutive row dates in SQL

Hello I am trying to calculate the time difference of 2 consecutive rows for Date (either in hours or Days), as attached in the image
Highlighted in Yellow is the result I want which is basically the difference of the date in that row and 1 above.
How can we achieve it in the SQL? Attached is my complex code which has the rest of the fields in it
with cte
as
(
select m.voucher_no, CONVERT(VARCHAR(30),CONVERT(datetime, f.action_Date, 109),100) as action_date,f.col1_Value,f.col3_value,f.col4_value,f.comments,f.distr_user,f.wf_status,f.action_code,f.wf_user_id
from attdetailmap m
LEFT JOIN awftaskfin f ON f.oid = m.oid and f.client ='PC'
where f.action_Date !='' and action_date between '$?datef' and '$?datet'
),
.*select *, ROW_NUMBER() OVER(PARTITION BY action_Date,distr_user,wf_Status,wf_user_id order by action_Date,distr_user,wf_Status,wf_user_id ) as row_no_1 from cte
cte2 as
(
select *, ROW_NUMBER() OVER(PARTITION BY voucher_no,action_Date,distr_user,wf_Status,wf_user_id order by voucher_no ) as row_no_1 from cte
)
select distinct(v.dim_value) as resid,c.voucher_no,CONVERT(datetime, c.action_Date, 109) as action_Date,c.col4_value,c.comments,c.distr_user,v.description,c.wf_status,c.action_code, c.wf_user_id,v1.description as name,r.rel_value as pay_office,r1.rel_value as site
from cte2 c
LEFT OUTER JOIN aagviuserdetail v ON v.user_id = c.distr_user
LEFT OUTER JOIN aagviuserdetail v1 ON v1.user_id = c.wf_user_id
LEFT OUTER JOIN ahsrelvalue r ON r.resource_id = v.dim_Value and r.rel_Attr_id = 'P1' and r.period_to = '209912'
LEFT OUTER JOIN ahsrelvalue r1 ON r1.resource_id = v.dim_Value and r1.rel_Attr_id = 'Z1' and r1.period_to = '209912'
where c.row_no_1 = '1' and r.rel_value like '$?site1' and voucher_no like '$?trans'
order by voucher_no,action_Date
The key idea is lag(). However, date/time functions vary among databases. So, the idea is:
select t.*,
(date - lag(date) over (partition by transaction_no order by date)) as diff
from t;
I should note that this exact syntax might not work in your database -- because - may not even be defined on date/time values. However, lag() is a standard function and should be available.
For instance, in SQL Server, this would look like:
select t.*,
datediff(second, lag(date) over (partition by transaction_no order by date), date) / (24.0 * 60 * 60) as diff_days
from t;

Calculate Average time spend based on a change in location zone

I have a table similar to
create table LOCHIST
(
RES_ID VARCHAR(10) NOT NULL,
LOC_DATE TIMESTAMP NOT NULL,
LOC_ZONE VARCHAR(10)
)
with values such as
insert into LOCHIST values(0911,2015-09-23 12:27:00.000000,SYLVSYLGA);
insert into LOCHIST values(5468,2013-02-15 13:13:24.000000,30726);
insert into LOCHIST values(23894,2013-02-15 13:12:13.000000,BECTFOUNC);
insert into LOCHIST values(24119,2013-02-15 13:12:09.000000,30363);
insert into LOCHIST values(7101,2013-02-15 13:11:37.000000,37711);
insert into LOCHIST values(26083,2013-02-15 13:11:36.000000,SHAWANDAL);
insert into LOCHIST values(24978,2013-02-15 13:11:36.000000,38132);
insert into LOCHIST values(26696,2013-02-15 13:11:27.000000,29583);
insert into LOCHIST values(5468,2013-02-15 13:11:00.000000,37760);
insert into LOCHIST values(5552,2013-02-15 13:10:55.000000,30090);
insert into LOCHIST values(24932,2013-02-15 13:10:48.000000,JBTTLITGA);
insert into LOCHIST values(23894,2013-02-15 13:10:42.000000,47263);
insert into LOCHIST values(26803,2013-02-15 13:10:25.000000,32534);
insert into LOCHIST values(24434,2013-02-15 13:10:03.000000,PLANSUFVA);
insert into LOCHIST values(26696,2013-02-15 13:10:00.000000,GEORALBGA);
insert into LOCHIST values(5468,2013-02-15 13:09:54.000000,19507);
insert into LOCHIST values(23894,2013-02-15 13:09:48.000000,37725);
This table literally goes on for millions of records.
Each RES_ID represents the ID of a trailer who pings their location to a LOC_ZONE which is then stored at the time in LOC_DATE.
What I am trying to find, is the average amount of time spent for all trailers in a specific location zone. For example, if trailer x spent 4 hours in in loc zone PLANSUFVA, and trailer y spent 6 hours in loc zone PLANSUFVA I would want to return
Loc Zone Avg Time
PLANSUFVA 5
Is there anyway to do this without cursors?
I really appreciate your help.
This needs SQL 2012:
with data
as (
select *, (case when LOC_ZONE != PREVIOUS_LOC_ZONE or PREVIOUS_LOC_ZONE is null then ROW_ID else null end) as STAY_START, (case when LOC_ZONE != NEXT_LOC_ZONE or NEXT_LOC_ZONE is null then ROW_ID else null end) as STAY_END
from (
select RES_ID, LOC_ZONE, LOC_DATE, lead(LOC_DATE, 1) over (partition by RES_ID, LOC_ZONE order by LOC_DATE) as NEXT_LOC_DATE, lag(LOC_ZONE, 1) over (partition by RES_ID order by LOC_DATE) as PREVIOUS_LOC_ZONE, lead(LOC_ZONE, 1) over (partition by RES_ID order by LOC_DATE) as NEXT_LOC_ZONE, ROW_NUMBER() over (order by RES_ID, LOC_ZONE, LOC_DATE) as ROW_ID
from LOCHIST
) t
), stays as (
select * from (
select RES_ID, LOC_ZONE, STAY_START, lead(STAY_END, 1) over (order by ROWID) as STAY_END
from (
select RES_ID, LOC_ZONE, STAY_START, STAY_END, ROW_NUMBER() over (order by RES_ID, LOC_ZONE, STAY_START desc) as ROWID
from data
where STAY_START is not null or STAY_END is not null
) t
) t
where STAY_START is not null and STAY_END is not null
)
select s.LOC_ZONE, avg(datediff(second, LOC_DATE, NEXT_LOC_DATE)) / 60 / 60 as AVG_IN_HOURS
from data d
inner join stays s on d.RES_ID = s.RES_ID and d.LOC_ZONE = s.LOC_ZONE and d.ROW_ID >= s.STAY_START and d.ROW_ID < s.STAY_END
group by s.LOC_ZONE
To solve this problem, you need the amount of time spent at each location.
One way to do this is with a correlated subquery. You need to group adjacent values. The idea is to find the next value in the sequence:
select resid, min(loc_zone) as loc_zone, min(loc_date) as StartTime,
max(loc_date) as EndTime,
nextdate as NextStartTime
from (select lh.*,
(select min(loc_date) from lochist lh2
where lh2.res_id = lh.res_id and lh2.loc_zone <> lh.loc_zone and
lh2.loc_date > lh.loc_date
) as nextdate
from lochist lh
) lh
group by lh.res_id, nextdate
With this data, you can then get the average that you want.
I am not clear if the time should be based on EndTime - StartTime (last recorded time at the location minus the first recorded time) or NextStartTime - startTime (first time at next location minus first time at this location).
Also, this returns NULL for the last location for each res_id. You don't say what to do about the last in the sequence.
If you build an index on res_id, loc_date, loc_zone, it might run faster.
If you had Oracle or SQL Server 2012, the right query is:
select lh.*,
lead(loc_date) over (partition by res_id order by loc_date) as nextdate
from (select lh.*,
lag(loc_zone) over (partition by res_id order by loc_date) as prevzone
from lochist lh
) lh
where prevzone is null or prevzone <> loc_zone
Now you have one row per stay and nextdate is the date at the next zone.
This should get you each zone ordered by the average number of minutes spent in it. The CROSS APPLY returns the next ping in a different zone.
SELECT
loc.LOC_ZONE
,AVG(DATEDIFF(mi,loc.LOC_DATE,nextPing.LOC_DATE)) AS avgMinutes
FROM LOCHIST loc
CROSS APPLY(
SELECT TOP 1 loc2.LOC_DATE
FROM LOCHIST loc2
WHERE loc2.RES_ID = loc.RES_ID
AND loc2.LOC_DATE > loc.LOC_DATE
AND loc2.LOC_ZONE <> loc.LOC_ZONE
ORDER BY loc2.LOC_DATE ASC
) AS nextPing
GROUP BY loc.LOC_ZONE
ORDER BY avgMinutes DESC
My variation of the solution:
select LOC_ZONE, avg(TOTAL_TIME) AVG_TIME from (
select RES_ID, LOC_ZONE, sum(TIME_SPENT) TOTAL_TIME
from (
select RES_ID, LOC_ZONE, datediff(mi, lag(LOC_DATE, 1) over (
partition by RES_ID order by LOC_DATE), LOC_DATE) TIME_SPENT
from LOCHIST
) t
where TIME_SPENT is not null
group by RES_ID, LOC_ZONE) f
group by LOC_ZONE
This accounts for multiple stays at the same location. The choice between lag or lead depends if a stay should start or end with the ping (ie, if one trailer sends a ping from A and then x hours later from B, does that count for A or B).
To do this without using either a cursor or a correlated subquery, try:
with rl as
(select l.*, rank() over (partition by res_id order by loc_date) rn
from lochist l),
fdr as
(select rc.*, coalesce(rn.loc_date, getdate()) next_date
from rl rc
left join rl rn on rc.res_id = rn.res_id and rc.rn + 1 = rn.rn)
select loc_zone, avg(datediff(second, loc_date, next_date))/3600 avg_time
from fdr
group by loc_zone
SQLFiddle here.
(Because of the way that SQLServer calculates time differences, it's probably better to calculate the average time in seconds and then divide by 60*60. With the exception of the getdate() and datediff clauses - which can be replaced by sysdate and next_date - loc_date - this should work in both SQLServer 2005 onwards and Oracle 10g onwards.)

How to find the row where the sum of all values in a column reaches a specified value?

Given data in a table with the following schema:
CREATE TABLE purchases (timestamp DATETIME, quantity INT)
I would like to find the point in time (i.e. the timestamp of the row) where the sum of the values in the quantity column passed a certain threshold value.
This is in MS SQL Server, and ideally I'd like to avoid using a cursor if possible.
SELECT timestamp, SUM(quantity)
FROM purchases
GROUP BY timestamp
HAVING SUM(quantity) > someValue
Or if it is a Running Sum
SELECT a1.timestamp
FROM purchases a1, purchases a2
WHERE a1.quantity >= a2.quantity or (a1.quantity=a2.quantity and a1.timestamp = a2.timestamp)
GROUP BY a1.timestamp, a1.quantity
having SUM(a2.quantity) >= someValue
ORDER BY a1.timestamp ASC
LIMIT 1
You could get the smallest timestamp where the sum of the previous values is larger than the threshold:
select min(timestamp)
from purchases p
where (
select sum(x.quantity)
from purchases x
where x.timestamp < p.timestamp
) > #threshold
However, this is not a very efficient query, so it might be better to use a cursor after all.
In SQL Server 2005+ you could try this:
;WITH numbered AS (
SELECT
timestamp,
quantity,
rownum = ROW_NUMBER() OVER (ORDER BY timestamp)
FROM purchases
),
recursive AS (
SELECT
timestamp,
quantity,
rownum,
runningsum = quantity,
passed = CASE WHEN n.quantity < #threshold THEN 0 ELSE 1 END
FROM numbered
UNION ALL
SELECT
n.timestamp,
n.quantity,
n.rownum,
runningsum = n.quantity + r.runningsum,
passed = CASE WHEN n.quantity + r.runningsum < #threshold THEN 0 ELSE 1 END
FROM numbered n
INNER JOIN recursive r ON n.rownum = r.rownum + 1
)
SELECT MIN(timestamp)
FROM recursive
WHERE passed = 1
Basically, same as #Guffa's solution, only makes use of CTEs to avoid the need of triangular join.