SQL : Running Total for identical transactions Without Using ROWS UNBOUNDED PRECEDING - sql

I am trying to calculate a running total of "cab fare earned by a driver on a particular day". Originally tested on Netezza and now trying to code on spark-sql.
However if for two rows with structure as ((driver,day) --> fare) if 'fare' value is identical then running_total column always showing the final sum ! In case all the fares are distinct , it is being calculated perfectly. Is there any way to achieve this ( in ANSI SQL or Spark dataframe) without using rowsBetween(start,end) ?
Sample Data :
driver_id<<<<>>>>date_id <<<<>>>>fare
10001 2017-07-27 500
10001 2017-07-27 500
10001 2017-07-30 500
10001 2017-07-30 1500
SQL Query I fired to calculate running total
select driver_id, date_id, fare ,
sum(fare)
over(partition by date_id,driver_id
order by date_id,fare )
as run_tot_fare
from trip_info
order by 2
Result :
driver_id <<<<>>>> date_id <<<<>>>> fare <<<<>>>> run_tot_fare
10001 2017-07-27 500 1000 --**Showing Final Total expecting 500**
10001 2017-07-27 500 1000
10001 2017-07-30 500 500 --**No problem here**
10001 2017-07-30 1500 2000
If anybody can kindly let me know,what I am doing wrong and if it is achievable without using Rows Unbounded Precedings/rowsBetween(b,e), then I highly appreciate that. Thanks in advance.

The traditional solution in SQL is to use range instead of rows:
select driver_id, date_id, fare ,
sum(fare) over (partition by date_id, driver_id
order by date_id, fare
range between unbounded preceding and current rows
) as run_tot_fare
from trip_info
order by 2;
Absent that, two levels of window functions or an aggregation and join:
select driver_id, date_id, fare,
max(run_tot_fare_temp) over (partition by date_id, driver_id ) as run_tot_fare
from (select driver_id, date_id, fare ,
sum(fare) over (partition by date_id, driver_id
order by date_id, fare
) as run_tot_fare_temp
from trip_info ti
) ti
order by 2;
(The max() assumes the fares are never negative.)

Related

Past 7 days running amounts average as progress per each date

So, the query is simple but i am facing issues in implementing the Sql logic. Heres the query suppose i have records like
Phoneno Company Date Amount
83838 xyz 20210901 100
87337 abc 20210902 500
47473 cde 20210903 600
Output expected is past 7 days progress as running avg of amount for each date (current date n 6 days before)
Date amount avg
20210901 100 100
20210902 500 300
20210903 600 400
I tried
Select date, amount, select
avg(lg) from (
Select case when lag(amount)
Over (order by NULL) IS NULL
THEN AMOUNT
ELSE
lag(amount)
Over (order by NULL) END AS LG)
From table
WHERE DATE>=t.date-7) as avg
From table t;
But i am getting wrong avg values. Could anyone please help?
Note: Ive tried without lag too it results the wrong avgs too
You could use a self join to group the dates
select distinct
a.dt,
b.dt as preceding_dt, --just for QA purpose
a.amt,
b.amt as preceding_amt,--just for QA purpose
avg(b.amt) over (partition by a.dt) as avg_amt
from t a
join t b on a.dt-b.dt between 0 and 6
group by a.dt, b.dt, a.amt, b.amt; --to dedupe the data after the join
If you want to make your correlated subquery approach work, you don't really need the lag.
select dt,
amt,
(select avg(b.amt) from t b where a.dt-b.dt between 0 and 6) as avg_lg
from t a;
If you don't have multiple rows per date, this gets even simpler
select dt,
amt,
avg(amt) over (order by dt rows between 6 preceding and current row) as avg_lg
from t;
Also the condition DATE>=t.date-7 you used is left open on one side meaning it will qualify a lot of dates that shouldn't have been qualified.
DEMO
You can use analytical function with the windowing clause to get your results:
SELECT DISTINCT BillingDate,
AVG(amount) OVER (ORDER BY BillingDate
RANGE BETWEEN TO_DSINTERVAL('7 00:00:00') PRECEDING
AND TO_DSINTERVAL('0 00:00:00') FOLLOWING) AS RUNNING_AVG
FROM accounts
ORDER BY BillingDate;
Here is a DBFiddle showing the query in action (LINK)

Restrict LAG to specific row condition

I am using the following query to return the percentage difference between this month and last month for a given Site ID.
SELECT
reporting_month,
total_revenue,
invoice_count,
--total_revenue_prev,
--invoice_count_prev,
ROUND(SAFE_DIVIDE(total_revenue,total_revenue_prev)-1,4) AS actual_growth,
site_name
FROM (
SELECT DATE_TRUNC(table.date, MONTH) AS reporting_month,
ROUND(SUM(table.revenue),2) AS total_revenue,
COUNT(*) AS invoice_count,
ROUND(IFNULL(
LAG(SUM(table.revenue)) OVER (ORDER BY MIN(DATE_TRUNC(table.date, MONTH))) ,
0),2) AS total_revenue_prev,
IFNULL(
LAG(COUNT(*)) OVER (ORDER BY MIN(DATE_TRUNC(table.date, MONTH))) ,
0) AS invoice_count_prev,
tbl_sites.name AS site_name
FROM table
LEFT JOIN tbl_sites ON tbl_sites.id = table.site
WHERE table.site = '123'
GROUP BY site_name, reporting_month
ORDER BY reporting_month
)
This is working correctly, printing:
reporting_month
total revenue
invoice_count
actual_growth
site_name
2020-11-01 00:00:00 UTC
100.00
10
0.571
SiteNameString
2020-12-01 00:00:00 UTC
125.00
7
0.2500
SiteNameString
However I would like to be able to run the same query for all sites. When I remove WHERE table.site = '123' from the subquery, I assume it is the use of LAG that is making the numbers report incorrectly. Is there a way to restrict the LAG to the 'current' row site?
You can simply add a PARTITION BY clause in your LAG statement to define a window function :
LAG(SUM(table.revenue)) OVER (PARTITION BY table.site ORDER BY table.date, MONTH)
Here is the related BigQuery documentation page
"PARTITION BY: Breaks up the input rows into separate partitions, over which the analytic function is independently evaluated."

Error in implementing windows function in SQL

I have a table as below:
customer_ID
date
expense_transactions
BS:100331
4/30/2012
177.43
BS:100331
5/31/2012
96.9
BS:100331
6/30/2012
81.31
BS:100331
7/31/2012
98.13
BS:100331
8/31/2012
99.95
BS:100699
4/30/2012
403.99
BS:100699
5/31/2012
0
BS:100699
6/30/2012
3.24
BS:100699
7/31/2012
11.02
BS:100699
8/31/2012
11.27
My expected output is as shown in column expense_transactions_3_month_max. To arrive at this column, we first shift expense_transactions by one row as shown in expense_transactions_shifted and then calculate the max value for 3 rows. Where 3 is the windows size.
customer_ID
date
expense_transactions
expense_transactions_shifted
expense_transactions_3_month_max
BS:100331
4/30/2012
177.43
BS:100331
5/31/2012
96.9
177.43
BS:100331
6/30/2012
81.31
96.9
BS:100331
7/31/2012
98.13
81.31
177.43
BS:100331
8/31/2012
99.95
98.13
98.13
BS:100699
4/30/2012
403.99
BS:100699
5/31/2012
0
403.99
BS:100699
6/30/2012
3.24
0
BS:100699
7/31/2012
11.02
3.24
403.99
BS:100699
8/31/2012
11.27
11.02
11.02
I have tried using this SQL query but I am not sure where I am going wrong.
WITH shifted AS
(
SELECT
customer_ID, date,
LAG(expense_transactions, 1) OVER (PARTITION BY customer_ID ORDER BY customer_ID ASC) AS shiftedBy1Month
FROM
FundsFlowAfterMerge ffam
)
SELECT
customer_ID, date,
MAX(shiftedBy1Month) OVER (PARTITION BY customer_ID, date ORDER BY customer_ID ASC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS Rolling3Window
FROM
shifted
Is my approach correct? I am getting below error for the above query:
SQL Error [2809] [S0001]: The request for procedure 'FundsFlowAfterMerge' failed because 'FundsFlowAfterMerge' is a table object
Your current query is partitioning and ordering by the wrong columns.
Your lag says partition by customer_ID order by customer_ID ASC which means that it will get an arbitrary result for each customer_ID.
Your max says PARTITION BY customer_ID,date order by customer_ID ASC rows between 2 PRECEDING and CURRENT row which means that each individual date is another partition.
Furthermore, you seem to only want a result when you actually have 3 rows, you should take that into account
You can write this a bit shorter. ROWS 2 PRECEDING is short for rows between 2 PRECEDING and CURRENT ROW, also lag defaults to the previous row, and ASC is the default ordering.
with shifted as (
SELECT
customer_ID,
date,
lag(expense_transactions) over
(partition by customer_ID order by date) as shiftedBy1Month
from FundsFlowAfterMerge ffam
)
select
customer_ID,
date,
CASE WHEN LAG(shiftedBy1Month, 2) OVER
(PARTITION BY customer_ID order by date) IS NOT NULL
THEN max(shiftedBy1Month) over
(PARTITION BY customer_ID order by date ROWS 2 PRECEDING)
END as Rolling3Window
FROM shifted
It's not very elegant, but you could just code
Greatest(expense_transactions, lag(expense_transactions,1) over *blah blah*, lag(expense_transactions,2) over *same window*)
If your SQL flavor does not include greatest function, use more verbose Case syntax instead. Inelegant, because it's hard to generalize to n month intervals, but it has the advantage of being accomplished within a simple, non-recursive Select statement.

SQL query to compare a sales rep metrics with a quartile average captured in another table

I have written a sql query where it calculates my sales quartiles for past three months for all sales representatives and is captured in a temp table in stored procedure like this:
Quartile value of all sales representatives for past three months:
Date 25th% 50th% 75th% 100th%
10/2020 88.89 90.00 95.00 100.00
11/2020 85.63 91.00 96.00 100.00
12/2020 70.00 80 .00 90.00 100.00
Now in my another CTE I have the actual values of the sales rep like this:
SalesRepId Month salesvalue
101 10/2020 77
101 11/2020 90
101 12/2020 100
When I do the join of both cte and temp table, the query performance is bad, what is the best way to look up the temp table for a sales value and assign the quartile to my salesrepid?
Basically for 10/2020 the salesvalue 77 is less than the 25th quartile then salesrep should get 25th quartile assigned for month of october.
Thank you
This is exactly what percentile_disc() and percentile_cont(). Unfortunately, these are not aggregation functions, but one method is:
select distinct month,
percentile_disc(0.25) over (partition by month order by salesvalue) as value_25,
percentile_disc(0.50) over (partition by month order by salesvalue) as value_50,
percentile_disc(0.75) over (partition by month order by salesvalue) as value_75
from sales;
If you want to calculate the quartile, the simplest method is ntile():
select s.*,
ntile(4) over (partition by month order by sales)
from sales s;
You don't need to calculate the breaks. The one caveat to ntile() is that the tiles are as close in size as possible. That means that ties can be in different tiles. To solve that, just do the calculation manually:
select s.*,
ceiling(rank() over (partition by month order by sales) * 4.0 /
count(*) over (partition by month)
) as quartile
I updated my query so I can save the data into temp table instead of CTE. Now when I join the temp table , its a breeze.
https://www.brentozar.com/archive/2019/06/whats-better-ctes-or-temp-tables/

Finding when requests are met or exceeded by customer by month

I have a table that has customers and I want to find what month the customer met or exceeded a certain number of requests.
The table has customer_id a timestamp of each request.
What I am looking for is the month (or day) that the customer met or exceeded 10000 requests. I've tried to get a running total in place but this just isn't working for me. I've left it in the code in case someone knows how I can do this.
What I have is the following:
SELECT
customer_id
, DATE_TRUNC(CAST(TIMESTAMP_MILLIS(created_timestamp) AS DATE), MONTH) as cMonth
, COUNT(created_timestamp) as searchCount
-- , SUM(COUNT (DISTINCT(created_timestamp))) OVER (ROWS UNBOUNDED PRECEDING) as RunningTotal2
FROM customer_requests.history.all
GROUP BY distributor_id, cMonth
ORDER BY 2 ASC, 1 DESC;
The representation I am after is something like this.
customer requests cMonth totalRequests
cust1 6000 2017-10-01 6000
cust1 4001 2017-11-01 10001
cust2 4000 2017-10-01 4000
cust2 4000 2017-11-01 8000
cust2 4000 2017-12-01 12000
cust2 3000 2017-12-01 3000
cust2 3000 2017-12-01 6000
cust2 3000 2017-12-01 9000
cust2 3000 2017-12-01 12000
Assuming SQL Server, try this (adjusting the cutoff at the top to get the number of transactions you need; right now it looks for the thousandth transaction per customer).
Note that this will not return customers who have not exceeded your cutoff, and assumes that each transaction has a unique date (or is issued a sequential ID number to break ties if there can be ties on date).
DECLARE #cutoff INT = 1000;
WITH CTE
AS (SELECT customer_id,
transaction_ID,
transaction_date,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY transaction_date, transaction_ID) AS RN,
COUNT(transaction_ID) OVER (PARTITION BY customer_id) AS TotalTransactions
FROM #test)
SELECT DISTINCT
customer_id,
transaction_date as CutoffTransactionDate,
TotalTransactions
FROM CTE
WHERE RN = #cutoff;
How it works:
row_number assigns a unique sequential identifier to each of a customer's transactions, in the order in which they were made. count tells you the total number of transactions a person made (assuming again one record per transaction - otherwise you would need to calculate this separately, since distinct won't work with the partition).
Then the second select returns the 1,000th (or however many you specify) row for each customer and its date, along with the total for that customer.
this is my solution.
SELECT
customerid
,SUM(requests) sumDay
,created_timestamp
FROM yourTable
GROUP BY
customerid,
created_timestamp
HAVING SUM(requests) >= 10000;
Its pretty simple. You just group according to your needs, sum up the requests and select the rows that meet your HAVING clause.
You can try the query here.
If you want a cumulative sum, you can use window functions. In Standard SQL, this looks like:
SELECT customer_id,
DATE_TRUNC(CAST(TIMESTAMP_MILLIS(created_timestamp) AS DATE), MONTH) as cMonth
COUNT(*) as searchCount,
SUM(COUNT(*)) OVER (ORDER BY MIN(created_timestamp) as runningtotal
FROM customer_requests.history.all
GROUP BY distributor_id, cMonth
ORDER BY 2 ASC, 1 DESC;