Error in implementing windows function in SQL - sql

I have a table as below:
customer_ID
date
expense_transactions
BS:100331
4/30/2012
177.43
BS:100331
5/31/2012
96.9
BS:100331
6/30/2012
81.31
BS:100331
7/31/2012
98.13
BS:100331
8/31/2012
99.95
BS:100699
4/30/2012
403.99
BS:100699
5/31/2012
0
BS:100699
6/30/2012
3.24
BS:100699
7/31/2012
11.02
BS:100699
8/31/2012
11.27
My expected output is as shown in column expense_transactions_3_month_max. To arrive at this column, we first shift expense_transactions by one row as shown in expense_transactions_shifted and then calculate the max value for 3 rows. Where 3 is the windows size.
customer_ID
date
expense_transactions
expense_transactions_shifted
expense_transactions_3_month_max
BS:100331
4/30/2012
177.43
BS:100331
5/31/2012
96.9
177.43
BS:100331
6/30/2012
81.31
96.9
BS:100331
7/31/2012
98.13
81.31
177.43
BS:100331
8/31/2012
99.95
98.13
98.13
BS:100699
4/30/2012
403.99
BS:100699
5/31/2012
0
403.99
BS:100699
6/30/2012
3.24
0
BS:100699
7/31/2012
11.02
3.24
403.99
BS:100699
8/31/2012
11.27
11.02
11.02
I have tried using this SQL query but I am not sure where I am going wrong.
WITH shifted AS
(
SELECT
customer_ID, date,
LAG(expense_transactions, 1) OVER (PARTITION BY customer_ID ORDER BY customer_ID ASC) AS shiftedBy1Month
FROM
FundsFlowAfterMerge ffam
)
SELECT
customer_ID, date,
MAX(shiftedBy1Month) OVER (PARTITION BY customer_ID, date ORDER BY customer_ID ASC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS Rolling3Window
FROM
shifted
Is my approach correct? I am getting below error for the above query:
SQL Error [2809] [S0001]: The request for procedure 'FundsFlowAfterMerge' failed because 'FundsFlowAfterMerge' is a table object

Your current query is partitioning and ordering by the wrong columns.
Your lag says partition by customer_ID order by customer_ID ASC which means that it will get an arbitrary result for each customer_ID.
Your max says PARTITION BY customer_ID,date order by customer_ID ASC rows between 2 PRECEDING and CURRENT row which means that each individual date is another partition.
Furthermore, you seem to only want a result when you actually have 3 rows, you should take that into account
You can write this a bit shorter. ROWS 2 PRECEDING is short for rows between 2 PRECEDING and CURRENT ROW, also lag defaults to the previous row, and ASC is the default ordering.
with shifted as (
SELECT
customer_ID,
date,
lag(expense_transactions) over
(partition by customer_ID order by date) as shiftedBy1Month
from FundsFlowAfterMerge ffam
)
select
customer_ID,
date,
CASE WHEN LAG(shiftedBy1Month, 2) OVER
(PARTITION BY customer_ID order by date) IS NOT NULL
THEN max(shiftedBy1Month) over
(PARTITION BY customer_ID order by date ROWS 2 PRECEDING)
END as Rolling3Window
FROM shifted

It's not very elegant, but you could just code
Greatest(expense_transactions, lag(expense_transactions,1) over *blah blah*, lag(expense_transactions,2) over *same window*)
If your SQL flavor does not include greatest function, use more verbose Case syntax instead. Inelegant, because it's hard to generalize to n month intervals, but it has the advantage of being accomplished within a simple, non-recursive Select statement.

Related

Group by month and counting rows for current and all previous months

PostgreSQL 13
Assuming a simplified table plans like the following, it can be assumed that there is at least 1 row for every month and sometimes multiple rows on the same day:
id
first_published_at
12345678910
2022-10-01 03:58:55.118
abcd1234efg
2022-10-03 03:42:55.118
jhsdf894hld
2022-10-03 17:34:55.118
aslb83nfys5
2022-09-12 08:17:55.118
My simplified query:
SELECT TO_CHAR(plans.first_published_at, 'YYYY-MM') AS publication_date, COUNT(*)
FROM plans
WHERE plans.first_published_at IS NOT NULL
GROUP BY TO_CHAR(plans.first_published_at, 'YYYY-MM');
This gives me the following result:
publication_date
count
2022-10
3
2022-09
1
But the result I would need for October is 4.
For every month, the count should be an aggregation of the current month and ALL previous months. I would appreciate any insight on how to approach this.
I would use your query as a CTE and run a select that uses cumulative sum as a window function.
with t as
(
SELECT TO_CHAR(plans.first_published_at, 'YYYY-MM') AS publication_date,
COUNT(*) AS cnt
FROM plans
WHERE plans.first_published_at IS NOT NULL
GROUP BY publication_date
)
select publication_date,
sum(cnt) over (order by publication_date) as "count"
from t
order by publication_date desc;
Demo on DB fiddle

Average timestamp in one column in BigQuery

I need to find the average of the order that came:
Order_Date
2022-06-02 15:40:00 UTC
2022-06-07 11:01:00 UTC
2022-06-21 10:55:00 UTC
2022-06-23 14:44:00 UTC
Outcome:
average Order_Date *that came
Just apply the AVG() average function over your entire table:
SELECT AVG(Order_Date) AS Avg_Order_Date
FROM yourTable;
Average timestamp is unusual ask! But anyway, formally you can do below
select
timestamp_seconds(cast(avg(unix_seconds(timestamp(Order_date))) as int64)) as average_Order_Date
from your_table
if applied to sample data in your question - output is
Note: Supported signatures for AVG: AVG(INT64); AVG(FLOAT64); AVG(NUMERIC); AVG(BIGNUMERIC); AVG(INTERVAL) - that is why you need all this back and forth "translations"
WITH CTE as
(
SELECT Order_Date, LAG(Order_Date,1) OVER(ORDER BY Order_Date ASC) as Datelag
FROM table
),
CTE2 as
(
SELECT Order_Date, datetime_diff(Order_Date,Datelag,hour) as Datedif
FROM CTE
)
SELECT AVG(Datedif)
FROM CTE2

SQL to find sum of total days in a window for a series of changes

Following is the table:
start_date
recorded_date
id
2021-11-10
2021-11-01
1a
2021-11-08
2021-11-02
1a
2021-11-11
2021-11-03
1a
2021-11-10
2021-11-04
1a
2021-11-10
2021-11-05
1a
I need a query to find the total day changes in aggregate for a given id. In this case, it changed from 10th Nov to 8th Nov so 2 days, then again from 8th to 11th Nov so 3 days and again from 11th to 10th for a day, and finally from 10th to 10th, that is 0 days.
In total there is a change of 2+3+1+0 = 6 days for the id - '1a'.
Basically for each change there is a recorded_date, so we arrange that in ascending order and then calculate the aggregate change of days grouped by id. The final result should be like:
id
Agg_Change
1a
6
Is there a way to do this using SQL. I am using vertica database.
Thanks.
you can use window function lead to get the difference between rows and then group by id
select id, sum(daydiff) Agg_Change
from (
select id, abs(datediff(day, start_Date, lead(start_date,1,start_date) over (partition by id order by recorded_date))) as daydiff
from tablename
) t group by id
It's indeed the use of LAG() to get the previous date in an OLAP query, and an outer query getting the absolute date difference, and the sum of it, grouping by id:
WITH
-- your input - don't use in real query ...
indata(start_date,recorded_date,id) AS (
SELECT DATE '2021-11-10',DATE '2021-11-01','1a'
UNION ALL SELECT DATE '2021-11-08',DATE '2021-11-02','1a'
UNION ALL SELECT DATE '2021-11-11',DATE '2021-11-03','1a'
UNION ALL SELECT DATE '2021-11-10',DATE '2021-11-04','1a'
UNION ALL SELECT DATE '2021-11-10',DATE '2021-11-05','1a'
)
-- real query starts here, replace following comma with "WITH" ...
,
w_lag AS (
SELECT
id
, start_date
, LAG(start_date) OVER w AS prevdt
FROM indata
WINDOW w AS (PARTITION BY id ORDER BY recorded_date)
)
SELECT
id
, SUM(ABS(DATEDIFF(DAY,start_date,prevdt))) AS dtdiff
FROM w_lag
GROUP BY id
-- out id | dtdiff
-- out ----+--------
-- out 1a | 6
I was thinking lag function will provide me the answer, but it kept giving me wrong answer because I had the wrong logic in one place. I have the answer I need:
with cte as(
select id, start_date, recorded_date,
row_number() over(partition by id order by recorded_date asc) as idrank,
lag(start_date,1) over(partition by id order by recorded_date asc) as prev
from table_temp
)
select id, sum(abs(date(start_date) - date(prev))) as Agg_Change
from cte
group by 1
If someone has a better solution please let me know.

Why is the value of 1 outputted for every row in sales rank?

I know to rank it by sales I can just remove the partition; but what is the partition function doing that would cause all values of rank to be outputted as 1?
select
trunc(sales_date,'MON') as sales_month,
sum(sales_amount) as Monthly_Sales,
rank() over (partition by trunc (sales_date,'MON') order by sum(sales_amount) desc) as Sales_Rank
from s
group by trunc(sales_date,'MON')
order by 1;
SALES_MON MONTHLY_SALES SALES_RANK
--------- ------------- ----------
01-JAN-15 5600 1
01-FEB-15 50880 1
01-MAR-15 126120 1
01-APR-15 118320 1
01-MAY-15 2280 1
Partition by creates Group for your data in the query. In your query, you have partitioned i.e. Grouped your data by Month for Ranks. So this is showing as 1 for each row and you already grouped your data.
You have only one record in each month. You are starting the ranking enumeration with each month, so you get exactly "1" for each rank.
That is how partition by works. It restarts the enumeration.

SQL : Running Total for identical transactions Without Using ROWS UNBOUNDED PRECEDING

I am trying to calculate a running total of "cab fare earned by a driver on a particular day". Originally tested on Netezza and now trying to code on spark-sql.
However if for two rows with structure as ((driver,day) --> fare) if 'fare' value is identical then running_total column always showing the final sum ! In case all the fares are distinct , it is being calculated perfectly. Is there any way to achieve this ( in ANSI SQL or Spark dataframe) without using rowsBetween(start,end) ?
Sample Data :
driver_id<<<<>>>>date_id <<<<>>>>fare
10001 2017-07-27 500
10001 2017-07-27 500
10001 2017-07-30 500
10001 2017-07-30 1500
SQL Query I fired to calculate running total
select driver_id, date_id, fare ,
sum(fare)
over(partition by date_id,driver_id
order by date_id,fare )
as run_tot_fare
from trip_info
order by 2
Result :
driver_id <<<<>>>> date_id <<<<>>>> fare <<<<>>>> run_tot_fare
10001 2017-07-27 500 1000 --**Showing Final Total expecting 500**
10001 2017-07-27 500 1000
10001 2017-07-30 500 500 --**No problem here**
10001 2017-07-30 1500 2000
If anybody can kindly let me know,what I am doing wrong and if it is achievable without using Rows Unbounded Precedings/rowsBetween(b,e), then I highly appreciate that. Thanks in advance.
The traditional solution in SQL is to use range instead of rows:
select driver_id, date_id, fare ,
sum(fare) over (partition by date_id, driver_id
order by date_id, fare
range between unbounded preceding and current rows
) as run_tot_fare
from trip_info
order by 2;
Absent that, two levels of window functions or an aggregation and join:
select driver_id, date_id, fare,
max(run_tot_fare_temp) over (partition by date_id, driver_id ) as run_tot_fare
from (select driver_id, date_id, fare ,
sum(fare) over (partition by date_id, driver_id
order by date_id, fare
) as run_tot_fare_temp
from trip_info ti
) ti
order by 2;
(The max() assumes the fares are never negative.)