Taking an Error while trying to find LAST_VALUE() in impala - sql

I am trying to find the last blnc value of each id but it throws me an error:
AnalysisException: select list expression not produced by aggregation
output (missing from GROUP BY clause?): last_value(blnc) OVER
(PARTITION BY id ORDER BY id date ASC ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING) lasted.
SELECT id, number, type,
LAST_VALUE(blnc) OVER (PARTITION BY id ORDER BY date rows between unbounded preceding and unbounded following ) AS lasted ,
to_timestamp(MAX(date),'yyyyMMdd') as end_date,
concat(substr(date,1,6),"01") as start_date,
substr(date,1,6) as id_month
FROM table
GROUP BY id,number,type,concat(substr(date,1,6),"01"),substr(date,1,6)
I put all the LAST_VALUE() statement in the group by also but another error occurs.

The problem is that your expression:
LAST_VALUE(blnc) OVER (PARTITION BY id
ORDER BY date
rows between unbounded preceding and unbounded following
) AS lasted ,
is scoped to run after the aggregation. So, only expressions that are understood after the aggregation are valid. And there is no date or blnc. You can fix this by using aggregation functions:
LAST_VALUE(MAX(blnc)) OVER (PARTITION BY id
ORDER BY MAX(date)
rows between unbounded preceding and unbounded following
) AS lasted ,
Although this answers your question and fixes the syntax error, it probably doesn't do anything useful. I think you want conditional aggregation. You haven't explained the logic you want or provided sample data, but the idea is:
SELECT id, number, type,
to_timestamp(MAX(date), 'yyyyMMdd') as end_date,
concat(substr(date,1,6),"01") as start_date,
substr(date, 1, 6) as id_month,
MAX(CASE WHEN seqnum = 1 THEN blnc END) as lasted
FROM (SELECT t.*,
ROW_NUMBER() OVER (PARTITION BY id, number, type, concat(substr(date, 1, 6), '01'), substr(date,1,6)
ORDER BY date DESC
) as seqnum
FROM table t
) t
GROUP BY id, number, type, concat(substr(date, 1, 6), '01'), substr(date,1,6)
Note: String operations on dates look wrong. You should be using the built-in date/time functions, if the column is stored correctly.

Related

SQL Rolling LTV (Lifetime Value)

I am trying to get a rolling calculation of customer lifetime value. The basic formula that I am using would 'SUM(revenue) / COUNT(DISTINCT CUSTOMERS)' but am running into issues when trying to just get those numbers from whatever day it is moving backward. I have code below that isn't correct but had also tried PARTITION code that also didn't work.
CREATE TEMP TABLE customer_revenue AS
(
SELECT TRUNC(timestamp) AS "order_date", COUNT(DISTINCT customer_email) AS "customers",
SUM(revenue)-SUM(discount)-SUM(shipping)-SUM(tax) AS "revenue"
FROM public.fact_shopify_orders
GROUP BY TRUNC(timestamp)
);
SELECT TRUNC(SO.timestamp) AS "date", SUM(CR.revenue) / COUNT(customers) AS "LTV"
FROM customer_revenue CR
LEFT JOIN public.fact_shopify_orders SO ON CR.order_date = SO.timestamp
WHERE CR.order_date <= SO.timestamp
GROUP BY TRUNC(SO.timestamp)
ORDER BY TRUNC(SO.timestamp) DESC
I think you want rolling sums and count(distinct). The latter is a little tricky but you can emulate it easily using a flag based on the first time the customer is seen:
SELECT date,
( SUM(SUM(net_revenue)) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) /
SUM(SUM( (seqnum = 1)::int )) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
) as LTV
FROM (SELECT so.*, TRUNC(SO.timestamp) as date,
(revenue - discount - shipping - tax) as net_revenue,
ROW_NUMBER() OVER (PARTITION BY customer_email ORDER BY timestamp) as seqnum
FROM public.fact_shopify_orders so
) so
GROUP BY date;
EDIT:
I think Redshift supports window functions with aggregation . . . but there is some database out there that does not. You can try this:
SELECT date,
( SUM(net_revenue) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) /
SUM(num_firsts) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
) as LTV
FROM (SELECT date, SUM(net_revenue) as net_revenue,
SUM( (seqnum = 1)::int ) as num_firsts
FROM (SELECT so.*, TRUNC(SO.timestamp) as date,
(revenue - discount - shipping - tax) as net_revenue,
ROW_NUMBER() OVER (PARTITION BY customer_email ORDER BY timestamp) as seqnum
FROM public.fact_shopify_orders so
) so
GROUP BY date
) so;
Here is a similar version running in Postgres.

Redshift - Group Table based on consecutive rows

I am working right now with this table:
What I want to do is to clear up this table a little bit, grouping some consequent rows together.
Is there any form to achieve this kind of result?
The first table is already working fine, I just want to get rid of some rows to free some disk space.
One method is to peak at the previous row to see when the value changes. Assuming that valid_to and valid_from are really dates:
select id, class, min(valid_to), max(valid_from)
from (select t.*,
sum(case when prev_valid_to >= valid_from + interval '-1 day' then 0 else 1 end) over (partition by id order by valid_to rows between unbounded preceding and current row) as grp
from (select t.*,
lag(valid_to) over (partition by id, class order by valid_to) as prev_valid_to
from t
) t
) t
group by id, class, grp;
If the are not dates, then this gets trickier. You could convert to dates. Or, you could use the difference of row_numbers:
select id, class, min(valid_from), max(valid_to)
from (select t.*,
row_number() over (partition by id order by valid_from) as seqnum,
row_number() over (partition by id, class order by valid_from) as seqnum_2
from t
) t
group by id, class, (seqnum - seqnum_2)

Last_Value returning the current value

I am not able to get the last value, rather it is just returning the same value with my code below in snowflake - does anyone have any idea? Is there something glaring wrong?
select MNTH,
sum_cust,
last_value(sum_cust) over (partition by MNTH order by sum_cust desc ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as sum_cust_last
from block_2;
I think what you actually want is to LAG the value from the previous MNTH:
SELECT MNTH,
sum_cust,
LAG(sum_cust) OVER (ORDER BY MNTH) AS sum_cust_last
FROM block_2;
I actually recommend first_value() rather than last_value() for some technical reasons involving window frames. If you want the last value, order by the month desc and choose the first row:
select MNTH, sum_cust,
first_value(sum_cust) over (order by MNTH desc
rows between current_row AND UNBOUNDED FOLLOWING
) as sum_cust_last
from block_2;

Use a regular aggregative function (sum) alongside a window function

I was reading this tutorial on how to calculate running totals.
Copying the suggested approach I have a query of the form:
select
date,
sum(sales) over (order by date rows unbounded preceding) as cumulative_sales
from sales_table;
This works fine and does what I want - a running total by date.
However, in addition to the running total, I'd also like to add daily sales:
select
date,
sum(sales),
sum(sales) over (order by date rows unbounded preceding) as cumulative_sales
from sales_table
group by 1;
This throws an error:
SYNTAX_ERROR: line 6:8: '"sum"("sales") OVER (ORDER BY "activity_date" ASC ROWS UNBOUNDED PRECEDING)' must be an aggregate expression or appear in GROUP BY clause
How can I calculate both daily total as well as running total?
I think you can try it, but it will repeat your daily_sales. In this way you don't need to group by your date field.
SELECT date,
SUM(sales) OVER (PARTITION BY DATE) as daily_sales
SUM(sales) OVER (ORDER BY DATE ROWS UNBOUNDED PRECEDING) as cumulative_sales
FROM sales_table;
Presumably, you intend an aggregation query to begin with:
select date, sum(sales) as daily_sales,
sum(sum(sales)) over (order by date rows unbounded preceding) as cumulative_sales
from sales_table
group by date
order by date;

How to group same multiple window functions as one and call by an alias wherever needed in the query?

How can I address the issue of having the same window function multiple times in a single SQL query for different aggregations? Is there any way I can alias it and call it multiple times as needed in the query.
I tried using 'Window' clause for the same but SQL Server currently doesn't support the 'Window' clause.
select empid, qty,
sum(qty) over (partition by empid order by month rows between unbounded preceding and current row) as running_sum,
avg(qty) over (partition by empid order by month rows between unbounded preceding and current row) as running_avg,
min(qty) over (partition by empid order by month rows between unbounded preceding and current row) as running_min,
max(qty) over (partition by empid order by month rows between unbounded preceding and current row) as running_max
from employee
Is there a way to remove the redundancy in the code?
Not in SQL Server, ANSI SQL supports a WINDOWS clause for defining windows which can be re-used. However, SQL Server does not support it.
I think you can slightly simplify your logic:
select empid, qty,
sum(qty) over (partition by empid order by month) as running_sum,
avg(qty) over (partition by empid order by month) as running_avg,
min(qty) over (partition by empid order by month) as running_min,
max(qty) over (partition by empid order by month) as running_max
from employee;