I come to you today because I'm struggling with a query that involve the LAG function (FYI, I am using PostgreSQL).
I have a table that contains the quantities of a product sold by country to another one on a monthly basis. The table is defined like this:
create table market_research.test_tonnage(
origin text, -- Origin country
desti text, -- Destination country
yr int, -- Year
mt int, -- Month
q numeric -- quantity sold (always > 0)
)
Here is the content:
origin
desti
yr
mt
q
toto
coucou
2019
1
1.4
toto
coucou
2019
2
2.5
toto
coucou
2019
3
1.2
tata
yoyo
2018
11
5.4
tata
yoyo
2018
12
5.5
tata
yoyo
2019
1
5.2
I am trying to create a view that will add 2 calculated fields as following:
beginning_stock : Initial value of 0, then beginning_stock = ending_stock of the previous month
ending_stock : ending_stock = beginning_stock - q
origin
desti
yr
mt
q
beginning_stock
ending_stock
toto
coucou
2019
1
1.4
0
-1.4
toto
coucou
2019
2
2.5
-1.4
-3.9
toto
coucou
2019
3
1.2
-3.9
-5.1
tata
yoyo
2018
11
5.4
0
-5.4
tata
yoyo
2018
12
5.5
-5.4
-10.9
tata
yoyo
2019
1
5.2
-10.9
-16.1
I have tried many queries using the LAG function but I think that the problem comes from the sequentiality of the calculus over time. Here is an example of my attempt:
select origin,
desti,
yr,
mt,
q,
COALESCE(lag(ending_stock, 1) over (partition by origin order by yr, mt), 0) beginning_stock,
beginning_stock - q ending_stock
from market_research.test_tonnage
Thank you for your help!
Max
You need a cumulative SUM() function instead of LAG():
demo:db<>fiddle
SELECT
*,
SUM(-q) OVER (PARTITION BY origin ORDER BY yr, mt) + q as beginning, -- 2
SUM(-q) OVER (PARTITION BY origin ORDER BY yr, mt) as ending -- 1
FROM my_table
Sum all quantities (because you want negative values, you can make the values negative before, of course) until the current gives you current total (ending)
Same operation without the current value (add q again, because the SUM() subtracted it already) gives the beginning.
Related
I was trying this on test table
create table years (
yr bigint,
average decimal(10,2),
rollno bigint
)
i created this table year for storing 2 years like 2020 and 2021
average of marks scored in average
Condition is to find only those students whose avg is above 54 from last 2 years.
data is as follows
year average rollno
2021 55.20 1
2020 55.50 1
2020 54.50 2
2020 55.50 3
2021 55.40 3
select rollno
from years
where average > 54
and yr = (YEAR(GETDATE())-1)
and yr = (YEAR(GETDATE())-2)
i tried this query but it is not working when i want to specifically find those values where the condition is true.
if i use this query like this
select rollno
from years
where average > 54
and yr = (YEAR(GETDATE())-1) or yr = (YEAR(GETDATE())-2)
it works but doesnt give me the desired result.
result i want is as follows
year average rollno
2020 55.50 1
2021 55.20 1
2020 55.50 3
2021 55.40 3
but i am getting roll no 2 in the output
I'm checking your request after a review in you query, seems there are some parentheses missing in your expression.
Follows the example, based on your query
SELECT * FROM years WHERE
average > 55
and --below the bracket P1
(--Open P1
(--Open P2
yr = year(getdate())-1
)--Close P2
or
(--Open P3
yr = year(getdate())-2
)--Close P3
)--Close P1
What does it mean in the WHERE clause is the average > 55(mandatory) AND all what we have inside in the bracket P1 as another condition.
The result
year average rollno
2020 55.50 1
2021 55.20 1
2020 55.50 3
2021 55.40 3
Best Regards
I am looking to filter very large tables to the latest entry per user per month. I'm not sure if I found the best way to do this. I know I "should" trust the SQL engine (snowflake) but there is a part of me that does not like the join on three columns.
Note that this is a very common operation on many big tables, and I want to use it in DBT views which means it will get run all the time.
To illustrate, my data is of this form:
mytable
userId
loginDate
year
month
value
1
2021-01-04
2021
1
41.1
1
2021-01-06
2021
1
411.1
1
2021-01-25
2021
1
251.1
2
2021-01-05
2021
1
4369
2
2021-02-06
2021
2
32
2
2021-02-14
2021
2
731
3
2021-01-20
2021
1
258
3
2021-02-19
2021
2
4251
3
2021-03-15
2021
3
171
And I'm trying to use SQL to get the last value (by loginDate) for each month.
I'm currently doing a groupby & a join as follows:
WITH latest_entry_by_month AS (
SELECT "userId", "year", "month", max("loginDate") AS "loginDate"
FROM mytable
)
SELECT * FROM mytable NATURAL JOIN latest_entry_by_month
The above results in my desired output:
userId
loginDate
year
month
value
1
2021-01-25
2021
1
251.1
2
2021-01-05
2021
1
4369
2
2021-02-14
2021
2
731
3
2021-01-20
2021
1
258
3
2021-02-19
2021
2
4251
3
2021-03-15
2021
3
171
But I'm not sure if it's optimal.
Any guidance on how to do this faster? Note that I am not materializing the underlying data, so it is effectively un-clustered (I'm getting it from a vendor via the Snowflake marketplace).
Using QUALIFY and windowed function(ROW_NUMBER):
SELECT *
FROM mytable
QUALIFY ROW_NUMBER() OVER(PARTITION BY userId, year, month
ORDER BY loginDate DESC) = 1
I have a table where I want to join to bring through an i.d, straight forward enough but I only want to bring through values that are 'live' (referenced by a 1 in the flag column below). On the latest year no values are live yet but I need these values brought through too. It might be easier to explain in an example.
Joining Table:
Company Year Product ID Flag
A 2019 X 100 0
A 2019 X 101 1
A 2019 Y 102 1
A 2019 Y 103 0
A 2019 Y 104 0
A 2020 X 105 1
A 2020 Y 106 0
A 2020 Y 107 1
A 2020 Y 108 0
A 2020 Z 109 1
A 2021 X 110 0
A 2021 Y 111 0
A 2021 Y 112 0
A 2021 Y 113 0
A 2021 Z 114 0
I need to bring through those values that have a 1 in the Flag column and then all values with a year of 2021 (when 2021 begins the values in the flag column for 2021 will swap to zeroes and 1s, with the need to only bring through the rows with a 1 in the flag column, again).
The need to bring through next years values will reoccur at the end of every year so the idea is to future proof this from further changes so adding a when year =2021 is not an option.
The original table has the company, year and product so when I join it will be on these three fields.
Any thoughts, let me know
Thanks
Is this what you want?
select t.*
from mytable t
where flag = 1 or year = extract(year from current_date)
This brings rows where flag has value 1 or where year is the current year.
Note that this uses standard date functions extract() and current_date - not all databases support this syntax, but they all have equivalent.
I have two tables client and grouping. They look like this:
Client
C_id
C_grouping_id
Month
Profit
Grouping
Grouping_id
Month
Profit
The client table contains monthly profit for every client and every client belongs to a specific grouping scheme specified by C_grouping_id.
The grouping table contains all the groups and their monthly profits.
I'm struggling with a query that essentially calculates the monthly residual for every subscriber:
Residual= (Subscriber Monthly Profit - Grouping monthly Profit)*(average subscriber monthly profits for all months / average profits for all months for the grouping subscriber belongs to)
I have come up with the following query so far but the results seem to be incorrect:
SELECT client.C_id, client.C_grouping_Id, client.Month,
((client.Profit - grouping.profit) * (avg(client.Profit)/avg(grouping.profit))) as "residual"
FROM client
INNER JOIN grouping
ON "C_grouping_id"="Grouping_id"
group by client.C_id, client.C_grouping_Id,client.Month, grouping.profit
I would appreciate it if someone can shed some light on what I'm doing wrong and how to correct it.
EDIT: Adding sample data and desired results
Client
C_id C_grouping_id Month Profit
001 aaa jul 10$
001 aaa aug 12$
001 aaa sep 8$
016 abc jan 25$
016 abc feb 21$
Grouping
Grouping_id Month Profit
aaa Jul 30$
aaa aug 50$
aaa Sep 15$
abc Jan 21$
abc Feb 27$
Query Result:
C_ID C_grouping_id Month Residual
001 aaa Jul (10-30)*(10/31.3)=-6.38
... and so on for every month for avery client.
This can be done in a pretty straight forward way.
The main difficulty is obviously that you try to deal with different levels of aggregation at once (average of the group and the client as well as the current record).
This is rather difficult/clumsy with simple SELECT FROM GROUP BY-SQL.
But with analytical functions aka Window functions this is very easy.
Start with combining the tables and calculating the base numbers:
select c.c_id as client_id,
c.c_grouping_id as grouping_id,
c.month,
c.profit as client_profit,
g.profit as group_profit,
avg (c.profit) over (partition by c.c_id) as avg_client_profit,
avg (g.profit) over (partition by g.grouping_id) as avg_group_profit
from client c inner join grouping g
on c."C_GROUPING_ID"=g."GROUPING_ID"
and c. "MONTH" = g. "MONTH";
With this you already get the average profits by client and by grouping_id.
Be aware that I changed the data type of the currency column to DECIMAL (10,3) as a VARCHAR with a $ sign in it is just hard to convert.
I also fixed the data for MONTHS as the test data contained different upper/lower case spellings which prevented the join to work.
Finally I turned all column names into upper case to, in order to make typing easier.
Anyhow, running this provides you with the following result set:
CLIENT_ID GROUPING_ID MONTH CLIENT_PROFIT GROUP_PROFIT AVG_CLIENT_PROFIT AVG_GROUP_PROFIT
16 abc JAN 25 21 23 24
16 abc FEB 21 27 23 24
1 aaa JUL 10 30 10 31.666
1 aaa AUG 12 50 10 31.666
1 aaa SEP 8 15 10 31.666
From here it's only one step further to the residual calculation.
You can either put this current SQL into a view to make it reusable for other queries or use it as a inline view.
I chose to use it as a common table expression (CTE) aka WITH clause because it's nice and easy to read:
with p as
(select c.c_id as client_id,
c.c_grouping_id as grouping_id,
c.month,
c.profit as client_profit,
g.profit as group_profit,
avg (c.profit) over (partition by c.c_id) as avg_client_profit,
avg (g.profit) over (partition by g.grouping_id) as avg_group_profit
from client c inner join grouping g
on c."C_GROUPING_ID"=g."GROUPING_ID"
and c. "MONTH" = g. "MONTH")
select client_id, grouping_id, month,
client_profit, group_profit,
avg_client_profit, avg_group_profit,
round( (client_profit - group_profit)
* (avg_client_profit/avg_group_profit), 2) as residual
from p
order by grouping_id, month, client_id;
Notice how easy to read the whole statement is and how straight forward the residual calculation is done.
The result is then this:
CLIENT_ID GROUPING_ID MONTH CLIENT_PROFIT GROUP_PROFIT AVG_CLIENT_PROFIT AVG_GROUP_PROFIT RESIDUAL
1 aaa AUG 12 50 10 31.666 -12
1 aaa JUL 10 30 10 31.666 -6.32
1 aaa SEP 8 15 10 31.666 -2.21
16 abc FEB 21 27 23 24 -5.75
16 abc JAN 25 21 23 24 3.83
Cheers,
Lars
I need to find a moving average for the previous 12 rows. I need to have my result set look like this.
t Year Month Sales MovingAverage
1 2010 3 20 NULL
2 2010 4 22 NULL
3 2010 5 24 NULL
4 2010 6 25 NULL
5 2010 7 23 NULL
6 2010 8 26 NULL
7 2010 9 28 NULL
8 2010 10 26 NULL
9 2010 11 29 NULL
10 2010 12 27 NULL
11 2011 1 28 NULL
12 2011 2 30 NULL
13 2011 3 27 25.67
14 2011 4 29 26.25
15 2011 5 26 26.83
For row 13 I need to average rows 1 to 12 and have the result returned in row 13 column MovingAverage. Rows 1-12 have a MovingAverage of NULL because there should be at least 12 previous rows for the calculation. Rows t, Year, Month, and Sales already exist. I need to create the MovingAverage row. I am using postgreSQL but the syntax should be very similar.
Don't use the lag() function. There is a build in moving average function. Well, almost:
select t.*, avg(sales) over (order by t range between 12 preceding and current row
from table t;
The problem is that this will produce an average for the first 11 months. To prevent that:
select t.*,
(case when row_number() over (order by t) >= 12
then avg(sales) over (order by t range between 12 preceding and current row
end) as MovingAvg
from table t;
Note that the syntax rows between instead of range between would be very similar for this query.