How can I create a week-to-date metric in vertica? - sql

I have a table which stores year-to-date metrics once per client per day. The schema simplified looks roughly like so, lets call this table history::
bus_date | client_id | ytd_costs
I'd like to create a view that adds a week-to-date costs, essentially any cost that occurs after the prior friday would be considered part of the week-to-date. Currently, I have the following but I'm concerned about the switch case logic.
Here is an example of the logic I have right now to show that this works.
I also got to use the timeseries clause which I've never used before...
;with history as (
select bus_date,client_id,ts_first_Value(value,'linear') "ytd_costs"
from (select {ts'2016-10-07'} t,1 client_id,5.0 "value" union all select {ts'2016-10-14'},1, 15) k
timeseries bus_Date as '1 day' over (partition by client_id order by t)
)
,history_with_wtd as (select bus_date
,client_id
,ytd_costs
,ytd_costs - decode(
dayofweek(bus_date)
,6,first_value(ytd_costs) over (partition by client_id order by bus_date range '1 week' preceding)
,first_value(ytd_costs) over (partition by client_id,date_trunc('week',bus_date+3) order by bus_date)
) as "wtd_costs"
,ytd_costs - 5 "expected_wtd"
from history)
select *
from history_with_wtd
where date_trunc('week',bus_date) = '2016-10-10'
In Sql server, I could just use the lag function, as I can pass a variable to the look-back clause. but in Vertica no such option exists.

How about you partition by week starting on Saturday? First grab the first day of the week, then offset to start on Saturday. trunc(bus_date + 1,'D') - 1
Also notice the window frame is from the start of the partition (Saturday, unbounded preceding) to the current row.
select
bus_date
,client_id
,ytd_costs
,ytd_costs - first_value(ytd_costs) over (
partition by client_id, trunc(bus_date + 1,'D') - 1
order by bus_date
range between unbounded preceding and current row) wtd_costs
from sos.history
order by client_id, bus_date

Related

Getting 3 best Posts per Month with three different queries

I am having a hard time wrapping my head around the row_number function.
This is my SCHEMA :
I am trying to build a query that would output the top value for Post_impressions within a date range (I.E. a month) WHEN the RowNumber is set to 1, the second best value when it is set to 2 and so on.
Here is the query I came up with so far
SELECT Post_timestamp,
Post_impressions,
Post_tipo,
from
(SELECT Post_timestamp,
Post_impressions,
Post_tipo,
FORMAT_DATE("%Y-%m-%d",DATE_TRUNC(TIMESTAMP(Post_timestamp), DAY)) as TheDate,
row_number() OVER
(PARTITION BY FORMAT_DATE("%Y-%m-%d",DATE_TRUNC(TIMESTAMP(Post_timestamp), DAY)) ORDER BY Post_impressions DESC) AS RowNumber
from `***DATABASENAME***`
WHERE RowNumber = 1 AND TheDate BETWEEN "2021-07-01" AND "2021-07-31";
Thans for your help!
You're getting 31 rows because you're partitioning the subquery by day, and each partition has a RowNumber = 1. You could partition your query by month but I suspect that wouldn't address all your use cases, particularly when you want to look at a time period over multiple partitions.
Alternatively if your use case is limited to month over month, you can simply partition by the month.
SELECT Post_timestamp,
Post_impressions,
Post_tipo,
from
(SELECT Post_timestamp,
Post_impressions,
Post_tipo,
FORMAT_DATE("%Y-%m-%d",DATE_TRUNC(TIMESTAMP(Post_timestamp), day)) as TheDate,
row_number() OVER
(PARTITION BY FORMAT_DATE("%Y-%m-%d",DATE_TRUNC(TIMESTAMP(Post_timestamp), month)) ORDER BY Post_impressions DESC) AS RowNumber
from `***DATABASENAME***`
WHERE RowNumber = 1 AND TheDate BETWEEN "2021-07-01" AND "2021-07-31";

PostgreSQL subquery - calculating average of lagged values

I am looking at Sales Rates by month, and was able to query the 1st table. I am quite new to PostgreSQL and am trying to figure out how I can query the second (I had to do the 2nd one in Excel for now)
I have the current Sales Rate and I would like to compare it to the Sales Rate 1 and 2 months ago, as an averaged rate.
I am not asking for an answer how exactly to solve it because this is not the point of getting better, but just for hints for functions to use that are specific to PostgreSQL. What I am trying to calculate is the 2 month average in the 2nd table based on the lagged values of the 2nd table. Thanks!
Here is the query for the 1st table:
with t1 as
(select date,
count(sales)::numeric/count(poss_sales) as SR_1M_before
from data
where date between '2019-07-01' and '2019-11-30'
group by 1),
t2 as
(select date,
count(sales)::numeric/count(poss_sales) as SR_2M_before
from data
where date between '2019-07-01' and '2019-10-31'
group by 1)
select t0.date,
count(t0.sales)::numeric/count(t0.poss_sales) as Sales_Rate
t1.SR_1M_before,
t2.SR_2M_before
from data as t0
left join t1 on t0.date=t1.date
left join t2 on t0.date=t1.date
where date between '2019-07-01' and '2019-12-31'
group by 1,3,4
order by 1;
As commented by a_horse_with_no_name, you can use window functions to take the average of the two previous monthes with a range clause:
select
date,
count(sales)::numeric/count(poss_sales) as Sales_Rate,
avg(count(sales)::numeric/count(poss_sales)) over(
order by date
rows between '2 month' preceding and '1 month' preceding
) Sales_Rate,
count(sales)::numeric/count(poss_sales) as Sales_Rate
- avg(count(sales)::numeric/count(poss_sales)) over(
order by date
rows between '2 month' preceding and '1 month' preceding
) PercentDeviation
from data
where date between '2019-07-01' and '2019-12-31'
group by date
order by date;
Your data is a bit confusing -- it would be less confusing if you had decimal places (that is, 58% being the average of 57% and 58% is not obvious).
Because you want to have NULL values on the first two rows, I'm going to calculate the values using sum() and count():
with q as (
<whatever generates the data you have shown>
)
select q.*,
(sum(sales_rate) over (order by date
rows between 2 preceding and 1 preceding
) /
nullif(count(*) over (order by date
rows between 2 preceding and 1 preceding
)
) as two_month_average
from q;
You could also express this using case and avg():
select q.*,
(case when row_number() over (order by date) > 2)
then avg(sales_rate) over (order by date
rows between 2 preceding and 1 preceding
)
end) as two_month_average
from q;

PostgreSQL: RANGE BETWEEN INTERVAL '10 DAY' AND CURRENT ROW

I have a table which stores, for every item, the daily price. If the price hasn't been updated, there isn't a record for that item on that day.
I need to write a query which retrieves, for every item, the most recent price with a lookback window of 10 days from the current row date otherwise return NULL. I was thinking to achieve that using a RANGE BETWEEN INTERVAL statement. Something like:
SELECT
DATE(datetime),
item_id,
LAST(price) OVER(
PARTITION BY item_id
ORDER BY datetime DESC
RANGE BETWEEN INTERVAL '10 DAYS' AND CURRENT ROW
) AS most_recent_price_within_last_10days
FROM ...
GROUP BY
date,
item_id,
price
Unfortunately this query raises an error:
LINE 20: RANGE BETWEEN INTERVAL '10 DAY' PRECEDING AND CURRENT ROW
^
I came across an old blog post saying such operation is not possible in Postgres. Is this still accurate?
You could use ROW_NUMBER() to pull out the most recent record within the last 10 days for each item:
SELECT *
FROM (
SELECT
DATE(datetime),
item_id,
price AS most_recent_price_within_last_10days,
ROW_NUMBER() OVER(PARTITION BY item_id ORDER BY datetime DESC) rn
FROM ...
WHERE datetime > NOW() - INTERVAL '10 DAY'
) x WHERE rn = 1
In the subquery, the WHERE clause does the filtering on the date range; ROW_NUMBER() assigns a rank to each record within groups of records having the same item_id, with the most recent record first. Then, the outer query just filters in records having row number 1.
One method is to use LAG() and some comparison:
(CASE WHEN LAG(datetime) OVER (PARTITION BY item_id ORDER BY datetime) > datetime - interval '10 days'
THEN LAG(price) OVER (PARTITION BY item_id ORDER BY datetime)
END) as most_recent_price_within_last_10days
That is, the price you are looking for is on the preceding row. The only question is whether the date on that row is recent enough.

Calculate MAX for value over a relative date range

I am trying to calculate the max of a value over a relative date range. Suppose I have these columns: Date, Week, Category, Value. Note: The Week column is the Monday of the week of the corresponding Date.
I want to produce a table which gives the MAX value within the last two weeks for each Date, Week, Category combination so that the output produces the following: Date, Week, Category, Value, 2WeeksPriorMAX.
How would I go about writing that query? I don't think the following would work:
SELECT Date, Week, Value,
MAX(Value) OVER (PARTITION BY Category
ORDER BY Week
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as 2WeeksPriorMAX
The above query doesn't account for cases where there are missing values for a given Category, Week combination within the last 2 weeks, and therefore it would span further than 2 weeks when it analyzes the 2 preceding rows.
Left joining or using a lateral join/subquery might be expensive. You can do this with window functions, but you need to have a bit more logic:
select t.*,
(case when lag(date, 1) over (partition by category order by date) < date - interval '2 week'
then value
when lag(date, 2) over (partition by category order by date) < date - interval '2 week'
then max(value) over (partition by category order by date rows between 1 preceding and current row)
else max(value) over (partition by category order by date rows between 2 preceding and current row)
end) as TwoWeekMax
from t;

Postgres windowing (determine contiguous days)

Using Postgres 9.3, I'm trying to count the number of contiguous days of a certain weather type. If we assume we have a regular time series and weather report:
date|weather
"2016-02-01";"Sunny"
"2016-02-02";"Cloudy"
"2016-02-03";"Snow"
"2016-02-04";"Snow"
"2016-02-05";"Cloudy"
"2016-02-06";"Sunny"
"2016-02-07";"Sunny"
"2016-02-08";"Sunny"
"2016-02-09";"Snow"
"2016-02-10";"Snow"
I want something count the contiguous days of the same weather. The results should look something like this:
date|weather|contiguous_days
"2016-02-01";"Sunny";1
"2016-02-02";"Cloudy";1
"2016-02-03";"Snow";1
"2016-02-04";"Snow";2
"2016-02-05";"Cloudy";1
"2016-02-06";"Sunny";1
"2016-02-07";"Sunny";2
"2016-02-08";"Sunny";3
"2016-02-09";"Snow";1
"2016-02-10";"Snow";2
I've been banging my head on this for a while trying to use windowing functions. At first, it seems like it should be no-brainer, but then I found out its much harder than expected.
Here is what I've tried...
Select date, weather, Row_Number() Over (partition by weather order by date)
from t_weather
Would it be better just easier to compare the current row to the next? How would you do that while maintaining a count? Any thoughts, ideas, or even solutions would be helpful!
-Kip
You need to identify the contiguous where the weather is the same. You can do this by adding a grouping identifier. There is a simple method: subtract a sequence of increasing numbers from the dates and it is constant for contiguous dates.
One you have the grouping, the rest is row_number():
Select date, weather,
Row_Number() Over (partition by weather, grp order by date)
from (select w.*,
(date - row_number() over (partition by weather order by date) * interval '1 day') as grp
from t_weather w
) w;
The SQL Fiddle is here.
I'm not sure what the query engine is going to do when scanning multiple times across the same data set (kinda like calculating area under a curve), but this works...
WITH v(date, weather) AS (
VALUES
('2016-02-01'::date,'Sunny'::text),
('2016-02-02','Cloudy'),
('2016-02-03','Snow'),
('2016-02-04','Snow'),
('2016-02-05','Cloudy'),
('2016-02-06','Sunny'),
('2016-02-07','Sunny'),
('2016-02-08','Sunny'),
('2016-02-09','Snow'),
('2016-02-10','Snow') ),
changes AS (
SELECT date,
weather,
CASE WHEN lag(weather) OVER () = weather THEN 1 ELSE 0 END change
FROM v)
SELECT date
, weather
,(SELECT count(weather) -- number of times the weather didn't change
FROM changes v2
WHERE v2.date <= v1.date AND v2.weather = v1.weather
AND v2.date >= ( -- bounded between changes of weather
SELECT max(date)
FROM changes v3
WHERE change = 0
AND v3.weather = v1.weather
AND v3.date <= v1.date) --<-- here's the expensive part
) curve
FROM changes v1
Here is another approach based off of this answer.
First we add a change column that is 1 or 0 depending on whether the weather is different or not from the previous day.
Then we introduce a group_nr column by summing the change over an order by date. This produces a unique group number for each sequence of consecutive same-weather days since the sum is only incremented on the first day of each sequence.
Finally we do a row_number() over (partition by group_nr order by date) to produce the running count per group.
select date, weather, row_number() over (partition by group_nr order by date)
from (
select *, sum(change) over (order by date) as group_nr
from (
select *, (weather != lag(weather,1,'') over (order by date))::int as change
from tmp_weather
) t1
) t2;
sqlfiddle (uses equivalent WITH syntax)
You can accomplish this with a recursive CTE as follows:
WITH RECURSIVE CTE_ConsecutiveDays AS
(
SELECT
my_date,
weather,
1 AS consecutive_days
FROM My_Table T
WHERE
NOT EXISTS (SELECT * FROM My_Table T2 WHERE T2.my_date = T.my_date - INTERVAL '1 day' AND T2.weather = T.weather)
UNION ALL
SELECT
T.my_date,
T.weather,
CD.consecutive_days + 1
FROM
CTE_ConsecutiveDays CD
INNER JOIN My_Table T ON
T.my_date = CD.my_date + INTERVAL '1 day' AND
T.weather = CD.weather
)
SELECT *
FROM CTE_ConsecutiveDays
ORDER BY my_date;
Here's the SQL Fiddle to test: http://www.sqlfiddle.com/#!15/383e5/3