Get aggregate over n last values in vertica - sql

We have table that has the columns dates,sales and item.
An item's price can be different at every sale, and we want to find the price of an item, averaged over its most recent 50 sales.
Is there a way to do this using analytical functions in Vertica?
For a popular item, all these 50 sales could be from this week. For another, we may need to have a 3 month window.
Can we know what these windows are, per item ?

You would use a window-frame clause to get the value on every row:
select t.*,
avg(t.price) over (partition by item
order by t.date desc
rows between 49 preceding and current row
) as avg_price_50
from t;
On re-reading the question, I suspect you want a single row per item. For that, use row_number():
select t.item, avg(t.price)
from (select t.*,
row_number() over (partition by item order by t.date desc) as seqnum
from t
) t
where seqnum <= 50
group by item;

Related

How would I get the last N quarters?

How would I get the last N quarters? I would like to extract the data that contains the last 5 quarters (including the current quarter).
Below is a SQL is just grouping the milestone to show how many unique data points there are. In these milestones, they contain multiple rows/data.
SELECT LEFT(MILESTONE,7) AS MILESTONE2
FROM XXXTable
WHERE MILESTONE LIKE '%M0'
GROUP BY 1
ORDER BY MILESTONE2 DESC
MILESTONE2
2020_Q4
2020_Q3
2020_Q2
2020_Q1
2019_Q4
2019_Q3
2019_Q2
2019_Q1
2018_Q4
2018_Q3
You can use dense_rank():
select t.*
from (select t.*, dense_rank() over (order by LEFT(MILESTONE, 7) desc) as seqnum
from XXXTable
where MILESTONE like '%M0'
) t
where seqnum <= 5

Spark SQL - Finding the maximum value of a month per year

I have created a data frame which contains Year, Month, and the occurrence of incidents (count).
I want to find the month of each year had the most incident using spark SQL.
You can use window functions:
select *
from (select t.*, rank() over(partition by year order by cnt desc) rn from mytable t) t
where rn = 1
For each year, this gives you the row that has the greatest cnt. If there are ties, the query returns them.
Note that count is a language keyword in SQL, hence not a good choice for a column name. I renamed it to cnt in the query.
You can use window functions, if you want to use SQL:
select t.*
from (select t.*,
row_number() over (partition by year order by count desc) as seqnum
from t
) t
where seqnum = 1;
This returns one row per year, even if there are ties for the maximum count. If you want all such rows in the event of ties, then use rank() instead of row_number().

Is there a way to split rows into groups based on certain values?

consider this table:
I want to divide these rows into groups based on their id and price values: as long as two rows have the same id and price and are not divided by any other row they belong to the same group, so I expect the output to be sorta like this:
I tried using window functions but with them I ended up with the last row having the same group as the first 3. Is there something I'm missing?
This is a gaps-and-islands problem. One method is to use lag() to detect changes and then a cumulative sum:
select t.*,
sum(case when prev_price = price then 0 else 1 end) over
(partition by id order by dt) as group_num
from (select t.*,
lag(price) over (partition by id order by dt) as prev_price
from t
) t

SQL query for backfilling register read values

I have a table with ID,timestamp,register reads for a day, the register reads are like running totals starts at 12.00 at midnight and ends at 11.00 at night.
Problem is there are some random timeintervals in which the cumulative reads may not be present, I need to back fill those,
The below picture gives a snapshot of the problem, The KWH_RDNG is the difference between two cumulative intervals divided by 1000, but the 4th column 5.851 is actually accumulation of 3 missing hours along with the 4th hour value. its fine if i simply divide 5.851/4 and distribute it.
The challenge is they can happen at random intervals and it can be different for different meters (1st column). I am using SQL Server 2016.
Please help.!!
This is a gaps and islands problem -- sort of. You need to identify groups of NULL values with the subsequent value. One method is to use a cumulative sum of the non-NULL value on or after each value. This defines the groups.
Then, you need the count and the reading. So, this should do the calculation:
select t.*,
(max_kwh_rding / cnt) as new_kwh_rding
from (select t.*, count(*) over (partition by meter_serial, grp) as cnt,
max(kwh_rding) over (partition by meter_serial, grp) as max_kwh_rding
from (select t.*,
count(kwh_rding) over (partition by meter_serial order by read_utc desc rows between unbounded preceding and current row) as grp
from t
) t
) t
where cnt > 1;
You can incorporate this into an update:
with toupdate as (
select t.*,
(max_kwh_rding / cnt) as new_kwh_rding
from (select t.*, count(*) over (partition by meter_serial, grp) as cnt,
max(kwh_rding) over (partition by meter_serial, grp) as max_kwh_rding
from (select t.*,
count(kwh_rding) over (partition by meter_serial order by read_utc desc rows between unbounded preceding and current row) as grp
from t
) t
) t
where cnt > 1
)
update toupdate
set kwh_rding = max_kwh_rding;

SQL Insert Statement that pulls top n from each set of categories that could have duplicates

I am trying to write an Insert statement that will go through sales numbers for a group of people with each sale being marked as an R or C type of sale. I want to find the TOP 100 salespersons in ALL (both R and C), R, and C. Not only do I have sales data though, I have Sales, Margin, Count, Sales/Count data I want to do the same thing for. so far I have to do 12 SQL statements to accomplish this (4 categories X 3 sales types) each one is a slight variation of this to get one of my 4 categories.
INSERT INTO ztbl_AllTopSalesPerson (SalesPerson)
SELECT TOP 100 tbl_Master.SalesPerson
FROM tbl_Master
WHERE tbl_Master.SaleType="C"
GROUP BY tbl_Master.SalesPerson
ORDER BY Sum(tbl_Master.Margin) DESC;
INSERT INTO ztbl_AllTopSalesPerson (SalesPerson)
SELECT TOP 100 tbl_Master.SalesPerson
FROM tbl_Master
WHERE tbl_Master.SaleType="R"
GROUP BY tbl_Master.SalesPerson
ORDER BY Sum(tbl_Master.Margin) DESC;
INSERT INTO ztbl_AllTopSalesPerson (SalesPerson)
SELECT TOP 100 tbl_Master.SalesPerson
FROM tbl_Master
GROUP BY tbl_Master.SalesPerson
ORDER BY Sum(tbl_Master.Margin) DESC;
Ideally I would like a way to make this all one statement. And(if it is not impossible) I would like to filter each one by date so I can do it by monthly data too, not just overall.
Just a few notes: I cant have duplicate names, so if a salesperson is top in all three sales types, they still only appear once. Im using Access with a SQL Server back-end for only the main data table. I cant take the top 300 results, because there is so much overlap between the sales types, and I need the top from each ( I do a separate query after this list is made that lines up the SalesPersons' Alphabetically with their 4 categories as fields). And lastly, I generally up with a final list that has around 260-290 records.
THANKS!
p.s. thanks for your replies, stack exchange has saved my bacon 100s of times. I would post my attempts at this, but I think it would hurt more than it would help.
You might have to tweak it a little depending on what sort of output you want. You also might have to do a subquery for the COUNT(*) part of it, as this is untested. But I think this is the general idea of what you are looking for.
To get aggregated information, you can break it up into two CTE's:
WITH CTE1 AS (
SELECT SalesPerson,
SaleType,
SUM(Margin) OVER (PARTITION BY SalesPerson,SaleType) as Margin,
SUM(Sales) OVER (PARTITION BY SalesPerson,SaleType) as Sales,
SUM(Sales)/COUNT(*) OVER (PARTITION BY SalesPerson,SaleType) as Sales_pct,
COUNT(*) OVER (PARTITION BY SalesPerson,SaleType) as Total
SUM(Margin) OVER (PARTITION BY SalesPerson) as Margin_all,
SUM(Sales) OVER (PARTITION BY SalesPerson) as Sales_all,
SUM(Sales)/COUNT(*) OVER (PARTITION BY SalesPerson) as Sales_pct_all,
COUNT(*) OVER (PARTITION BY SalesPerson) as Total_all
FROM tbl_Master
)
,CTE2 AS (
SELECT SalesPerson
,RANK() OVER (PARTITION BY SaleType ORDER BY Margin desc) as Margin
,RANK() OVER (PARTITION BY SaleType ORDER BY Sales desc) as Sales
,RANK() OVER (PARTITION BY SaleType ORDER BY Sales_pct desc) as Sales_pct
,RANK() OVER (PARTITION BY Master.SaleType ORDER BY Total desc) as Total
,RANK() OVER (ORDER BY Margin_all desc) as Margin_all
,RANK() OVER (ORDER BY Sales_all desc) as Sales_all
,RANK() OVER (ORDER BY Sales_pct_all desc) as Sales_pct_all
,RANK() OVER (ORDER BY Total_all desc) as Total_all
FROM CTE1 )
Select distinct SalesPerson from CTE2
Where Margin <= 100 Or Sales <= 100 Or Total <= 100 or Sales_pct <= 100
Or Margin_all <= 100 Or Sales_all <= 100 Or Total_all <= 100 or Sales_pct_all <= 100
I understand this is not perfect, but it should get you started. To filter by date, add DATEPART(month,[your date field]) to your PARTITION BY clauses (and the first CTE)