Filtering by window function result in Postgresql - sql

Ok, initially this was just a joke we had with a friend of mine, but it turned into interesting technical question :)
I have the following stuff table:
CREATE TABLE stuff
(
id serial PRIMARY KEY,
volume integer NOT NULL DEFAULT 0,
priority smallint NOT NULL DEFAULT 0,
);
The table contains the records for all of my stuff, with respective volume and priority (how much I need it).
I have a bag with specified volume, say 1000. I want to select from the table all stuff I can put into a bag, packing the most important stuff first.
This seems like the case for using window functions, so here is the query I came up with:
select s.*, sum(volume) OVER previous_rows as total
from stuff s
where total < 1000
WINDOW previous_rows as
(ORDER BY priority desc ROWS between UNBOUNDED PRECEDING and CURRENT ROW)
order by priority desc
The problem with it, however, is that Postgres complains:
ERROR: column "total" does not exist
LINE 3: where total < 1000
If I remove this filter, total column gets properly calculated, results properly sorted but all stuff gets selected, which is not what I want.
So, how do I do this? How do I select only items that can fit into the bag?

I don't know if this qualifies as "more elegant" but it is written in a different manner than Cybernate's solution (although it is essentially the same)
WITH window_table AS
(
SELECT s.*,
sum(volume) OVER previous_rows as total
FROM stuff s
WINDOW previous_rows as
(ORDER BY priority desc ROWS between UNBOUNDED PRECEDING and CURRENT ROW)
)
SELECT *
FROM window_table
WHERE total < 1000
ORDER BY priority DESC
If by "more elegant" you mean something that avoids the sub-select, then the answer is "no"

I haven't worked with PostgreSQL. However, my best guess would be using an inline view.
SELECT a.*
FROM (
SELECT s.*, sum(volume) OVER previous_rows AS total
FROM stuff AS s
WINDOW previous_rows AS (
ORDER BY priority desc
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
ORDER BY priority DESC
) AS a
WHERE a.total < 1000;

Related

How do I group a set of entities (people) into 10 equal portions but based on their usage for EX using Oracle's Toad 11g (SQL)

Hi I have a list of 2+ mil people and their usage put in order from largest to smallest.
I tried ranking using row_number () over (partition by user column order by usage desc) as rnk
but that didnt work ..the results were crazy.
Simply put, I just want 10 equal groups of 10 with the first group consisting of the highest usage in the order of which i had first listed them.
HELP!
You can use ntile():
select t.*, ntile(10) over (order by usage desc) as usage_decile
from t;
The only caveat: This will divide the data into exactly 10 equal sized groups. If usage values have duplicates, then users with the same usage will be in different deciles.
If you don't want that behavior, use a more manual calculation:
select t.*,
ceil(rank() over (order by usage desc) * 10 /
count(*) over ()
) as usage_decile
from t;

Why Window Functions Require My Aggregated Column in Group

I have been working with window functions a fair amount but I don't think I understand enough about how they work to answer why they behave the way they do.
For the query that I was working on (below), why am I required to take my aggregated field and add it to the group by? (In the second half of my query below I am unable to produce a result if I don't include "Events" in my second group by)
With Data as (
Select
CohortDate as month
,datediff(week,CohortDate,EventDate) as EventAge
,count(distinct case when EventDate is not null then GUID end) as Events
From MyTable
where month >= [getdate():month] - interval '12 months'
group by 1, 2
order by 1, 2
)
Select
month
,EventAge
,sum(Events) over (partition by month order by SubAge asc rows between unbounded preceding and current row) as TotEvents
from data
group by 1, 2, Events
order by 1, 2
I have run into this enough that I have just taken it for granted, but would really love some more color as to why this is needed. Is there a way I should be formatting these differently in order to avoid this (somewhat non-intuitive) requirement?
Thanks a ton!
What you are looking for is presumably a cumulative sum. That would be:
select month, EventAge,
sum(sum(Events)) over (partition by month
order by SubAge asc
rows between unbounded preceding and current row
) as TotEvents
from data
group by 1, 2
order by 1, 2 ;
Why? That might be a little hard to explain. Perhaps if you see the equivalent version with a subquery it will be clearer:
select me.*
sum(sum_events) over (partition by month
order by SubAge asc
rows between unbounded preceding and current row
) as TotEvents
from (select month, EventAge, sum(events) as sum_events
from data
group by 1, 2
) me
order by 1, 2 ;
This is pretty much an exactly shorthand for the query. The window function is evaluated after aggregation. You want to sum the SUM of the events after the aggregation. Hence, you need sum(sum(events)). After the aggregation, events is no longer available.
The nesting of aggregation functions is awkward at first -- at least it was for me. When I first started using window functions, I think I first spent a few days writing aggregation queries using subqueries and then rewriting without the subqueries. Quickly, I got used to writing them without subqueries.

i need to count how many below average salary the table has

SELECT
*,
COUNT (AnnualSalary < avg(AnnualSalary)) AS Count
FROM Assessment
GROUP BY ServiceType
This is a Hive query, im trying to count how many records from table earn less that the average salary
First, distribute rows into different partitions according their ServiceType. Without specifying ORDER BY and window specification, the default is ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
Then, apply the aggregation AVG as an analytic function over each window to get the average AnnualSalary for each partition. Consequently we get to know whether a record's AnnualSalary is below the average of its partition.
Finally, a count on the intermediate result set.
SELECT
SERVICETYPE,
SUM(ISBELOW)
FROM (
SELECT
*,
CASE
WHEN ANNUALSALARY < AVG(ANNUALSALARY) OVER (PARTITION BY SERVICETYPE) THEN 1
ELSE 0
END AS ISBELOW
FROM ASSESSMENT
) TMP
GROUP BY SERVICETYPE
;
Note, a HAVING clause is for filtering after GROUP BY, and details for individual rows are lost before the filter.

How do I group entry timestamps into visitor-specific "sessions" in PostgreSQL?

Here's some mock data:
visitor_id,channel,timestamp,order_id,session
100,A,1,,1
100,B,2,,1
100,A,3,,1
100,B,4,1,1
100,B,5,,2
100,B,6,,2
100,B,7,2,2
100,A,8,,3
100,A,9,,3
A visitor will come into the site via channels, and eventually order (creating an order_id). Many visitors never order, but I still want to group their session together (to determine what was their first channel, for example). The last column is one example solving the problem.
What's an efficient, declarative statement to create it in PostgreSQL? Are there better solutions than what I am proposing?
You want to combine the values up-to an order. One method would be to assign each row a grouping id, such as the number of orders before a given time. This can be done with a correlated subquery:
select md.*,
(select count(md2.order_id)
from mockdata md2
where md2.visitor_id = md.visitor_id and
md2.timestamp < md.timestamp
) as session
from mockdata md;
This can also be done using a cumulative count:
select md.*,
count(order_id) over (partition by visitor_id
order by timestamp
rows between unbounded preceding and 1 preceding
) as session
from mockdata md;

COUNT(*) of a windowed SQL range

In Postgres 9.1 I'm using a windowing function like so:
SELECT a.category_id, (dense_rank() over w) - 1
FROM (
_inner select_
) a
WINDOW w AS (PARTITION BY category_id ORDER BY score)
What I can't figure out is how to also select the total number of elements in the windowed range. If I just use count(*) over w that tells me how many elements I've seen in the window so far instead of the total number in the window.
My core issue here is that cume_dist() counts from 1, not 0, for the number of rows before or equal to you. percentile_rank() counts from 0, like I need, but then it also subtracts 1 from the total number of rows when it does it's division.
SELECT
a.category_id,
(dense_rank() over w) - 1,
count(*) over (partition by category_id) --without order by
FROM (
_inner select_
) a
WINDOW w AS (PARTITION BY category_id ORDER BY score)
From the manual on Window Functions:
There is another important concept associated with window functions: for each row, there is a set of rows within its partition called its window frame. Many (but not all) window functions act only on the rows of the window frame, rather than of the whole partition. By default, if ORDER BY is supplied then the frame consists of all rows from the start of the partition up through the current row, plus any following rows that are equal to the current row according to the ORDER BY clause. When ORDER BY is omitted the default frame consists of all rows in the partition