Selecting a grouped condition in an Aggregate query

Selecting a grouped condition in an Aggregate query - sql

I have a view that averages some statistics by averaging past rows relative
to the current outer row. Think of a batting average for each previous at bat for each batter. This works as I would like but, I would like more control over the old_foo.dates
The views idealized query is like this:
create view myview as
select
avg(old_foo.stuff),
foo.person_id,
foo.date_ as this_date
from
foo,
join ( select stuff, person_id, date_ from foo) as old_foo
on old_foo.date_ < foo.date_
group by person_id, this_date
;
But what I would really like is to be able set the minimum old_foo.date from the
view so I could be able to create arbitrary moving averages on the fly.
Such as:
select * from myview where mindate > now()::date - 10
(mindate is fictitious since I lose it with the group by)
I know I can do this with a function but I would prefer not too. Would CTE's give me more flexibility with what I want?
edit
I can't bring the oldate column to the top level of the view without grouping it (which is not what I want.) I want the view to be general so I could just as easily do a 10 day moving average as a 20 day one, or any date I would like. The olddates in the inner query so I have no access to it once I create a view.

I figured it out :)
create view myview as
select
avg(old_foo.stuff),
foo.person_id,
foo.date_ as this_date,
offset
from
generate_series(1, 100) as offset,
foo,
join ( select stuff, person_id, date_ from foo) as old_foo
on old_foo.date_ < foo.date_
and old_foo.date_ > foo.date_ - offset
group by person_id, this_date, offset
;
select * from myview where offset = 10;
Then offset would simulate a function parameter.

Try using the having clause here is some reference
http://www.postgresql.org/docs/8.1/static/tutorial-agg.html
I believe it would look something like this.
create view myview as
select
avg(old_foo.stuff),
foo.person_id,
foo.date_ as this_date
from
foo,
join ( select stuff, person_id, date_ from foo) as old_foo
on old_foo.date_ < foo.date_
group by person_id
having min(foo.date_) <= now() - 10

Related

Insert data from table into a new one with condition

Okay, so this has been bugging me the whole day. I have two tables (e.g original_table and new_table). The new table is empty and I need to populate it with records from original_table given the following conditions:
Trip duration must be at least 30 seconds
Include only stations which have at least 100 trips starting there
Include only stations which have at least 100 trips ending there
The duration part is easy, but I find it hard to filter the other two conditions.
I tried to make two temporary tables like so:
CREATE TEMP TABLE start_stations AS(
SELECT ARRAY(SELECT DISTINCT start_station_id FROM `dataset.original_table`
WHERE duration_sec >= 30
GROUP BY start_station_id
HAVING COUNT(start_station_id)>=100
AND COUNT(end_station_id)>=100) as arr
);
CREATE TEMP TABLE end_stations AS(
SELECT ARRAY(SELECT DISTINCT end_station_id FROM `dataset.original_table`
WHERE duration_sec >= 30
GROUP BY end_station_id
HAVING COUNT(end_station_id)>=100
AND COUNT(start_station_id)>=100) as arr
);
And then try to insert in the new_table like this:
INSERT INTO `dataset.new_table`
SELECT a.* FROM `dataset.original_table` as a, start_stations as ss,
end_stations as es
WHERE a.start_station_id IN UNNEST(ss.arr)
AND a.end_station_id IN UNNEST(es.arr)
However, this does not provide me the right answer. I tried to make a temprary function to clean up the data, but I didnt go far. :(
Here's a sample of the table:
trip_id|duration_sec|start_date|start_station_id| end_date|end_station_id|
--------------------------------------------------------------------------|
afad333| 231|2017-12-20| 210|2017-12-20| 355|
sffde56| 35|2017-12-12| 355|2017-12-12| 210|
af33445| 333|2018-10-27| 650|2018-10-27| 650|
dd1238d| 456|2017-09-15| 123|2017-09-15| 210|
dsa2223| 500|2017-09-15| 210|2017-09-15| 123|
...
I will be very thankful If you can help me.
Thanks in advance!

Approach should be
with major_stations as(
select start_station_id station_id
from trips
group by start_station_id
having count(*) > 100
union
select end_station_id station_id
from trips
group by end_station_id
having count(*) > 100
)
select *
from trips
where start_station_id in (select station_id from major_stations)
and trip_duration > 30
There may be some easy way, but this is first approach I think of.

So I found what my problem was. Since I must filter out stations where 100 trips started AND ended, doing it the way I did before was wrong.
The current answer for me was this:
INSERT INTO dataset.new_table
WITH stations AS (
SELECT start_station_id, end_station_id FROM dataset.original_table
GROUP BY start_station_id, end_station_id
HAVING count(start_station_id)>=100
AND count(end_station_id)>=100
)
SELECT a.* FROM dataset.original_table AS a, stations as s
WHERE a.start_station_id = s.start_station_id
AND a.end_station_id = s.end_station_id
AND a.duration_sec >= 30
This way I am creating only one WITH clause which filters only start AND end stations, by the given criteria.
As easy as it looks, obviously my brain needs a rest sometimes and a start with a new perspective.

PostgreSQL GROUP BY that includes zeros

I have a SQL query (postgresql) that looks something like this:
SELECT
my_timestamp::timestamp::date as the_date,
count(*) as count
FROM my_table
WHERE ...
GROUP BY the_date
ORDER BY the_date
The result is a table of YYYY-MM-DD, count pairs.
Now I've been asked to fill in the empty dates with zero. So if I was previously providing
2022-03-15 3
2022-03-17 1
I'd now want to return
2022-03-15 3
2022-03-16 0
2022-03-17 1
Now I can easily do this client-side (relative to the database) and let my program compute and return the zero-augmented list to its clients based on the original list from postgres. But perhaps it would better if I could just tell postgresql to include zeros.
I suspect this isn't easy at all, because postgres has no obvious way of knowing what I'm up to. But in the interests of learning more about postgres and SQL, I thought I'd have try. The try isn't too promising thus far...
Any pointers before I conclude that I was right to leave this to my (postgres client) program?
Update
This is an interesting case where my simplification of the problem led to a correct answer that didn't work for me. For those who come after, I thought it worth documenting what followed, because it take some fun twists through constructing SQL queries.
#a_horse_with_no_name responded with a query that I've verified works if I simplify my own query to match. Unfortunately, my query had some extra baggage that I didn't think pertinent, and so had trimmed out when posting the original question.
Here's my real (original) query, with all names preserved (if shortened):
-- current query
SELECT
LEAST(time1, time2, time3, time4)::timestamp::date as the_date,
count(*) as count
FROM reading_group_reader rgr
INNER JOIN ( SELECT group_id, group_type ::group_type_name
FROM (VALUES (31198, 'excerpt')) as T(group_id, group_type)) TT
ON TT.group_id = rgr.group_id
AND TT.group_type = rgr.group_type
WHERE LEAST(time1, time2, time3, time4) > current_date - 30
GROUP BY the_date
ORDER BY the_date;
If I translate that directly into the proposed solution, however, the inner join between reading_group_reader and the temporary table TT causes the left join to become inner (I think) and the date sequence drops its zeros again. Fwiw, the table TT is a table because sometimes it actually is a subselect.
So I transformed my query into this:
SELECT
g.dt::date as the_date,
count(*) as count
FROM generate_series(date '2022-03-06', date '2022-04-06', interval '1 day') as g(dt)
LEFT JOIN (
SELECT
LEAST(rgr.time1, rgr.time2, rgr.time3, rgr.time4)::timestamp::date as the_date
FROM reading_group_reader rgr
INNER JOIN (
SELECT group_id, group_type ::group_type_name
FROM (VALUES (31198, 'excerpt')) as T(group_id, group_type)) TT
ON TT.group_id = rgr.group_id
AND TT.group_type = rgr.group_type
) rgrt
ON rgrt.the_date = g.dt::date
GROUP BY g.dt
ORDER BY the_date;
but this outputs 1's instead of 0's at the places that should be 0.
The reason for that, however, is because I've now selected every date, so, of course, there's one of each. I need to include an additional field (which will be NULL) and count that.
So this query finally does what I want:
SELECT
g.dt::date as the_date,
count(rgrt.device_id) as count
FROM generate_series(date '2022-03-06', date '2022-04-06', interval '1 day') as g(dt)
LEFT JOIN (
SELECT
LEAST(rgr.time1, rgr.time2, rgr.time3, rgr.time4)::timestamp::date as the_date,
rgr.device_id
FROM reading_group_reader rgr
INNER JOIN (
SELECT group_id, group_type ::group_type_name
FROM (VALUES (31198, 'excerpt')) as T(group_id, group_type)
) TT
ON TT.group_id = rgr.group_id
AND TT.group_type = rgr.group_type
) rgrt(the_date)
ON rgrt.the_date = g.dt::date
GROUP BY g.dt
ORDER BY g.dt;
And, of course, on re-reading the accepted answer, I eventually saw that he did count an unrelated field, which I'd simply missed on my first several readings.

You will need to join to a list of dates. This can e.g. be done using generate_series()
SELECT g.dt::date as the_date,
count(t.my_timestamp) as count
FROM generate_series(date '2022-03-01',
date '2022-03-31',
interval '1 day') as g(dt)
LEFT JOIN my_table as t
ON t.my_timestamp::date = g.dt::date
AND ... -- the original WHERE clause goes here!
GROUP BY the_date
ORDER BY the_date;
Note that the original WHERE conditions need to go into the join condition of the LEFT JOIN. You can't put them into a WHERE clause because that would turn the outer join back into an inner join (which means the missing dates wouldn't be returned).

Get apps with the highest review count since a dynamic series of days

I have two tables, apps and reviews (simplified for the sake of discussion):
apps table
id int
reviews table
id int
review_date date
app_id int (foreign key that points to apps)
2 questions:
1. How can I write a query / function to answer the following question?:
Given a series of dates from the earliest reviews.review_date to the latest reviews.review_date (incrementing by a day), for each date, D, which apps had the most reviews if the app's earliest review was on or later than D?
I think I know how to write a query if given an explicit date:
SELECT
apps.id,
count(reviews.*)
FROM
reviews
INNER JOIN apps ON apps.id = reviews.app_id
group by
1
having
min(reviews.review_date) >= '2020-01-01'
order by 2 desc
limit 10;
But I don't know how to query this dynamically given the desired date series and compile all this information in a single view.
2. What's the best way to model this data?
It would be nice to have the # of reviews at the time for each date as well as the app_id. As of now I'm thinking something that might look like:
... 2020-01-01_app_id | 2020-01-01_review_count | 2020-01-02_app_id | 2020-01-02_review_count ...
But I'm wondering if there's a better way to do this. Stitching the data together also seems like a challenge.

I think this is what you are looking for:
Postgres 13 or newer
WITH cte AS ( -- MATERIALIZED
SELECT app_id, min(review_date) AS earliest_review, count(*)::int AS total_ct
FROM reviews
GROUP BY 1
)
SELECT *
FROM (
SELECT generate_series(min(review_date)
, max(review_date)
, '1 day')::date
FROM reviews
) d(review_window_start)
LEFT JOIN LATERAL (
SELECT total_ct, array_agg(app_id) AS apps
FROM (
SELECT app_id, total_ct
FROM cte c
WHERE c.earliest_review >= d.review_window_start
ORDER BY total_ct DESC
FETCH FIRST 1 ROWS WITH TIES -- new & hot
) sub
GROUP BY 1
) a ON true;
WITH TIES makes it a bit cheaper. Added in Postgres 13 (currently beta). See:
Get top row(s) with highest value, with ties
Postgres 12 or older
WITH cte AS ( -- MATERIALIZED
SELECT app_id, min(review_date) AS earliest_review, count(*)::int AS total_ct
FROM reviews
GROUP BY 1
)
SELECT *
FROM (
SELECT generate_series(min(review_date)
, max(review_date)
, '1 day')::date
FROM reviews
) d(review_window_start)
LEFT JOIN LATERAL (
SELECT total_ct, array_agg(app_id) AS apps
FROM (
SELECT total_ct, app_id
, rank() OVER (ORDER BY total_ct DESC) AS rnk
FROM cte c
WHERE c.earliest_review >= d.review_window_start
) sub
WHERE rnk = 1
GROUP BY 1
) a ON true;
db<>fiddle here
Same as above, but without WITH TIES.
We don't need to involve the table apps at all. The table reviews has all information we need.
The CTE cte computes earliest review & current total count per app. The CTE avoids repeated computation. Should help quite a bit.
It is always materialized before Postgres 12, and should be materialized automatically in Postgres 12 since it is used many times in the main query. Else you could add the keyword MATERIALIZED in Postgres 12 or later to force it. See:
How to force evaluation of subquery before joining / pushing down to foreign server
The optimized generate_series() call produces the series of days from earliest to latest review. See:
Generating time series between two dates in PostgreSQL
Join a count query on generate_series() and retrieve Null values as '0'
Finally, the LEFT JOIN LATERAL you already discovered. But since multiple apps can tie for the most reviews, retrieve all winners, which can be 0 - n apps. The query aggregates all daily winners into an array, so we get a single result row per review_window_start. Alternatively, define tiebreaker(s) to get at most one winner. See:
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?

If you are looking for hints, then here are a few:
Are you aware of generate_series() and how to use it to compose a table of dates given a start and end date? If not, then there are plenty of examples on this site.
To answer this question for any given date, you need to have only two measures for each app, and only one of these is used to compare an app against other apps. Your query in part 1 shows that you know what these two measures are.
Hints 1 and 2 should be enough to get this done. The only thing I can add is for you not to worry about making the database do "too much work." That is what it is there to do. If it does not do it quickly enough, then you can think about optimizations, but before you get to that step, concentrate on getting the answer that you want.
Please comment if you need further clarification on this.

The missing piece for me was lateral join.
I can accomplish just about what I want using the following:
select
review_windows.review_window_start,
id,
review_total,
earliest_review
from
(
select
date_trunc('day', review_windows.review_windows) :: date as review_window_start
from
generate_series(
(
SELECT
min(reviews.review_date)
FROM
reviews
),
(
SELECT
max(reviews.review_date)
FROM
reviews
),
'1 year'
) review_windows
order by
1 desc
) review_windows
left join lateral (
SELECT
apps.id,
count(reviews.*) as review_total,
min(reviews.review_date) as earliest_review
FROM
reviews
INNER JOIN apps ON apps.id = reviews.app_id
where
reviews.review_date >= review_windows.review_window_start
group by
1
having
min(reviews.review_date) >= review_windows.review_window_start
order by
2 desc,
3 desc
limit
2
) apps_most_reviews on true;

Count of id per day using window function

I'm trying to count track_uri that are associated to a given playlist_uri in a day in a one month window and have composed the following sql:
SELECT
playlist_uri, playlist_date, track_uri, count(track_uri)
over (partition by playlist_uri, playlist_date) as count_tracks
FROM
tbl1
WHERE
_PARTITIONTIME BETWEEN '2017-09-09' AND '2017-10-09'
AND playlist_uri in (
SELECT playlist_uri from tbl2 WHERE playlist_owner = "spotify"
)
However I am getting the following output:
I instead would like it to show me the count of track_uri for each playlist_uri on each day.
Would really appreciate some help with this.

Not sure if I understand your question correctly, but if you might not need to use the window function for that:
SELECT
playlist_uri, playlist_date, COUNT(DISTINCT track_uri)
FROM
tbl1
WHERE
_PARTITIONTIME BETWEEN '2017-09-09' AND '2017-10-09'
AND playlist_uri in (
SELECT playlist_uri from tbl2 WHERE playlist_owner = "spotify"
)
GROUP BY 1, 2;

Sorting twice on same column

I'm having a bit of a weird question, given to me by a client.
He has a list of data, with a date between parentheses like so:
Foo (14/08/2012)
Bar (15/08/2012)
Bar (16/09/2012)
Xyz (20/10/2012)
However, he wants the list to be displayed as follows:
Foo (14/08/2012)
Bar (16/09/2012)
Bar (15/08/2012)
Foot (20/10/2012)
(notice that the second Bar has moved up one position)
So, the logic behind it is, that the list has to be sorted by date ascending, EXCEPT when two rows have the same name ('Bar'). If they have the same name, it must be sorted with the LATEST date at the top, while staying in the other sorting order.
Is this even remotely possible? I've experimented with a lot of ORDER BY clauses, but couldn't find the right one. Does anyone have an idea?
I should have specified that this data comes from a table in a sql server database (the Name and the date are in two different columns). So I'm looking for a SQL-query that can do the sorting I want.
(I've dumbed this example down quite a bit, so if you need more context, don't hesitate to ask)

This works, I think
declare #t table (data varchar(50), date datetime)
insert #t
values
('Foo','2012-08-14'),
('Bar','2012-08-15'),
('Bar','2012-09-16'),
('Xyz','2012-10-20')
select t.*
from #t t
inner join (select data, COUNT(*) cg, MAX(date) as mg from #t group by data) tc
on t.data = tc.data
order by case when cg>1 then mg else date end, date desc
produces
data date
---------- -----------------------
Foo 2012-08-14 00:00:00.000
Bar 2012-09-16 00:00:00.000
Bar 2012-08-15 00:00:00.000
Xyz 2012-10-20 00:00:00.000

A way with better performance than any of the other posted answers is to just do it entirely with an ORDER BY and not a JOIN or using CTE:
DECLARE #t TABLE (myData varchar(50), myDate datetime)
INSERT INTO #t VALUES
('Foo','2012-08-14'),
('Bar','2012-08-15'),
('Bar','2012-09-16'),
('Xyz','2012-10-20')
SELECT *
FROM #t t1
ORDER BY (SELECT MIN(t2.myDate) FROM #t t2 WHERE t2.myData = t1.myData), T1.myDate DESC
This does exactly what you request and will work with any indexes and much better with larger amounts of data than any of the other answers.
Additionally it's much more clear what you're actually trying to do here, rather than masking the real logic with the complexity of a join and checking the count of joined items.

This one uses analytic functions to perform the sort, it only requires one SELECT from your table.
The inner query finds gaps, where the name changes. These gaps are used to identify groups in the next query, and the outer query does the final sorting by these groups.
I have tried it here (SQL Fiddle) with extended test-data.
SELECT name, dat
FROM (
SELECT name, dat, SUM(gap) over(ORDER BY dat, name) AS grp
FROM (
SELECT name, dat,
CASE WHEN LAG(name) OVER (ORDER BY dat, name) = name THEN 0 ELSE 1 END AS gap
FROM t
) x
) y
ORDER BY grp, dat DESC
Extended test-data
('Bar','2012-08-12'),
('Bar','2012-08-11'),
('Foo','2012-08-14'),
('Bar','2012-08-15'),
('Bar','2012-08-16'),
('Bar','2012-09-17'),
('Xyz','2012-10-20')
Result
Bar 2012-08-12
Bar 2012-08-11
Foo 2012-08-14
Bar 2012-09-17
Bar 2012-08-16
Bar 2012-08-15
Xyz 2012-10-20

I think that this works, including the case I asked about in the comments:
declare #t table (data varchar(50), [date] datetime)
insert #t
values
('Foo','20120814'),
('Bar','20120815'),
('Bar','20120916'),
('Xyz','20121020')
; With OuterSort as (
select *,ROW_NUMBER() OVER (ORDER BY [date] asc) as rn from #t
)
--Now we need to find contiguous ranges of the same data value, and the min and max row number for such a range
, Islands as (
select data,rn as rnMin,rn as rnMax from OuterSort os where not exists (select * from OuterSort os2 where os2.data = os.data and os2.rn = os.rn - 1)
union all
select i.data,rnMin,os.rn
from
Islands i
inner join
OuterSort os
on
i.data = os.data and
i.rnMax = os.rn-1
), FullIslands as (
select
data,rnMin,MAX(rnMax) as rnMax
from Islands
group by data,rnMin
)
select
*
from
OuterSort os
inner join
FullIslands fi
on
os.rn between fi.rnMin and fi.rnMax
order by
fi.rnMin asc,os.rn desc
It works by first computing the initial ordering in the OuterSort CTE. Then, using two CTEs (Islands and FullIslands), we compute the parts of that ordering in which the same data value appears in adjacent rows. Having done that, we can compute the final ordering by any value that all adjacent values will have (such as the lowest row number of the "island" that they belong to), and then within an "island", we use the reverse of the originally computed sort order.
Note that this may, though, not be too efficient for large data sets. On the sample data it shows up as requiring 4 table scans of the base table, as well as a spool.

Try something like...
ORDER BY CASE date
WHEN '14/08/2012' THEN 1
WHEN '16/09/2012' THEN 2
WHEN '15/08/2012' THEN 3
WHEN '20/10/2012' THEN 4
END
In MySQL, you can do:
ORDER BY FIELD(date, '14/08/2012', '16/09/2012', '15/08/2012', '20/10/2012')
In Postgres, you can create a function FIELD and do:
CREATE OR REPLACE FUNCTION field(anyelement, anyarray) RETURNS numeric AS $$
SELECT
COALESCE((SELECT i
FROM generate_series(1, array_upper($2, 1)) gs(i)
WHERE $2[i] = $1),
0);
$$ LANGUAGE SQL STABLE
If you do not want to use the CASE, you can try to find an implementation of the FIELD function to SQL Server.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Selecting a grouped condition in an Aggregate query - sql

Related

Insert data from table into a new one with condition

PostgreSQL GROUP BY that includes zeros

Get apps with the highest review count since a dynamic series of days

Count of id per day using window function

Sorting twice on same column

Categories

Resources