ORDER BY before WHERE - alternatives? - sql

I'm using a PostgreSQL database (12.8).
Given events sorted by start_time, I would like to retrieve the next 10 events after the event with id = 10.
My first idea was something like this:
SELECT *
FROM events
ORDER BY start_time
WHERE id > 10
LIMIT 10;
However, this does not work because the WHERE-clause is always applied before the ORDER BY-clause. In other words, first all events are selected which have an id > 10, and only the remaining events are then ordered by start_time.
Next, I came up with a CTE expression. If the record with ID 10 has row_number 3:
WITH ordered_events AS (
SELECT id, ROW_NUMBER() OVER (ORDER BY start_time DESC) AS row_number
FROM events
)
SELECT *
FROM ordered_events
WHERE row_number > 3
LIMIT 10;
This works as desired, if I would know the row_number beforehand. However, I only know the ID, i.e. id = 10.
a) Ideally, I would do something like this (however, I don't know if there is any way to write such an expression):
WITH ordered_events AS (
SELECT id, ROW_NUMBER() OVER (ORDER BY start_time DESC) AS row_number
FROM events
)
SELECT *
FROM ordered_events
WHERE row_number > (row_number of row where the ID is 10) # <------
LIMIT 10;
b) The only alternative I came up with is to first retrieve the row_number for all records and then make a second query.
First query:
SELECT id, ROW_NUMBER() OVER (ORDER BY start_time DESC) AS row_number
FROM events;
Now I know that the row_number for the record with ID 10 is 3. Thus, I have enough information to make the query with the CTE expression as described above. However, the performance is bad because I need to retrieve all records from events in the first query.
I would really appreciate it, if you could help me to figure out a (performant) way of doing this.

You only want rows with a start time less than ID 10's start time. Use a WHERE clause for this.
select *
from events where start_time < (select start_time from events where id = 10)
order by start_time desc
limit 10;
This query may benefit from an index on start_time. (I take it for granted that there already is an index on ID to find the row with ID 10 quickly.)

Related

SQL Max or empty value grouped by conditions

I have a table like this
and i want my output to look like this
I need to look at the ID and then take max created date and max completed date for that ID. There is also some cases where completed date is still empty so in that case i just need to look at the max created date. Im not sure how to tackle this, doing a group by doesnt account for my multiple scenarios
Use ROW_NUMBER:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY QUOTE_NUMBER
ORDER BY WORKBOOK_CREATED_DATE DESC) rn
FROM yourTable
)
SELECT *
FROM yourTable
WHERE rn = 1;

can we get totalcount and last record from postgresql

i am having table having 23 records , I am trying to get total count of record and last record also in single query. something like that
select count(*) ,(m order by createdDate) from music m ;
is there any way to pull this out only last record as well as total count in PostgreSQL.
This can be done using window functions
select *
from (
select m.*,
row_number() over (order by createddate desc) as rn,
count(*) over () as total_count
from music
) t
where rn = 1;
Another option would be to use a scalar sub-query and combine it with a limit clause:
select *,
(select count(*) from order_test.orders) as total_count
from music
order by createddate desc
limit 1;
Depending on the indexes, your memory configuration and the table definition might be faster then the two window functions.
No, it's not not possible to do what is being asked, sql does not function that way, the second you ask for a count () sql changes the level of your data to an aggregation. The only way to do what you are asking is to do a count() and order by in a separate query.
Another solution using windowing functions and no subquery:
SELECT DISTINCT count(*) OVER w, last_value(m) OVER w
FROM music m
WINDOW w AS (ORDER BY date DESC RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);
The point here is that last_value applies on partitions defined by windows and not on groups defined by GROUP BY.
I did not perform any test but I suspect my solution to be the less effective amongst the three already posted. But it is also the closest to your example query so far.

Global row numbers in chunked query

I would like to include a column row_number in my result set with the row number sequence, where 1 is the newest item, without gaps. This works:
SELECT id, row_number() over (ORDER BY id desc) AS row_number, title
FROM mytable
WHERE group_id = 10;
Now I would like to query for the same data in chunks of 1000 each to be easier on memory:
SELECT id, row_number() over (ORDER BY id desc) AS row_number, title
FROM mytable
WHERE group_id = 10 AND id >= 0 AND id < 1000
ORDER BY id ASC;
Here the row_number restarts from 1 for every chunk, but I would like it to be as if it were part of the global query, as in the first case. Is there an easy way to accomplish this?
Assuming:
id is defined as PRIMARY KEY - which means UNIQUE and NOT NULL. Else you may have to deal with NULL values and / or duplicates (ties).
You have no concurrent write access on the table - or you don't care what happens after you have taken your snapshot.
A MATERIALIZED VIEW, like you demonstrate in your answer, is a good choice.
CREATE MATERIALIZED VIEW mv_temp AS
SELECT row_number() OVER (ORDER BY id DESC) AS rn, id, title
FROM mytable
WHERE group_id = 10;
But index and subsequent queries must be on the row number rn to get
data in chunks of 1000
CREATE INDEX ON mv_temp (rn);
SELECT * FROM mv_temp WHERE rn BETWEEN 1000 AND 2000;
Your implementation would require a guaranteed gap-less id column - which would void the need for an added row number to begin with ...
When done:
DROP MATERIALIZED VIEW mv_temp;
The index dies with the table (materialized view in this case) automatically.
Related, with more details:
Optimize query with OFFSET on large table
You want to have a query for the first 1000 rows, then one for the next 1000, and so on?
Usually you just write one query (the one you already use), have your app fetch 1000 records, do something with them, then fetch the next 1000 and so on. No need for separate queries, hence.
However, it would be rather easy to write such partial queries:
select *
from
(
SELECT id, row_number() over (ORDER BY id desc) AS rn, title
FROM mytable
WHERE group_id = 10
) numbered
where rn between 1 and 1000; -- <- simply change the row number range here
-- e.g. where rn between 1001 and 2000 for the second chunk
You need a pagination. Try this
SELECT id, row_number() over (ORDER BY id desc)+0 AS row_number, title
FROM mytable
WHERE group_id = 10 AND id >= 0 AND id < 1000
ORDER BY id ASC;
Next time, when you change the start value of id in the WHERE clause change it in row_number() as well like below
SELECT id, row_number() over (ORDER BY id desc)+1000 AS row_number, title
FROM mytable
WHERE group_id = 10 AND id >= 1000 AND id < 2000
ORDER BY id ASC;
or Better you can use OFFSET and LIMIT approach for pagination
https://wiki.postgresql.org/images/3/35/Pagination_Done_the_PostgreSQL_Way.pdf
In the end I ended up doing it this way:
First I create a temporary materialized view:
CREATE MATERIALIZED VIEW vw_temp AS SELECT id, row_number() over (ORDER BY id desc) AS rn, title
FROM mytable
WHERE group_id = 10;
Then I define the index:
CREATE INDEX idx_temp ON vw_temp USING btree(id);
Now I can perform all operations very quickly, and with numbered rows:
SELECT * FROM vw_temp WHERE id BETWEEN 1000 AND 2000;
After doing the operations, cleanup:
DROP INDEX idx_temp;
DROP MATERIALIZED VIEW vw_temp;
Even though Thorsten Kettner's answer seems the cleanest one, it was not practical for me due to being too slow. Thanks for contributing everyone. For those interesed in the practical use case, I use this for feeding data to the Sphinx indexer.

Pagination issue with ROWNUM [duplicate]

I am struggling to fetch the data based on rownum. When I execute the below query to get the results based rownum 1 to 4 then it is working fine.
SELECT ROWNUM TOTAL,MI.* FROM (SELECT USER_ID,CUSTOMER_NAME FROM ELEC_AUTO_MERC
ORDER BY CREATION_DATE DESC ) MI WHERE ROWNUM BETWEEN 1 AND 4;
But when I am executing same query to get result from rownum 2 to 4 then it is not working, it doesn't return anything.
SELECT ROWNUM TOTAL,MI.* FROM (SELECT USER_ID,CUSTOMER_NAME FROM ELEC_AUTO_MERC
ORDER BY CREATION_DATE DESC ) MI WHERE ROWNUM BETWEEN 2 AND 4;
As a workaround, when I use one more SELECT statement then it is working fine, but I don't think it is good approach to use SELECT multiple times only for rownum.
SELECT * FROM (SELECT ROWNUM TOTAL,MI.* FROM (SELECT USER_ID,CUSTOMER_NAME FROM ELEC_AUTO_MERC
ORDER BY CREATION_DATE DESC ) MI) WHERE TOTAL BETWEEN 2 AND 4;
Can you please help me out to create optimize query?
ROWNUM is weird in that it can be evaluated as part of a condition in the query - but if the row then fails to pass that filter, the ROWNUM value that it was assigned becomes available to be used again for the next row.
One important effect of this is that if you use any condition that excludes a ROWNUM value of 1, you will never get a match. The first row to be tested against this condition will be row 1; but then it will fail the test, so the next row will then be considered row 1; and so on.
So your condition ROWNUM BETWEEN 2 AND 4 can never be true.
The workaround you have found is the traditional one. Another would be to use an analytic function to rank the rows, then filter on the rank, e.g.:
SELECT MI.* FROM (
SELECT USER_ID,CUSTOMER_NAME, RANK() OVER (ORDER BY CREATION_DATE DESC) AS the_rank
FROM ELEC_AUTO_MERC
) MI
WHERE the_rank BETWEEN 2 AND 4;
Several analytic functions - RANK, DENSE_RANK, and ROW_NUMBER - can be used for this purpose, and will produce slightly different results, especially if there are ties. Check out the docs.

Ranking over several columns

In the process of query optimization I got to following SQL query:
select s.*
from
(
select id, DATA, update_dt, inspection_dt, check_dt
RANK OVER()
(PARTITION by ID
ORDER BY update_dt DESC, DATA) rank
FROM TABLE
where update_dt < inspection_dt or update_dt < check_dt
) r
where r.rank = 1
Query returns the DATA that corresponds to the latest check_dt.
However, what I want to get is:
1. DATA corresponding to latest check_dt
2. DATA corresponding to latest inspection_dt.
One of the trivial solutions - just write two separate queries with a where single condition - one for inspection_dt, and one for check_dt. However, that way it loses initial intent - to shorten the running time.
By observing the source data I noticed the way to implement it - check date is always later than inspection date; knowing that I could just extract the record with the rank = 1 and it will give me DATA corresponding to latest CHECK_DT, and record with the largest rank would correspond to INSPECTION.
However, data I'm afraid data will not be always consistent, so I was looking for more abstract solution.
How about this?
select s.*
from (select id, DATA, update_dt, inspection_dt, check_dt,
RANK() OVER (PARTITION by ID
ORDER BY update_dt DESC, DATA
) as rank_upd,
RANK() OVER (PARTITION by ID
ORDER BY inspection_dt DESC, DATA
) as rank_insp,
FROM TABLE
) r
where r.rank_upd = 1 or r.rank_insp = 1;