Optimize GROUP BY query to retrieve latest row per user - sql

I have the following log table for user messages (simplified form) in Postgres 9.2:
CREATE TABLE log (
log_date DATE,
user_id INTEGER,
payload INTEGER
);
It contains up to one record per user and per day. There will be approximately 500K records per day for 300 days. payload is ever increasing for each user (if that matters).
I want to efficiently retrieve the latest record for each user before a specific date. My query is:
SELECT user_id, max(log_date), max(payload)
FROM log
WHERE log_date <= :mydate
GROUP BY user_id
which is extremely slow. I have also tried:
SELECT DISTINCT ON(user_id), log_date, payload
FROM log
WHERE log_date <= :mydate
ORDER BY user_id, log_date DESC;
which has the same plan and is equally slow.
So far I have a single index on log(log_date), but doesn't help much.
And I have a users table with all users included. I also want to retrieve the result for some some users (those with payload > :value).
Is there any other index I should use to speed this up, or any other way to achieve what I want?

For best read performance you need a multicolumn index:
CREATE INDEX log_combo_idx
ON log (user_id, log_date DESC NULLS LAST);
To make index only scans possible, add the otherwise not needed column payload in a covering index with the INCLUDE clause (Postgres 11 or later):
CREATE INDEX log_combo_covering_idx
ON log (user_id, log_date DESC NULLS LAST) INCLUDE (payload);
See:
Do covering indexes in PostgreSQL help JOIN columns?
Fallback for older versions:
CREATE INDEX log_combo_covering_idx
ON log (user_id, log_date DESC NULLS LAST, payload);
Why DESC NULLS LAST?
Unused index in range of dates query
For few rows per user_id or small tables DISTINCT ON is typically fastest and simplest:
Select first row in each GROUP BY group?
For many rows per user_id an index skip scan (or loose index scan) is (much) more efficient. That's not implemented up to Postgres 15 (work is ongoing). But there are ways to emulate it efficiently.
Common Table Expressions require Postgres 8.4+.
LATERAL requires Postgres 9.3+.
The following solutions go beyond what's covered in the Postgres Wiki.
1. No separate table with unique users
With a separate users table, solutions in 2. below are typically simpler and faster. Skip ahead.
1a. Recursive CTE with LATERAL join
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT user_id, log_date, payload
FROM log
WHERE log_date <= :mydate
ORDER BY user_id, log_date DESC NULLS LAST
LIMIT 1
)
UNION ALL
SELECT l.*
FROM cte c
CROSS JOIN LATERAL (
SELECT l.user_id, l.log_date, l.payload
FROM log l
WHERE l.user_id > c.user_id -- lateral reference
AND log_date <= :mydate -- repeat condition
ORDER BY l.user_id, l.log_date DESC NULLS LAST
LIMIT 1
) l
)
TABLE cte
ORDER BY user_id;
This is simple to retrieve arbitrary columns and probably best in current Postgres. More explanation in chapter 2a. below.
1b. Recursive CTE with correlated subquery
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT l AS my_row -- whole row
FROM log l
WHERE log_date <= :mydate
ORDER BY user_id, log_date DESC NULLS LAST
LIMIT 1
)
UNION ALL
SELECT (SELECT l -- whole row
FROM log l
WHERE l.user_id > (c.my_row).user_id
AND l.log_date <= :mydate -- repeat condition
ORDER BY l.user_id, l.log_date DESC NULLS LAST
LIMIT 1)
FROM cte c
WHERE (c.my_row).user_id IS NOT NULL -- note parentheses
)
SELECT (my_row).* -- decompose row
FROM cte
WHERE (my_row).user_id IS NOT NULL
ORDER BY (my_row).user_id;
Convenient to retrieve a single column or the whole row. The example uses the whole row type of the table. Other variants are possible.
To assert a row was found in the previous iteration, test a single NOT NULL column (like the primary key).
More explanation for this query in chapter 2b. below.
Related:
Query last N related rows per row
GROUP BY one column, while sorting by another in PostgreSQL
2. With separate users table
Table layout hardly matters as long as exactly one row per relevant user_id is guaranteed. Example:
CREATE TABLE users (
user_id serial PRIMARY KEY
, username text NOT NULL
);
Ideally, the table is physically sorted in sync with the log table. See:
Optimize Postgres query on timestamp range
Or it's small enough (low cardinality) that it hardly matters. Else, sorting rows in the query can help to further optimize performance. See Gang Liang's addition. If the physical sort order of the users table happens to match the index on log, this may be irrelevant.
2a. LATERAL join
SELECT u.user_id, l.log_date, l.payload
FROM users u
CROSS JOIN LATERAL (
SELECT l.log_date, l.payload
FROM log l
WHERE l.user_id = u.user_id -- lateral reference
AND l.log_date <= :mydate
ORDER BY l.log_date DESC NULLS LAST
LIMIT 1
) l;
JOIN LATERAL allows to reference preceding FROM items on the same query level. See:
What is the difference between a LATERAL JOIN and a subquery in PostgreSQL?
Results in one index (-only) look-up per user.
Returns no row for users missing in the users table. Typically, a foreign key constraint enforcing referential integrity would rule that out.
Also, no row for users without matching entry in log - conforming to the original question. To keep those users in the result use LEFT JOIN LATERAL ... ON true instead of CROSS JOIN LATERAL:
Call a set-returning function with an array argument multiple times
Use LIMIT n instead of LIMIT 1 to retrieve more than one rows (but not all) per user.
Effectively, all of these do the same:
JOIN LATERAL ... ON true
CROSS JOIN LATERAL ...
, LATERAL ...
The last one has lower priority, though. Explicit JOIN binds before comma. That subtle difference can matters with more join tables. See:
"invalid reference to FROM-clause entry for table" in Postgres query
2b. Correlated subquery
Good choice to retrieve a single column from a single row. Code example:
Optimize groupwise maximum query
The same is possible for multiple columns, but you need more smarts:
CREATE TEMP TABLE combo (log_date date, payload int);
SELECT user_id, (combo1).* -- note parentheses
FROM (
SELECT u.user_id
, (SELECT (l.log_date, l.payload)::combo
FROM log l
WHERE l.user_id = u.user_id
AND l.log_date <= :mydate
ORDER BY l.log_date DESC NULLS LAST
LIMIT 1) AS combo1
FROM users u
) sub;
Like LEFT JOIN LATERAL above, this variant includes all users, even without entries in log. You get NULL for combo1, which you can easily filter with a WHERE clause in the outer query if need be.
Nitpick: in the outer query you can't distinguish whether the subquery didn't find a row or all column values happen to be NULL - same result. You need a NOT NULL column in the subquery to avoid this ambiguity.
A correlated subquery can only return a single value. You can wrap multiple columns into a composite type. But to decompose it later, Postgres demands a well-known composite type. Anonymous records can only be decomposed providing a column definition list.
Use a registered type like the row type of an existing table. Or register a composite type explicitly (and permanently) with CREATE TYPE. Or create a temporary table (dropped automatically at end of session) to register its row type temporarily. Cast syntax: (log_date, payload)::combo
Finally, we do not want to decompose combo1 on the same query level. Due to a weakness in the query planner this would evaluate the subquery once for each column (still true in Postgres 12). Instead, make it a subquery and decompose in the outer query.
Related:
Get values from first and last row per group
Demonstrating all 4 queries with 100k log entries and 1k users:
db<>fiddle here - pg 11
Old sqlfiddle

This is not a standalone answer but rather a comment to #Erwin's answer. For 2a, the lateral join example, the query can be improved by sorting the users table to exploit the locality of the index on log.
SELECT u.user_id, l.log_date, l.payload
FROM (SELECT user_id FROM users ORDER BY user_id) u,
LATERAL (SELECT log_date, payload
FROM log
WHERE user_id = u.user_id -- lateral reference
AND log_date <= :mydate
ORDER BY log_date DESC NULLS LAST
LIMIT 1) l;
The rationale is that index lookup is expensive if user_id values are random. By sorting out user_id first, the subsequent lateral join would be like a simple scan on the index of log. Even though both query plans look alike, the running time would differ much especially for large tables.
The cost of the sorting is minimal especially if there is an index on the user_id field.

Perhaps a different index on the table would help. Try this one: log(user_id, log_date). I am not positive that Postgres will make optimal use with distinct on.
So, I would stick with that index and try this version:
select *
from log l
where not exists (select 1
from log l2
where l2.user_id = l.user_id and
l2.log_date <= :mydate and
l2.log_date > l.log_date
);
This should replace the sorting/grouping with index look ups. It might be faster.

Related

DISTINCT ON slow for 300000 rows

I have a table named assets. Here is the ddl:
create table assets (
id bigint primary key,
name varchar(255) not null,
value double precision not null,
business_time timestamp with time zone,
insert_time timestamp with time zone default now() not null
);
create index idx_assets_name on assets (name);
I need to extract the newest (based on insert_time) value for each asset name. This is the query that I initially used:
SELECT DISTINCT
ON (a.name) *
FROM home.assets a
WHERE a.name IN (
'USD_RLS',
'EUR_RLS',
'SEKKEH_RLS',
'NIM_SEKKEH_RLS',
'ROB_SEKKEH_RLS',
'BAHAR_RLS',
'GOLD_18_RLS',
'GOLD_OUNCE_USD',
'SILVER_OUNCE_USD',
'PLATINUM_OUNCE_USD',
'GOLD_MESGHAL_RLS',
'GOLD_24_RLS',
'STOCK_IR',
'AED_RLS',
'GBP_RLS',
'CAD_RLS',
'CHF_RLS',
'TRY_RLS',
'AUD_RLS',
'JPY_RLS',
'CNY_RLS',
'RUB_RLS',
'BTC_USD'
)
ORDER BY a.name,
a.insert_time DESC;
I have around 300,000 rows in the assets table. On my VPS this query takes about 800 ms. this is causing a whole response time of about 1 second for a specific endpoint. This is a bit slow and considering the fact that the assets table is growing fast, this endpoint will be even slower in the near future. I also tried to avoid IN(...) using this query:
SELECT DISTINCT
ON (a.name) *
FROM home.assets a
ORDER BY a.name,
a.insert_time DESC;
But I didn't notice a significant difference. Any idea how I could optimize this query?
You may try adding the following index to your table:
CREATE INDEX idx ON assets (name, insert_time DESC);
If used, Postgres can simply scan this index to find the distinct record having the most recent insert_time for each name.
For more than a few rows per name in the table (looks to be so), I expect this query to be substantially faster, yet:
SELECT a.*
FROM unnest('{USD_RLS, EUR_RLS, SEKKEH_RLS, NIM_SEKKEH_RLS, ROB_SEKKEH_RLS
, BAHAR_RLS, GOLD_18_RLS, GOLD_OUNCE_USD, SILVER_OUNCE_USD
, PLATINUM_OUNCE_USD, GOLD_MESGHAL_RLS, GOLD_24_RLS, STOCK_IR
, AED_RLS, GBP_RLS, CAD_RLS, CHF_RLS
, TRY_RLS, AUD_RLS, JPY_RLS, CNY_RLS
, RUB_RLS, BTC_USD}'::text[]) AS n(name)
CROSS JOIN LATERAL (
SELECT *
FROM home.assets a
WHERE a.name = n.name
ORDER BY a.insert_time DESC
LIMIT 1
) a;
Pass your list as array, unnest, and then get each latest row in a LATERAL subquery. The CROSS JOIN eliminates names that are not found at all. (You might be interested in LEFT JOIN LATERAL ... ON true instead, to keep those in the result.)
You still need the multicolumn index that Tim mentioned.
CREATE INDEX ON assets (name, insert_time DESC);
Default ascending sort order would work, too, in this case. Postgres can scan backwards:
CREATE INDEX ON assets (name, insert_time);
See:
Postgres: getting latest rows for an array of keys
Optimize GROUP BY query to retrieve latest row per user - basically type 2a
What is the difference between a LATERAL JOIN and a subquery in PostgreSQL?
Not the number of rows in the table, but the number of rows per group (per name in your case) decides whether DISTINCT ON is the best choice. See this benchmark comparing relevant query styles:
Select first row in each GROUP BY group?

Is there any better option to apply pagination without applying OFFSET in SQL Server?

I want to apply pagination on a table with huge data. All I want to know a better option than using OFFSET in SQL Server.
Here is my simple query:
SELECT *
FROM TableName
ORDER BY Id DESC
OFFSET 30000000 ROWS
FETCH NEXT 20 ROWS ONLY
You can use Keyset Pagination for this. It's far more efficient than using Rowset Pagination (paging by row number).
In Rowset Pagination, all previous rows must be read, before being able to read the next page. Whereas in Keyset Pagination, the server can jump immediately to the correct place in the index, so no extra rows are read that do not need to be.
For this to perform well, you need to have a unique index on that key, which includes any other columns you need to query.
In this type of pagination, you cannot jump to a specific page number. You jump to a specific key and read from there. So you need to save the unique ID of page you are on and skip to the next. Alternatively, you could calculate or estimate a starting point for each page up-front.
One big benefit, apart from the obvious efficiency gain, is avoiding the "missing row" problem when paginating, caused by rows being removed from previously read pages. This does not happen when paginating by key, because the key does not change.
Here is an example:
Let us assume you have a table called TableName with an index on Id, and you want to start at the latest Id value and work backwards.
You begin with:
SELECT TOP (#numRows)
*
FROM TableName
ORDER BY Id DESC;
Note the use of ORDER BY to ensure the order is correct
In some RDBMSs you need LIMIT instead of TOP
The client will hold the last received Id value (the lowest in this case). On the next request, you jump to that key and carry on:
SELECT TOP (#numRows)
*
FROM TableName
WHERE Id < #lastId
ORDER BY Id DESC;
Note the use of < not <=
In case you were wondering, in a typical B-Tree+ index, the row with the indicated ID is not read, it's the row after it that's read.
The key chosen must be unique, so if you are paging by a non-unique column then you must add a second column to both ORDER BY and WHERE. You would need an index on OtherColumn, Id for example, to support this type of query. Don't forget INCLUDE columns on the index.
SQL Server does not support row/tuple comparators, so you cannot do (OtherColumn, Id) < (#lastOther, #lastId) (this is however supported in PostgreSQL, MySQL, MariaDB and SQLite).
Instead you need the following:
SELECT TOP (#numRows)
*
FROM TableName
WHERE (
(OtherColumn = #lastOther AND Id < #lastId)
OR OtherColumn < #lastOther
)
ORDER BY
OtherColumn DESC,
Id DESC;
This is more efficient than it looks, as SQL Server can convert this into a proper < over both values.
The presence of NULLs complicates things further. You may want to query those rows separately.
On very big merchant website we use a technic compound of ids stored in a pseudo temporary table and join with this table to the rows of the product table.
Let me talk with a clear example.
We have a table design this way :
CREATE TABLE S_TEMP.T_PAGINATION_PGN
(PGN_ID BIGINT IDENTITY(-9 223 372 036 854 775 808, 1) PRIMARY KEY,
PGN_SESSION_GUID UNIQUEIDENTIFIER NOT NULL,
PGN_SESSION_DATE DATETIME2(0) NOT NULL,
PGN_PRODUCT_ID INT NOT NULL,
PGN_SESSION_ORDER INT NOT NULL);
CREATE INDEX X_PGN_SESSION_GUID_ORDER
ON S_TEMP.T_PAGINATION_PGN (PGN_SESSION_GUID, PGN_SESSION_ORDER)
INCLUDE (PGN_SESSION_ORDER);
CREATE INDEX X_PGN_SESSION_DATE
ON S_TEMP.T_PAGINATION_PGN (PGN_SESSION_DATE);
We have a very big product table call T_PRODUIT_PRD and a customer filtered it with many predicates. We INSERT rows from the filtered SELECT into this table this way :
DECLARE #SESSION_ID UNIQUEIDENTIFIER = NEWID();
INSERT INTO S_TEMP.T_PAGINATION_PGN
SELECT #SESSION_ID , SYSUTCDATETIME(), PRD_ID,
ROW_NUMBER() OVER(ORDER BY --> custom order by
FROM dbo.T_PRODUIT_PRD
WHERE ... --> custom filter
Then everytime we need a desired page, compound of #N products we add a join to this table as :
...
JOIN S_TEMP.T_PAGINATION_PGN
ON PGN_SESSION_GUID = #SESSION_ID
AND 1 + (PGN_SESSION_ORDER / #N) = #DESIRED_PAGE_NUMBER
AND PGN_PRODUCT_ID = dbo.T_PRODUIT_PRD.PRD_ID
All the indexes will do the job !
Of course, regularly we have to purge this table and this is why we have a scheduled job which deletes the rows whose sessions were generated more than 4 hours ago :
DELETE FROM S_TEMP.T_PAGINATION_PGN
WHERE PGN_SESSION_DATE < DATEADD(hour, -4, SYSUTCDATETIME());
In the same spirit as SQLPro solution, I propose:
WITH CTE AS
(SELECT 30000000 AS N
UNION ALL SELECT N-1 FROM CTE
WHERE N > 30000000 +1 - 20)
SELECT T.* FROM CTE JOIN TableName T ON CTE.N=T.ID
ORDER BY CTE.N DESC
Tried with 2 billion lines and it's instant !
Easy to make it a stored procedure...
Of course, valid if ids follow each other.

How to optimise a SQL SELECT query for generating a user's newsfeed?

I'm currently trying to build a feature to generate a user's newsfeed using the following query from a table of posts. This is the SQL statement we are using:
SELECT *
FROM "posts" AS "post"
WHERE "post"."sourceId" IN (...)
ORDER BY "post"."createdAt" DESC, "post"."timestamp" DESC
LIMIT 10;
The posts table currently has roughly 200K+ rows and likely to grow much larger. My skills in DB performance isn't the strongest, but is there anyway to optimise this query to make it run as fast as possible? I'm assuming it's not enough to add an index on the sourceId column but instead would need a multi column index to also take into account the ORDER BY column.
For this query:
SELECT p.*
FROM posts p
WHERE p.sourceId IN (...)
ORDER BY p.createdAt DESC, p.timestamp DESC
LIMIT 10;
The only index that can really help is an index on posts(sourceId).
Note that I removed the ". Do not escape table and column names when you define them. Then you don't need to escape them when you use them.
However, the query still has to sort all the data. And that can be time-consuming. A more complicated query is easier for Postgres to optimize:
select p.*
from ((select p.*
from posts p
where sourceId = $si_1
order by p.createdAt desc, p.timestamp desc
limit 10
) union all
(select p.*
from posts p
where sourceId = $si_2
order by p.createdAt desc, p.timestamp desc
limit 10
) union all
. . .
) p
order by p.createdAt desc, p.timestamp desc;
This query can use an index on posts(sourceId, createdAt desc, timestamp desc) for the inner selects. That should be fast. the outer order by will still need sorting, but the volume of data should be much smaller.
For instance, if a typical source has 10,000 rows and you are only looking for 3 of them, then your version of the query needs to sort 30,000 rows to fetch 10. This version fetches 30 rows uses the index and then sorts them to get the final 10.
That would be a big difference in performance.
You may find that just an index on sourceId is sufficient:
CREATE INDEX src_idx ON posts (sourceId);
Postgres would then manually have to sort the records which make it past the WHERE clause. Further adding the columns in the ORDER BY clause might also help:
CREATE INDEX idx ON posts (sourceId, createdAt DESC, timestamp DESC);
This might speed up the sorting operation by letting Postgres sort the matching groups of sourceId records at once.

How to find duplicate rows in Hive?

I want to find duplicate rows from one of the Hive table for which I was given two approaches.
First approach is to use following two queries:
select count(*) from mytable; // this will give total row count
second query is as below which will give count of distinct rows
select count(distinct primary_key1, primary_key2) from mytable;
With this approach, for one of my table total row count derived using first query is 3500 and second query gives row count 2700. So it tells us that 3500 - 2700 = 800 rows are duplicate. But this query doesn't tell which rows are duplicated.
My second approach to find duplicate is:
select primary_key1, primary_key2, count(*)
from mytable
group by primary_key1, primary_key2
having count(*) > 1;
Above query should list of rows which are duplicated and how many times particular row is duplicated. but this query shows zero rows which means there are no duplicate rows in that table.
So I would like to know:
If my first approach is correct - if yes then how do I find which rows are duplicated
Why second approach is not providing list of rows which are duplicated?
Is there any other way to find the duplicates?
Hive does not validate primary and foreign key constraints.
Since these constraints are not validated, an upstream system needs to
ensure data integrity before it is loaded into Hive.
That means that Hive allows duplicates in Primary Keys.
To solve your issue, you should do something like this:
select [every column], count(*)
from mytable
group by [every column]
having count(*) > 1;
This way you will get list of duplicated rows.
analytic window function row_number() is quite useful and can provide the duplicates based upon the elements specified in the partition by clause. A simply in-line view and exists clause will then pinpoint what corresponding sets of records contain these duplicates from the original table. In some databases (like TD, you can forgo the inline view using a QUALIFY pragma option)
SQL1 & SQL2 can be combined. SQL2: If you want to deal with NULLs and not simply dismiss, then a coalesce and concatenation might be better in the
SELECT count(1) , count(distinct coalesce(keypart1 ,'') + coalesce(keypart2 ,'') )
FROM srcTable s
3) Finds all records, not just the > 1 records. This provides all context data as well as the keys so it can be useful when analyzing why you have dups and not just the keys.
select * from srcTable s
where exists
( select 1 from (
SELECT
keypart1,
keypart2,
row_number() over( partition by keypart1, keypart2 ) seq
FROM srcTable t
WHERE
-- (whatever additional filtering you want)
) t
where seq > 1
AND t.keypart1 = s.keypart1
AND t.keypart2 = s.keypart2
)
Suppose your want get duplicate rows based on a particular column ID here. Below query will give you all the IDs which are duplicate in table in hive.
SELECT "ID"
FROM TABLE
GROUP BY "ID"
HAVING count(ID) > 1

How to retrieve the last 2 records from table?

I have a table with n number of records
How can i retrieve the nth record and (n-1)th record from my table in SQL without using derived table ?
I have tried using ROWID as
select * from table where rowid in (select max(rowid) from table);
It is giving the nth record but i want the (n-1)th record also .
And is there any other method other than using max,derived table and pseudo columns
Thanks
You cannot depend on rowid to get you to the last row in the table. You need an auto-incrementing id or creation time to have the proper ordering.
You can use, for instance:
select *
from (select t.*, row_number() over (order by <id> desc) as seqnum
from t
) t
where seqnum <= 2
Although allowed in the syntax, the order by clause in a subquery is ignored (for instance http://docs.oracle.com/javadb/10.8.2.2/ref/rrefsqlj13658.html).
Just to be clear, rowids have nothing to do with the ordering of rows in a table. The Oracle documentation is quite clear that they specify a physical access path for the data (http://docs.oracle.com/cd/B28359_01/server.111/b28318/datatype.htm#i6732). It is true that in an empty database, inserting records into a newtable will probably create a monotonically increasing sequence of row ids. But you cannot depend on this. The only guarantees with rowids are that they are unique within a table and are the fastest way to access a particular row.
I have to admit that I cannot find good documentation on Oracle handling or not handling order by's in subqueries in its most recent versions. ANSI SQL does not require compliant databases to support order by in subqueries. Oracle syntax allows it, and it seems to work in some cases, at least. My best guess is that it would probably work on a single processor, single threaded instance of Oracle, or if the data access is through an index. Once parallelism is introduced, the results would probably not be ordered. Since I started using Oracle (in the mid-1990s), I have been under the impression that order bys in subqueries are generally ignored. My advice would be to not depend on the functionality, until Oracle clearly states that it is supported.
select * from (select * from my_table order by rowid) where rownum <= 2
and for rows between N and M:
select * from (
select * from (
select * from my_table order by rowid
) where rownum <= M
) where rownum >= N
Try this
select top 2 * from table order by rowid desc
Assuming rowid as column in your table:
SELECT * FROM table ORDER BY rowid DESC LIMIT 2