Incremental DISTINCT / GROUP BY operation

Incremental DISTINCT / GROUP BY operation - sql

I have a simple two-stage SQL query that operators on two tables A and B, where I use a sub-select to retrieve a number of IDs of table A that are stored as foreign keys in B, using a (possibly complex) query on table B (and possibly other joined tables). Then, I want to simply return the first x IDs of A. I tried using a query like this:
SELECT sq.id
FROM (
SELECT a_id AS id, created_at
FROM B
WHERE ...
ORDER BY created_at DESC
) sq
GROUP BY sq.id
ORDER BY max(sq.created_at) DESC
LIMIT 10;
which is quite slow as Postgres seems to perform the GROUP BY / DISTINCT operation on the whole result set before limiting it. If I LIMIT the sub-query (e.g. to 100), the performance is just fine (as I'd expect), but of course it's no longer guaranteed that there will be at least 10 distinct a_id values in the resulting rows of sq.
Similarly, the query
SELECT a_id AS id
FROM B
WHERE ...
GROUP BY id
ORDER BY max(created_at) DESC
LIMIT 10
is quite slow as Postgres seems to perform a sequential scan on B instead of using an (existing) index. If I remove the GROUP BY clause it uses the index just fine.
The data in table B is such that most rows contain different a_ids, hence even without the GROUP BY most of the returned IDs will be different. The goal I pursue with the grouping is to assure that the result set always contains a given number of entries from A.
Is there a way to perform an "incremental DISTINCT / GROUP BY"? In my naive thinking it would suffice for Postgres to produce result rows and group them incrementally until it reaches the number specified by LIMIT, which in most cases should be nearly instantaneous as most a_id values are different. I tried various ways to query the data but so far I didn't find anything that works reliably.
The Postgres version is 9.6, the data schema as follows:
Table "public.a"
Column | Type | Modifiers
--------+-------------------+------------------------------------------------
id | bigint | not null default nextval('a_id_seq'::regclass)
bar | character varying |
Indexes:
"a_pkey" PRIMARY KEY, btree (id)
"ix_a_bar" btree (bar)
Referenced by:
TABLE "b" CONSTRAINT "b_a_id_fkey" FOREIGN KEY (a_id) REFERENCES a(id)
Table "public.b"
Column | Type | Modifiers
------------+-----------------------------+--------------------------------------------------
id | bigint | not null default nextval('b_id_seq'::regclass)
foo | character varying |
a_id | bigint | not null
created_at | timestamp without time zone |
Indexes:
"b_pkey" PRIMARY KEY, btree (id)
"ix_b_created_at" btree (created_at)
"ix_b_foo" btree (foo)
Foreign-key constraints:
"b_a_id_fkey" FOREIGN KEY (a_id) REFERENCES a(id)

This problem is much more complex than it might seem at a first glance.
If ...
your criteria are not very selective (much more than 10 distinct a_id qualify)
you don't have many duplicate a_id in table B (like you stated)
then there is a very fast way.
To simplify a bit I assume created_at is also defined NOT NULL, or you need to do more.
WITH RECURSIVE top10 AS (
( -- extra parentheses required
SELECT a_id, ARRAY[a_id] AS id_arr, created_at
FROM b
WHERE ... -- your other filter conditions here
ORDER BY created_at DESC, a_id DESC -- both NOT NULL
LIMIT 1
)
UNION ALL -- UNION ALL, not UNION, since we exclude dupes a priori
(
SELECT b.a_id, id_arr || b.a_id, b.created_at
FROM top10 t
JOIN b ON (b.created_at, b.a_id)
< (t.created_at, t.a_id) -- comparing ROW values
AND b.a_id <> ALL (t.id_arr)
WHERE ... -- repeat conditions
ORDER BY created_at DESC, a_id DESC
LIMIT 1
)
)
SELECT a_id
FROM top10
LIMIT 10;
Ideally supported by an index on (created_at DESC, a_id DESC) (or just (created_at, a_id)).
Depending on your other WHERE conditions, other (partial?) indexes may serve even better.
This is particularly efficient for a small result set. Else, and depending on various other details, other solutions may be faster.
Related (with much more explanation):
Can spatial index help a “range - order by - limit” query
Optimize GROUP BY query to retrieve latest record per user
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
Best way to select random rows PostgreSQL
PostgreSQL sort by datetime asc, null first?
Select first row in each GROUP BY group?

The only way the planner has a chance to avoid sorting the whole table is if you have an index on the complete ORDER BY clause.
Then an index scan can be chosen to get the correct ordering, and the first ten result rows may be found quickly.

Related

Optimisation of sql query for deleting duplicate items from large table

Could anyone please help me optimise one of the queries which is taking more than 20 minutes to run against 3 Million data.
Table Structure
-----------------------------------------------------------------------------------------
|id [INT Auto Inc]| name_id (uuid) | name (varchar)| city (varchar) | name_type(varchar)|
-----------------------------------------------------------------------------------------
Query
The purpose of the query is to eliminate the duplicate, here duplicate means having same name_id and name.
DELETE
FROM records
WHERE id NOT IN
(SELECT DISTINCT
ON (name_id, name) id
FROM records);

I would write your delete using exists logic:
DELETE
FROM records r1
WHERE EXISTS (SELECT 1 FROM records r2
WHERE r2.name_id = r1.name_id AND r2.name = r2.name AND
r2.id < r1.id);
This delete query will spare the duplicate having the smallest id value. To speed this up, you may try adding the following index:
CREATE INDEX idx ON records (name_id, name, id);

You probably already have a primary key on the identity column, then you can use it to exclude redundant rows by id in the following way:
WITH cte AS (
SELECT MIN(id) AS id FROM records GROUP BY name_id, name)
DELETE FROM records
WHERE NOT EXISTS (SELECT id FROM cte WHERE id=records.id)
Even without the index, this should work relatively fast, probably because of merge join strategy.

mariadb not using all fields of composite index

Mariadb not fully using composite index. Fast select and slow select both return same data, but explain shows that slow select uses only ix_test_relation.entity_id part and does not use ix_test_relation.stamp part.
I tried many cases (inner join, with, from) but couldn't make mariadb use both fields of index together with recursive query. I understand that I need to tell mariadb to materialize recursive query somehow.
Please help me optimize slow select which is using recursive query to be similar speed to fast select.
Some details about the task... I need to query user activity. One user activity record may relate to multiple entities. Entities are hierarchical. I need to query user activity for some parent entity and all children for specified stamp range. Stamp simplified from TIMESTAMP to BIGINT for demonstration simplicity. There can be a lot (1mil) of entities and each entity may relate to a lot (1mil) of user activity entries. Entity hierarchy depth expected to be like 10 levels deep. I assume that used stamp range reduces number of user activity records to 10-100. I denormalized schema, copied stamp from test_entry to test_relation to be able to include it in test_relation index.
I use 10.4.11-Mariadb-1:10:4.11+maria~bionic.
I can upgrade or patch or whatever mariadb if needed, I have full control over building docker image.
Schema:
CREATE TABLE test_entity(
id BIGINT NOT NULL,
parent_id BIGINT NULL,
CONSTRAINT pk_test_entity PRIMARY KEY (id),
CONSTRAINT fk_test_entity_pid FOREIGN KEY (parent_id) REFERENCES test_entity(id)
);
CREATE TABLE test_entry(
id BIGINT NOT NULL,
name VARCHAR(100) NOT NULL,
stamp BIGINT NOT NULL,
CONSTRAINT pk_test_entry PRIMARY KEY (id)
);
CREATE TABLE test_relation(
entry_id BIGINT NOT NULL,
entity_id BIGINT NOT NULL,
stamp BIGINT NOT NULL,
CONSTRAINT pk_test_relation PRIMARY KEY (entry_id, entity_id),
CONSTRAINT fk_test_relation_erid FOREIGN KEY (entry_id) REFERENCES test_entry(id),
CONSTRAINT fk_test_relation_enid FOREIGN KEY (entity_id) REFERENCES test_entity(id)
);
CREATE INDEX ix_test_relation ON test_relation(entity_id, stamp);
CREATE SEQUENCE sq_test_entry;
Test data:
CREATE OR REPLACE PROCEDURE test_insert()
BEGIN
DECLARE v_entry_id BIGINT;
DECLARE v_parent_entity_id BIGINT;
DECLARE v_child_entity_id BIGINT;
FOR i IN 1..1000 DO
SET v_parent_entity_id = i * 2;
SET v_child_entity_id = i * 2 + 1;
INSERT INTO test_entity(id, parent_id)
VALUES(v_parent_entity_id, NULL);
INSERT INTO test_entity(id, parent_id)
VALUES(v_child_entity_id, v_parent_entity_id);
FOR j IN 1..1000000 DO
SELECT NEXT VALUE FOR sq_test_entry
INTO v_entry_id;
INSERT INTO test_entry(id, name, stamp)
VALUES(v_entry_id, CONCAT('entry ', v_entry_id), j);
INSERT INTO test_relation(entry_id, entity_id, stamp)
VALUES(v_entry_id, v_parent_entity_id, j);
INSERT INTO test_relation(entry_id, entity_id, stamp)
VALUES(v_entry_id, v_child_entity_id, j);
END FOR;
END FOR;
END;
CALL test_insert;
Slow select (> 100ms):
SELECT entry_id
FROM test_relation TR
WHERE TR.entity_id IN (
WITH RECURSIVE recursive_child AS (
SELECT id
FROM test_entity
WHERE id IN (2, 4)
UNION ALL
SELECT C.id
FROM test_entity C
INNER JOIN recursive_child P
ON P.id = C.parent_id
)
SELECT id
FROM recursive_child
)
AND TR.stamp BETWEEN 6 AND 8
Fast select (1-2ms):
SELECT entry_id
FROM test_relation TR
WHERE TR.entity_id IN (2,3,4,5)
AND TR.stamp BETWEEN 6 AND 8
UPDATE 1
I can demonstrate the problem with even shorter example.
Explicitly store required entity_id records in temporary table
CREATE OR REPLACE TEMPORARY TABLE tbl
WITH RECURSIVE recursive_child AS (
SELECT id
FROM test_entity
WHERE id IN (2, 4)
UNION ALL
SELECT C.id
FROM test_entity C
INNER JOIN recursive_child P
ON P.id = C.parent_id
)
SELECT id
FROM recursive_child
Try to run select using temporary table (below). Select is still slow but the only difference with fast query now is that IN statement queries table instead of inline constants.
SELECT entry_id
FROM test_relation TR
WHERE TR.entity_id IN (SELECT id FROM tbl)
AND TR.stamp BETWEEN 6 AND 8

For your queries (both of them) it looks to me like you should, as you mentioned, flip the column order on your compound index:
CREATE INDEX ix_test_relation ON test_relation(stamp, entity_id);
Why?
Your queries have a range filter TR.stamp BETWEEN 2 AND 3 on that column. For a range filter to use an index range scan (whether on a TIMESTAMP or a BIGINT column), the column being filtered must be first in a multicolumn index.
You also want a sargable filter, that is something lik this:
TR.stamp >= CURDATE() - INTERVAL 7 DAY
AND TR.stamp < CURDATE()
in place of
DATE(TR.stamp) BETWEEN DATE(NOW() - INTERVAL 7 DAY) AND DATE(NOW())
That is, don't put a function on the column you're scanning in your WHERE clause.
With a structured query like your first one, the query planner turns it into several queries. You can see this with ANALYZE FORMAT=JSON. The planner may choose different indexes and/or different chunks of indexes for each of those subqueries.
And, a word to the wise: don't get too wrapped around the axle trying to outguess the query planner built into the DBMS. It's an extraordinarily complex and highly wrought piece of software, created by decades of programming work by world-class experts in optimization. Our job as MariaDB / MySQL users is to find the right indexes.

The order of columns in a composite index matters. (O.Jones explains it nicely -- using SQL that has been removed from the Question?!)
I would rewrite
SELECT entry_id
FROM test_relation TR
WHERE TR.entity_id IN (SELECT id FROM tbl)
AND TR.stamp BETWEEN 6 AND 8
as
SELECT TR.entry_id
FROM tbl
JOIN test_relation TR ON tbl.id = TR.entity_id
WHERE TR.stamp BETWEEN 6 AND 8
or
SELECT entry_id
FROM test_relation TR
WHERE TR.stamp BETWEEN 6 AND 8
AND EXISTS ( SELECT 1 FROM tbl
WHERE tbl.id = TR.entity_id )
And have these in either case:
TR: INDEX(stamp, entity_id, entry_id) -- With `stamp` first
tbl: INDEX(id) -- maybe
Since tbl is a freshly built TEMPORARY TABLE, and it seems that only 3 rows need checking, it may not be worth adding INDEX(id).
Also needed:
test_entity: INDEX(parent_id, id)
Assuming that test_relation is a many:many mapping table, it is likely that you will also need (though not necessarily for the current query):
INDEX(entity_id, entry_id)

How to break ties when comparing columns in SQL

I am trying to delete duplicates in Postgres. I am using this as the base of my query:
DELETE FROM case_file as p
WHERE EXISTS (
SELECT FROM case_file as p1
WHERE p1.serial_no = p.serial_no
AND p1.cfh_status_dt < p.cfh_status_dt
);
It works well, except that when the dates cfh_status_dt are equal then neither of the records are removed.
For rows that have the same serial_no and the date is the same, I would like to keep the one that has a registration_no (if any do, this column also has NULLS).
Is there a way I can do this with all one query, possibly with a case statement or another simple comparison?

DELETE FROM case_file AS p
WHERE id NOT IN (
SELECT DISTINCT ON (serial_no) id -- id = PK
FROM case_file
ORDER BY serial_no, cfh_status_dt DESC, registration_no
);
This keeps the (one) latest row per serial_no, choosing the smallest registration_no if there are multiple candidates.
NULL sorts last in default ascending order. So any row with a not-null registration_no is preferred.
If you want the greatest registration_no instead, to still sort NULL values last, use:
...
ORDER BY serial_no, cfh_status_dt DESC, registration_no DESC NULLS LAST
See:
Select first row in each GROUP BY group?
Sort by column ASC, but NULL values first?
If you have no PK (PRIMARY KEY) or other UNIQUE NOT NULL (combination of) column(s) you can use for this purpose, you can fall back to ctid. See:
How do I (or can I) SELECT DISTINCT on multiple columns?
NOT IN is typically not the most efficient way. But this deals with duplicates involving NULL values. See:
How to delete duplicate rows without unique identifier
If there are many duplicates - and you can afford to do so! - it can be (much) more efficient to create a new, pristine table of survivors and replace the old table, instead of deleting the majority of rows in the existing table.
Or create a temporary table of survivors, truncate the old and insert from the temp table. This way depending objects like views or FK constraints can stay in place. See:
How to delete duplicate entries?
Surviving rows are simply:
SELECT DISTINCT ON (serial_no) *
FROM case_file
ORDER BY serial_no, cfh_status_dt DESC, registration_no;

SQL Server : index for finding latest value which is greater than a passed value

I have a table with 4 columns
USER_ID: numeric
EVENT_DATE: date
VERSION: date
SCORE: decimal
I have a clustered index on (USER_ID, EVENT_DATE, VERSION). These three values together are unique.
I need to get the maximum EventDate for a set of UserIds (~1000 different ids) where the Score is larger than a specific value and only consider those entries with a specific Version.
SELECT M.*
FROM (VALUES
( 5237 ),
………1000 more
( 27054 ) ) C (USER_ID)
CROSS APPLY
(SELECT TOP 1 C.USER_ID, M.EVENT_DATE, M.SCORE
FROM MY_HUGE_TABLE M
WHERE C. USER_ID = M. USER_ID
AND M.VERSION = 'xxxx-xx-xx'
AND M.SCORE > 2 --Comment M.SCORE > 2
ORDER BY M.EVENT_DATE DESC) M
Once I execute the query, I get poor results with respect to runtime, due to a missing index on score column (I suppose).
If I delete the filtering on “M.SCORE > 2” I get my results ten times faster, nevertheless the latest Scores may be less than “2”.
Could anyone please hint me on how to setup an index which could allow me to improve my query performance.
Thank you very much in advance

For your query, the optimal index would be on (User_ID, Version, ValueDate desc, Score).
Unfortunately, your clustered index doesn't match. Only the first and third columns match, but they need to match in order. So, only the User_ID can help but that probably doesn't do much to filter the data.

order by field with more than 10000 ids

I need to do specific ordering with use of order by field.
select * from table order by field(id,3,4,1,2.......upto 10000 ids)
As the ordering required is not gettable from SQL then how much it affect as per performance and is it feasible to do?
Updates from the comments:
Ordering depends on user and category IDs and can be anything the user wants.
The ordering specification changes (about) daily.
So, we need a custom ordering that depends on the user and category and this ordering needs to change daily.

The easiest way would be to put your ordering in a separate table (called ordering_table in this example):
id | position
----+----------
1 | 11
2 | 42
3 | 23
etc.
The above would mean "put an id of 1 at position 11, 2 at position 42, 3 at position 23, ...". Then you can join that ordering table in:
SELECT t.id, t.col1, t.col2
FROM some_table t
JOIN ordering_table o ON (t.id = o.id)
ORDER BY o.position
Where ordering_table is the table (as above) that defines your strange ordering. This approach simply represents your ordering function as a table (any function with a finite domain is, essentially, just a table after all).
This "ordering table" approach should work fine as long as the ordering table is complete.
If you only need this strange ordering in one place then you could merge the position column into your main table and add NOT NULL and UNIQUE constraints on that column to make sure you cover everything and have a consistent ordering.
Further commenting indicates that you want different orderings for different users and categories and that the ordering will change on a daily basis. You could make separate tables for each condition (which would lead to a combinatorial explosion) or, as Mikael Eriksson and ypercube suggest, add a couple more columns to the ordering table to hold the user and category:
CREATE TABLE ordering_table (
thing_id INT NOT NULL,
position INT NOT NULL,
user_id INT NOT NULL,
category_id INT NOT NULL
);
The thing_id, user_id, and category_id would be foreign keys to their respective tables and you'd probably want to index all the columns in ordering_table but a couple minutes of looking at the query plans would be worthwhile to see if the indexes get used would be worthwhile. You could also make all four columns the primary key to avoid duplicates. Then, the lookup query would be something like this:
SELECT t.id, t.col1, t.col2
FROM some_table t
LEFT JOIN ordering_table o
ON (t.id = o.thing_id AND o.user_id = $user AND o.category_id = $cat)
ORDER BY COALESCE(o.position, 99999)
Where $user and $cat are the user and category IDs (respectively). Note the change to a LEFT JOIN and the addition of COALESCE to allow for missing rows in ordering_table, these changes will push anything that doesn't have a specified position in the order to the bottom of the list rather than removing them from the results completely.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Incremental DISTINCT / GROUP BY operation - sql

The only way the planner has a chance to avoid sorting the whole table is if you have an index on the complete ORDER BY clause. Then an index scan can be chosen to get the correct ordering, and the first ten result rows may be found quickly.

Related

Optimisation of sql query for deleting duplicate items from large table

mariadb not using all fields of composite index

How to break ties when comparing columns in SQL

SQL Server : index for finding latest value which is greater than a passed value

order by field with more than 10000 ids

Categories

Resources