Efficiently selecting distinct (a, b) from big table

Efficiently selecting distinct (a, b) from big table - sql

I have a table with around 54 million rows in a Postgres 9.6 DB and would like to find all distinct pairs of two columns (there are around 4 million such values). I have an index over the two columns of interest:
create index ab_index on tbl (a, b)
What is the most efficient way to get such pairs? I have tried:
select a,b
from tbl
where a>$previouslargesta
group by a,b
order by a,b
limit 1000
And also:
select distinct(a,b)
from tbl
where a>previouslargesta
order by a,b
limit 1000
Also this recursive query:
with recursive t AS (
select min(a) AS a from tbl
union all
select (select min(a) from tickets where a > t.a)
FROM t)
select a FROM t
But all are slooooooow.
Is there a faster way to get this information?

Your table has 54 million rows and ...
there are around 4 million such values
7,4 % of all rows is a high percentage, an index can mostly only help by providing pre-sorted data, ideally in an index-only scan. There are more sophisticated techniques for smaller result sets (see below), and there are much faster ways for paging which returns much fewer rows at a time (see below) but for the general case a plain DISTINCT may be among the fastest:
SELECT DISTINCT a, b -- *no* parentheses
FROM tbl;
-- ORDER BY a, b -- ORDER BY wasn't not mentioned as requirement ...
Don't confuse it with DISTINCT ON, which would require parentheses. See:
Select first row in each GROUP BY group?
The B-tree index ab_index you have on (a, b) is already the best index for this. It has to be scanned in its entirety, though. The challenge is to have enough work_mem to process all in RAM. With standard settings it occupies at least 1831 MB on disk, typically more with some bloat. If you can afford it, run the query with a work_mem setting of 2 GB (or more) in your session. See:
Configuration parameter work_mem in PostgreSQL on Linux
SET work_mem = '2 GB';
SELECT DISTINCT a, b ...
RESET work_mem;
A read-only table helps. Else you need aggressive enough VACUUM settings to allow an index-only scan. And some more RAM, yet, would help (with appropriate settings) to keep the index cashed.
Also upgrade to the latest version of Postgres (11.3 as of writing). There have been many improvements for big data.
Paging
If you want to add paging as indicated by your sample query, urgently consider ROW value comparison. See:
Optimize query with OFFSET on large table
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
SELECT DISTINCT a, b
FROM tbl
WHERE (a, b) > ($previous_a, $previous_b) -- !!!
ORDER BY a, b
LIMIT 1000;
Recursive CTE
This also may or may not be faster for the general big query as well. For the small subset, it becomes much more attractive:
WITH RECURSIVE cte AS (
( -- parentheses required du to LIMIT 1
SELECT a, b
FROM tbl
WHERE (a, b) > ($previous_a, $previous_b) -- !!!
ORDER BY a, b
LIMIT 1
)
UNION ALL
SELECT x.a, x.b
FROM cte c
CROSS JOIN LATERAL (
SELECT t.a, t.b
FROM tbl t
WHERE (t.a, t.b) > (c.a, c.b) -- lateral reference
ORDER BY t.a, t.b
LIMIT 1
) x
)
TABLE cte
LIMIT 1000;
This can make perfect use of your index and should be as fast as it gets.
Further reading:
Optimize GROUP BY query to retrieve latest row per user
For repeated use and no or little write load on the table, consider a MATERIALIZED VIEW, based on one of the above queries - for much faster read performance.

I cannot guarantee for performance at Postgres, but this is a technique i had used on sql server in a similar case and proven faster than others:
get distinct A into a temp a
get distinct B into a temp b
cross a and b temps to Cartesian into a temp abALL
rank the abALL (optionally)
create a view myview as select top 1 a,b from tbl (your_main_table)
join temp abALL with myview into a temp abCLEAN
rank abCLEAN here if you havent rank above

Related

Joining a series in postgres with a select query

I'm looking for a way to join these two queries (or run these two together):
SELECT s
FROM generate_series(1, 50) s;
With this query:
SELECT id FROM foo ORDER BY RANDOM() LIMIT 50;
In a way where I get 50 rows like this:
series, ids_from_foo
1, 53
2, 34
3, 23
I've been at it for a couple days now and I can't figure it out. Any help would be great.

Use row_number()
select row_number() over() as rn, a
from (
select a
from foo
order by random()
limit 50
) s
order by rn;

Picking the top n rows from a randomly sorted table is a simple, but slow way to pick 50 rows randomly. All rows have to be sorted that way.
Doesn't matter much for small to medium tables and one-time, ad-hoc use. For repeated use on a big table, there are much more efficient ways.
If the ratio of gaps / island in the primary key is low, use this:
SELECT row_number() OVER() AS rn, *
FROM (
SELECT *
FROM (
SELECT trunc(random() * 999999)::int AS foo_id
FROM generate_series(1, 55) g
GROUP BY 1 -- fold duplicates
) sub1
JOIN foo USING (foo_id)
LIMIT 50
) sub2;
With an index on foo_id, this blazingly fast, no matter how big the table. (A primary key serves just fine.) Compare performance with EXPLAIN ANALYZE.
How?
999999 is an estimated row count of the table, rounded up. You can get it cheaply from:
SELECT reltuples FROM pg_class WHERE oid = 'foo'::regclass;
Round up to easily include possible new entries since the last ANALYZE. You can also use the expression itself in a generic query dynamically, it's cheap. Details:
Fast way to discover the row count of a table in PostgreSQL
55 is your desired number of rows (50) in the result, multiplied by a low factor to easily make up for the gap ratio in your table and (unlikely but possible) duplicate random numbers.
If your primary key does not start near 1 (does not have to be 1 exactly, gaps are covered), add the minimum pk value to the calculation:
min_pkey + trunc(random() * 999999)::int
Detailed explanation here:
Best way to select random rows PostgreSQL

Efficient repeated sampling with replacement of a table in a PostgreSQL alike?

I'm trying to check the distribution of numbers in a column of a table. Rather than calculate on the entire table (which is large - tens of gigabytes) I want to estimate via repeated sampling. I think the typical Postgres method for this is
select COLUMN
from TABLE
order by RANDOM()
limit 1;
but this is slow for repeated sampling, especially since (I suspect) it manipulates the entire column each time I run it.
Is there a better way?
EDIT: Just to make sure I expressed it right, I want to do the following:
for(i in 1:numSamples)
draw 500 random rows
end
without having to reorder the entire massive table each time. Perhaps I could get all of the table row IDs and sample from it in R or something, and then just request those rows?

As you want a sample of the data, what about using the estimated size of the table and then calculate a percentage of that as the sample?
The table pg_class stores an estimate of the number of rows for each table (updated by the vacuum process if I'm not mistaken).
So the following would select 1% of all rows from that table:
with estimated_rows as (
select reltuples as num_rows
from pg_class t
join pg_namespace n on n.oid = t.relnamespace
where t.relname = 'some_table'
and n.nspname = 'public'
)
select *
from some_table
limit (select 0.01 * num_rows from estimated_rows)
;
If you do that very often you might want to create a function so you could do something like this:
select *
from some_table
limit (select estimate_percent(0.01, 'public', 'some_table'))
;

Create a temporary table from the target table adding a row number column
drop table if exists temp_t;
create temporary table temp_t as
select *, (row_number() over())::int as rn
from t
Create a lighter temporary table by selecting only the columns that will be used in the sampling and filtering as necessary.
Index it by the row number column
create index temp_t_rn on temp_t(rn);
analyze temp_t;
Issue this query for each sample
with r as (
select ceiling(random() * (select max(rn) from temp_t))::int as rn
from generate_series(1, 500) s
)
select *
from temp_t
where rn in (select rn from r)
SQL Fiddle

How to speed up group-based duplication-count queries on unindexed tables

When I need to know the number of rows containing more than n duplicates for certain colulmn c, I can do it like this:
WITH duplicateRows AS (
SELECT COUNT(1)
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT COUNT(1) FROM duplicateRows
This leads to an unwanted behaviour: SQL Server counts all rows grouped by i, which (when no index is on this table) leads to horrible performance.
However, when altering the script such that SQL Server doesn't have to count all the rows doesn't solve the problem:
WITH duplicateRows AS (
SELECT 1
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT COUNT(1) FROM duplicateRows
Although SQL Server now in theory can stop counting after n + 1, it leads to the same query plan and query cost.
Of course, the reason is that the GROUP BY really introduces the cost, not the counting. But I'm not at all interested in the numbers. Is there another option to speed up the counting of duplicate rows, on a table without indexes?

The greatest two costs in your query are the re-ordering for the GROUP BY (due to lack of appropriate index) and the fact that you're scanning the whole table.
Unfortunately, to identify duplicates, re-ordering the whole table is the cheapest option.
You may get a benefit from the following change, but I highly doubt it would be significant, as I'd expect the execution plan to involve a sort again anyway.
WITH
sequenced_data AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY fieldC) AS sequence_id
FROM
yourTable
)
SELECT
COUNT(*)
FROM
sequenced_data
WHERE
sequence_id = (n+1)
Assumes SQLServer2005+

Without indexing the GROUP BY solution is the best, every PARTITION-based solution involving both table(clust. index) scan and sort, instead of simple scan-and-counting in GROUP BY case

If the only goal is to determine if there are ANY rows in ANY group (or, to rephrase that, "there is a duplicate inside the table, given the distinction of column c"), adding TOP(1) to the SELECT queries could perform some performance magic.
WITH duplicateRows AS (
SELECT TOP(1)
1
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT 1 FROM duplicateRows
Theoretically, SQL Server doesn't need to determine all groups, so as soon as the first group with a duplicate is found, the query is finished (but worst-case will take as long as the original approach). I have to say though that this is a somewhat imperative way of thinking - not sure if it's correct...

Speed and "without indexes" almost never go together.
Athough as others here have mentioned I seriously doubt that it will have performance benefits. Perhaps you could try restructuring your query with PARTITION BY.
For example:
WITH duplicateRows AS (
SELECT a.aFK,
ROW_NUMBER() OVER(PARTITION BY a.aFK ORDER BY a.aFK) AS DuplicateCount
FROM Address a
) SELECT COUNT(DuplicateCount) FROM duplicateRows
I haven't tested the performance of this against the actual group by clause query. It's just a suggestion of how you could restructure it in another way.

Fastest technique to deleting duplicate data

After searching stackoverflow.com I found several questions asking how to remove duplicates, but none of them addressed speed.
In my case I have a table with 10 columns that contains 5 million exact row duplicates. In addition, I have at least a million other rows with duplicates in 9 of the 10 columns. My current technique is taking (so far) 3 hours to delete these 5 million rows. Here is my process:
-- Step 1: **This step took 13 minutes.** Insert only one of the n duplicate rows into a temp table
select
MAX(prikey) as MaxPriKey, -- identity(1, 1)
a,
b,
c,
d,
e,
f,
g,
h,
i
into #dupTemp
FROM sourceTable
group by
a,
b,
c,
d,
e,
f,
g,
h,
i
having COUNT(*) > 1
Next,
-- Step 2: **This step is taking the 3+ hours**
-- delete the row when all the non-unique columns are the same (duplicates) and
-- have a smaller prikey not equal to the max prikey
delete
from sourceTable
from sourceTable
inner join #dupTemp on
sourceTable.a = #dupTemp.a and
sourceTable.b = #dupTemp.b and
sourceTable.c = #dupTemp.c and
sourceTable.d = #dupTemp.d and
sourceTable.e = #dupTemp.e and
sourceTable.f = #dupTemp.f and
sourceTable.g = #dupTemp.g and
sourceTable.h = #dupTemp.h and
sourceTable.i = #dupTemp.i and
sourceTable.PriKey != #dupTemp.MaxPriKey
Any tips on how to speed this up, or a faster way? Remember I will have to run this again for rows that are not exact duplicates.
Thanks so much.
UPDATE:
I had to stop step 2 from running at the 9 hour mark.
I tried OMG Ponies' method and it finished after only 40 minutes.
I tried my step 2 with Andomar's batch delete, it ran the 9 hours before I stopped it.
UPDATE:
Ran a similar query with one less field to get rid of a different set of duplicates and the query ran for only 4 minutes (8000 rows) using OMG Ponies' method.
I will try the cte technique the next chance I get, however, I suspect OMG Ponies' method will be tough to beat.

What about EXISTS:
DELETE FROM sourceTable
WHERE EXISTS(SELECT NULL
FROM #dupTemp dt
WHERE sourceTable.a = dt.a
AND sourceTable.b = dt.b
AND sourceTable.c = dt.c
AND sourceTable.d = dt.d
AND sourceTable.e = dt.e
AND sourceTable.f = dt.f
AND sourceTable.g = dt.g
AND sourceTable.h = dt.h
AND sourceTable.i = dt.i
AND sourceTable.PriKey < dt.MaxPriKey)

Can you afford to have the original table unavailable for a short time?
I think the fastest solution is to create a new table without the duplicates. Basically the approach that you use with the temp table, but creating a "regular" table instead.
Then drop the original table and rename the intermediate table to have the same name as the old table.

The bottleneck in bulk row deletion is usually the transaction that SQL Server has to build up. You might be able to speed it up considerably by splitting the removal into smaller transactions. For example, to delete 100 rows at a time:
while 1=1
begin
delete top 100
from sourceTable
...
if ##rowcount = 0
break
end

...based on OMG Ponies comment above, a CTE method that's a little more compact. This method works wonders on tables where you've (for whatever reason) no primary key - where you can have rows which are identical on all columns.
;WITH cte AS (
SELECT ROW_NUMBER() OVER
(PARTITION BY a,b,c,d,e,f,g,h,i ORDER BY prikey DESC) AS sequence
FROM sourceTable
)
DELETE
FROM cte
WHERE sequence > 1

Well lots of differnt things. First would something like this work (do a select o make sure, maybe even put into a temp table of it's own, #recordsToDelete):
delete
from sourceTable
left join #dupTemp on
sourceTable.PriKey = #dupTemp.MaxPriKey
where #dupTemp.MaxPriKey is null
Next you can index temp tables, put an index on prikey
If you have records in a temp table of the ones you want to delete, you can delete in batches which is often faster than locking up the whole table with a delete.

Here's a version where you can combine both steps into a single step.
WITH cte AS
( SELECT prikey, ROW_NUMBER() OVER (PARTITION BY a,b,c,d,e,f,g,h,i ORDER BY
prikey DESC) AS sequence
FROM sourceTable
)
DELETE
FROM sourceTable
WHERE prikey IN
( SELECT prikey
FROM cte
WHERE sequence > 1
) ;
By the way, do you have any indexes that can be temporarily removed?

If you're using Oracle database, I recently found out that following statement performs best, from total durtion time as well as CPU consumption point of view.
I've performed several test with different data sizes from tens of rows to thousands, always in a loop. I used TKProf tool to analyze the results.
When compared to ROW_NUMBER() solution above, this approach took 2/3 of the original time and consumed about 50% of the CPU time. It seemed to behave linearly, ie it should give similar results with any input data size.
Feel free to give me your feedback. I wonder if there is a better method.
DELETE FROM sourceTable
WHERE
ROWID IN(
-- delete all
SELECT ROWID
FROM sourceTable t
MINUS
-- but keep every unique row
SELECT
rid
FROM
(
SELECT a,b,c,d,e,f,g,h,i, MAX(ROWID) KEEP (DENSE_RANK FIRST ORDER BY ROWID) AS RID
FROM sourceTable t
GROUP BY a,b,c,d,e,f,g,h,i
)
)
;

Lazy evaluation of Oracle PL/SQL statements in SELECT clauses of SQL queries

I have a performance problem with an Oracle select statement that I use in a cursor. In the statement one of the terms in the SELECT clause is expensive to evaluate (it's a PL/SQL procedure call, which accesses the database quite heavily). The WHERE clause and ORDER BY clauses are straightforward, however.
I expected that Oracle would first perform the WHERE clause to identify the set of records that match the query, then perform the ORDER BY clause to order them, and finally evaluate each of the terms in the SELECT clause. As I'm using this statement in a cursor from which I then pull results, I expected that the expensive evaluation of the SELECT term would only be performed as needed, when each result was requested from the cursor.
However, I've found that this is not the sequence that Oracle uses. Instead it appears to evaluate the terms in the SELECT clause for each record that matches the WHERE clause before performing the sort. Due to this, the procedure that is expensive to call is called for every result result in the result set before any results are returned from the cursor.
I want to be able to get the first results out of the cursor as quickly as possible. Can anyone tell me how to persuade Oracle not to evaluate the procedure call in the SELECT statement until after the sort has been performed?
This is all probably easier to describe in example code:
Given a table example with columns a, b, c and d, I have a statement like:
select a, b, expensive_procedure(c)
from example
where <the_where_clause>
order by d;
On executing this, expensive_procedure() is called for every record that matches the WHERE clause, even if I open the statement as a cursor and only pull one result from it.
I've tried restructuring the statement as:
select a, b, expensive_procedure(c)
from example, (select example2.rowid, ROWNUM
from example example2
where <the_where_clause>
order by d)
where example.rowid = example2.rowid;
Where the presence of ROWNUM in the inner SELECT statement forces Oracle to evaluate it first. This restructuring has the desired performance benefit. Unfortunately it doesn't always respect the ordering that is required.
Just to be clear, I know that I won't be improving the time it takes to return the entire result set. I'm looking to improve the time taken to return the first few results from the statement. I want the time taken to be progressive as I iterate over the results from the cursor, not all of it to elapse before the first result is returned.
Can any Oracle gurus tell me how I can persuade Oracle to stop executing the PL/SQL until it is necessary?

Why join EXAMPLE to itself in the in-line view? Why not just:
select /*+ no_merge(v) */ a, b, expensive_procedure(c)
from
( select a, b, c
from example
where <the_where_clause>
order by d
) v;

If your WHERE conditions are equalities, i. e.
WHERE col1 = :value1
AND col2 = :value2
you can create a composite index on (col1, col2, d):
CREATE INDEX ix_example_col1_col2_d ON example(col1, col2, d)
and hint your query to use it:
SELECT /*+ INDEX (e ix_example_col1_col2_d) */
a, b, expensive_procedure(c)
FROM example e
WHERE col1 = :value1
AND col2 = :value2
ORDER BY
d
In the example below, t_even is a 1,000,000 rows table with an index on value.
Fetching 100 columns from this query:
SELECT SYS_GUID()
FROM t_even
ORDER BY
value
is instant (0,03 seconds), while this one:
SELECT SYS_GUID()
FROM t_even
ORDER BY
value + 1
takes about 170 seconds to fetch first 100 rows.
SYS_GUID() is quite expensive in Oracle
As proposed by others, you can also use this:
SELECT a, b, expensive_proc(c)
FROM (
SELECT /*+ NO_MERGE */
*
FROM mytable
ORDER BY
d
)
, but using an index will improve your query response time (how soon the first row is returned).

Does this do what you intend?
WITH
cheap AS
(
SELECT A, B, C
FROM EXAMPLE
WHERE <the_where_clause>
)
SELECT A, B, expensive_procedure(C)
FROM cheap
ORDER BY D

You might want to give this a try
select a, b, expensive_procedure(c)
from example, (select /*+ NO_MERGE */
example2.rowid,
ROWNUM
from example example2
where <the_where_clause>
order by d)
where example.rowid = example2.rowid;

Might some form of this work?
FOR R IN (SELECT a,b,c FROM example WHERE ...) LOOP
e := expensive_procedure(R.c);
...
END LOOP;

One of the key problems with the solutions that we've tried is how to adjust the application that generates the SQL to structure the query correctly. The built SQL will vary in terms of number of columns retrieved, number and type of conditions in the where clause and number and type of expressions in the order by.
The inline view returning ROWIDs for joining to the outer was an almost completely generic solution that we can utilise, except where the search is returning a significant portion of the data. In this case the optimiser decides [correctly] that a HASH join is cheaper than a NESTED LOOP.
The other issue was that some of the objects involved are VIEWs that can't have ROWIDs.
For information: "D" was not a typo. The expression for the order by is not selected as part of the return value. Not an unusual thing:
select index_name, column_name
from user_ind_columns
where table_name = 'TABLE_OF_INTEREST'
order by index_name, column_position;
Here, you don't need to know the column_position, but sorting by it is critical.
We have reasons (with which we won't bore the reader) for avoiding the need for hints in the solution, but it's not looking like this is possible.
Thanks for the suggestions thus far - we have tried most of them already ...

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas