How can I make this query faster? - sql

I'm running this query on an Oracle DB:
SELECT COUNT(1) FROM db.table WHERE columnA = 'VALUE' AND ROWNUM < 2
There is no index on columnA, and the table has many many thousands of lines (possibly millions). There's about twenty values that should be returned, so it's not a huge set being returned. However, because it triggers a full table scan it takes eons. How can I make it go faster?
Note: I'm not a DBA so I have limited access to the database and can't implement restructuring, or adding indexes, or get rid of old data.

If you're looking for the existence of a row, not the number of times it appears, then this would be more appropriate:
SELECT 1
FROM DB.TABLE
WHERE ColumnA = 'VALUE'
AND ROWNUM = 1
That will stop the query as fast as possible once a row's been found; however, if you need it to go faster, that's what indexes are for.
Test Case:
create table q8806566
( id number not null,
column_a number not null,
padding char(256), -- so all the rows aren't really short
constraint pk_q8806566 primary key (id)
using index tablespace users
)
tablespace users;
insert into q8806566 -- 4 million rows
(id, column_a, padding)
with generator as
(select --+ materialize
rownum as rn from dba_objects
where rownum <= 2000)
select rownum as id, mod(rownum, 20) as column_a,
v1.rn as padding
from generator v1
cross join generator v2;
commit;
exec dbms_stats.gather_table_stats (ownname => user, tabname => 'q8806566');
The data for column_A is well distributed, and can be found in the first few blocks for all values, so this query runs well:
SELECT 1
FROM q8806566
WHERE Column_A = 1
AND ROWNUM = 1;
Sub .1 sec execution time and low I/O - on the order of 4 I/Os. However, when looking for a value that's NOT present, things change alarmingly:
SELECT 1
FROM q8806566
WHERE Column_A = 20
AND ROWNUM = 1;
20-40 seconds of execution time, and over 100,000 I/Os.
However, if we add the index:
create index q8806566_idx01 on q8806566 (column_a) tablespace users;
exec dbms_stats.gather_index_stats (ownname => user, indname => 'q8806566_idx01');
We get sub .1 second response time and single-digit I/Os from both queries.

Related

How to increase performance of COUNT SQL query in PostgreSQL?

I have a table with multiply columns. But for simplicity purpose, we can consider the following table:
create table tmp_table
(
entity_1_id varchar(255) not null,
status integer default 1 not null,
entity_2_id varchar(255)
);
create index tmp_table_entity_1_id_idx
on tmp_table (entity_1_id);
create index tmp_table_entity_2_id_idx
on tmp_table (entity_2_id);
I want to execute this request:
SELECT tmp_table.entity_2_id, COUNT(*) FROM tmp_table
WHERE tmp_table.entity_1_id='cedca236-3f27-4db3-876c-a6c159f4d15e' AND
tmp_table.status <> 2 AND
tmp_table.entity_2_id = ANY (string_to_array('21c5598b-0620-4a8c-b6fd-a4bfee024254,af0f9cb9-da47-4f6b-a3c4-218b901842f7', ','))
GROUP BY tmp_table.entity_2_id;
It works fine, when I send string to string_to_array function with a few values (like 1-20). But when I try to send 500 elements, it works too slow. Unfortunately, I really need 100-500 elements.
For this query:
SELECT t.entity_2_id, COUNT(*)
FROM tmp_table t
WHERE t.entity_1_id = 'cedca236-3f27-4db3-876c-a6c159f4d15e' AND
t.status <> 2 AND
t.entity_2_id = ANY (string_to_array('21c5598b-0620-4a8c-b6fd-a4bfee024254,af0f9cb9-da47-4f6b-a3c4-218b901842f7', ','))
GROUP BY t.entity_2_id;
I would recommend an index on tmp_table(entity_1_id, entity_2_id, status).
However, you might find this faster:
select rst.entity_2_id,
(select count(*)
from tmp_table t
where t.entity_2_id = rst.entity_2_id and
t.entity_1_id = 'cedca236-3f27-4db3-876c-a6c159f4d15e' AND
t.status <> 2
) as cnt
from regexp_split_to_table(str, ',') rst(entity_2_id);
Then you want an index on tmp_table(entity_2_id, entity_1_id, status).
In most databases, this would be faster, because the index is a covering index and this avoids the final aggregation over the entire result set. However, Postgres stores locking information on the data pages, so they still need to be read. It is still worth trying.

Why query optimizer selects completely different query plans?

Let us have the following table in SQL Server 2016
-- generating 1M test table with four attributes
WITH x AS
(
SELECT n FROM (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) v(n)
), t1 AS
(
SELECT ones.n + 10 * tens.n + 100 * hundreds.n + 1000 * thousands.n + 10000 * tenthousands.n + 100000 * hundredthousands.n as id
FROM x ones, x tens, x hundreds, x thousands, x tenthousands, x hundredthousands
)
SELECT id,
id % 50 predicate_col,
row_number() over (partition by id % 50 order by id) join_col,
LEFT('Value ' + CAST(CHECKSUM(NEWID()) AS VARCHAR) + ' ' + REPLICATE('*', 1000), 1000) as padding
INTO TestTable
FROM t1
GO
-- setting the `id` as a primary key (therefore, creating a clustered index)
ALTER TABLE TestTable ALTER COLUMN id int not null
GO
ALTER TABLE TestTable ADD CONSTRAINT pk_TestTable_id PRIMARY KEY (id)
-- creating a non-clustered index
CREATE NONCLUSTERED INDEX ix_TestTable_predicate_col_join_col
ON TestTable (predicate_col, join_col)
GO
Ok, and now when I run the following queries having just slightly different predicates (b.predicate_col <= 0 vs. b.predicate_col = 0) I got completely different plans.
-- Q1
select b.id, b.predicate_col, b.join_col, b.padding
from TestTable b
join TestTable a on b.join_col = a.id
where a.predicate_col = 1 and b.predicate_col <= 0
option (maxdop 1)
-- Q2
select b.id, b.predicate_col, b.join_col, b.padding
from TestTable b
join TestTable a on b.join_col = a.id
where a.predicate_col = 1 and b.predicate_col = 0
option (maxdop 1)
If I look on query plans, then it is clear that he chooses to join the key lookup together with non-clustered index seek first and then he does the final join with non-clustered index in the case of Q1 (which is bad). A much better solution is in the case of Q2: he joins the non-clustered indexes first and then he does the final key lookup.
The question is: why is that and can I improve it somehow?
In my intuitive understanding of histograms, it should be easy to estimate the correct result for both variants of predicates (b.predicate_col <= 0 vs. b.predicate_col = 0), therefore, why different query plans?
EDIT:
Actually, I do not want to change the indexes or physical structure of the table. I would like to understand why he picks up such a bad query plan in the case of Q1. Therefore, my question is precisely like this:
Why he picks such a bad query plan in the case of Q1 and can I improve without altering the physical design?
I have checked the result estimations in the query plan and both query plans have exact row number estimations of every operator! I have checked the result memo structure (OPTION (QUERYTRACEON 3604, QUERYTRACEON 8615, QUERYTRACEON 8620)) and rules applied during the compilation (OPTION (QUERYTRACEON 3604, QUERYTRACEON 8619, QUERYTRACEON 8620)) and it seems that he finish the query plan search once he hit the first plan. Is this the reason for such behaviour?
This is caused by SQL Server's inability to use Index Columns to the Right of the Inequality search.
This code produces the same issue:
SELECT * FROM TestTable WHERE predicate_col <= 0 and join_col = 1
SELECT * FROM TestTable WHERE predicate_col = 0 and join_col <= 1
Inequality queries such as >= or <= put a limitation on SQL, the Optimiser can't use the rest of the columns in the index, so when you put an inequality on [predicate_col] you're rendering the rest of the index useless, SQL can't make full use of the index and produces an alternate (bad) plan. [join_col] is the last column in the Index so in the second query SQL can still make full use of the index.
The reason SQL opts for the Hash Match is because it can't guarantee the order of the data coming out of table B. The inequality renders [join_col] in the index useless so SQL has to prepare for unsorted data on the join, even though the row count is the same.
The only way to fix your problem (even though you don't like it) is to alter the Index so that Equality columns come before Inequality columns.
Ok answer can be from Statistics and histogram point of view also.
Answer can be from index structure arrangement point of view also.
Ok I am trying to answer this from index structure.
Although you get same result in both query because there is no predicate_col < 0 records
When there is Range predicate in composite index ,both the index are not utilise. There can also be so many other reason of index not being utilise.
-- Q1
select b.id, b.predicate_col, b.join_col, b.padding
from TestTable b
join TestTable a on b.join_col = a.id
where a.predicate_col = 1 and b.predicate_col <= 0
option (maxdop 1)
If we want plan like in Q2 then we can create another composite index.
-- creating a non-clustered index
CREATE NONCLUSTERED INDEX ix_TestTable_predicate_col_join_col_1
ON TestTable (join_col,predicate_col)
GO
We get query plan exactly like Q2.
Another way is to define CHECK constraint in predicate_col
Alter table TestTable ADD check (predicate_col>=0)
GO
This also give same query plan as Q2.
Though in real table and data, whether you can create CHECK Constraint or create another composite index or not is another discussion.

I have a fairly large table and need to get rows `where column_a <> column_b`, what kind of index do I need?

I have a fairly large table (~ 100M rows) and this table has two boolean columns. Let's call them a and b. I want to get all rows where a is not equal to b:
SELECT *
FROM table
WHERE a <> b
Do I need two indices, one on a and one on b for this, or will a composite index on (a, b) also work here?
I am using PostgreSQL 9.6 and will be upgrading to 10.1 soon.
Pretty much no index is going to help this query. If you happen to know that a is often equal to b, then you could have an index on an expression. However, indexes on booleans is not usually recommended. And, the values would have to be equal most of the time -- think 90% of the time or 99% of the time.
If there are relatively few records with a<>b, you could use a conditional index:
CREATE INDEX ON thetable (id) WHERE a<>b;
The actual index-field id is not that imporant, and could possibly shadow an existing unconditional (PK) index. If a and b are nullable (makes little sense for booleans) you could use a is distinct from b as a condition.
UPDATE:
-- \i tmp.sql
CREATE TABLE thetable
( id serial NOT NULL PRIMARY KEY
, data text
, a boolean NOT NULL
, b boolean NOT NULL
);
INSERT INTO thetable(a,b, data)
SELECT True, True, 'data_' || gs::integer
FROM generate_series(1,1000000) gs
;
UPDATE thetable SET a = False WHERE id % 37 = 0 ;
UPDATE thetable SET b = False WHERE id % 47 = 0 ;
SELECT version();
DROP INDEX zzzzzz ;
CREATE INDEX zzzzzz ON thetable((a<>b)) WHERE a<>b;
VACUUM ANALYZE thetable;
EXPLAIN ANALYZE
-- SELECT COUNT(*)
SELECT id
FROM thetable
WHERE a <> b
;
Result:
psql:tmp.sql:2: NOTICE: drop cascades to table tmp.thetable
DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
INSERT 0 1000000
UPDATE 27027
UPDATE 21276
SET
version
----------------------------------------------------------------------------------------------
PostgreSQL 9.3.5 on i686-pc-linux-gnu, compiled by gcc (Ubuntu 4.4.3-4ubuntu5) 4.4.3, 32-bit
(1 row)
ERROR: index "zzzzzz" does not exist
CREATE INDEX
VACUUM
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Index Scan using zzzzzz on thetable (cost=0.29..9741.19 rows=995000 width=4) (actual time=0.057..193.087 rows=47153 loops=1)
Index Cond: ((a <> b) = true)
Total runtime: 259.891 ms
(3 rows)
So, it appears you should have to have exactly the same condition in the index-expression as in the WHERE conditional. (and it should possibly match the query-condition, too)

Optimization of selection of semi-random rows in Postgres

I currently have a query that randomly selects a job from a table of jobs:
select jobs.job_id
from jobs
where (jobs.type is null)
and (jobs.project_id = 5)
and (jobs.status = 'Available')
offset floor(random() * (select count(*) from jobs
where (jobs.type is null) and (jobs.project_id = 5)
and (jobs.status = 'Available')))
limit 1
This has the desired functionality, but is too slow. I am using Postgres 9.2 so I can't use TABLESAMPLE, unfortunately.
On the plus side, I do not need it to be truly random, so I'm thinking I can optimize it by making it slightly less random.
Any ideas?
Could I suggest an index on jobs(project_id, status, type)? That might speed your query, if it is not already defined on the table.
Instead of using OFFSET and LIMIT, why don't you use
ORDER BY random() LIMIT 1
If that is also too slow, you could replace your subquery
select count(*) from jobs
where (jobs.type is null) and (jobs.project_id = 5)
and (jobs.status = 'Available')
with something like
SELECT reltuples * <factor> FROM pg_class WHERE oid = <tableoid>
where <tableoid> is the OID of the jobs table and <factor> is a number that is slightly bigger than the selectivity of the WHERE condition of your subquery.
That will save one sequential scan, with the downside that you occasionally get no result and have to repeat the query.
Is that good enough?
A dirty trick: store the random value inside the table and build a (partial) index on it. (you may want to re-randomise this field from time to time to avoid records to never be picked ;-)
-- assumed table definition
CREATE table jobs
( job_id SERIAL NOT NULL PRIMARY KEY
, type VARCHAR
, project_id INTEGER NOT NULL
, status VARCHAR NOT NULL
-- pre-computed magic random number
-- , magic DOUBLE PRECISION NOT NULL DEFAULT random()
);
-- some data to play with
INSERT INTO jobs(type,project_id,status)
SELECT 'aaa' , gs %10 , 'Omg!'
FROM generate_series(1,10000) gs;
UPDATE jobs SET type = NULL WHERE random() < 0.2;
UPDATE jobs SET status = 'Available' WHERE random() < 0.2;
-- add a column containing random numbers
ALTER TABLE jobs
ADD column magic DOUBLE PRECISION NOT NULL DEFAULT random()
;
CREATE INDEX ON jobs(magic)
-- index is only applied for the conditions you will be searching
WHERE status = 'Available' AND project_id = 5 AND type IS NULL
;
-- make sure statistics are present
VACUUM ANALYZE jobs;
-- EXPLAIN
SELECT j.job_id
FROM jobs j
WHERE j.type is null
AND j.project_id = 5
AND j.status = 'Available'
ORDER BY j.magic
LIMIT 1
;
Something similar can be accomplished by using a serial with a rather high increment value (some prime number around 3G) instead of random+float.

Fastest check if row exists in PostgreSQL

I have a bunch of rows that I need to insert into table, but these inserts are always done in batches. So I want to check if a single row from the batch exists in the table because then I know they all were inserted.
So its not a primary key check, but shouldn't matter too much. I would like to only check single row so count(*) probably isn't good, so its something like exists I guess.
But since I'm fairly new to PostgreSQL I'd rather ask people who know.
My batch contains rows with following structure:
userid | rightid | remaining_count
So if table contains any rows with provided userid it means they all are present there.
Use the EXISTS key word for TRUE / FALSE return:
select exists(select 1 from contact where id=12)
How about simply:
select 1 from tbl where userid = 123 limit 1;
where 123 is the userid of the batch that you're about to insert.
The above query will return either an empty set or a single row, depending on whether there are records with the given userid.
If this turns out to be too slow, you could look into creating an index on tbl.userid.
if even a single row from batch exists in table, in that case I
don't have to insert my rows because I know for sure they all were
inserted.
For this to remain true even if your program gets interrupted mid-batch, I'd recommend that you make sure you manage database transactions appropriately (i.e. that the entire batch gets inserted within a single transaction).
INSERT INTO target( userid, rightid, count )
SELECT userid, rightid, count
FROM batch
WHERE NOT EXISTS (
SELECT * FROM target t2, batch b2
WHERE t2.userid = b2.userid
-- ... other keyfields ...
)
;
BTW: if you want the whole batch to fail in case of a duplicate, then (given a primary key constraint)
INSERT INTO target( userid, rightid, count )
SELECT userid, rightid, count
FROM batch
;
will do exactly what you want: either it succeeds, or it fails.
If you think about the performace ,may be you can use "PERFORM" in a function just like this:
PERFORM 1 FROM skytf.test_2 WHERE id=i LIMIT 1;
IF FOUND THEN
RAISE NOTICE ' found record id=%', i;
ELSE
RAISE NOTICE ' not found record id=%', i;
END IF;
as #MikeM pointed out.
select exists(select 1 from contact where id=12)
with index on contact, it can usually reduce time cost to 1 ms.
CREATE INDEX index_contact on contact(id);
SELECT 1 FROM user_right where userid = ? LIMIT 1
If your resultset contains a row then you do not have to insert. Otherwise insert your records.
select true from tablename where condition limit 1;
I believe that this is the query that postgres uses for checking foreign keys.
In your case, you could do this in one go too:
insert into yourtable select $userid, $rightid, $count where not (select true from yourtable where userid = $userid limit 1);