Optimization of selection of semi-random rows in Postgres - sql

I currently have a query that randomly selects a job from a table of jobs:
select jobs.job_id
from jobs
where (jobs.type is null)
and (jobs.project_id = 5)
and (jobs.status = 'Available')
offset floor(random() * (select count(*) from jobs
where (jobs.type is null) and (jobs.project_id = 5)
and (jobs.status = 'Available')))
limit 1
This has the desired functionality, but is too slow. I am using Postgres 9.2 so I can't use TABLESAMPLE, unfortunately.
On the plus side, I do not need it to be truly random, so I'm thinking I can optimize it by making it slightly less random.
Any ideas?

Could I suggest an index on jobs(project_id, status, type)? That might speed your query, if it is not already defined on the table.

Instead of using OFFSET and LIMIT, why don't you use
ORDER BY random() LIMIT 1
If that is also too slow, you could replace your subquery
select count(*) from jobs
where (jobs.type is null) and (jobs.project_id = 5)
and (jobs.status = 'Available')
with something like
SELECT reltuples * <factor> FROM pg_class WHERE oid = <tableoid>
where <tableoid> is the OID of the jobs table and <factor> is a number that is slightly bigger than the selectivity of the WHERE condition of your subquery.
That will save one sequential scan, with the downside that you occasionally get no result and have to repeat the query.
Is that good enough?

A dirty trick: store the random value inside the table and build a (partial) index on it. (you may want to re-randomise this field from time to time to avoid records to never be picked ;-)
-- assumed table definition
CREATE table jobs
( job_id SERIAL NOT NULL PRIMARY KEY
, type VARCHAR
, project_id INTEGER NOT NULL
, status VARCHAR NOT NULL
-- pre-computed magic random number
-- , magic DOUBLE PRECISION NOT NULL DEFAULT random()
);
-- some data to play with
INSERT INTO jobs(type,project_id,status)
SELECT 'aaa' , gs %10 , 'Omg!'
FROM generate_series(1,10000) gs;
UPDATE jobs SET type = NULL WHERE random() < 0.2;
UPDATE jobs SET status = 'Available' WHERE random() < 0.2;
-- add a column containing random numbers
ALTER TABLE jobs
ADD column magic DOUBLE PRECISION NOT NULL DEFAULT random()
;
CREATE INDEX ON jobs(magic)
-- index is only applied for the conditions you will be searching
WHERE status = 'Available' AND project_id = 5 AND type IS NULL
;
-- make sure statistics are present
VACUUM ANALYZE jobs;
-- EXPLAIN
SELECT j.job_id
FROM jobs j
WHERE j.type is null
AND j.project_id = 5
AND j.status = 'Available'
ORDER BY j.magic
LIMIT 1
;
Something similar can be accomplished by using a serial with a rather high increment value (some prime number around 3G) instead of random+float.

Related

How to increase performance of COUNT SQL query in PostgreSQL?

I have a table with multiply columns. But for simplicity purpose, we can consider the following table:
create table tmp_table
(
entity_1_id varchar(255) not null,
status integer default 1 not null,
entity_2_id varchar(255)
);
create index tmp_table_entity_1_id_idx
on tmp_table (entity_1_id);
create index tmp_table_entity_2_id_idx
on tmp_table (entity_2_id);
I want to execute this request:
SELECT tmp_table.entity_2_id, COUNT(*) FROM tmp_table
WHERE tmp_table.entity_1_id='cedca236-3f27-4db3-876c-a6c159f4d15e' AND
tmp_table.status <> 2 AND
tmp_table.entity_2_id = ANY (string_to_array('21c5598b-0620-4a8c-b6fd-a4bfee024254,af0f9cb9-da47-4f6b-a3c4-218b901842f7', ','))
GROUP BY tmp_table.entity_2_id;
It works fine, when I send string to string_to_array function with a few values (like 1-20). But when I try to send 500 elements, it works too slow. Unfortunately, I really need 100-500 elements.
For this query:
SELECT t.entity_2_id, COUNT(*)
FROM tmp_table t
WHERE t.entity_1_id = 'cedca236-3f27-4db3-876c-a6c159f4d15e' AND
t.status <> 2 AND
t.entity_2_id = ANY (string_to_array('21c5598b-0620-4a8c-b6fd-a4bfee024254,af0f9cb9-da47-4f6b-a3c4-218b901842f7', ','))
GROUP BY t.entity_2_id;
I would recommend an index on tmp_table(entity_1_id, entity_2_id, status).
However, you might find this faster:
select rst.entity_2_id,
(select count(*)
from tmp_table t
where t.entity_2_id = rst.entity_2_id and
t.entity_1_id = 'cedca236-3f27-4db3-876c-a6c159f4d15e' AND
t.status <> 2
) as cnt
from regexp_split_to_table(str, ',') rst(entity_2_id);
Then you want an index on tmp_table(entity_2_id, entity_1_id, status).
In most databases, this would be faster, because the index is a covering index and this avoids the final aggregation over the entire result set. However, Postgres stores locking information on the data pages, so they still need to be read. It is still worth trying.

Why query optimizer selects completely different query plans?

Let us have the following table in SQL Server 2016
-- generating 1M test table with four attributes
WITH x AS
(
SELECT n FROM (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) v(n)
), t1 AS
(
SELECT ones.n + 10 * tens.n + 100 * hundreds.n + 1000 * thousands.n + 10000 * tenthousands.n + 100000 * hundredthousands.n as id
FROM x ones, x tens, x hundreds, x thousands, x tenthousands, x hundredthousands
)
SELECT id,
id % 50 predicate_col,
row_number() over (partition by id % 50 order by id) join_col,
LEFT('Value ' + CAST(CHECKSUM(NEWID()) AS VARCHAR) + ' ' + REPLICATE('*', 1000), 1000) as padding
INTO TestTable
FROM t1
GO
-- setting the `id` as a primary key (therefore, creating a clustered index)
ALTER TABLE TestTable ALTER COLUMN id int not null
GO
ALTER TABLE TestTable ADD CONSTRAINT pk_TestTable_id PRIMARY KEY (id)
-- creating a non-clustered index
CREATE NONCLUSTERED INDEX ix_TestTable_predicate_col_join_col
ON TestTable (predicate_col, join_col)
GO
Ok, and now when I run the following queries having just slightly different predicates (b.predicate_col <= 0 vs. b.predicate_col = 0) I got completely different plans.
-- Q1
select b.id, b.predicate_col, b.join_col, b.padding
from TestTable b
join TestTable a on b.join_col = a.id
where a.predicate_col = 1 and b.predicate_col <= 0
option (maxdop 1)
-- Q2
select b.id, b.predicate_col, b.join_col, b.padding
from TestTable b
join TestTable a on b.join_col = a.id
where a.predicate_col = 1 and b.predicate_col = 0
option (maxdop 1)
If I look on query plans, then it is clear that he chooses to join the key lookup together with non-clustered index seek first and then he does the final join with non-clustered index in the case of Q1 (which is bad). A much better solution is in the case of Q2: he joins the non-clustered indexes first and then he does the final key lookup.
The question is: why is that and can I improve it somehow?
In my intuitive understanding of histograms, it should be easy to estimate the correct result for both variants of predicates (b.predicate_col <= 0 vs. b.predicate_col = 0), therefore, why different query plans?
EDIT:
Actually, I do not want to change the indexes or physical structure of the table. I would like to understand why he picks up such a bad query plan in the case of Q1. Therefore, my question is precisely like this:
Why he picks such a bad query plan in the case of Q1 and can I improve without altering the physical design?
I have checked the result estimations in the query plan and both query plans have exact row number estimations of every operator! I have checked the result memo structure (OPTION (QUERYTRACEON 3604, QUERYTRACEON 8615, QUERYTRACEON 8620)) and rules applied during the compilation (OPTION (QUERYTRACEON 3604, QUERYTRACEON 8619, QUERYTRACEON 8620)) and it seems that he finish the query plan search once he hit the first plan. Is this the reason for such behaviour?
This is caused by SQL Server's inability to use Index Columns to the Right of the Inequality search.
This code produces the same issue:
SELECT * FROM TestTable WHERE predicate_col <= 0 and join_col = 1
SELECT * FROM TestTable WHERE predicate_col = 0 and join_col <= 1
Inequality queries such as >= or <= put a limitation on SQL, the Optimiser can't use the rest of the columns in the index, so when you put an inequality on [predicate_col] you're rendering the rest of the index useless, SQL can't make full use of the index and produces an alternate (bad) plan. [join_col] is the last column in the Index so in the second query SQL can still make full use of the index.
The reason SQL opts for the Hash Match is because it can't guarantee the order of the data coming out of table B. The inequality renders [join_col] in the index useless so SQL has to prepare for unsorted data on the join, even though the row count is the same.
The only way to fix your problem (even though you don't like it) is to alter the Index so that Equality columns come before Inequality columns.
Ok answer can be from Statistics and histogram point of view also.
Answer can be from index structure arrangement point of view also.
Ok I am trying to answer this from index structure.
Although you get same result in both query because there is no predicate_col < 0 records
When there is Range predicate in composite index ,both the index are not utilise. There can also be so many other reason of index not being utilise.
-- Q1
select b.id, b.predicate_col, b.join_col, b.padding
from TestTable b
join TestTable a on b.join_col = a.id
where a.predicate_col = 1 and b.predicate_col <= 0
option (maxdop 1)
If we want plan like in Q2 then we can create another composite index.
-- creating a non-clustered index
CREATE NONCLUSTERED INDEX ix_TestTable_predicate_col_join_col_1
ON TestTable (join_col,predicate_col)
GO
We get query plan exactly like Q2.
Another way is to define CHECK constraint in predicate_col
Alter table TestTable ADD check (predicate_col>=0)
GO
This also give same query plan as Q2.
Though in real table and data, whether you can create CHECK Constraint or create another composite index or not is another discussion.

SQLite Update Query Optimization

So I have tables with the following structure:
TimeStamp,
var_1,
var_2,
var_3,
var_4,
var_5,...
This contains about 600 columns named var_##, the user parses some data stored by a machine and I have to update all null values inside that table to the last valid value. At the moment I use the following query:
update tableName
set var_## =
(select b.var_## from tableName as
where b.timeStamp <= tableName.timeStamp and b.var_## is not null
order by timeStamp desc limit 1)
where tableName.var_## is null;
Problem right now is the tame it takes to run this query for all columns, is there any way to optimize this query?
UPDATE: this is the output query plan when executin te query for one column:
update wme_test2
set var_6 =
(select b.var_6 from wme_test2 as b
where b.timeStamp <= wme_test2.timeStamp and b.var_6 is not null
order by timeStamp desc limit 1)
where wme_test2.var_6 is null;
Having 600 indexes on the data columns would be silly. (But not necessarily more silly than having 600 columns.)
All queries can be sped up with an index on the timeStamp column.

teradata case when issue

I have the following queries which are supposed to give the same result, but drastically different
1.
select count(*)
from qigq_sess_parse_2
where str_vendor = 'natural search' and str_category is null and destntn_url = 'http://XXXX.com';
create table qigq_test1 as
(
select case
when (str_vendor = 'natural search' and str_category is null and destntn_url = 'http://XXXX.com' ) then 1
else 0
end as m
from qigq_sess_parse_2
) with data;
select count(*) from qigq_test1 where m = 1;
the first block gives a total number of count 132868, while the second one only gives 1.
What are the subtle parts in the query that causes this difference?
Thanks
When you create a table in Teradata, you can specify it to be SET or MULTISET. If you don't specify, it defaults to SET. A set table cannot contain duplicates. So at most, your new table will contain two rows, a 0 and a 1, since that's all that can come from your case statement.
EDIT:
After a bit more digging, the defaults aren't quite that simple. But in any case, I suspect that if you add the MULTISET option to your create statement, you'll see the behavior your expect.
My guess would be that your Create Table statement is only pulling in one row of data that fits the parameters for the following Count statement. Try this instead:
CREATE TABLE qigq_test1 (m integer);
INSERT INTO qigq_test1
SELECT
CASE
WHEN (str_vendor = 'natural search' and str_category IS NULL AND destntn_url = 'http://XXXX.com' ) THEN 1
ELSE 0
END AS m
FROM qigq_sess_parse_2;
SELECT COUNT(*) FROM qigq_test1 WHERE m = 1;
This should pull ALL ROWS of data from qigq_sess_parse_2 into qigq_test1 as either a 0 or 1.

How can I make this query faster?

I'm running this query on an Oracle DB:
SELECT COUNT(1) FROM db.table WHERE columnA = 'VALUE' AND ROWNUM < 2
There is no index on columnA, and the table has many many thousands of lines (possibly millions). There's about twenty values that should be returned, so it's not a huge set being returned. However, because it triggers a full table scan it takes eons. How can I make it go faster?
Note: I'm not a DBA so I have limited access to the database and can't implement restructuring, or adding indexes, or get rid of old data.
If you're looking for the existence of a row, not the number of times it appears, then this would be more appropriate:
SELECT 1
FROM DB.TABLE
WHERE ColumnA = 'VALUE'
AND ROWNUM = 1
That will stop the query as fast as possible once a row's been found; however, if you need it to go faster, that's what indexes are for.
Test Case:
create table q8806566
( id number not null,
column_a number not null,
padding char(256), -- so all the rows aren't really short
constraint pk_q8806566 primary key (id)
using index tablespace users
)
tablespace users;
insert into q8806566 -- 4 million rows
(id, column_a, padding)
with generator as
(select --+ materialize
rownum as rn from dba_objects
where rownum <= 2000)
select rownum as id, mod(rownum, 20) as column_a,
v1.rn as padding
from generator v1
cross join generator v2;
commit;
exec dbms_stats.gather_table_stats (ownname => user, tabname => 'q8806566');
The data for column_A is well distributed, and can be found in the first few blocks for all values, so this query runs well:
SELECT 1
FROM q8806566
WHERE Column_A = 1
AND ROWNUM = 1;
Sub .1 sec execution time and low I/O - on the order of 4 I/Os. However, when looking for a value that's NOT present, things change alarmingly:
SELECT 1
FROM q8806566
WHERE Column_A = 20
AND ROWNUM = 1;
20-40 seconds of execution time, and over 100,000 I/Os.
However, if we add the index:
create index q8806566_idx01 on q8806566 (column_a) tablespace users;
exec dbms_stats.gather_index_stats (ownname => user, indname => 'q8806566_idx01');
We get sub .1 second response time and single-digit I/Os from both queries.