How to increase performance of COUNT SQL query in PostgreSQL?

How to increase performance of COUNT SQL query in PostgreSQL? - sql

I have a table with multiply columns. But for simplicity purpose, we can consider the following table:
create table tmp_table
(
entity_1_id varchar(255) not null,
status integer default 1 not null,
entity_2_id varchar(255)
);
create index tmp_table_entity_1_id_idx
on tmp_table (entity_1_id);
create index tmp_table_entity_2_id_idx
on tmp_table (entity_2_id);
I want to execute this request:
SELECT tmp_table.entity_2_id, COUNT(*) FROM tmp_table
WHERE tmp_table.entity_1_id='cedca236-3f27-4db3-876c-a6c159f4d15e' AND
tmp_table.status <> 2 AND
tmp_table.entity_2_id = ANY (string_to_array('21c5598b-0620-4a8c-b6fd-a4bfee024254,af0f9cb9-da47-4f6b-a3c4-218b901842f7', ','))
GROUP BY tmp_table.entity_2_id;
It works fine, when I send string to string_to_array function with a few values (like 1-20). But when I try to send 500 elements, it works too slow. Unfortunately, I really need 100-500 elements.

For this query:
SELECT t.entity_2_id, COUNT(*)
FROM tmp_table t
WHERE t.entity_1_id = 'cedca236-3f27-4db3-876c-a6c159f4d15e' AND
t.status <> 2 AND
t.entity_2_id = ANY (string_to_array('21c5598b-0620-4a8c-b6fd-a4bfee024254,af0f9cb9-da47-4f6b-a3c4-218b901842f7', ','))
GROUP BY t.entity_2_id;
I would recommend an index on tmp_table(entity_1_id, entity_2_id, status).
However, you might find this faster:
select rst.entity_2_id,
(select count(*)
from tmp_table t
where t.entity_2_id = rst.entity_2_id and
t.entity_1_id = 'cedca236-3f27-4db3-876c-a6c159f4d15e' AND
t.status <> 2
) as cnt
from regexp_split_to_table(str, ',') rst(entity_2_id);
Then you want an index on tmp_table(entity_2_id, entity_1_id, status).
In most databases, this would be faster, because the index is a covering index and this avoids the final aggregation over the entire result set. However, Postgres stores locking information on the data pages, so they still need to be read. It is still worth trying.

Related

SQL query to find a row with a specific number of associations

Using Postgres I have a schema that has conversations and conversationUsers. Each conversation has many conversationUsers. I want to be able to find the conversation that has the exactly specified number of conversationUsers. In other words, provided an array of userIds (say, [1, 4, 6]) I want to be able to find the conversation that contains only those users, and no more.
So far I've tried this:
SELECT c."conversationId"
FROM "conversationUsers" c
WHERE c."userId" IN (1, 4)
GROUP BY c."conversationId"
HAVING COUNT(c."userId") = 2;
Unfortunately, this also seems to return conversations which include these 2 users among others. (For example, it returns a result if the conversation also includes "userId" 5).

This is a case of relational-division - with the added special requirement that the same conversation shall have no additional users.
Assuming the PK of table "conversationUsers" is on ("userId", "conversationId"), which enforces uniqueness of combinations, NOT NULL and also provides the essential index for performance implicitly. Columns of the multicolumn PK in this order. Ideally, you have another index on ("conversationId", "userId"). See:
Is a composite index also good for queries on the first field?
For the basic query, there is the "brute force" approach to count the number of matching users for all conversations of all given users and then filter the ones matching all given users. OK for small tables and/or only short input arrays and/or few conversations per user, but doesn't scale well:
SELECT "conversationId"
FROM "conversationUsers" c
WHERE "userId" = ANY ('{1,4,6}'::int[])
GROUP BY 1
HAVING count(*) = array_length('{1,4,6}'::int[], 1)
AND NOT EXISTS (
SELECT FROM "conversationUsers"
WHERE "conversationId" = c."conversationId"
AND "userId" <> ALL('{1,4,6}'::int[])
);
Eliminating conversations with additional users with a NOT EXISTS anti-semi-join. More:
How do I (or can I) SELECT DISTINCT on multiple columns?
Alternative techniques:
Select rows which are not present in other table
There are various other, (much) faster relational-division query techniques. But the fastest ones are not well suited for a dynamic number of user IDs.
How to filter SQL results in a has-many-through relation
For a fast query that can also deal with a dynamic number of user IDs, consider a recursive CTE:
WITH RECURSIVE rcte AS (
SELECT "conversationId", 1 AS idx
FROM "conversationUsers"
WHERE "userId" = ('{1,4,6}'::int[])[1]
UNION ALL
SELECT c."conversationId", r.idx + 1
FROM rcte r
JOIN "conversationUsers" c USING ("conversationId")
WHERE c."userId" = ('{1,4,6}'::int[])[idx + 1]
)
SELECT "conversationId"
FROM rcte r
WHERE idx = array_length(('{1,4,6}'::int[]), 1)
AND NOT EXISTS (
SELECT FROM "conversationUsers"
WHERE "conversationId" = r."conversationId"
AND "userId" <> ALL('{1,4,6}'::int[])
);
For ease of use wrap this in a function or prepared statement. Like:
PREPARE conversations(int[]) AS
WITH RECURSIVE rcte AS (
SELECT "conversationId", 1 AS idx
FROM "conversationUsers"
WHERE "userId" = $1[1]
UNION ALL
SELECT c."conversationId", r.idx + 1
FROM rcte r
JOIN "conversationUsers" c USING ("conversationId")
WHERE c."userId" = $1[idx + 1]
)
SELECT "conversationId"
FROM rcte r
WHERE idx = array_length($1, 1)
AND NOT EXISTS (
SELECT FROM "conversationUsers"
WHERE "conversationId" = r."conversationId"
AND "userId" <> ALL($1);
Call:
EXECUTE conversations('{1,4,6}');
db<>fiddle here (also demonstrating a function)
There is still room for improvement: to get top performance you have to put users with the fewest conversations first in your input array to eliminate as many rows as possible early. To get top performance you can generate a non-dynamic, non-recursive query dynamically (using one of the fast techniques from the first link) and execute that in turn. You could even wrap it in a single plpgsql function with dynamic SQL ...
More explanation:
Using same column multiple times in WHERE clause
Alternative: MV for sparsely written table
If the table "conversationUsers" is mostly read-only (old conversations are unlikely to change) you might use a MATERIALIZED VIEW with pre-aggregated users in sorted arrays and create a plain btree index on that array column.
CREATE MATERIALIZED VIEW mv_conversation_users AS
SELECT "conversationId", array_agg("userId") AS users -- sorted array
FROM (
SELECT "conversationId", "userId"
FROM "conversationUsers"
ORDER BY 1, 2
) sub
GROUP BY 1
ORDER BY 1;
CREATE INDEX ON mv_conversation_users (users) INCLUDE ("conversationId");
The demonstrated covering index requires Postgres 11. See:
https://dba.stackexchange.com/a/207938/3684
About sorting rows in a subquery:
How to apply ORDER BY and LIMIT in combination with an aggregate function?
In older versions use a plain multicolumn index on (users, "conversationId"). With very long arrays, a hash index might make sense in Postgres 10 or later.
Then the much faster query would simply be:
SELECT "conversationId"
FROM mv_conversation_users c
WHERE users = '{1,4,6}'::int[]; -- sorted array!
db<>fiddle here
You have to weigh added costs to storage, writes and maintenance against benefits to read performance.
Aside: consider legal identifiers without double quotes. conversation_id instead of "conversationId" etc.:
Are PostgreSQL column names case-sensitive?

you can modify your query like this and it should work:
SELECT c."conversationId"
FROM "conversationUsers" c
WHERE c."conversationId" IN (
SELECT DISTINCT c1."conversationId"
FROM "conversationUsers" c1
WHERE c1."userId" IN (1, 4)
)
GROUP BY c."conversationId"
HAVING COUNT(DISTINCT c."userId") = 2;

This might be easier to follow. You want the conversation ID, group by it. add the HAVING clause based on the sum of matching user IDs count equal to the all possible within the group. This will work, but will be longer to process because of no pre-qualifier.
select
cu.ConversationId
from
conversationUsers cu
group by
cu.ConversationID
having
sum( case when cu.userId IN (1, 4) then 1 else 0 end ) = count( distinct cu.UserID )
To Simplify the list even more, have a pre-query of conversations that at least one person is in... If they are not in to begin with, why bother considering such other conversations.
select
cu.ConversationId
from
( select cu2.ConversationID
from conversationUsers cu2
where cu2.userID = 4 ) preQual
JOIN conversationUsers cu
preQual.ConversationId = cu.ConversationId
group by
cu.ConversationID
having
sum( case when cu.userId IN (1, 4) then 1 else 0 end ) = count( distinct cu.UserID )

Is there something I can change to make my view run faster?

I just created a view but it is really slow, since my actual table has something around 800k rows.
Is there something I can change in the actual sql code to make it run faster?
Here is how it looks now:
Select B.*
FROM
(Select A.*, (select count(B.KEY_ID)/77
FROM book_new B
where B.KEY_ID = A.KEY_ID) as COUNT_KEY
FROM
(select *
from book_new
where region = 'US'
and (actual_release_date is null or
actual_release_date >= To_Date( '01/07/16','dd/mm/yy'))
) A
) B
WHERE B.COUNT_KEY = 1
OR (B.COUNT_KEY > 1 AND B.NEW_OLD <> 'Old')

The most obvious things to do are add indexes:
Add an index on book_new(key_id)
Add an index on book_new(region, actual_release_date)
These are probably sufficient. It is possible that rewriting the query would help, but this is a good beginning. If you want to rewrite the query, it would help if you described the logic you are trying to implement.

There are many ways to solve this issue based on your needs
You can create an indexed view
You can create an index in the base tables which are used in this view.
You can use the required columns in the SELECT statement instead of using SELECT * FROM,
If the table contains many columns but you require only few columns, you can create a NON CLUSTERED INDEX with INCLUDE COLUMNS option which will reduce the LOGICAL READS.

For starters, replace the scalar subquery for COUNT_KEY with a windowed COUNT(*).
SELECT * FROM
(
select book_new.*, COUNT(*) OVER ( PARTITION BY book_new.key_id)/77 COUNT_KEY
from book_new
where region = 'US'
and (actual_release_date is null or
actual_release_date >= To_Date( '01/07/16','dd/mm/yy'))
)
WHERE count_key = 1 OR ( count_key > 1 AND new_old <> 'Old' )
This way, you only go through the BOOK_NEW table one time.
BTW, I agree with other comments that this query makes little sense.

How to force Index usage on columns in Merge Statement in Oracle

I am working on Oracle 10gR2
I have a MERGE statement for a table, TBL_CUSTOMER. TBL_CUSTOMER contains a column USERNAME, which contains email addresses. The data stored in this table is case insensitive, as in, the incoming data can be in upper case, lower case or any combination of cases.
While merging the data, I have to ensure that I compare data without considering case. I have created a function bases index on USERNAME column as UPPER(USERNAME).
MERGE INTO tbl_customer t
USING (SELECT /*+ dynamic_sampling(a 2) */ NVL(
(x.username||decode((x.cnt+x.rn-1),0,null,(x.cnt+x.rn-1))),
t1.cust_username
) community_id
,DECODE (source_system_name,'SYS1', t1.cust_firstname,t1.cust_username) display_name
,t1.cust_username
,t1.cust_id cust_id
,t1.cust_account_no cust_account_no
,t1.cust_creation_date
,t1.source_system_name
,t1.seq_no
,nvl(t1.cust_email,'NULLEMAIL') cust_email
,t1.file_name
,t1.diakey
,t1.sourcetupleidcustmer
,DECODE (source_system_name,'SYS1','DefaultPassword',t1.cust_password) cust_password
FROM gtt_customer_data t1,
(SELECT a.username,
cust_id,
row_number() over(partition by lower(a.username) order by seq_no) rn,
(SELECT count(community_id)FROM TBL_customer where regexp_like (lower(community_id) ,'^'||lower(a.username)||'[0-9]*$'))cnt
FROM gtt_cust_count_name a
) x
WHERE t1.cust_status = 'A'
AND x.cust_id(+) = t1.cust_id
AND nvl(t1.op_code,'X') <> 'D'
AND t1.cust_id is not null
AND cust_email is not null
) a
ON (
(a.sourcetupleidcustmer = t.source_tuple_id AND a.source_system_name =t.created_by)
OR
( upper(a.cust_email) = upper(t.username) AND a.source_system_name ='SYS2' )
)
When I check the explain plan, the function based index on USERNAME is not being used. I noticed that if I remove the OR condition, the index is used, but I cannot remove that, due to complex business logic.
How can I force that index to be used?

Oracle lets you create function-based index, in your case on upper(username). You can also try INDEX hint in the query , but I think in your case function based index is a much better solution.
BTree indexes are not normally used if the index field is an argument of a function (assuming the function in WHERE and it's not a covering index). For instance, WHERE trunc(date_field) = trunc(sysdate()) will not use index on date_field, but will use index on (trunc(date_field)).

sql server-query optimization with many columns

we have "Profile" table with over 60 columns like (Id, fname, lname, gender, profilestate, city, state, degree, ...).
users search other peopel on website. query is like :
WITH TempResult as (
select ROW_NUMBER() OVER(ORDER BY #sortColumn DESC) as RowNum, profile.id from Profile
where
(#a is null or a = #a) and
(#b is null or b = #b) and
...(over 60 column)
)
SELECT profile.* FROM TempResult join profile on TempResult.id = profile.id
WHERE
(RowNum >= #FirstRow)
AND
(RowNum <= #LastRow)
sql server by default use clustered index for execution query. but total execution time is over 300. we test another solution such as multi column index in all columns in where clause but total execution time is over 400.
do you have any solution to make total execution time lower than 100.
we using sql server 2008.

Unfortunately I don't think there is a pure SQL solution to your issue. Here are a couple alternatives:
Dynamic SQL - build up a query that only includes WHERE clause statements for values that are actually provided. Assuming the average search actually only fills in 2-3 fields, indexes could be added and utilized.
Full Text Search - go to something more like a Google keyword search. No individual options.
Lucene (or something else) - Search outside of SQL; This is a fairly significant change though.
One other option that I just remembered implementing in a system once. Create a vertical table that includes all of the data you are searching on and build up a query for it. This is easiest to do with dynamic SQL, but could be done using Table Value Parameters or a temp table in a pinch.
The idea is to make a table that looks something like this:
Profile ID
Attribute Name
Attribute Value
The table should have a unique index on (Profile ID, Attribute Name) (unique to make the search work properly, index will make it perform well).
In this table you'd have rows of data like:
(1, 'city', 'grand rapids')
(1, 'state', 'MI')
(2, 'city', 'detroit')
(2, 'state', 'MI')
Then your SQL will be something like:
SELECT *
FROM Profile
JOIN (
SELECT ProfileID
FROM ProfileAttributes
WHERE (AttributeName = 'city' AND AttributeValue = 'grand rapids')
AND (AttributeName = 'state' AND AttributeValue = 'MI')
GROUP BY ProfileID
HAVING COUNT(*) = 2
) SelectedProfiles ON Profile.ProfileID = SelectedProfiles.ProfileID
... -- Add your paging here
Like I said, you could use a temp table that has attribute name/values:
SELECT *
FROM Profile
JOIN (
SELECT ProfileID
FROM ProfileAttributes
JOIN PassedInAttributeTable ON ProfileAttributes.AttributeName = PassedInAttributeTable.AttributeName
AND ProfileAttributes.AttributeValue = PassedInAttributeTable.AttributeValue
GROUP BY ProfileID
HAVING COUNT(*) = CountOfRowsInPassedInAttributeTable -- calculate or pass in
) SelectedProfiles ON Profile.ProfileID = SelectedProfiles.ProfileID
... -- Add your paging here
As I recall, this ended up performing very well, even on fairly complicated queries (though I think we only had 12 or so columns).

As a single query, I can't think of a clever way of optimising this.
Provided that each column's check is highly selective, however, the following (very long winded) code, might prove faster, assuming each individual column has it's own separate index...
WITH
filter AS (
SELECT
[a].*
FROM
(SELECT * FROM Profile WHERE #a IS NULL OR a = #a) AS [a]
INNER JOIN
(SELECT id FROM Profile WHERE b = #b UNION ALL SELECT NULL WHERE #b IS NULL) AS [b]
ON ([a].id = [b].id) OR ([b].id IS NULL)
INNER JOIN
(SELECT id FROM Profile WHERE c = #c UNION ALL SELECT NULL WHERE #c IS NULL) AS [c]
ON ([a].id = [c].id) OR ([c].id IS NULL)
.
.
.
INNER JOIN
(SELECT id FROM Profile WHERE zz = #zz UNION ALL SELECT NULL WHERE #zz IS NULL) AS [zz]
ON ([a].id = [zz].id) OR ([zz].id IS NULL)
)
, TempResult as (
SELECT
ROW_NUMBER() OVER(ORDER BY #sortColumn DESC) as RowNum,
[filter].*
FROM
[filter]
)
SELECT
*
FROM
TempResult
WHERE
(RowNum >= #FirstRow)
AND (RowNum <= #LastRow)
EDIT
Also, thinking about it, you may even get the same result just by having the 60 individual indexes. SQL Server can do INDEX MERGING...

You've several issues imho. One is that you're going to end up with a seq scan no matter what you do.
But I think your more crucial issue here is that you've an unnecessary join:
SELECT profile.* FROM TempResult
WHERE
(RowNum >= #FirstRow)
AND
(RowNum <= #LastRow)

This is a classic "SQL Filter" query problem. I've found that the typical approaches of "(#b is null or b = #b)" & it's common derivatives all yeild mediocre performance. The OR clause tends to be the cause.
Over the years I've done a lot of Perf/Tuning & Query Optimisation. The Approach I've found best is to generate Dynamic SQL inside a Stored Proc. Most times you also need to add "with Recompile" on the statement. The Stored Proc helps reduce potential for SQL injection attacks. The Recompile is needed to force the selection of indexes appropriate to the parameters you are searching on.
Generally it is at least an order of magnitude faster.
I agree you should also look at points mentioned above like :-
If you commonly only refer to a small subset of the columns you could create non-clustered "Covering" indexes.
Highly selective (ie:those with many unique values) columns will work best if they are the lead colum in the index.
If many colums have a very small number of values, consider using The BIT datatype. OR Create your own BITMASKED BIGINT to represent many colums ie: a form of "Enumerated datatyle". But be careful as any function in the WHERE clause (like MOD or bitwise AND/OR) will prevent the optimiser from choosing an index. It works best if you know the value for each & can combine them to use an equality or range query.
While often good to find RoWID's with a small query & then join to get all the other columns you want to retrieve. (As you are doing above) This approach can sometimes backfire. If the 1st part of the query does a Clustred Index Scan then often it is faster to get the otehr columns you need in the select list & savethe 2nd table scan.
So always good to try it both ways & see what works best.
Remember to run SET STATISTICS IO ON & SET STATISTICS TIME ON. Before running your tests. Then you can see where the IO is & it may help you with index selection, for the mose frequent combination of paramaters.
I hope this makes sense without long code samples. (it is on my other machine)

Best way to write a named SQL query that returns if row exists?

So I have this SQL query,
<named-query name="NQ::job_exists">
<query>
select 0 from dual where exists (select * from job_queue);
</query>
</named-query>
Which I plan to use like this:
Query q = em.createNamedQuery("NQ::job_exists");
List<Integer> results = q.getResultList();
boolean exists = !results.isEmpty();
return exists;
I am not very strong in SQL/JPA however, and was wondering whether there is a better way of doing it (or ways to improve it). Should I for example, write (select jq.id from job_queue jq) instead of using a star??
EDIT:This call is very performance critical in our app.
EDIT:Did some performance testing, and while the differences were almost negligible, I finally decided to go with:
select distinct null
from dual
where exists (
select null from job_queue
);

IF you are using EXISTS Oracle I recommend using null:
select null
from dual where exists (select null from job_queue);
The following will be the one with lower cost on Oracle:
select null
from job_queue
where rownum = 1;
Update: To include the case when there are no rows on table you can run the following query:
select count(*)
from (select null
from job_queue
where rownum = 1);
With this query you have a optimum plan and only two possible results: 1 if there are rows and 0 if there are no rows.

If you do an "exists" then it will stop looking as soon as it finds a match. This can stop it from doing a full table scan. Same with TOP 1 if you don't have an ORDER BY. If you do a TOP 1 ID and ID is in an index it might use the index and not even go to the table at all. Stopping the full table scan is where the biggest saving in performance is.
Another small savings is that if you do "SELECT 1" or "SELECT COUNT(1)" instead of "SELECT * " or "SELECT COUNT(*)" it saves the work of getting the table structure.
So I would go with:
SELECT TOP 1 1 AS Found
FROM job_queue
Then:
return !results.isEmpty();
This is the least amount of work that I can think of.
For Oracle it would be:
SELECT 1
FROM job_queue
WHERE rownum<2;
Or:
Set Rowcount 1
SELECT 1
FROM job_queue

Why not just do:
select count(*) as JobCount from job_queue
If JobCount = 0, then there's your answer!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas