SQL query to find a row with a specific number of associations - sql

Using Postgres I have a schema that has conversations and conversationUsers. Each conversation has many conversationUsers. I want to be able to find the conversation that has the exactly specified number of conversationUsers. In other words, provided an array of userIds (say, [1, 4, 6]) I want to be able to find the conversation that contains only those users, and no more.
So far I've tried this:
SELECT c."conversationId"
FROM "conversationUsers" c
WHERE c."userId" IN (1, 4)
GROUP BY c."conversationId"
HAVING COUNT(c."userId") = 2;
Unfortunately, this also seems to return conversations which include these 2 users among others. (For example, it returns a result if the conversation also includes "userId" 5).

This is a case of relational-division - with the added special requirement that the same conversation shall have no additional users.
Assuming the PK of table "conversationUsers" is on ("userId", "conversationId"), which enforces uniqueness of combinations, NOT NULL and also provides the essential index for performance implicitly. Columns of the multicolumn PK in this order. Ideally, you have another index on ("conversationId", "userId"). See:
Is a composite index also good for queries on the first field?
For the basic query, there is the "brute force" approach to count the number of matching users for all conversations of all given users and then filter the ones matching all given users. OK for small tables and/or only short input arrays and/or few conversations per user, but doesn't scale well:
SELECT "conversationId"
FROM "conversationUsers" c
WHERE "userId" = ANY ('{1,4,6}'::int[])
GROUP BY 1
HAVING count(*) = array_length('{1,4,6}'::int[], 1)
AND NOT EXISTS (
SELECT FROM "conversationUsers"
WHERE "conversationId" = c."conversationId"
AND "userId" <> ALL('{1,4,6}'::int[])
);
Eliminating conversations with additional users with a NOT EXISTS anti-semi-join. More:
How do I (or can I) SELECT DISTINCT on multiple columns?
Alternative techniques:
Select rows which are not present in other table
There are various other, (much) faster relational-division query techniques. But the fastest ones are not well suited for a dynamic number of user IDs.
How to filter SQL results in a has-many-through relation
For a fast query that can also deal with a dynamic number of user IDs, consider a recursive CTE:
WITH RECURSIVE rcte AS (
SELECT "conversationId", 1 AS idx
FROM "conversationUsers"
WHERE "userId" = ('{1,4,6}'::int[])[1]
UNION ALL
SELECT c."conversationId", r.idx + 1
FROM rcte r
JOIN "conversationUsers" c USING ("conversationId")
WHERE c."userId" = ('{1,4,6}'::int[])[idx + 1]
)
SELECT "conversationId"
FROM rcte r
WHERE idx = array_length(('{1,4,6}'::int[]), 1)
AND NOT EXISTS (
SELECT FROM "conversationUsers"
WHERE "conversationId" = r."conversationId"
AND "userId" <> ALL('{1,4,6}'::int[])
);
For ease of use wrap this in a function or prepared statement. Like:
PREPARE conversations(int[]) AS
WITH RECURSIVE rcte AS (
SELECT "conversationId", 1 AS idx
FROM "conversationUsers"
WHERE "userId" = $1[1]
UNION ALL
SELECT c."conversationId", r.idx + 1
FROM rcte r
JOIN "conversationUsers" c USING ("conversationId")
WHERE c."userId" = $1[idx + 1]
)
SELECT "conversationId"
FROM rcte r
WHERE idx = array_length($1, 1)
AND NOT EXISTS (
SELECT FROM "conversationUsers"
WHERE "conversationId" = r."conversationId"
AND "userId" <> ALL($1);
Call:
EXECUTE conversations('{1,4,6}');
db<>fiddle here (also demonstrating a function)
There is still room for improvement: to get top performance you have to put users with the fewest conversations first in your input array to eliminate as many rows as possible early. To get top performance you can generate a non-dynamic, non-recursive query dynamically (using one of the fast techniques from the first link) and execute that in turn. You could even wrap it in a single plpgsql function with dynamic SQL ...
More explanation:
Using same column multiple times in WHERE clause
Alternative: MV for sparsely written table
If the table "conversationUsers" is mostly read-only (old conversations are unlikely to change) you might use a MATERIALIZED VIEW with pre-aggregated users in sorted arrays and create a plain btree index on that array column.
CREATE MATERIALIZED VIEW mv_conversation_users AS
SELECT "conversationId", array_agg("userId") AS users -- sorted array
FROM (
SELECT "conversationId", "userId"
FROM "conversationUsers"
ORDER BY 1, 2
) sub
GROUP BY 1
ORDER BY 1;
CREATE INDEX ON mv_conversation_users (users) INCLUDE ("conversationId");
The demonstrated covering index requires Postgres 11. See:
https://dba.stackexchange.com/a/207938/3684
About sorting rows in a subquery:
How to apply ORDER BY and LIMIT in combination with an aggregate function?
In older versions use a plain multicolumn index on (users, "conversationId"). With very long arrays, a hash index might make sense in Postgres 10 or later.
Then the much faster query would simply be:
SELECT "conversationId"
FROM mv_conversation_users c
WHERE users = '{1,4,6}'::int[]; -- sorted array!
db<>fiddle here
You have to weigh added costs to storage, writes and maintenance against benefits to read performance.
Aside: consider legal identifiers without double quotes. conversation_id instead of "conversationId" etc.:
Are PostgreSQL column names case-sensitive?

you can modify your query like this and it should work:
SELECT c."conversationId"
FROM "conversationUsers" c
WHERE c."conversationId" IN (
SELECT DISTINCT c1."conversationId"
FROM "conversationUsers" c1
WHERE c1."userId" IN (1, 4)
)
GROUP BY c."conversationId"
HAVING COUNT(DISTINCT c."userId") = 2;

This might be easier to follow. You want the conversation ID, group by it. add the HAVING clause based on the sum of matching user IDs count equal to the all possible within the group. This will work, but will be longer to process because of no pre-qualifier.
select
cu.ConversationId
from
conversationUsers cu
group by
cu.ConversationID
having
sum( case when cu.userId IN (1, 4) then 1 else 0 end ) = count( distinct cu.UserID )
To Simplify the list even more, have a pre-query of conversations that at least one person is in... If they are not in to begin with, why bother considering such other conversations.
select
cu.ConversationId
from
( select cu2.ConversationID
from conversationUsers cu2
where cu2.userID = 4 ) preQual
JOIN conversationUsers cu
preQual.ConversationId = cu.ConversationId
group by
cu.ConversationID
having
sum( case when cu.userId IN (1, 4) then 1 else 0 end ) = count( distinct cu.UserID )

Related

Find if a string is in or not in a database

I have a list of IDs
'ACE', 'ACD', 'IDs', 'IN','CD'
I also have a table similar to following structure
ID value
ACE 2
CED 3
ACD 4
IN 4
IN 4
I want a SQL query that returns a list of IDs that exists in the database and a list of IDs that does not in the database.
The return should be:
1.ACE, ACD, IN (exist)
2.IDs,CD (not exist)
my code is like this
select
ID,
value
from db
where ID is in ( 'ACE', 'ACD', 'IDs', 'IN','CD')
however, the return is 1) super slow with all kinds of IDs 2) return multiple rows with the same ID. Is there anyway using postgresql to return 1) unique ID 2) make the running faster?
Assuming no duplicates in table nor input, this query should do it:
SELECT t.id IS NOT NULL AS id_exists
, array_agg(ids.id)
FROM unnest(ARRAY['ACE','ACD','IDs','IN','CD']) ids(id)
LEFT JOIN tbl t USING (id)
GROUP BY 1;
Else, please define how to deal with duplicates on either side.
If the LEFT JOIN finds a matching row, the expression t.id IS NOT NULL is true. Else it's false. GROUP BY 1 groups by this expression (1st in the SELECT list), array_agg() forms arrays for each of the two groups.
Related:
Select rows which are not present in other table
Hmmm . . . Is this sufficient:
select ids.id,
(exists (select 1 from table t where t.id = ids.id)
from unnest(array['ACE', 'ACD', 'IDs', 'IN','CD']) ids(id);

How to increase performance of COUNT SQL query in PostgreSQL?

I have a table with multiply columns. But for simplicity purpose, we can consider the following table:
create table tmp_table
(
entity_1_id varchar(255) not null,
status integer default 1 not null,
entity_2_id varchar(255)
);
create index tmp_table_entity_1_id_idx
on tmp_table (entity_1_id);
create index tmp_table_entity_2_id_idx
on tmp_table (entity_2_id);
I want to execute this request:
SELECT tmp_table.entity_2_id, COUNT(*) FROM tmp_table
WHERE tmp_table.entity_1_id='cedca236-3f27-4db3-876c-a6c159f4d15e' AND
tmp_table.status <> 2 AND
tmp_table.entity_2_id = ANY (string_to_array('21c5598b-0620-4a8c-b6fd-a4bfee024254,af0f9cb9-da47-4f6b-a3c4-218b901842f7', ','))
GROUP BY tmp_table.entity_2_id;
It works fine, when I send string to string_to_array function with a few values (like 1-20). But when I try to send 500 elements, it works too slow. Unfortunately, I really need 100-500 elements.
For this query:
SELECT t.entity_2_id, COUNT(*)
FROM tmp_table t
WHERE t.entity_1_id = 'cedca236-3f27-4db3-876c-a6c159f4d15e' AND
t.status <> 2 AND
t.entity_2_id = ANY (string_to_array('21c5598b-0620-4a8c-b6fd-a4bfee024254,af0f9cb9-da47-4f6b-a3c4-218b901842f7', ','))
GROUP BY t.entity_2_id;
I would recommend an index on tmp_table(entity_1_id, entity_2_id, status).
However, you might find this faster:
select rst.entity_2_id,
(select count(*)
from tmp_table t
where t.entity_2_id = rst.entity_2_id and
t.entity_1_id = 'cedca236-3f27-4db3-876c-a6c159f4d15e' AND
t.status <> 2
) as cnt
from regexp_split_to_table(str, ',') rst(entity_2_id);
Then you want an index on tmp_table(entity_2_id, entity_1_id, status).
In most databases, this would be faster, because the index is a covering index and this avoids the final aggregation over the entire result set. However, Postgres stores locking information on the data pages, so they still need to be read. It is still worth trying.

Array field in postgres, need to do self-join with results

I have a table that looks like this:
stuff
id integer
content text
score double
children[] (an array of id's from this same table)
I'd like to run a query that selects all the children for a given id, and then right away gets the full row for all these children, sorted by score.
Any suggestions on the best way to do this? I've looked into WITH RECURSIVE but I'm not sure that's workable. Tried posting at postgresql SE with no luck.
The following query will find all rows corresponding to the children of the object with id 14:
SELECT *
FROM unnest((SELECT children FROM stuff WHERE id=14)) t(id)
JOIN stuff USING (id)
ORDER BY score;
This works by finding the children of 14 as array first, then we convert it into a table using the unnest function, and then we join with stuff to find all rows with the given ids.
The ANY construct in the join condition would be simplest:
SELECT c.*
FROM stuff p
JOIN stuff c ON id = ANY (p.children)
WHERE p.id = 14
ORDER BY c.score;
Doesn't matter for the query whether the array of children IDs is in the same table or different one. You just need table aliases here to be unambiguous.
Related:
Check if value exists in Postgres array
Similar solution:
With Postgres you can use a recursive common table expression:
with recursive rel_tree as (
select rel_id, rel_name, rel_parent, 1 as level, array[rel_id] as path_info
from relations
where rel_parent is null
union all
select c.rel_id, rpad(' ', p.level * 2) || c.rel_name, c.rel_parent, p.level + 1, p.path_info||c.rel_id
from relations c
join rel_tree p on c.rel_parent = p.rel_id
)
select rel_id, rel_name
from rel_tree
order by path_info;
Ref: Postgresql query for getting n-level parent-child relation stored in a single table

sql server-query optimization with many columns

we have "Profile" table with over 60 columns like (Id, fname, lname, gender, profilestate, city, state, degree, ...).
users search other peopel on website. query is like :
WITH TempResult as (
select ROW_NUMBER() OVER(ORDER BY #sortColumn DESC) as RowNum, profile.id from Profile
where
(#a is null or a = #a) and
(#b is null or b = #b) and
...(over 60 column)
)
SELECT profile.* FROM TempResult join profile on TempResult.id = profile.id
WHERE
(RowNum >= #FirstRow)
AND
(RowNum <= #LastRow)
sql server by default use clustered index for execution query. but total execution time is over 300. we test another solution such as multi column index in all columns in where clause but total execution time is over 400.
do you have any solution to make total execution time lower than 100.
we using sql server 2008.
Unfortunately I don't think there is a pure SQL solution to your issue. Here are a couple alternatives:
Dynamic SQL - build up a query that only includes WHERE clause statements for values that are actually provided. Assuming the average search actually only fills in 2-3 fields, indexes could be added and utilized.
Full Text Search - go to something more like a Google keyword search. No individual options.
Lucene (or something else) - Search outside of SQL; This is a fairly significant change though.
One other option that I just remembered implementing in a system once. Create a vertical table that includes all of the data you are searching on and build up a query for it. This is easiest to do with dynamic SQL, but could be done using Table Value Parameters or a temp table in a pinch.
The idea is to make a table that looks something like this:
Profile ID
Attribute Name
Attribute Value
The table should have a unique index on (Profile ID, Attribute Name) (unique to make the search work properly, index will make it perform well).
In this table you'd have rows of data like:
(1, 'city', 'grand rapids')
(1, 'state', 'MI')
(2, 'city', 'detroit')
(2, 'state', 'MI')
Then your SQL will be something like:
SELECT *
FROM Profile
JOIN (
SELECT ProfileID
FROM ProfileAttributes
WHERE (AttributeName = 'city' AND AttributeValue = 'grand rapids')
AND (AttributeName = 'state' AND AttributeValue = 'MI')
GROUP BY ProfileID
HAVING COUNT(*) = 2
) SelectedProfiles ON Profile.ProfileID = SelectedProfiles.ProfileID
... -- Add your paging here
Like I said, you could use a temp table that has attribute name/values:
SELECT *
FROM Profile
JOIN (
SELECT ProfileID
FROM ProfileAttributes
JOIN PassedInAttributeTable ON ProfileAttributes.AttributeName = PassedInAttributeTable.AttributeName
AND ProfileAttributes.AttributeValue = PassedInAttributeTable.AttributeValue
GROUP BY ProfileID
HAVING COUNT(*) = CountOfRowsInPassedInAttributeTable -- calculate or pass in
) SelectedProfiles ON Profile.ProfileID = SelectedProfiles.ProfileID
... -- Add your paging here
As I recall, this ended up performing very well, even on fairly complicated queries (though I think we only had 12 or so columns).
As a single query, I can't think of a clever way of optimising this.
Provided that each column's check is highly selective, however, the following (very long winded) code, might prove faster, assuming each individual column has it's own separate index...
WITH
filter AS (
SELECT
[a].*
FROM
(SELECT * FROM Profile WHERE #a IS NULL OR a = #a) AS [a]
INNER JOIN
(SELECT id FROM Profile WHERE b = #b UNION ALL SELECT NULL WHERE #b IS NULL) AS [b]
ON ([a].id = [b].id) OR ([b].id IS NULL)
INNER JOIN
(SELECT id FROM Profile WHERE c = #c UNION ALL SELECT NULL WHERE #c IS NULL) AS [c]
ON ([a].id = [c].id) OR ([c].id IS NULL)
.
.
.
INNER JOIN
(SELECT id FROM Profile WHERE zz = #zz UNION ALL SELECT NULL WHERE #zz IS NULL) AS [zz]
ON ([a].id = [zz].id) OR ([zz].id IS NULL)
)
, TempResult as (
SELECT
ROW_NUMBER() OVER(ORDER BY #sortColumn DESC) as RowNum,
[filter].*
FROM
[filter]
)
SELECT
*
FROM
TempResult
WHERE
(RowNum >= #FirstRow)
AND (RowNum <= #LastRow)
EDIT
Also, thinking about it, you may even get the same result just by having the 60 individual indexes. SQL Server can do INDEX MERGING...
You've several issues imho. One is that you're going to end up with a seq scan no matter what you do.
But I think your more crucial issue here is that you've an unnecessary join:
SELECT profile.* FROM TempResult
WHERE
(RowNum >= #FirstRow)
AND
(RowNum <= #LastRow)
This is a classic "SQL Filter" query problem. I've found that the typical approaches of "(#b is null or b = #b)" & it's common derivatives all yeild mediocre performance. The OR clause tends to be the cause.
Over the years I've done a lot of Perf/Tuning & Query Optimisation. The Approach I've found best is to generate Dynamic SQL inside a Stored Proc. Most times you also need to add "with Recompile" on the statement. The Stored Proc helps reduce potential for SQL injection attacks. The Recompile is needed to force the selection of indexes appropriate to the parameters you are searching on.
Generally it is at least an order of magnitude faster.
I agree you should also look at points mentioned above like :-
If you commonly only refer to a small subset of the columns you could create non-clustered "Covering" indexes.
Highly selective (ie:those with many unique values) columns will work best if they are the lead colum in the index.
If many colums have a very small number of values, consider using The BIT datatype. OR Create your own BITMASKED BIGINT to represent many colums ie: a form of "Enumerated datatyle". But be careful as any function in the WHERE clause (like MOD or bitwise AND/OR) will prevent the optimiser from choosing an index. It works best if you know the value for each & can combine them to use an equality or range query.
While often good to find RoWID's with a small query & then join to get all the other columns you want to retrieve. (As you are doing above) This approach can sometimes backfire. If the 1st part of the query does a Clustred Index Scan then often it is faster to get the otehr columns you need in the select list & savethe 2nd table scan.
So always good to try it both ways & see what works best.
Remember to run SET STATISTICS IO ON & SET STATISTICS TIME ON. Before running your tests. Then you can see where the IO is & it may help you with index selection, for the mose frequent combination of paramaters.
I hope this makes sense without long code samples. (it is on my other machine)

MySQL doesn't support the limit clause inside a subselect, how can I do that?

I have the following table on a MySQL 5.1.30:
CREATE TABLE article (
article_id int(10) unsigned NOT NULL AUTO_INCREMENT,
category_id int(10) unsigned NOT NULL,
title varchar(100) NOT NULL,
PRIMARY KEY (article_id)
);
With this information:
1, 1, 'foo'
2, 1, 'bar'
3, 1, 'baz'
4, 1, 'quox'
5, 2, 'quonom'
6, 2, 'qox'
I need to obtain the first three articles in each category for all categories that have articles. Something like this:
1, 1, 'foo'
2, 1, 'bar'
3, 1, 'baz'
5, 2, 'quonom'
6, 2, 'qox'
Of course a union would work:
select * from articles where category_id = 1 limit 3
union
select * from articles where category_id = 2 limit 3
But there are an unknown number of categories in the database. Also, the order should specified by an is_sticky and a published_date columns I left out of the examples to simplify.
Is it possible to build a query that retrieves this information?
UPDATE: I've tried the following which would seemed to work except that MySQL doesn't support the limit clause inside a subselect. Do you know of a way to simulate the limit there?
select *
from articles a
where a.article_id in (select f.article_id
from articles f
where f.category_id = a.category_id
order by f.is_sticky, f.published_at
limit 3)
Thanks
SELECT ... LIMIT isn't supported in subqueries, I'm afraid, so it's time to break out the self-join magic:
SELECT article.*
FROM article
JOIN (
SELECT a0.category_id AS id, MIN(a2.article_id) AS lim
FROM article AS a0
LEFT JOIN article AS a1 ON a1.category_id=a0.category_id AND a1.article_id>a0.article_id
LEFT JOIN article AS a2 ON a2.category_id=a1.category_id AND a2.article_id>a1.article_id
GROUP BY id
) AS cat ON cat.id=article.category_id
WHERE article.article_id<=cat.lim OR cat.lim IS NULL
ORDER BY article_id;
The bit in the middle is working out the ID of the third-lowest-ID article for each category by trying to join three copies of the same table in ascending ID order. If there are fewer than three articles for a category, the left joins will ensure the limit is NULL, so the outer WHERE needs to pick up that case as well.
If your “top 3” requirement might change to “top n” at some point, this begins to get unwieldy. In that case you might want to reconsider the idea of querying the list of distinct categories first then unioning the per-category queries.
ETA: Ordering on two columns: eek, new requirements! :-)
It depends what you mean: if you're only trying to order the final results you can bang it on the end no problem. But if you need to use this ordering to select which three articles are to be picked things are a lot harder.
We are using a self-join with ‘<’ to reproduce the effect ‘ORDER BY article_id’ would have. Unfortunately, whilst you can do ‘ORDER BY a, b’, you can't do ‘(a, b)<(c, d)’... neither can you do ‘MIN(a, b)’. Plus, you'd actually be ordering by three columns, issticky, published and article_id, because you need to ensure that each ordering value is unique, to avoid getting four or more rows returned.
Whilst you could make up your own orderable value by some crude integer or string combination of columns:
LEFT JOIN article AS a1
ON a1.category_id=a0.category_id
AND HEX(a1.issticky)+HEX(a1.published_at)+HEX(a1.article_id)>HEX(a0.issticky)+HEX(a0.published_at)+HEX(a0.article_id)
this is getting unfeasibly ugly, and the calculations will scupper any chance of using the indices to make the query efficient. At which point you are better off simply doing the separate per-category LIMITed queries.
You probably should add another table containing the category_id and a description of the categories. Then you can query that table for a list of category IDs, and use a subquery or additional queries to get the articles with proper sorting and limiting. I don't have time to write this out fully now, but someone else probably will (or I'll do it in the unlikely event that no one else has responded by the time I get back).
Here's something I'm not proud of (in MS SQL - not sure if it'll work in MySQL)
select a2.article_id, a2.category_id, a2.title
from
(select distinct category_id
from article) as a1
inner join article a2 on a2.category_id = a1.category_id
where a2.article_id <= (
select top 1 a4.article_id
from (
select top 3 a3.article_id
from article a3
where a3.category_id = a1.category_id
order by a3.article_id asc
) a4
order by a4.article_id desc)
It'll depend on MySQL supporting subqueries in this manner. Basically it works out the third-largest article_id for each category and joins all articles less than or equal to that per category.
SELECT TOP n * should work the same as SELECT * LIMIT n, I hope...