SQL Performance: Using Union and Subqueries - sql

Hi stackoverflow(My first question!),
We're doing something like an SNS, and got a question about optimizing queries.
Using mysql 5.1, the current table was created with:
CREATE TABLE friends(
user_id BIGINT NOT NULL,
friend_id BIGINT NOT NULL,
PRIMARY KEY (user_id, friend_id)
) ENGINE INNODB;
Sample data is populated like:
INSERT INTO friends VALUES
(1,2),
(1,3),
(1,4),
(1,5),
(2,1),
(2,3),
(2,4),
(3,1),
(3,2),
(4,1),
(4,2),
(5,1),
(5,6),
(6,5),
(7,8),
(8,7);
The business logic: we need to figure out which users are friends or friends of friends for a given user.
The current query for this for a user with user_id=1 is:
SELECT friend_id FROM friends WHERE user_id = 1
UNION
SELECT DISTINCT friend_id FROM friends WHERE user_id IN (
SELECT friend_id FROM friends WHERE user_id = 1
);
The expected result is(order doesn't matter):
2
3
4
5
1
6
As you can see, the above query performs the subquery "SELECT friend_id FROM friends WHERE user_id = 1" twice.
So, here is the question. If performance is your primary concern, how would you change the above query or schema?
Thanks in advance.

In this particular case, you can use a JOIN:
SELECT DISTINCT f2.friend_id
FROM friends AS f1
JOIN friends AS f2 ON f1.friend_id=f2.user_id OR f2.user_id=1
WHERE f1.user_id=1;
Examining each query suggests the JOIN will about as performant as the UNION in a big-O sense, though perhaps faster by a constant factor. Jasie's query looks like it might be big-O faster.
EXPLAIN SELECT friend_id FROM friends WHERE user_id = 1
UNION
SELECT DISTINCT friend_id FROM friends WHERE user_id IN (
SELECT friend_id FROM friends WHERE user_id = 1
);
+----+--------------------+------------+--------+---------------+---------+---------+------------+------+-------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+------------+--------+---------------+---------+---------+------------+------+-------------------------------------------+
| 1 | PRIMARY | friends | ref | PRIMARY | PRIMARY | 8 | const | 4 | Using index |
| 2 | UNION | friends | index | NULL | PRIMARY | 16 | NULL | 16 | Using where; Using index; Using temporary |
| 3 | DEPENDENT SUBQUERY | friends | eq_ref | PRIMARY | PRIMARY | 16 | const,func | 1 | Using index |
| NULL | UNION RESULT | <union1,2> | ALL | NULL | NULL | NULL | NULL | NULL | |
+----+--------------------+------------+--------+---------------+---------+---------+------------+------+-------------------------------------------+
EXPLAIN SELECT DISTINCT f2.friend_id
FROM friends AS f1
JOIN friends AS f2
ON f1.friend_id=f2.user_id OR f2.user_id=1
WHERE f1.user_id=1;
+----+-------------+-------+-------+---------------+---------+---------+-------+------+---------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+-------+------+---------------------------------------------+
| 1 | SIMPLE | f1 | ref | PRIMARY | PRIMARY | 8 | const | 4 | Using index; Using temporary |
| 1 | SIMPLE | f2 | index | PRIMARY | PRIMARY | 16 | NULL | 16 | Using where; Using index; Using join buffer |
+----+-------------+-------+-------+---------------+---------+---------+-------+------+---------------------------------------------+
EXPLAIN SELECT DISTINCT friend_id FROM friends WHERE user_id IN (
SELECT friend_id FROM friends WHERE user_id = 1
) OR user_id = 1;
+----+--------------------+---------+--------+---------------+---------+---------+------------+------+-------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------+--------+---------------+---------+---------+------------+------+-------------------------------------------+
| 1 | PRIMARY | friends | index | PRIMARY | PRIMARY | 16 | NULL | 16 | Using where; Using index; Using temporary |
| 2 | DEPENDENT SUBQUERY | friends | eq_ref | PRIMARY | PRIMARY | 16 | const,func | 1 | Using index |
+----+--------------------+---------+--------+---------------+---------+---------+------------+------+-------------------------------------------+

No need for the UNION. Just include an OR with the user_id of the beginning user:
SELECT DISTINCT friend_id FROM friends WHERE user_id IN (
SELECT friend_id FROM friends WHERE user_id = 1
) OR user_id = 1;
+-----------+
| friend_id |
+-----------+
| 2 |
| 3 |
| 4 |
| 5 |
| 1 |
| 6 |
+-----------+

Related

how to perform sql actions/query for duplicate rows

I have 2 tables:
1-brokers(this is a company that could have multiple broker individuals)
and
2-brokerIndividuals (A person/individuals table that has a foreign key of broker company it belongs to and the individuals details)
I'm trying to create a unique index column for brokers table where the fields companyName are unique and isDeleted is NULL. Currently, the table is already populated so I want to write an SQL QUERY to find duplicate rows and whenever there are rows with the same companyName and isDeleted=NULL, I would like to perform 2 actions/queries:
1-keep the first row as it is and changes other duplicates(rows following the first duplicate) rows' isDeleted columns value to true.
2- associate or change the foreign key in brokerIndividuals for the duplicate rows for the first row.
The verbal description of what I am trying to do is: soft delete the duplicate rows and associate their corresponding brokerIndividuals to the first occurrence of duplicates. Table needs to have 1 occurrence of companyName where isDeleted is NULL.
I am using knex.js ORM so if that help's you can also suggest a solution using knex functions but knex doesn't support partial index yet( Knex.js - How to create unique index with 'where' clause? ) so I have to use the raw SQL method. Plus the DB I'm using is mssql(version: 6.0.1).
Here's a full test case (commented), with link to the fiddle:
Working test case, tested with MySQL 5.5, 5.6, 5.7, 8.0 and MariaDB up to 10.6
Create the tables and insert initial data with duplicate company_name entries:
CREATE TABLE brokers (
id int primary key auto_increment
, company_name VARCHAR(30)
, isDeleted boolean default null
);
CREATE TABLE brokerIndividuals (
id int primary key auto_increment
, broker_id int references brokers (id)
);
INSERT INTO brokers (company_name) VALUES
('name1')
, ('name1')
, ('name1')
, ('name1')
, ('name123')
, ('name123')
, ('name123')
, ('name123')
;
INSERT INTO brokerIndividuals (broker_id) VALUES
(2)
, (7)
;
SELECT * FROM brokers;
+----+--------------+-----------+
| id | company_name | isDeleted |
+----+--------------+-----------+
| 1 | name1 | null |
| 2 | name1 | null |
| 3 | name1 | null |
| 4 | name1 | null |
| 5 | name123 | null |
| 6 | name123 | null |
| 7 | name123 | null |
| 8 | name123 | null |
+----+--------------+-----------+
SELECT * FROM brokerIndividuals;
+----+-----------+
| id | broker_id |
+----+-----------+
| 1 | 2 |
| 2 | 7 |
+----+-----------+
Adjust brokers to determine isDeleted based on the MIN(id) per company_name:
UPDATE brokers
JOIN (
SELECT company_name, MIN(id) AS id
FROM brokers
GROUP BY company_name
) AS x
ON x.company_name = brokers.company_name
AND isDeleted IS NULL
SET isDeleted = CASE WHEN (x.id <> brokers.id) THEN 1 END
;
The updated brokers contents:
SELECT * FROM brokers;
+----+--------------+-----------+
| id | company_name | isDeleted |
+----+--------------+-----------+
| 1 | name1 | null |
| 2 | name1 | 1 |
| 3 | name1 | 1 |
| 4 | name1 | 1 |
| 5 | name123 | null |
| 6 | name123 | 1 |
| 7 | name123 | 1 |
| 8 | name123 | 1 |
+----+--------------+-----------+
For brokerIndividuals, find / set the correct broker_id:
UPDATE brokerIndividuals
JOIN brokers AS b1
ON b1.id = brokerIndividuals.broker_id
JOIN brokers AS b2
ON b1.company_name = b2.company_name
AND b2.isDeleted IS NULL
SET brokerIndividuals.broker_id = b2.id
;
New contents:
SELECT * FROM brokerIndividuals;
+----+-----------+
| id | broker_id |
+----+-----------+
| 1 | 1 |
| 2 | 5 |
+----+-----------+

Dependent SERIAL column in PostgreSQL

I want to get this result in my table contacts:
|contact_id | user_id | user_contact_id |
+-----------+------------------+----------------------+
| 1 | 1 | 1 |
+-----------+------------------+----------------------+
| 2 | 1 | 2 |
+-----------+------------------+----------------------+
| 3 | 1 | 3 |
+-----------+------------------+----------------------+
| 4 | 2 | 1 |
+-----------+------------------+----------------------+
| 5 | 2 | 2 |
+-----------+------------------+----------------------+
| 6 | 2 | 3 |
+-----------+------------------+----------------------+
| 7 | 3 | 1 |
+-----------+------------------+----------------------+
I'm going to insert only user_id.
INSERT INTO contacts (user_id) VALUES ($user_id);
The contact_id will auto-increment because it's a serial. I want user_contact_id to also populate automatically by the DB itself, so it is 100% stable with concurrent writes.
As other users suggested only sequence-s or serial type are guaranteed to be concurrent safe.
If you really need to have user_contact_id "restarted" every user_id maybe you could use following view:
create view contacts_v as
select
contact_id,
user_id,
rank() over (partition by user_id order by contact_id) as user_contact_id
from contacts;
INSERT INTO CONTACTS
(user_id, user_contact_id)
VALUES
($user_id, (SELECT COALESCE(MAX(user_contact_id), 0) + 1 FROM CONTACTS WHERE user_id = $user_id))
You should use a sequence as a default value for your user_contact_id. And this is exactly what the SERIAL column type is doing.
http://www.postgresql.org/docs/9.1/static/datatype-numeric.html
http://www.postgresql.org/docs/9.4/static/sql-createsequence.html
Sequences are safe with concurrent writes.
CREATE TABLE contacts (
contact_id SERIAL PRIMARY KEY,
user_id INTEGER,
user_contact_id SERIAL
);
INSERT INTO contacts (user_id) VALUES
(1), (1), (1), (2), (2), (2), (3);
And here are the results:
> SELECT * FROM contacts;
contact_id | user_id | user_contact_id
------------+---------+-----------------
1 | 1 | 1
2 | 1 | 2
3 | 1 | 3
4 | 2 | 4
5 | 2 | 5
6 | 2 | 6
7 | 3 | 7
> \d contacts
Table "public.contacts"
Column | Type | Modifiers
-----------------+---------+--------------------------------------------------------------------
contact_id | integer | not null default nextval('contacts_contact_id_seq'::regclass)
user_id | integer |
user_contact_id | integer | not null default nextval('contacts_user_contact_id_seq'::regclass)
Indexes:
"contacts_pkey" PRIMARY KEY, btree (contact_id)

MySQL Join Sub-Select Optimization

UPDATE users u
JOIN (select count(*) as job_count, user_id from job_responses where date_created > subdate(now(), 30) group by user_id) j
ON j.user_id = u.user_id
JOIN users_profile p
ON p.user_id = u.user_id
JOIN users_roles_xref x
ON x.user_id = u.user_id
SET num_job_responses = least(j.job_count, 5)
WHERE u.status = 1 AND p.visible = "Y" AND x.role_id = 2000
And explain tells me this:
+----+-------------+---------------+--------+---------------------------------+---------------+---------+----------------------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+--------+---------------------------------+---------------+---------+----------------------+--------+----------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 23008 | |
| 1 | PRIMARY | u | eq_ref | PRIMARY,user_id,status,status_2 | PRIMARY | 4 | j.user_id | 1 | Using where |
| 1 | PRIMARY | p | ref | user_id,visible | user_id | 4 | scoop_jazz.u.user_id | 2 | Using where |
| 1 | PRIMARY | x | ref | index_role_id,index_user_id | index_user_id | 4 | scoop_jazz.u.user_id | 3 | Using where |
| 2 | DERIVED | job_responses | range | date_created | date_created | 4 | NULL | 135417 | Using where; Using temporary; Using filesort |
+----+-------------+---------------+--------+---------------------------------+---------------+---------+----------------------+--------+----------------------------------------------+
I'm having trouble optimizing this query with explain. Any way to do it?
You will want to add an index on job_responses(date_created, user_id).
Then you can drop the current single-column index on date_created.
The most expensive part of the query is the subquery
(select count(*) as job_count, user_id
from job_responses
where date_created > subdate(now(), 30)
group by user_id)
The only two fields of note are user_id and date_created. There is an index on date_created that has been chosen to satisfy date_created in last 30 days. However, it will have to go back to the data pages to retrieve user_id, then group by it.
If you had a composite index, the user_id is available directly from the index. It also covers the single-column index date_created, so you can drop that one.
It ended up being easier and way faster to generate a temporary table, populate it, and then use a join on that table. I was "chunking" the original query, which ends up being very expensive when it has to create and destroy tables created by sub-selects.

help optimizing query (shows strength of two-way relationships between contacts)

i have a contact_relationship table that stores the reported strength of a relationship between one contact and another at a given point in time.
mysql> desc contact_relationship;
+------------------+-----------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+------------------+-----------+------+-----+-------------------+-----------------------------+
| relationship_id | int(11) | YES | | NULL | |
| contact_id | int(11) | YES | MUL | NULL | |
| other_contact_id | int(11) | YES | | NULL | |
| strength | int(11) | YES | | NULL | |
| recorded | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
+------------------+-----------+------+-----+-------------------+-----------------------------+
now i want to get a list of two-way relationships between contacts (meaning there are two rows, one with contact a specifying a relationship strength with contact b and another with contact b specifying a strength for contact a -- the strength of the two-way relationship is the smaller of those two strength values).
this is the query i've come up with but it is pretty slow:
select
mrcr1.contact_id,
mrcr1.other_contact_id,
case when (mrcr1.strength < mrcr2.strength) then
mrcr1.strength
else
mrcr2.strength
end strength
from (
select
cr1.*
from (
select
contact_id,
other_contact_id,
max(recorded) as max_recorded
from
contact_relationship
group by
contact_id,
other_contact_id
) as cr2
inner join contact_relationship cr1 on
cr1.contact_id = cr2.contact_id
and cr1.other_contact_id = cr2.other_contact_id
and cr1.recorded = cr2.max_recorded
) as mrcr1,
(
select
cr3.*
from (
select
contact_id,
other_contact_id,
max(recorded) as max_recorded
from
contact_relationship
group by
contact_id,
other_contact_id
) as cr4
inner join contact_relationship cr3 on
cr3.contact_id = cr4.contact_id
and cr3.other_contact_id = cr4.other_contact_id
and cr3.recorded = cr4.max_recorded
) as mrcr2
where
mrcr1.contact_id = mrcr2.other_contact_id
and mrcr1.other_contact_id = mrcr2.contact_id
and mrcr1.contact_id != mrcr1.other_contact_id
and mrcr2.contact_id != mrcr2.other_contact_id
and mrcr1.contact_id <= mrcr1.other_contact_id;
anyone have any recommendations of how to speed it up?
note that because a user may specify the strength of his relationship with a particular user more than once, you must only grab the most recent record for each pair of contacts.
update: here is the result of explaining the query...
+----+-------------+----------------------+-------+----------------------------------------------------------------------------------------+------------------------------+---------+-------------------------------------+-------+--------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+----------------------------------------------------------------------------------------+------------------------------+---------+-------------------------------------+-------+--------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 36029 | Using where |
| 1 | PRIMARY | <derived4> | ALL | NULL | NULL | NULL | NULL | 36029 | Using where; Using join buffer |
| 4 | DERIVED | <derived5> | ALL | NULL | NULL | NULL | NULL | 36021 | |
| 4 | DERIVED | cr3 | ref | contact_relationship_index_1,contact_relationship_index_2,contact_relationship_index_3 | contact_relationship_index_2 | 10 | cr4.contact_id,cr4.other_contact_id | 1 | Using where |
| 5 | DERIVED | contact_relationship | index | NULL | contact_relationship_index_3 | 14 | NULL | 37973 | Using index |
| 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 36021 | |
| 2 | DERIVED | cr1 | ref | contact_relationship_index_1,contact_relationship_index_2,contact_relationship_index_3 | contact_relationship_index_2 | 10 | cr2.contact_id,cr2.other_contact_id | 1 | Using where |
| 3 | DERIVED | contact_relationship | index | NULL | contact_relationship_index_3 | 14 | NULL | 37973 | Using index |
+----+-------------+----------------------+-------+----------------------------------------------------------------------------------------+------------------------------+---------+-------------------------------------+-------+--------------------------------+
You are losing a lot lot lot of time selecting the most recent record. 2 options :
1- Change the way you are stocking data, and have a table with only recent record, and an other table more like historical record.
2- Use analytic request to select the most recent record, if your DBMS allows you to do this. Something like
Select first_value(strength) over(partition by contact_id, other_contact_id order by recorded desc)
from contact_relationship
Once you have the good record line, I think your query will go a lot faster.
Scorpi0's answer got me to thinking maybe I could use a temp table...
create temporary table mrcr1 (
contact_id int,
other_contact_id int,
strength int,
index mrcr1_index_1 (
contact_id,
other_contact_id
)
) replace as
select
cr1.contact_id,
cr1.other_contact_id,
cr1.strength from (
select
contact_id,
other_contact_id,
max(recorded) as max_recorded
from
contact_relationship
group by
contact_id, other_contact_id
) as cr2
inner join
contact_relationship cr1 on
cr1.contact_id = cr2.contact_id
and cr1.other_contact_id = cr2.other_contact_id
and cr1.recorded = cr2.max_recorded;
which i had to do twice (second time into a temp table named mrcr2) because mysql has a limitation where you can't alias the same temp table twice in one query.
with my two temp tables created my query then becomes:
select
mrcr1.contact_id,
mrcr1.other_contact_id,
case when (mrcr1.strength < mrcr2.strength) then
mrcr1.strength
else
mrcr2.strength
end strength
from
mrcr1,
mrcr2
where
mrcr1.contact_id = mrcr2.other_contact_id
and mrcr1.other_contact_id = mrcr2.contact_id
and mrcr1.contact_id != mrcr1.other_contact_id
and mrcr2.contact_id != mrcr2.other_contact_id
and mrcr1.contact_id <= mrcr1.other_contact_id;

MySQL - How to use subquery into IN statement by value

The question is to get table column data and use it as a value list for IN function;
For this example I created 2 tables: movies and genres
Table "movies" contains 3 columns: id, name and genre.
Table "genres" contains 2 columns: id and name.
+- movies-+
| |- movie_id - int(11) - AUTO_INCREMENT - PRIMARY
| |- movie_name - varchar(255)
| |- movie_genres - varchar(255)
|
|
+- genres-+
|- genre_id - int(11) - AUTO_INCREMENT - PRIMARY
|- genre_name - varchar(255)
Both tables contain some dummy data:
+----------+------------+--------------+
| movie_id | movie_name | movie_genres |
+----------+------------+--------------+
| 1 | MOVIE 1 | 2,3,1 |
| 2 | MOVIE 2 | 2,4 |
| 3 | MOVIE 3 | 1,3 |
| 4 | MOVIE 4 | 3,4 |
+----------+------------+--------------+
+----------+------------+
| genre_id | genre_name |
+----------+------------+
| 1 | Comedy |
| 2 | Fantasy |
| 3 | Action |
| 4 | Mystery |
+----------+------------+
My goal is to get result like this:
+----------+------------+--------------+-----------------------+
| movie_id | movie_name | movie_genres | movie_genre_names |
+----------+------------+--------------+-----------------------+
| 1 | MOVIE 1 | 2,3,1 | Fantasy,Action,Comedy |
| 2 | MOVIE 2 | 2,4 | Fantasy,Mystery |
| 3 | MOVIE 3 | 1,3 | Comedy,Action |
| 4 | MOVIE 4 | 3,4 | Action,Mystery |
+----------+------------+--------------+-----------------------+
I'm using this query and it's partly working only problem is that it uses the first value of the movie_genres field in the IN value list.
SELECT `m` . * , GROUP_CONCAT( `g`.`genre_name` ) AS `movie_genre_names`
FROM `genres` AS `g`
LEFT JOIN `movies` AS `m` ON ( `g`.`genre_id`
IN (
`m`.`movie_genres`
) )
WHERE `g`.`genre_id`
IN (
(
SELECT `movie_genres`
FROM `movies`
WHERE `movie_id` =1
)
)
GROUP BY 1 =1
The results greatly differ from the one I want:
+----------+------------+--------------+-------------------+
| movie_id | movie_name | movie_genres | movie_genre_names |
+----------+------------+--------------+-------------------+
| 1 | MOVIE 1 | 2,3,1 | Fantasy,Fantasy |
+----------+------------+--------------+-------------------+
Sorry if I missed some data I'm new to mysql.
What query should I use to get the wanted results?
This is a bad design. You should create a many-to-many link table (movie_id, genre_id)
If you cannot change this design, however, use this query:
SELECT movie.*
(
SELECT GROUP_CONCAT(genre_name)
FROM genres
WHERE find_in_set(genre_id, movie_genres)
) as movie_genre_names
FROM movies