MySQL Join Sub-Select Optimization - sql

UPDATE users u
JOIN (select count(*) as job_count, user_id from job_responses where date_created > subdate(now(), 30) group by user_id) j
ON j.user_id = u.user_id
JOIN users_profile p
ON p.user_id = u.user_id
JOIN users_roles_xref x
ON x.user_id = u.user_id
SET num_job_responses = least(j.job_count, 5)
WHERE u.status = 1 AND p.visible = "Y" AND x.role_id = 2000
And explain tells me this:
+----+-------------+---------------+--------+---------------------------------+---------------+---------+----------------------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+--------+---------------------------------+---------------+---------+----------------------+--------+----------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 23008 | |
| 1 | PRIMARY | u | eq_ref | PRIMARY,user_id,status,status_2 | PRIMARY | 4 | j.user_id | 1 | Using where |
| 1 | PRIMARY | p | ref | user_id,visible | user_id | 4 | scoop_jazz.u.user_id | 2 | Using where |
| 1 | PRIMARY | x | ref | index_role_id,index_user_id | index_user_id | 4 | scoop_jazz.u.user_id | 3 | Using where |
| 2 | DERIVED | job_responses | range | date_created | date_created | 4 | NULL | 135417 | Using where; Using temporary; Using filesort |
+----+-------------+---------------+--------+---------------------------------+---------------+---------+----------------------+--------+----------------------------------------------+
I'm having trouble optimizing this query with explain. Any way to do it?

You will want to add an index on job_responses(date_created, user_id).
Then you can drop the current single-column index on date_created.
The most expensive part of the query is the subquery
(select count(*) as job_count, user_id
from job_responses
where date_created > subdate(now(), 30)
group by user_id)
The only two fields of note are user_id and date_created. There is an index on date_created that has been chosen to satisfy date_created in last 30 days. However, it will have to go back to the data pages to retrieve user_id, then group by it.
If you had a composite index, the user_id is available directly from the index. It also covers the single-column index date_created, so you can drop that one.

It ended up being easier and way faster to generate a temporary table, populate it, and then use a join on that table. I was "chunking" the original query, which ends up being very expensive when it has to create and destroy tables created by sub-selects.

Related

Postgres update column, on conflict ignore this row

I have a table with email and secondary_email. email column has a unique constraint, while secondary_email can be repeated across rows.
I have to write a query to copy secondary_email to email. If there is a conflict, then ignore this row.
This query
UPDATE users SET email = secondary_email
WHERE NOT EXISTS
(SELECT 1 FROM users WHERE email=secondary_email)
still throws the error ERROR: duplicate key value violates unique constraint "users_email_key"
Users Before
+----+-------+-----------------+
| id | email | secondary_email |
+----+-------+-----------------+
| 1 | NULL | NULL |
| 2 | NULL | NULL |
| 3 | NULL | |
| 4 | NULL | e1#example.com |
| 5 | NULL | e1#example.com |
| 6 | NULL | e2#example.com |
+----+-------+-----------------+
Users After
+----+----------------+-----------------+
| id | email | secondary_email |
+----+----------------+-----------------+
| 1 | NULL | NULL |
| 2 | NULL | NULL |
| 3 | NULL | |
| 4 | e1#example.com | e1#example.com |
| 5 | NULL | e1#example.com |
| 6 | e2#example.com | e2#example.com |
+----+----------------+-----------------+
You need table aliases to fix your query:
UPDATE users u
SET email = u.secondary_email
WHERE NOT EXISTS (SELECT 1 FROM users u2 WHERE u2.email = u.secondary_email);
For your overall problem, check for no duplicates within the column as well:
UPDATE users u
SET email = u.secondary_email
FROM (SELECT secondary_email, COUNT(*) as cnt
FROM users u
GROUP BY secondary_email
HAVING COUNT(*) = 1
) s
WHERE s.secondary_email = u.secondary_email AND
NOT EXISTS (SELECT 1 FROM users u2 WHERE u2.email = u.secondary_email);
Or choose the first one:
UPDATE users u
SET email = u.secondary_email
FROM (SELECT u.*,
ROW_NUMBER() OVER (PARTITION BY secondary_email ORDER BY user_id) as seqnum
FROM users u
) s
WHERE s.user_id = u.user_id AND
s.seqnum = 1 AND
NOT EXISTS (SELECT 1 FROM users u2 WHERE u2.email = u.secondary_email);
Note: This will also filter out NULL values which seems like a good idea.
Here is a db<>fiddle.

Join Lookup from 1 table to multiple columns

How do I link 1 table with multiple columns in another table without using mutiple JOIN query?
Below is my scenario:
I have table User with ID and Name
User
+---------+------------+
| Id | Name |
+---------+------------+
| 1 | John |
| 2 | Mike |
| 3 | Charles |
+---------+------------+
And table Product with multiple columns, but just focus on 2 columns CreateBy And ModifiedBy
+------------+-----------+-------------+
| product_id | CreateBy | ModifiedBy |
+------------+-----------+-------------+
| 1 | 1 | 3 |
| 2 | 1 | 3 |
| 3 | 2 | 3 |
| 4 | 2 | 1 |
| 5 | 2 | 3 |
+------------+-----------+-------------+
With normal JOIN, i will need to do 2 JOIN:
SELECT p.Product_id,
u1.Name AS CreateByName,
u2.Name AS ModifiedByName
FROM Product p
JOIN USER user u1 ON p.CreateBy = u1.Id,
JOIN USER user u2 ON p.ModifiedBy = u2.Id
to come out result
+------------+---------------+-----------------+
| product_id | CreateByName | ModifiedByName |
+------------+---------------+-----------------+
| 1 | John | Charles |
| 2 | John | Charles |
| 3 | Mike | Charles |
| 4 | Mike | John |
| 5 | Mike | Charles |
+------------+---------------+-----------------+
How do i avoid that 2 times JOIN?
I'm using MS-SQL , but open to all SQL query for my own learning curious
Your current design/approach is acceptable, I think, and the need for two joins is a function of there being two user ID columns. Each of the two columns requires a separate join.
For fun, here is a table design which you may consider if you really want to have to perform only one join:
+------------+-----------+-------------+
| product_id | user_id | type |
+------------+-----------+-------------+
| 1 | 1 | created |
| 2 | 1 | created |
| 3 | 2 | created |
| 4 | 2 | created |
| 5 | 2 | created |
| 1 | 3 | modified |
| 2 | 3 | modified |
| 3 | 3 | modified |
| 4 | 1 | modified |
| 5 | 3 | modified |
+------------+-----------+-------------+
Now, you can get away with a just a single join followed by an aggregation:
SELECT
p.product_id,
MAX(CASE WHEN t.type = 'created' THEN u.Name END) AS CreateByName,
MAX(CASE WHEN t.type = 'modified' THEN u.Name END) AS ModifiedByName
FROM Product p
INNER JOIN user u
ON p.user_id = u.Id
GROUP BY
p.product_id;
Note that I don't recommend this approach at all. It is much cleaner to use your current approach and use two joins. Joins can fairly easily be optimized using one or more indices. The above aggregation approach would probably not perform as well as what you already have.
If you use natural keys instead of surrogates, you won't need to join at all.
I don't know how you tell your products apart in the real world, but for the example I will assume you have a UPC
CREATE TABLE User
(Name VARCHAR(20) PRIMARY KEY);
CREATE TABLE Product
(UPC CHAR(12) PRIMARY KEY,
CreatedBy VARCHAR(20) REFERENCES User(Name),
ModifiedBy VARCHAR(20) REFERENCES User(Name)
);
Now your query is a simple select, and you also enforce uniqueness of your user names as a bonus, and don't need additional indexes.
Try it...
HTH
Join is the best Approach, but if looking for alternate approach you can use Inline Query.
SELECT P.PRODUCT_ID,
(SELECT [NAME] FROM #USER WHERE ID = CREATED_BY) AS CREATED_BY,
(SELECT [NAME] FROM #USER WHERE ID = MODIFIED_BY) AS MODIFIED_BY
FROM #PRODUCT P
DEMO

Find number of rows identical one some, but different on another column

Say I have the following table:
CREATE TABLE data (
PROJECT_ID VARCHAR,
TASK_ID VARCHAR,
REF_ID VARCHAR,
REF_VALUE VARCHAR
);
I want to identify rows where
PROJECT_ID, REF_ID, REF_VALUE are the same
but TASK_ID are different.
The desired output is a list of TASK_ID_1, TASK_ID_2 and COUNT(*) of such conflicts. So, for example,
DATA
+------------+---------+--------+-----------+
| PROJECT_ID | TASK_ID | REF_ID | REF_VALUE |
+------------+---------+--------+-----------+
| 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 2 |
| 1 | 2 | 1 | 1 |
| 1 | 2 | 1 | 2 |
+------------+---------+--------+-----------+
OUTPUT
+-----------+-----------+----------+
| TASK_ID_1 | TASK_ID_2 | COUNT(*) |
+-----------+-----------+----------+
| 1 | 2 | 2 |
| 2 | 1 | 2 |
+-----------+-----------+----------+
would mean that there are two entries with TASK_ID == 1 and two entries with TASK_ID == 2 that share the same values for the other three columns. The inherent symmetry in the output is fine.
How would I go about finding this information? I've tried joining the table onto itself and grouping, but this turned up more results for a single task than the table had rows altogether, so it's clearly wrong.
The database used is PostgreSQL, though a solution that applies to most common SQL systems would be preferable.
You want a self join and aggregation:
select d1.task_id as task_id_1, d2.task_id as task_id_2, count(*)
from data d1 join
data d2
on d1.project_id = d2.project_id and
d1.ref_id = d2.ref_id and
d1.ref_value = d2.ref_value and
d1.task_id <> d2.task_id
group by d1.task_id, d2.task_id;
Notes:
Add the condition d1.task_id < d2.task_id if you want each pair to occur only once in the result set.
This does not handle NULL values, although that is easy enough to handle. Use is not distinct from instead of =.
You can also simplify this a bit with the using clause:
select d1.task_id as task_id_1, d2.task_id as task_id_2, count(*)
from data d1 join
data d2
using (project_id, ref_id, ref_value)
where d1.task_id <> d2.task_id
group by d1.task_id, d2.task_id;
You can get an idea of how many rows might be returned by using:
select d.project_id, d.ref_id, d.ref_value, count(distinct d.task_id), count(*)
from data d
group by d.project_id, d.ref_id, d.ref_value;
This is how I understand your question. This assume there are only two task for the same combination.
SQL DEMO
SELECT "PROJECT_ID", "REF_ID", "REF_VALUE",
MIN("TASK_ID") as TASK_ID_1,
MAX("TASK_ID") as TASK_ID_2,
COUNT(*) as cnt
FROM Table1
GROUP BY "PROJECT_ID", "REF_ID", "REF_VALUE"
HAVING MIN("TASK_ID") != MAX("TASK_ID")
-- COUNT(*) > 1 also should work
OUTPUT
I add more column to make clear what are the same elements:
| PROJECT_ID | REF_ID | REF_VALUE | task_id_1 | task_id_2 | cnt |
|------------|--------|-----------|-----------|-----------|-----|
| 1 | 1 | 2 | 1 | 2 | 2 |
| 1 | 1 | 1 | 1 | 2 | 2 |

SQL Performance: Using Union and Subqueries

Hi stackoverflow(My first question!),
We're doing something like an SNS, and got a question about optimizing queries.
Using mysql 5.1, the current table was created with:
CREATE TABLE friends(
user_id BIGINT NOT NULL,
friend_id BIGINT NOT NULL,
PRIMARY KEY (user_id, friend_id)
) ENGINE INNODB;
Sample data is populated like:
INSERT INTO friends VALUES
(1,2),
(1,3),
(1,4),
(1,5),
(2,1),
(2,3),
(2,4),
(3,1),
(3,2),
(4,1),
(4,2),
(5,1),
(5,6),
(6,5),
(7,8),
(8,7);
The business logic: we need to figure out which users are friends or friends of friends for a given user.
The current query for this for a user with user_id=1 is:
SELECT friend_id FROM friends WHERE user_id = 1
UNION
SELECT DISTINCT friend_id FROM friends WHERE user_id IN (
SELECT friend_id FROM friends WHERE user_id = 1
);
The expected result is(order doesn't matter):
2
3
4
5
1
6
As you can see, the above query performs the subquery "SELECT friend_id FROM friends WHERE user_id = 1" twice.
So, here is the question. If performance is your primary concern, how would you change the above query or schema?
Thanks in advance.
In this particular case, you can use a JOIN:
SELECT DISTINCT f2.friend_id
FROM friends AS f1
JOIN friends AS f2 ON f1.friend_id=f2.user_id OR f2.user_id=1
WHERE f1.user_id=1;
Examining each query suggests the JOIN will about as performant as the UNION in a big-O sense, though perhaps faster by a constant factor. Jasie's query looks like it might be big-O faster.
EXPLAIN SELECT friend_id FROM friends WHERE user_id = 1
UNION
SELECT DISTINCT friend_id FROM friends WHERE user_id IN (
SELECT friend_id FROM friends WHERE user_id = 1
);
+----+--------------------+------------+--------+---------------+---------+---------+------------+------+-------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+------------+--------+---------------+---------+---------+------------+------+-------------------------------------------+
| 1 | PRIMARY | friends | ref | PRIMARY | PRIMARY | 8 | const | 4 | Using index |
| 2 | UNION | friends | index | NULL | PRIMARY | 16 | NULL | 16 | Using where; Using index; Using temporary |
| 3 | DEPENDENT SUBQUERY | friends | eq_ref | PRIMARY | PRIMARY | 16 | const,func | 1 | Using index |
| NULL | UNION RESULT | <union1,2> | ALL | NULL | NULL | NULL | NULL | NULL | |
+----+--------------------+------------+--------+---------------+---------+---------+------------+------+-------------------------------------------+
EXPLAIN SELECT DISTINCT f2.friend_id
FROM friends AS f1
JOIN friends AS f2
ON f1.friend_id=f2.user_id OR f2.user_id=1
WHERE f1.user_id=1;
+----+-------------+-------+-------+---------------+---------+---------+-------+------+---------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+-------+------+---------------------------------------------+
| 1 | SIMPLE | f1 | ref | PRIMARY | PRIMARY | 8 | const | 4 | Using index; Using temporary |
| 1 | SIMPLE | f2 | index | PRIMARY | PRIMARY | 16 | NULL | 16 | Using where; Using index; Using join buffer |
+----+-------------+-------+-------+---------------+---------+---------+-------+------+---------------------------------------------+
EXPLAIN SELECT DISTINCT friend_id FROM friends WHERE user_id IN (
SELECT friend_id FROM friends WHERE user_id = 1
) OR user_id = 1;
+----+--------------------+---------+--------+---------------+---------+---------+------------+------+-------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------+--------+---------------+---------+---------+------------+------+-------------------------------------------+
| 1 | PRIMARY | friends | index | PRIMARY | PRIMARY | 16 | NULL | 16 | Using where; Using index; Using temporary |
| 2 | DEPENDENT SUBQUERY | friends | eq_ref | PRIMARY | PRIMARY | 16 | const,func | 1 | Using index |
+----+--------------------+---------+--------+---------------+---------+---------+------------+------+-------------------------------------------+
No need for the UNION. Just include an OR with the user_id of the beginning user:
SELECT DISTINCT friend_id FROM friends WHERE user_id IN (
SELECT friend_id FROM friends WHERE user_id = 1
) OR user_id = 1;
+-----------+
| friend_id |
+-----------+
| 2 |
| 3 |
| 4 |
| 5 |
| 1 |
| 6 |
+-----------+

help optimizing query (shows strength of two-way relationships between contacts)

i have a contact_relationship table that stores the reported strength of a relationship between one contact and another at a given point in time.
mysql> desc contact_relationship;
+------------------+-----------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+------------------+-----------+------+-----+-------------------+-----------------------------+
| relationship_id | int(11) | YES | | NULL | |
| contact_id | int(11) | YES | MUL | NULL | |
| other_contact_id | int(11) | YES | | NULL | |
| strength | int(11) | YES | | NULL | |
| recorded | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
+------------------+-----------+------+-----+-------------------+-----------------------------+
now i want to get a list of two-way relationships between contacts (meaning there are two rows, one with contact a specifying a relationship strength with contact b and another with contact b specifying a strength for contact a -- the strength of the two-way relationship is the smaller of those two strength values).
this is the query i've come up with but it is pretty slow:
select
mrcr1.contact_id,
mrcr1.other_contact_id,
case when (mrcr1.strength < mrcr2.strength) then
mrcr1.strength
else
mrcr2.strength
end strength
from (
select
cr1.*
from (
select
contact_id,
other_contact_id,
max(recorded) as max_recorded
from
contact_relationship
group by
contact_id,
other_contact_id
) as cr2
inner join contact_relationship cr1 on
cr1.contact_id = cr2.contact_id
and cr1.other_contact_id = cr2.other_contact_id
and cr1.recorded = cr2.max_recorded
) as mrcr1,
(
select
cr3.*
from (
select
contact_id,
other_contact_id,
max(recorded) as max_recorded
from
contact_relationship
group by
contact_id,
other_contact_id
) as cr4
inner join contact_relationship cr3 on
cr3.contact_id = cr4.contact_id
and cr3.other_contact_id = cr4.other_contact_id
and cr3.recorded = cr4.max_recorded
) as mrcr2
where
mrcr1.contact_id = mrcr2.other_contact_id
and mrcr1.other_contact_id = mrcr2.contact_id
and mrcr1.contact_id != mrcr1.other_contact_id
and mrcr2.contact_id != mrcr2.other_contact_id
and mrcr1.contact_id <= mrcr1.other_contact_id;
anyone have any recommendations of how to speed it up?
note that because a user may specify the strength of his relationship with a particular user more than once, you must only grab the most recent record for each pair of contacts.
update: here is the result of explaining the query...
+----+-------------+----------------------+-------+----------------------------------------------------------------------------------------+------------------------------+---------+-------------------------------------+-------+--------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+----------------------------------------------------------------------------------------+------------------------------+---------+-------------------------------------+-------+--------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 36029 | Using where |
| 1 | PRIMARY | <derived4> | ALL | NULL | NULL | NULL | NULL | 36029 | Using where; Using join buffer |
| 4 | DERIVED | <derived5> | ALL | NULL | NULL | NULL | NULL | 36021 | |
| 4 | DERIVED | cr3 | ref | contact_relationship_index_1,contact_relationship_index_2,contact_relationship_index_3 | contact_relationship_index_2 | 10 | cr4.contact_id,cr4.other_contact_id | 1 | Using where |
| 5 | DERIVED | contact_relationship | index | NULL | contact_relationship_index_3 | 14 | NULL | 37973 | Using index |
| 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 36021 | |
| 2 | DERIVED | cr1 | ref | contact_relationship_index_1,contact_relationship_index_2,contact_relationship_index_3 | contact_relationship_index_2 | 10 | cr2.contact_id,cr2.other_contact_id | 1 | Using where |
| 3 | DERIVED | contact_relationship | index | NULL | contact_relationship_index_3 | 14 | NULL | 37973 | Using index |
+----+-------------+----------------------+-------+----------------------------------------------------------------------------------------+------------------------------+---------+-------------------------------------+-------+--------------------------------+
You are losing a lot lot lot of time selecting the most recent record. 2 options :
1- Change the way you are stocking data, and have a table with only recent record, and an other table more like historical record.
2- Use analytic request to select the most recent record, if your DBMS allows you to do this. Something like
Select first_value(strength) over(partition by contact_id, other_contact_id order by recorded desc)
from contact_relationship
Once you have the good record line, I think your query will go a lot faster.
Scorpi0's answer got me to thinking maybe I could use a temp table...
create temporary table mrcr1 (
contact_id int,
other_contact_id int,
strength int,
index mrcr1_index_1 (
contact_id,
other_contact_id
)
) replace as
select
cr1.contact_id,
cr1.other_contact_id,
cr1.strength from (
select
contact_id,
other_contact_id,
max(recorded) as max_recorded
from
contact_relationship
group by
contact_id, other_contact_id
) as cr2
inner join
contact_relationship cr1 on
cr1.contact_id = cr2.contact_id
and cr1.other_contact_id = cr2.other_contact_id
and cr1.recorded = cr2.max_recorded;
which i had to do twice (second time into a temp table named mrcr2) because mysql has a limitation where you can't alias the same temp table twice in one query.
with my two temp tables created my query then becomes:
select
mrcr1.contact_id,
mrcr1.other_contact_id,
case when (mrcr1.strength < mrcr2.strength) then
mrcr1.strength
else
mrcr2.strength
end strength
from
mrcr1,
mrcr2
where
mrcr1.contact_id = mrcr2.other_contact_id
and mrcr1.other_contact_id = mrcr2.contact_id
and mrcr1.contact_id != mrcr1.other_contact_id
and mrcr2.contact_id != mrcr2.other_contact_id
and mrcr1.contact_id <= mrcr1.other_contact_id;