Improve query performance for a large join

Improve query performance for a large join - sql

I am trying to get a list of users who did a pull request on any repos with a specified language.
SELECT distinct(actor_id) as id FROM pull_requests
JOIN (SELECT id FROM repos WHERE language = 'javascript') as res
ON pull_requests.repo_id = res.id
I've been trying to improve the performance of this query. Currently it takes 2sec+ to run.

First thing -- try a semi-join:
SELECT distinct actor_id as id
FROM pull_requests p
where exists (
select null
from repos r
where p.repo_id = r.id and r.language = 'javascript'
)
Secondly -- verify your distinct is necessary based on this change. It probably is in this case, but semi-joins can often times eliminate the need for distinct where the it's used as a crutch for a 1::many returning multiple rows -- the exists will not multiply results based on multiple matches in the repos table.

Try this:
SELECT distinct(A.actor_id) as id FROM pull_requests AS A
INNER JOIN repos AS B ON A.repo_id = B.id AND B.language = 'javascript'
You may need to index on either repo_id and/or id field

Try using IN
SELECT distinct(actor_id) as id
FROM pull_requests
WHERE pull_requests.repo_id IN (SELECT id FROM repos WHERE language = 'javascript')

Related

Remove duplicates from result in sql

i have following sql in java project:
select distinct * from drivers inner join licenses on drivers.user_id=licenses.issuer_id
inner join users on drivers.user_id=users.id
where (licenses.state='ISSUED' or drivers.status='WAITING')
and users.is_deleted=false
And result i database looks like this:
And i would like to get only one result instead of two duplicated results.
How can i do that?

Solution 1 - That's Because one of data has duplicate value write distinct keyword with only column you want like this
Select distinct id, distinct creation_date, distinct modification_date from
YourTable
Solution 2 - apply distinct only on ID and once you get id you can get all data using in query
select * from yourtable where id in (select distinct id from drivers inner join
licenses
on drivers.user_id=licenses.issuer_id
inner join users on drivers.user_id=users.id
where (licenses.state='ISSUED' or drivers.status='WAITING')
and users.is_deleted=false )

Enum fields name on select, using COALESCE for fields which value is null.

usually you dont query distinct with * (all columns), because it means if one column has the same value but the rest isn't, it will be treated as a different rows. so you have to distinct only the column you want to, then get the data

I suspect that you want left joins like this:
select *
from users u left join
drivers d
on d.user_id = u.id and d.status = 'WAITING' left join
licenses l
on d.user_id = l.issuer_id and l.state = 'ISSUED'
where u.is_deleted = false and
(d.user_id is not null or l.issuer_id is not null);

SQL - How to connect an UPDATE with a COUNT-function?

Hi I am quite new to SQL and I was trying to search here and on Tutorial sites but somehow can't get to a solution. My problem is actually simple.
I have 3 tables: tweets, users, bon_results.
In order to get my final .csv-data I need to add some values into the table 'users'. But those values I need to create via a function first. I need to do the following: Each tweet has a favorite_count. I need to SUM it up and GROUP BY user, like:
SELECT user, count(favorite_count) FROM tweets GROUP BY user
The Point is I need to write this into the table 'users' into a column 'favorite_count' and I seriously don't know how to connect these two steps. I tried it via the UPDATE-Statement like this:
UPDATE users
SET favorite_count=COUNT(favorite_count) FROM tweets
WHERE tweets.user=users.user
I know that the part after the "=" is bullshit but I don't know how to get the function COUNT into this.
Advice would be marvelous.

You should always tag the RDBMS you are using in the question.
You can use a Correlated query to the update in most of the databases:
update users u
set favorite_count = (
select count(favorite_count)
from tweets t
where t.user = u.user
);
If you don't want correlation, the other solutions are mostly vendor specific.
In SQL Server, you can use:
update u
set u.favorite_count = t.cnt
from users u join (
select
user,
count(favorite_count) as cnt
FROM tweets
GROUP BY user
) t on u.user = t.user;
In MySQL:
update users u join (
select
user,
count(favorite_count) as cnt
FROM tweets
GROUP BY user
) t on u.user = t.user
set u.favorite_count = t.cnt;

You were really close to the subquery version:
update users
set favorite_count = (
select count(favorite_count)
from tweets
where tweets.user=users.user
);
Inner join to a derived table (subquery) version :
update u
set favorite_count = t.favorite_count
from users u
inner join (
select user, count(favorite_count) as favorite_count
from tweets
group by user
) as t
on u.user = t.user

You can try this bro.
Update b
SET favorite_count=COunt(favorite_count)
FROM
tweets a
JOIN
Users b
ON a.Column1=b.Column1

Inner Join a Table to Itself

I have a table that uses two identifying columns, let's call them id and userid. ID is unique in every record, and userid is unique to the user but is in many records.
What I need to do is get a record for the User by userid and then join that record to the first record we have for the user. The logic of the query is as follows:
SELECT v1.id, MIN(v2.id) AS entryid, v1.userid
FROM views v1
INNER JOIN views v2
ON v1.userid = v2.userid
I'm hoping that I don't have to join the table to a subquery that handles the min() piece of the code as that seems to be quite slow.

I guess (it's not entirely clear) you want to find for every user, the rows of the table that have minimum id, so one row per user.
In that case, you an use a subquery (a derived table) and join it to the table:
SELECT v.*
FROM views AS v
JOIN
( SELECT userid, MIN(id) AS entryid
FROM views
GROUP BY userid
) AS vm
ON vm.userid = v.userid
AND vm.entryid = v.id ;
The above can also be written using a Common Table Expression (CTE), if you like them:
; WITH vm AS
( SELECT userid, MIN(id) AS entryid
FROM views
GROUP BY userid
)
SELECT v.*
FROM views AS v
JOIN vm
ON vm.userid = v.userid
AND vm.entryid = v.id ;
Both would be quite efficient with an index on (userid, id).
With SQL-Server, you could write this using the ROW_NUMBER() window function:
; WITH viewsRN AS
( SELECT *
, ROW_NUMBER() OVER (PARTITION BY userid ORDER BY id) AS rn
FROM views
)
SELECT * --- skipping the "rn" column
FROM viewsRN
WHERE rn = 1 ;

Well, to use the MIN function along with non-aggregate columns, you'd have to group the statement. That's possible with the query you have... (EDIT based on additional info)
SELECT MIN(v2.id) AS entryid, v1.id, v1.userid
FROM views v1
INNER JOIN views v2
ON v1.userid = v2.userid
GROUP BY v1.id, v1.userid
... however if this is just a simple example and you're looking to pull more data with this query, it quickly becomes an unfeasible solution.
What you seem to want is a list of all the user data in this view, with a link on each row leading back to the "first" record that exists for the same user. The above query will get you what you want, but there are much easier ways to determine the first record for each user:
SELECT v1.id, v1.userid
FROM views v1
ORDER BY v1.userid, v1.id
The first record for each unique user is your "entry point". I think I understand why you want to do it the way you specified, and the first query I gave will be reasonably performant, but you'll have to consider whether not having to use the order by clause to get the correct answer is worth it.

edit-1: as pointed out in the comments, this solution also uses a sub-query. However, it does not use aggregate functions, which (depending on the database) might have a huge impact on the performance.
Can achieve without sub-query (see below).
Obviously, an index on views.userid is of tremedous value for the performance.
SELECT v1.*
FROM views v1
WHERE v1.id = (
SELECT TOP 1 v2.id
FROM views v2
WHERE v2.userid = v1.userid
ORDER BY v2.id ASC
)

Update using Distinct SUM

I have found a few good resources that show I should be able to merge a select query with an update, but I just can't get my head around of the correct formatting.
I have a select statement that is getting info for me, and I want to pretty much use those results to Update an account table that matches the accountID in the select query.
Here is the select statement:
SELECT DISTINCT SUM(b.workers)*tt.mealTax as MealCost,b.townID,b.accountID
FROM buildings AS b
INNER JOIN town_tax AS tt ON tt.townID = b.townID
GROUP BY b.townID,b.accountID
So in short I want the above query to be merged with:
UPDATE accounts AS a
SET a.wealth = a.wealth - MealCost
Where MealCost is the result from the select query. I am sure there is a way to put this into one, I just haven't quite been able to connect the dots to get it to run consistently without separating into two queries.

First, you don't need the distinct when you have a group by.
Second, how do you intend to link the two results? The SELECT query is returning multiple rows per account (one for each town). Presumably, the accounts table has only one row. Let's say that you wanted the average MealCost for the update.
The select query to get this is:
SELECT accountID, avg(MealCost) as avg_Mealcost
FROM (SELECT SUM(b.workers)*tt.mealTax as MealCost, b.townID, b.accountID
FROM buildings AS b INNER JOIN
town_tax AS tt
ON tt.townID = b.townID
GROUP BY b.townID,b.accountID
) a
GROUP BY accountID
Now, to put this into an update, you can use syntax like the following:
UPDATE accounts
set accounts.wealth = accounts.wealth + asum.avg_mealcost
from (SELECT accountID, avg(MealCost) as avg_Mealcost
FROM (SELECT SUM(b.workers)*tt.mealTax as MealCost, b.townID, b.accountID
FROM buildings AS b INNER JOIN
town_tax AS tt
ON tt.townID = b.townID
GROUP BY b.townID,b.accountID
) a
GROUP BY accountID
) asum
where accounts.accountid = asum.accountid
This uses SQL Server syntax, which I believe is the same as for Oracle and most other databases. Mysql puts the "from" clause before the "set" and allows an alias on "update accounts".

INNER JOIN vs IN

SELECT C.* FROM StockToCategory STC
INNER JOIN Category C ON STC.CategoryID = C.CategoryID
WHERE STC.StockID = #StockID
VS
SELECT * FROM Category
WHERE CategoryID IN
(SELECT CategoryID FROM StockToCategory WHERE StockID = #StockID)
Which is considered the correct (syntactically) and most performant approach and why?
The syntax in the latter example seems more logical to me but my assumption is the JOIN will be faster.
I have looked at the query plans and havent been able to decipher anything from them.
Query Plan 1
Query Plan 2

The two syntaxes serve different purposes. Using the Join syntax presumes you want something from both the StockToCategory and Category table. If there are multiple entries in the StockToCategory table for each category, the Category table values will be repeated.
Using the IN function presumes that you want only items from the Category whose ID meets some criteria. If a given CategoryId (assuming it is the PK of the Category table) exists multiple times in the StockToCategory table, it will only be returned once.
In your exact example, they will produce the same output however IMO, the later syntax makes your intent (only wanting categories), clearer.
Btw, yet a third syntax which is similar to using the IN function:
Select ...
From Category
Where Exists (
Select 1
From StockToCategory
Where StockToCategory.CategoryId = Category.CategoryId
And StockToCategory.Stock = #StockId
)

Syntactically (semantically too) these are both correct. In terms of performance they are effectively equivalent, in fact I would expect SQL Server to generate the exact same physical plans for these two queries.

T think There are just two ways to specify the same desired result.

for sqlite
table device_group_folders contains 10 records
table device_groups contains ~100000 records
INNER JOIN: 31 ms
WITH RECURSIVE select_childs(uuid) AS (
SELECT uuid FROM device_group_folders WHERE uuid = '000B:653D1D5D:00000003'
UNION ALL
SELECT device_group_folders.uuid FROM device_group_folders INNER JOIN select_childs ON parent = select_childs.uuid
) SELECT device_groups.uuid FROM select_childs INNER JOIN device_groups ON device_groups.parent = select_childs.uuid;
WHERE 31 ms
WITH RECURSIVE select_childs(uuid) AS (
SELECT uuid FROM device_group_folders WHERE uuid = '000B:653D1D5D:00000003'
UNION ALL
SELECT device_group_folders.uuid FROM device_group_folders INNER JOIN select_childs ON parent = select_childs.uuid
) SELECT device_groups.uuid FROM select_childs, device_groups WHERE device_groups.parent = select_childs.uuid;
IN <1 ms
SELECT device_groups.uuid FROM device_groups WHERE device_groups.parent IN (WITH RECURSIVE select_childs(uuid) AS (
SELECT uuid FROM device_group_folders WHERE uuid = '000B:653D1D5D:00000003'
UNION ALL
SELECT device_group_folders.uuid FROM device_group_folders INNER JOIN select_childs ON parent = select_childs.uuid
) SELECT * FROM select_childs);

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Improve query performance for a large join - sql

Try this: SELECT distinct(A.actor_id) as id FROM pull_requests AS A INNER JOIN repos AS B ON A.repo_id = B.id AND B.language = 'javascript' You may need to index on either repo_id and/or id field

Try using IN SELECT distinct(actor_id) as id FROM pull_requests WHERE pull_requests.repo_id IN (SELECT id FROM repos WHERE language = 'javascript')

Related

Remove duplicates from result in sql

SQL - How to connect an UPDATE with a COUNT-function?

Inner Join a Table to Itself

Update using Distinct SUM

INNER JOIN vs IN

Categories

Resources