Update of many rows with join extremely slow - sql

I have a table with five relevant fields - id, source, iid, track_hash, alias. I want to group all entries into groups with a common track_hash and then for each row save the id of the row with the lowest source (with ties broken in favor of the highest iid) entry from its group into the alias field. To do this I wrote the following query:
with best as
(SELECT id as bid, track_hash FROM
(SELECT id, track_hash,
RANK () OVER (
PARTITION BY track_hash
ORDER BY source asc, iid DESC
) rank
from albums
)
where rank = 1
)
select bid, a.* from albums a inner join best
on a.track_hash = best.track_hash
This takes a completely reasonable 2 seconds on 24k rows. Now, instead of simply seeing this id, I want to actually save it. For this, I used the following very similar query:
with best as
(SELECT id as bid, track_hash FROM
(SELECT id, track_hash,
RANK () OVER (
PARTITION BY track_hash
ORDER BY source asc, iid DESC
) rank
from albums
)
where rank = 1
)
update albums
set alias = bid FROM albums a inner join best
on a.track_hash = best.track_hash
However, this one takes anywhere between 1 and 10 minutes, and I really don't understand why. Doesn't the engine have to match every row to its best.id/alias anyway, which is exactly what I'm doing with my update? Why is this happening and what am I doing wrong?
Query plan looks like this:
MATERIALIZE 1
CO-ROUTINE 4
SCAN TABLE albums USING INDEX track_hash_idx
USE TEMP B-TREE FOR RIGHT PART OF ORDER BY
SCAN SUBQUERY 4
SCAN TABLE albums USING COVERING INDEX track_hash_idx
SEARCH SUBQUERY 1 USING AUTOMATIC PARTIAL COVERING INDEX (rank=?)
SEARCH TABLE albums AS a USING COVERING INDEX track_hash_idx (track_hash=?)

You don't need the join to albums (again).
The UPDATE ... FROM syntax provides actually an implicit join of albums to best:
UPDATE albums AS a
SET alias = b.bid
FROM best AS b
WHERE a.track_hash = b.track_hash

Related

Finding a better way to get the most recent status

I have this way that I am checking the most recent status by grabbing the related detail record with the max(created_on) date.
It works, but it just seems really clumsy, and I am hoping to find a better way.
select rd.stat_code from rec r, rec_detl rd
where r.id = rd.rec_id
and r.id = 13455478
and rd.id in (
select id from rec_detl rd1
where rd1.rec_id = r.id
and rd1.created_on = (
select max(created_on) from rec_detl rd2
where rd2.rec_id = r.id
));
In that example, the rec_detl table has a stat_code column indicating the status. The rec_detl table has a foreign key that points to the rec table, so each row from the rec table corresponds to multiple rec_detl records.
Basically, a new rec_detl is inserted each time the rec record has its status updated.
So my questions are "Is there some better and less clumsy way of achieving the same thing and grabbing the most recent status?" and "other than the fact this way is sort of clumsy and ugly, is there anything wrong with this approach?"
For a single record you may use fetch first 1 rows only addition, avaliable since 12c.
select *
from rec_detl
where rec_id = 13455478
order by created_on desc
fetch first 1 rows only
As long as you do not use any column from rec table, there's no need to join anything.
If you need to have rec columns with rec_detl columns for small amount of rows and with index on rec_detl.rec_id column, this may be adapted to lateral join. Small is because of this query introduces nested loop lookups.
select *
from rec
left join lateral(
select *
from rec_det
where rec.id = rec_det.rec_id
order by created_on desc
fetch first 1 rows only
) q
on 1 = 1
For mass processing it would be better to use keep dense_rank first, because if will use general join (it will be hash for large datasets in general). But it will require to list all the columns explicitly either aggregated or grouped by:
select
rec.id
, max(rec.val) as val
, max(rec_det.status) keep(dense_rank first order by created_on desc) as last_status
from rec
left join rec_det
on rec.id = rec_det.rec_id
group by
rec.id
db<>fiddle here

How to prevent duplicated records when applying ORDER BY NEWID() to fetch them randomly?

I tried using the solution provided in Return rows in random order to fetch random records in my query. But I have to add NEWID() to the list of columns I want to fetch or otherwise I will not be able to add ORDER BY NEWID() . Unfortunately it makes my resultset to contain duplicate records.
For more clarification, this query makes my results to have duplicates due to existence of NEWID() among requested columns:
SELECT distinct top 4
Books.BookID,
Books.Authors,
Books.ShortTitle,
NEWID()
FROM Books
inner join Publishers on Books.PublisherID = Publishers.PublisherID
ORDER BY NEWID()
How can I overcome this issue of not fetching unique records (Here BookID is PK)?
You definitely don't want to add newid() to each row. That will undo the distinct. Instead, use group by with order bynewid()`:
SELECT top 4 b.BookID, b.Authors, b.ShortTitle
FROM Books b inner join
Publishers p
on b.PublisherID = p.PublisherID
GROUP BY b.BookId, b.Authors, B.ShortTitle
ORDER BY NEWID();
It will work fine. You can order by values that are not in the select list.
Or if you still want to use NEWID, just make a distinct list before assigning the new id:
SELECT a.BookID, a.Authors, a.ShortTitle FROM
(SELECT distinct top 4
Books.BookID AS BookID,
Books.Authors AS Authors,
Books.ShortTitle AS ShortTitle,
FROM Books
inner join Publishers on Books.PublisherID = Publishers.PublisherID) a
ORDER BY NEWID()

Postgres left outer join appears to not be using table indices

Let me know if this should be posted on DBA.stackexchange.com instead...
I have the following query:
SELECT DISTINCT "court_cases".*
FROM "court_cases"
LEFT OUTER JOIN service_of_processes
ON service_of_processes.court_case_id = court_cases.id
LEFT OUTER JOIN jobs
ON jobs.service_of_process_id = service_of_processes.id
WHERE
(jobs.account_id = 250093
OR court_cases.account_id = 250093)
ORDER BY
court_cases.court_date DESC NULLS LAST,
court_cases.id DESC
LIMIT 30
OFFSET 0;
But it takes a good 2-4 seconds to run, and in a web application this is unacceptable for a single query.
I ran EXPLAIN (ANALYZE, BUFFERS) on the query as suggested on the PostgreSQL wiki, and have put the results here: http://explain.depesz.com/s/Yn6
The table definitions for those tables involved in the query is here (including the indexes on foreign key relationships):
http://sqlfiddle.com/#!15/114c6
Is it having issues using the indexes because the WHERE clause is querying from two different tables? What kind of index or change to the query can I make to make this run faster?
These are the current sizes of the tables in question:
PSQL=# select count(*) from service_of_processes;
count
--------
103787
(1 row)
PSQL=# select count(*) from jobs;
count
--------
108995
(1 row)
PSQL=# select count(*) from court_cases;
count
-------
84410
(1 row)
EDIT: I'm on Postgresql 9.3.1, if that matters.
or clauses can make optimizing a query difficult. One idea is to split the two parts of the query into two separate subqueries. This actually simplifies one of them a lot (the one on court_cases.account_id).
Try this version:
(SELECT cc.*
FROM "court_cases" cc
WHERE cc.account_id = 250093
ORDER BY cc.court_date DESC NULLS LAST,
cc.id DESC
LIMIT 30
) UNION ALL
(SELECT cc.*
FROM "court_cases" cc LEFT OUTER JOIN
service_of_processes sop
ON sop.court_case_id = cc.id LEFT OUTER JOIN
jobs j
ON j.service_of_process_id = sop.id
WHERE (j.account_id = 250093 AND cc.account_id <> 250093)
ORDER BY cc.court_date DESC NULLS LAST, id DESC
LIMIT 30
)
ORDER BY court_date DESC NULLS LAST,
id DESC
LIMIT 30 OFFSET 0;
And add the following indexes:
create index court_cases_accountid_courtdate_id on court_cases(account_id, court_date, id);
create index jobs_accountid_sop on jobs(account_id, service_of_process_id);
Note that the second query uses and cc.count_id <> 250093, which prevents duplicate records. This eliminates the need for distinct or for union (without union all).
I'll try modifying the query as the following:
SELECT DISTINCT "court_cases".*
FROM "court_cases"
LEFT OUTER JOIN service_of_processes
ON service_of_processes.court_case_id = court_cases.id
LEFT OUTER JOIN jobs
ON jobs.service_of_process_id = service_of_processes.id and jobs.account_id = 250093
WHERE
(court_cases.account_id = 250093)
ORDER BY
court_cases.court_date DESC NULLS LAST,
court_cases.id DESC
LIMIT 30
OFFSET 0;
I think that the issue is in the fact that the where filter is not properly decomposed by query planner optimizer, a really strange performance bug

Inner Join a Table to Itself

I have a table that uses two identifying columns, let's call them id and userid. ID is unique in every record, and userid is unique to the user but is in many records.
What I need to do is get a record for the User by userid and then join that record to the first record we have for the user. The logic of the query is as follows:
SELECT v1.id, MIN(v2.id) AS entryid, v1.userid
FROM views v1
INNER JOIN views v2
ON v1.userid = v2.userid
I'm hoping that I don't have to join the table to a subquery that handles the min() piece of the code as that seems to be quite slow.
I guess (it's not entirely clear) you want to find for every user, the rows of the table that have minimum id, so one row per user.
In that case, you an use a subquery (a derived table) and join it to the table:
SELECT v.*
FROM views AS v
JOIN
( SELECT userid, MIN(id) AS entryid
FROM views
GROUP BY userid
) AS vm
ON vm.userid = v.userid
AND vm.entryid = v.id ;
The above can also be written using a Common Table Expression (CTE), if you like them:
; WITH vm AS
( SELECT userid, MIN(id) AS entryid
FROM views
GROUP BY userid
)
SELECT v.*
FROM views AS v
JOIN vm
ON vm.userid = v.userid
AND vm.entryid = v.id ;
Both would be quite efficient with an index on (userid, id).
With SQL-Server, you could write this using the ROW_NUMBER() window function:
; WITH viewsRN AS
( SELECT *
, ROW_NUMBER() OVER (PARTITION BY userid ORDER BY id) AS rn
FROM views
)
SELECT * --- skipping the "rn" column
FROM viewsRN
WHERE rn = 1 ;
Well, to use the MIN function along with non-aggregate columns, you'd have to group the statement. That's possible with the query you have... (EDIT based on additional info)
SELECT MIN(v2.id) AS entryid, v1.id, v1.userid
FROM views v1
INNER JOIN views v2
ON v1.userid = v2.userid
GROUP BY v1.id, v1.userid
... however if this is just a simple example and you're looking to pull more data with this query, it quickly becomes an unfeasible solution.
What you seem to want is a list of all the user data in this view, with a link on each row leading back to the "first" record that exists for the same user. The above query will get you what you want, but there are much easier ways to determine the first record for each user:
SELECT v1.id, v1.userid
FROM views v1
ORDER BY v1.userid, v1.id
The first record for each unique user is your "entry point". I think I understand why you want to do it the way you specified, and the first query I gave will be reasonably performant, but you'll have to consider whether not having to use the order by clause to get the correct answer is worth it.
edit-1: as pointed out in the comments, this solution also uses a sub-query. However, it does not use aggregate functions, which (depending on the database) might have a huge impact on the performance.
Can achieve without sub-query (see below).
Obviously, an index on views.userid is of tremedous value for the performance.
SELECT v1.*
FROM views v1
WHERE v1.id = (
SELECT TOP 1 v2.id
FROM views v2
WHERE v2.userid = v1.userid
ORDER BY v2.id ASC
)

Using SQL(ite) how do I find the lowest unique child for each parent in a one to many relationship during a JOIN?

I have two tables with a many to one relationship which represent lots and bids within an auction system. Each lot can have zero or more bids associated with it. Each bid is associated with exactly one lot.
My table structure (with irrelevant fields removed) looks something like this:
For one type of auction the winning bid is the lowest unique bid for a given lot.
E.g. if there are four bids for a given lot: [1, 1, 2, 4] the lowest unique bid is 2 (not 1).
So far I have been able to construct a query which will find the lowest unique bid for a single specific lot (assuming the lot ID is 123):
SELECT id, value FROM bid
WHERE lot = 123
AND amount = (
SELECT value FROM bid
WHERE lot = 123
GROUP BY value HAVING COUNT(*) = 1
ORDER BY value
)
This works as I would expect (although I'm not sure it's the most graceful approach).
I would now like to construct a query which will get the lowest unique bids for all lots at once. Essentially I want to perform a JOIN on the two tables where one column is the result of something similar to the above query. I'm at a loss as to how to use the same approach for finding the lowest unique bid in a JOIN though.
Am I on the wrong track with this approach to finding the lowest unique bid? Is there another way I can achieve the same result?
Can anyone help me expand this query into a JOIN?
Is this even possible in SQL or will I have to do it in my application proper?
Thanks in advance.
(I am using SQLite 3.5.9 as found in Android 2.1)
You can use group by with a "having" condition to find the set of bids without duplicate amounts for each lot.
select lotname, amt
from lot inner join bid on lot.id = bid.lotid
group by lotname, amt having count(*) = 1
You can in turn make that query an inline view and select the lowest bid from it for each lot.
select lotname, min(amt)
from
(
select lotname, amt
from lot inner join bid on lot.id = bid.lotid
group by lotname, amt having count(*) = 1
) as X
group by X.lotname
EDIT: Here's how to get the bid id using this approach, using nested inline views:
select bid.id as WinningBidId, Y.lotname, bid.amt
from
bid
join
(
select x.lotid, lotname, min(amt) as TheMinAmt
from
(
select lot.id as lotid, lotname, amt
from lot inner join bid on lot.id = bid.lotid
group by lot.id, lotname, amt
having count(*)=1
) as X
group by x.lotid, x.lotname
) as Y
on Y.lotid = bid.lotid and Y.TheMinAmt = Bid.amt
I think you need some subqueries to get to your desired data:
SELECT [b].[id] AS [BidID], [l].[id] AS [LotID],
[l].[Name] AS [Lot], [b].[value] AS [BidValue]
FROM [bid] [b]
INNER JOIN [lot] [l] ON [b].[lot] = [l].[id]
WHERE [b].[id] =
(SELECT TOP 1 [min].[id]
FROM [bid] [min]
WHERE [min].[lot] = [b].[lot]
AND NOT EXISTS(SELECT *
FROM [bid] [check]
WHERE [check].[lot] = [min].[lot]
AND [check].[value] = [min].[value]
AND [check].[id] <> [min].[id])
ORDER BY [min].[value] ASC)
The most inner query (within the exists) checks if there are no other bids on that lot, having the same value.
The query in the middle (top 1) determines the minimum bid of all unique bids on that lot.
The outer query makes this happen for all lots, that have bids.
SELECT lot.name, ( SELECT MIN(bid.value) FROM bid Where bid.lot = lot.ID) AS MinBid
FROM Lot INNER JOIN
bid on lot.ID = bid.ID
If I understand you correctly this will give you everylot and their smallest bid