Improving Mass Update Query Performance

Improving Mass Update Query Performance - sql

I'm looking to run some remediation on my database which requires adding some new fields to a table and backfilling them based on specific criteria. There are a LOT of records and I'd prefer the backfill to run relatively fast, as my current query is taking forever to run.
I attempted updating via subqueries, but this doesn't seem to be very performant.
Query below is edited to show general process (so syntax might be a little off). Hopefully you can understand what I'm trying to do./
I'd like to update every single record in the accounts table.
I'd like to do this by goin through each record and running a number of checks against the ID of that record prior to updating. For any records that don't match up in the join, I just want to set them to 0.
Doing this over the course of a few hundred thousand records seems to take forever. I'm sure there is a faster more efficient way to go about doing this. Using Postgres.
UPDATE
accounts
SET
account_age = a2.account_age,
FROM
(
with dataPoints as (
select
COALESCE(account__c.account_id as account_id,
account__c.account_age, 0) as account_age
from account__c
OUTER LEFT join points on points.account_id = account__c.id
OUTER LEFT join goal__c on goal__c.id = scores.goal_id
group by account_id, account__c.account_age
)
select
account_id,
max(dataPoints.account_age) as account_age,
from scores
join account on accounts.id = scores.account_id
group by account_id
) as a2
WHERE accounts.id = a2.account_id;

Related

Request optimisation

I have two tables, on one there are all the races that the buses do
dbo.Courses_Bus
|ID|ID_Bus|ID_Line|DateHour_Start_Course|DateHour_End_Course|
On the other all payments made in these buses
dbo.Payments
|ID|ID_Bus|DateHour_Payment|
The goal is to add the notion of a Line in the payment table to get something like this
dbo.Payments
|ID|ID_Bus|DateHour_Payment|Line|
So I tried to do this :
/** I first added a Line column to the dbo.Payments table**/
UPDATE
Table_A
SET
Table_A.Line = Table_B.ID_Line
FROM
[dbo].[Payments] AS Table_A
INNER JOIN [dbo].[Courses_Bus] AS Table_B
ON Table_A.ID_Bus = Table_B.ID_Bus
AND Table_A.DateHour_Payment BETWEEN Table_B.DateHour_Start_Course AND Table_B.DateHour_End_Course
And this
UPDATE
Table_A
SET
Table_A.Line = Table_B.ID_Line
FROM
[dbo].[Payments] AS Table_A
INNER JOIN (
SELECT
P.*,
CP.ID_Line AS ID_Line
FROM
[dbo].[Payments] AS P
INNER JOIN [dbo].[Courses_Bus] CP ON CP.ID_Bus = P.ID_Bus
AND CP.DateHour_Start_Course <= P.Date
AND CP.DateHour_End_Course >= P.Date
) AS Table_B ON Table_A.ID_Bus = Table_B.ID_Bus
The main problem, apart from the fact that these requests do not seem to work properly, is that each table has several million lines that are increasing every day, and because of the datehour filter (mandatory since a single bus can be on several lines everyday) SSMS must compare each row of the second table to all rows of the other table.
So it takes an infinite amount of time, which will increase every day.
How can I make it work and optimise it ?

Assuming that this is the logic you want:
UPDATE p
SET p.Line = cb.ID_Line
FROM [dbo].[Payments] p JOIN
[dbo].[Courses_Bus] cb
ON p.ID_Bus = cb.ID_Bus AND
p.DateHour_Payment BETWEEN cb.DateHour_Start_Course AND cb.DateHour_End_Course;
To optimize this query, then you want an index on Courses_Bus(ID_Bus, DateHour_Start_Course, DateHour_End_Course).
There might be slightly more efficient ways to optimize the query, but your question doesn't have enough information -- is there always exactly one match, for instance?
Another big issue is that updating all the rows is quite expensive. You might find that it is better to do this in loops, one chunk at a time:
UPDATE TOP (10000) p
SET p.Line = cb.ID_Line
FROM [dbo].[Payments] p JOIN
[dbo].[Courses_Bus] cb
ON p.ID_Bus = cb.ID_Bus AND
p.DateHour_Payment BETWEEN cb.DateHour_Start_Course AND cb.DateHour_End_Course
WHERE p.Line IS NULL;
Once again, though, this structure depends on all the initial values being NULL and an exact match for all rows.

Thank you Gordon for your answer.
I have investigated and came with this query :
MERGE [dbo].[Payments] AS p
USING [dbo].[Courses_Bus] AS cb
ON p.ID_Bus= cb.ID_Bus AND
p.DateHour_Payment>= cb.DateHour_Start_Course AND
p.DateHour_Payment<= cb.DateHour_End_Course
WHEN MATCHED THEN
UPDATE SET p.Line = cb.ID_Ligne;
As it seems to be the most suitable in an MS-SQL environment.
It also came with the error :
The MERGE statement attempted to UPDATE or DELETE the same row more than once. This happens when a target row matches more than one source row. A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. Refine the ON clause to ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows.
I understood this to mean that it finds several lines with identical
[p.ID_Bus= cb.ID_Bus AND
p.DateHour_Payment >= cb.DateHour_Start_Course AND
p.DateHour_Payment <= cb.DateHour_End_Course]
Yes, this is a possible case, however the ID is different each time.
For example, if two blue cards are beeped at the same time, or if there is a loss of network and the equipment has been updated, thus putting the beeps at the same time. These are different lines that must be treated separately, and you can obtain for example:
|ID|ID_Bus|DateHour_Payments|Line|
----------------------------------
|56|204|2021-01-01 10:00:00|15|
----------------------------------
|82|204|2021-01-01 10:00:00|15|
How can I improve this query so that it takes into account different payment IDs?
I can't figure out how to do this with the help I find online. Maybe this method is not the right one in this context.

Force unique key in view to avoid merge join

I'm trying to optimize a query. Basically, there are 3 parts to a transaction that can be repeated. I log all communications, but want to get the "freshest" of the 3 parts. The 3 parts are all linked through a single intermediate table (unfortunately) which is what is slowing this whole thing down (too much normalization?).
There is the center of the "star" "Transactions", then the center spokes (all represened by "TransactionDetails", which refer to the hub using "Transactions" primary key, then the outer spokes (PPGDetails, TicketDetails and CompletionDetails), all of which refer to "TransactionDetails" buy it's primary key.
Each of "PPGDetails", "TicketDetails" and "CompletionDetails" will have exactly one row in "TransactionDetails" that they link to, by primary key. There can be many of each of these pairs of objects per transaction.
So, in order to get the most recent TicketDetails for a transaction, I use this view:
CREATE VIEW [dbo].[TicketTransDetails] AS
select *
from TicketDetails tkd
join (select MAX(TicketDetail_ID) as TicketDetail_ID
from TicketDetails temp1
join TransactionDetails temp2
on temp1.TransactionDetail_ID = temp2.TransactionDetail_ID
group by temp2.Transaction_ID) qq
on tkd.TicketDetail_ID = qq.TicketDetail_ID
join TransactionDetails td
on tkd.TransactionDetail_ID = td.TransactionDetail_ID
GO
The other 2 detail types have similar views.
Then, to get all of the transaction details I want, one row per transaction, I use:
select *
from Transactions t
join CompletionTransDetails cpd
on t.Transaction_ID = cpd.Transaction_ID
left outer join TicketTransDetails tkd
on t.Transaction_ID = tkd.Transaction_ID
left outer join PPGTransDetails ppd
on t.Transaction_ID = ppd.Transaction_ID
where cpd.DateAndTime between '2/1/2017' and '3/1/2017'
It is by design that I want ONLY transactions that have at least 1 "CompletionDetail", but 0 or more "PPGDetail" or "TicketDetail".
This query returns the correct results, but takes 40 seconds to execute, on decent server hardware, and a "Merge Join (Left Outer Join)" immediately before the "SELECT" returns takes 100% of the execution plan time.
If I take out the join to either PPGTransDetails or TicketTransDetails in the final query, it brings the execution time down to ~20 seconds, so a marked improvement, but still doing a Merge Join over a significant number of records (many extraneous, I assume).
When just a single transaction is selected (via where clause), the query only takes about 4 seconds, and the query, then, has a final step of "Nested Loops" which also takes a large portion of the time (96%). I would like this query to take less than a second.
Since the views don't have a primary key, I assume that is causing the Merge Join to proceed. That said, I am having trouble creating a query that emulates this functionality - much less one that is more efficient.
Can anyone help me recognize what I may be missing?
Thanks!
--mobrien118
Edit: Adding more info -
Here is the effective data model:
Essentially, for a single transaction, there can be MANY PPGDetails, TicketDetails and CompletionDetails, but each one will have it's own TransactionDetails (they are one-to-one, but not enforced in the model, just in software).
There are currently:
1,619,307 "Transactions"
3,564518 "TransactionDetails"
512,644 "PPGDetails"
1,471,826 "TicketDetails"
1,580,043 "CompletionDetails"
There are currently no foreign key constraints or indexes set up on these items.

First a quick remark:
which also takes a large portion of the time (96%).
This is a bit of a (common) misconception. The 96% there is an estimate on how much resources said 'block' will need. It by no means indicates that 96% of the time inside the query was spent on it. I've had situations where stuff that took over half of the query time-wise were attributed virtually no cost.
Additionally, you seem to be assuming that when you query/join to the view that the system will first prepare the data from the view and then later on will use that result to further 'work out the query'. This is not the case, the system will 'expand' the view and do a 'combined' query, taking everything into account.
For us to understand what's going on you'll need to provide us with the query plan (.sqlplan if you use SqlSentry Plan Explorer), it's that or a full explanation on the table layout, indexes, foreign keys, etc... and a bit of explanation on the data (total rows, expected matches between tables, etc...)
PS: even though everybody seems to be touting 'hash joins' as the solution to everything, nested loops and merge joins often are more efficient.
(trying to understand your queries, is this view equivalent to your view?)
[edit: incorrect view removed to avoid confusion]
Second try: (think I have it right this time)
CREATE VIEW [dbo].[TicketTransDetails] AS
SELECT td.Transaction_ID, tkd.*
FROM TicketDetails tkd
JOIN TransactionDetails td
ON td.TransactionDetail_ID = tkd.TransactionDetail_ID
JOIN (SELECT MAX(TicketDetail_ID) as max_TicketDetail_ID, temp2.Transaction_ID
FROM TicketDetails temp1
JOIN TransactionDetails temp2
ON temp1.TransactionDetail_ID = temp2.TransactionDetail_ID
GROUP BY temp2.Transaction_ID) qq
ON qq.max_TicketDetail_ID = tkd.TicketDetail_ID
AND qq.TransactionDetail_ID = td.Transaction_ID
It might not be any faster when querying the entire table, but it should be when fetching specific records from the Transactions table.
Indexing-wise you probably want a unique index on TicketDetails (TransactionDetail_ID, TicketDetail_ID)
You'll need similar constructs for the other tables off course.
Thinking it through a bit further I think this would work too:
CREATE VIEW [dbo].[TicketTransDetails]
AS
SELECT *
FROM (
SELECT td.Transaction_ID,
TicketDetail_ID_rownr = ROW_NUMBER() OVER (PARTITION BY td.Transacion_ID ORDER BY tkd.TicketDetail_ID DESC),
tkd.*
FROM TicketDetails tkd
JOIN TransactionDetails td
ON td.TransactionDetail_ID = tkd.TransactionDetail_ID
) xx
WHERE TicketDetail_ID_rownr = 1 -- we want the "first one from the end" only
It looks quite a bit more readable but I'm not sure it would be faster or not... you'll have to compare timings and query plans.

Fast Query with Left Join To Get Record With Latest Date

For this problem, I have 2 main existing postgres tables I am working with. The first table is named client, the second table is named task.
A single client can have multiple tasks, each with it's own scheduled_date and scheduled_time.
I'm trying to run a query that will return a list of all clients along with the date/time of their latest task.
Currently, my query works and looks something like this...
SELECT
c.*,
t1.scheduled_time||' '||t1.scheduled_time::timestamp AS latest_task_datetime
FROM
client c
LEFT JOIN
task t1 ON t1.client_id = c.client_id
LEFT JOIN
task t2 ON t2.client_id = t1.client_id AND ((t1.scheduled_date||' '||t1.scheduled_time)::timestamp < (t2.scheduled_date||' '||t2.scheduled_time)::timestamp) OR ((t1.scheduled_date||' '||t1.scheduled_time)::timestamp = (t2.scheduled_date||' '||t2.scheduled_time)::timestamp AND t1.task_id < t2.task_id);
The problem I'm having is the actual query I am working with deals with a lot more other tables (7+ tables), and every table has a lot of data in them, so because of the two left joins shown above, it is slowing down the execution of the query from 4 seconds to almost 45 seconds, which of course is very bad.
Does anyone know a possible faster way to write this query to run more efficiently?
A question I think you might initially have after seeing this is why I have scheduled_date and scheduled_time as separate columns? Why not have it as just a single timestamp column? The answer to that is this is an existing table that I can't change, at least not easily without requiring a lot of work making the changes in the entire server to support it.
Edit: Not quite the solution, but I just ended up doing it a different way. (See my comment below)

If you want to get multiple columns of information from different tables -- but one row for each client and his/her latest task, then you can use distinct on:
SELECT DISTINCT ON (c.client_id) c.*, t.*
FROM client c LEFT JOIN
task t
ON t.client_id = c.client_id
ORDER BY c.client_id, t.scheduled_date desc, t.scheduled_time desc;

MySQL statement taking abnormally long

I have two tables in one database with about 50,000 to 70,000 rows. Both are MyISAM. The first, yahooprices, contains SKU codes (column code) for items and pricing (column price). The second table, combined_stock, contains partnumber (same information as code, but sorted differently), price, quantity, and description. Price is currently defined as FLOAT 10,2 and set to 0.00. I am attempting to pull the pricing over from yahooprices (also FLOAT 10,2) to combined_stock using this statement:
UPDATE combined_stock dest LEFT JOIN (
SELECT price, code FROM yahooprices
) src ON dest.partnumber = src.code
SET dest.price = src.price
I know this statement worked because I tried it on a smaller test amount. They have partnumber and code as non-unique indexes. I also tried indexing price on both tables to see if that would speed up the query. Technically it should finish within seconds, but last time I tried running this, it sat there overnight and even then I'm pretty certain it didn't work out. Anyone have any troubleshooting recommendations?

I would suggest some relatively small changes. First, get rid of the subquery. Second, switch to an inner join:
UPDATE combined_stock dest JOIN
yahooprices src
ON dest.partnumber = src.code
SET dest.price = src.price;
Finally create an index on yahooprices(code, price).
You can leave the left outer join if you really want the price to be set to NULL when there is no match.

How to perform multiple SQL tasks when using SQL within code (in this case vbscript)

I am hitting a brick wall with something I'm trying to do.
I'm trying to perform a complex query and return the results to a vbscript (vbs) record set.
In order to speed up the query I create temporary tables and then use those tables in the main query (creates a speed boost of around 1200% on just using sub queries)
the problem is, the outlying code seems to ignore the main query, only 'seeing' the result of the very first command (i.e. it will return a 'records affected' figure)
For example, given a query like this..
delete from temp
select * into temp from sometable where somefield = somefilter
select sum(someotherfield) from yetanothertable where account in (select * from temp)
The outlying code only seems to 'see' the returned result of 'delete from temp' I can't access the data that the third command is returning.
(Obviously the sql query above is pseudo/fake. the real query is large and it's content not relevant to the question being asked. I need to solve this problem as without being able to use a temporary table the query goes from taking 3 seconds to 6 minutes!)
edit: I know I could get around this by making multiple calls to ADODB.Connection's execute (make the call to empty the temp tables, make the call to create them again, finally make the call to get the data) but I'd rather find an elegant solution/way to avoid this way of doing it.
edit 2: Below is the actual SQL code I've ended up with. Just adding it for the curiosity of people who have replied. It doesn't use the nocount as I'd already settled on a solution which works for me. It is also probably badly written. It evolved over time from something more basic. I could probably improve it myself but as it works and returns data extremely quickly I have stuck with it. (for now)
Here's the SQL.
Here's the Code where it's called. My chosen solution is to run the first query into a third temp table, then run a select * on that table from the code, then a delete from from the code...
I make no claims about being a 'good' sql scripter (self taught via necesity mostly), and the database is not very well designed (a mix of old and new tables. Old tables not relational and contain numerical values and date values stored as strings)
Here is the original (slow) query...
select
name,
program_name,
sum(handle) + sum(refund) as [Total Sales],
sum(refund) as Refunds,
sum(handle) as [Net Sales],
sum(credit - refund) as Payout,
cast(sum(comm) as money) as commission
from
(select accountnumber,program_name,
cast(credit_amount as money) as credit,cast(refund_amt as money) as refund,handle, handle * (
(select commission from amtotecommissions
where _date = a._date
and pool_type = (case when a.pool_type in ('WP','WS','PS','WPS') then 'WN' else a.pool_type end)
and program_name = a.program_name) / 100) as comm
from amtoteaccountactivity a where _date = '#yy/#mm/#dd' and transaction_type = 'Bet'
and accountnumber not in ('5067788','5096272') /*just to speed the query up a bit. I know these accounts aren't included*/
) a,
ews_db.dbo.amtotetrack t
where (a.accountnumber in (select accountno from ews_db.dbo.get_all_customers where country = 'US')
or a.accountnumber in ('5122483','5092147'))
and t.our_code = a.program_name collate database_default
and t.tracktype = 2
group by name,program_name

I suspect that with the right SQL and indexes you should be able to get equal performance with a single SELECT, however there isn't enough information in the original question to be able to give guidance on that.
I think you'll be best of doing this as a stored procedure and calling that.
CREATE PROCEDURE get_Count
#somefilter int
AS
delete from temp;
select * into temp from sometable where somefield = #somefilter;
select sum(someotherfield) from yetanothertable
where account in (select * from temp);
However an example avoiding the IN the way you're using it via a JOIN will probably fix the performance issue. Use EXPLAIN SELECT to see what's going on and optimise from there. For example the following
select sum(transactions.value) from transactions
inner join user on transactions.user=user.id where user.name='Some User'
is much quicker than
select sum(transactions.value) from transactions
where user in (SELECT id from user where user.name='Some User')
because the amount of rows scanned in the second example will be the entire table, whereas in the first the indexes can be used.
Rev1
Taking the slow SQL posted it is appears that there are full table scans going on where the SQL states WHERE .. IN e.g.
where (a.accountnumber in (select accountno from ews_db.dbo.get_all_customers))
The above will pull in lots of records which may not be required. This together with the other nested table selects are not allowing the optimiser to pull in only the records that match, as would be the case when using JOIN at the outer level.
When building these type of complex queries I generally start with the inner detail, because we need to have the inner detail so we can perform joins and aggregate operations.
What I mean by this is if you have a typical DB with customers that have orders that create transactions that contain items then I would start with the items and pull in the rest of the detail with joins.
By way of example only I suggest building the query more like the following:
select name,
program_name,
SUM(handle) + SUM(refund) AS [Total Sales],
SUM(refund) AS Refunds,
SUM(handle) AS [Net Sales],
SUM(credit - refund) AS Payout,
CAST(SUM(comm) AS money) AS commission,
FROM ews_db.dbo.get_all_customers AS cu
INNER JOIN amtoteactivity AS a ON a.accoutnumber = cu.accountnumber
INNER JOIN ews_db.dbo.amtotetrack AS track ON track.our_code = a.program_name
INNER JOIN amtotecommissions AS commision ON co.program_name = a.program_name
WHERE customers.country='US'
AND t.tracktype = 2
AND a.transaction_type = 'Bet'
AND a._date = ''#yy/#mm/#dd'
AND a.program_name = co.program_name
AND co.pool_type = (case when a.pool_type in ('WP','WS','PS','WPS') then 'WN' else a.pool_type end)
GROUP BY name,program_name,co.commission
NOTE: The above is not functional and is for illustration purposes. I'd need to have the database online to build the real query. I'm hoping you'll get the general idea and build from there.
My top tip for complex queries that don't work is simply to completely start again throwing away what you've already got. Sometimes I will do this three or four times when building a really tricky query.
Always build these queries gradually starting from the most detail and working outwards. Inspect the results at each stage because it helps visualise what the data are.

If you could come to a common data structure for all the selects you could UNION ALL them together with perhaps selecting a constant in each union so you know where the data was coming from - kinda like
select '1',col1,col2,'' from table 1
UNION ALL
select '2',col1,col2,col3 from table2

I just solved my original problem (that I came up against again today on a different query) in a slightly hacky way...
Conn.Execute(split(query,";")(0))
set rs = Conn.Execute(split(query,";")(1))
Works perfectly!
Edit : I just noticed that the first comment on my original question also provided a quick fix (set nocount on). I forgot about that. Well there is this and that. I had tried to get the query working without the temporary table but I couldn't get anywhere near the same performance as with it.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas