I need to update a field called FamName in a table called Episode with randomly generated Germanic names from a different table called Surnames which has a single column called Surname.
To do this I have first added an ID field and NONCLUSTERED INDEX to my Surnames table
ALTER TABLE Surnames
ADD ID INT NOT NULL IDENTITY(1, 1);
GO
CREATE UNIQUE NONCLUSTERED INDEX idx ON Surnames(ID);
GO
I then attempt to update my Episode table via
UPDATE E
SET E.FamName = S.Surnames
FROM Episode AS E
INNER LOOP JOIN Surnames AS S
ON S.ID = (1 + ABS(CRYPT_GEN_RANDOM(8) % (SELECT COUNT(*) FROM Surnames)));
GO
where I am attempting to force the query to 'loop' using the LOOP join hint. Of course if I don't force the optimizer to loop (using LOOP) I will get the same German name for all rows. However, this query is strangely returning zero rows affected.
Why is this returning zero affected rows and how can this be amended to work?
Note, I could use a WHILE loop to perform this update, but I want a succinct way of doing this and to find out what I am doing wrong in this particular case.
You cannot (reliably) affect query results with join hints. They are performance hints, not semantic hints. You try to rely on undefined behavior.
Moving the random number computation out of the join condition into one of the join sources prevents the expression to be treated as a constant:
UPDATE E
SET E.FamName = S.Surnames
FROM (
SELECT *, (1 + ABS(CRYPT_GEN_RANDOM(8) % (SELECT COUNT(*) FROM Surnames))) AS SurnameID
FROM Episode AS E
) E
INNER LOOP JOIN Surnames AS S ON S.ID = E.SurnameID
The derived table E adds the computed SurnameID as a new column.
You don't need join hints any longer. I just tested that this works in my specific test case although I'm not whether this is guaranteed to work.
Related
I have two tables, on one there are all the races that the buses do
dbo.Courses_Bus
|ID|ID_Bus|ID_Line|DateHour_Start_Course|DateHour_End_Course|
On the other all payments made in these buses
dbo.Payments
|ID|ID_Bus|DateHour_Payment|
The goal is to add the notion of a Line in the payment table to get something like this
dbo.Payments
|ID|ID_Bus|DateHour_Payment|Line|
So I tried to do this :
/** I first added a Line column to the dbo.Payments table**/
UPDATE
Table_A
SET
Table_A.Line = Table_B.ID_Line
FROM
[dbo].[Payments] AS Table_A
INNER JOIN [dbo].[Courses_Bus] AS Table_B
ON Table_A.ID_Bus = Table_B.ID_Bus
AND Table_A.DateHour_Payment BETWEEN Table_B.DateHour_Start_Course AND Table_B.DateHour_End_Course
And this
UPDATE
Table_A
SET
Table_A.Line = Table_B.ID_Line
FROM
[dbo].[Payments] AS Table_A
INNER JOIN (
SELECT
P.*,
CP.ID_Line AS ID_Line
FROM
[dbo].[Payments] AS P
INNER JOIN [dbo].[Courses_Bus] CP ON CP.ID_Bus = P.ID_Bus
AND CP.DateHour_Start_Course <= P.Date
AND CP.DateHour_End_Course >= P.Date
) AS Table_B ON Table_A.ID_Bus = Table_B.ID_Bus
The main problem, apart from the fact that these requests do not seem to work properly, is that each table has several million lines that are increasing every day, and because of the datehour filter (mandatory since a single bus can be on several lines everyday) SSMS must compare each row of the second table to all rows of the other table.
So it takes an infinite amount of time, which will increase every day.
How can I make it work and optimise it ?
Assuming that this is the logic you want:
UPDATE p
SET p.Line = cb.ID_Line
FROM [dbo].[Payments] p JOIN
[dbo].[Courses_Bus] cb
ON p.ID_Bus = cb.ID_Bus AND
p.DateHour_Payment BETWEEN cb.DateHour_Start_Course AND cb.DateHour_End_Course;
To optimize this query, then you want an index on Courses_Bus(ID_Bus, DateHour_Start_Course, DateHour_End_Course).
There might be slightly more efficient ways to optimize the query, but your question doesn't have enough information -- is there always exactly one match, for instance?
Another big issue is that updating all the rows is quite expensive. You might find that it is better to do this in loops, one chunk at a time:
UPDATE TOP (10000) p
SET p.Line = cb.ID_Line
FROM [dbo].[Payments] p JOIN
[dbo].[Courses_Bus] cb
ON p.ID_Bus = cb.ID_Bus AND
p.DateHour_Payment BETWEEN cb.DateHour_Start_Course AND cb.DateHour_End_Course
WHERE p.Line IS NULL;
Once again, though, this structure depends on all the initial values being NULL and an exact match for all rows.
Thank you Gordon for your answer.
I have investigated and came with this query :
MERGE [dbo].[Payments] AS p
USING [dbo].[Courses_Bus] AS cb
ON p.ID_Bus= cb.ID_Bus AND
p.DateHour_Payment>= cb.DateHour_Start_Course AND
p.DateHour_Payment<= cb.DateHour_End_Course
WHEN MATCHED THEN
UPDATE SET p.Line = cb.ID_Ligne;
As it seems to be the most suitable in an MS-SQL environment.
It also came with the error :
The MERGE statement attempted to UPDATE or DELETE the same row more than once. This happens when a target row matches more than one source row. A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. Refine the ON clause to ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows.
I understood this to mean that it finds several lines with identical
[p.ID_Bus= cb.ID_Bus AND
p.DateHour_Payment >= cb.DateHour_Start_Course AND
p.DateHour_Payment <= cb.DateHour_End_Course]
Yes, this is a possible case, however the ID is different each time.
For example, if two blue cards are beeped at the same time, or if there is a loss of network and the equipment has been updated, thus putting the beeps at the same time. These are different lines that must be treated separately, and you can obtain for example:
|ID|ID_Bus|DateHour_Payments|Line|
----------------------------------
|56|204|2021-01-01 10:00:00|15|
----------------------------------
|82|204|2021-01-01 10:00:00|15|
How can I improve this query so that it takes into account different payment IDs?
I can't figure out how to do this with the help I find online. Maybe this method is not the right one in this context.
I have a group of enitities which need to have another record associated with them from another table.
When I try to output an Id for the table to be matched on it doesn't work because you can only output from inserted, updated etc.
DECLARE #SignatureGlobalIdsTbl table (ID int,
CompanyBankAccountId int);
INSERT INTO GlobalIds (TypeId)
-- I Cannot output cba.Id into the table since its not from inserted
OUTPUT Inserted.Id,
cba.Id
INTO #SignatureGlobalIdsTbl (ID,
CompanyBankAccountId)
SELECT (#DocumentsGlobalTypeKey)
FROM CompanyBankAccounts cba
INNER JOIN Companies c ON c.CompanyId = cba.CompanyId
WHERE SignatureDocumentId IS NULL
AND (SignatureFile IS NOT NULL
AND SignatureFile != '');
INSERT INTO Documents (DocumentPath,
DocumentType,
DocumentIsExternal,
OwnerGlobalId,
OwnerGlobalTypeID,
DocumentName,
Extension,
GlobalId)
SELECT SignatureFile,
#SignatureDocumentTypeKey,
1,
CompanyGlobalId,
#OwnerGlobalTypeKey,
[dbo].[fnGetFileNameWithoutExtension](SignatureFile),
[dbo].[fnGetFileExtension](SignatureFile),
documentGlobalId
FROM (SELECT c.GlobalId AS CompanyGlobalId,
cba.*,
s.ID AS documentGlobalId
FROM CompanyBankAccounts cba
INNER JOIN Companies c ON c.CompanyId = cba.CompanyId
CROSS JOIN #SignatureGlobalIdsTbl s) info
WHERE SignatureDocumentId IS NULL
AND (SignatureFile IS NOT NULL
AND SignatureFile != '');
I Tried to use cross join to prevent cartesian production but that did not work. I also tried to output the rownumber over some value but I could not get that to be stored in the table either.
If I have two seperate queries which return the same amount of records, how can I pair the records together without creating cartesian production?
'When I try to output an Id for the table ... it doesn't work.'
This seems to be because one of the columns you want to OUTPUT is not actually part of the insert. It's an annoying problem and I wish SQL Server would allow us to do it.
Someone may have a much better answer for this than I do, but the way I usually approach this is
Create a temporary table/etc of the data I want to insert, with a column for ID (starts blank)
Do an insert of the correct amount of rows, and get the IDs out into another temporary table,
Assign the IDs as appropriate within the original temporary table
Go back and update the inserted rows with any additional data needed (though that's probably not needed here given you're just inserting a constant)
What this does is to flag/get the IDs ready for you to use, then you allocate them to your data as needed, then fill in the table with the data. It's relatively simple although it does do 2 table hits rather than 1.
Also consider doing it all within a transaction to keep the data consistent (though also probably not needed here).
How can I pair the records together?
A cross join unfortunately multiplies the rows (number of rows on left times the number of rows on the right). It is useful in some instances, but possibly not here.
I suggest when you do your inserts above, you get an identifier (e.g., companyID) in your temp table and join on that.
If you don't have a matching record and just want to assign them in order, you can use an answer similar to my answer in another recent question How to update multiple rows in a temp table with multiple values from another table using only one ID common between them?
Further notes
I suggest avoiding table variables (e.g., DECLARE #yourtable TABLE) and use temporary tables (CREATE TABLE #yourtable) instead - for performance reasons. If it's only a small amount of rows it's OK, but it gets worse as it gets larger as SQL Server assumed that table variables only have 1 row
In your bottom statement, why is there the SELECT statement in the FROM clause? Couldn't you just get rid of that select statement and have the FROM clause list the tables you want?
I figured out a way to have access to the output, by using a merge statement.
DECLARE #LogoGlobalIdsTbl TABLE (ID INT, companyBankAccountID INT)
MERGE GlobalIds
USING
(
SELECT (cba.CompanyBankAccountId)
FROM CompanyBankAccounts cba
INNER JOIN Companies c on c.CompanyId = cba.CompanyId
WHERE cba.LogoDocumentId IS NULL AND (cba.LogoFile IS NOT NUll AND cba.LogoFile != '')
) src ON (1=0)
WHEN NOT MATCHED
THEN INSERT ( TypeId )
VALUES (#DocumentsGlobalTypeKey)
OUTPUT [INSERTED].[Id], src.CompanyBankAccountId
INTO #LogoGlobalIdsTbl;
I retrieve data by joining multiple tables as indicated on the image below. On the other hand, as there is no data in the FK column (EmployeeID) of Event table, I have to use CardNo (nvarchar) fields in order to join the two tables. On the other hand, the digit numbers of CardNo fields in the Event and Employee tables are different, I also have to use RIGHT function of SQL Server and this makes the query to be executed approximately 10 times longer. So, in this scene what should I do? Can I use CardNo field without changing its data type to int, etc (because there are other problem might be seen after changing it and it sill be better to find a solution without changing the data type of it). Here is also execution plan of the query below.
Query:
; WITH a AS (SELECT emp.EmployeeName, emp.Status, dep.DeptName, job.JobName, emp.CardNo
FROM TEmployee emp
LEFT JOIN TDeptA AS dep ON emp.DeptAID = dep.DeptID
LEFT JOIN TJob AS job ON emp.JobID = job.JobID),
b AS (SELECT eve.EventID, eve.EventTime, eve.CardNo, evt.EventCH, dor.DoorName
FROM TEvent eve LEFT JOIN TEventType AS evt ON eve.EventType = evt.EventID
LEFT JOIN TDoor AS dor ON eve.DoorID = dor.DoorID)
SELECT * FROM b LEFT JOIN a ON RIGHT(a.CardNo, 8) = RIGHT(b.CardNo, 8)
ORDER BY b.EventID ASC
You can add a computed column to your table like this:
ALTER TABLE TEmployee -- Don't start your table names with prefixes, you already know they're tables
ADD CardNoRight8 AS RIGHT(CardNo, 8) PERSISTED
ALTER TABLE TEvent
ADD CardNoRight8 AS RIGHT(CardNo, 8) PERSISTED
CREATE INDEX TEmployee_CardNoRight8_IDX ON TEmployee (CardNoRight8)
CREATE INDEX TEvent_CardNoRight8_IDX ON TEvent (CardNoRight8)
You don't need to persist the column since it already matches the criteria for a computed column to be indexed, but adding the PERSISTED keyword shouldn't hurt and might help the performance of other queries. It will cause a minor performance hit on updates and inserts, but that's probably fine in your case unless you're importing a lot of data (millions of rows) at a time.
The better solution though is to make sure that your columns that are supposed to match actually match. If the right 8 characters of the card number are something meaningful, then they shouldn't be part of the card number, they should be another column. If this is an issue where one table uses leading zeroes and the other doesn't then you should fix that data to be consistent instead of putting together work arounds like this.
This line is what is costing you 86% of the query time:
LEFT JOIN a ON RIGHT(a.CardNo, 8) = RIGHT(b.CardNo, 8)
This is happening because it has to run RIGHT() on those fields for every row and then match them with the other table. This is obviously going to be inefficient.
The most straightforward solution is probably to either remove the RIGHT() entirely or else to re-implement it as a built-in column on the table so it doesn't have to be calculated on the fly while the query is running.
While inserting the record, you would have to also insert the eight, right digits of the card number and store it in this field. My original thought was to use a computed column but I don't think those can be indexed so you'd have to use a regular column.
; WITH a AS (
SELECT emp.EmployeeName, emp.Status, dep.DeptName, job.JobName, emp.CardNoRightEight
FROM TEmployee emp
LEFT JOIN TDeptA AS dep ON emp.DeptAID = dep.DeptID
LEFT JOIN TJob AS job ON emp.JobID = job.JobID
),
b AS (
SELECT eve.EventID, eve.EventTime, eve.CardNoRightEight, evt.EventCH, dor.DoorName
FROM TEvent eve LEFT JOIN TEventType AS evt ON eve.EventType = evt.EventID
LEFT JOIN TDoor AS dor ON eve.DoorID = dor.DoorID
)
SELECT *
FROM b
LEFT JOIN a ON a.CardNoRightEight = b.CardNoRightEight
ORDER BY b.EventID ASC
This will help you see how to add a calculated column to your database.
create table #temp (test varchar(30))
insert into #temp
values('000456')
alter table #temp
add test2 as right(test, 3) persisted
select * from #temp
The other alternative is to fix the data and the data entry so that both columns are the same data type and contain the same leading zeros (or remove them)
Many thanks all of your help. With the help of your answers, I managed to reduce the query execution time from 2 minutes to 1 at the first step after using computed columns. After that, when creating an index for these columns, I managed to reduce the execution time to 3 seconds. Wow, it is really perfect :)
Here are the steps posted for those who suffers from a similar problem:
Step I: Adding computed columns to the tables (As CardNo fields are nvarchar data type, I specify data type of computed columns as int):
ALTER TABLE TEvent ADD CardNoRightEight AS RIGHT(CAST(CardNo AS int), 8)
ALTER TABLE TEmployee ADD CardNoRightEight AS RIGHT(CAST(CardNo AS int), 8)
Step II: Create index for the computed columns in order to execute the query faster:
CREATE INDEX TEmployee_CardNoRightEight_IDX ON TEmployee (CardNoRightEight)
CREATE INDEX TEvent_CardNoRightEight_IDX ON TEvent (CardNoRightEight)
Step 3: Update the query by using the computed columns in it:
; WITH a AS (
SELECT emp.EmployeeName, emp.Status, dep.DeptName, job.JobName, emp.CardNoRightEight --emp.CardNo
FROM TEmployee emp
LEFT JOIN TDeptA AS dep ON emp.DeptAID = dep.DeptID
LEFT JOIN TJob AS job ON emp.JobID = job.JobID
),
b AS (
SELECT eve.EventID, eve.EventTime, evt.EventCH, dor.DoorName, eve.CardNoRightEight --eve.CardNo
FROM TEvent eve
LEFT JOIN TEventType AS evt ON eve.EventType = evt.EventID
LEFT JOIN TDoor AS dor ON eve.DoorID = dor.DoorID)
SELECT * FROM b LEFT JOIN a ON a.CardNoRightEight = b.CardNoRightEight --ON RIGHT(a.CardNo, 8) = RIGHT(b.CardNo, 8)
ORDER BY b.EventID ASC
I have two tables
table 1 : rm_example(customer, weekno, salenum, card_type,...., imputed)
table 2 : rm_dummy(customer, weekno, imputed)
The imputed column in table one is null(all columns).
I want to set "imputed" column in table 1 with the value of "imputed" in table two where customer and weekno match....
below the query I wrote.....but it is taking forever to execute...
update rm_example e
set e.imputed =
(select imputed
from rm_dummy d
inner join rm_example e on e.customer=d.customer and e.weekno=d.weekno)...
Is something wrong with the query?
I am working on remote database using sqldeveloperplus...and we are talking about million rows.
MERGE is usually quite a bit faster than an UPDATE with a subquery (the syntax might seem a little bit weird, but you'll get used to it); this assumes rm_example has a primary key column PK:
MERGE INTO rm_example target
USING
(SELECT e.pk as e_pk,
d.imputed
FROM rm_dummy d
INNER JOIN rm_example e ON e.customer=d.customer AND e.weekno=d.weekno) src
ON (target.pk = src.e_pk)
WHEN MATCHED THEN UPDATE
SET target.imputed = src.imputed;
Not sure if it will be faster than what you have done already but you try this
UPDATE
(SELECT e.imputed, d.imputed
FROM rm_example e
INNER JOIN rm_dummy d ON e.customer = d.customer AND e.weekno = d.weekno)
SET e.imputed = d.imputed;
After reading 8 Bulk Update Methods Compared (Oracle) I see this is really a Deprecated method that a MERGE syntax should be used. But in saying that depending on your system this could possibly have better performance
I have two tables in an MS Access 2010 database: TBLIndividuals and TblIndividualsUpdates. They have a lot of the same data, but the primary key may not be the same for a given person's record in both tables. So I'm doing a join between the two tables on names and birthdates to see which records correspond. I'm using a left join so that I also get rows for the people who are in TblIndividualsUpdates but not in TBLIndividuals. That way I know which records need to be added to TBLIndividuals to get it up to date.
SELECT TblIndividuals.PersonID AS OldID,
TblIndividualsUpdates.PersonID AS UpdateID
FROM TblIndividualsUpdates LEFT JOIN TblIndividuals
ON ( (TblIndividuals.FirstName = TblIndividualsUpdates.FirstName)
and (TblIndividuals.LastName = TblIndividualsUpdates.LastName)
AND (TblIndividuals.DateBorn = TblIndividualsUpdates.DateBorn
or (TblIndividuals.DateBorn is null
and (TblIndividuals.MidName is null and TblIndividualsUpdates.MidName is null
or TblIndividuals.MidName = TblIndividualsUpdates.MidName))));
TblIndividualsUpdates has 4149 rows, but the query returns only 4103 rows. There are about 50 new records in TblIndividualsUpdates, but only 4 rows in the query result where OldID is null.
If I export the data from Access to PostgreSQL and run the same query there, I get all 4149 rows.
Is this a bug in Access? Is there a difference between Access's left join semantics and PostgreSQL's? Is my database corrupted (Compact and Repair doesn't help)?
ON (
TblIndividuals.FirstName = TblIndividualsUpdates.FirstName
and
TblIndividuals.LastName = TblIndividualsUpdates.LastName
AND (
TblIndividuals.DateBorn = TblIndividualsUpdates.DateBorn
or
(
TblIndividuals.DateBorn is null
and
(
TblIndividuals.MidName is null
and TblIndividualsUpdates.MidName is null
or TblIndividuals.MidName = TblIndividualsUpdates.MidName
)
)
)
);
What I would do is systematically remove all the join conditions except the first two until you find the records drop off. Then you will know where your problem is.
This should never happen. Unless rows are being inserted/deleted in the meantime,
the query:
SELECT *
FROM a LEFT JOIN b
ON whatever ;
should never return less rows than:
SELECT *
FROM a ;
If it happens, it's a bug. Are you sure the queries are exactly like this (and you have't omitted some detail, like a WHERE clause)? Are you sure that the first returns 4149 rows and the second one 4103 rows? You could make another check by changing the * above to COUNT(*).
Drop any indexes from both tables which include those JOIN fields (FirstName, LastName, and DateBorn). Then see whether you get the expected
4,149 rows with this simplified query.
SELECT
i.PersonID AS OldID,
u.PersonID AS UpdateID
FROM
TblIndividualsUpdates AS u
LEFT JOIN TblIndividuals AS i
ON
(
(i.FirstName = u.FirstName)
AND (i.LastName = u.LastName)
AND (i.DateBorn = u.DateBorn)
);
For whatever it is worth, since this seems to be a deceitful bug and any additional information could help resolving it, I have had the same problem.
The query is too big to post here and I don't have the time to reduce it now to something suitable, but I can report what I found. In the below, all joins are left joins.
I was gradually refining and changing my query. It had a derived table in it (D). And the whole thing was made into a derived table (T) and then joined to a last table (L). In any case, at one point in its development, no field in T that originated in D participated in the join to L. It was then the problem occurred, the total number of rows mysteriously became less than the main table, which should be impossible. As soon as I again let a field from D participate (via T) in the join to L, the number increased to normal again.
It was as if the join condition to D was moved to a WHERE clause when no field in it was participating (via T) in the join to L. But I don't really know what the explanation is.