Fastest way to modify each row in a table - sql

What's the recommended way of updating a relatively large table (~70 million rows), in order to replace a foreign key column with an id of a different table (indirectly linked by the current key)?
Let's say I have three tables:
Person
Id long,
Group_id long --> foreign key to Group table
Group
Id long
Device_id long --> foreign key to Device table
Device
Id long
I would like to update the Person table to have a direct foreign key to the Device table, i.e.:
Person
Id long,
Device_Id long --> foreign key to Device table
Device
Id long
The query would look something like this:
-- replace Group_id with Device_id
update p from Person p
inner join Group g
on g.Id = p.Group_id
set p.Group_id = g.Device_id
I would first drop the FK constraint, and then rename the column afterwards.
Will this work?
Is there a better way?
Can I speed it up? (while this query is running, everything else will be offline, server is UPS backed-up, so I'd like to skip any transactional updates)

It would work if you wrote the UPDATE properly (assuming this is SQL Server)
update p
set p.Group_id = g.Device_id
from Person p
inner join Group g on g.Id = p.Group_id
Apart from that, it's a really smart move to re-use, then rename the column*. Can't think of any smart way to make this any faster, unless you wish to use a WHILE loop and person.Id markers to break up the updates into batches.
* - ALTER TABLE DROP COLUMN DOES NOT RECLAIM THE SPACE THE COLUMN TOOK

Drop indexes on the table you are updating and recreate after the update is complete.
Drop constraints on the table you are updating and recreate appropriately (you are changing the reference after all) after the update is complete.
Turn off triggers on the table you are updating and enable after the update is complete.
You might want to consider running batches. I personally would create a loop and batch update 10k rows at a time. This seemed to cause the fewest problems on my hardware (running out of disk space, etc). You could order the update and track the PK so you know where you are at. Or create a bit column that is set when a particular record is updated; this method might make it easier overall as you won't need to track the PK at all.
An example of such a loop might look like this:
select top 1 * from table
DECLARE #MinPK BIGINT
DECLARE #MaxPK BIGINT
SET #MinPK=0
SET #MaxPK=0
WHILE ##ROWCOUNT>0
BEGIN
SELECT
#MaxPK=MAX(a.PK)
FROM (
SELECT TOP 3
PK
FROM Table
WHERE PK>#MinPK
ORDER BY PK ASC
) a
--Change this to an update
SELECT
PK
FROM Table
WHERE PK>#MinPK
AND PK<=#MaxPK
SET #MinPK=#MaxPK
END

Your idea won't "work", unless there is only one device per group (which would be ridiculous, so I assume not).
The problem is that you would have to cram many device_id values into one column in the person table - that's why you've got a group table in the first place.

Related

Improve MERGE performance when using big tables

Context
We have a model in which each element has an element kind and from 0 to N features. Each feature belongs to only one element and has a feature name.
This is modeled as the following tables:
ELEMENTS
elem_id int not null -- PK
elem_elki_id int not null -- FK to ELEMENT_KINDS
-- more columns with elements data
ELEMENT_KINDS
elki_id int not null -- PK
-- more columns with elements kinds data
FEATURES
feat_id int not null -- PK
feat_elem_id int not null -- FK to ELEMENTS
feat_fena_id int not null -- FK to FEATURE_NAMES
-- more columns with features data
FEATURE_NAMES
fena_id int not null -- PK
-- more columns with feature_names data
Requirement
There is a new requirement of replacing the feature names table with a feature kinds table.
There is one (and only one) feature kind for each (element kind, feature name) pair.
The changes in the models were adding a new column and creating a new table:
ALTER TABLE features ADD feat_feki_id int null;
CREATE TABLE FEATURE_KINDS
(
feki_id int not null, -- PK
feki_elki_id int not null, -- FK to ELEMENT_KINDS
feki_fena_id int null, -- FK* to FEATURE_NAMES
-- more columns with feature kinds data
)
*feki_fena_id is actually a temp colum showing which feature name
was used to create each feature kind. After populating feat_feki_id, feki_fena_id should be discarded along with feat_fena_id and the feature names table.
Problem
After successfully populating the features kinds table we are trying to populate the feat_feki_id column using the following query:
MERGE INTO features F
USING
(
SELECT *
FROM elements
INNER JOIN feature_kinds
ON elem_elki_id = feki_elki_id
) EFK
ON
(
F.feat_elem_id = EFK.elem_id AND
F.feat_fena_id = EFK.feki_fena_id
)
WHEN MATCHED THEN
UPDATE SET F.feat_feki_id = EFK.feki_id;
This works in small case scenarios with test data, but in production we have ~20 million elements and ~2000 feature_kinds and it takes about an hour before throwing an ORA-30036: unable to extend segment by 8 in undo tablespace 'UNDOTBS1' error.
Question
Is there any way I could improve the performance of the MERGE so that it works? (Maybe I'm lacking some indexes?)
Is there another alternative to fill up the feat_feki_id column? (We already have tried UPDATE instead of MERGE with similar results)
It's not clear whether there is something wrong going on or whether your undo segments are just too small. Can you do the following statement without getting an ORA-30036?
UPDATE features f SET f.feat_feki_id = 12345;
If that doesn't work, you just need to increase the size of your undo segment. Kludges are available to do the update in chunks, but you really shouldn't have to do that.
Assuming it's NOT a simple UNDO size issue, one thing you might do is make sure that your MERGE (or UPDATE) is updating rows in the order they appear in your table. Otherwise, you could be revisiting the same blocks over and over, really hurting performance and increasing UNDO usage. I encountered this in a similar operation I had to do a few years ago and I was shocked when I finally figured it out.
To avoid the problem I had, you would want something like this:
MERGE INTO features F
USING
(
SELECT f.feat_id, fk.feki_id
FROM features f
INNER JOIN elements e ON e.elem_id = f.feat_elem_id
INNER JOIN feature_kinds fk ON fk.feki_elki_id = e.elem_elki_id and fk.feki_fena_id = f.feat_fena_id
-- Order by the ROWID of the table you are updating to ensure you are not revisiting the same block over and over
ORDER BY f.rowid
) EFK
ON
(
F.feat_id = efk.feat_id )
)
WHEN MATCHED THEN
UPDATE SET F.feat_feki_id = EFK.feki_id;
I may have gotten your data model wrong, but the key point is to include the FEATURES table in the MERGE query and ORDER BY features.rowid to ensure that the updates happen in row order.

How to delete 3 billion rows from 2 related tables

I have a table with 5 billion rows (table1) another table with 3 billion rows in table 2. These 2 tables are related. I have to delete 3 billion rows from table 1 and its related rows from table 2. Table1 is child of table 2. I tried using the for all method from plsql it didn't help much. Then I thought of using oracle partition strategy. Since I am not a DBA I would like to know if partioning of a existing table is possible on primary key column for a selected number of id's? My primary key is 64 bit auto generated number.
It is hard to partition the objects online(it can be done using dbms_redefinition). And not necessary(with the details you gave).
Best ideea would be to recreate the objects without the undesired rows.
For example some simple code would be like:
create table undesired_data as (select undesired rows from table1);
Create table1_new as (select * from table1 where key not in (select key from undesired_data));
Create table2_new as (select * from table2 where key not in (select key from undesired_data));
rename table1 to table1_old;
rename table2 to table2_old;
rename table1_new to table1;
rename table2_new to table2;
recreate constraints;
check if everything is ok;
drop table1_old and table2_old;
This can be done offlining consumers, but would be very small downtime for them if scripts are ok(you should test them in a test environment).
Sounds very dubious.
If it is real use-case then you don't delete you create another table, well defined, including partitioned and you fill it using insert /*+ append */ into MyNewTable select ....
The most common practice is to define partitions on dates (record create date, event date etc.).
Again, if this is a real use-case I strongly recommend that you will reach for real help, not seek for advice on the internet and not doing it yourself.

Moving large amounts of data instead of updating it

I have a large table (about 40M Rows) where I had a number of columns that are 0 which need to be null instead so we can better key the data.
I've written scripts to look chop the update into chunks of 10000 records, find the occurance of the columns with zero and update them to null.
Example:
update FooTable
set order_id = case when order_id = 0 then null else order_id end,
person_id = case when person_id = 0 then null else person_id end
WHERE person_id = 0
OR order_id = 0
This works great, but it takes for ever.
I thinking the better way to do this would be to create a second table and insert the data into it and then rename it to replace the old table with the columns having zero.
Question is - can I do a insert into table2 select from table1 and in the process cleanse the data from table1 before it goes in?
You can usually create a new, sanitised, table, depending on the actual DB server you are using.
The hard thing is that if there are other tables in the database, you may have issues with foreign keys, indexes, etc which will refer to the original table.
Whether making a new sanitised table will be quicker than updating your existing table is something you can only tell by trying it.
Dump the pk/clustered key of all the records you want to update into a temp table. Then perform the update joining to the temp table. That will ensure the lowest locking level and quickest access. You can also add an identity column to the temp table, than you can loop through and do the updates in batches.

Suggested techniques for storing multiple versions of SQL row data

I am developing an application that is required to store previous versions of database table rows to maintain a history of changes. I am recording the history in the same table but need the most current data to be accessible by a unique identifier that doesn't change with new versions. I have a few ideas on how this could be done and was just looking for some ideas on the best way of doing this or whether there is any reason not to use one of my ideas:
Create a new row for each row version, with a field to indicate which row was the current row. The drawback of this is that the new version has a different primary key and any references to the old version will not return the current version.
When data is updated, the old row version is duplicated to a new row, and the new version replaces the old row. The current row can be accessed by the same primary key.
Add a second table with only a primary key, add a column to the other table which is foreign key to new table's primary key. Use same method as described in option 1 for storing multiple versions and create a view which finds the current version by using the new table's primary key.
PeopleSoft uses (used?) "effective dated records". It took a little while to get the hang of it, but it served its purpose. The business key is always extended by an EFFDT column (effective date). So if you had a table EMPLOYEE[EMPLOYEE_ID, SALARY] it would become EMPLOYEE[EMPLOYEE_ID, EFFDT, SALARY].
To retrieve the employee's salary:
SELECT e.salary
FROM employee e
WHERE employee_id = :x
AND effdt = (SELECT MAX(effdt)
FROM employee
WHERE employee_id = :x
AND effdt <= SYSDATE)
An interesting application was future dating records: you could give every employee a 10% increase effective Jan 1 next year, and pre-poulate the table a few months beforehand. When SYSDATE crosses Jan 1, the new salary would come into effect. Also, it was good for running historical reports. Instead of using SYSDATE, you plug in a date from the past in order to see the salaries (or exchange rates or whatever) as they would have been reported if run at that time in the past.
In this case, records are never updated or deleted, you just keep adding records with new effective dates. Makes for more verbose queries, but it works and starts becoming (dare I say) normal. There are lots of pages on this, for example: http://peoplesoft.wikidot.com/effective-dates-sequence-status
#3 is probably best, but if you wanted to keep the data in one table, I suppose you could add a datetime column that has a now() value populated for each new row and then you could at least sort by date desc limit 1.
Overall though - multiple versions needs more info on what you want to do effectively as much as programatically...ie need more info on what you want to do.
R
Have you considered using AutoAudit?
AutoAudit is a SQL Server (2005, 2008) Code-Gen utility that creates
Audit Trail Triggers with:
Created, CreatedBy, Modified, ModifiedBy, and RowVersion (incrementing INT) columns to table
Insert event logged to Audit table
Updates old and new values logged to Audit table
Delete logs all final values to the Audit tbale
view to reconstruct deleted rows
UDF to reconstruct Row History
Schema Audit Trigger to track schema changes
Re-code-gens triggers when Alter Table changes the table
For me, history tables are always separate. So, definitely I would go with that, but why create some complex versioning thing where you need to look at the current production record. In reporting, this results in nasty unions that are really unnecessary.
Table has a primary key and who cares what else.
TableHist has these columns: incrementing int/bigint primary key, history written date/time, history written by, record type (I, U, D for insert, update, delete), the PK from Table as an FK on TableHist, the remaining columns all other columns with the same name are in the TableHist table.
If you create this history table structure and populate it via triggers on Table, you will have all versions of every row in the tables you care about and can easily determine the original record, every change, and the deletion records as well. AND if you are reporting, you only need to use your historical tables to get all of the information you'd like.
create table table1 (
Id int identity(1,1) primary key,
[Key] varchar(max),
Data varchar(max)
)
go
create view view1 as
with q as (
select [Key], Data, row_number() over (partition by [Key] order by Id desc) as 'r'
from table1
)
select [Key], Data from q where r=1
go
create trigger trigger1 on view1 instead of update, insert as begin
insert into table1
select [Key], Data
from (select distinct [Key], Data from inserted) a
end
go
insert into view1 values
('key1', 'foo')
,('key1', 'bar')
select * from view1
update view1
set Data='updated'
where [Key]='key1'
select * from view1
select * from table1
drop trigger trigger1
drop table table1
drop view view1
Results:
Key Data
key1 foo
Key Data
key1 updated
Id Key Data
1 key1 bar
2 key1 foo
3 key1 updated
I'm not sure if the disctinct is needed.

easiest way to map ids during database refactoring

i have a number of tables with a column called OrderId. I have just done a refactoring and i want to get rid of the Order table and i have a new table called Transaction. I want all tables that have an OrderId column to now have a TransactionId column
This is complete. I now need to populate the transactionId column. I have a mapping today between orderId and transactionId so i wanted to see the quickest way i can go populate that new transactionId column (should i do this through code, through a SQL query, etc ??)
So i have the transationId column in the Order Table so i can do a join.
I want a query that says something like this (pseudo SQL)
update childTable CT
set transactionId = MapFromOrderId(CT.OrderId)
any suggestions?
I would do it in SQL code:
UPDATE MT
SET
transaction_id = MAP.transaction_id
FROM
My_Table MT
INNER JOIN My_Map MAP ON
MAP.order_id = MT.order_id
Then check to make sure that every row was mapped:
SELECT
*
FROM
My_Table
WHERE
transaction_id IS NULL
The process is usually:
Make sure the database is backed up
Addtransctionid to each child table.
Populate based on a join to the
mapping table (you did store the
mappings between orderid and
transactionid in a table?)
Make sure you have no blank values.
Then you create the FK for
transactions, drop the fk to the
Order table and then drop the orderid
column.
Then move to the next table and
repeat.
Test to make sure everything worked
properly
Definitely I'd do this in a script so it will be easy to port to prod after dev and QA testing.
On prod you need to do this while the database is in single user mode to prevent new orders from being added as the process transitions.