Proper way of updating a whole table in Redshift, drop table + create table vs. truncate + insert into table - sql

Currently I have many tables for which I have to update the information they hold, sometimes on a daily or weekly basis. So far, I've been doing this by a combination of DROP TABLE IF EXIST some_schema.some_table_name; followed by CREATE TABLE some_schema.some_table_name AS ( SELECT ... FROM ... WHERE ...); and I would like to know what is the "best-practice" or proper way of doing so.
I've read that INSERT operations in Redshift are quite expensive, so I've been avoiding its usage, but maybe the use of TRUNCATE with INSERT is better than dropping and creating.
How can I confirm which option is better?
I've seen this article from Redshift docs, but I'm not sure if it is the best option, since I could have not only to remove records, but keeping and inserting as well.

If your desire is to completely erase a table and replace the data then the general pattern you are following in fine. However, there are a few things you should be doing to make things safer / better.
There are 3 patterns to do this and one is clearly the lowest performance. These are Delete/Insert, Truncate/Insert, and Drop/Insert. Of these Delete/Create/Insert is NOT what you want to do from a performance point of view. This process invalidates all the rows in the table (not delete them) and adds new valid rows. This doubles the size of the table, wasting space, and needs to be vacuumed. The only upside of this approach is that it doesn't have the downsides of the other approaches but this only matters in certain situations. Go down this approach only if you have to.
Truncate/Insert is fast and maintains the same table id as the original table. Because truncate operates on the blocks of the table (unlinking them) it is fast but there is some small overhead in managing all the block links. Since the table definition is unchanged all DDL stays defined and dependent views can keep pointing to the table. The downside with truncate is that it forces a COMMIT to occur which means that until the table is repopulated with new data other users of the database can see an empty table. This can lead to incorrect results during these windows. Not good.
Lastly there is Drop/Create/Insert. This approach is marginally (very slightly and only for large tables) faster than truncate in the ideal case. It just throws away the old blocks. There is some additional cost to setting up the new table (of the same name) so truncate and drop are about the same speed unless the table is large. Since Drop can be inside of a transaction block the empty table won't be seen by third parties (if done correctly). The downside with this approach is that the old table and new table are entirely different tables (different oids) - they just happen to have the same name. This means that any dependent (regular) views will need to be dropped and recreated as well. Also since this table is "going away" the commit of the transaction cannot complete until all uses of the table are complete. This becomes a large problem when someone leaves a transaction open in their bench and goes home for the night. Since the tables needs to be recreated your process needs to know the complete and correct DDL for the table.
Hopefully this gives you some idea of when to use these different approaches. Two things I see that could be better in your current code - 1) You are not using a transaction block (as far as I can tell) so there is a window when others will see that the tables doesn't exists or is empty. This may or may not be important to you but be aware. 2) "Create table As" doesn't define the DDL of the table in performant structure (and possibly incorrectly). You should always specify your permanent tables fully. Sort and Dist keys matter as do varchar lengths, data types etc. This is a time bomb waiting to go off.
Per request for an example of drop/create/insert:
As I mentioned there are lock dependency issues that can arise with this method so I like to use a "swap & drop" approach to this path. This makes the new information visible to users at the "swap" so even if the "drop" gets blocks things get published on time. This doesn't remove the lock risk as a lock can still prevent the process (session) from completing, it just makes it so that the new data is visible (published) while you hunt down the offender.
(Please note that for transactions to execute properly you need to be sure that extra COMMITs are not being inserted into the process. This can happen with benches that are configured in "autocommit" mode.)
Create table new_table ( ... ) ...; -- make the new table but with a different name (and unique from other tables) than the existing table
Insert into new_table ... ; -- put the desired data into the new table
Analyze new_table; -- to ensure metadata is up to date
Begin; -- start transaction
Alter table perm_table rename to old_table; -- rename existing table
Alter table new_table rename to perm_table; -- complete the swap
Commit; -- publish the new data for all to see but transactions still using the original data can keep doing so
Drop table old_table; -- remove the old data to free up space
Commit;
This process is just one example. Sometimes you want to keep the old versions of the table around for a while (history / error recovery) so you will date stamp the old data and have a separate process to free up the space. This also helps with stray locks clogging up the works - only the clean up process gets stalled. You can also have the recreation of views in the process so that these are updated in the same transaction. And so on.

I think you will need to use the Update command. I understand that drooping a table is a risky move, as you might loose all of your data from your database.
Update some_table_name s set
s.Id="whatever you want to update",
s.Name="whatever you want to update",
s.LastName="whatever you want to update",
s.OtherTableColumn="whatever you want to update"
From
some_table_name s
In above code I assumed your table had for columns (1-Id, 2-Name, 3-LastName, 4-OtherTableColumn). If you have more or less columns then I would adjust accordingly.
I would also write a update procedure for this (and each table) so if you need to update somewhat frequently you just use the procedure; I think its quicker. Below would be my procedure:
Create Proc sp_UpdateSome_table_name
#Id int,
#name nvarchar(255),
#lastname nvarchar(255),
#OtherTableColumn int
AS
BEGIN
Update s some_table_name
s.Name="whatever you want to update",
s.LastName="whatever you want to update",
s.OtherTableColumn="whatever you want to update"
From
some_table_name s
Where
s.Id=#Id
END
You want to make sure that each column in your table is defined with correct data type in the procedure. For example I assumed above that #Id was int, Name was nvarchar(255) etc. If you want to allow yourself not to enter any data (allowing null) in certain table columns when updating then after the data type you can write Null; for example if you write #Id int Null, then you can update is as null; but if you are not sure what this is, simply ignore this sentence for now.
Once you assured above paragraph is good (data types are correct), then select the entire procedure and then execute (F5). This will store this procedure.
Then I will write the procedure every time you want to update your table shown as below:
Exec sp_UpdateSome_table_name 1,John,Smith,77
If you highlight the above command and execute (f5) it then it will update the table which has Id=1 and it will make the name John, last name Smith and the other column 77 from whatever it was before. If there is no data in the table with Id=1 then you can execute.
Keep in mind the last rows of the codes might not have a comma. The above codes are written correctly, just pointing it out as you might put a comma out of habit.

Related

Updating different fields in different rows

I've tried to ask this question at least once, but I never seem to put it across properly. I really have two questions.
My database has a table called PatientCarePlans
( ID, Name, LastReviewed, LastChanged, PatientID, DateStarted, DateEnded). There are many other fields, but these are the most important.
Every hour, a JSON extract gets a fresh copy of the data for PatientCarePlans, which may or may not be different to the existing records. That data is stored temporarily in PatientCarePlansDump. Unlike other tables which will rarely change, and if they do only one or two fields, with this table there are MANY fields which may now be different. Therefore, rather than simply copy the Dump files to the live table based on whether the record already exists or not, my code does the no doubt wrong thing: I empty out any records from PatientCarePlans from that location, and then copy them all from the Dump table back to the live one. Since I don't know whether or not there are any changes, and there are far too many fields to manually check, I must assume that each record is different in some way or another and act accordingly.
My first question is how best (I have OKish basic knowledge, but this is essentially a useful hobby, and therefore have limited technical / theoretical knowledge) do I ensure that there is minimal disruption to the PatientCarePlans table whilst doing so? At present, my code is:
IF Object_ID('PatientCarePlans') IS NOT NULL
BEGIN
BEGIN TRANSACTION
DELETE FROM [PatientCarePlans] WHERE PatientID IN (SELECT PatientID FROM [Patients] WHERE location = #facility)
COMMIT TRANSACTION
END
ELSE
SELECT TOP 0 * INTO [PatientCarePlans]
FROM [PatientCareplansDUMP]
INSERT INTO [PatientCarePlans] SELECT * FROM [PatientCarePlansDump]
DROP TABLE [PatientCarePlansDUMP]
My second question relates to how this process affects the numerous queries that run on and around the same time as this import. Very often those queries will act as though there are no records in the PatientCarePlans table, which causes obvious problems. I'm vaguely aware of transaction locks etc, but it goes a bit over my head given the hobby status! How can I ensure that a query is executed and results returned whilst this process is taking place? Is there a more efficient or less obstructive method of updating the table, rather than simply removing them and re-adding? I know there are merge and update commands, but none of the examples seem to fit my issue, which only confuses me more!
Apologies for the lack of knowhow, though that of course is why I'm here asking the question.
Thanks
I suggest you do not delete and re-create the table. The DDL script to create the table should be part of your database setup, not part of regular modification scripts.
You are going to want to do the DELETE and INSERT inside a transaction. Preferably you would do this under SERIALIZABLE isolation in order to prevent concurrency issues. (You could instead do a WITH (TABLOCK) hint, which would be less likely cause a deadlock, but will completely lock the table.)
SET XACT_ABORT, NOCOUNT ON; -- always set XACT_ABORT if you have a transaction
SET TRANSACTION ISOLATION LEVEL SNAPSHOT;
BEGIN TRAN;
DELETE FROM [PatientCarePlans]
WHERE PatientID IN (
SELECT p.PatientID
FROM [Patients] p
WHERE location = #facility
);
INSERT INTO [PatientCarePlans] (YourColumnsHere) -- always specify columns
SELECT YourColumnsHere
FROM [PatientCarePlansDump];
COMMIT;
You could also do this with a single MERGE statement. However it is complex to write (owing to the need to restrict the set of rows being targetted), and is not usually more performant than separate statements, and also needs SERIALIZABLE.

Truncate and insert new content into table with the least amount of interruption

Twice a day, I run a heavy query and save the results (40MBs worth of rows) to a table.
I truncate this results table before inserting the new results such that it only ever has the latest query's results in it.
The problem, is that while the update to the table is written, there is technically no data and/or a lock. When that is the case, anyone interacting with the site could experience an interruption. I haven't experienced this yet, but I am looking to mitigate this in the future.
What is the best way to remedy this? Is it proper to write the new results to a table named results_pending, then drop the results table and rename results_pending to results?
Two methods come to mind. One is to swap partitions for the table. To be honest, I haven't done this in SQL Server, but it should work at a low level.
I would normally have all access go through a view. Then, I would create the new day's data in a separate table -- and change the view to point to the new table. The view change is close to "atomic". Well, actually, there is a small period of time when the view might not be available.
Then, at your leisure you can drop the old version of the table.
TRUNCATE is a DDL operation which causes problems like this. If you are using snapshot isolation with row versioning and want users to either see the old or new data then use a single transaction to DELETE the old records and INSERT the new data.
Another option if a lot of the data doesn't actually change is to UPDATE / INSERT / DELETE only those records that need it and leave unchanged records alone.

How to perform data archive in SQL Server with many tables?

Let's say I have a database with many tables in it. I want to perform data archiving on certain tables, that is create a same table with same structures (same constraint, indexes, columns, triggers, etc) as a new table and insert specific data into the new table from the old table.
Example, current table has data from 2008-2017 and I want to move only data from 2010-2017 into the new table. Then after that, I can delete the old table and rename the new table with naming conventions similar to old table.
How should I approach this?
For the sort of clone-rename-drop logic you're talking about, the basics are pretty straight forward. Really the only time this is a good idea is if you have a table with a large amount of data, which you can't afford down time or blocking on, and you only plan to do this one. The process looks something like this:
Insert all the data from your original table into the clone table
In a single transaction, sp_rename the original table from (for example) myTable to myTable_OLD (just something to distinguish it from the real table). Then sp_rename the clone table from (for example) myTable_CLONE to myTable
Drop myTable_OLD when you're happy everything has worked how you want. If it didn't work how you want, just sp_rename the objects back.
Couple considerations to think about if you go that route
Identity columns: If your table has any identities on it, you'll have to use identity_insert on then reseed the identity to pick up at where the old identity left off
Do you have the luxury of blocking the table while you do this? Generally if you need to do this sort of thing, the answer is no. What I find works well is to insert all the rows I need using (nolock), or however you need to do it so the impact of the select from the original table is mitigated. Then, after I've moved 99% of the data, I will then open a transaction, block the original table, insert just the new data that's come in since the bulk of the data movement, then do the sp_rename stuff
That way you don't lock anything for the bulk of the data movement, and you only block the table for the very last bit of data that came into the original table between your original insert and your sp_rename
How you determine what's come in "since you started" will depend on how your table is structured. If you have an identity or a datestamp column, you can probably just pick rows which came in after the max of those fields you moved over. If your table does NOT have something you can easily hook into, you might need to get creative.
Alternatives
A couple other alternatives that came to mind:
Table Partitioning:
This shards a single table across multiple partitions (which can be managed sort of like individual tables). You can, say, partition you data by year, then when you want to purge the trailing year of data, you "switch out" that partition to a special table which you can then truncate. All those operations are meta-data only, so they're super fast. This also works really well for huge amounts of data where deletes and all their pesky transaction logging aren't feasible
The downside to table partitioning is it's kind of a pain to set up and manage.
Batched Deletes:
If you're data isn't too big, you could just do batched deletes on the trailing end of your data. If you can find a way to get clustered index seeks for your deletes, they should be reasonably lightweight. As long as you're not accumulating data faster than you can get rid of it, the benefit of this kind of thing is you just run it semi-continuously and it just nibbles away at the trailing end of your data
Snapshot Isolation:
If deletes cause too much blocking, you can also set up something like snapshot isolation, which basically stores historical versions of rows in tempdb. Any query which sets isolation level read committed snapshot will then read those pre-change rows instead of contend for locks on the "real" table. You can then do batched deletes to your hearts content and know that any queries that hit the table will never get blocked by a delete (or any other DML operation) because they'll either read the pre-delete snapshot, or they'll read the post-delete snapshot. They won't wait for an in-process delete to figure out whether it's going to commit or rollback. This is not without its drawbacks as well unfortunately. For large data sets, it can put a big burden on tempdb and it too can be a little bit of a black box. It's also going to require buy-in from your DBAs.

How to avoid data fragmentation in an "INSERT once, UPDATE once" table?

I have large number of tables that are "INSERT once", and then read-only after that. ie: After the initial INSERT of a record, there are never any UPDATE or DELETE statements. Due to this, the data fragmentation on disk for the tables is minimal.
I'm now considering adding an needs_action boolean field to every table. This field will only ever get updated once, and it will be done on a slow/periodic basis. As a result of MVCC, when VACUUM comes along (with an even slower schedule) after the UPDATE, the table becomes quite fragmented as it clears out the originally inserted tuples, and they get subsequently backfilled by new insertions.
In short: The addition of this single "always updated once" field changes the table from being being minimally fragmented by design, to being highly fragmented by design.
Is there some method of effectively achieving this single needs_action record flagging that can avoid the resulting table fragmentation?
.
.
.
.
<And now for some background/supplemental info...>
Some options considered so far...
At the risk of making this question massive (and therefore overlooked?), below are some options that have been considered so far:
Just add the column to each table, do the UPDATE and don't worry about resulting fragmentation until it actually proves to be an issue.
I'm conscious of premature optimization here, but with some of the tables getting large (>1M, and even > 1B) I'd rather get the design right up front.
Make an independent tracking table (for every table) only containing A) the PK from the master table, and B) the needs_action flag. Create a record in the tracking table with an AFTER INSERT trigger in the master table
This will preserve "INSERT only" minimal fragmentation levels on the master table... at the expense of adding (significant?) up-front write overhead
putting the tracking tables in a separate schema would also neatly separate the functionality from the core tables
Forcing the needs_action field to be a HOT update to avoid tuple replication
Needing an index on WHERE needs_action = TRUE seems to rule out this option, but maybe there's another way to find them quickly?
Using table fillfactor (at 50?) to leave space for the inevitable UPDATE
eg: set fillfactor to 50 to leave room for the UPDATE, and therefore keep it in the same page
But... with only one UPDATE it seems like this will leave the table packing fraction eternally at 50% and take twice the storage space? I don't 100% understand this option yet... still learning.
Find a special/magical field/bit in the master table records that can be twiddled without MVCC impact.
This does not seem to exist in postgres. Even if it did, it would need to be indexed (or have some other quick finding mechanism akin to a WHERE needs_action = TRUE partial index)
Being able to optionally suppress MVCC operation on specific columns seems like it would be nice here (although surely fraught with peril)
Storing needs_action outside of postgres (eg: as a <table_name>:needs_copying list of PKs in redis) to evade fragmentation due to mvcc.
I have concerns about keeping this atomic, though. Maybe using redis_fdw (or some other fdw?) in an AFTER INSERT trigger can keep it atomic? I need to learn more about fdw capabilities... seems like all fdw's I can find are all read-only, though.
Run a fancy view with background defrag/compacting, as described in this fantastic article
Seems like a bit much to do for all tables.
Simply track ids/PKs that need copying in a postgres table
just stash ids that need action as records to a fast inert table (eg: no PK), and DELETE the records when action is completed
similar to RPUSHing to an offline redis list (but definitely ACID)
This seems like the best option at the moment.
Are there other options to consider?
More on the specific use case driving this...
I'm interested in the general case how to avoid this fragmentation, but here's a bit more on the current use case:
Read performance is much more important than write performance on all tables (but avoiding crazy slow writes is clearly desirable)
Some tables will reach millions of rows. A few may hit billions of rows.
SELECT queries will be across wide table ranges (not just recent data) and can range from single result records to 100k+
Table design can be done from scratch... no need to worry about existing data
PostgreSQL 9.6
I would just lower the fillfactor below the default value of 100.
Depending on the size of the row, use a value like 80 or 90, so that a few new rows will still fit into the block. After the update, the old row can be “pruned” and defragmented by the next transaction so that the space can be reused.
A value of 50 seems way too low. True, this would leave room for all rows in a block being updated at the same time, but that is not your use case, right?
You can try to use inherited tables. This approach does not directly support PK for tables, but it might be resolved by triggers.
CREATE TABLE data_parent (a int8, updated bool);
CREATE TABLE data_inserted (CHECK (NOT updated)) INHERITS (data_parent);
CREATE TABLE data_updated (CHECK (updated)) INHERITS (data_parent);
CREATE FUNCTION d_insert () RETURNS TRIGGER AS $$
BEGIN
NEW.updated = false;
INSERT INTO data_inserted VALUES (NEW.*);
RETURN NULL;
END
$$ LANGUAGE plpgsql SECURITY DEFINER;
CREATE TRIGGER d_insert BEFORE INSERT ON data_parent FOR EACH ROW EXECUTE PROCEDURE d_insert();
CREATE FUNCTION d_update () RETURNS TRIGGER AS $$
BEGIN
NEW.updated = true;
INSERT INTO data_updated VALUES (NEW.*);
DELETE FROM data_inserted WHERE (data_inserted.*) IN (OLD);
RETURN NULL;
END
$$ LANGUAGE plpgsql SECURITY DEFINER;
CREATE TRIGGER d_update BEFORE INSERT ON data_inserted FOR EACH ROW EXECUTE PROCEDURE d_update();
-- GRANT on d_insert to regular user
-- REVOKE insert / update to regular user on data_inserted/updated
INSERT INTO data_parent (a) VALUES (1);
SELECT * FROM ONLY data_parent;
SELECT * FROM ONLY data_inserted;
SELECT * FROM ONLY data_updated;
INSERT 0 0
a | updated
---+---------
(0 rows)
a | updated
---+---------
1 | f
(1 row)
a | updated
---+---------
(0 rows)
UPDATE data_parent SET updated = true;
SELECT * FROM ONLY data_parent;
SELECT * FROM ONLY data_inserted;
SELECT * FROM ONLY data_updated;
UPDATE 0
a | updated
---+---------
(0 rows)
a | updated
---+---------
(0 rows)
a | updated
---+---------
1 | t
(1 row)

Reverting a database insertion with log files?

I am working on a program that is supposed to insert hundreds of rows to the database per run.
The problem is that once the inserted data is wrong, how can we recover from that run? Currently I only have a log file (I created the format), which records the raw data get inserted (no metadata nor primary keys). Is there a way we can create a log that database can understand it, and once we want to undo the insertion we feed the database with that log file.
Or, if there is alternative mechanism of undoing an operation from a program, kindly let me know, thanks.
The fact, that this is only hundreds of rows, makes it succeptible to the great-grandmother of all undo mechanisms:
have a table importruns with a row for each run you do. I assume it has an integer auto-increment PK
add a field to your data table, that identifies carries the PK of the import run
for insert-only runs, you just need to DELETE FROM sometable WHERE importid=$whatever
If you also have replace/update imports, go one step further
for each data table have a corresponding table, that has one field more: superseededby
for each row you update/replace, place an original copy of the row in this table plus the import id in superseededby
to revert, you now have to add INSERT INTO originaltable SELECT * FROM superseededtable WHERE superseededby=$whatever
You can clean up superseededtable for known-good imports, to make sure, storage doesn't grow unlimited.
You have several options. Depending on when you notice the error.
If you know there is an error with the data, the you can use the transactions API to rollback to changes of the current transaction.
In case you know there was an error only later, then you can create your own log. Make an index identifying the transaction, and add a field to the relevant table where that id would be inserted. This would allow you to identify exactly which transaction it came from. You can also create a stored procedure that deletes rows according to the given transaction id.