I am working on a program that is supposed to insert hundreds of rows to the database per run.
The problem is that once the inserted data is wrong, how can we recover from that run? Currently I only have a log file (I created the format), which records the raw data get inserted (no metadata nor primary keys). Is there a way we can create a log that database can understand it, and once we want to undo the insertion we feed the database with that log file.
Or, if there is alternative mechanism of undoing an operation from a program, kindly let me know, thanks.
The fact, that this is only hundreds of rows, makes it succeptible to the great-grandmother of all undo mechanisms:
have a table importruns with a row for each run you do. I assume it has an integer auto-increment PK
add a field to your data table, that identifies carries the PK of the import run
for insert-only runs, you just need to DELETE FROM sometable WHERE importid=$whatever
If you also have replace/update imports, go one step further
for each data table have a corresponding table, that has one field more: superseededby
for each row you update/replace, place an original copy of the row in this table plus the import id in superseededby
to revert, you now have to add INSERT INTO originaltable SELECT * FROM superseededtable WHERE superseededby=$whatever
You can clean up superseededtable for known-good imports, to make sure, storage doesn't grow unlimited.
You have several options. Depending on when you notice the error.
If you know there is an error with the data, the you can use the transactions API to rollback to changes of the current transaction.
In case you know there was an error only later, then you can create your own log. Make an index identifying the transaction, and add a field to the relevant table where that id would be inserted. This would allow you to identify exactly which transaction it came from. You can also create a stored procedure that deletes rows according to the given transaction id.
Related
Currently I have many tables for which I have to update the information they hold, sometimes on a daily or weekly basis. So far, I've been doing this by a combination of DROP TABLE IF EXIST some_schema.some_table_name; followed by CREATE TABLE some_schema.some_table_name AS ( SELECT ... FROM ... WHERE ...); and I would like to know what is the "best-practice" or proper way of doing so.
I've read that INSERT operations in Redshift are quite expensive, so I've been avoiding its usage, but maybe the use of TRUNCATE with INSERT is better than dropping and creating.
How can I confirm which option is better?
I've seen this article from Redshift docs, but I'm not sure if it is the best option, since I could have not only to remove records, but keeping and inserting as well.
If your desire is to completely erase a table and replace the data then the general pattern you are following in fine. However, there are a few things you should be doing to make things safer / better.
There are 3 patterns to do this and one is clearly the lowest performance. These are Delete/Insert, Truncate/Insert, and Drop/Insert. Of these Delete/Create/Insert is NOT what you want to do from a performance point of view. This process invalidates all the rows in the table (not delete them) and adds new valid rows. This doubles the size of the table, wasting space, and needs to be vacuumed. The only upside of this approach is that it doesn't have the downsides of the other approaches but this only matters in certain situations. Go down this approach only if you have to.
Truncate/Insert is fast and maintains the same table id as the original table. Because truncate operates on the blocks of the table (unlinking them) it is fast but there is some small overhead in managing all the block links. Since the table definition is unchanged all DDL stays defined and dependent views can keep pointing to the table. The downside with truncate is that it forces a COMMIT to occur which means that until the table is repopulated with new data other users of the database can see an empty table. This can lead to incorrect results during these windows. Not good.
Lastly there is Drop/Create/Insert. This approach is marginally (very slightly and only for large tables) faster than truncate in the ideal case. It just throws away the old blocks. There is some additional cost to setting up the new table (of the same name) so truncate and drop are about the same speed unless the table is large. Since Drop can be inside of a transaction block the empty table won't be seen by third parties (if done correctly). The downside with this approach is that the old table and new table are entirely different tables (different oids) - they just happen to have the same name. This means that any dependent (regular) views will need to be dropped and recreated as well. Also since this table is "going away" the commit of the transaction cannot complete until all uses of the table are complete. This becomes a large problem when someone leaves a transaction open in their bench and goes home for the night. Since the tables needs to be recreated your process needs to know the complete and correct DDL for the table.
Hopefully this gives you some idea of when to use these different approaches. Two things I see that could be better in your current code - 1) You are not using a transaction block (as far as I can tell) so there is a window when others will see that the tables doesn't exists or is empty. This may or may not be important to you but be aware. 2) "Create table As" doesn't define the DDL of the table in performant structure (and possibly incorrectly). You should always specify your permanent tables fully. Sort and Dist keys matter as do varchar lengths, data types etc. This is a time bomb waiting to go off.
Per request for an example of drop/create/insert:
As I mentioned there are lock dependency issues that can arise with this method so I like to use a "swap & drop" approach to this path. This makes the new information visible to users at the "swap" so even if the "drop" gets blocks things get published on time. This doesn't remove the lock risk as a lock can still prevent the process (session) from completing, it just makes it so that the new data is visible (published) while you hunt down the offender.
(Please note that for transactions to execute properly you need to be sure that extra COMMITs are not being inserted into the process. This can happen with benches that are configured in "autocommit" mode.)
Create table new_table ( ... ) ...; -- make the new table but with a different name (and unique from other tables) than the existing table
Insert into new_table ... ; -- put the desired data into the new table
Analyze new_table; -- to ensure metadata is up to date
Begin; -- start transaction
Alter table perm_table rename to old_table; -- rename existing table
Alter table new_table rename to perm_table; -- complete the swap
Commit; -- publish the new data for all to see but transactions still using the original data can keep doing so
Drop table old_table; -- remove the old data to free up space
Commit;
This process is just one example. Sometimes you want to keep the old versions of the table around for a while (history / error recovery) so you will date stamp the old data and have a separate process to free up the space. This also helps with stray locks clogging up the works - only the clean up process gets stalled. You can also have the recreation of views in the process so that these are updated in the same transaction. And so on.
I think you will need to use the Update command. I understand that drooping a table is a risky move, as you might loose all of your data from your database.
Update some_table_name s set
s.Id="whatever you want to update",
s.Name="whatever you want to update",
s.LastName="whatever you want to update",
s.OtherTableColumn="whatever you want to update"
From
some_table_name s
In above code I assumed your table had for columns (1-Id, 2-Name, 3-LastName, 4-OtherTableColumn). If you have more or less columns then I would adjust accordingly.
I would also write a update procedure for this (and each table) so if you need to update somewhat frequently you just use the procedure; I think its quicker. Below would be my procedure:
Create Proc sp_UpdateSome_table_name
#Id int,
#name nvarchar(255),
#lastname nvarchar(255),
#OtherTableColumn int
AS
BEGIN
Update s some_table_name
s.Name="whatever you want to update",
s.LastName="whatever you want to update",
s.OtherTableColumn="whatever you want to update"
From
some_table_name s
Where
s.Id=#Id
END
You want to make sure that each column in your table is defined with correct data type in the procedure. For example I assumed above that #Id was int, Name was nvarchar(255) etc. If you want to allow yourself not to enter any data (allowing null) in certain table columns when updating then after the data type you can write Null; for example if you write #Id int Null, then you can update is as null; but if you are not sure what this is, simply ignore this sentence for now.
Once you assured above paragraph is good (data types are correct), then select the entire procedure and then execute (F5). This will store this procedure.
Then I will write the procedure every time you want to update your table shown as below:
Exec sp_UpdateSome_table_name 1,John,Smith,77
If you highlight the above command and execute (f5) it then it will update the table which has Id=1 and it will make the name John, last name Smith and the other column 77 from whatever it was before. If there is no data in the table with Id=1 then you can execute.
Keep in mind the last rows of the codes might not have a comma. The above codes are written correctly, just pointing it out as you might put a comma out of habit.
I need to know when the data in a table was last modified (data inserted, created, deleted). The common answer is to extract the data from sys.dm_db_index_usage_stats, but I don't have access to that table and as I understand it, access to that is a server-level permission, and since this is on a corporate shared server, the odds of the IT department granting me access is up there with pigs flying.
Is there a way to get this information, that does not require greatly elevated privileges?
Update #1
Some additional information: What I am doing, is caching the contents of this table locally, and need to know when to update my local cache. This means:
I don't need an actual timestamp, just any value that I can compare against a locally-cached value that tells me "things have changed, update the cache"
If this value changes too often (e.g. gets reset every time they restart the server which is extremely rarely) it's OK, it just means I do an extra cache update that I didn't actually need to do
Update #2
I should have done this early on, but I just assumed I would be able to create tables as needed ... but I'm not. Permission not granted. So neither of the proposed methods will work :-(
Here's my new thought : I can call
select checksum_agg(binary_checksum(*))
and get a checksum on the entire remote table. Then, if the checksum changes, I know the table has changed and I can update my local cache. I've seen the problems with checksum, but using HASHBYTES sounds like it would be much more complicated and much slower.
This is working, the problem is that when the checksum changes, I have to reload the entire cache. My table has so many rows that returning the checksum per row takes an unacceptably long time, is there a way to use the "OVER" clause and get maybe 10 checksums, the first checksum for the first tenth of the rows, etc.?
If you can modify their schema, then you should be able to add a trigger.
Once you add a lastmodified column, you can use this trigger to get the time updated any time the record changes:
CREATE trigger [dbo].[TR_TheirTable_Timestamp]
on [dbo].[TheirTable] for update
as
begin
update
dbo.TheirTable
set
lastmodified=getdate()
from
Inserted
where
dbo.TheirTable.UniqueID=Inserted.UniqueKey
end
The reason I do it only for update, and not insert is because I can see a new record, and I don't need a timestamp, but a modified record I can compare the time I last updated the record. if you want an insert update, then
on [dbo].[TheirTable] for insert,update
would work
If you just wanted to know when the table was updated, then the trigger could write to another table with the tablename and date, and you wouldn't have to modify their schema
I need to implement a copy of records right after they are deleted from the table, so they can be recovered in case of accidental deletion.
I am using MS Access. Is there any built in way to do it or will I have to INSERT INTO SELECT before every DELETE?
Doing it for just one table is not a concern. I want to use something ready for any table regardless of its structure, so I don't need to create and configure another recycle-bin-table for every table I have in the database, which would be necessary if I want successful move operations.
Besides SQL, I can run VBA to accomplish this task.
EDIT
There are recommendations of adding a boolean column that indicates if the record is to be displayed or is archived (has the meaning of "deleted" for my purposes), but this involves changing every table and every query I have done, so it won't fit for me, only as a last resort.
What happens when you have cascading deletions, as in all good designed databases? Also your INSERT in a backup table before DELETE will not solve all the issues you will face. Also copying table can result in a lot of copies that will increase your database size and you will have soon or later to clean your data.
Journaling can be better solutions?
I have a problem that I haven't been able to come up with a solution for yet. I have a database (actually thousands of them at customer sites) that I want to extract data from periodically. I'd like to do a full data extract one time (select * from table) then after that only get rows that have changed.
The challenge is that there aren't any updated date columns in most of the tables that could be used to constrain the SQL query. I can't use a trigger based approach nor change the application that writes to the database since it's another group that develops the app and they are way backed up already.
I may be able to write to the database tables when doing the data extract, but would prefer not to do that. Does anyone have any ideas for how we might be able to do this?
You will have to programatically mark the records. I see suggestions of an auto-incrementing field but that will only get newly inserted records. How will you track updated or deleted records?
If you only want newly inserted that an autoincrementing field will do the job; in subsequent data dumps grab every thing since the last value of the autoincrment field and then recrod the current value.
If you want updates the minimum I can see is to have a last_update field and probably a trigger to populare it. If the last_update is later the the last data dump grab that record. This will get inserts and updates but not deletes.
You could try something like a 'instead of delete' trigger if your RDBMS supports it and NULL the last_update field. On subsequent data dumps grap all recoirds where this field is NULL and then delete them. But there would be problems with this (e.g. how to stop the app seeing them between the logical and physical delete)
The most fool proof method I can see is aset of history (audit) tables and ech change gets written to them. Then you select your data dump from there.
By the way do you only care about know the updates have happened? What about if 2 (or more) updates have happened. The history table is the only way that I can see you capturing this scenario.
This should isolate rows that have changed since your last backup. Assuming DestinationTable is a copy of SourceTable even on the key fields; if not you could list out the important fields.
SELECT * FROM SourceTable
EXCEPT
SELECT * FROM DestinationTable
Let's say I want to import all the customers (or all the rows in some other specific table) to some external system. Not all at once but every one after they have been created in database. To do that I have to keep record of all the rows that have already been reported because I want to find only the ones that have not been reported yet. Is it generally better to add a column to do that or to create some kind of a batchlog table?
I'm using MS SQL Server if that is relevant
A Simplified example:
select * from Customer where reportedToExternalSystem is null
or
select * from Customer where cus_id not in (select cus_id from integrationBatchLog)
or is there maybe some more ways to do that that might be even better? This is the first time I do something like this so I don't know the best practise yet.
The simple solution is to add a column that marks the row as imported. A status int (0/1) or if you want to keep track of when it was imported an imported date. This solution does have some limitations:
You can only import the row once. Do you need to import the customer again when the record is updated? Are you going to clear the update field when the customer is updated?
It causes the row to be locked when you update the row status. Are you sure the application that inserts the customer record will be happy with your code locking the records?
On some system it causes the entire row to be written to the log system for recovery. Depending on the size of the row this can be a lot of log writing for just one field.
In a highly parallel import system you can have a lot of contention for resources. If one import program is locking the table, think how bad it would be if many import programs are locking the table at the same time.
If the customer data is updated several times between your import polling interval, you will only see the latest data and will skip over the intermediate updates. This is only an issue if you care about the intermedaite updates. For customers you might not care, for order statuses you might care a lot.
You have to modify the table structure. This might not be allowed by the source application due to data/support/political issues.
Besides putting a status column in the table, one technique that works well is to put a trigger on the table and mirror the import data to a second table. You would then 'consume' the data in the second table. This has several advantages:
It keeps the locking issues contained to the second table.
It allows you to process every update to the main table.
You can add an index to the second table that is used to keep track of the update statuses without the issues of changing the main table.
If you delete the rows from the second table (either immediately as they are consumed or after a short audit period) the size of the table/index will be kep to a minimum.
When I use this technique in Sql Server I put the second table in a seperate schema. Since most apps store their tables in dbo, you can end up with dbo.Customers and Import.Customers. This can help you to keep track of which tables you are importing and keeps you from having to come up with new names for your import tables.
Unless you have to complicate implementation, go with the simplest solution possible. One important thing you should consider, is how hard would it be to refactor this simple to more general one, in case if you need it.
In your case I see only one problem in upgrading from column to table. If you would need history of imports. Solution: make reportedToExternalSystem column of DateTime (or Timestamp) type
I would use a separate table indicating, say, import date cross-referenced to the key of the record in the table you're tracking. In other words, a table with 3 columns: auto-increment key, record-id-from-other-table, import-date. Something like that. This also allows the case if a record is ever re-imported later. You'd have track of all the imports by date.
I Prefer having a column for importing status. Maintaining a separate log leads to time consumable results with growing table size. I do have conceptual idea on SQL Servers but seems that it works. Keep posting!