Updating different fields in different rows - sql

I've tried to ask this question at least once, but I never seem to put it across properly. I really have two questions.
My database has a table called PatientCarePlans
( ID, Name, LastReviewed, LastChanged, PatientID, DateStarted, DateEnded). There are many other fields, but these are the most important.
Every hour, a JSON extract gets a fresh copy of the data for PatientCarePlans, which may or may not be different to the existing records. That data is stored temporarily in PatientCarePlansDump. Unlike other tables which will rarely change, and if they do only one or two fields, with this table there are MANY fields which may now be different. Therefore, rather than simply copy the Dump files to the live table based on whether the record already exists or not, my code does the no doubt wrong thing: I empty out any records from PatientCarePlans from that location, and then copy them all from the Dump table back to the live one. Since I don't know whether or not there are any changes, and there are far too many fields to manually check, I must assume that each record is different in some way or another and act accordingly.
My first question is how best (I have OKish basic knowledge, but this is essentially a useful hobby, and therefore have limited technical / theoretical knowledge) do I ensure that there is minimal disruption to the PatientCarePlans table whilst doing so? At present, my code is:
IF Object_ID('PatientCarePlans') IS NOT NULL
BEGIN
BEGIN TRANSACTION
DELETE FROM [PatientCarePlans] WHERE PatientID IN (SELECT PatientID FROM [Patients] WHERE location = #facility)
COMMIT TRANSACTION
END
ELSE
SELECT TOP 0 * INTO [PatientCarePlans]
FROM [PatientCareplansDUMP]
INSERT INTO [PatientCarePlans] SELECT * FROM [PatientCarePlansDump]
DROP TABLE [PatientCarePlansDUMP]
My second question relates to how this process affects the numerous queries that run on and around the same time as this import. Very often those queries will act as though there are no records in the PatientCarePlans table, which causes obvious problems. I'm vaguely aware of transaction locks etc, but it goes a bit over my head given the hobby status! How can I ensure that a query is executed and results returned whilst this process is taking place? Is there a more efficient or less obstructive method of updating the table, rather than simply removing them and re-adding? I know there are merge and update commands, but none of the examples seem to fit my issue, which only confuses me more!
Apologies for the lack of knowhow, though that of course is why I'm here asking the question.
Thanks

I suggest you do not delete and re-create the table. The DDL script to create the table should be part of your database setup, not part of regular modification scripts.
You are going to want to do the DELETE and INSERT inside a transaction. Preferably you would do this under SERIALIZABLE isolation in order to prevent concurrency issues. (You could instead do a WITH (TABLOCK) hint, which would be less likely cause a deadlock, but will completely lock the table.)
SET XACT_ABORT, NOCOUNT ON; -- always set XACT_ABORT if you have a transaction
SET TRANSACTION ISOLATION LEVEL SNAPSHOT;
BEGIN TRAN;
DELETE FROM [PatientCarePlans]
WHERE PatientID IN (
SELECT p.PatientID
FROM [Patients] p
WHERE location = #facility
);
INSERT INTO [PatientCarePlans] (YourColumnsHere) -- always specify columns
SELECT YourColumnsHere
FROM [PatientCarePlansDump];
COMMIT;
You could also do this with a single MERGE statement. However it is complex to write (owing to the need to restrict the set of rows being targetted), and is not usually more performant than separate statements, and also needs SERIALIZABLE.

Related

Proper way of updating a whole table in Redshift, drop table + create table vs. truncate + insert into table

Currently I have many tables for which I have to update the information they hold, sometimes on a daily or weekly basis. So far, I've been doing this by a combination of DROP TABLE IF EXIST some_schema.some_table_name; followed by CREATE TABLE some_schema.some_table_name AS ( SELECT ... FROM ... WHERE ...); and I would like to know what is the "best-practice" or proper way of doing so.
I've read that INSERT operations in Redshift are quite expensive, so I've been avoiding its usage, but maybe the use of TRUNCATE with INSERT is better than dropping and creating.
How can I confirm which option is better?
I've seen this article from Redshift docs, but I'm not sure if it is the best option, since I could have not only to remove records, but keeping and inserting as well.
If your desire is to completely erase a table and replace the data then the general pattern you are following in fine. However, there are a few things you should be doing to make things safer / better.
There are 3 patterns to do this and one is clearly the lowest performance. These are Delete/Insert, Truncate/Insert, and Drop/Insert. Of these Delete/Create/Insert is NOT what you want to do from a performance point of view. This process invalidates all the rows in the table (not delete them) and adds new valid rows. This doubles the size of the table, wasting space, and needs to be vacuumed. The only upside of this approach is that it doesn't have the downsides of the other approaches but this only matters in certain situations. Go down this approach only if you have to.
Truncate/Insert is fast and maintains the same table id as the original table. Because truncate operates on the blocks of the table (unlinking them) it is fast but there is some small overhead in managing all the block links. Since the table definition is unchanged all DDL stays defined and dependent views can keep pointing to the table. The downside with truncate is that it forces a COMMIT to occur which means that until the table is repopulated with new data other users of the database can see an empty table. This can lead to incorrect results during these windows. Not good.
Lastly there is Drop/Create/Insert. This approach is marginally (very slightly and only for large tables) faster than truncate in the ideal case. It just throws away the old blocks. There is some additional cost to setting up the new table (of the same name) so truncate and drop are about the same speed unless the table is large. Since Drop can be inside of a transaction block the empty table won't be seen by third parties (if done correctly). The downside with this approach is that the old table and new table are entirely different tables (different oids) - they just happen to have the same name. This means that any dependent (regular) views will need to be dropped and recreated as well. Also since this table is "going away" the commit of the transaction cannot complete until all uses of the table are complete. This becomes a large problem when someone leaves a transaction open in their bench and goes home for the night. Since the tables needs to be recreated your process needs to know the complete and correct DDL for the table.
Hopefully this gives you some idea of when to use these different approaches. Two things I see that could be better in your current code - 1) You are not using a transaction block (as far as I can tell) so there is a window when others will see that the tables doesn't exists or is empty. This may or may not be important to you but be aware. 2) "Create table As" doesn't define the DDL of the table in performant structure (and possibly incorrectly). You should always specify your permanent tables fully. Sort and Dist keys matter as do varchar lengths, data types etc. This is a time bomb waiting to go off.
Per request for an example of drop/create/insert:
As I mentioned there are lock dependency issues that can arise with this method so I like to use a "swap & drop" approach to this path. This makes the new information visible to users at the "swap" so even if the "drop" gets blocks things get published on time. This doesn't remove the lock risk as a lock can still prevent the process (session) from completing, it just makes it so that the new data is visible (published) while you hunt down the offender.
(Please note that for transactions to execute properly you need to be sure that extra COMMITs are not being inserted into the process. This can happen with benches that are configured in "autocommit" mode.)
Create table new_table ( ... ) ...; -- make the new table but with a different name (and unique from other tables) than the existing table
Insert into new_table ... ; -- put the desired data into the new table
Analyze new_table; -- to ensure metadata is up to date
Begin; -- start transaction
Alter table perm_table rename to old_table; -- rename existing table
Alter table new_table rename to perm_table; -- complete the swap
Commit; -- publish the new data for all to see but transactions still using the original data can keep doing so
Drop table old_table; -- remove the old data to free up space
Commit;
This process is just one example. Sometimes you want to keep the old versions of the table around for a while (history / error recovery) so you will date stamp the old data and have a separate process to free up the space. This also helps with stray locks clogging up the works - only the clean up process gets stalled. You can also have the recreation of views in the process so that these are updated in the same transaction. And so on.
I think you will need to use the Update command. I understand that drooping a table is a risky move, as you might loose all of your data from your database.
Update some_table_name s set
s.Id="whatever you want to update",
s.Name="whatever you want to update",
s.LastName="whatever you want to update",
s.OtherTableColumn="whatever you want to update"
From
some_table_name s
In above code I assumed your table had for columns (1-Id, 2-Name, 3-LastName, 4-OtherTableColumn). If you have more or less columns then I would adjust accordingly.
I would also write a update procedure for this (and each table) so if you need to update somewhat frequently you just use the procedure; I think its quicker. Below would be my procedure:
Create Proc sp_UpdateSome_table_name
#Id int,
#name nvarchar(255),
#lastname nvarchar(255),
#OtherTableColumn int
AS
BEGIN
Update s some_table_name
s.Name="whatever you want to update",
s.LastName="whatever you want to update",
s.OtherTableColumn="whatever you want to update"
From
some_table_name s
Where
s.Id=#Id
END
You want to make sure that each column in your table is defined with correct data type in the procedure. For example I assumed above that #Id was int, Name was nvarchar(255) etc. If you want to allow yourself not to enter any data (allowing null) in certain table columns when updating then after the data type you can write Null; for example if you write #Id int Null, then you can update is as null; but if you are not sure what this is, simply ignore this sentence for now.
Once you assured above paragraph is good (data types are correct), then select the entire procedure and then execute (F5). This will store this procedure.
Then I will write the procedure every time you want to update your table shown as below:
Exec sp_UpdateSome_table_name 1,John,Smith,77
If you highlight the above command and execute (f5) it then it will update the table which has Id=1 and it will make the name John, last name Smith and the other column 77 from whatever it was before. If there is no data in the table with Id=1 then you can execute.
Keep in mind the last rows of the codes might not have a comma. The above codes are written correctly, just pointing it out as you might put a comma out of habit.

How to establish read-only-once implement within SAP HANA?

Context: I am a long-time MSSQL developer... What I would like to know is how to implement a read-only-once select from SAP HANA.
High-level pseudo-code:
Collect request via db proc (query)
Call API with request
Store results of the request (response)
I have a table (A) that is the source of inputs to a process. Once a process has completed it will write results to another table (B).
Perhaps this is all solved if I just add a column to table A to avoid concurrent processors from selecting the same records from A?
I am wondering how to do this without adding the column to source table A.
What I have tried is a left outer join between tables A and B to get rows from A that have no corresponding rows (yet) in B. This doesn't work, or I haven't implemented such that rows are processed only 1 time by any of the processors.
I have a stored proc to handle batch selection:
/*
* getBatch.sql
*
* SYNOPSIS: Retrieve the next set of criteria to be used in a search
* request. Use left outer join between input source table
* and results table to determine the next set of inputs, and
* provide support so that concurrent processes may call this
* proc and get their inputs exclusively.
*/
alter procedure "ACOX"."getBatch" (
in in_limit int
,in in_run_group_id varchar(36)
,out ot_result table (
id bigint
,runGroupId varchar(36)
,sourceTableRefId integer
,name nvarchar(22)
,location nvarchar(13)
,regionCode nvarchar(3)
,countryCode nvarchar(3)
)
) language sqlscript sql security definer as
begin
-- insert new records:
insert into "ACOX"."search_result_v4" (
"RUN_GROUP_ID"
,"BEGIN_DATE_TS"
,"SOURCE_TABLE"
,"SOURCE_TABLE_REFID"
)
select
in_run_group_id as "RUN_GROUP_ID"
,CURRENT_TIMESTAMP as "BEGIN_DATE_TS"
,'acox.searchCriteria' as "SOURCE_TABLE"
,fp.descriptor_id as "SOURCE_TABLE_REFID"
from
acox.searchCriteria fp
left join "ACOX"."us_state_codes" st
on trim(fp.region) = trim(st.usps)
left outer join "ACOX"."search_result_v4" r
on fp.descriptor_id = r.source_table_refid
where
st.usps is not null
and r.BEGIN_DATE_TS is null
limit :in_limit;
-- select records inserted for return:
ot_result =
select
r.ID id
,r.RUN_GROUP_ID runGroupId
,fp.descriptor_id sourceTableRefId
,fp.merch_name name
,fp.Location location
,st.usps regionCode
,'USA' countryCode
from
acox.searchCriteria fp
left join "ACOX"."us_state_codes" st
on trim(fp.region) = trim(st.usps)
inner join "ACOX"."search_result_v4" r
on fp.descriptor_id = r.source_table_refid
and r.COMPLETE_DATE_TS is null
and r.RUN_GROUP_ID = in_run_group_id
where
st.usps is not null
limit :in_limit;
end;
When running 7 concurrent processors, I get a 35% overlap. That is to say that out of 5,000 input rows, the resulting row count is 6,755. Running time is about 7 mins.
Currently my solution includes adding a column to the source table. I wanted to avoid that but it seems to make a simpler implement. I will update the code shortly, but it includes an update statement prior to the insert.
Useful references:
SAP HANA Concurrency Control
Exactly-Once Semantics Are Possible: Here’s How Kafka Does It
First off: there is no "read-only-once" in any RDBMS, including MS SQL.
Literally, this would mean that a given record can only be read once and would then "disappear" for all subsequent reads. (that's effectively what a queue does, or the well-known special-case of a queue: the pipe)
I assume that that is not what you are looking for.
Instead, I believe you want to implement a processing-semantic analogous to "once-and-only-once" aka "exactly-once" message delivery. While this is impossible to achieve in potentially partitioned networks it is possible within the transaction context of databases.
This is a common requirement, e.g. with batch data loading jobs that should only load data that has not been loaded so far (i.e. the new data that was created after the last batch load job began).
Sorry for the long pre-text, but any solution for this will depend on being clear on what we want to actually achieve. I will get to an approach for that now.
The major RDBMS have long figured out that blocking readers is generally a terrible idea if the goal is to enable high transaction throughput. Consequently, HANA does not block readers - ever (ok, not ever-ever, but in the normal operation setup).
The main issue with the "exactly-once" processing requirement really is not the reading of the records, but the possibility of processing more than once or not at all.
Both of these potential issues can be addressed with the following approach:
SELECT ... FOR UPDATE ... the records that should be processed (based on e.g. unprocessed records, up to N records, even-odd-IDs, zip-code, ...). With this, the current session has an UPDATE TRANSACTION context and exclusive locks on the selected records. Other transactions can still read those records, but no other transaction can lock those records - neither for UPDATE, DELETE, nor for SELECT ... FOR UPDATE ... .
Now you do your processing - whatever this involves: merging, inserting, updating other tables, writing log-entries...
As the final step of the processing, you want to "mark" the records as processed. How exactly this is implemented, does not really matter.
One could create a processed-column in the table and set it to TRUE when records have been processed. Or one could have a separate table that contains the primary keys of the processed records (and maybe a load-job-id to keep track of multiple load jobs).
In whatever way this is implemented, this is the point in time, where this processed status needs to be captured.
COMMIT or ROLLBACK (in case something went wrong). This will COMMIT the records written to the target table, the processed-status information, and it will release the exclusive locks from the source table.
As you see, Step 1 takes care of the issue that records may be missed by selecting all wanted records that can be processed (i.e. they are not exclusively locked by any other process).
Step 3 takes care of the issue of records potentially be processed more than once by keeping track of the processed records. Obviously, this tracking has to be checked in Step 1 - both steps are interconnected, which is why I point them out explicitly. Finally, all the processing occurs within the same DB-transaction context, allowing for guaranteed COMMIT or ROLLBACK across the whole transaction. That means, that no "record marker" will ever be lost when the processing of the records was committed.
Now, why is this approach preferable to making records "un-readable"?
Because of the other processes in the system.
Maybe the source records are still read by the transaction system but never updated. This transaction system should not have to wait for the data load to finish.
Or maybe, somebody wants to do some analytics on the source data and also needs to read those records.
Or maybe you want to parallelise the data loading: it's easily possible to skip locked records and only work on the ones that are "available for update" right now. See e.g. Load balancing SQL reads while batch-processing? for that.
Ok, I guess you were hoping for something easier to consume; alas, that's my approach to this sort of requirement as I understood it.

How to attempt to delete records without terminating on error

I have a table that is used in several other tables as a foreign key. If the table is referenced in only one specific table, I want the delete to be allowed and to cascade on delete. However, if there exists references in other tables, the delete should fail.
I want to test the referntial integrity of this with my data set by attempting to delete every record. No records should be deleted except for the last one. However, when I attempt to delete every record, it errors (as expected) and terminates the rest of the statement.
How can I write a script that attempts to delete every record in a table and not terminate the statement on the first error?
Kind Regards,
ribald
EDIT:
The reason I would want to do something like this is because the business users have added a lot of duplicate data (ie: Search for someone and click the "Add As New" instead of the "Select"). Now we may have 10 people out there that only have a name and no relation to the other tables. I hope this clearifies any confusion.
I played around with different ideas. Here is the most straight forward way. However, it is pretty costly. Again, this is to attempt to delete unused duplicate data. 1,000 records took 8 minutes. Can anyone think of a more effecient way to do this?
DECLARE #DeletedID Int
DECLARE ItemsToDelete SCROLL CurSor For
SELECT ID FROM ParentTable
Open ItemsToDelete
FETCH NEXT FROM ItemsToDelete INTO #DeletedID
While ##FETCH_STATUS = 0
BEGIN
BEGIN TRY
--ATTEMPT TO DELETE
DELETE FROM ParentTable WHERE ID = #DeletedID;
END TRY
BEGIN CATCH
--DO NOTHING
END CATCH
--FETCH NEXT ROW
FETCH NEXT FROM ItemsToDelete INTO #DeletedID
END
Close ItemsToDelete
Deallocate ItemsToDelete
Take this with a grain of salt: I'm not actually a DBA, and have never worked with SQL Server. However:
You're actually up against two different rules here:
Referential constraints
Business-specific (I'm assuming) 'delete-allowed' rules.
It sounds like when the referential constraints (in the children tables) were setup, they were created with the option RESTRICT or NO ACTION. This means that attempting to delete from the parent table will, when children rows are present, cause the failure. But the catch is that you want, on a specific table, to allow deletes, and to propogate them (option CASCADE). So, for only those tables where the delete should be propogated, alter the referential constraint to use CASCADE. Otherwise, prevent the delete with the (already present) error.
As for dealing with the 'exceptions' that crop up... here are some ways to deal with them:
Predict them. Write your delete in such a fashion as to not delete something if the key is referenced in a set of the children tables. This is obviously not maintainable in the long term, or for a heavily-referenced table.
Catch the exception. I doubt these can be caught 'in-script' (especially if running a single .txt-type file), so you're probably going to have to at least write a stored procedure, or run from a higher-level language. The second problem here is that you can't 'skip' the error and 'continue' the delete (that I'm aware of) - SQL is going to keep failing you every time (...on what amounts to a random row). You'd have to loop through the entire file, by line (or possibly set of lines, dynamically), to be able to ensure the record delete. For a small file (relatively), this may work.
Write a (set) of dynamic statements that uses the information schema tables and lookups to exclude ids included in the 'non-deletable' tables. Interesting in concept, but potentially difficult/expensive.
Well, each statement in SQL is considered as transaction, therefore, if you try to delete several records, and you hit a error considering integrity rules, each and every change you made so far will be rolled back. The thing you might do is to write query which will delete data from referencing table (table which has value from foreign key (T_child table in your case) first, and then delete data from T_parent table.
In total: First check T_child table for records which you want to delete, and then delete records from T_parent table, to avoid transaction failure.
Hope this helps.
(Correct me if I'm wrong)
There are a few options here. There is a bit of ambiguity in your question, so I want to re-iterate your use case first (and I'll answer your question as I understand it).
Use Case
Database consists of several tables T_Parent, T_Child1, T_Child2, T_Child3. A complete set of data would have records in all 4 tables. Given business requirements, we often end up with partial data that needs to be removed at a later time. For example, T_parent and T_Child2 may get data, but not TC1 and TC3.
I need to be able to check for partial data and if found remove all the partial data (T_Parent and T_Child 2 in my example, but it could be TP and TC3, or other combinations).
#ribald - Is my understanding correct?
Comment on my understanding and I'll write out an answer. If the comments aren't long or clear enough, just edit your question.
you said "cascade" in terms of delete (which means a very specific thing in SQL server), but in your later description it sounds more like you want to delete all the partial data.
"Cascading" is something that is available, but not really something you turn on and off based on conditions of some data.
when you said "dataset" you didn't mean an ADO.NET Dataset, you just meant test data.
I assume you aren't looking for a good way to test, you just want to be sure you have data integrity.

Is bulk insert atomic?

I have table with auto increment primary key. I want insert a bunch of data into it and get keys for each of them without additional queries.
START TRANSACTION;
INSERT INTO table (value) VALUES (x),(y),(z);
SELECT LAST_INSERT_ID() AS last_id;
COMMIT;
Could MySQL guarantee, that all data would be inserted in one continuous ordered flow, so I can easily compute id's for each element?
id(z) = last_id;
id(y) = last_id - 1;
id(x) = last_id - 2;
If you begin a transaction and then insert data into a table, that whole table will become locked to that transaction (unless you start playing with transaction isolation levels and/or locking hints).
That is bascially the point of transactions. To prevent external operations altering (in anyway) that which you are operating on.
This kind of "Table Lock" is the default behaviour in most cases.
In unusual circumstances you can find the the RDBMS has had certain options set that mean the 'normal' default (Table Locking) behaviour is not what happens for that particular install. If this is the case, you should be able to override the default by specifying that you want a Table Lock as part of the INSERT statement.
EDIT:
I have a lot of experience on MS-SQL Server and patchy experience on many other RDBMSs. In none of that time have I found any guarantee that an insert will occur in a specific order.
One reason for this is that the SELECT portion of the INSERT may be computed in parallel. That would mean the data to be inserted is ready out-of-order.
Equally where particularly large volumes of data are being inserted, the RDBMS may identify that the new data will span several pages of memory or disk space. Again, this may lead to parallelised operation.
As far as I am aware, MySQL has a row_number() type of function, that you could specify in the SELECT query, the result of which you can store in the database. You would then be able to rely upon That field (constructed by you) but not the auto-increment id field (constructed by the RDBMS).
As far as i know, this will work in pretty much all SQL-engines out there.

How-To delete 8,500,000 Records from one table on sql server

delete activities
where unt_uid is null
would be the fastest way but nobody can access the database / table until this statement has finished so this is a no-go.
I defined a cursor to get this task done during working time but anyway the impact to productivity is to big.
So how to delete these record so that the normal use of this database is guaranteed?
It's a SQL-2005 Server on a 32-bit Win2003. Second Question is: How Long would you estimate for this job to be done (6 hours or 60 hours)? (Yes, i know that depends on the load but assume that this is a small-business environment)
You can do it in chunks. For example, every 10 seconds execute:
delete from activities where activityid in
(select top 1000 activityid from activities where unt_uid is null)
Obviously define the row count (I arbitrarily picked 1000) and interval (I picked 10 seconds) which makes the most sense for your application.
Perhaps instead of deleting the records from your table, you could create a new identical table, insert the records you want to keep, and then rename the tables so the new one replaces the old one. This would still take some time, but the down-time on your site would be pretty minimal (just when swapping the tables)
Who can access the table will depend on your transaction isolation mode, I'd guess.
However, you're broadly right - lots of deletes is bad, particularly if your where clause means it cannot use an index - this means the database probably won't be able to lock only the rows it needs to delete, so it will end up taking a big lock on the whole table.
My best recommendation would be to redesign your application so you don't need to delete these rows, or possibly any rows.
You can either do this by partitioning the table such that you can simply drop partitions instead, or use the "copy the rows you want to keep then drop the table" recipe suggested by others.
I'd use the "nibbling delete" technique. From http://sqladvice.com/blogs/repeatableread/archive/2005/09/20/12795.aspx:
DECLARE #target int
SET #target = 2000
DECLARE #count int
SET #count = 2000
WHILE #count = 2000 BEGIN
DELETE FROM myBigTable
WHERE targetID IN
(SELECT TOP (#target) targetID
FROM myBigTable WITH(NOLOCK)
WHERE something = somethingElse)
SELECT #count = ##ROWCOUNT
WAITFOR DELAY '000:00:00.200'
END
I've used it for exactly this type of scenario.
The WAITFOR is important to keep, it allows other queries to do their work in between deletes.
In a small-business environment, it seems odd that you would need to delete 500,000 rows in standard operational behavior without affecting any other users. Typically for deletes that large, we're making a new table and using TRUNCATE/INSERT or sp_rename to overwrite the old one.
Having said that, in a special case, one of my monthly processes regularly can delete 200m rows in batches of around 3m at a time if it detects that it needs to re-run the process which generated those 200m rows. But this is a single-user process in a dedicated data warehouse database, and I wouldn't call it a small-business scenario.
I second the answers recommending seeking alternative approaches to your design.
i would create a task for this and schedule it to run during offpeak hours. But i would not suggest you to delete in the table being used. Move the rows you want to keep to new table and totally drop the current table with lots of rows you want to delete.