How to exclude one statement from current Sql transaction for Sql id generator? - sql

I would like to implement id generator to be able to have unique records identification for multiple tables and be able to assign id to structures of new records formed on client side.
Usually obvious and standard answer is Guid, but I want to use int because of space efficiency and human readability.
It's ok to have gaps in id sequence - which will happen with unfinished transactions, lost client connections and so on.
For implementation I would have a table Counters with field NextId int and increment that counter any time id is requested. I may increment that id by more than 1 when I need range of ids for multiple or bulk inserts.
To avoid locking bottlenecks when updating Counters table I need to make id requests atomic and outside of any other transactions. So my question is how to do that ?
It's not a problem on application level - it can make one atomic transaction request to get pool of ids and then use those ids in another bigger transaction to insert records.
But what do I do if I want to get new ids inside Stored Procedure or Trigger ?
If I wrap that update Counter set NextId=NextId+1 table request into nested transaction begin tran ... commit tran it's not going to exclude it from locking until outer big transaction ends.
Is there any way to exclude that one Sql statement from current transaction so that locking ends right when statement ends and it does not participate in rollback if outer transaction is rolled back.

You need to use a second connection. You cannot have multiple transactions at once per connection.

Related

Maintain a counter in one table to track INSERT and DELETE in another table

I have a Spring Boot application with Postgres as database and Hibernate for persistence management. There are two tables in the db: STATISTICS and USER. Client machines submit entries to STATISTICS table and there is a 'counter' column in USER table to keep track of number of entries in the other table(increment on insert and decrement on delete).
I started with a basic query to maintain the counter: UPDATE USER u set u.counter = coalesce(u.counter, 0) + <update_by> where u.id = <user_id>. The <update_by> would be negative for deletion.
My questions:
Do I need to obtain explicit locks(something like SELECT...FOR UPDATE) to ensure concurrent updates don't work with stale data and overwrite other update's changes?
Would a database trigger be a better choice to maintain this counter? If yes, do I need to take care of locking for concurrent updates there?
(I need to maintain the counter along with the changes to other table and not do the count by going over the whole table when it is requested)
postgresql already blocks updatable records, you can use a trigger or a function, both are executed in the one transaction, blocking updatable records

How to establish read-only-once implement within SAP HANA?

Context: I am a long-time MSSQL developer... What I would like to know is how to implement a read-only-once select from SAP HANA.
High-level pseudo-code:
Collect request via db proc (query)
Call API with request
Store results of the request (response)
I have a table (A) that is the source of inputs to a process. Once a process has completed it will write results to another table (B).
Perhaps this is all solved if I just add a column to table A to avoid concurrent processors from selecting the same records from A?
I am wondering how to do this without adding the column to source table A.
What I have tried is a left outer join between tables A and B to get rows from A that have no corresponding rows (yet) in B. This doesn't work, or I haven't implemented such that rows are processed only 1 time by any of the processors.
I have a stored proc to handle batch selection:
/*
* getBatch.sql
*
* SYNOPSIS: Retrieve the next set of criteria to be used in a search
* request. Use left outer join between input source table
* and results table to determine the next set of inputs, and
* provide support so that concurrent processes may call this
* proc and get their inputs exclusively.
*/
alter procedure "ACOX"."getBatch" (
in in_limit int
,in in_run_group_id varchar(36)
,out ot_result table (
id bigint
,runGroupId varchar(36)
,sourceTableRefId integer
,name nvarchar(22)
,location nvarchar(13)
,regionCode nvarchar(3)
,countryCode nvarchar(3)
)
) language sqlscript sql security definer as
begin
-- insert new records:
insert into "ACOX"."search_result_v4" (
"RUN_GROUP_ID"
,"BEGIN_DATE_TS"
,"SOURCE_TABLE"
,"SOURCE_TABLE_REFID"
)
select
in_run_group_id as "RUN_GROUP_ID"
,CURRENT_TIMESTAMP as "BEGIN_DATE_TS"
,'acox.searchCriteria' as "SOURCE_TABLE"
,fp.descriptor_id as "SOURCE_TABLE_REFID"
from
acox.searchCriteria fp
left join "ACOX"."us_state_codes" st
on trim(fp.region) = trim(st.usps)
left outer join "ACOX"."search_result_v4" r
on fp.descriptor_id = r.source_table_refid
where
st.usps is not null
and r.BEGIN_DATE_TS is null
limit :in_limit;
-- select records inserted for return:
ot_result =
select
r.ID id
,r.RUN_GROUP_ID runGroupId
,fp.descriptor_id sourceTableRefId
,fp.merch_name name
,fp.Location location
,st.usps regionCode
,'USA' countryCode
from
acox.searchCriteria fp
left join "ACOX"."us_state_codes" st
on trim(fp.region) = trim(st.usps)
inner join "ACOX"."search_result_v4" r
on fp.descriptor_id = r.source_table_refid
and r.COMPLETE_DATE_TS is null
and r.RUN_GROUP_ID = in_run_group_id
where
st.usps is not null
limit :in_limit;
end;
When running 7 concurrent processors, I get a 35% overlap. That is to say that out of 5,000 input rows, the resulting row count is 6,755. Running time is about 7 mins.
Currently my solution includes adding a column to the source table. I wanted to avoid that but it seems to make a simpler implement. I will update the code shortly, but it includes an update statement prior to the insert.
Useful references:
SAP HANA Concurrency Control
Exactly-Once Semantics Are Possible: Here’s How Kafka Does It
First off: there is no "read-only-once" in any RDBMS, including MS SQL.
Literally, this would mean that a given record can only be read once and would then "disappear" for all subsequent reads. (that's effectively what a queue does, or the well-known special-case of a queue: the pipe)
I assume that that is not what you are looking for.
Instead, I believe you want to implement a processing-semantic analogous to "once-and-only-once" aka "exactly-once" message delivery. While this is impossible to achieve in potentially partitioned networks it is possible within the transaction context of databases.
This is a common requirement, e.g. with batch data loading jobs that should only load data that has not been loaded so far (i.e. the new data that was created after the last batch load job began).
Sorry for the long pre-text, but any solution for this will depend on being clear on what we want to actually achieve. I will get to an approach for that now.
The major RDBMS have long figured out that blocking readers is generally a terrible idea if the goal is to enable high transaction throughput. Consequently, HANA does not block readers - ever (ok, not ever-ever, but in the normal operation setup).
The main issue with the "exactly-once" processing requirement really is not the reading of the records, but the possibility of processing more than once or not at all.
Both of these potential issues can be addressed with the following approach:
SELECT ... FOR UPDATE ... the records that should be processed (based on e.g. unprocessed records, up to N records, even-odd-IDs, zip-code, ...). With this, the current session has an UPDATE TRANSACTION context and exclusive locks on the selected records. Other transactions can still read those records, but no other transaction can lock those records - neither for UPDATE, DELETE, nor for SELECT ... FOR UPDATE ... .
Now you do your processing - whatever this involves: merging, inserting, updating other tables, writing log-entries...
As the final step of the processing, you want to "mark" the records as processed. How exactly this is implemented, does not really matter.
One could create a processed-column in the table and set it to TRUE when records have been processed. Or one could have a separate table that contains the primary keys of the processed records (and maybe a load-job-id to keep track of multiple load jobs).
In whatever way this is implemented, this is the point in time, where this processed status needs to be captured.
COMMIT or ROLLBACK (in case something went wrong). This will COMMIT the records written to the target table, the processed-status information, and it will release the exclusive locks from the source table.
As you see, Step 1 takes care of the issue that records may be missed by selecting all wanted records that can be processed (i.e. they are not exclusively locked by any other process).
Step 3 takes care of the issue of records potentially be processed more than once by keeping track of the processed records. Obviously, this tracking has to be checked in Step 1 - both steps are interconnected, which is why I point them out explicitly. Finally, all the processing occurs within the same DB-transaction context, allowing for guaranteed COMMIT or ROLLBACK across the whole transaction. That means, that no "record marker" will ever be lost when the processing of the records was committed.
Now, why is this approach preferable to making records "un-readable"?
Because of the other processes in the system.
Maybe the source records are still read by the transaction system but never updated. This transaction system should not have to wait for the data load to finish.
Or maybe, somebody wants to do some analytics on the source data and also needs to read those records.
Or maybe you want to parallelise the data loading: it's easily possible to skip locked records and only work on the ones that are "available for update" right now. See e.g. Load balancing SQL reads while batch-processing? for that.
Ok, I guess you were hoping for something easier to consume; alas, that's my approach to this sort of requirement as I understood it.

Prevent other sessions from reading data until I'm finished

Prevent other sessions from reading data until I'm finished
I have a table that holds customers from different companies, something like:
CUSTOMER
CUSTOMER_ID
COMPANY_ID
CUSTOMER_NAME
FOO_CODE
When I insert or update a customer I need to calculate a FOO_CODE based on existing ones (within the company).
If I simply do this:
SELECT MAX(FOO_CODE) AS GREATEST_CODE_SO_FAR
FROM CUSTOMER
WHERE COMPANY_ID=:company_id
... then generate the code in the client language (PHP) and finally issue the INSERT/UPDATE I understand I can face a race condition if other program instance fetches the same GREATEST_CODE_SO_FAR.
Is it possible to issue a row-level lock on the table so other sessions that attempt to read the FOO_CODE column of any customer that belongs to a given company are delayed until I commit or rollback my transaction?
My failed attemps:
This:
SELECT MAX(FOO_CODE)
FROM CUSTOMER
WHERE COMPANY_ID=:company_id
FOR UPDATE
... triggers:
ORA-01786: FOR UPDATE of this query expression is not allowed
This:
SELECT FOO_CODE
FROM CUSTOMER
WHERE COMPANY_ID=:company_id
FOR UPDATE
... retrieves all company rows and does not even prevent other sessions from reading data.
LOCK TABLE... well, documentation barely has any example and I can't figure out the syntax
P.S. Is it not an incrementing number, it's an alphanumeric string.
You can't block another session from reading data, as far as I'm aware. One of the differences between Oracle and some other databases is that writers don't block readers.
I'd probably look at this slightly differently. I'm assuming the way you generate the next foo_code is deterministic. If you add a unique index on company_id, foo_code then you can have your application attempt the insert in a loop:
get your current max value
calculate your new code
do the insert
if you don't get a constraint violation, break out of the loop
otherwise continue to the next iteration of the loop and repeat the process
If two sessions attempt this at the same time then the second one will attempt to insert the same foo_code and will get a unique constraint violation. That is trapped and handled nicely and it just tries again; potentially multiple times until it gets a clean insert.
You could have a DB procedure that attempts the insert in a loop, but since you want to generate the new value in PHP then it would make sense for the loop to be in PHP too, attempting a simple insert.
This doesn't necessarily scale well if you have a high volume of inserts and clashes are likely. But if you're expecting simultaneous inserts for the same customer to be rare and just have to handle the odd occasions when it does happen this won't add much overhead.

Best implementation of a "counter" table in SQL Server

I'm working with a large SQL Server database, and that's built upon the idea of a counter table for primary key values. Each table has a row in this counter table with the PK name and the next value to be used as a primary key (for that table). Our current method of getting a counter value is something like this:
BEGIN TRAN
UPDATE CounterValue + 1
SELECT Counter Value
COMMIT TRAN
That works mostly well since the process of starting a transaction, then updating the row, locks the row/page/table (the level of locking isn't too important for this topic) until the transaction is committed.
The problem here is that if a transaction is held open for a long period of time, access to that table/page/row is locked for too long. We have situations where hundreds of inserts may occur in a single transaction (which needs access to this counter table).
One attempt to address this problem would be to always use a separate connection from your application that would never hold a transaction open. Access to the table and hence the transaction would be quick, so access to the table is generally available. The problem here is that the use of triggers that may also need access to these counter values makes that a fairly unreasonable rule to have. In other words, we have triggers that also need counter values and those triggers sometimes run in the context of a larger parent transaction.
Another attempt to solve the problem is using a SQL Server app lock to serialize access to the table/row. That's Ok most of the time too, but has downsides. One of the biggest downsides here also involves triggers. Since triggers run in the context of the triggering query, the app lock would be locked until any parent transactions are completed.
So what I'm trying to figure out is a way to serialize access to a row/table that could be run from an application or from a SP / trigger that would never run in the context of a parent transaction. If a parent transaction would roll back, I don't need the counter value to roll back. Having always available, fast access to a counter value is much more important than loosing a few counter values should a parent transaction be rolled back.
I should point out that I completely realize that using GUID values or an identity column would solve a lot of my problems, but as I mentioned, we're talking about a massive system, with massive amounts of data that can't be changed in a reasonable time frame without a lot of pain for our clients (we're talking hundreds of tables with hundreds of millions of rows).
Any thoughts about the best way to implement such a counter table would be appreciated. Remember - access should be always available from many apps, services, triggers and other SPs, with very little blocking.
EDIT - we can assume SQL Server 2005+
The way the system currently works in unscalable. You have noticed that yourself. Here are some solutions in rough order of preference:
Use an IDENTITY column (You can set the IDENTITY property without rebuilding the table. Search the web to see how.)
Use a sequence
Use Hi-Lo ID generation (What's the Hi/Lo algorithm?). In short, consumers of IDs (application instances) check out big ranges of IDs (like 100) in a separate transaction. The overhead of that scheme is very low.
Working with the constraints from your comment below: You can achieve scalable counter generation even with a single transaction and no application-level changes. This is kind of a last resort measure.
Stripe the counter. For each table, you have 100 counters. The counter N tracks IDs that conform to ID % 100 = N. So each counter tracks 1/100th of all IDs.
When you want to take an ID, you take it from a randomly chosen counter. The chance is good that this counter is not in use by a concurrent transaction. You will have little blocking due to row-level locking in SQL Server.
You initialize counter N to N and increment it by 100. This ensures that all counters generate distinct ID ranges.
Counter 0 generates 0, 100, 200, .... Counter 1 generates 1, 101, 201, .... And so on.
A disadvantage of this is that your IDs now are not sequential. In my opinion, an application should not rely on this anyway because it is not a reliable property.
You can abstract all of this into a single procedure call. code complexity will actually not that much bigger. You basically just generate an additional random number and change the increment logic.
One way is to get and increment the counter value in one statement:
DECLARE #NextKey int
UPDATE Counter
SET #NextKey = NextKey + 1,
NextKey = #NextKey

Are Transactions Always Atomic?

I'm trying to better understand a nuance of SQL Server transactions.
Say I have a query that updates 1,000 existing rows, updating one of the columns to have the values 1 through 1,000. It's possible to execute this query and, when completed, those rows would not be numbered sequentially. This is because it's possible for another query to modify one of those rows before my query finishes.
On the other hand, if I wrap those updates in a transaction, that guarantees that if any one update fails, I can fail all updates. But does it also mean that those rows would be guaranteed to be sequential when I'm done?
In other words, are transactions always atomic?
But does it also mean that those rows would be guaranteed to be sequential when I'm done?
No. This has nothing to do with transactions, because what you're asking for simply doesn't exists: relational tables have no order an asking for 'sequential rows' is the wrong question to ask. You can rephrase the question as 'will the 1000 updated rows contain the entire sequence from 1 to 1000, w/o gaps' ? Most likely yes, but the truth of the matter is that there could be gaps depending on the way you do the updates. Those gaps would not appear because updated rows are modified after the update before commit, but because the update will be a no-op (will not update any row) which is a common problem of read-modify-write back type of updates ( the row 'vanishes' between the read and the write-back due to concurrent operations).
To answer your question more precisely whether your code is correct or not you have to post the exact code you're doing the update with, as well as the exact table structure, including all indexes.
Atomic means the operation(s) within the transaction with either occur, or they don't.
If one of the 1,000 statements fails, none of the operations within the transaction will commit. The smaller the sample of statements within a transaction -- say 100 -- means that the blocks of 100 leading up to the error (say at the 501st) can be committed (the first 400; the 500 block won't, and the 600+ blocks will).
But does it also mean that those rows would be guaranteed to be sequential when I'm done?
You'll have to provide more context about what you're doing in a transaction to be "sequential".
The 2 points are unrelated
Sequential
If you insert values 1 to 1000, it will be sequential with an WHERE and ORDER BY to limit you to these 1000 rows in some column. Unless there are duplicates, so you'd need a unique constraint
If you rely on an IDENTITY, it isn't guaranteed: Do Inserted Records Always Receive Contiguous Identity Values.
Atomicity
All transactions are atomic:
Is neccessary to encapsulate a single merge statement (with insert, delete and update) in a transaction?
SQL Server and connection loss in the middle of a transaction
Does it delete partially if execute a delete statement without transaction?
SQL transactions, like transactions on all database platforms, put the data in isolation to cover the entire ACID acronym (atomic, consistent, isolated and durable). So the answer is yes.
A transaction guarantees atomicity. That is the point.
You problem is that after you do the insert, they are only "Sequential" until the next thing comes along and touches one of the new records.
If another step in you process requires them to still be sequential then that step, too, needs to be within your original transaction.