Context: I am a long-time MSSQL developer... What I would like to know is how to implement a read-only-once select from SAP HANA.
High-level pseudo-code:
Collect request via db proc (query)
Call API with request
Store results of the request (response)
I have a table (A) that is the source of inputs to a process. Once a process has completed it will write results to another table (B).
Perhaps this is all solved if I just add a column to table A to avoid concurrent processors from selecting the same records from A?
I am wondering how to do this without adding the column to source table A.
What I have tried is a left outer join between tables A and B to get rows from A that have no corresponding rows (yet) in B. This doesn't work, or I haven't implemented such that rows are processed only 1 time by any of the processors.
I have a stored proc to handle batch selection:
/*
* getBatch.sql
*
* SYNOPSIS: Retrieve the next set of criteria to be used in a search
* request. Use left outer join between input source table
* and results table to determine the next set of inputs, and
* provide support so that concurrent processes may call this
* proc and get their inputs exclusively.
*/
alter procedure "ACOX"."getBatch" (
in in_limit int
,in in_run_group_id varchar(36)
,out ot_result table (
id bigint
,runGroupId varchar(36)
,sourceTableRefId integer
,name nvarchar(22)
,location nvarchar(13)
,regionCode nvarchar(3)
,countryCode nvarchar(3)
)
) language sqlscript sql security definer as
begin
-- insert new records:
insert into "ACOX"."search_result_v4" (
"RUN_GROUP_ID"
,"BEGIN_DATE_TS"
,"SOURCE_TABLE"
,"SOURCE_TABLE_REFID"
)
select
in_run_group_id as "RUN_GROUP_ID"
,CURRENT_TIMESTAMP as "BEGIN_DATE_TS"
,'acox.searchCriteria' as "SOURCE_TABLE"
,fp.descriptor_id as "SOURCE_TABLE_REFID"
from
acox.searchCriteria fp
left join "ACOX"."us_state_codes" st
on trim(fp.region) = trim(st.usps)
left outer join "ACOX"."search_result_v4" r
on fp.descriptor_id = r.source_table_refid
where
st.usps is not null
and r.BEGIN_DATE_TS is null
limit :in_limit;
-- select records inserted for return:
ot_result =
select
r.ID id
,r.RUN_GROUP_ID runGroupId
,fp.descriptor_id sourceTableRefId
,fp.merch_name name
,fp.Location location
,st.usps regionCode
,'USA' countryCode
from
acox.searchCriteria fp
left join "ACOX"."us_state_codes" st
on trim(fp.region) = trim(st.usps)
inner join "ACOX"."search_result_v4" r
on fp.descriptor_id = r.source_table_refid
and r.COMPLETE_DATE_TS is null
and r.RUN_GROUP_ID = in_run_group_id
where
st.usps is not null
limit :in_limit;
end;
When running 7 concurrent processors, I get a 35% overlap. That is to say that out of 5,000 input rows, the resulting row count is 6,755. Running time is about 7 mins.
Currently my solution includes adding a column to the source table. I wanted to avoid that but it seems to make a simpler implement. I will update the code shortly, but it includes an update statement prior to the insert.
Useful references:
SAP HANA Concurrency Control
Exactly-Once Semantics Are Possible: Here’s How Kafka Does It
First off: there is no "read-only-once" in any RDBMS, including MS SQL.
Literally, this would mean that a given record can only be read once and would then "disappear" for all subsequent reads. (that's effectively what a queue does, or the well-known special-case of a queue: the pipe)
I assume that that is not what you are looking for.
Instead, I believe you want to implement a processing-semantic analogous to "once-and-only-once" aka "exactly-once" message delivery. While this is impossible to achieve in potentially partitioned networks it is possible within the transaction context of databases.
This is a common requirement, e.g. with batch data loading jobs that should only load data that has not been loaded so far (i.e. the new data that was created after the last batch load job began).
Sorry for the long pre-text, but any solution for this will depend on being clear on what we want to actually achieve. I will get to an approach for that now.
The major RDBMS have long figured out that blocking readers is generally a terrible idea if the goal is to enable high transaction throughput. Consequently, HANA does not block readers - ever (ok, not ever-ever, but in the normal operation setup).
The main issue with the "exactly-once" processing requirement really is not the reading of the records, but the possibility of processing more than once or not at all.
Both of these potential issues can be addressed with the following approach:
SELECT ... FOR UPDATE ... the records that should be processed (based on e.g. unprocessed records, up to N records, even-odd-IDs, zip-code, ...). With this, the current session has an UPDATE TRANSACTION context and exclusive locks on the selected records. Other transactions can still read those records, but no other transaction can lock those records - neither for UPDATE, DELETE, nor for SELECT ... FOR UPDATE ... .
Now you do your processing - whatever this involves: merging, inserting, updating other tables, writing log-entries...
As the final step of the processing, you want to "mark" the records as processed. How exactly this is implemented, does not really matter.
One could create a processed-column in the table and set it to TRUE when records have been processed. Or one could have a separate table that contains the primary keys of the processed records (and maybe a load-job-id to keep track of multiple load jobs).
In whatever way this is implemented, this is the point in time, where this processed status needs to be captured.
COMMIT or ROLLBACK (in case something went wrong). This will COMMIT the records written to the target table, the processed-status information, and it will release the exclusive locks from the source table.
As you see, Step 1 takes care of the issue that records may be missed by selecting all wanted records that can be processed (i.e. they are not exclusively locked by any other process).
Step 3 takes care of the issue of records potentially be processed more than once by keeping track of the processed records. Obviously, this tracking has to be checked in Step 1 - both steps are interconnected, which is why I point them out explicitly. Finally, all the processing occurs within the same DB-transaction context, allowing for guaranteed COMMIT or ROLLBACK across the whole transaction. That means, that no "record marker" will ever be lost when the processing of the records was committed.
Now, why is this approach preferable to making records "un-readable"?
Because of the other processes in the system.
Maybe the source records are still read by the transaction system but never updated. This transaction system should not have to wait for the data load to finish.
Or maybe, somebody wants to do some analytics on the source data and also needs to read those records.
Or maybe you want to parallelise the data loading: it's easily possible to skip locked records and only work on the ones that are "available for update" right now. See e.g. Load balancing SQL reads while batch-processing? for that.
Ok, I guess you were hoping for something easier to consume; alas, that's my approach to this sort of requirement as I understood it.
Related
So, i need to process parts of some huge dataset(100.000.000+ records) simultaneously by multiple instances of processing script, running on separate servers. each instance will be processing it's own chunk of data, no data will be given to more than one instance. I would give records in chunks of 50/100 to each instance. The question is - how to organise the pagination? I think that some sort of global pointers shout be stored on the DB side (PostgreSQL). CURSOR is not of use here because it exists within the transaction.
The dataset consists of multiple text files stored 1 line per row and will be queried though b-index. sample columns: uuid, file_name, line, line_nr, date.
The idea i have is to create a cursors table and to store the cursor current value(which will refer to line_nr) for each file_name after every request.
Is that an efficient way to do it, or there is some built-in functionality in PostgreSQL which would allow me to do it?
SELECT FOR UPDATE with SKIP LOCKED. You'll want to create a work queue table with claimed_at/completed_at timestamp columns that the workers update as they complete their work. The other column of work queue table will be a reference to your dataset's table's PK (you probably won't want to use a real foreign key for performance reasons). Then you can use a modified version of this query:
https://stackoverflow.com/a/49403339/16361
We'll use a larger limit in order to allocate chunks of tasks at once. And, instead of deleting, we'll set the claimed_at timestamp, and use a filter on claimed_at being null to avoid double-claiming. Your application code would be responsible for the 2nd update to set the completed_at timestamp. As a tertiary advantage, you can query your work queue table's completed_at - claimed_at timestamp to keep track of how long each task is taking, and you can query for completed_at IS NULL when everything is done to see any rows that caused workers to crash or that otherwise did not complete.
UPDATE work_queue_table set claimed_at=now()
WHERE dataset_row_uuid = (
SELECT dataset_row_uuid
FROM work_queue_table
WHERE claimed_at IS NULL
FOR UPDATE SKIP LOCKED
LIMIT 50
)
RETURNING dataset_row_uuid;
The setup of the work queue table could be as simple as this:
CREATE TABLE work_queue_table AS
SELECT uuid AS dataset_row_uuid,
NULL::timestamp AS claimed_at,
NULL::timestamp AS completed_at
FROM the_dataset_table
Though you many need to instead turn it into a regular CREATE & multiple INSERTs that you can run in parallel if this takes too long (I haven't created a huge table like this since PG gained multiprocessing features, it's possible that doesn't actually help anymore).
If it's good/helpful to have tasks from the same file handled by the same worker, you can change the ORDER BY to file_name, line_nr instead. There's a lot of tweaks you can do for various use cases, hopefully this can get you started.
If the number of clients is constant, each could fetch its next batch with
SELECT ...
FROM atable
WHERE id > previous_id
AND id % number_of_clients = client_no
ORDER BY id
LIMIT 50;
Here previous_id is the maximal id from the previous batch, number_of_clients is the number of clients and client_no is different for each client.
That avoids taking locks if you don't need them.
We face the following situation (Teradata):
Business layer frequently executes long-running queries on Table X_Past UNION ALL Table X_Today.
Table X_Today gets updated frequently, say once every 10 minutes. X_Past only once after midnight (per full-load).
Writing process should not block reading process.
Writing should happen as soon as new data is available.
Proposed approach:
2 "Today" and a "past" table, plus a UNION ALL view that selects from one of them based on the value in a load status table.
X_Today_1
X_Today_0
X_Past
loading process with load in X_Today_1 and set the active_table value in the load status table to "X_Today_1"
next time it will load X_Today_0 and set the active_table value to "X_Today_0"
etc.
The view that is used to select on the table will be built as follows:
select *
from X_PAST
UNION ALL
select td1.*
from X_Today_1 td1
, ( select active_table from LOAD_STATUS ) active_tab1
where active_tab1.te_active_table = 'X_Today_1'
UNION ALL
select td0.*
from X_Today_0 td0
, ( select active_table from STATUS_LOG ) active_tab0
where active_tab1.te_active_table = 'X_Today_0'
my main questions:
when executing the select, will there be a lock on ALL tables, or only on those that are actually accessed for data? Since because of the where clause, data from one of the Today_1/0 tables will always be ignored and this table should be availablew for loading;
do we need any form of locking or is the default locking mechanism that what we want (which I suspect it is)?
will this work, or am I overlooking something?
It is important that the loading process will wait in case the reading process takes longer than 20 minutes and the loader is about to refresh the second table again. The reading process should never really be blocked, except maybe by itself.
Any input is much appreciated...
thank you for your help.
A few comments to your questions:
Depending on the query structure, the Optimizer will try to get the default locks (in this case a READ lock) at different levels -- most likely table or row-hash locks. For example, if you do a SELECT * FROM my_table WHERE PI_column = 'value', you should get a row-hash lock and not a table lock.
Try running an EXPLAIN on your SELECT and see if it gives you any locking info. The Optimizer might be smart enough to determine there are 0 rows in one of the joined tables and reduce the lock requests. If it still locks both tables, see the end of this post for an alternative approach.
Your query written as-is will result in READ locks, which would block any WRITE requests on the tables. If you are worried about locking issues / concurrency, have you thought about using an explicit ACCESS lock? This would allow your SELECT to run without ever having to wait for your write queries to complete. This is called a "dirty read", since there could be other requests still modifying the tables while they are being read, so it may or may not be appropriate depending on your requirements.
Your approach seems feasible. You could also do something similar, but instead of having two UNIONs, have a single "X_Today" view that points to the "active" table. After your load process completes, you could re-point the view to the appropriate table as needed via a MACRO call:
-- macros (switch between active / loading)
REPLACE MACRO switch_to_today_table_0 AS
REPLACE VIEW X_Today AS SELECT * FROM X_Today_0;
REPLACE MACRO switch_to_today_table_1 AS
REPLACE VIEW X_Today AS SELECT * FROM X_Today_1;
-- SELECT query
SELECT * FROM X_PAST UNION ALL SELECT * FROM X_Today;
-- Write request
MERGE INTO x_today_0...;
-- Switch active "today" table to must recently loaded one
EXEC switch_to_today_table_0;
You'd have to manage which table to write to (or possible do that using a view too) and which "switch" macro to call within your application.
One thing to think about is that having two physical tables that logically represent the same table (i.e. should have the same data) may potentially allow for situations where one table is missing data and needs to be manually synced.
Also, if you haven't looked at them already, a few ideas to optimize your SELECT queries to run faster: row partitioning, indexes, compression, statistics, primary index selection.
I have a informatica mapping wherein soft delete condition is as follows:
Pk_Src is null and Pk_Tgt is not null then set the active_flag to N.
Now based on this condition the mapping evaluated that there are 400k records which needs to be updated. Its a simple update but it is taking more than 3 hours using update strategy
Appreciate your valuable inputs.
Dex.
How many records are on the table and how many indexes does active_flag appear in? If active_flag is in many indexes then you should consider dropping those indexes before the session starts and redefining them after the session ends. Have you looked through the session log to see which steps are taking time? There may be more beside the update query. Another strategy to try is increasing your commit interval to 500000 (so long as your db undo can withstand that)
In general you should expect very slow performance when doing updates/deletes from an infa mapping. 10-100 times slower than inserts. This is due to the fact that there is no ‘array’ based update/delete, and each update therefore needs to be handled separately and sequentially, and you end up in a situation where 90% if the time is spend sending handshakes back and forth. With inserts the handshakes is only being done for each ‘array-size’ records (say 10000) and the overhead is therefore negligible.
The best solutions I have found so far is:
use push down optimization (draw back is that mapping needs to be 100% push-downable, which means: no variable ports, and both source and target behind the same connection and much more)
use a stage/apply approach:
In the pre-sql of the session, drop&create a STAGE table with all the keys you wish to delete
Override the table name for your target to point to this STAGE table
Let the mapping do inserts to the STAGE table
In the post-sql do a
Delete from TABLE where Exists (select * from STAGE where TABLE.ID=STAGE.ID)
I hope you cal follow me
Here's the situation:
MS SQL 2008 database with table that is updated approximately once a minute.
The table structure is similar to following:
[docID], [warehouseID], [docDate], [docNum], [partID], [partQty]
Typical working cycle:
User starts data exchange from in-house developed system:
BEGIN TRANSACTION
SELECT * FROM t1
WHERE [docDate] BETWEEN &DateStart AND &DateEnd
AND [warehouseID] IN ('w1','w2','w3')
...then system performs rather long processing of the data selected, generates the list of [docID]s to delete from t1, then goes
DELETE FROM t1 WHERE [docID] IN ('d1','d2','d3',...,'dN')
COMMIT TRANSACTION
Here, the problem is that while 1st transaction processes selected the data, another reads it too and then they together populate the same data in in-house system.
At first, I inserted (TABLOCKX) table hint into SELECT query. And it worked pretty well until users started to complain about system's performance.
Then I changed hints to (ROWLOCK, XLOCK, HOLDLOCK), assuming that it would:
exclusively lock...
selected rows (instead of whole table)...
until the end of transaction
But this seems making a whole table lock anyway. I have no access to database itself, so I can't just analyze these locks (actually, I have no idea yet how to do it, even if I had access)
What I would like to have as a result:
users are able to process data related with different warehouses and dates in parallel
as a result of 1., avoid duplication of downloaded data
Except locks, other solutions I have are (although they both seem clumsy):
Implement a flag in t1, showing that the data is under processing (and then do 'SELECT ... WHERE NOT [flag]')
Divide t1 into two parts: header and details, and apply locks separately.
I beleive that I might misunderstood some concepts with regards to transaction isolation levels and/or table hints and there is another (better) way.
Please, advise!
You may change a concept of workflow.
Instead of deleting records update them with setting extra field Deprecated from 0 to 1.
And read data not from the table but from the view where Deprecated = 0.
BEGIN TRANSACTION
SELECT * FROM vT1
WHERE [docDate] BETWEEN &DateStart AND &DateEnd
AND [warehouseID] IN ('w1','w2','w3')
where vT1 view looks like this:
select *
from t1
where Deprecated = 0
And the deletion will look like this:
UPDATE t1 SET Deprecated = 1 WHERE [docID] IN ('d1','d2','d3',...,'dN')
COMMIT TRANSACTION
Using such a concept you will achieve two goals:
decrease probability of locks
get history of movings on warehouses
Recently I had to deal with a problem that I imagined would be pretty common: given a database table with a large (million+) number of rows to be processed, and various processors running in various machines / threads, how to safely allow each processor instance to get a chunk of work (say 100 items) without interfering with one another?
The reason I am getting a chunk at a time is for performance reasons - I don't want to go to the database for each item.
There are a few approaches - you could associate each processor a token, and have a SPROC that sets that token against the next [n] available items; perhaps something like:
(note - needs suitable isolation-level; perhaps serializable: SET TRANSACTION ISOLATION LEVEL SERIALIZABLE)
(edited to fix TSQL)
UPDATE TOP (1000) WORK
SET [Owner] = #processor, Expiry = #expiry
OUTPUT INSERTED.Id -- etc
WHERE [Owner] IS NULL
You'd also want a timeout (#expiry) on this, so that when a processor goes down you don't lose work. You'd also need a task to clear the owner on things that are past their Expiry.
You can have a special table to queue work up, where the consumers delete (or mark) work as being handled, or use a middleware queuing solution, like MSMQ or ActiveMQ.
Middleware comes with its own set of problems so, if possible, I'd stick with a special table (keep it as small as possible, hopefully just with an id so the workers can fetch the rest of the information by themselves on the rest of the database and not lock the queue table up for too long).
You'd fill this table up at regular intervals and let processors grab what they need from the top.
Related questions on SQL table queues:
Queue using table
Working out the SQL to query a priority queue table
Related questions on queuing middleware:
Building a high performance and automatically backupped queue
Messaging platform
You didn't say which database server you're using, but there are a couple of options.
MySQL includes an extension to SQL99's INSERT to limit the number of rows that are updated. You can assign each worker a unique token, update a number of rows, then query to get that worker's batch. Marc used the UPDATE TOP syntax, but didn't specify the database server.
Another option is to designate a table used for locking. Don't use the same table with the data, since you don't want to lock it for reading. Your lock table likely only needs a single row, with the next ID needing work. A worker locks the table, gets the current ID, increments it by whatever your batch size is, updates the table, then releases the lock. Then it can go query the data table and pull the rows it reserved. This option assumes the data table has a monotonically increasing ID, and isn't very fault-tolerant if a worker dies or otherwise can't finish a batch.
Quite similar to this question: SQL Server Process Queue Race Condition
You run a query to assign a 100 rows to a given processorid. If you use these locking hints then it's "safe" in the concurrency sense. And it's a single SQL statement with no SET statements needed.
This is taken from the other question:
UPDATE TOP (100)
foo
SET
ProcessorID = #PROCID
FROM
OrderTable foo WITH (ROWLOCK, READPAST, UPDLOCK)
WHERE
ProcessorID = 0 --Or whatever unassigned is