Looping through a CURSOR in PL/PGSQL without locking the table - sql

I have a simple PL/PGSQL block Postgres 9.5 that loops over records in a table and conditionally updates some of the records.
Here's a simplified example:
DO $$
DECLARE
-- Define a cursor to loop through records
my_cool_cursor CURSOR FOR
SELECT
u.id AS user_id,
u.name AS user_name,
u.email AS user_email
FROM users u
;
BEGIN
FOR record IN my_cool_cursor LOOP
-- Simplified example:
-- If user's first name is 'Anjali', set email to NULL
IF record.user_name = 'Anjali' THEN
BEGIN
UPDATE users SET email = NULL WHERE id = record.user_id;
END;
END IF;
END LOOP;
END;
$$ LANGUAGE plpgsql;
I'd like to execute this block directly against my database (from my app, via the console, etc...). I do not want to create a FUNCTION() or stored procedure to do this operation.
The Issue
The issue is that the CURSOR and LOOP create a table-level lock on my users table, since everything between the outer BEGIN...END runs in a transaction. This blocks any other pending queries against it. If users is sufficiently large, this locks it up for several seconds or even minutes.
What I tried
I tried to COMMIT after each UPDATE so that it clears the transaction and the lock periodically. I was surprised to see this error message:
ERROR: cannot begin/end transactions in PL/pgSQL
HINT: Use a BEGIN block with an EXCEPTION clause instead.
I'm not quite sure how this is done. Is it asking me to raise an EXCEPTION to force a COMMIT? I tried reading the documentation on Trapping Errors but it only mentions ROLLBACK, so I don't see any way to COMMIT.
How do I periodically COMMIT a transaction inside the LOOP above?
More generally, is my approach even correct? Is there a better way to loop through records without locking up the table?

1.
You cannot COMMIT within a PostgreSQL function or DO command at all (plpgsql or any other PL). The error message you reported is to the point (as far as Postgres 9.5 is concerned):
ERROR: cannot begin/end transactions in PL/pgSQL
A procedure could do that in Postgres 11 or later. See:
PostgreSQL cannot begin/end transactions in PL/pgSQL
In PostgreSQL, what is the difference between a “Stored Procedure” and other types of functions?
There are limited workarounds to achieve "autonomous transactions" in older versions:
How do I do large non-blocking updates in PostgreSQL?
Does Postgres support nested or autonomous transactions?
Do stored procedures run in database transaction in Postgres?
But you do not need any of this for the presented case.
2.
Use a simple UPDATE instead:
UPDATE users
SET email = NULL
WHERE user_name = 'Anjali'
AND email IS DISTINCT FROM NULL; -- optional improvement
Only locks the rows that are actually updated (with corner case exceptions). And since this is much faster than a CURSOR over the whole table, the lock is also very brief.
The added AND email IS DISTINCT FROM NULL avoids empty updates. Related:
Update a column of a table with a column of another table in PostgreSQL
How do I (or can I) SELECT DISTINCT on multiple columns?
It's rare that explicit cursors are useful in plpgsql functions.

If you want to avoid locking rows for a long time, you could also define a cursor WITH HOLD, for example using the DECLARE SQL statement.
Such cursors can be used across transaction boundaries, so you could COMMIT after a certain number of updates. The price you are paying is that the cursor has to be materialized on the database server.
Since you cannot use transaction statements in functions, you will either have to use a procedure or commit in your application code.

Related

How to nest stored procedures that use one stream in Snowflake?

I have the following architecture:
Main stored procedure main_sproc
Nested stored procedure nested_sproc
The task we have is processing data from a stream in Snowflake but have to do it from a nested approach.
Thus this would be the code:
create or replace procedure nested_sproc()
returns varchar
language sql
as
begin
begin transaction;
insert into table1
select * from stream; -- it will be more complex than this and multiple statements
commit;
end;
create or replace procedure main_sproc()
returns varchar
language sql
as
begin
begin transaction;
insert into table1
select * from stream; -- it will be more complex than this and multiple statements
call nested_sproc();
commit;
end;
When I try to run call main_sproc() I notice that the first statements goes through but when it reaches call nested_sproc() the transaction is blocked. I guess it's because I have in both procedures a begin transaction and commit but without them I get an error stating that I need to define the scope of the transaction. I need to deploy this final procedure on a task that runs on a schedule but without having to merge the queries from both procedures and the ability to still process the current data inside the stream.
Is there a way of achieving this?
As per the comments by #NickW, you can't have two processes reading the same stream simultaneously as the running a DML command is altering that stream - so your outer SP is selecting from the stream and will lock it until the outer SP transaction completes. You'll need to find another approach to achieve your end result e.g. in the outer SP write the stream to a table and use that table in the inner SP. If you use a temporary table it should be available when either SP is running (as they will be part of the same session) and then will get automatically dropped when the session ends - as long as you don't need this data outside of the 2 SPs this might be more convenient as you have less maintenance.

Postgres: Save rows from temp table before rollback

I have a main procedure (p_proc_a) in which I create a temp table for logging (tmp_log). In the main procedure I call some other procedures (p_proc_b, p_proc_c). In each of these procedures I insert data into table tmp_log.
How do I save rows from tmp_log into a physical table (log) in case of exception before rollback?
create procedure p_proc_a
language plpgsql
as $body$
begin
create temp table tmp_log (log_message text) on commit drop;
call p_proc_b();
call p_proc_c();
insert into log (log_message)
select log_message from tmp_log;
exception
when others then
begin
get stacked diagnostics
v_message_text = message_text;
insert into log (log_message)
values(v_message_text);
end;
end;
$body$
What is a workround to save logs into a table and rollback changes from p_proc_b and p_proc_c?
That is not possible in PostgreSQL.
The typical workaround is to use dblink to connect to the database itself and write the logs via dblink.
I found three solutions to store data within a transaction (im my case - for debugging propose), and still be able to see that data after rollback-ing the transaction
I have a scenario where inside, I use following block, so it may not apply to your scenario
DO $$
BEGIN
...
ROLLBACK;
END;
$$;
Two first solutions are suggested to me in the Postgres slack, and the other I tried and found after talking with them, a way that worked in other db.
Solutions
1 - Using DBLink
I don't remember how it was done, but you import some libraries and then connect to another db, and use the other DB - which maybe can support to be this db - which seem to be not affected by transactions
2 - Using COPY command
Using the
COPY (SELECT ...) TO PROGRAM 'psql -c "COPY xyz FROM stdin"'
BTW I never used it, and it seems that it requires Super User(SU) permission in Unix. And god knows how it is used, or how it output data
3 - Using Sub-Transactions
In this way, you use a sub-transaction (which I'm not sure it it's correct, but it must be called Autonomous transactions) to commit the result you want to keep.
In my case the command looks like this:
I used a Temp Table, but it seems (I'm not sure) to work with an actual table as well
CREATE TEMP TABLE
IF NOT EXISTS myZone AS
SELECT * from public."Zone"
LIMIT 0;
DO $$
BEGIN
INSERT INTO public."Zone" (...)VALUES(...);
BEGIN
INSERT INTO myZone
SELECT * from public."Zone";
commit;
END;
Rollback;
END; $$;
SELECT * FROM myZone;
DROP TABLE myZone;
don't ask what is the purpose of doing this, I'm creating a test scenario, and I wished to track what I did until now. since this block did not support SELECT of DQL, I had to do something else, and I wanted a clean set of report, not raising errors
According to www.techtarget.com:
*Autonomous transactions allow a single transaction to be subdivided into multiple commit/rollback transactions, each of which
will be tracked for auditing purposes. When an autonomous transaction
is called, the original transaction (calling transaction) is
temporarily suspended.
(This text was indexed by google and existed on 2022-10-11, and the website was not opened due to an E-mail validation issue)
Also this name seems to be coming from Oracle, which this article can relate
EIDTED:
Removing solution 3 as it won't work
POSTGRES 11 Claim to support Autonomous Transactions but it's not what we may expect...
For this functionality Postgres introduced the SAVEPOINT:
SAVEPOINT <name of savepoint>;
<... CODE ...>
<RELEASE|ROLLBACK> SAVEPOINT <name of savepoint>;
Now the issue is:
If you use nested BEGIN, the COMMIT Inside the nested code can COMMIT everything, and the ROLLBACK in the outside block will do none (Not rolling back anything that happened before COMMIT of inner
If you use SAVEPOINT, it is only used to rollbacks part of the code, and even if you COMMIT it, the ROLLBACK in the outside block will rollback the SAVEPOINT too

Getting newest not handled, updated rows using txid

I have a table in my PostgreSQL (actually its mulitple tables, but for the sake of simplicity lets assume its just one) and multiple clients that periodically need to query the table to find changed items. These are updated or inserted items (deleted items are handled by first marking them for deletion and then actually deleting them after a grace period).
Now the obvious solution would be to just keep a “modified” timestamp column for each row, remember it for each each client and then simply fetch the changed ones
SELECT * FROM the_table WHERE modified > saved_modified_timestamp;
The modified column would then be kept up to date using triggers like:
CREATE FUNCTION update_timestamp()
RETURNS trigger
LANGUAGE ‘plpgsql’
AS $$
BEGIN
NEW.modified = NOW();
RETURN NEW;
END;
$$;
CREATE TRIGGER update_timestamp_update
BEFORE UPDATE ON the_table
FOR EACH ROW EXECUTE PROCEDURE update_timestamp();
CREATE TRIGGER update_timestamp_insert
BEFORE INSERT ON the_table
FOR EACH ROW EXECUTE PROCEDURE update_timestamp();
The obvious problem here is that NOW() is the time the transation started. So it might happen that a transaction is not yet commited while fetching the updated rows and when its commited, the timestamp is lower than the saved_modified_timestamp, so the update is never registered.
I think I found a solution that would work and I wanted to see if you can find any flaws with this approach.
The basic idea is to use xmin (or rather txid_current()) instead of the timestamp and then when fetching the changes, wrap them in an explicit transaction with REPEATABLE READ and read txid_snapshot() (or rather the three values it contains txid_snapshot_xmin(), txid_snapshot_xmax(), txid_snapshot_xip()) from the transaction.
If I read the postgres documentation correctly, then all changes made transactions that are < txid_snapshot_xmax() and not in txid_snapshot_xip() should be returned in that fetch transaction. This information should then be all that is required to get all the update rows when fetching again. The select would then look like this, with xmin_version replacing the modified column:
SELECT * FROM the_table WHERE
xmin_version >= last_fetch_txid_snapshot_xmax OR xmin_version IN last_fetch_txid_snapshot_xip;
The triggers would then be simply like this:
CREATE FUNCTION update_xmin_version()
RETURNS trigger
LANGUAGE ‘plpgsql’
AS $$
BEGIN
NEW.xmin_version = txid_current();
RETURN NEW;
END;
$$;
CREATE TRIGGER update_timestamp_update
BEFORE UPDATE ON the_table
FOR EACH ROW EXECUTE PROCEDURE update_xmin_version();
CREATE TRIGGER update_timestamp_update_insert
BEFORE INSERT ON the_table
FOR EACH ROW EXECUTE PROCEDURE update_xmin_version();
Would this work? Or am I missing something?
Thank you for the clarification about the 64-bit return from txid_current() and how the epoch rolls over. I am sorry I confused that epoch counter with the time epoch.
I cannot see any flaw in your approach but would verify through experimentation that having multiple client sessions concurrently in repeatable read transactions taking the txid_snapshot_xip() snapshot does not cause any problems.
I would not use this method in practice because I assume that the client code will need to solve for handling duplicate reads of the same change (insert/update/delete) as well as periodic reconciliations between the database contents and the client's working set to handle drift due to communication failures or client crashes. Once that code is written, then using now() in the client tracking table, clock_timestamp() in the triggers, and a grace interval overlap when the client pulls changesets would work for use cases I have encountered.
If requirements called for stronger real-time integrity than that, then I would recommend a distributed commit strategy.
Ok, so I've tested it in depth now and haven't found any flaws so far. I had around 30 clients writing and reading to the database at the same time and all of them got consistent updates. So I guess this approach works.

I have a trigger autonomous but only execute one time in the same session

I have a trigger autonomous but only execute one time in the same session, then do nothing
CREATE OR REPLACE TRIGGER tdw_insert_unsus
BEFORE INSERT ON unsuscription_fact FOR EACH ROW
DECLARE
PRAGMA AUTONOMOUS_TRANSACTION;
v_id_suscription SUSCRIPTION_FACT.ID_SUSCRIPTION%TYPE;
v_id_date_suscription SUSCRIPTION_FACT.ID_DATE_SUSCRIPTION%TYPE;
v_id_date_unsuscription SUSCRIPTION_FACT.ID_DATE_UNSUSCRIPTION%TYPE;
v_suscription DATE;
v_unsuscription DATE;
v_live_time SUSCRIPTION_FACT.LIVE_TIME%TYPE;
BEGIN
SELECT id_suscription, id_date_suscription
INTO v_id_suscription, v_id_date_suscription
FROM(
SELECT id_suscription, id_date_suscription
FROM suscription_fact
WHERE id_mno = :NEW.ID_MNO
AND id_provider = :NEW.ID_PROVIDER
AND ftp_service_id = :NEW.FTP_SERVICE_ID
AND msisdn = :NEW.MSISDN
AND id_date_unsuscription IS NULL
ORDER BY id_date_suscription DESC
)
WHERE ROWNUM = 1;
-- calculate time
v_unsuscription := to_date(:NEW.id_date_unsuscription,'yyyymmdd');
v_suscription := to_date(v_id_date_suscription,'yyyymmdd');
v_live_time := (v_unsuscription - v_suscription);
UPDATE suscription_fact SET id_date_unsuscription = :NEW.id_date_unsuscription,
id_time_unsuscription = :NEW.id_time_unsuscription, live_time = v_live_time
WHERE id_suscription = v_id_suscription;
COMMIT;
EXCEPTION
WHEN NO_DATA_FOUND THEN
ROLLBACK;
END;
/
if I insert values works well the first o second time but after not work, but if I logout the session and login works for the first or second insertion
what is the problem?, I use oracle 10g
You're using an autonomous transaction to work around the fact that a trigger can not query its table itself. You've run into the infamous mutating table error and you have found that declaring the trigger as an autonomous transaction makes the error go away.
No luck for you though, this does not solve the problem at all:
First, any transaction logic is lost. You can't rollback the changes on the suscription_fact table, they are committed, while your main transaction is not and could be rolled back. So you've also lost your data integrity.
The trigger can not see the new row because the new row hasn't been committed yet! Since the trigger runs in an independent transaction, it can not see the uncommitted changes made by the main transaction: you will run into completely wrong results.
This is why you should never do any business logic in autonomous transactions. (there are legitimate applications but they are almost entirely limited to logging/debugging).
In your case you should either:
Update your logic so that it does not need to query your table (updating suscription_fact only if the new row is more recent than the old value stored in id_date_unsuscription).
Forget about using business logic in triggers and use a procedure that updates all tables correctly or use a view because here we have a clear case of redundant data.
Use a workaround that actually works (by Tom Kyte).
I would strongly advise using (2) here. Don't use triggers to code business logic. They are hard to write without bugs and harder still to maintain. Using a procedure guarantees that all the relevant code is grouped in one place (a package or a procedure), easy to read and follow and without unforeseen consequences.

Bulk Insert into Oracle database: Which is better: FOR Cursor loop or a simple Select?

Which would be a better option for bulk insert into an Oracle database ?
A FOR Cursor loop like
DECLARE
CURSOR C1 IS SELECT * FROM FOO;
BEGIN
FOR C1_REC IN C1 LOOP
INSERT INTO BAR(A,
B,
C)
VALUES(C1.A,
C1.B,
C1.C);
END LOOP;
END
or a simple select, like:
INSERT INTO BAR(A,
B,
C)
(SELECT A,
B,
C
FROM FOO);
Any specific reason either one would be better ?
I would recommend the Select option because cursors take longer.
Also using the Select is much easier to understand for anyone who has to modify your query
The general rule-of-thumb is, if you can do it using a single SQL statement instead of using PL/SQL, you should. It will usually be more efficient.
However, if you need to add more procedural logic (for some reason), you might need to use PL/SQL, but you should use bulk operations instead of row-by-row processing. (Note: in Oracle 10g and later, your FOR loop will automatically use BULK COLLECT to fetch 100 rows at a time; however your insert statement will still be done row-by-row).
e.g.
DECLARE
TYPE tA IS TABLE OF FOO.A%TYPE INDEX BY PLS_INTEGER;
TYPE tB IS TABLE OF FOO.B%TYPE INDEX BY PLS_INTEGER;
TYPE tC IS TABLE OF FOO.C%TYPE INDEX BY PLS_INTEGER;
rA tA;
rB tB;
rC tC;
BEGIN
SELECT * BULK COLLECT INTO rA, rB, rC FROM FOO;
-- (do some procedural logic on the data?)
FORALL i IN rA.FIRST..rA.LAST
INSERT INTO BAR(A,
B,
C)
VALUES(rA(i),
rB(i),
rC(i));
END;
The above has the benefit of minimising context switches between SQL and PL/SQL. Oracle 11g also has better support for tables of records so that you don't have to have a separate PL/SQL table for each column.
Also, if the volume of data is very great, it is possible to change the code to process the data in batches.
If your rollback segment/undo segment can accomodate the size of the transaction then option 2 is better. Option 1 is useful if you do not have the rollback capacity needed and have to break the large insert into smaller commits so you don't get rollback/undo segment too small errors.
A simple insert/select like your 2nd option is far preferable. For each insert in the 1st option you require a context switch from pl/sql to sql. Run each with trace/tkprof and examine the results.
If, as Michael mentions, your rollback cannot handle the statement then have your dba give you more. Disk is cheap, while partial results that come from inserting your data in multiple passes is potentially quite expensive. (There is almost no undo associated with an insert.)
I think that in this question is missing one important information.
How many records will you insert?
If from 1 to cca. 10.000 then you should use SQL statement (Like they said it is easy to understand and it is easy to write).
If from cca. 10.000 to cca. 100.000 then you should use cursor, but you should add logic to commit on every 10.000 records.
If from cca. 100.000 to millions then you should use bulk collect for better performance.
As you can see by reading the other answers, there are a lot of options available. If you are just doing < 10k rows you should go with the second option.
In short, for approx > 10k all the way to say a <100k. It is kind of a gray area. A lot of old geezers will bark at big rollback segments. But honestly hardware and software have made amazing progress to where you may be able to get away with option 2 for a lot of records if you only run the code a few times. Otherwise you should probably commit every 1k-10k or so rows. Here is a snippet that I use. I like it because it is short and I don't have to declare a cursor. Plus it has the benefits of bulk collect and forall.
begin
for r in (select rownum rn, t.* from foo t) loop
insert into bar (A,B,C) values (r.A,r.B,r.C);
if mod(rn,1000)=0 then
commit;
end if;
end;
commit;
end;
I found this link from the oracle site that illustrates the options in more detail.
You can use:
Bulk collect along with FOR ALL that is called Bulk binding.
Because PL/SQL forall operator speeds 30x faster for simple table inserts.
BULK_COLLECT and Oracle FORALL together these two features are known as Bulk Binding. Bulk Binds are a PL/SQL technique where, instead of multiple individual SELECT, INSERT, UPDATE or DELETE statements are executed to retrieve from, or store data in, at table, all of the operations are carried out at once, in bulk. This avoids the context-switching you get when the PL/SQL engine has to pass over to the SQL engine, then back to the PL/SQL engine, and so on, when you individually access rows one at a time. To do bulk binds with INSERT, UPDATE, and DELETE statements, you enclose the SQL statement within a PL/SQL FORALL statement. To do bulk binds with SELECT statements, you include the BULK COLLECT clause in the SELECT statement instead of using INTO.
It improves performance.
I do neither for a daily complete reload of data. For example say I am loading my Denver site. There are other strategies for near real time deltas.
I use a create table SQL as I have found is just almost as fast as a bulk load
For example, below a create table statement is used to stage the data, casting the columns to the correct data type needed:
CREATE TABLE sales_dataTemp as select
cast (column1 as Date) as SALES_QUARTER,
cast (sales as number) as SALES_IN_MILLIONS,
....
FROM
TABLE1;
this temporary table mirrors my target table's structure exactly which is list partitioned by site.
I then do a partition swap with the DENVER partition and I have a new data set.