Explicitly set ROWNUM in column - sql

I'm trying to split what was a large table update into multiple inserts into working tables. One of the queries needs uses the row number in it. On an INSERT in oracle, can I explicitly add the ROWNUM as an explicit column? This is a working table ultimately used in a reporting operation with a nasty partion over clause and having a true row number is helpful.
create table MY_TABLE(KEY number,SOMEVAL varchar2(30),EXPLICIT_ROW_NUMBER NUMBER);
INSERT /*+PARALLEL(AUTO) */ INTO MY_TABLE(KEY,SOMEVAL,EXPLICIT_ROW_NUMBER) (
SELECT /*+PARALLEL(AUTO) */ KEY,SOMEVAL,ROWNUM
FROM PREVIOUS_VERSION_OF_MY_TABLE
);
where PREVIOUS_VERSION_OF_MY_TABLE has both a KEY and SOMEVAL fields.
I'd like it to number the rows in the order that the inner select statement does it. So, the first row in the select, had it been explicitly run, would have a ROWNUM of 1, etc. I don't want it reversed, etc.
The table above has over 80MM records. Originally I used an UPDATE, and when I ran it, I got some ORA error saying that I ran out of UNDO space. I do not have the exact error message at this point anymore.
I'm trying to accomplish the same thing with multiple working tables that I would have done with one or more updates. Apparently it is either hard, impossible, etc to add UNDO space, for this query (our company DB team says), without making me a DBA, or spending about $100 on a hard drive and attaching it to the instance. So I need to write a harder query to get around this limitation. The goal is to have a session id and timestamps within that session, but for each timestamp within a session (except the last timestamp), show the next session. The original query is included below:
update sc_hub_session_activity schat
set session_time_stamp_rank = (
select /*+parallel(AUTO) */ order_number
from (
select /*+parallel(AUTO) */ schat_all.explicit_row_number as explicit_row_number,row_number() over (partition by schat_all.session_key order by schat_all.session_key,schat_all.time_stamp) as order_number
from sc_hub_session_activity schat_all
where schat_all.session_key=schat.session_key
) schat_all_group
where schat.explicit_row_number = schat_all_group.explicit_row_number
);
commit;
update sc_hub_session_activity schat
set session_next_time_stamp = (
select /*+parallel(AUTO) */ time_stamp
from sc_hub_session_activity schat2
where (schat2.session_time_stamp_rank = schat.session_time_stamp_rank+1) and (schat2.session_key = schat.session_key)
);
commit;

Related

which delete statement is better for deleting millions of rows

I have table which contains millions of rows.
I want to delete all the data which is over a week old based on the value of column last_updated.
so here are my two queries,
Approach 1:
Delete from A where to_date(last_updated,''yyyy-mm-dd'')< sysdate-7;
Approach 2:
l_lastupdated varchar2(255) := to_char(sysdate-nvl(p_days,7),'YYYY-MM-DD');
insert into B(ID) select ID from A where LASTUPDATED < l_lastupdated;
delete from A where id in (select id from B);
which one is better considering performance, safety and locking?
Assuming the delete removes a significant fraction of the data & millions of rows, approach three:
create table tmp
Delete from A where to_date(last_updated,''yyyy-mm-dd'')< sysdate-7;
drop table a;
rename tmp to a;
https://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:2345591157689
Obviously you'll need to copy over all the indexes, grants, etc. But online redefinition can help with this https://oracle-base.com/articles/11g/online-table-redefinition-enhancements-11gr1
When you get to 12.2, there's another simpler option: a filtered move.
This is an alter table move operation, with an extra clause stating which rows you want to keep:
create table t (
c1 int
);
insert into t values ( 1 );
insert into t values ( 2 );
commit;
alter table t
move including rows where c1 > 1;
select * from t;
C1
2
While you're waiting to upgrade to 12.2+ and if you don't want to use the create-as-select method for some reason then approach 1 is superior:
Both methods delete the same rows from A* => it's the same amount of work to do the delete
Option 1 has one statement; Option 2 has two statements; 2 > 1 => option 2 is more work
*Statement level consistency means you might get different results running the processes. Say another session tries to update an old row that your process will remove.
With just the delete, the update will be blocked until the delete finishes. At which point the row's gone, so the update does nothing.
Whereas if you do the insert first, the other session can update & commit the row before the insert completes. So the update "succeeds". But the delete will then remove it! Which can lead to some unhappy customers...
Your stored dateformat seems suitable for proper sorting, so you could go the other way round and convert sysdate to string:
--this is false today
select * from dual where '2019-06-05' < to_char(sysdate-7, 'YYYY-MM-DD');
--this is true today
select * from dual where '2019-05-05' < to_char(sysdate-7, 'YYYY-MM-DD');
So it would be:
Delete from A where last_updated < to_char(sysdate-7, ''yyyy-mm-dd'');
It has the benefit that your default index (if there is any) will be used.
It has the disadvantage on relying on the String/Varchar ordering which might be changed i.e. bei NLS changes (if i remember right), so in any case you should do a little testing before...
In the long term, you should of cource alter the colum to a proper date-datatype, but I guess that doesn't help you right now ;)
If you are trying to delete most of the rows in the table, I would advise you go with a different approach, namely:
create <new table name> as
select *
from <old table name>
where <predicates for the data you want to keep>;
then
drop table <old table name>;
and finally you can rename the new table back to the old table.
You could always partition the new table (i.e. create the new table with a separate statement containing the partitioning clauses, and then have an insert as select into the new table from the old table).
That way, when you need to delete rows, it's a simple matter of dropping the relevant partition(s).

Postgres: SELECT or INSERT in high concurrent write load DB

We have a DB for which we need a "selsert" (not upsert) function.
The function should take a text value and return a id column of existing row (SELECT) or insert the value and return id of new row (INSERT).
There are multiple processes that will need to perform this functionality (selsert)
I have been experimenting with pg_advisory_lock and ON CONFLICT clause for INSERT but am still not sure what approach would work best (even when looking at some of the other answers).
So far I have come up with following
WITH
selected AS (
SELECT id FROM test.body_parts WHERE (lower(trim(part))) = lower(trim('finger')) LIMIT 1
),
inserted AS (
INSERT INTO test.body_parts (part)
SELECT trim('finger')
WHERE NOT EXISTS ( SELECT * FROM selected )
-- ON CONFLICT (lower(trim(part))) DO NOTHING -- not sure if this is needed
RETURNING id
)
SELECT id, 'inserted' FROM inserted
UNION
SELECT id, 'selected' FROM selected
Will above query (within function) insure consistency in high
concurrency write workloads?
Are there any other issues I must consider (locking?, etc, etc)
BTW, I can insure that there are no duplicate values of (part) by creating unique index. That is not an issue. What I am after is that SELECT returns existing value if another process does INSERT (I hope I am explaining this right)
Unique index would have following definition
CREATE UNIQUE INDEX body_parts_part_ux
ON test.body_parts
USING btree
(lower(trim(part)));

SQL Transaction Set Temporary Value

I am relatively new to databases and SQL, and I am not clear on how or whether transactions may relate to a problem that I am trying to solve. I want to be able to temporarily set a value in a database table, run some query, and then clear out the value that was set, and I don't want any operations outside of the transaction to be able to see or alter the temporary value that was set.
The reason I am doing this is so that I can create predefined views that query certain data depending on variables such as the current user's id. In order for the predefined view to have access to the current user's id, I would save the id into a special table just before querying the view, then delete the id immediately afterward. I don't want to worry about some other user overwriting the current user's id while the transaction is in process. Is this a proper use for a transaction?
I am using H2 if that makes a difference.
SET #tempVar=value;
I don't know if you really need to go through the pain of creating a temp table and setting the value. This seems far simpler.
You can then do - SELECT * FROM TABLE_NAME WHERE COLUMN=#tempVar;
I think you want a Procedure or Function. Both can take a parameter as input.
ex.
CREATE PROCEDURE pr_emp
(
#input INT
)
AS
SELECT *
FROM myTable
WHERE emp_id = #input
ex.
CREATE FUNCTION v_empid (#input INT)
RETURNS TABLE
AS
RETURN
SELECT * FROM myTABLE WHERE emp_id = #input;
These could let you to access information for an empid. For example:
SELECT * FROM v_empid(32)

Get Id from a conditional INSERT

For a table like this one:
CREATE TABLE Users(
id SERIAL PRIMARY KEY,
name TEXT UNIQUE
);
What would be the correct one-query insert for the following operation:
Given a user name, insert a new record and return the new id. But if the name already exists, just return the id.
I am aware of the new syntax within PostgreSQL 9.5 for ON CONFLICT(column) DO UPDATE/NOTHING, but I can't figure out how, if at all, it can help, given that I need the id to be returned.
It seems that RETURNING id and ON CONFLICT do not belong together.
The UPSERT implementation is hugely complex to be safe against concurrent write access. Take a look at this Postgres Wiki that served as log during initial development. The Postgres hackers decided not to include "excluded" rows in the RETURNING clause for the first release in Postgres 9.5. They might build something in for the next release.
This is the crucial statement in the manual to explain your situation:
The syntax of the RETURNING list is identical to that of the output
list of SELECT. Only rows that were successfully inserted or updated
will be returned. For example, if a row was locked but not updated
because an ON CONFLICT DO UPDATE ... WHERE clause condition was not
satisfied, the row will not be returned.
Bold emphasis mine.
For a single row to insert:
Without concurrent write load on the same table
WITH ins AS (
INSERT INTO users(name)
VALUES ('new_usr_name') -- input value
ON CONFLICT(name) DO NOTHING
RETURNING users.id
)
SELECT id FROM ins
UNION ALL
SELECT id FROM users -- 2nd SELECT never executed if INSERT successful
WHERE name = 'new_usr_name' -- input value a 2nd time
LIMIT 1;
With possible concurrent write load on the table
Consider this instead (for single row INSERT):
Is SELECT or INSERT in a function prone to race conditions?
To insert a set of rows:
How to use RETURNING with ON CONFLICT in PostgreSQL?
How to include excluded rows in RETURNING from INSERT ... ON CONFLICT
All three with very detailed explanation.
For a single row insert and no update:
with i as (
insert into users (name)
select 'the name'
where not exists (
select 1
from users
where name = 'the name'
)
returning id
)
select id
from users
where name = 'the name'
union all
select id from i
The manual about the primary and the with subqueries parts:
The primary query and the WITH queries are all (notionally) executed at the same time
Although that sounds to me "same snapshot" I'm not sure since I don't know what notionally means in that context.
But there is also:
The sub-statements in WITH are executed concurrently with each other and with the main query. Therefore, when using data-modifying statements in WITH, the order in which the specified updates actually happen is unpredictable. All the statements are executed with the same snapshot
If I understand correctly that same snapshot bit prevents a race condition. But again I'm not sure if by all the statements it refers only to the statements in the with subqueries excluding the main query. To avoid any doubt move the select in the previous query to a with subquery:
with s as (
select id
from users
where name = 'the name'
), i as (
insert into users (name)
select 'the name'
where not exists (select 1 from s)
returning id
)
select id from s
union all
select id from i

How to return record count in PostgreSQL

I have a query with a limit and an offset. For example:
select * from tbl
limit 10 offset 100;
How to keep track of the count of the records, without running a second query like:
select count(*) from tbl;
I think this answers my question, but I need it for PostgreSQL. Any ideas?
I have found a solution and I want to share it. What I do is - I create a temp table from my real table with the filters applied, then I select from the temp table with a limit and offset (no limitations, so the performance is good), then select count(*) from the temp table (again no filters), then the other stuff I need and last - I drop the temp table.
select * into tmp_tbl from tbl where [limitations];
select * from tmp_tbl offset 10 limit 10;
select count(*) from tmp_tbl;
select other_stuff from tmp_tbl;
drop table tmp_tbl;
I haven't tried this, but from the section titled Obtaining the Result Status in the documentation you can use the GET DIAGNOSTICS command to determine the effect of a command.
GET DIAGNOSTICS number_of_rows = ROW_COUNT;
From the documentation:
This command allows retrieval of system status indicators. Each item
is a key word identifying a state value to be assigned to the
specified variable (which should be of the right data type to receive
it). The currently available status items are ROW_COUNT, the number of
rows processed by the last SQL command sent down to the SQL engine,
and RESULT_OID, the OID of the last row inserted by the most recent
SQL command. Note that RESULT_OID is only useful after an INSERT
command into a table containing OIDs.
Depends if you need it from the psql CLI or if you're accessing the database from something like an HTTP server. I am using postgres from my Node server with node-postgres. The result set is returned as an array called 'rows' on the result object so I can just do
console.log(results.rows.length)
To get the row count.