How does SQLite behave when inserting unordered hashes as the rowid? - sql

According to the SQLite documentation on rowid the data for rowid tables is stored in a B-tree. I’ve been considering using a hash of my data as the rowid. Since this means I’d be inserting rows with rowids that are not ordered like the default implementation of rowid how will this impact INSERT and SELECT performance in addition to the layout of data in my table?
If I insert a row which has a large rowid because it’s a hash and then a row with a smaller rowid what will the table layout look like?

It would depend upon how.
If you do not define an alias for the rowid column and a VACUUM takes place then the rowid values will likely be messed up (as they may/will be re-assigned).
e.g. :-
DROP TABLE IF EXISTS tablex;
CREATE TABLE IF NOT EXISTS tablex (data TEXT);
INSERT INTO tablex (rowid,data) VALUES(82356476978,'fred'),(55,'mary');
SELECT rowid AS therowid,* FROM tablex;
VACUUM;
SELECT rowid AS therowid,* FROM tablex;
results in :-
and then :-
If an alias is defined the VACUUM shouldn't be an issue and as above, it's fine to do so.
Of course you have to adhere to the rules and as long as the rules are obeyed, that is that the values are unique integers and are not greater than 9223372036854775807 or less than -9223372036854775808, then it should be fine. Other values would result in a datatype mismatch error.
I don't believe there would be much of an impact upon performance, there could possibly even be an improvement as there may well be space free in leaves reducing the need for a more costly split.
e.g. the following :-
DROP TABLE IF EXISTS tabley;
CREATE TABLE IF NOT EXISTS tabley (myrowidalias INTEGER PRIMARY KEY ,data TEXT);
INSERT INTO tabley VALUES(9223372036854775807,'fred'),(-9223372036854775808,'Mary'),(55,'Sue');
SELECT rowid AS therowid,* FROM tabley;
VACUUM;
SELECT rowid AS therowid,* FROM tabley;
-- INSERT INTO tabley VALUES(9223372036854775808,'Sarah'); -- Dataype mismatch
INSERT INTO tabley VALUES(-9223372036854775809,'Bob'); -- Datatype mismatch
SELECT rowid AS therowid,* FROM tabley; -- not run due to above error
Results in (note rowid retrieved via rowid and it's alias) :-
and after the VACUUM (identical) :-
With message :-
-- INSERT INTO tabley VALUES(9223372036854775808,'Sarah');
INSERT INTO tabley VALUES(-9223372036854775809,'Bob')
> datatype mismatch
> Time: 0s

Related

Get identity of row inserted in Snowflake Datawarehouse

If I have a table with an auto-incrementing ID column, I'd like to be able to insert a row into that table, and get the ID of the row I just created. I know that generally, StackOverflow questions need some sort of code that was attempted or research effort, but I'm not sure where to begin with Snowflake. I've dug through their documentation and I've found nothing for this.
The best I could do so far is try result_scan() and last_query_id(), but these don't give me any relevant information about the row that was inserted, just confirmation that a row was inserted.
I believe what I'm asking for is along the lines of MS SQL Server's SCOPE_IDENTITY() function.
Is there a Snowflake equivalent function for MS SQL Server's SCOPE_IDENTITY()?
EDIT: for the sake of having code in here:
CREATE TABLE my_db..my_table
(
ROWID INT IDENTITY(1,1),
some_number INT,
a_time TIMESTAMP_LTZ(9),
b_time TIMESTAMP_LTZ(9),
more_data VARCHAR(10)
);
INSERT INTO my_db..my_table
(
some_number,
a_time,
more_data
)
VALUES
(1, my_time_value, some_data);
I want to get to that auto-increment ROWID for this row I just inserted.
NOTE: The answer below can be not 100% correct in some very rare cases, see the UPDATE section below
Original answer
Snowflake does not provide the equivalent of SCOPE_IDENTITY today.
However, you can exploit Snowflake's time travel to retrieve the maximum value of a column right after a given statement is executed.
Here's an example:
create or replace table x(rid int identity, num int);
insert into x(num) values(7);
insert into x(num) values(9);
-- you can insert rows in a separate transaction now to test it
select max(rid) from x AT(statement=>last_query_id());
----------+
MAX(RID) |
----------+
2 |
----------+
You can also save the last_query_id() into a variable if you want to access it later, e.g.
insert into x(num) values(5);
set qid = last_query_id();
...
select max(rid) from x AT(statement=>$qid);
Note - it will be usually correct, but if the user e.g. inserts a large value into rid manually, it might influence the result of this query.
UPDATE
Note, I realized the code above might rarely generate incorrect answer.
Since the execution order of various phases of a query in a distributed system like Snowflake can be non-deterministic, and Snowflake allows concurrent INSERT statements, the following might happen
Two queries, Q1 and Q2, do a simple single row INSERT, start at roughly the same time
Q1 starts, is a bit ahead
Q2 starts
Q1 creates a row with value 1 from the IDENTITY column
Q2 creates a row with value 2 from the IDENTITY column
Q2 gets ahead of Q1 - this is the key part
Q2 commits, is marked as finished at time T2
Q1 commits, is marked as finished at time T1
Note that T1 is later than T2. Now, when we try to do SELECT ... AT(statement=>Q1), we will see the state as-of T1, including all changes from statements before, hence including the value 2 from Q2. Which is not what we want.
The way around it could be to add a unique identifier to each INSERT (e.g. from a separate SEQUENCE object), and then use a MAX.
Sorry. Distributed transactions are hard :)
If I have a table with an auto-incrementing ID column, I'd like to be
able to insert a row into that table, and get the ID of the row I just
created.
FWIW, here's a slight variation of the current accepted answer (using Snowflake's 'Time Travel' feature) that gives any column values "of the row I just created." It applies to auto-incrementing sequences and more generally to any column configured with a default (e.g. CURRENT_TIMESTAMP() or UUID_STRING()). Further, I believe it avoids any inconsistencies associated with a second query utilizing MAX().
Assuming this table setup:
CREATE TABLE my_db.my_table
(
ROWID INT IDENTITY(1,1),
some_number INT,
a_time TIMESTAMP_LTZ(9),
b_time TIMESTAMP_LTZ(9),
more_data VARCHAR(10)
);
Make sure the 'Time Travel' feature (change_tracking) is enabled for this table with:
ALTER TABLE my_db.my_table SET change_tracking = true;
Perform the INSERT per usual:
INSERT INTO my_db.my_table
(
some_number,
a_time,
more_data
)
VALUES
(1, my_time_value, some_data);
Use the CHANGES clause with BEFORE(statement... and END(statement... specified as LAST_QUERY_ID() to SELECT the row(s) added to my_table which are the precise result of the previous INSERT statement (with column values that existed the moment the row(s) was(were) added, including any defaults):
SET insertQueryId=LAST_QUERY_ID();
SELECT
ROWID,
some_number,
a_time,
b_time,
more_data
FROM my_db.my_table
CHANGES(information => default)
BEFORE(statement => $insertQueryId)
END(statement => $insertQueryId);
For more information on the CHANGES, BEFORE, END clauses see the Snowflake documentation here.

What happens with duplicates when inserting multiple rows?

I am running a python script that inserts a large amount of data into a Postgres database, I use a single query to perform multiple row inserts:
INSERT INTO table (col1,col2) VALUES ('v1','v2'),('v3','v4') ... etc
I was wondering what would happen if it hits a duplicate key for the insert. Will it stop the entire query and throw an exception? Or will it merely ignore the insert of that specific row and move on?
The INSERT will just insert all rows and nothing special will happen, unless you have some kind of constraint disallowing duplicate / overlapping values (PRIMARY KEY, UNIQUE, CHECK or EXCLUDE constraint) - which you did not mention in your question. But that's what you are probably worried about.
Assuming a UNIQUE or PK constraint on (col1,col2), you are dealing with a textbook UPSERT situation. Many related questions and answers to find here.
Generally, if any constraint is violated, an exception is raised which (unless trapped in subtransaction like it's possible in a procedural server-side language like plpgsql) will roll back not only the statement, but the whole transaction.
Without concurrent writes
I.e.: No other transactions will try to write to the same table at the same time.
Exclude rows that are already in the table with WHERE NOT EXISTS ... or any other applicable technique:
Select rows which are not present in other table
And don't forget to remove duplicates within the inserted set as well, which would not be excluded by the semi-anti-join WHERE NOT EXISTS ...
One technique to deal with both at once would be EXCEPT:
INSERT INTO tbl (col1, col2)
VALUES
(text 'v1', text 'v2') -- explicit type cast may be needed in 1st row
, ('v3', 'v4')
, ('v3', 'v4') -- beware of dupes in source
EXCEPT SELECT col1, col2 FROM tbl;
EXCEPT without the key word ALL folds duplicate rows in the source. If you know there are no dupes, or you don't want to fold duplicates silently, use EXCEPT ALL (or one of the other techniques). See:
Using EXCEPT clause in PostgreSQL
Generally, if the target table is big, WHERE NOT EXISTS in combination with DISTINCT on the source will probably be faster:
INSERT INTO tbl (col1, col2)
SELECT *
FROM (
SELECT DISTINCT *
FROM (
VALUES
(text 'v1', text'v2')
, ('v3', 'v4')
, ('v3', 'v4') -- dupes in source
) t(c1, c2)
) t
WHERE NOT EXISTS (
SELECT FROM tbl
WHERE col1 = t.c1 AND col2 = t.c2
);
If there can be many dupes, it pays to fold them in the source first. Else use one subquery less.
Related:
Select rows which are not present in other table
With concurrent writes
Use the Postgres UPSERT implementation INSERT ... ON CONFLICT ... in Postgres 9.5 or later:
INSERT INTO tbl (col1,col2)
SELECT DISTINCT * -- still can't insert the same row more than once
FROM (
VALUES
(text 'v1', text 'v2')
, ('v3','v4')
, ('v3','v4') -- you still need to fold dupes in source!
) t(c1, c2)
ON CONFLICT DO NOTHING; -- ignores rows with *any* conflict!
Further reading:
How to use RETURNING with ON CONFLICT in PostgreSQL?
How do I insert a row which contains a foreign key?
Documentation:
The manual
The commit page
The Postgres Wiki page
Craig's reference answer for UPSERT problems:
How to UPSERT (MERGE, INSERT ... ON DUPLICATE UPDATE) in PostgreSQL?
Will it stop the entire query and throw an exception? Yes.
To avoid that, you can look on the following SO question here, which describes how to avoid Postgres from throwing an error for multiple inserts when some of the inserted keys already exist on the DB.
You should basically do this:
INSERT INTO DBtable
(id, field1)
SELECT 1, 'value'
WHERE
NOT EXISTS (
SELECT id FROM DBtable WHERE id = 1
);

I want to Rebuild data table from a sorted table

Okay first a little bit of background, I've inherited maintaining a Database on MSSQL 2000.
In the Database there's a massive collection of interconnected tables, through Foreign keys.
What I'm attempting to do is to rebuild each table in a sorted fashion that will eliminate gaps in the IDENT column of the table.
On one table in particular I have the following columns:
RL_ID, RL_FK_RaidID, RL_FK_MemberID, RL_FK_ItemID, RL_ItemValue, RL_Notes, RL_IsUber, RL_IsWishItem, RL_LootModifier, RL_WishItemValue, RL_WeightedLootValue
It uses RL_ID as the IDENT column which currently reports 32620 by using DBCC CHECKIDENT (Table)
There is, however, only 12128 rows of information in this table.
So I tried a simple script to copy all the information in a sorted fashion into a new table:
INSERT INTO Table_1
SELECT RL_ID, RL_FK_RaidID, RL_FK_MemberID, RL_FK_ItemID, RL_ItemValue, RL_Notes, RL_IsUber, RL_IsWishItem, RL_LootModifier, RL_WishItemValue, RL_WeightedLootValue
FROM RaidLoot
ORDER BY RL_ID
Then Delete all the rows from the source table with:
TRUNCATE TABLE (RaidLoot)
Verify the IDENT is 1 with:
DBCC CHECKIDENT (RaidLoot)
Now copy the Data back into the Original table from Row 1 to the end:
SET IDENTITY_INSERT RaidLoot ON
INSERT INTO RaidLoot (RL_ID, RL_FK_RaidID, RL_FK_MemberID, RL_FK_ItemID, RL_ItemValue, RL_Notes, RL_IsUber, RL_IsWishItem, RL_LootModifier, RL_WishItemValue, RL_WeightedLootValue)
SELECT RL_ID, RL_FK_RaidID, RL_FK_MemberID, RL_FK_ItemID, RL_ItemValue, RL_Notes, RL_IsUber, RL_IsWishItem, RL_LootModifier, RL_WishItemValue, RL_WeightedLootValue
FROM Table_1
ORDER BY RL_ID
SET IDENTITY_INSERT RaidLoot OFF
Now verify that I only have the 12128 rows of data:
DBCC CHECKIDENT (RaidLoot)
(Note: I end up with 32620 again since it never did renumber the RL_ID, it just put them back into the same spots leaving the gaps). So where / how can I get it to Renumber the RL_ID column starting from 1 so that when it writes the data back to the original table I don't have the gaps?
The only other solution I can see is a heartache process of Manually changing each row RL_ID in the Table_1 before I write it back to the Original table. While this isn't impossible. I have another table that has approx 306,000 rows of data, but the IDENT report lists out as 450,123, so I'm hoping there is an easier way to automate the renumbering process.
If you really have to do this (seems like a great waste of time to me), you will have to adjust all of the foreign key references as well.
Consider the strategy of adding a NewID column for each table and populate the new column sequentially. Then you can use this NewID column in the queries needed to adjust the foreign keys. Very messy nonetheless unless you can come up with a consistent pattern to do so.
Since you can query the metadata to determine foreign keys, etc. this is certainly possible, and definitely should be considered seriously if you really do have lots of tables.
ADDED
There is a simple way to populate the NewID column
declare #id int
set #id = 0
update MyTable set NewID=#id, #id=#id+1
It is not obvious that this works, but it does.
I don't think it has to do with RL_ID being referenced by other tables in the schema - if I set up a single table test, the identity will always show up as the max number in the identity field:
CREATE TABLE #temp (id INT IDENTITY(1,1), other VARCHAR(1))
INSERT INTO #temp
( other )
VALUES ( -- id - int
'a' -- other - varchar(1)
),('b'),('c'),('d'),('e')
SELECT *
FROM #temp
SELECT *
INTO #holder
FROM #temp
WHERE other = 'C'
TRUNCATE TABLE #temp
SET IDENTITY_INSERT #temp ON
INSERT INTO #temp
( id, other )
SELECT id ,
other
FROM #holder
DBCC CHECKIDENT (#temp)
DROP TABLE #temp
DROP TABLE #holder
So your new identity is 32620 because that is the MAX(RL_ID)

Using User Defined Functions and performance?

I'm using stored procedure to fetch data and i needed to filter dynamically. For example if i dont want to fetch some data which's id is 5, 10 or 12 im sending it as string to procedure and im converting it to table via user defined function. But i must consider performance so here is a example:
Solution 1:
SELECT *
FROM Customers
WHERE CustomerID NOT IN (SELECT Value
FROM dbo.func_ConvertListToTable('4,6,5,1,2,3,9,222',','));
Solution 2:
CREATE TABLE #tempTable (Value NVARCHAR(4000));
INSERT INTO #tempTable
SELECT Value FROM dbo.func_ConvertListToTable('4,6,5,1,2,3,9,222',',')
SELECT *
FROM BusinessAds
WHERE AdID NOT IN (SELECT Value FROM #tempTable)
DROP TABLE #tempTable
Which solution is better for performance?
You would probably be better off creating the #temp table with a clustered index and appropriate datatype
CREATE TABLE #tempTable (Value int primary key);
INSERT INTO #tempTable
SELECT DISTINCT Value
FROM dbo.func_ConvertListToTable('4,6,5,1,2,3,9,222',',')
You can also put a clustered index on the table returned by the TVF.
As for which is better SQL Server will always assume that the TVF will return 1 row rather than recompiling after the #temp table is populated, so you would need to consider whether this assumption might cause sub optimal query plans for the case that the list is large.

Row number in Sybase tables

Sybase db tables do not have a concept of self updating row numbers. However , for one of the modules , I require the presence of rownumber corresponding to each row in the database such that max(Column) would always tell me the number of rows in the table.
I thought I'll introduce an int column and keep updating this column to keep track of the row number. However I'm having problems in updating this column in case of deletes. What sql should I use in delete trigger to update this column?
You can easily assign a unique number to each row by using an identity column. The identity can be a numeric or an integer (in ASE12+).
This will almost do what you require. There are certain circumstances in which you will get a gap in the identity sequence. (These are called "identity gaps", the best discussion on them is here). Also deletes will cause gaps in the sequence as you've identified.
Why do you need to use max(col) to get the number of rows in the table, when you could just use count(*)? If you're trying to get the last row from the table, then you can do
select * from table where column = (select max(column) from table).
Regarding the delete trigger to update a manually managed column, I think this would be a potential source of deadlocks, and many performance issues. Imagine you have 1 million rows in your table, and you delete row 1, that's 999999 rows you now have to update to subtract 1 from the id.
Delete trigger
CREATE TRIGGER tigger ON myTable FOR DELETE
AS
update myTable
set id = id - (select count(*) from deleted d where d.id < t.id)
from myTable t
To avoid locking problems
You could add an extra table (which joins to your primary table) like this:
CREATE TABLE rowCounter
(id int, -- foreign key to main table
rownum int)
... and use the rownum field from this table.
If you put the delete trigger on this table then you would hugely reduce the potential for locking problems.
Approximate solution?
Does the table need to keep its rownumbers up to date all the time?
If not, you could have a job which runs every minute or so, which checks for gaps in the rownum, and does an update.
Question: do the rownumbers have to reflect the order in which rows were inserted?
If not, you could do far fewer updates, but only updating the most recent rows, "moving" them into gaps.
Leave a comment if you would like me to post any SQL for these ideas.
I'm not sure why you would want to do this. You could experiment with using temporary tables and "select into" with an Identity column like below.
create table test
(
col1 int,
col2 varchar(3)
)
insert into test values (100, "abc")
insert into test values (111, "def")
insert into test values (222, "ghi")
insert into test values (300, "jkl")
insert into test values (400, "mno")
select rank = identity(10), col1 into #t1 from Test
select * from #t1
delete from test where col2="ghi"
select rank = identity(10), col1 into #t2 from Test
select * from #t2
drop table test
drop table #t1
drop table #t2
This would give you a dynamic id (of sorts)