How to prevent adding identical records to SQL database - sql

I am writing a program that recovers structured data as individual records from a (damaged) file and collects the results into a sqlite database.
The program is invoked several times with slightly different recovery parameters. That leads to recovering often the same, but sometimes different data from the file.
Now, every time I run my program with different parameters, it's supposed to add just the newly (different) found items to the same database.
That means that I need a fast way to tell if each recovered record is already present in the DB or not, in order to add them only if they're not existing in the DB yet.
I understand that for each record I want to add, I could first do a SELECT for all columns to see if there is already a matching record in the DB, and only add the new one if no same is found.
But since I'm adding 10000s of records, doing a SELECT for each of these records feels pretty inefficient (slow) to me.
I wonder if there's a smarter way to handle this? I.e, is there a way I can tell sqlite that I do not want duplicate entries, and so it automatically detects and rejects them? I know about the UNIQUE modifier, but that's not it because it applies to single columns only, doesn't it? I'd need to be able to say that the combination of COL1+COL2+COL3 must be unique. Is there a way to do that?
Note: I never want to update any existing records. I only want to collect a set of different records.
Bonus part - performance
In a classic programming language, I'd use a key-value dictionary where the key is the sum of all a record's values. Similarly, I could calculate a Hash code for each added record and look that hash code up first. If there's no match, then the record is surely not in the DB yet; If there is a match I'd still have to search the DB for any duplicates. That'd surely be faster already, but I still wonder if sqlite can make this more efficient.

Try:
sqlite> create table foo (
...> a int,
...> b int,
...> unique(a, b)
...> );
sqlite>
sqlite> insert into foo values(1, 2);
sqlite> insert into foo values(2, 1);
sqlite> insert into foo values(1, 2);
Error: columns a, b are not unique
sqlite>

You could use UNIQUE column constraint or to declare a multiple columns unique constraint you can use UNIQUE () ON CONFLICT :
CREATE TABLE name ( id int , UNIQUE (col_name1 type , col_name2 type) ON CONFLICT IGNORE )
SQLite has two ways of expressing uniqueness constraints: PRIMARY KEY and UNIQUE. Both of them create an index and so the lookup happens through the created index.

If you do not want to use an SQL approach (as mentioned in other answers) you can do a select for all your data when the program starts, store the data in a dictionary and work with the dictionary do decide which records to insert to your DB.
The benefit of this approach is the single select is much faster than many small selects.
The disadvantage is that it won't work well if you don't have enough memory to store your data in.

Related

Informix select trigger to update column

Is it possible to increase the value of a number in a column with a trigger every time it gets selected? We have special tables where we store the new id and when we update it in the app, it tends to get conflicts before the update happens, even when it all takes less than a second. So I was wondering if it is not possible to set database to increase value after every select on that column? Do not ask me why we do not use autoincrement for ids because I do not know.
Informix provides the SERIAL and BIGSERIAL types (and also SERIAL8, but don't use that) which provide autoincrement support. It also provides SEQUENCES with more sophisticated autoincrements. You should aim to use one of those.
Trying to use a SELECT trigger to update the table being selected from is, at best, fraught with problems about transactions and the like (problems which both the types and sequences carefully avoid).
If your design team needs help making effective use of these, ask a new question outlining what you want to achieve.
Normally, the correct way to proceed is to make the ID column in each table that defines 'something' (the Orders table, the Customer table, …) into a SERIAL column and either not insert a value into the ID column or insert 0 into it. The generated value can be retrieved and used when creating auxilliary information — order items, etc.
Note that you could think about using:
CREATE TABLE xyz_sequence
(
xyz SERIAL NOT NULL PRIMARY KEY
);
and using:
INSERT INTO xyz_sequence VALUES(0);
and then retrieving the inserted value — in Informix ESQL/C, you'd use sqlca.sqlerrd[1], in other languages, other techniques. You can also delete the newly inserted record, or even all the records in the table. You can afford to ignore errors from the DELETE statement; sooner or later, the rows will be deleted. The next value inserted will continue where the prior ones left off.
In a stored procedure, you'd use DBINFO('sqlca.sqlerrd1') to get the inserted value. You'd use DBINFO('bigserial') to get the value if you use a BIGSERIAL type.
I found out possible answer in this question update with return value instead of doing it with select it seems better to return value directly from update as update use locks it should be more safer even when you use multithreading application. But these are just my assumptions. Hopefully it will help someone.

Inserting test data which references another table without hard coding the foreign key values

I'm trying to write a SQL query that will insert test data into two tables, one of which references the other.
Tables are created from something like the following:
CREATE TABLE address (
address_id INTEGER IDENTITY PRIMARY KEY,
...[irrelevant columns]
);
CREATE TABLE member (
...[irrelevant columns],
address INTEGER,
FOREIGN KEY(address) REFERENCES address(address_id)
);
I want ids in both tables to auto increment, so that I can easily insert new rows later without having to look into the table for ids.
I need to insert some test data into both tables, about 25 rows in each. Hardcoding ids for the insert causes issues with inserting new rows later, as the automatic values for the id columns try and start with 1 (which is already in the database). So I need to let the ids be automatically generated, but I also need to know which ids are in the database for inserting test data into the member database - I don't believe the autogenerated ones are guaranteed to be consecutive, so can't assume I can safely hardcode those.
This is test data - I don't care which record I link each member row I am inserting to, only that there is an address record in the address table with that id.
My thoughts for how to do this so far include:
Insert addresses individually, returning the id, then use that to insert an individual member (cons: potentially messy, not sure of the syntax, harder to see expected sets of addresses/members in the test data)
Do the member insert with a SELECT address_id FROM address WHERE [some condition that will only give one row] for the address column (cons: also a bit messy, involves a quite long statement for something I don't care about)
Is there a neater way around this problem?
I particularly wonder if there is a way to either:
Let the auto increment controlling functions be aware of manually inserted id values, or
Get the list of inserted ids from the address table into a variable which I can use values from in turn to insert members.
Ideally, I'd like this to work with as many (irritatingly slightly different) database engines as possible, but I need to support at least postgresql and sqlite - ideally in a single query, although I could have two separate ones. (I have separate ones for creating the tables, the sole difference being INTEGER GENEREATED BY DEFAULT AS IDENTITY instead of just IDENTITY.)
https://www.postgresql.org/docs/8.1/static/functions-sequence.html
Sounds like LASTVAL() is what you're looking for. It was also work in the real world to maintain transactional consistency between multiple selects, as it's scoped to your sessions last insert.

SQLite: Autoincrement and insert or ignore will produce unused Autoincrement Keys

I'm using a table Mail with auto-increment Id and Mail Address. The table is used in 4 other tables and it is mainly used to save storage (String is only saved once and not 4 times). I'm using INSERT OR IGNORE to just blindly add the mail addresses to the table and if it exists ignore the update. This approach is MUCH faster than checking the existence with SELECT ... and do an INSERT if needed.
For every INSERT OR IGNORE the auto-increment, no matter if ignored or done the auto-increment Id is incremented. I one run I have approx. 500k data sets to proceed. So after every run the the last auto-increment key is incremented by 500k. I know there are 2^63-1 possible keys, so a long time to use them all up.
I also tried INSERT OR REPLACE, but this will increment the Id of the dataset on every run of the command, so this is not a solution at all.
Is there a way to prevent this increase of auto-increment key on every INSERT OR IGNORE?
Table Mail Example (replaced with pseudo Addresses)
mIdMail mMail
"1" ""
"7" "mail1#example.com"
"15" "mail2#example.com"
"17" "mail3#example.com"
"19" "mail4#example.com"
"23" "mail5#example.com"
...
Insert Query (Using Java Lib: org.apache.commons.dbutils)
INSERT OR IGNORE
INTO MAIL
( mMail )
VALUES ( ? );
Table Definition
CREATE TABLE IF NOT EXISTS MAIL (
mIdMail INTEGER PRIMARY KEY AUTOINCREMENT,
mMail CHAR(90) UNIQUE
);
To get autoincrementing values without gaps, drop the AUTOINCREMENT keyword. (Yes, you get autoincrementing values even without it.)
Auto-increment keys behave the way they do specifically because the database guarantees their behavior -- regardless of concurrent transactions and transaction failures.
Auto-increment keys have two guarantees:
They are increasing, so later inserts have larger values than earlier ones.
They are guaranteed to be unique.
The mechanism for allocating the keys does not guarantee no gaps. Why not? Because no-gaps would incur a lot more overhead on the database. Basically, each transaction on the table would need to be completely serialized (that is completed and committed) before the next one can take place. Generally, that is a really bad idea from a performance perspective.
Unfortunately, SQLite doesn't have the simplest solution, which is simply to call row_number() on the auto-incremented keys. You could try to implement a gapless auto-increment using triggers, significantly slowing down your application.
My real suggestion is simply to live with the gaps. Accept them. Surrender. That is how the built-in method works, and for good reason. Now design the rest of the database/application keeping this in mind.
I had the same issue, and changing "INSERT OR IGNORE" into "INSERT OR FAIL" solved the problem, so now when it fails the id value doesn't increment.

rebuild/refresh my table's PK list - gap in numbers

I have finished all my changes to a database table in sql server management studio 2012, but now I have a large gap between some values due to editing. Is there a way to keep my data, but re-assign all the ID's from 1 up to my last value?
I would like this cleaned up as I populate dropdownlists with these values and then I make interactions with my database with the assumption that my dropdownlist index and the table's ID match up, which is not the case right now.
My current DB has a large gap from 7 to 28, I would like to shift everything from 28 and up, back down to 8, 9, 10, 11, ect... so that my database has NO gaps from 1 and onward.
If the solution is tricky please give me some steps as I am new to SQL.
Thank you!
Yes, there are any number of ways to "close the gaps" in an auto generated sequence. You say you're new to SQL so I'll assume you're also new to relational concepts. Here is my advice to you: don't do it.
The ID field is a surrogate key. There are several aspects of surrogates one must be mindful of when using them, but the one I want to impress upon you is,
-- A surrogate key is used to make the row unique. Other than the guarantee that
-- the value is unique, no other assumptions may be made concerning the value.
-- In particular, no meaning may be derived from the value as to the contents of
-- the row or the row's relationship to any other row.
You have designed your app with a built-in assumption of the value of the key field (that they will be consecutive). Already it is causing you problems. Do you really want to go through this every time you make changes to the table? And suppose a future feature requires you to filter out some of the choices according to an option the user has selected? Or enable the user to specify the order of the items? Not going to be easy. So what is the solution?
You can create an additional (non-visible) field in the dropdown list that contains the key value. When the user makes a selection, use that index to get the key value of the selection and then go out to the database and get whatever additional data you need. This will work if you populate the list from the entire table or just select a few according to some as yet unknown filtering criteria or change the order in any way.
Viola. You never have this problem again, no matter how often you add and remove rows in the table.
However, on the off chance that you are as stubborn as me (not likely!) or just refuse to listen to the melodious voice of reason and experience, then try this:
Create a new table exactly like the old table, including auto incrementing PK.
Populate the new table using a Select from the old table. You can specify any order you want.
Drop the old table.
Rename the new table to the old table name.
You will have to drop and redefine any FKs from other tables. But this entire process
can be placed in a script because if you do this once, you'll probably do it again.
Now all the values are consecutive. Until you edit the table again...
You should refactor the code for your dropdown list and not the PK of the table.
If you do not agree, you can do one of the following:
Insert another column holding the dropdown's "order of appearance", make a unique index on it and fill this by hand (or programmatically).
Replace the SERIAL with an INT would work, make a unique index on the column and fill this by hand (or programmatically).
Remove the large ids and reseed your serial - the code depending on your DBMS
This happens to me all the time. If you don't have any foreign key constraints then it should be an easy fix.
Remember a DELETE statement will remove the record but keep the identity seed the same. (If I remove id # 5 and #5 was the last record inserted then SQL server still stores the identity seed value at "6").
TRUNCATING the table will reset the identity seed back to it's original value.
INSERT_IDENTITY [TABLE] ON can also be used to insert the correct data in the correct order if tuncating cannot happen.
SELECT *
INTO #tempTable
FROM [TableTryingToFix]
TRUNCATE TABLE [TableTryingToFix];
INSERT INTO [TableTryingToFix] (COL1, COL2, COL3, ETC)
SELECT COL1, COL2, COL2, ETC
FROM #tempTable
ORDER BY oldTableID

How can I get the Primary Key id of a file I just INSERTED?

Earlier today I asked this question which arose from A- My poor planning and B- My complete disregard for the practice of normalizing databases. I spent the last 8 hours reading about normalizing databases and the finer points of JOIN and worked my way through the SQLZoo.com tutorials.
I am enlightened. I understand the purpose of database normalization and how it can suit me. Except that I'm not entirely sure how to execute that vision from a procedural standpoint.
Here's my old vision: 1 table called "files" that held, let's say, a file id and a file url and appropos grade levels for that file.
New vision!: 1 table for "files", 1 table for "grades", and a junction table to mediate.
But that's not my problem. This is a really basic Q that I'm sure has an obvious answer- When I create a record in "files", it gets assigned the incremented primary key automatically (file_id). However, from now on I'm going to need to write that file_id to the other tables as well. Because I don't assign that id manually, how do I know what it is?
If I upload text.doc and it gets file_id 123, how do I know it got 123 in order to write it to "grades" and the junction table? I can't do a max(file_id) because if you have concurrent users, you might nab a different id. I just don't know how to get the file_id value without having manually assigned it.
You may want to use LAST_INSERT_ID() as in the following example:
START TRANSACTION;
INSERT INTO files (file_id, url) VALUES (NULL, 'text.doc');
INSERT INTO grades (file_id, grade) VALUES (LAST_INSERT_ID(), 'some-grade');
COMMIT;
The transaction ensures that the operation remains atomic: This guarantees that either both inserts complete successfully or none at all. This is optional, but it is recommended in order to maintain the integrity of the data.
For LAST_INSERT_ID(), the most
recently generated ID is maintained in
the server on a per-connection basis.
It is not changed by another client.
It is not even changed if you update
another AUTO_INCREMENT column with a
nonmagic value (that is, a value that
is not NULL and not 0).
Using
LAST_INSERT_ID() and AUTO_INCREMENT
columns simultaneously from multiple
clients is perfectly valid. Each
client will receive the last inserted
ID for the last statement that client
executed.
Source and further reading:
MySQL Reference: How to Get the Unique ID for the Last Inserted Row
MySQL Reference: START TRANSACTION, COMMIT, and ROLLBACK Syntax
In PHP to get the automatically generated ID of a MySQL record, use mysqli->insert_id property of your mysqli object.
How are you going to find the entry tomorrow, after your program has forgotten the value of last_insert_id()?
Using a surrogate key is fine, but your table still represents an entity, and you should be able to answer the question: what measurable properties define this particular entity? The set of these properties are the natural key of your table, and even if you use surrogate keys, such a natural key should always exist and you should use it to retrieve information from the table. Use the surrogate key to enforce referential integrity, for indexing purpuses and to make joins easier on the eye. But don't let them escape from the database