FMDatabase: is it safe to leave failing inserts (unique PK key violated)?

FMDatabase: is it safe to leave failing inserts (unique PK key violated)? - objective-c

Currently, developing an app that needs to synchronize its data between a server. The app heavily uses SQLite, so we moved to raw sqlite and decided to use FMDatabase. Our tables are denormalized and without keys because we do not manage data state, just gathering the data and sending it. We are updating the tables at some time interval (getting from a server and inserting new content) but sending data to the server is more frequent case.
Data from a server comes as full records with id values (char type GUID value), so we have made PK keys for the tables to consist of those id columns. The thing is, when getting data from a server, I need just to insert new records and currently, I'm making inserts with those id values. Some records get inserted (new) and some fail due to unique PK constraint on id columns. For me, it's perfectly fine, just annoys messages in console (unique key violated) and not sure is it affecting performance and etc. I could loop through records before inserting new, to find existing ones and to insert only really new records but I think it's a waste.
Also, I could make explicit internal int id (faster for PK index) but our tables are not related with FK, so it would be redundant. Any thoughts on that?

The INSERT statement comes with optional ON CONFLICT clause, so if you want to suppress the warnings, simply use INSERT OR IGNORE and the duplicated PKs will be ignored.

Related

Inserting test data which references another table without hard coding the foreign key values

I'm trying to write a SQL query that will insert test data into two tables, one of which references the other.
Tables are created from something like the following:
CREATE TABLE address (
address_id INTEGER IDENTITY PRIMARY KEY,
...[irrelevant columns]
);
CREATE TABLE member (
...[irrelevant columns],
address INTEGER,
FOREIGN KEY(address) REFERENCES address(address_id)
);
I want ids in both tables to auto increment, so that I can easily insert new rows later without having to look into the table for ids.
I need to insert some test data into both tables, about 25 rows in each. Hardcoding ids for the insert causes issues with inserting new rows later, as the automatic values for the id columns try and start with 1 (which is already in the database). So I need to let the ids be automatically generated, but I also need to know which ids are in the database for inserting test data into the member database - I don't believe the autogenerated ones are guaranteed to be consecutive, so can't assume I can safely hardcode those.
This is test data - I don't care which record I link each member row I am inserting to, only that there is an address record in the address table with that id.
My thoughts for how to do this so far include:
Insert addresses individually, returning the id, then use that to insert an individual member (cons: potentially messy, not sure of the syntax, harder to see expected sets of addresses/members in the test data)
Do the member insert with a SELECT address_id FROM address WHERE [some condition that will only give one row] for the address column (cons: also a bit messy, involves a quite long statement for something I don't care about)
Is there a neater way around this problem?
I particularly wonder if there is a way to either:
Let the auto increment controlling functions be aware of manually inserted id values, or
Get the list of inserted ids from the address table into a variable which I can use values from in turn to insert members.
Ideally, I'd like this to work with as many (irritatingly slightly different) database engines as possible, but I need to support at least postgresql and sqlite - ideally in a single query, although I could have two separate ones. (I have separate ones for creating the tables, the sole difference being INTEGER GENEREATED BY DEFAULT AS IDENTITY instead of just IDENTITY.)

https://www.postgresql.org/docs/8.1/static/functions-sequence.html
Sounds like LASTVAL() is what you're looking for. It was also work in the real world to maintain transactional consistency between multiple selects, as it's scoped to your sessions last insert.

How to prevent adding identical records to SQL database

I am writing a program that recovers structured data as individual records from a (damaged) file and collects the results into a sqlite database.
The program is invoked several times with slightly different recovery parameters. That leads to recovering often the same, but sometimes different data from the file.
Now, every time I run my program with different parameters, it's supposed to add just the newly (different) found items to the same database.
That means that I need a fast way to tell if each recovered record is already present in the DB or not, in order to add them only if they're not existing in the DB yet.
I understand that for each record I want to add, I could first do a SELECT for all columns to see if there is already a matching record in the DB, and only add the new one if no same is found.
But since I'm adding 10000s of records, doing a SELECT for each of these records feels pretty inefficient (slow) to me.
I wonder if there's a smarter way to handle this? I.e, is there a way I can tell sqlite that I do not want duplicate entries, and so it automatically detects and rejects them? I know about the UNIQUE modifier, but that's not it because it applies to single columns only, doesn't it? I'd need to be able to say that the combination of COL1+COL2+COL3 must be unique. Is there a way to do that?
Note: I never want to update any existing records. I only want to collect a set of different records.
Bonus part - performance
In a classic programming language, I'd use a key-value dictionary where the key is the sum of all a record's values. Similarly, I could calculate a Hash code for each added record and look that hash code up first. If there's no match, then the record is surely not in the DB yet; If there is a match I'd still have to search the DB for any duplicates. That'd surely be faster already, but I still wonder if sqlite can make this more efficient.

Try:
sqlite> create table foo (
...> a int,
...> b int,
...> unique(a, b)
...> );
sqlite>
sqlite> insert into foo values(1, 2);
sqlite> insert into foo values(2, 1);
sqlite> insert into foo values(1, 2);
Error: columns a, b are not unique
sqlite>

You could use UNIQUE column constraint or to declare a multiple columns unique constraint you can use UNIQUE () ON CONFLICT :
CREATE TABLE name ( id int , UNIQUE (col_name1 type , col_name2 type) ON CONFLICT IGNORE )
SQLite has two ways of expressing uniqueness constraints: PRIMARY KEY and UNIQUE. Both of them create an index and so the lookup happens through the created index.

If you do not want to use an SQL approach (as mentioned in other answers) you can do a select for all your data when the program starts, store the data in a dictionary and work with the dictionary do decide which records to insert to your DB.
The benefit of this approach is the single select is much faster than many small selects.
The disadvantage is that it won't work well if you don't have enough memory to store your data in.

How to create a Primary Key on quasi-unique data keys?

I have a nightly SSIS process that exports a TON of data from an AS400 database system. Due to bugs in the AS400 DB software, ocassional duplicate keys are inserted into data tables. Every time a new duplicate is added to an AS400 table, it kills my nightly export process. This issue has moved from being a nuisance to a problem.
What I need is to have an option to insert only unique data. If there are duplicates, select the first encountered row of the duplicate rows. Is there SQL Syntax available that could help me do this? I know of the DISTINCT ROW clause but that doesn't work in my case because for most of the offending records, the entirety of the data is non-unique except for the fields which comprise the PK.
In my case, it is more important for my primary keys to remain unique in my SQL Server DB cache, rather than having a full snapshot of data. Is there something I can do to force this constraint on the export in SSIS/SQL Server with out crashing the process?
EDIT
Let me further clarify my request. What I need is to assure that the data in my exported SQL Server tables maintains the same keys that are maintained the AS400 data tables. In other words, creating a unique Row Count identifier wouldn't work, nor would inserting all of the data without a primary key.
If a bug in the AS400 software allows for mistaken, duplicate PKs, I want to either ignore those rows or, preferably, just select one of the rows with the duplicate key but not both of them.
This SELECT statement should probably happen from the SELECT statement in my SSIS project which connects to the mainframe through an ODBC connection.
I suspect that there may not be a "simple" solution to my problem. I'm hoping, however, that I'm wrong.

Since you are using SSIS, you must be using OLE DB Source to fetch the data from AS400 and you will be using OLE DB Destination to insert data into SQL Server.
Let's assume that you don't have any transformations
Add a Sort transformation after the OLE DB Source. In the Sort Transformation, there is a check box option at the bottom to remove duplicate rows based on a give set of column values. Check all the fields but don't select the Primary Key that comes from AS400. This will eliminate the duplicate rows but will insert the data that you still need.
I hope that is what you are looking for.

In SQL Server 2005 and above:
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY almost_unique_field ORDER BY id) rn
FROM import_table
) q
WHERE rn = 1

There are several options.
If you use IGNORE_DUP_KEY (http://www.sqlservernation.com/home/creating-indexes-with-ignore_dup_key.html) option on your primary key, SQL will issue a warning and only the duplicate records will fail.
You can also group/roll-up your data but this can get very expensive. What I mean by that is:
SELECT Id, MAX(value1), MAX(value2), MAX(value3) etc
Another option is to add an identity column (and cluster on this for an efficient join later) to your staging table and then create a mapping in a temp table. The mapping table would be:
CREATE TABLE #mapping
(
RowID INT PRIMARY KEY CLUSTERED,
PKIN INT
)
INSERT INTO #mapping
SELECT PKID, MIN(rowid) FROM staging_table
GROUP BY PKID
INSERT INTO presentation_table
SELECT S.*
FROM Staging_table S
INNER JOIN #mapping M
ON S.RowID = M.RowID

If I understand you correctly, you have duplicated PKs that have different data in the other fields.
First, put the data from the other database into a staging table. I find it easier to research issues with imports (especially large ones) if I do this. Actually I use two staging tables (and for this case I strongly recommend it), one with the raw data and one with only the data I intend to import into my system.
Now you can use and Execute SQL task to grab the one of the records for each key (see #Quassnoi for an idea of how to do that you may need to adjust his query for your situation). Personally I put an identity into my staging table, so I can identify which is the first or last occurance of duplicated data. Then put the record you chose for each key into your second staging table. If you are using an exception table, copy the records you are not moving to it and don't forget a reason code for the exception ("Duplicated key" for instance).
Now that you have only one record per key in a staging table, your next task is to decide what to do about the other data that is not unique. If there are two different business addresses for the same customer, which do you chose? This is a matter of business rules definition not strictly speaking SSIS or SQL code. You must define the business rules for how you chose the data when the data needs to be merged between two records (what you are doing is the equivalent of a de-dupping process). If you are lucky there is a date field or other way to determine which is the newest or oldest data and that is the data they want you to use. In that case once you have selected just one record, you are done the intial transform.
More than likely though you may need different rules for each other field to choose the correct one. In this case you write SSIS transforms in a data flow or Exec SQl tasks to pick the correct data and update the staging table.
Once you have the exact record you want to import, then do the data flow to move to the correct production tables.

How can I get the Primary Key id of a file I just INSERTED?

Earlier today I asked this question which arose from A- My poor planning and B- My complete disregard for the practice of normalizing databases. I spent the last 8 hours reading about normalizing databases and the finer points of JOIN and worked my way through the SQLZoo.com tutorials.
I am enlightened. I understand the purpose of database normalization and how it can suit me. Except that I'm not entirely sure how to execute that vision from a procedural standpoint.
Here's my old vision: 1 table called "files" that held, let's say, a file id and a file url and appropos grade levels for that file.
New vision!: 1 table for "files", 1 table for "grades", and a junction table to mediate.
But that's not my problem. This is a really basic Q that I'm sure has an obvious answer- When I create a record in "files", it gets assigned the incremented primary key automatically (file_id). However, from now on I'm going to need to write that file_id to the other tables as well. Because I don't assign that id manually, how do I know what it is?
If I upload text.doc and it gets file_id 123, how do I know it got 123 in order to write it to "grades" and the junction table? I can't do a max(file_id) because if you have concurrent users, you might nab a different id. I just don't know how to get the file_id value without having manually assigned it.

You may want to use LAST_INSERT_ID() as in the following example:
START TRANSACTION;
INSERT INTO files (file_id, url) VALUES (NULL, 'text.doc');
INSERT INTO grades (file_id, grade) VALUES (LAST_INSERT_ID(), 'some-grade');
COMMIT;
The transaction ensures that the operation remains atomic: This guarantees that either both inserts complete successfully or none at all. This is optional, but it is recommended in order to maintain the integrity of the data.
For LAST_INSERT_ID(), the most
recently generated ID is maintained in
the server on a per-connection basis.
It is not changed by another client.
It is not even changed if you update
another AUTO_INCREMENT column with a
nonmagic value (that is, a value that
is not NULL and not 0).
Using
LAST_INSERT_ID() and AUTO_INCREMENT
columns simultaneously from multiple
clients is perfectly valid. Each
client will receive the last inserted
ID for the last statement that client
executed.
Source and further reading:
MySQL Reference: How to Get the Unique ID for the Last Inserted Row
MySQL Reference: START TRANSACTION, COMMIT, and ROLLBACK Syntax

In PHP to get the automatically generated ID of a MySQL record, use mysqli->insert_id property of your mysqli object.

How are you going to find the entry tomorrow, after your program has forgotten the value of last_insert_id()?
Using a surrogate key is fine, but your table still represents an entity, and you should be able to answer the question: what measurable properties define this particular entity? The set of these properties are the natural key of your table, and even if you use surrogate keys, such a natural key should always exist and you should use it to retrieve information from the table. Use the surrogate key to enforce referential integrity, for indexing purpuses and to make joins easier on the eye. But don't let them escape from the database

Fixing DB Inconsistencies - ID Fields

I've inherited a (Microsoft?) SQL database that wasn't very pristine in its original state. There are still some very strange things in it that I'm trying to fix - one of them is inconsistent ID entries.
In the accounts table, each entry has a number called accountID, which is referenced in several other tables (notes, equipment, etc. ). The problem is that the numbers (for some random reason) - range from about -100000 to +2000000 when there are about only 7000 entries.
Is there any good way to re-number them while changing corresponding numbers in the other tables? At my disposal I also have ColdFusion, so any thing that works with SQL and/or that I'll accept.

For surrogate keys, they are meant to be meaningless, so unless you actually had a database integrity issue (like there were no foreign key contraints properly defined) or your identity was approaching the maximum for its datatype, I would leave them alone and go after some other low hanging fruit that would have more impact.

In this instance, it sounds like "why" is a better question than "how". The OP notes that there is a strange problem that needs to be fixed but doesn't say why it is a problem. Is it causing problems? What positive impact would changing these numbers have? Unless you originally programmed the system and understand precisely why the number is in its current state, you are taking quite a risky making changes like this.

I would talk to an accountant (or at least your financial people) before messing in anyway with the numbers in the accounts tables if this is a financial app. The Table of accounts is very critical to how finances are reported. These IDs may have meaning you don't understand. No one puts in a negative id unless they had a reason. I would under no circumstances change that unless I understood why it was negative to begin with. You could truly screw up your tax reporting or some other thing by making an uneeded change.

You could probably disable the foreign key relationships (if you're able to take it offline temporarily) and then update the primary keys using a script. I've used this update script before to change values, and you could pretty easily wrap this code in a cursor to go through the key values in question, one by one, and update the arbitrary value to an incrementing value you're keeping track of.
Check out the script here: http://vyaskn.tripod.com/sql_server_search_and_replace.htm
If you just have a list of tables that use the primary key, you could set up a series of UPDATE statements that run inside your cursor, and then you wouldn't need to use this script (which can be a little slow).
It's worth asking, though, why these values appear out of wack. Does this database have values added and deleted constantly? Are the primary key values really arbitrary, or do they just appear to be, but they really have meaning? Though I'm all for consolidating, you'd have to ensure that there's no purpose to those values.

With ColdFusion this shouldn't be a herculean task, but it will be messy and you'll have to be careful. One method you could use would be to script the database and then generate a brand new, blank table schema. Set the accountID as an identity field in the new database.
Then, using ColdFusion, write a query that will pull all of the old account data and insert them into the new database one by one. For each row, let the new database assign a new ID. After each insert, pull the new ID (using either ##IDENTITY or MAX(accountID)) and store the new ID and the old ID together in a temporary table so you know which old IDs belong to which new IDs.
Next, repeat the process with each of the child tables. For each old ID, pull its child entries and re-insert them into the new database using the new IDs. If the primary keys on the child tables are fine, you can insert them as-is or let the server assign new ones if they don't matter.
Assigning new IDs in place by disabling relationships temporarily may work, but you might also run into conflicts if one of the entries is assigned an ID that is already being used by the old data which could cause conflicts.

Create a new column in the accounts table for your new ID, and new column in each of your related tables to reference the new ID column.
ALTER TABLE accounts
ADD new_accountID int IDENTITY
ALTER TABLE notes
ADD new_accountID int
ALTER TABLE equipment
ADD new_accountID int
Then you can map the new_accountID column on each of your referencing tables to the accounts table.
UPDATE notes
SET new_accountID = accounts.new_accountID
FROM accounts
INNER JOIN notes ON (notes.accountID = accounts.accountID)
UPDATE equipment
SET new_accountID = accounts.new_accountID
FROM accounts
INNER JOIN equipment ON (equipment.accountID = accounts.accountID)
At this point, each table has both accountID with the old keys, and new_accountID with the new keys. From here it should be pretty straightforward.
Break all of the foreign keys on accountID.
On each table, UPDATE [table] SET accountID = new_accountID.
Re-add the foreign keys for accountID.
Drop new_accountID from all of the tables, as it's no longer needed.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas