How to create a Primary Key on quasi-unique data keys? - sql

I have a nightly SSIS process that exports a TON of data from an AS400 database system. Due to bugs in the AS400 DB software, ocassional duplicate keys are inserted into data tables. Every time a new duplicate is added to an AS400 table, it kills my nightly export process. This issue has moved from being a nuisance to a problem.
What I need is to have an option to insert only unique data. If there are duplicates, select the first encountered row of the duplicate rows. Is there SQL Syntax available that could help me do this? I know of the DISTINCT ROW clause but that doesn't work in my case because for most of the offending records, the entirety of the data is non-unique except for the fields which comprise the PK.
In my case, it is more important for my primary keys to remain unique in my SQL Server DB cache, rather than having a full snapshot of data. Is there something I can do to force this constraint on the export in SSIS/SQL Server with out crashing the process?
EDIT
Let me further clarify my request. What I need is to assure that the data in my exported SQL Server tables maintains the same keys that are maintained the AS400 data tables. In other words, creating a unique Row Count identifier wouldn't work, nor would inserting all of the data without a primary key.
If a bug in the AS400 software allows for mistaken, duplicate PKs, I want to either ignore those rows or, preferably, just select one of the rows with the duplicate key but not both of them.
This SELECT statement should probably happen from the SELECT statement in my SSIS project which connects to the mainframe through an ODBC connection.
I suspect that there may not be a "simple" solution to my problem. I'm hoping, however, that I'm wrong.

Since you are using SSIS, you must be using OLE DB Source to fetch the data from AS400 and you will be using OLE DB Destination to insert data into SQL Server.
Let's assume that you don't have any transformations
Add a Sort transformation after the OLE DB Source. In the Sort Transformation, there is a check box option at the bottom to remove duplicate rows based on a give set of column values. Check all the fields but don't select the Primary Key that comes from AS400. This will eliminate the duplicate rows but will insert the data that you still need.
I hope that is what you are looking for.

In SQL Server 2005 and above:
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY almost_unique_field ORDER BY id) rn
FROM import_table
) q
WHERE rn = 1

There are several options.
If you use IGNORE_DUP_KEY (http://www.sqlservernation.com/home/creating-indexes-with-ignore_dup_key.html) option on your primary key, SQL will issue a warning and only the duplicate records will fail.
You can also group/roll-up your data but this can get very expensive. What I mean by that is:
SELECT Id, MAX(value1), MAX(value2), MAX(value3) etc
Another option is to add an identity column (and cluster on this for an efficient join later) to your staging table and then create a mapping in a temp table. The mapping table would be:
CREATE TABLE #mapping
(
RowID INT PRIMARY KEY CLUSTERED,
PKIN INT
)
INSERT INTO #mapping
SELECT PKID, MIN(rowid) FROM staging_table
GROUP BY PKID
INSERT INTO presentation_table
SELECT S.*
FROM Staging_table S
INNER JOIN #mapping M
ON S.RowID = M.RowID

If I understand you correctly, you have duplicated PKs that have different data in the other fields.
First, put the data from the other database into a staging table. I find it easier to research issues with imports (especially large ones) if I do this. Actually I use two staging tables (and for this case I strongly recommend it), one with the raw data and one with only the data I intend to import into my system.
Now you can use and Execute SQL task to grab the one of the records for each key (see #Quassnoi for an idea of how to do that you may need to adjust his query for your situation). Personally I put an identity into my staging table, so I can identify which is the first or last occurance of duplicated data. Then put the record you chose for each key into your second staging table. If you are using an exception table, copy the records you are not moving to it and don't forget a reason code for the exception ("Duplicated key" for instance).
Now that you have only one record per key in a staging table, your next task is to decide what to do about the other data that is not unique. If there are two different business addresses for the same customer, which do you chose? This is a matter of business rules definition not strictly speaking SSIS or SQL code. You must define the business rules for how you chose the data when the data needs to be merged between two records (what you are doing is the equivalent of a de-dupping process). If you are lucky there is a date field or other way to determine which is the newest or oldest data and that is the data they want you to use. In that case once you have selected just one record, you are done the intial transform.
More than likely though you may need different rules for each other field to choose the correct one. In this case you write SSIS transforms in a data flow or Exec SQl tasks to pick the correct data and update the staging table.
Once you have the exact record you want to import, then do the data flow to move to the correct production tables.

Related

What's the best way to create a unique repeatable identification column with rows that are nearly identicle?

I'm writing a stored procedure that links together data from several different relational tables based on the primary key for the main table. This information is being send to a flat database. The stored procedure is going to produce several nearly identical rows where only a single column may be different due to multiple entries in some of the tables that are linked to a single entry in the main table. I need to uniquely identify each row in the stored procedure output but I am unable to use the primary key from the main table since there will be multiple entries for each "key".
I considered taking the approach of using the primary key from the main table followed by each of the columns that may be different in duplicate rows. For example _
However, this approach results in a very long and messy key. I am unable to use a GUID because if any data changes in the relational database the stored procedure is rerun and must update old entries rather than create new ones.
If your purpose is only to have a unique key that is as short as possible and does not relate to anything else, consider just adding ROW_NUMBER() to your select.
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)), othercolumns

How to delete a duplicate record without using primary key

I went for an interview today where they give me technical test on sql. One of them was how to delete duplicate records without a primary key.
For one I can't imagine a table without a primary key. Yes I have read the existing threads on this. Say this happened and needed to be fixed Now. Couldn't I just add to the end of the table a automatically incrementing id then use that to delete the duplicate record?
Can anyone think of a reason why that won't work? I tried it on a simple database I created and I can't see any problems
You've got a couple of options here.
If they don't mind you dropping the table you could SELECT DISTINCT * from the table in question and then INSERT this into a new table, DROPping the old table as you go. This obviously won't be usable in a Production database but can be useful for where someone has mucked up a routine that's populating a data warehouse for example.
Alternatively you could effectively create a temporary index by using the row number as per this answer. That answer shows you how to use the built in row_number() function in SQL server but could be replicated in other RDBMS' (not sure which but MySQL certainly) by declaring a variable called #row_num or equivalent and then using it in your SELECT statement as:
SET #row_num=0;
SELECT #row_num:=#row_num+1 AS row_num, [REMAINING COLUMNS GO HERE]
One of possible options how to do this:
select distinct rows from your table(you can achieve this using group by all columns)
insert result into new table
drop first table
alter second table to name of first one
But this is not always possible in production

If exist update else insert records in SQL Server 2008 table

I have one staging table and want to insert data to Main table, so i want to check while inserting data from staging to Main table, if exists then update the records else insert as new records. Here the issue is both the staging as well as Main table does not have any key column based on which i can compare values.
Is it possible to do without having key columns i.e. primary key on both the tables? if yes, please, suggest me how.
Thanks in advance.
If there is no unique key or set of data within a row to define uniqueness, then no.
The set of data can be a combination of the data in each column, creating a sum of parts which will provide uniqueness; however without exposure to your data you would need to make that decision.
You write the WHERE-clause to include all the fields that make your record unique (ie. the fields that decide whether the record is new or should be updated.)
Take a look at this article (http://blogs.msdn.com/b/miah/archive/2008/02/17/sql-if-exists-update-else-insert.aspx) for hints on how to construct it.
If you are using SQL Server 2008r2, you could also use the MERGE statement - I haven't tried it on tables without keys, so I don't know whether it would work for you.

How to Merge Multiple Database files in SQLite?

I have multiple database files which exist in multiple locations with exactly similar structure. I understand the attach function can be used to connect multiple files to one database connection, however, this treats them as seperate databases. I want to do something like:
SELECT uid, name FROM ALL_DATABASES.Users;
Also,
SELECT uid, name FROM DB1.Users UNION SELECT uid, name FROM DB2.Users ;
is NOT a valid answer because I have an arbitrary number of database files that I need to merge. Lastly, the database files, must stay seperate. anyone know how to accomplish this?
EDIT: an answer gave me the idea: would it be possible to create a view which is a combination of all the different tables? Is it possible to query for all database files and which databases they 'mount' and then use that inside the view query to create the 'master table'?
Because SQLite imposes a limit on the number of databases that can be attached at one time, there is no way to do what you want in a single query.
If the number can be guaranteed to be within SQLite's limit (which violates the definition of "arbitrary"), there's nothing that prevents you from generating a query with the right set of UNIONs at the time you need to execute it.
To support truly arbitrary numbers of tables, your only real option is to create a table in an unrelated database and repeatedly INSERT rows from each candidate:
ATTACH DATABASE '/path/to/candidate/database' AS candidate;
INSERT INTO some_table (uid, name) SELECT uid, name FROM candidate.User;
DETACH DATABASE candidate;
Some cleverness in the schema would take care of this.
You will generally have 2 types of tables: reference tables, and dynamic tables.
Reference tables have the same content across all databases, for example country codes, department codes, etc.
Dynamic data is data that will be unique to each DB, for example time series, sales statistics,etc.
The reference data should be maintained in a master DB, and replicated to the dynamic databases after changes.
The dynamic tables should all have a column for DB_ID, which would be part of a compound primary key, for example your time series might use db_id,measurement_id,time_stamp. You could also use a hash on DB_ID to generate primary keys, use same pk generator for all tables in DB. When merging these from different DBS , the data will be unique.
So you will have 3 types of databases:
Reference master -> replicated to all others
individual dynamic -> replicated to full dynamic
full dynamic -> replicated from reference master and all individual dynamic.
Then, it is up to you how you will do this replication, pseudo-realtime or brute force, truncate and rebuild the full dynamic every day or as needed.

Re copy and re-order table

I am using SQLite. I want to create a new table ordered differently than an existing tables's data. I have tried the following but it does not work in either the Mozzilla SQLite admin tools or SQLite Manager. What am I doing wrong?
INSERT INTO temp (SnippetID, LibraryID,Name, BeforeSelection, AfterSelection, ReplaceSelection, NewDocument, NewDocumentLang, Sort)
SELECT (SnippetID, LibraryID,Name, BeforeSelection, AfterSelection, ReplaceSelection, NewDocument, NewDocumentLang, Sort)
FROM Snippets ORDER BY LibraryID;
Thanks - JZ
My question to you is a simple "Why?". SQL is a relational algebra and data sets only have order when you specify it on extraction.
It makes no sense to talk about the order of your data since your database is free to impose whatever order it likes. Any decent database will not care one bit the order in which records are inserted, only the keys and constraints and other properties that can be used to efficiently store and retrieve the data.
Only when you extract data with a select ... order by does the order have to be fixed. If you want to be able to efficiently extract data ordered by your LibraryID column, simply index it and use an order by clause when extracting the data.
By default the table is "ordered" on the primary key, or the first column in the table, which is often the primary key anyways. when you do
select * from mytable
it uses this default "order". Everything Pax says is true though and I +1 to it.