change ID number to smooth out duplicates in a table

change ID number to smooth out duplicates in a table - sql-server-2000

I have run into this problem that I'm trying to solve: Every day I import new records into a table that have an ID number.
Most of them are new (have never been seen in the system before) but some are coming in again. What I need to do is to append an alpha to the end of the ID number if the number is found in the archive, but only if the data in the row is different from the data in the archive, and this needs to be done sequentially, IE, if 12345 is seen a 2nd time with different data, I change it to 12345A, and if 12345 is seen again, and is again different, I need to change it to 12345B, etc.
Originally I tried using a where loop where it would put all the 'seen again' records in a temp table, and then assign A first time, then delete those, assign B to what's left, delete those, etc., till the temp table was empty, but that hasn't worked out.
Alternately, I've been thinking of trying subqueries as in:
update table
set IDNO= (select max idno from archive) plus 1
Any suggestions?

How about this as an idea? Mind you, this is basically pseudocode so adjust as you see fit.
With "src" as the table that all the data will ultimately be inserted into, and "TMP" as your temporary table.. and this is presuming that the ID column in TMP is a double.
do
update tmp set id = id + 0.01 where id in (select id from src);
until no_rows_changed;
alter table TMP change id into id varchar(255);
update TMP set id = concat(int(id), chr((id - int(id)) * 100 + 64);
insert into SRC select * from tmp;

What happens when you get to 12345Z?
Anyway, change the table structure slightly, here's the recipe:
Drop any indices on ID.
Split ID (apparently varchar) into ID_Num (long int) and ID_Alpha (varchar, not null). Make the default value for ID_Alpha an empty string ('').
So, 12345B (varchar) becomes 12345 (long int) and 'B' (varchar), etc.
Create a unique, ideally clustered, index on columns ID_Num and ID_Alpha.
Make this the primary key. Or, if you must, use an auto-incrementing integer as a pseudo primary key.
Now, when adding new data, finding duplicate ID number's is trivial and the last ID_Alpha can be obtained with a simple max() operation.
Resolving duplicate ID's should now be an easier task, using either a while loop or a cursor (if you must).
But, it should also be possible to avoid the "Row by agonizing row" (RBAR), and use a set-based approach. A few days of reading Jeff Moden articles, should give you ideas in that regard.

Here is my final solution:
update a
set IDnum=b.IDnum
from tempimiportable A inner join
(select * from archivetable
where IDnum in
(select max(IDnum) from archivetable
where IDnum in
(select IDnum from tempimporttable)
group by left(IDnum,7)
)
) b
on b.IDnum like a.IDnum + '%'
WHERE
*row from tempimport table = row from archive table*
to set incoming rows to the same IDnum as old rows, and then
update a
set patient_account_number = case
when len((select max(IDnum) from archive where left(IDnum,7) = left(a.IDnum,7)))= 7 then a.IDnum + 'A'
else left(a.IDnum,7) + char(ascii(right((select max(IDnum) from archive where left(IDnum,7) = left(a.IDnum,7)),1))+1)
end
from tempimporttable a
where not exists ( *select rows from archive table* )
I don't know if anyone wants to delve too far into this, but I appreciate contructive criticism...

Related

update rows one by one with max value + 1 in SQL

here is my situation,
I have 2 tables,
1st table has all records, and it has IDs
2nd table has new records and it doesnt have ID, yet.
I want to generate ID for 2nd table with max(id) + 1 from 1st table.
when i do this, it makes all rows same id number, but i want to make it unique increment number.
e.g
select max(id) from table1 then it gives '997040'
I want to make second table rows like;
id
997041
997042
997043
997044
i think i need to use cursor or whileloop, or both, but i could not create the actual query.
sorry about bad explanation, i am so confused now

Use ROWNUM to generate incrementing row numbers. E.g.:
SELECT someConstant + ROWNUM FROM source.

CREATE TABLE table_name
(
ID int IDENTITY(997041,1) PRIMARY KEY
)
I hope this sql query would work!!
Or refer http://www.w3schools.com/sql/sql_autoincrement.asp

postgresql: Fast way to update the latest inserted row

What is the best way to modify the latest added row without using a temporary table.
E.g. the table structure is
id | text | date
My current approach would be an insert with the postgresql specific command "returning id" so that I can update the table afterwards with
update myTable set date='2013-11-11' where id = lastRow
However I have the feeling that postgresql is not simply using the last row but is iterating through millions of entries until "id = lastRow" is found. How can i directly access the last added row?

update myTable date='2013-11-11' where id IN(
SELECT max(id) FROM myTable
)

Just to add to mvb13's answer (since I don't have enough points to comment directly yet) there is one word missing. Hopefully, this will save someone some time from working out the correct syntax LOL.
update myTable set date='2013-11-11' where id IN(
SELECT max(id) FROM myTable
);

SQL query select from table and group on other column

I'm phrasing the question title poorly as I'm not sure what to call what I'm trying to do but it really should be simple.
I've a link / join table with two ID columns. I want to run a check before saving new rows to the table.
The user can save attributes through a webpage but I need to check that the same combination doesn't exist before saving it. With one record it's easy as obviously you just check if that attributeId is already in the table, if it is don't allow them to save it again.
However, if the user chooses a combination of that attribute and another one then they should be allowed to save it.
Here's an image of what I mean:
So if a user now tried to save an attribute with ID of 1 it will stop them, but I need it to also stop them if they tried ID's of 1, 10 so long as both 1 and 10 had the same productAttributeId.
I'm confusing this in my explanation but I'm hoping the image will clarify what I need to do.
This should be simple so I presume I'm missing something.

If I understand the question properly, you want to prevent the combination of AttributeId and ProductAttributeId from being reused. If that's the case, simply make them a combined primary key, which is by nature UNIQUE.
If that's not feasible, create a stored procedure that runs a query against the join for instances of the AttributeId. If the query returns 0 instances, insert the row.
Here's some light code to present the idea (may need to be modified to work with your database):
SELECT COUNT(1) FROM MyJoinTable WHERE AttributeId = #RequestedID
IF ##ROWCOUNT = 0
BEGIN
INSERT INTO MyJoinTable ...
END

You can control your inserts via a stored procedure. My understanding is that
users can select a combination of Attributes, such as
just 1
1 and 10 together
1,4,5,10 (4 attributes)
These need to enter the table as a single "batch" against a (new?) productAttributeId
So if (1,10) was chosen, this needs to be blocked because 1-2 and 10-2 already exist.
What I suggest
The stored procedure should take the attributes as a single list, e.g. '1,2,3' (comma separated, no spaces, just integers)
You can then use a string splitting UDF or an inline XML trick (as shown below) to break it into rows of a derived table.
Test table
create table attrib (attributeid int, productattributeid int)
insert attrib select 1,1
insert attrib select 1,2
insert attrib select 10,2
Here I use a variable, but you can incorporate as a SP input param
declare #t nvarchar(max) set #t = '1,2,10'
select top(1)
t.productattributeid,
count(t.productattributeid) count_attrib,
count(*) over () count_input
from (select convert(xml,'<a>' + replace(#t,',','</a><a>') + '</a>') x) x
cross apply x.x.nodes('a') n(c)
cross apply (select n.c.value('.','int')) a(attributeid)
left join attrib t on t.attributeid = a.attributeid
group by t.productattributeid
order by countrows desc
Output
productattributeid count_attrib count_input
2 2 3
The 1st column gives you the productattributeid that has the most matches
The 2nd column gives you how many attributes were matched using the same productattributeid
The 3rd column is how many attributes exist in the input
If you compare the last 2 columns and the counts
match - you can use the productattributeid to attach to the product which has all these attributes
don't match - then you need to do an insert to create a new combination

Create a unique primary key (hash) from database columns

I have this table which doesn't have a primary key.
I'm going to insert some records in a new table to analyze them and I'm thinking in creating a new primary key with the values from all the available columns.
If this were a programming language like Java I would:
int hash = column1 * 31 + column2 * 31 + column3*31
Or something like that. But this is SQL.
How can I create a primary key from the values of the available columns? It won't work for me to simply mark all the columns as PK, for what I need to do is to compare them with data from other DB table.
My table has 3 numbers and a date.
EDIT What my problem is
I think a bit more of background is needed. I'm sorry for not providing it before.
I have a database ( dm ) that is being updated everyday from another db ( original source ) . It has records form the past two years.
Last month ( july ) the update process got broken and for a month there was no data being updated into the dm.
I manually create a table with the same structure in my Oracle XE, and I copy the records from the original source into my db ( myxe ) I copied only records from July to create a report needed by the end of the month.
Finally on aug 8 the update process got fixed and the records which have been waiting to be migrated by this automatic process got copied into the database ( from originalsource to dm ).
This process does clean up from the original source the data once it is copied ( into dm ).
Everything look fine, but we have just realize that an amount of the records got lost ( about 25% of july )
So, what I want to do is to use my backup ( myxe ) and insert into the database ( dm ) all those records missing.
The problem here are:
They don't have a well defined PK.
They are in separate databases.
So I thought that If I could create a unique pk from both tables which gave the same number I could tell which were missing and insert them.
EDIT 2
So I did the following in my local environment:
select a.* from the_table#PRODUCTION a , the_table b where
a.idle = b.idle and
a.activity = b.activity and
a.finishdate = b.finishdate
Which returns all the rows that are present in both databases ( the .. union? ) I've got 2,000 records.
What I'm going to do next, is delete them all from the target db and then just insert them all s from my db into the target table
I hope I don't get in something worst : - S : -S

The danger of creating a hash value by combining the 3 numbers and the date is that it might not be unique and hence cannot be used safely as a primary key.
Instead I'd recommend using an autoincrementing ID for your primary key.

Just create a surrogate key:
ALTER TABLE mytable ADD pk_col INT
UPDATE mytable
SET pk_col = rownum
ALTER TABLE mytable MODIFY pk_col INT NOT NULL
ALTER TABLE mytable ADD CONSTRAINT pk_mytable_pk_col PRIMARY KEY (pk_col)
or this:
ALTER TABLE mytable ADD pk_col RAW(16)
UPDATE mytable
SET pk_col = SYS_GUID()
ALTER TABLE mytable MODIFY pk_col RAW(16) NOT NULL
ALTER TABLE mytable ADD CONSTRAINT pk_mytable_pk_col PRIMARY KEY (pk_col)
The latter uses GUID's which are unique across databases, but consume more spaces and are much slower to generate (your INSERT's will be slow)
Update:
If you need to create same PRIMARY KEYs on two tables with identical data, use this:
MERGE
INTO mytable v
USING (
SELECT rowid AS rid, rownum AS rn
FROM mytable
ORDER BY
co1l, col2, col3
)
ON (v.rowid = rid)
WHEN MATCHED THEN
UPDATE
SET pk_col = rn
Note that tables should be identical up to a single row (i. e. have same number of rows with same data in them).
Update 2:
For your very problem, you don't need a PK at all.
If you just want to select the records missing in dm, use this one (on dm side)
SELECT *
FROM mytable#myxe
MINUS
SELECT *
FROM mytable
This will return all records that exist in mytable#myxe but not in mytable#dm
Note that it will shrink all duplicates if any.

Assuming that you have ensured uniqueness...you can do almost the same thing in SQL. The only problem will be the conversion of the date to a numeric value so that you can hash it.
Select Table2.SomeFields
FROM Table1 LEFT OUTER JOIN Table2 ON
(Table1.col1 * 31) + (Table1.col2 * 31) + (Table1.col3 * 31) +
((DatePart(year,Table1.date) + DatePart(month,Table1.date) + DatePart(day,Table1.date) )* 31) = Table2.hashedPk
The above query would work for SQL Server, the only difference for Oracle would be in terms of how you handle the date conversion. Moreover, there are other functions for converting dates in SQL Server as well, so this is by no means the only solution.
And, you can combine this with Quassnoi's SET statement to populate the new field as well. Just use the left side of the Join condition logic for the value.

If you're loading your new table with values from the old table, and you then need to join the two tables, you can only "properly" do this if you can uniquely identify each row in the original table. Quassnoi's solution will allow you to do this, IF you can first alter the old table by adding a new column.
If you cannot alter the original table, generating some form of hash code based on the columns of the old table would work -- but, again, only if the hash codes uniquely identify each row. (Oracle has checksum functions, right? If so, use them.)
If hash code uniqueness cannot be guaranteed, you may have to settle for a primary key composed of as many columns are required to ensure uniqueness (e.g. the natural key). If there is no natural key, well, I heard once that Oracle provides a rownum for each row of data, could you use that?

MySQL -- mark all but 1 matching row

This is similar to this question, but it seems like some of the answers there aren't quite compatible with MySQL (or I'm not doing it right), and I'm having a heck of a time figuring out the changes I need. Apparently my SQL is rustier than I thought it was. I'm also looking to change a column value rather than delete, but I think at least that part is simple...
I have a table like:
rowid SERIAL
fingerprint TEXT
duplicate BOOLEAN
contents TEXT
created_date DATETIME
I want to set duplicate=true for all but the first (by created_date) of each group by fingerprint. It's easy to mark all of the rows with duplicate fingerprints as dupes. The part I'm getting stuck on is keeping the first.
One of the apps that populates the table does bulk loads of data, with multiple workers loading data from different sources, and the workers' data isn't necessarily partitioned by date, so it's a pain to try to mark these all as they come in (the first one inserted isn't necessarily the first one by date). Also, I already have a bunch of data in there I'll need to clean up either way. So I'd rather just have a relatively efficient query I can run after a bulk load to clean up than try to build it into that app.
Thanks!

MySQL needs to be explicitly told if the data you are grouping by is larger than 1024 bytes (see this link for details). So if your data in the fingerprint column is larger than 1024 bytes you should use set the max_sort_length variable (see this link for details about values allowed, and this link about how to set it) to a larger number so that the group by wont silently use only part of your data for grouping.
Once you're certain that MySQL will group your data properly, the following query will set the duplicate flag so that the first fingerprint record has duplicate set to FALSE/0 and any subsequent fingerprint records have duplicate set to TRUE/1:
UPDATE mytable m1
INNER JOIN (SELECT fingerprint
, MIN(rowid) AS minrow
FROM mytable m2
GROUP BY fingerprint) m3
ON m1.fingerprint = m3.fingerprint
SET m1.duplicate = m3.minrow != m1.rowid;
Please keep in mind that this solution does not take NULLs into account and if it is possible for the fingerprint field to be NULL then you would need additional logic to handle that case.

How about a two-step approach, assuming you can go offline during a data load:
Mark every item as duplicate.
Select the earliest row from each group, and clear the duplicate flag.
Not elegant, but gets the job done.

Here's a funny way to do it:
SET #rowid := 0;
UPDATE mytable
SET duplicate = (rowid = #rowid),
rowid = (#rowid:=rowid)
ORDER BY rowid, created_date;
First set a user variable to zero, assuming this is less than any rowid in your table.
Then use the MySQL UPDATE...ORDER BY feature to ensure that the rows are updated in order by rowid, then by created_date.
For each row, if the current rowid is not equal to the user variable #rowid, set duplicate to 0 (false). This will be true only on the first row encountered with a given value for rowid.
Then add a dummy set of rowid to its own value, setting #rowid to that value as a side effect.
As you UPDATE the next row, if it's a duplicate of the previous row, rowid will be equal to the user variable #rowid, and therefore duplicate will be set to 1 (true).
Edit: Now I have tested this, and I corrected a mistake in the line that sets duplicate.

Here's another way to do it, using MySQL's multi-table UPDATE syntax:
UPDATE mytable m1
JOIN mytable m2 ON (m1.rowid = m2.rowid AND m1.created_date < m2.created_date)
SET m2.duplicate = 1;

I don't know the MySQL syntax, but in PLSQL you just do:
UPDATE t1
SET duplicate = 1
FROM MyTable t1
WHERE rowid != (
SELECT TOP 1 rowid FROM MyTable t2
WHERE t2.fingerprint = t1.fingerprint ORDER BY created_date DESC
)
That may have some syntax errors, as I'm just typing off the cuff/not able to test it, but that's the gist of it.
MySQL version (not tested):
UPDATE t1
SET duplicate = 1
FROM MyTable t1
WHERE rowid != (
SELECT rowid FROM MyTable t2
WHERE t2.fingerprint = t1.fingerprint
ORDER BY created_date DESC
LIMIT 1
)

Untested...
UPDATE TheAnonymousTable
SET duplicate = TRUE
WHERE rowid NOT IN
(SELECT rowid
FROM (SELECT MIN(created_date) AS created_date, fingerprint
FROM TheAnonymousTable
GROUP BY fingerprint
) AS M,
TheAnonymousTable AS T
WHERE M.created_date = T.created_date
AND M.fingerprint = T.fingerprint
);
The logic is that the innermost query returns the earliest created_date for each distinct fingerprint as table alias M. The middle query determines the rowid value for each of those rows; it is a nuisance to have to do this (but necessary), and the code assumes that you won't get two records for the same fingerprint and timestamp. This gives you the rowid for the earlist record for each separate fingerprint. Then the outer query (the UPDATE) sets the 'duplicate' flag on all those rows where the rowid is not one of the earliest rows.
Some DBMS may be unhappy about doing (nested) sub-queries on the table being updated.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas