Create a unique primary key (hash) from database columns - sql

I have this table which doesn't have a primary key.
I'm going to insert some records in a new table to analyze them and I'm thinking in creating a new primary key with the values from all the available columns.
If this were a programming language like Java I would:
int hash = column1 * 31 + column2 * 31 + column3*31
Or something like that. But this is SQL.
How can I create a primary key from the values of the available columns? It won't work for me to simply mark all the columns as PK, for what I need to do is to compare them with data from other DB table.
My table has 3 numbers and a date.
EDIT What my problem is
I think a bit more of background is needed. I'm sorry for not providing it before.
I have a database ( dm ) that is being updated everyday from another db ( original source ) . It has records form the past two years.
Last month ( july ) the update process got broken and for a month there was no data being updated into the dm.
I manually create a table with the same structure in my Oracle XE, and I copy the records from the original source into my db ( myxe ) I copied only records from July to create a report needed by the end of the month.
Finally on aug 8 the update process got fixed and the records which have been waiting to be migrated by this automatic process got copied into the database ( from originalsource to dm ).
This process does clean up from the original source the data once it is copied ( into dm ).
Everything look fine, but we have just realize that an amount of the records got lost ( about 25% of july )
So, what I want to do is to use my backup ( myxe ) and insert into the database ( dm ) all those records missing.
The problem here are:
They don't have a well defined PK.
They are in separate databases.
So I thought that If I could create a unique pk from both tables which gave the same number I could tell which were missing and insert them.
EDIT 2
So I did the following in my local environment:
select a.* from the_table#PRODUCTION a , the_table b where
a.idle = b.idle and
a.activity = b.activity and
a.finishdate = b.finishdate
Which returns all the rows that are present in both databases ( the .. union? ) I've got 2,000 records.
What I'm going to do next, is delete them all from the target db and then just insert them all s from my db into the target table
I hope I don't get in something worst : - S : -S

The danger of creating a hash value by combining the 3 numbers and the date is that it might not be unique and hence cannot be used safely as a primary key.
Instead I'd recommend using an autoincrementing ID for your primary key.

Just create a surrogate key:
ALTER TABLE mytable ADD pk_col INT
UPDATE mytable
SET pk_col = rownum
ALTER TABLE mytable MODIFY pk_col INT NOT NULL
ALTER TABLE mytable ADD CONSTRAINT pk_mytable_pk_col PRIMARY KEY (pk_col)
or this:
ALTER TABLE mytable ADD pk_col RAW(16)
UPDATE mytable
SET pk_col = SYS_GUID()
ALTER TABLE mytable MODIFY pk_col RAW(16) NOT NULL
ALTER TABLE mytable ADD CONSTRAINT pk_mytable_pk_col PRIMARY KEY (pk_col)
The latter uses GUID's which are unique across databases, but consume more spaces and are much slower to generate (your INSERT's will be slow)
Update:
If you need to create same PRIMARY KEYs on two tables with identical data, use this:
MERGE
INTO mytable v
USING (
SELECT rowid AS rid, rownum AS rn
FROM mytable
ORDER BY
co1l, col2, col3
)
ON (v.rowid = rid)
WHEN MATCHED THEN
UPDATE
SET pk_col = rn
Note that tables should be identical up to a single row (i. e. have same number of rows with same data in them).
Update 2:
For your very problem, you don't need a PK at all.
If you just want to select the records missing in dm, use this one (on dm side)
SELECT *
FROM mytable#myxe
MINUS
SELECT *
FROM mytable
This will return all records that exist in mytable#myxe but not in mytable#dm
Note that it will shrink all duplicates if any.

Assuming that you have ensured uniqueness...you can do almost the same thing in SQL. The only problem will be the conversion of the date to a numeric value so that you can hash it.
Select Table2.SomeFields
FROM Table1 LEFT OUTER JOIN Table2 ON
(Table1.col1 * 31) + (Table1.col2 * 31) + (Table1.col3 * 31) +
((DatePart(year,Table1.date) + DatePart(month,Table1.date) + DatePart(day,Table1.date) )* 31) = Table2.hashedPk
The above query would work for SQL Server, the only difference for Oracle would be in terms of how you handle the date conversion. Moreover, there are other functions for converting dates in SQL Server as well, so this is by no means the only solution.
And, you can combine this with Quassnoi's SET statement to populate the new field as well. Just use the left side of the Join condition logic for the value.

If you're loading your new table with values from the old table, and you then need to join the two tables, you can only "properly" do this if you can uniquely identify each row in the original table. Quassnoi's solution will allow you to do this, IF you can first alter the old table by adding a new column.
If you cannot alter the original table, generating some form of hash code based on the columns of the old table would work -- but, again, only if the hash codes uniquely identify each row. (Oracle has checksum functions, right? If so, use them.)
If hash code uniqueness cannot be guaranteed, you may have to settle for a primary key composed of as many columns are required to ensure uniqueness (e.g. the natural key). If there is no natural key, well, I heard once that Oracle provides a rownum for each row of data, could you use that?

Related

Adding Row in existing table (SQL Server 2005)

I want to add another row in my existing table and I'm a bit hesitant if I'm doing the right thing because it might skew the database. I have my script below and would like to hear your thoughts about it.
I want to add another row for 'Jane' in the table, which will be 'SKATING" in the ACT column.
Table: [Emp_table].[ACT].[LIST_EMP]
My script is:
INSERT INTO [Emp_table].[ACT].[LIST_EMP]
([ENTITY],[TYPE],[EMP_COD],[DATE],[LINE_NO],[ACT],[NAME])
VALUES
('REG','EMP','45233','2016-06-20 00:00:00:00','2','SKATING','JANE')
Will this do the trick?
Your statement looks ok. If the database has a problem with it (for example, due to a foreign key constraint violation), it will reject the statement.
If any of the fields in your table are numeric (and not varchar or char), just remove the quotes around the corresponding field. For example, if emp_cod and line_no are int, insert the following values instead:
('REG','EMP',45233,'2016-06-20 00:00:00:00',2,'SKATING','JANE')
Inserting records into a database has always been the most common reason why I've lost a lot of my hairs on my head!
SQL is great when it comes to SELECT or even UPDATEs but when it comes to INSERTs it's like someone from another planet came into the SQL standards commitee and managed to get their way of doing it implemented into the final SQL standard!
If your table does not have an automatic primary key that automatically gets generated on every insert, then you have to code it yourself to manage avoiding duplicates.
Start by writing a normal SELECT to see if the record(s) you're going to add don't already exist. But as Robert implied, your table may not have a primary key because it looks like a LOG table to me. So insert away!
If it does require to have a unique record everytime, then I strongly suggest you create a primary key for the table, either an auto generated one or a combination of your existing columns.
Assuming the first five combined columns make a unique key, this select will determine if your data you're inserting does not already exist...
SELECT COUNT(*) AS FoundRec FROM [Emp_table].[ACT].[LIST_EMP]
WHERE [ENTITY] = wsEntity AND [TYPE] = wsType AND [EMP_COD] = wsEmpCod AND [DATE] = wsDate AND [LINE_NO] = wsLineno
The wsXXX declarations, you will have to replace them with direct values or have them DECLAREd earlier in your script.
If you ran this alone and recieved a value of 1 or more, then the data exists already in your table, at least those 5 first columns. A true duplicate test will require you to test EVERY column in your table, but it should give you an idea.
In the INSERT, to do it all as one statement, you can do this ...
INSERT INTO [Emp_table].[ACT].[LIST_EMP]
([ENTITY],[TYPE],[EMP_COD],[DATE],[LINE_NO],[ACT],[NAME])
VALUES
('REG','EMP','45233','2016-06-20 00:00:00:00','2','SKATING','JANE')
WHERE (SELECT COUNT(*) AS FoundRec FROM [Emp_table].[ACT].[LIST_EMP]
WHERE [ENTITY] = wsEntity AND [TYPE] = wsType AND
[EMP_COD] = wsEmpCod AND [DATE] = wsDate AND
[LINE_NO] = wsLineno) = 0
Just replace the wsXXX variables with the values you want to insert.
I hope that made sense.

Primary key violation while merging data from another table

I have two tables, TBTC03 and TBTC03Y, with TBTC03Y having two extra columns as EFFDTE and EXPDTE. I have to merge the data from TBTC03 to TBTC03Y with the following logic:
If no matching TC03 entry is found in TC03Y
a new TC03Y record is build with the TC03 data
the Effective Date will default to '01-01-1980'
the Expiration Date will default to '09-30-1995'
I wrote a query for the same as :
insert into TBTC03Y (LOB,MAJPERIL,LOSSCAUSE,NUMERICCL,EFFDTE,EXPDTE)
select LOB,MAJPERIL,LOSSCAUSE,NUMERICCL,'0800101' ,'0950930'
from TBTC03 where not EXISTS (select * from TBTC03Y where
TBTC03Y.LOB = TBTC03.LOB AND
TBTC03Y.MAJPERIL = TBTC03.MAJPERIL AND
TBTC03Y.LOSSCAUSE = TBTC03.LOSSCAUSE AND
TBTC03Y.NUMERICCL = TBTC03.NUMERICCL)
The primary key for both the tables is LOB, MAJPERIL and LOSSCAUSE.
However i have some TBTC03Y records, that already have the data with the primary key.
Firing the above query gives primary key constraints on some of the rows.
I am unable to figure out how i can acomplish it.
The issue with the primary key is that you're also including NUMERICCL in the WHERE clause. If you remove this you'll then be inserting unique data.
You may have to create a separate process as it appears you have some records in each table that have the same LOB, MAJPERIL and LOSSCAUSE but have a different NUMERICCL. I can think of three options here;
You have an issue with the data that needs fixing.
Maybe you want to update this value to match, in which case you're looking at an UPDATE rather than INSERT INTO.
You need to update your composite primary key to include the column NUMERICCL.
Removing NUMERICCL from the where clause would also correct this.
If the PK for both tables is {LOB, MAJPERIL, LOSSCAUSE}, you should remove TBTC03Y.NUMERICCL = TBTC03.NUMERICCL from your where clause.
Example:
t1{LOB, MAJPERIL, LOSSCAUSE, NUMERICCL}
1 1 1 1
t2{LOB, MAJPERIL, LOSSCAUSE, NUMERICCL}
1 1 1 2
In t2 there is no row where:
TBTC03Y.LOB = TBTC03.LOB AND
TBTC03Y.MAJPERIL = TBTC03.MAJPERIL AND
TBTC03Y.LOSSCAUSE = TBTC03.LOSSCAUSE AND
TBTC03Y.NUMERICCL = TBTC03.NUMERICCL
But inserting will obvioulsy violate PK constraint in t2:
t2{LOB, MAJPERIL, LOSSCAUSE}
1 1 1

Bulk updating existing rows in Redshift

This seems like it should be easy, but isn't. I'm migrating a query from MySQL to Redshift of the form:
INSERT INTO table
(...)
VALUES
(...)
ON DUPLICATE KEY UPDATE
value = MIN(value, VALUES(value))
For primary keys we're inserting that aren't already in the table, those are just inserted. For primary keys that are already in the table, we update the row's values based on a condition that depends on the existing and new values in the row.
http://docs.aws.amazon.com/redshift/latest/dg/merge-replacing-existing-rows.html does not work, because filter_expression in my case depends on the current entries in the table. I'm currently creating a staging table, inserting into it with a COPY statement and am trying to figure out the best way to merge the staging and real tables.
I'm having to do exactly this for a project right now. The method I'm using involves 3 steps:
1.
Run an update that addresses changed fields (I'm updating whether or not the fields have changed, but you can certainly qualify that):
update table1 set col1=s.col1, col2=s.col2,...
from table1 t
join stagetable s on s.primkey=t.primkey;
2.
Run an insert that addresses new records:
insert into table1
select s.*
from stagetable s
left outer join table1 t on s.primkey=t.primkey
where t.primkey is null;
3.
Mark rows no longer in the source as inactive (our reporting tool uses views that filter inactive records):
update table1
set is_active_flag='N', last_updated=sysdate
from table1 t
left outer join stagetable s on s.primkey=t.primkey
where s.primkey is null;
Is posible to create a temp table. In redshift is better to delete and insert the record.
Check this doc
http://docs.aws.amazon.com/redshift/latest/dg/merge-replacing-existing-rows.html
Here is the fully working approach for Redshift.
Assumptions:
A.Data available in S3 in gunzip format with '|' separated columns, may have some garbage data see maxerror.
B.Sales fact with two dimension tables to keep it simple (TIME and SKU(SKU may have many groups and categories))).
C.You have Sales table like this.
CREATE TABLE sales (
sku_id int encode zstd,
date_id int encode zstd,
quantity numeric(10,2) encode delta32k,
);
1)Create Staging table, that should resemble with your Online Table used by app/apps.
CREATE TABLE stg_sales_onetime (
sku_number varchar(255) encode zstd,
time varchar(255) encode zstd,
qty_str varchar(20) encode zstd,
quantity numeric(10,2) encode delta32k,
sku_id int encode zstd,
date_id int encode zstd
);
2)Copy data from S3( this could done using SSH).
copy stg_sales_onetime (sku_number,time,qty_str) from
's3://<buecket_name>/<full_file_path>' CREDENTIALS 'aws_access_key_id=<your_key>;aws_secret_access_key=<your_secret>' delimiter '|' ignoreheader 1 maxerror as 1000 gzip;
3)This step is optional, in case you don't have good formatted data, this a your transformation step if needed(as converting String(12.555654) quantity to Number(12.56))
update stg_sales_onetime set quantity=convert(decimal(10,2),qty_str);
4)Populating the correct IDs from dimension table.
update stg_sales_onetime set sku_id=<your_sku_demesion_table>.sku_id from <your_sku_demesion_table> where stg_sales_onetime.sku_number=<your_sku_demesion_table>.sku_number;
update stg_sales_onetime set time_id=<your_time_demesion_table>.time_id from <your_time_demesion_table> where stg_sales_onetime.time=<your_time_demesion_table>.time;
5)Finally you have data good to go from Staging to Online Sales table.
insert into sales(sku_id,time_id,quantity) select sku_id,time_id,quantity from stg_sales_onetime;

Regarding delete a record

HI I am having a table which does not have any primary key or unique key.
How can I delete the duplicate records?
Can any one of u tell me?
The easiest way would be to copy all of the duplicates into another identical table, delete them all from the original table, then put back the duplicates (just once for each unique one of course) from the temporary table.
For example:
BEGIN TRANSACTION
CREATE TABLE Holding_Table (my_string VARCHAR(20) NOT NULL)
INSERT INTO Holding_Table (my_string)
SELECT my_string
FROM My_Table
GROUP BY my_string
HAVING COUNT(*) > 1
DELETE MT
FROM Holding_Table HT
INNER JOIN My_Table MT ON MT.my_string = HT.my_string
INSERT INTO My_Table (my_string)
SELECT my_string
FROM Holding_Table
DROP TABLE Holding_Table
COMMIT TRANSACTION
This is just a simple example with one column. You would need to adjust it for your table obviously. Then be sure to add a primary key to your table...
You would have to create a primary key first. Then you would be able to run an aggregate query and see how many duplicates there are and delete based off of the new ID. You could then remove the primary key and make another field the primary key if you so desired (or stick with the one you created).
I have done this many times when fixing ancient legacy databases.
If you use: SET ROWCOUNT 1
You can get SQL to delete only a single row, and use whatever technique you prefer to delete the identical rows one at a time.
To revert back to normal behaviour, use: SET ROWCOUNT 0
However, it would be advisable to at least add a column that allows you to uniquely identify each row so that you can avoid this problem in future. The following does the trick:
ALTER TABLE TableName ADD TableName_ID int IDENTITY NOT NULL
Now you can simply: DELETE TableName WHERE TableName_ID = ? for each of your duplicates.
Check this site on support.microsoft.com: Site
It can tell you alot of how to identify, etc.
Adding this as another answer since it's a different approach...
You could also add a new column to the table, make that one unique, and then use that to delete all but one of the duplicate rows. For example:
ALTER TABLE My_Table
ADD my_id INT IDENTITY NOT NULL
DELETE
MT1
FROM
My_Table MT1
WHERE EXISTS (
SELECT
*
FROM
My_Table MT2
WHERE
MT2.my_string = MT1.my_string AND
MT2.my_id < MT1.my_id)
ALTER TABLE My_Table
DROP COLUMN my_id

Row number in Sybase tables

Sybase db tables do not have a concept of self updating row numbers. However , for one of the modules , I require the presence of rownumber corresponding to each row in the database such that max(Column) would always tell me the number of rows in the table.
I thought I'll introduce an int column and keep updating this column to keep track of the row number. However I'm having problems in updating this column in case of deletes. What sql should I use in delete trigger to update this column?
You can easily assign a unique number to each row by using an identity column. The identity can be a numeric or an integer (in ASE12+).
This will almost do what you require. There are certain circumstances in which you will get a gap in the identity sequence. (These are called "identity gaps", the best discussion on them is here). Also deletes will cause gaps in the sequence as you've identified.
Why do you need to use max(col) to get the number of rows in the table, when you could just use count(*)? If you're trying to get the last row from the table, then you can do
select * from table where column = (select max(column) from table).
Regarding the delete trigger to update a manually managed column, I think this would be a potential source of deadlocks, and many performance issues. Imagine you have 1 million rows in your table, and you delete row 1, that's 999999 rows you now have to update to subtract 1 from the id.
Delete trigger
CREATE TRIGGER tigger ON myTable FOR DELETE
AS
update myTable
set id = id - (select count(*) from deleted d where d.id < t.id)
from myTable t
To avoid locking problems
You could add an extra table (which joins to your primary table) like this:
CREATE TABLE rowCounter
(id int, -- foreign key to main table
rownum int)
... and use the rownum field from this table.
If you put the delete trigger on this table then you would hugely reduce the potential for locking problems.
Approximate solution?
Does the table need to keep its rownumbers up to date all the time?
If not, you could have a job which runs every minute or so, which checks for gaps in the rownum, and does an update.
Question: do the rownumbers have to reflect the order in which rows were inserted?
If not, you could do far fewer updates, but only updating the most recent rows, "moving" them into gaps.
Leave a comment if you would like me to post any SQL for these ideas.
I'm not sure why you would want to do this. You could experiment with using temporary tables and "select into" with an Identity column like below.
create table test
(
col1 int,
col2 varchar(3)
)
insert into test values (100, "abc")
insert into test values (111, "def")
insert into test values (222, "ghi")
insert into test values (300, "jkl")
insert into test values (400, "mno")
select rank = identity(10), col1 into #t1 from Test
select * from #t1
delete from test where col2="ghi"
select rank = identity(10), col1 into #t2 from Test
select * from #t2
drop table test
drop table #t1
drop table #t2
This would give you a dynamic id (of sorts)