Snowflake having duplicate - primary-key

In My use-case, my scheduled Job reads a CSV and writes to snowflake.
When I schedule this read from CSV and write to snowflake for every hour I see multiple duplicates in snowflake. This is despite that my ID is a PRIMARY KEY (ALTER TABLE tablename ADD PRIMARY KEY (column1).
I understand that Snowflake supports defining and maintaining constraints, but does not enforce them, except for NOT NULL constraints, which are always enforced. I need help to solve this issue.
To elaborate, Lets consider scenario:
Step 1: At 9AM insert data from CSV to Snowflake
ID Customer name Price
1111 John Mathew 10
1112 David Becham 20
Step 2: At 10PM I get one additional row hence my CSV is
ID Customer name Price
1111 John Mathew 10
1112 David Becham 20
1113 Hello World 40
Expected in Snowflake
ID Customer name Price
1111 John Mathew 10
1112 David Becham 20
1113 Hello World 40
What I get is duplicates as below
ID Customer name Price
1111 John Mathew 10
1112 David Becham 20
1113 Hello World 40
1111 John Mathew 10
1112 David Becham 20

It would help if you provided your code. It looks like you are updating your CSV, which means Snowflake sees the entire file as a new file to be loaded, which will then load the entire file, again. If you are just running a COPY INTO command with no downstream logic, then that is what will happen.
Two options:
1) don't update the CSV file...just create a new one with just the new data. Then, the COPY INTO command will work fine.
2) if you are also receiving updates to previous records, then you should run a COPY INTO into a temporary table and then MERGE that data into your final table on the primary key.

Create another table(second table) to store de-duplicated records. First table will get data from your source(CSV). Then create a stream on top of first table to capture changes. Then create a task for that stream which will merge(insert/update) data into the second table.

Related

Oracle Delete and Insert

I want to update the categories linked to a row. A row can be linked to multiple categories. To visually represent this, please take a look at the following simplified scenarios:
Object Table
id name
----------
1 Chair
2 Computer
Category Table
id category
--------------
90 Asset
100 Furniture
200 Electronics
300 Garbage
Linking Table
obj cat
---------
1 90
1 100
2 90
2 200
So I have those values in the database right now. But now I decided to update the Chair record to be both 100 | Furniture and 300 | Garbage.
How do I go about doing this efficiently? I know I can delete all of the associated links, then add the new links but there must be a more efficient way for me to do this.
There are a couple of different options you could take:
Update the existing rows if the number of rows in the Linking Table isn't changing. In this case, you could update the row with 1 90 to be 1 300. The challenge here is ensuring the number of rows to remove is equal to the number you plan on inserting.
Delete the ones no longer valid and insert new ones for the rows missing. Note that in this case, the 1 100 row would still stay intact just like the first case though this doesn't have the requirement of the # of rows to delete = # of rows to insert.
Nuke n Pave would be what you initially suggest of deleting all the data and starting over again that could be overkill.

Merging databases with similar tables structures into one single database

I got two databases with similar data structures. They are basically archived databases having same tables but one having slightly updated data. I want to merge them both into one single database.
When I was trying to merge them both I am facing error saying duplicate data cannot be copied. But i want to copy duplicate data too. I believe this error is mainly due to primary key constraint.
Can anyone please suggest me how to club two databases without losing duplicate data in it?
For example:
Table1:
MemberID Name Class Year
120 Sam B 2005
121 Mark A 2005
122 John A 2005
Table2:
MemberID Name Class Year
120 Sam B 2006
121 Mark A 2006
123 David C 2006
Result table should be:
MemberID Name Class Year
120 Sam B 2005
120 Sam B 2006
121 Mark A 2005
121 Mark A 2006
122 John A 2005
123 David C 2006
Note: memberID is the primary key
Primary keys must be unique. So you need to change the design of the result table so it has some other primary key, such as a field called "ID" that is an identity field (autonumber).
When MemberID is no longer primary, then your insert should work.
You could also add another field to the result table that would indicate if it is archived, then make the primary key a combination of MemberID and archived data. There are many ways to do this, but somehow MemberID cannot be primary in the results table.

SQL Server 2012 Query to extract subsets of data

I'm trying to 2nf some data:
Refid | Reason
------|---------
1 | Admission
1 | Advice and Support
1 | Behaviour
As you can see one person might have multiple reasons so i need another table to have the following format:
Refid | Reason1 | Reason2 | Reason3 | ETC...
------|-----------|--------------------|-----------
1 | Admission | Advice and Support | Behaviour
But I don't know how to write a query to extract the data and write it in a new table like this. The reasons don't have dates of other criteria that would make any reason to be in any special order. All reasons are assigned at the time of referral.
Thanks For yor Help.. SQL Server 2012
You are modelling a many to many relationship
You need 3 tables
- One for Reasons (say ReasonID and Reason)
- One for each entity identified by RefID (say RefID and ReferenceOtherData)
- An junction (or intersection) table with the keys (RefID, ReasonID)
This way,
Multiple reasons can apply to one Ref entity
Multiple Refs can have the same reason
You turn repeated columns into rows.

How to merge two identical database data to one?

Two customers are going to merge. They are both using my application, with their own database. About a few weeks they are merging (they become one organisation). So they want to have all the data in 1 database.
So the two database structures are identical. The problem is with the data. For example, I have Table Locations and persons (these are just two tables of 50):
Database 1:
Locations:
Id Name Adress etc....
1 Location 1
2 Location 2
Persons:
Id LocationId Name etc...
1 1 Alex
2 1 Peter
3 2 Lisa
Database 2:
Locations:
Id Name Adress etc....
1 Location A
2 Location B
Persons:
Id LocationId Name etc...
1 1 Mark
2 2 Ashley
3 1 Ben
We see that person is related to location (column locationId). Note that I have more tables that is referring to the location table and persons table.
The databases contains their own locations and persons, but the Id's can be the same. In case, when I want to import everything to DB2 then the locations of DB1 should be inserted to DB2 with the ids 3 and 4. The the persons from DB1 should have new Id 4,5,6 and the locations in the person table also has to be changed to the ids 4,5,6.
My solution for this problem is to write a query which handle everything, but I don't know where to begin.
What is the best way (in a query) to renumber the Id fields also having a cascade to the childs? The databases does not containing referential integrity and foreign keys (foreign keys are NOT defined in the database). Creating FKeys and Cascading is not an option.
I'm using sql server 2005.
You say that both customers are using your application, so I assume that it's some kind of "shrink-wrap" software that is used by more customers than just these two, correct?
If yes, adding special columns to the tables or anything like this probably will cause pain in the future, because you either would have to maintain a special version for these two customers that can deal with the additional columns. Or you would have to introduce these columns to your main codebase, which means that all your other customers would get them as well.
I can think of an easier way to do this without changing any of your tables or adding any columns.
In order for this to work, you need to find out the largest ID that exists in both databases together (no matter in which table or in which database it is).
This may require some copy & paste to get a lot of queries that look like this:
select max(id) as maxlocationid from locations
select max(id) as maxpersonid from persons
-- and so on... (one query for each table)
When you find the largest ID after running the query in both databases, take a number that's larger than that ID, and add it to all IDs in all tables in the second database.
It's very important that the number needs to be larger than the largest ID that already exists in both databases!
It's a bit difficult to explain, so here's an example:
Let's say that the largest ID in any table in both databases is 8000.
Then you run some SQL that adds 10000 to every ID in every table in the second database:
update Locations set Id = Id + 10000
update Persons set Id = Id + 10000, LocationId = LocationId + 10000
-- and so on, for each table
The queries are relatively simple, but this is the most work because you have to build a query like this manually for each table in the database, with the correct names of all the ID columns.
After running the query on the second database, the example data from your question will look like this:
Database 1: (exactly like before)
Locations:
Id Name Adress etc....
1 Location 1
2 Location 2
Persons:
Id LocationId Name etc...
1 1 Alex
2 1 Peter
3 2 Lisa
Database 2:
Locations:
Id Name Adress etc....
10001 Location A
10002 Location B
Persons:
Id LocationId Name etc...
10001 10001 Mark
10002 10002 Ashley
10003 10001 Ben
And that's it! Now you can import the data from one database into the other, without getting any primary key violations at all.
If this were my problem, I would probably add some columns to the tables in the database I was going to keep. These would be used to store the pk values from the other db. Then I would insert records from the other tables. For the ones with foreign keys, I would use a known value. Then I would update as required and drop the columns I added.

Efficient way to update SQL 'relationship' table

Say I have three properly normalised tables. One of people, one of qualifications and one mapping people to qualifications:
People:
id | Name
----------
1 | Alice
2 | Bob
Degrees:
id | Name
---------
1 | PhD
2 | MA
People-to-degrees:
person_id | degree_id
---------------------
1 | 2 # Alice has an MA
2 | 1 # Bob has a PhD
So then I have to update this mapping via my web interface. (I made a mistake. Bob has a BA, not a PhD, and Alice just got her B Eng.)
There are four possible states of these one-to-many relationship mappings:
was true before, should now be false
was false before, should now be true
was true before, should remain true
was false before, should remain false
what I don't want to do is read the values from four checkboxes, then hit the database four times to say "Did Bob have a BA before? Well he does now." "Did Bob have PhD before? Because he doesn't any more" and so on.
How do other people address this issue?
I'm curious to see if someone else arrives at the same solution I did.
UPDATE 1: onedaywhen suggests the same thing which occurred to me -- simply delete all the old entries, correct or not, and INSERT new ones.
UPDATE 2: potatopeelings suggests adding some code to the form which stores the original value of the field which can be compared with the new value on submit.
Logically, an UPDATE is a DELETE followed by an INSERT (consider that SQL Server triggers can access logical tables named inserted and deleted but there is no updated table). So you should be able to hit the database only twice i.e. first DELETE all rows (correct or otherwise) for Bob, second INSERT all correct rows for Bob.
If you want to hit the database only once, consider using Standard SQL's MERGE, assuming your DBMS supports it (SQL Server introduced it in 2008).
Assuming the UI is a checkbox grid (1. in Ismail comment in the question)
MA PhD
Alice x
Bob x
where the x represents checked boxes. I'd go with using the front-end script to send only the changes back to the server. Then doing the INSERTs and DELETEs in the People-to-degrees under a single transaction, or a MERGE (as pointed out in Ismail's link)
BEGIN TRAN
INSERT query
DELETE query
COMMIT
You would pass the INSERT (and DELETE) query a list of people ID, degree ID pairs like. For your example, the INSERT query would be the single pair (2,2) and for the DELETE query the single pair (2,1).