Current primary key is ineffective at preventing duplicates. Does this sound like a good way to rearchitect my tables? - sql

Every so often, I update our research recruitment database with those who responded to our Craigslist ad. Each respondent is given a unique respondentID, which is the primary key.
Sometimes, people respond to these Craigslist ads multiple times. I think we may have duplicate people in our database, which is bad.
I would like to change the primary key of all my recruitment tables from respondentID to Email, which will prevent duplicates and make it easier to look up information. There are probably duplicate email records in my database already, and I need to clean this up if so.
Here's the current architecture for my three recruitment tables:
demographic - contains columns like RespondentID (PK), Email (I want this to be PK), Phone, etc
genre - contains columns like RespondentID (PK), Horror, etc
platform - contains columns like RespondentID (PK), TV, etc.
I want to join all three tables together at some point so we can get a better understanding of someone.
Here are my questions:
How can I eliminate duplicate respondents already in my database? (I can tell if they are duplicates because they will have the same Email value.)
Given my current architecture, how can I transition my database to have Email as the primary key without messing up my data?
After transitioning to a new architecture, what is the process I can use to delete duplicates in my Craigslist ad spreadsheet before I append them to Demo, Genre, and Platform tables?
Here are my ideas about solutions:
Create backup tables. Join the three tables and export the big table to Excel. In Excel, use Data Filtering and Conditional Formatting to find the duplicate entries, and delete them by hand. Unfortunately, I have 20,000 records which will crash Excel. :( The chief issue is that I don't know how to remove duplicate entries within a table using SQL. (Also, if I have two entries by bobdole#republican.com, one entry should remain.) Can you come up with a smarter solution involving SQL and Access?
After each Email record is unique, I will create new tables with each using Email as the primary key.
When I want to remove duplicates within the data I'd like to import, I should be able to easily do it within Excel. Next, I will use this SQL command to deduplicate between the current database and the incoming data:
DELETE * from newParticipantsList
WHERE Email in (SelectEmail from Demo)
I'm going to try to duplicate my current architecture in a small test table in Access and see if I can figure it out. Overall, I don't have much experience with joining tables and removing data in SQL, so it's a little scary.

Maybe I'm just being thick, but why don't you just create a new Identity column in the existing table? You can always remove those records you deem duplicates, but the Identity column is guaranteed to be unique under all circumstances.
It will be up to you to make sure that any new records inserted into the table are not duplicates, by checking the Email column.

To remove duplicates from demographic table you could do something like:
WITH RecordsToKeep AS (
SELECT MIN(RespondentID) as RespondentID
FROM demographic
GROUP BY Email
) DELETE demographic
FROM demographic
LEFT JOIN RecordsToKeep on RecordsToKeep.RespondentID = demographic.RespondentID
where RecordsToKeep.RespondentID IS NULL
This will keep the first record for each email address and delete the rest. You will need to remap the genre and platform tables before you delete the source.
In terms of what to do in the future, you could get SQL to do all the de-duplicating for you by importing the data into a staging table and then only import distinct records to the final when the address isn't already in the demographic table.
There is no reason to change the Email Address to be the primary key. String's aren't great primary keys for a number of reasons. The problem you have isn't with duplicate keys, the problem is how you are inserting the data.

Related

How can I inset multiple data from a foreign key?

I am trying to insert a data from a table together with the data from the foreign key. For example, after the customer signing up. The data (that includes the id, name, contact) will be inserted in the Customers Table then the Customer ID will also be inserted in the QRCode Table since Customer ID is a foreign key. Now my problem is, how to include the "name" and the "contac"t in the QRCode table? Can anyone suggest what should I use?
This is too long for a comment.
In general, you don't. You just include the CustomerId in the two tables. When the time comes and you need the name or other information, you use a join:
select qr.*, c.* -- or whatever columns you want
from qr join
customers c
on qr.customerid = c.customerid;
In general, you want to avoid storing multiple copies of the same information in different tables -- it bloats the database and makes it hard to maintain.
Let me note that the above is a general rule. There are some cases where you might want to copy data in this situation (say, slowly changing dimensions), but as a general rule, data attributes should be stored in only one table, with joins used to combine data from different tables.

Using a Delete query on a single table when referencing other tables

I want to run a delete query to remove certain data from a table in a Sharepoint list using an MS Access query. However I want to be sure only to delete from a single list based on the values of another table.
The table is TMainData: This consists solely of number fields that are references to the keyfields in other tables, such as TProgram which has a program name, or TContact which has the point of contact, or TPositionTitle which has a title like Site Director.
So a TMainData entry looks something like
ProgramID, which links to TPrograms: 4
ContactID, which links to TContacts: 42
PositionTitle, which links to TPositionTitle: 3
This tells me that the Site Director (TPositionTitle 3) of Anesthesiology (ProgramID 4) is John Smith (ContactID 42).
Here's where it gets tricky: I have a reference under TPrograms to TProgramType. I want to delete all records under TMainData where they link to a certain Program Type, because that program type is going away. HOWEVER... I don't want to delete the program itself (yet), just the lines referencing that program in TMainData.
The "manual" way I see to do this is to run queries that identify what the ProgramIDs are of the programs I want to delete the contacts for, and then use those IDs in a delete query that only references the TMainData query. I'm wondering if there's a way to use referential data, because I may have to be running some ridiculous update queries at a later time that would need this same info.
I dug through https://support.office.com/en-us/article/Use-queries-to-delete-one-or-more-records-from-a-database-A323BF1A-C9B4-4C86-8719-BE58BDF1B10C but it doesn't seem to cover deleting from one table based on values referenced in another table.
You already seem to understand what you need to do to achieve the desired result when you state:
...run queries that identify what the ProgramIDs are of the programs I want to delete the contacts for, and then use those IDs in a delete query that only references the TMainData query.
If I've understood your description correctly, I would suggest something along the lines of:
delete from tmaindata
where
tmaindata.programid in
(
select tprograms.programid
from tprograms
where tprograms.tprogramtype = 'YourProgramType'
)
Always take a backup of your data before running delete queries - there is no undo.

Sqlite what's the right way to make a table holding data for several tables?

I need a table Attachments that stores data for a number of other tables such as Notes and Projects (+many more) with the following properties:
any other table can have many attachments
I frequently need to find all attachments for specific entries of another table (by their primary key in that table)
I've seen in other answers to similar questions that it's best to create the attachments table and then tables like NotesAttachments, ProjectsAttachments, etc. with the Attachment, Notes and Projects IDs as foreign keys. But that looks like complex overengineering to me.
What about directly storing the table Name itself as TEXT column in Attachments and use that name to look for the attachments of one table whenever I need them? So basically the plan is to query for (TableName, ForeignID) to obtain all attachments with integer id ForeignID in table TableName.
Is that problematic, and if so, why?
The attachments table now has a primary key consisting of two columns; this implies that all lookups search those two columns, and that you need a two-column index for these searches to be efficient.
But multi-column keys are a common feature in SQL databases, and are perfectly fine.
You might optimize the TableName column to store a short value, such as a single character, or a number. But the difference is probably not noticeable unless you have a really large amount of data.
(If you could ensure that the Notes/Projects/etc. use primary keys that are unique over all these tables, then you might be able to avoid storing the table name.)

Adding record with new foreign key

I have few tables to store company information in my database, but I want to focus on two of them. One table, Company, contains CompanyID, which is autoincremented, and some other columns that are irrelevant for now. The problem is that companies use different versions of names (e.g. IBM vs. International Business Machines) and I want/need to store them all for futher use, so I can't keep names in Company table. Therefore I have another table, CompanyName that uses CompanyID as a foreign key (it's one-to-many relation).
Now, I need to import some new companies, and I have names only. Therefore I want to add them to CompanyName table, but create new records in Company table immediately, so I can put right CompanyID in CompanyName table.
Is it possible with one query? How to approach this problem properly? Do I need to go as far as writing VBA procedure to add records one by one?
I searched Stack and other websites, but I didn't find any solution for my problem, and I can't figure it out myself. I guess it could be done with form and subform, but ultimately I want to put all my queries in macro, so data import would be done automatically.
I'm not database expert, so maybe I just designed it badly, but I didn't figure out another way to cleanly store multiple names of the same entity.
The table structure you setup appears to be a good way to do this. But there's not a way to insert records into both tables at the same time. One option is to write two queries to insert records into Company and then CompanyName. After inserting records into Company you will need to create a query that joins from the source table to the Company table joining it on a field that uniquely defines the record beside the autoincrement key. That will allow you to get the key field from Company for use when you insert into CompanyName.
The other option, is to write some VBA code to loop through the source data inserting records into both. The would be preferable since it should be more reliable.

MSSQL insert rows into relational tables.

I have two tables. Person and Phones. Many phones numbers could be associated with one person by foreign key. If I want to add a phone number and map it to particular person, how my SQL should look like?
In my understanding:
SQL statement should be transact, therefore first I have to insert person into Person table and after insert phone number in Phones and map it with just inserted row in Person table.
What if row is already exist in one of another table? How should I handle it?
I am Looking for a clean and simple solution or sql example.
Note: I don't have access for creating stored procedures.
If you're inserting a new Person with new Phones, then you would
Insert into the Person table.
Use LAST_INSERT_ID() to get the ID which was just generated on that insert.
Use that ID to insert records into the Phone table.
If you're inserting a new Phones for an existing Person, then you would
Select the Person to get its ID if you don't already have it
Use that ID to insert records into the Phone table.
What if row is already exist in one of another table? How should I handle it?
Define "already exists" in this context. What defines uniqueness in your data? In cases like this you may want to consider incorporating that definition of uniqueness into the primary key in that table. (Which can be composed of more than one column.) Otherwise you'll have to SELECT from the table to see if the row already exists. If it does, update it. If it doesn't, insert it. (Or however you want to handle already-existing data logically in your domain.)
Keep in mind that it's easy to go overboard with uniqueness in cases like this. For example, you might be tempted to try to create a many-to-many relationship between these tables so that you can avoid having duplicate phone numbers. In real world scenarios this ends up being a bad idea because it's possible that:
Two people share the same phone number.
One of those two people changes his/her number, but the other one doesn't.
In an overly-normalized scenario, the above events would result in one of the following:
Both users' phone numbers are updated when only one of them actually updates it, resulting in incorrect data for the other user.
You have to write overly-complicated code to check for this scenario and create a new record (disassociating the previous many-to-many relationship), resulting in a lot of unnecessary code and points of failure.