Checking the integrity of the data for an entity

Checking the integrity of the data for an entity - sql

I have three tables STUDENT, DEPARTMENT and COURSE in a University database...
STUDENT has a UID as a Primary key -> which is the UNIQUE ID of the student
DEPARTMENT has Dept_id as a Primary Key -> which is the Dept. number
COURSE has C_id as Primary Key -> which is the Course/subject Id
I need to store marks in a table by relating the primary key of STUDENT, DEPARTMENT and COURSE for each student in each course.
UID Dept_id C_id marks
1 CS CS01 98
1 CS CS02 96
1 ME ME01 88
1 ME ME02 90
The problem is if i create a table like this for marks then i feel the data operator might insert wrong combination of primary key of a student for example
UID Dept_id C_id marks
1 CS CS01 98
1 CS CS02 96
1 ME CS01 88 //wrong C_id (course id) inputted by the DBA
1 ME ME02 90
In which case how can i prevent him doing this?
Also is there any other way to store marks for each student ? I mean like :
UID Dept_id CS01 CS02
1 CS 98 96
3 CS 95 92

You should avoid duplicating data in your database if possible:
UID Dept_id C_id marks
1 CS CS01 98
^^ ^^
You could:
Change the course ID to a two column key (department, course number), eg ('CS', '01').
or:
Keep the course name as it is, but put the department ID field in the course table and omit it from your marks table. If you need to calculate the total marks for a specific department you can still do this easily by adding a JOIN to your query.
Your last suggestion seems to be a bad idea. You would need a column in your table for every course and most values would be NULL.

I'm not sure why you need the department in this table if the course indicates the department. Thus, why wouldn't your table be:
UID C_id marks
1 CS01 98
1 CS02 96
1 ME01 88
1 ME02 90
What is missing from this table is some indication of time. For example, a student could take the same course twice if they failed it the first time. Thus, you would need additional columns to indicate the semester and year.

Your suggestion would be a nightmare to maintain. You would have to add new columns every time a new course was added to the achedule. It also would be harder to query much of the time.
If you want to make sure that each course is appropriate for the department, you can do that in a trigger (make sure to handle multiple record inserts or updates) or in the application. This still won't prevent all data entry errors (it is possible to pick CS89 when you meant CS98), but it will reduce the amount of error. In this case it is unlikely the data would come from anywhere other than the application, so I'd probably choose to enforce the rules in the application. A pull down list where they chose the department and only the courses for that department showed would do the trick.

You could add foreign key constraints to your tables to ensure that a valid value is entered for student IDs, course IDs and department IDs. You could also add unique constraints to the table to ensure inadvertent duplicates were not created. But in the end you can't prevent incorrect data from being inserted; if you knew it was incorrect, you wouldn't need to ask for it.
Example: 29th February 1957 couldn't be my birthday; 15th July 2025 couldn't be my birthday; 27th September 1974 wasn't my birthday.

Related

How to structure DBT tables with cyclical dependencies

I have one table containing my members.
customer_id
name
age
1
John
74
2
Sarah
87
Everyday, I get a new table containing the current members.
If a new member has joined, I want to add them.
If a member has left, I want to nullify their name/id
If a current member is still a member then I want to keep them as is.
Imagine that I get a new upload with the following rows
customer_id
name
age
2
Sarah
87
3
Melvin
23
I then want to generate the table
customer_id
name
age
Null
Null
74
2
Sarah
87
3
Melvin
23
I don't want to nullify anything by mistake and therefore I want to run a few tests on this table before I replace my old one. The way I've done this is by creating a temporary table (let's call it customer_temp). However, I've now created a cyclical dependency since I:
Need to read the live table customer in order to create the customer_temp
Need to replace the live table customer with customer_temp after I've run my tests
Is there anyway I can do this using dbt?

Destroying data is tricky. I would avoid that unless it's necessary (e.g., DSAR compliance).
Assuming the new data is loaded into the same table in your database each day, I think this is a perfect candidate for snapshots, with the option for invalidating hard-deleted records. See the docs. This allows you to capture the change history of a table without any data loss.
If you turned on snapshots in the initial state, your snapshot table would look like (assuming the existing records had a timestamp of 1/1):
customer_id
name
age
valid_from
valid_to
1
John
74
1/1/2022
2
Sarah
87
1/1/2022
Then, after the source table was updated, re-running dbt snapshot (today) would create this table:
customer_id
name
age
valid_from
valid_to
1
John
74
1/1/2022
5/12/2022
2
Sarah
87
1/1/2022
3
Melvin
23
5/12/2022
You can create the format you'd like with a simple query:
select
case when valid_to is null then customer_id else null end as customer_id,
case when valid_to is null then name else null end as name,
age
from {{ ref('my_snapshot') }}

insert data into tables where ids need to be equal

I have two tables, customer and order each with two records for 2020. The ******* starred values are what I want to add for FY 2021.
Customer:
ID
FY
Name
1
2020
Tina Smith
2
2020
Bobby Brown
134
2021
Tina Smith***
234
2021
Bobby Brown***
Order
ID
2digitFY
Food
Drink
1
20
Hot Dog
Water
2
20
Burger
Soda
134
21
Hot Dog
Water***
234
21
Burger
Soda ***
I want to add records to both tables that is the same data for FY 2020/20 just new sequence numbers with the year 2021/21starred data above. I can't figure out how I would make the new ids equal when they auto generate. Below is similar code I have set up (fake data used above).
insert into customer (id, fy, name)
select (id, '2021', name)
from customer
where fy = '2020'
insert into order (id, 2digitFY, food, drink)
select (id, '21', food, drink)
from order
where 2digitFY = '20'

I can't figure out how I would make the new ids equal when they auto generate.
If what you said means those columns are primary keys which are automatically generated, then you don't have control over it, Oracle does.
I presume that "auto generate" you said means identity column whose value is automatically generated. If so, modify it so that it uses GENERATED BY DEFAULT ON NULL option. It means that - if you don't provide ID value, Oracle will generate it. But, if you provide it, its value will be the one you inserted.
Similarly, if you're on 11g or lower (where identity columns didn't exist) and created those values by database triggers, make sure that they fire and populate ID columns only when their values are NULL.
If you do that, then you'll be able to create your own ID values and insert them as you wish.

How do I make a query for if value exists in row add a value to another field?

I have a database on access and I want to add a value to a column at the end of each row based on which hospital they are in. This is a separate value. For example - the hospital called "St. James Hospital" has the id of "3" in a separate field. How do I do this using a query rather than manually going through a whole database?
example here

Not the best solution, but you can do something like this:
create table new_table as
select id, case when hospital="St. James Hospital" then 3 else null
from old_table
Or, the better option would be to create a table with the columns hospital_name and hospital_id. You can then create a foreign key relationship that will create the mapping for you, and enforce data integrity. A join across the two tables will produce what you want.
Read about this here:
http://net.tutsplus.com/tutorials/databases/sql-for-beginners-part-3-database-relationships/

The answer to your question is a JOIN+UPDATE. I am fairly sure if you looked up you would find the below link.
Access DB update one table with value from another

You could do this:
update yourTable
set yourFinalColumnWhateverItsNameIs = {your desired value}
where someColumn = 3
Every row in the table that has a 3 in the someColumn column will then have that final column set to your desired value.
If this isn't what you want, please make your question clearer. Are you trying to put the name of the hospital into this table? If so, that is not a good idea and there are better ways to accomplish that.
Furthermore, if every row with a certain value (3) gets this value, you could simply add it to the other (i.e. Hospitals) table. No need to repeat it everywhere in the table that points back to the Hospitals table.
P.S. Here's an example of what I meant:
Let's say you have two tables
HOSPITALS
id
name
city
state
BIRTHS
id
hospitalid
babysname
gender
mothersname
fathername
You could get a baby's city of birth without having to include the City column in the Births table, simply by joining the tables on hospitals.id = births.hospitalid.

After examining your ACCDB file, I suggest you consider setting up the tables differently.
Table Health_Professionals:
ID First Name Second Name Position hospital_id
1 John Doe PI 2
2 Joe Smith Co-PI 1
3 Sarah Johnson Nurse 3
Table Hospitals:
hospital_id Hospital
1 Beaumont
2 St James
3 Letterkenny Hosptial
A key point is to avoid storing both the hospital ID and name in the Health_Professionals table. Store only the ID. When you need to see the name, use the hospital ID to join with the Hospitals table and get the name from there.
A useful side effect of this design is that if anyone ever misspells a hospital name, eg "Hosptial", you need correct that error in only one place. Same holds true whenever a hospital is intentionally renamed.
Based on those tables, the query below returns this result set.
ID Second Name First Name Position hospital_id Hospital
1 Doe John PI 2 St James
3 Johnson Sarah Nurse 3 Letterkenny Hosptial
2 Smith Joe Co-PI 1 Beaumont
SELECT
hp.ID,
hp.[Second Name],
hp.[First Name],
hp.Position,
hp.hospital_id,
h.Hospital
FROM
Health_Professionals AS hp
INNER JOIN Hospitals AS h
ON hp.hospital_id = h.hospital_id
ORDER BY
hp.[Second Name],
hp.[First Name];

UPDATE query that fixes orphaned records

I have an Access database that has two tables that are related by PK/FK. Unfortunately, the database tables have allowed for duplicate/redundant records and has made the database a bit screwy. I am trying to figure out a SQL statement that will fix the problem.
To better explain the problem and goal, I have created example tables to use as reference:
alt text http://img38.imageshack.us/img38/9243/514201074110am.png
You'll notice there are two tables, a Student table and a TestScore table where StudentID is the PK/FK.
The Student table contains duplicate records for students John, Sally, Tommy, and Suzy. In other words the John's with StudentID's 1 and 5 are the same person, Sally 2 and 6 are the same person, and so on.
The TestScore table relates test scores with a student.
Ignoring how/why the Student table allowed duplicates, etc - The goal I'm trying to accomplish is to update the TestScore table so that it replaces the StudentID's that have been disabled with the corresponding enabled StudentID. So, all StudentID's = 1 (John) will be updated to 5; all StudentID's = 2 (Sally) will be updated to 6, and so on. Here's the resultant TestScore table that I'm shooting for (Notice there is no longer any reference to the disabled StudentID's 1-4):
alt text http://img163.imageshack.us/img163/1954/514201091121am.png
Can you think of a query (compatible with MS Access's JET Engine) that can accomplish this goal? Or, maybe, you can offer some tips/perspectives that will point me in the right direction.
Thanks.

The only way to do this is through a series of queries and temporary tables.
First, I would create the following Make Table query that you would use to create a mapping of the bad StudentID to correct StudentID.
Select S1.StudentId As NewStudentId, S2.StudentId As OldStudentId
Into zzStudentMap
From Student As S1
Inner Join Student As S2
On S2.Name = S1.Name
Where S1.Disabled = False
And S2.StudentId <> S1.StudentId
And S2.Disabled = True
Next, you would use that temporary table to update the TestScore table with the correct StudentID.
Update TestScore
Inner Join zzStudentMap
On zzStudentMap.OldStudentId = TestScore.StudentId
Set StudentId = zzStudentMap.NewStudentId

The most common technique to identify duplicates in a table is to group by the fields that represent duplicate records:
ID FIRST_NAME LAST_NAME
1 Brian Smith
3 George Smith
25 Brian Smith
In this case we want to remove one of the Brian Smith Records, or in your case, update the ID field so they both have the value of 25 or 1 (completely arbitrary which one to use).
SELECT min(id)
FROM example
GROUP BY first_name, last_name
Using min on ID will return:
ID FIRST_NAME LAST_NAME
1 Brian Smith
3 George Smith
If you use max you would get
ID FIRST_NAME LAST_NAME
25 Brian Smith
3 George Smith
I usually use this technique to delete the duplicates, not update them:
DELETE FROM example
WHERE ID NOT IN (SELECT MAX (ID)
FROM example
GROUP BY first_name, last_name)

How to manage "groups" in the database?

I've asked this question here, but I don't think I got my point across.
Let's say I have the following tables (all PK are IDENTITY fields):
People (PersonId (PK), Name, SSN, etc.)
Loans (LoanId (PK), Amount, etc.)
Borrowers (BorrowerId(PK), PersonId, LoanId)
Let's say Mr. Smith got 2 loans on his name, 3 joint loans with his wife, and 1 join loan with his mistress. For the purposes of application I want to GROUP people, so that I can easily single-out the loans that Mr. Smith took out jointly with his wife.
To accomplish that I added BorrowerGroup table, now I have the following (all PK are IDENTITY fields):
People (PersonId (PK), Name, SSN, etc.)
Loans (LoanId (PK), Amount, BorrowerGroupId, etc.)
BorrowerGroup(GroupId (PK))
Borrowers (BorrowerId(PK), GroupId, PersonId)
Now Mr. Smith is in 3 groups (himself, him and his wife, him and his mistress) and I can easily lookup his activity in any of those groups.
The problems with new design:
The only way to generate new BorrowerGroup is by inserting MAX(GourpId)+1 with IDENTITY_INSERT ON, this just doesn't feel right. Also, the notion of a table with 1 column is kind of weird.
I'm a firm believer in surrogate keys, and would like to stick to that design if possible.
This application does not care about individuals, the GROUP is treated as an individual
Is there a better way to group people for the purpose of this application?

You could just remove the table BorrowerGroups - it carries no information. This information is allready present via the Loans People share - I just assume you have a PeopleLoans table.
People Loans PeopleLoans
----------- ------------ -----------
1 Smith 6 S1 60 1 6
2 Wife 7 S2 60 1 7
3 Mistress 8 S+W1 74 1 8
9 S+W2 74 1 9
10 S+W3 74 1 10
11 S+M1 89 1 11
2 8
2 9
2 10
3 11
So your BorrowerGroups are actually almost the Loans - 6 and 7 with Smith only, 8 to 10 with Smith and Wife, and 11 with Smith and Mistress. So there is no need for BorrowerGroups in the first place, because they are identical to Loans grouped by the involved People.
But it might be quite hard to efficently retrieve this information, so you could think about adding a GroupId directly to Loans. Ignoring the second column of Loans (just for readability) the third column schould represent your groups. They are redundant, so you have to be carefull if you change them.
If you find a good way to derive a unique GroupId from the ids of involved people, you could make it a computed column. If a string would be okay as an group id, you could just order the ids of the people an concat them with a separator.
Group 60 with Smith only would get id '1', group 74 would become 1.2, and group 89 would become 1.3. Not that smart, but unique and easy to compute.

use the original schema:
People (PersonId (PK), Name, SSN, etc.)
Loans (LoanId (PK), Amount, etc.)
Borrowers (BorrowerId(PK), PersonId, LoanId)
just query for the data you need (your example to find husband and wife on same loans):
SELECT
l.*
FROM Borrowers b1
INNER JOIN Borrowers b2 ON b1.LoanId=b2.LoanId
INNER JOIN Loans l ON b1.LoanId=l.LoanId
WHERE b1.PersonId=#HusbandID
AND b2.PersonId=#WifeID

The design of the database seems OK. Why do you have to use MAX(GourpId)+1 when you create a new group? Can't you just create the row and then use SCOPE_IDENTITY() to return the new ID?
e.g.
INSERT INTO BorrowerGroup() DEFAULT VALUES
SELECT SCOPE_IDENTITY()
(See this other question)
(edit to SQL courtesy of this question)

I would do something more like this:
People (PersonId (PK), Name, SSN, etc.)
Loans (LoanId (PK), Amount, BorrowerGroupId, etc.)
BorrowerGroup(BorrowerGroupId (PK))
PersonBelongsToBorrowerGroup(BorrowerGroupId
(PK), PersonId(PK))
I got rid of the Borrowers table. Just store the info in the BorrowerGroup table. That's my preference.

The consensus seems to be to omit the BorrowerGroup table and I have to agree. Suggesting that you would use MAX(groupId+1) has all sorts of ACID/transaction issues and the main reason why IDENTITY fields exist.
That said; the SQL that KM provided looks good. There are any number of ways to get the same results. Joins, sub-selects and so on. The real issue there... is knowing the dataset. Given the explanation you provided the datasets are going to be very small. That also supports removing the BorrowerGroup table.

I would have a group table and then a groupmembers(borrowers) table to accomplish the many-to-many relationship between loans and people. This allows the tracking of data on the group other than just a list of members (I believe someone else made this suggestion?).
CREATE TABLE LoanGroup
(
ID int NOT NULL
, Group_Name char(50) NULL
, Date_Started datetime NULL
, Primary_ContactID int NULL
, Group_Type varchar(25)
)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas