Dynamic Tables? - sql

I have a database that has different grades per course (i.e. three homeworks for Course 1, two homeworks for Course 2, ... ,Course N with M homeworks). How should I handle this as far as database design goes?
CourseID HW1 HW2 HW3
1 100 99 100
2 100 75 NULL
EDIT
I guess I need to rephrase my question. As of right now, I have two tables, Course and Homework. Homework points to Course through a foreign key. My question is how do I know how many homeworks will be available for each class?

No, this is not a good design. It's an antipattern that I called Metadata Tribbles. You have to keep adding new columns for each homework, and they propagate out of control.
It's an example of repeating groups, which violates the First Normal Form of relational database design.
Instead, you should create one table for Courses, and another table for Homeworks. Each row in Homeworks references a parent row in Courses.
My question is how do I know how many homeworks will be available for each class?
You'd add rows for each homework, then you can count them as follows:
SELECT CourseId, COUNT(*) AS Num_HW_Per_Course
FROM Homeworks
GROUP BY CourseId
Of course this only counts the homeworks after you have populated the table with rows. So you (or the course designers) need to do that.

Decompose the table into three different tables. One holds the courses, the second holds the homeworks, and the third connects them and stores the result.
Course:
CourseID CourseName
1 Foo
Homework:
HomeworkID HomeworkName HomeworkDescription
HW1 Bar ...
Result:
CourseID HomeworkID Result
1 HW1 100

Related

SQL Schema for multiple many-to-many relationships

Consider I have 4 tables
persons
companies
groups
and
bills
Now there is a many-to-many relationship between bills/persons and bills/companies and bills/groups.
I see 4 possibilities for a sql schema for this:
variant 1 (multiple relationship tables)
persons_bills
person_id
bill_id
companies_bills
company_id
bill_id
groups_bills
group_id
bill_id
variant 2 (one relationship table with one id set and all others null)
bills_relations
person_id
company_id
group_id
bill_id
with a check that only person_id OR company_id OR group_id can be set and all other twos are null.
variant 3 (one relationship table with string reference to the other table)
bills_relations
bill_id
row_id
row_table
with row_table can have the string values 'person', 'company', 'group'.
variant 4 (add a supertype table)
persons
id
debtor_id
companies
id
deptor_id
groups
id
deptor_id
deptors
id
bills_deptors
bill_id
deptor_id
Can you recommend one variant?
I think that either variant 1 (multiple relationship tables) or variant 4 (add a supertype table) are the most feasible choices here.
Variant 2 is a much less efficient way to store the data since it requires the storage of 3 extra NULLs for each relationship.
Variant 3 will get you into a lot of trouble when trying to JOIN between bills and one of the other tables, since you won't be able to do it directly. You'll have to first select the table name from the string reference, and then inject it into a second query. Any kind of SQL injections like this open up the database to a SQL injection attack, so they are best avoided if possible.
Variant 1 is probably the best out of 1 and 4 in my opinion, since it will require one less JOIN in your queries and hence make them a little simpler. If all the tables are indexed correctly though, I don't think there should be much difference in performance (or space efficiency) between these two.

Purpose of Self-Joins

I am learning to program with SQL and have just been introduced to self-joins. I understand how these work, but I don't understand what their purpose is besides a very specific usage, joining an employee table to itself to neatly display employees and their respective managers.
This usage can be demonstrated with the following table:
EmployeeID | Name | ManagerID
------ | ------------- | ------
1 | Sam | 10
2 | Harry | 4
4 | Manager | NULL
10 | AnotherManager| NULL
And the following query:
select worker.employeeID, worker.name, worker.managerID, manager.name
from employee worker join employee manager
on (worker.managerID = manager.employeeID);
Which would return:
Sam AnotherManager
Harry Manager
Besides this, are there any other circumstances where a self-join would be useful? I can't figure out a scenario where a self-join would need to be performed.
Your example is a good one. Self-joins are useful whenever a table contains a foreign key into itself. An employee has a manager, and the manager is... another employee. So a self-join makes sense there.
Many hierarchies and relationship trees are a good fit for this. For example, you might have a parent organization divided into regions, groups, teams, and offices. Each of those could be stored as an "organization", with a parent id as a column.
Or maybe your business has a referral program, and you want to record which customer referred someone. They are both 'customers', in the same table, but one has a FK link to another one.
Hierarchies that are not a good fit for this would be ones where an entity might have more than one "parent" link. For example, suppose you had facebook-style data recording every user and friendship links to other users. That could be made to fit in this model, but then you'd need a new "user" row for every friend that a user had, and every row would be a duplicate except for the "relationshipUserID" column or whatever you called it.
In many-to-many relationships, you would probably have a separate "relationship" table, with a "from" and "to" column, and perhaps a column indicating the relationship type.
I found self joins most useful in situations like this:
Get all employees that work for the same manager as Sam. (This does not have to be hierarchical, this can also be: Get all employees that work at the same location as Sam)
select e2.employeeID, e2.name
from employee e1 join employee e2
on (e1.managerID = e2.managerID)
where e1.name = 'Sam'
Also useful to find duplicates in a table, but this can be very inefficient.
There are several great examples of using self-joins here. The one I often use relates to "timetables". I work with timetables in education, but they are relevant in other cases too.
I use self-joins to work out whether two items clash with one another, e.g. a student is scheduled for two lessons which happen at the same time, or a room is double booked. For example:
CREATE TABLE StudentEvents(
StudentId int,
EventId int,
EventDate date,
StartTime time,
EndTime time
)
SELECT
se1.StudentId,
se1.EventDate,
se1.EventId Event1Id,
se1.StartTime as Event1Start,
se1.EndTime as Event1End,
se2.StartTime as Event2Start,
se2.EndTime as Event2End,
FROM
StudentEvents se1
JOIN StudentEvents se2 ON
se1.StudentId = se2.StudentId
AND se1.EventDate = se2.EventDate
AND se1.EventId > se2.EventId
--The above line prevents (a) an event being seen as clashing with itself
--and (b) the same pair of events being returned twice, once as (A,B) and once as (B,A)
WHERE
se1.StartTime < se2.EndTime AND
se1.EndTime > se2.StartTime
Similar logic can be used to find other things in "timetable data", such as a pair of trains it is possible to take from A via B to C.
Self joins are useful whenever you want to compare records of the same table against each other. Examples are: Find duplicate addresses, find customers where the delivery address is not the same as the invoice address, compare a total in a daily report (saved as record) with the total of the previous day etc.

how to index a one way relation table?

I'm not a DB guy so this may be a trivial question...
Suppose
1) i have a relation table (I think that's what it's called), student_class, which holds a student_id and a class_id, (representing a many-to-many between a student table and a class table)
2) i do various query that results in a student_id (perhaps among other things) and then the results are "LEFT OUTER JOIN"ed to the student_class and LOJed again to the class table to get the associated class information.
3) i do that a lot, but i don't care to find the students from a given class, or any other thing you may think is common to do in the context of students and classes.
4) i have tens of thousands of students but only about 100 classes
5a) 99% of the students are not enrolled in any class (what a great school) and the rest are enrolled in only and only 1 class
5b) alternatively to 5a, on the average, every student is enrolled in about 2 classes
So how many and which of the indices below should i create in the student_class table for this sole purpose, and is the answer different for 5a and 5b?
a. index on student_id
b. index both student_id and class_id
c. index on class_id
I would create one index for each column.
There's not an argument from your question to only add index to class_id. You select values according to student_id and to class_id, so i think it's reasonable to have them both.
Additionally, your index needs don't change if there are more students enrolled in each class.
You make use of the indexes in the same way for both cases.
And for the amount of records you have, the indexes are going to be relatively small.

SQL Merging Associated Records

Let's say we have a database with a table that has many other associated tables. If you diagrammed the database, this would be the table at the center with many foreign key relationships spiraling out of it.
To make it more concrete, let's say the two records in this central table are Initech and Contoso. Initech and Contoso are both associated with many other records in associated tables like Employees, AccountingTransactions, etc. Let's say the two merged (Initech bought Contoso) and from a data standpoint, it really is as simple as merging all the records. What's the easiest way to take all of Contoso's related records, make them point to Initech and then delete Contoso?
UPDATE with CASCADE comes tantalizingly close, but it obviously can't work without turning off constraints and then turning them back on (yuck).
Is there a nice generic way to do this without hunting down every single linked table and migrating them one by one? This has to be a common requirement. It's come up in two places in this project and can be summed up with: Entity A needs to control everything Entity B current controls. How can I make it happen?
Before Merge:
Companies
ID Name
1 Contoso
2 Initech
Employees
ID Name CompanyId
1 Bob 1
2 Ted 2
After Merge:
Companies
ID Name
2 Initech
Employees
ID Name CompanyId
1 Bob 2
2 Ted 2
All my attempts at searching only turned up questions about merging separate databases... so sorry if this has been asked before.
This query is likely vendor-dependent, but in MySQL:
UPDATE Employees e, Cars c, OtherEntity o
SET e.CompanyId = 2, c.CompanyId = 2, o.CompanyID = 2
WHERE e.CompanyID = 1 OR c.CompanyId = 1 OR o.CompanyId = 1;
Succinctly, no; there isn't a generic way to do it.
Consider your sample database with tables Companies, Employees, Departments, and AccountingTransactions.
You need to delete one of the company records (because after the merger, you will only record the current state of affairs).
You need to alter the employee records to change the employing company. However, it is quite possible that there is an employee number N in both companies, and one of those (presumably Contoso's) will have to be assigned a new employee number.
You probably face the problem that department 1 in Conotoso's data is Engineering, but in Initech's is Finance. So, you need to worry about how you are going to map the department numbers between the two companies, and then you face the problem of assigning Contoso's employees to Initech's departments.
For the historical accounting transactions, you probably have to keep Contoso's historical accounting records in Contoso's name, while some (of the most recent) transactions will need to be migrated to Initech's name. So maybe you won't be deleting the Contoso record from the table of companies after all, but you won't be able to use it to identify any new records.
These are just a small sampling of the reasons why such mappings cannot readily be automated.
No, there's no simple generic way of merging rows and cascading those changes throughout your system. You can script it all - which may be the best way, depending on your scenario - or devise a workaround.
One workaround might be to implement a parenting pattern on your central table (or abstract it to another table). You would then end up with something like
Companies
ID ParentID Name
1 2 Contoso
2 null Initech
or
Companies
ID ParentID Name
1 3 Contoso
2 3 Initech
3 null MegaInitech
and all your queries that join onto this central Companies table now check ID and ParentID;
SELECT *
FROM Employees
WHERE CompanyId IN (SELECT ID FROM Companies WHERE ID = #id OR ParentID = #ID)
Abstract this away to a view or function
CREATE FUNCTION fn_IsMemberOf
(
#companyId INT,
#parentId INT
)
RETURNS BIT
AS
BEGIN
DECLARE #result BIT = 0
SELECT #result = 1 FROM Companies
WHERE ID = #companyId
AND COALESCE(ParentID, ID) = #parentID
RETURN #result
END
SELECT *
FROM Employees
WHERE fn_IsMemberOf(CompanyId, 1) = 1
(haven't tested this but you get the idea)

Storing & Querying Heirarchical Data with Multiple Parent Nodes

I've been doing quite a bit of searching, but haven't been able to find many resources on the topic. My goal is to store scheduling data like you would find in a Gantt chart. So one example of storing the data might be:
Task Id | Name | Duration
1 Task A 1
2 Task B 3
3 Task C 2
Task Id | Predecessors
1 Null
2 Null
3 1
3 2
Which would have Task C waiting for both Task A and Task B to complete.
So my question is: What is the best way to store this kind of data and efficiently query it? Any good resources for this kind of thing? There is a ton of information about tree structures, but once you add in multiple parents it becomes hard to find info. By the way, I'm working with SQL Server and .NET for this task.
Your problem is related to the concept of relationship cardinality. All relationships have some cardinality, which expresses the potential number of instances on each side of the relationship that are members of it, or can participate in a single instance of the relationship. As an example, for people, (for most living things, I guess, with rare exceptions), the Parent-Child relationship has a cardinality of 2 to zero or many, meaning it takes two parents on the parent side, and there can be zero or many children (perhaps it should be 2 to 1 or many)
In database design, generally, anything that has a 1(one), (or a zero or one), on one side can be easily represented with just two tables, one for each entity, (sometimes only one table is needed see note**) and a foreign key column in the table representing the "many" side, that points to the other table holding the entity on the "one" side.
In your case you have a many to many relationship. (A Task can have multiple predecessors, and each predecessors can certainly be the predecessor for multiple tasks) In this case a third table is needed, where each row, effectively, represents an association between 2 tasks, representing that one is the predecessor to the other. Generally, This table is designed to contain only all the columns of the primary keys of the two parent tables, and it's own primary key is a composite of all the columns in both parent Primary keys. In your case it simply has two columns, the taskId, and the PredecessorTaskId, and this pair of Ids should be unique within the table so together they form the composite PK.
When querying, to avoid double counting data columns in the parent tables when there are multiple joins, simply base the query on the parent table... e.g., to find the duration of the longest parent,
Assuming your association table is named TaskPredecessor
Select TaskId, Max(P.Duration)
From Task T Join Task P
On P.TaskId In (Select PredecessorId
From TaskPredecessor
Where TaskId = T.TaskId)
** NOTE. In cases where both entities in the relationship are of the same entity type, they can both be in the same table. The canonical (luv that word) example is an employee table with the many to one relationship of Worker to Supervisor... Since the supervisor is also an employee, both workers and supervisors can be in the same [Employee] table, and the realtionship can gbe modeled with a Foreign Key (called say SupervisorId) that points to another row in the same table and contains the Id of the employee record for that employee's supervisor.
Use adjacency list model:
chain
task_id predecessor
3 1
3 2
and this query to find all predecessors of the given task:
WITH q AS
(
SELECT predecessor
FROM chain
WHERE task_id = 3
UNION ALL
SELECT c.predecessor
FROM q
JOIN chain c
ON c.task_id = q.predecessor
)
SELECT *
FROM q
To get the duration of the longest parent for each task:
WITH q AS
(
SELECT task_id, duration
FROM tasks
UNION ALL
SELECT t.task_id, t.duration
FROM q
JOIN chain с
ON c.task_id = q.task_id
JOIN tasks t
ON t.task_id = c.predecessor
)
SELECT task_id, MAX(duration)
FROM q
Check "Hierarchical Weighted total" pattern in "SQL design patterns" book, or "Bill Of Materials" section in "Trees and Hierarchies in SQL".
In a word, graphs feature double aggregation. You do one kind of aggregation along the nodes in each path, and another one across alternative paths. For example, find a minimal distance between the two nodes is minimum over summation. Hierarchical weighted total query (aka Bill Of Materials) is multiplication of the quantities along each path, and summation along each alternative path:
with TCAssembly as (
select Part, SubPart, Quantity AS factoredQuantity
from AssemblyEdges
where Part = ‘Bicycle’
union all
select te.Part, e.SubPart, e.Quantity * te.factoredQuantity
from TCAssembly te, AssemblyEdges e
where te.SubPart = e.Part
) select SubPart, sum(Quantity) from TCAssembly
group by SubPart