How to convert SQL to Relational Algebra in case of SQL Joins? - sql

I am working on SQL and Relational Algebra these days. And I am stuck on the below questions. I am able to make a SQL for the below questions but somehow my Relational Algebra that I have made doesn't looks right.
Below are my tables-
Employee (EmployeeId, EmployeeName, EmployeeCountry)
Training (TrainingCode, TrainingName, TrainingType, TrainingInstructor)
Outcome (EmployeeId, TrainingCode, Grade)
All the keys are specified with star *.
Below is the question and its SQL query as well which works fine-
Find an Id of the Employee who has taken every training.
SQL Qyery:
SELECT X.EmployeeID
FROM (SELECT EmployeeID, COUNT(*) AS NumClassesTaken
FROM OutCome GROUP BY EmployeeID )
AS X
JOIN (SELECT COUNT(*) AS ClassesAvailable
FROM Training)
AS Y
ON X.NumClassesTaken = Y.ClassesAvailable
I am not able to understand what will be the relational algebra for the above query? Can anyone help me with that?

Relational algebra for:
Find an Id of the Employee who has taken every training.
Actually you need division % operator in relational algebra:
r ÷ s is used when we wish to express queries with “all”:
Example:
Which persons have a bank account at ALL the banks in the country?
Retrieve the name of employees who work on ALL the projects that Jon Smith works on?
Read also this slid for division operator:
You also need query % operator for your query: "Employee who has taken all training".
First list off all Training codes:
Training (TrainingCode, TrainingName, TrainingType, TrainingInstructor)
Primary key is: TrainingCode:
TC = ∏ TrainingCode(Training)
A pair of employeeID and trainingCode: a employee take the training.
ET = ∏ EmployeeId, TrainingCode(Outcome)
Apply % Division operation which gives you desired employee's codes with trainingCode then apply projection to filter out employee code only.
Result = ∏ EmployeeId(ET % TC)
"Fundamentals of Database Systems" is the book I always keep in my hand.
6.3.4 The DIVISION Operation
The DIVISION operation is defined for convenience for dealing with
queries that involves universal quantification or the all
condition. Most RDBMS implementation with SQL as the primary query
language do not directly implement division. SQL has round way of
dealing with the type of query using EXISTS, CONTAINS and NOT EXISTS
key words.
The general DIVISION operation applied to two relations T(Y) = R(Z) %
S(X), where X ⊆ Z and Y = Z - X (and hence Z =
X ∪ Y); that is Y is the set of attributes of R that are not
attributes of S e.g. X = {A}, Z = {A, B} then Y = {B}, B
attribute is not present in relation S.
T(Y) the result of DIVISION is a relation includes a tuple t if
tuple tR appear in relation R with
tR[Y] = t, and with
tR[X] = tS for every tuple in
S. This means that. for a tuple t to appear in the result T of the DIVISION, the value of t must be appear in R in combination with every tuple in S.
I would also like to add that the set of relational algebra operations {σ,∏,⋈,Χ,-} namely Selection, Projection, Join, Cartesian Cross and Minus is a complete set; that is any of the other original relational algebra operation can be expressed as a sequence of operations from this set. Division operation % can also be expressed in the form of ∏, ⋈, and - operations as follows:
T1 <-- ∏Y(R)
T2 <-- ∏Y((S Χ T1) - R)
T3 <-- T1 - T2
To represent your question using basic relational algebraic operation just replace R by Outcome, S by Training and attribute set Y by EmployeeId.
I hope this help.

Related

How to write relational algebra for SQL WHERE column IN?

select age from person where name in (select name from eats where pizza="mushroom")
I am not sure what to write for the "in". How should I solve this?
In this case the sub-select is equivalent to a join:
select age
from person p, eats e
where p.name = e.name and pizza='mushroom'
So you could translate it in:
πage (person p ⋈p.name=e.name (σpizza='mushroom'(eats e)))
Here's my guess. I'm assuming that set membership symbol is part of relational algebra
For base table r, C a column of both r & s, and x an unused name,
select ... from r where C in s
returns the same value as
select ... from r natural join (s) x
The use of in requires that s has one column. The in is true for a row of r exactly when its C equals the value in s. So the where keeps exactly the rows of r whose C equals the value in s. We assumed that s has column C, so the where keeps exactly the rows of r whose C equals the C of the row in r. Those are same rows that are returned by the natural join.
(For an expression like this where-in with C not a column of both r and s then this translation is not applicable. Similarly, the reverse translation is only applicable under certain conditions.)
How useful this particular translation is to you or whether you could simplify it or must complexify it depends on what variants of SQL & "relational algebra" you are using, what limitations you have on input expressions and other translation decisions you have made. If you use very straightforward and general translations then the output is more complex but obviously correct. If you translate using a lot of special case rules and ad hoc simplifications along the way then the output is simpler but the justification that the answer is correct is longer.

Converting a certain SQL query into relational algebra

Doing an assignment for my database course and I want to double check my relational algebra.
The SQL:
SELECT dato, SUM(pris*antall) AS total
FROM produkt, ordre
WHERE ordre.varenr = produkt.varenr
GROUP BY dato
HAVING total >= 10000
The relational algebra:
σ total >= 10000 (
ρ R(dato, total)(
σ ordre.varenr = produkt.varenr (
dato ℑ SUM(pris*antall (produkt x ordre)
)
)
)
Is this correct?
I don't know. And anybody else is not likely to know either.
RA courses typically limit themselves to the selection, projection and join operators. Aggregations are not typically covered by an RA course. There even isn't any standard approach (that I know of) that the RA takes on aggregations.
What is the operator that your course defines for doing aggregations on relations ? What type of value does that operator produce for its result ? A relation ? Something else ? If something else, how does your course explain doing relational restrictions on that result, given that these result values aren't relations, but restriction works only on relations ?
Algebraically, this case starts with a natural join (produkt x ordre).
[The result of] this natural join is subjected to an aggregation operation. Thus this natural join is to appear where you specify the relational input argument to your aggregation operator. The other needed specs for specifying the aggregation are the output attribute names (total), and the way to compute them (SUM(...)). Those might appear in subscript next to your aggregation operator symbol as "annotations", much like the attribute lists on projection and the restriction condition on restriction. But anything concerning this operator is course-specific, because there isn't any agreed-upon standard notation for aggregations, as far as I know.
Then if your aggregation operator is defined to return a relation, you can specify your aggregation result as the input argument to a restriction with condition "total>=10000".

Convert SQL Query to Relational Algebra

I need some help converting an SQL query into relational algebra.
Here is the SQL query:
SELECT * FROM Customer, Appointment
WHERE Appointment.CustomerCode = Customer.CustomerCode
AND Appointment.ServerCode IN
(
SELECT ServerCode FROM Appointment WHERE CustomerCode = '102'
)
;
I'm stuck because of the IN subquery in the above example.
Can anyone demonstrate for me how to express this SQL query in relational algebra?
Many thanks.
EDIT: Here is my proposed solution in relational algebra. Is this correct? Does it reproduce the SQL query?
Scodes ← ΠServerCode(σCustomerCode='102'(Appointment))
Ccodes ← ΠCustomerCode(Appointment ⋉ Scodes)
Result ← (Customer ⋉ Ccodes)
Your SQL code will result in duplicate columns for CustomerCode and the use of SELECT [ALL] is likely to result in duplicate rows. Because the result is not a relation, it cannot be expressed in relational algebra.
These problems are easily fixed in SQL:
SELECT DISTINCT *
FROM Customer NATURAL JOIN Appointment
WHERE Appointment.ServerCode IN
(
SELECT ServerCode FROM Appointment WHERE CustomerCode = '102'
)
;
You didn't specify which relational algebra you are intereted in. Date and Darwen proposed an algebra named A, specified an A language named D, and designed a D language named Tutorial D.
Tutorial D uses operators JOIN for natural join, WHERE for restriction and MATCHING for semijoin, The slight complication is the comparison in SQL:
CustomerCode = '102'
The comparison of a CustomerCode value to a CHAR value in SQL is possible because of implicit coercion. Tutorial D is stricter -- type safe, if you will -- requiring you to overload the equality operator or, more practically, define a selector operator for CHAR, which would typically have the same name as the type.
Therefore, the above (revised) SQL may be written in Tutorial D as:
( Customer JOIN Appointment )
MATCHING ( ( Appointment WHERE CustomerCode = CustomerCode ( '102' ) ) { ServerCode } )
"How do I represent my query in this standard form of RA?"
It's not so much a question of "type of algebra" as it is of "type of notation".
Notation using greek symbols typically uses sigma, the restrict condition in subscript appended to the sigma character, and then the subject of the restriction (the relational expression that is subjected to the restrict condition).
Date avoid that notation, because typesetting and/or creating text using such notations is usually a lot harder than it is using just the western alphabet (a math teacher of mine once told us that math textbooks contain the most errors of all).
σ <cond> (<rel exp>) thus denotes the very same algebra expression as (Date's syntax) "<rel exp> WHERE <cond>".
Similarly, with greek symbols, projection is typically denoted using the letter Pi, with the list of retained attributes in subscript appended to the Pi, and the expression that is the subject of the projection following that.
Π <attr list> (<rel exp>) thus denotes the very same algebra expression as (Date's syntax) "<rel exp> { <attr list> }".
The join family of operators is usually denoted, in "greek" symbols, using (variations of) the Unicode BOWTIE character, or that character consisting of a lowercase letter 'x' surrounded by a full circle (usually used to denote full cartesian product, cross-product, ... whatever your algebra course happens to name it).
Some courses provide a "greek-symbol" notation for rename, using the greek letter Rho. Appended in subscript is the rename list, in the form a1->b1,a2->b2,... Appended after that comes the relational expression that is subjected to the rename. Likewise, Date has a non-greek-symbol equivalent syntax : <rel exp> RENAME a1 AS b1, a2 AS b2 , ...
The important thing is to see that these differences are merely differences in syntactical notation, not "different algebrae".
EDIT
One could imagine that the greek symbols notation would be the way to program relational algebra into an APL engine, Date's syntax would be the way to program relational algebra into a cobol-like or PL/1-like engine (there effectively exists such an engine called Rel), and the way to program relational algebra into an OO-like engine, could look something like relation.NaturalJoin(otherRelation).Matching(yetOtherRelation.Restrict(condition).project(attributesList)).

Clear explanation of the "theta join" in relational algebra?

I'm looking for a clear, basic explanation of the concept of theta join in relational algebra and perhaps an example (using SQL perhaps) to illustrate its usage.
If I understand it correctly, the theta join is a natural join with a condition added in. So, whereas the natural join enforces equality between attributes of the same name (and removes the duplicate?), the theta join does the same thing but adds in a condition. Do I have this right? Any clear explanation, in simple terms (for a non-mathmetician) would be greatly appreciated.
Also (sorry to just throw this in at the end, but its sort of related), could someone explain the importance or idea of cartesian product? I think I'm missing something with regard to the basic concept, because to me it just seems like a restating of a basic fact, i.e that a set of 13 X a set of 4 = 52...
Leaving SQL aside for a moment...
A relational operator takes one or more relations as parameters and results in a relation. Because a relation has no attributes with duplicate names by definition, relational operations theta join and natural join will both "remove the duplicate attributes." [A big problem with posting examples in SQL to explain relation operations, as you requested, is that the result of a SQL query is not a relation because, among other sins, it can have duplicate rows and/or columns.]
The relational Cartesian product operation (results in a relation) differs from set Cartesian product (results in a set of pairs). The word 'Cartesian' isn't particularly helpful here. In fact, Codd called his primitive operator 'product'.
The truly relational language Tutorial D lacks a product operator and product is not a primitive operator in the relational algebra proposed by co-author of Tutorial D, Hugh Darwen**. This is because the natural join of two relations with no attribute names in common results in the same relation as the product of the same two relations i.e. natural join is more general and therefore more useful.
Consider these examples (Tutorial D):
WITH RELATION { TUPLE { Y 1 } , TUPLE { Y 2 } , TUPLE { Y 3 } } AS R1 ,
RELATION { TUPLE { X 1 } , TUPLE { X 2 } } AS R2 :
R1 JOIN R2
returns the product of the relations i.e. degree of two (i.e. two attributes, X and Y) and cardinality of 6 (2 x 3 = 6 tuples).
However,
WITH RELATION { TUPLE { Y 1 } , TUPLE { Y 2 } , TUPLE { Y 3 } } AS R1 ,
RELATION { TUPLE { Y 1 } , TUPLE { Y 2 } } AS R2 :
R1 JOIN R2
returns the natural join of the relations i.e. degree of one (i.e. the set union of the attributes yielding one attribute Y) and cardinality of 2 (i.e. duplicate tuples removed).
I hope the above examples explain why your statement "that a set of 13 X a set of 4 = 52" is not strictly correct.
Similarly, Tutorial D does not include a theta join operator. This is essentially because other operators (e.g. natural join and restriction) make it both unnecessary and not terribly useful. In contrast, Codd's primitive operators included product and restriction which can be used to perform a theta join.
SQL has an explicit product operator named CROSS JOIN which forces the result to be the product even if it entails violating 1NF by creating duplicate columns (attributes). Consider the SQL equivalent to the latter Tutoral D exmaple above:
WITH R1 AS (SELECT * FROM (VALUES (1), (2), (3)) AS T (Y)),
R2 AS (SELECT * FROM (VALUES (1), (2)) AS T (Y))
SELECT *
FROM R1 CROSS JOIN R2;
This returns a table expression with two columns (rather than one attribute) both called Y (!!) and 6 rows i.e. this
SELECT c1 AS Y, c2 AS Y
FROM (VALUES (1, 1),
(2, 1),
(3, 1),
(1, 2),
(2, 2),
(3, 2)
) AS T (c1, c2);
** That is, although there is only one relational model (i.e. Codd's), there can be more than one relational algebra (i.e. Codd's is but one).
You're not quite right - a theta join is a join which may include a condition other than = - in SQL, typically < or >= etc. See TechNet
As for cartesian product (or CROSS JOIN), it is an operation rather than an idea or concept. It's important because sometimes you need to use it! It is a basic fact that set of 13 x set of 4 = 52, and cartesian product is based on this fact.
In my opinion, to make it simple, if you understand equijoin, you suppose should understand theta join. If you change the symbol = (equal) in equijoin to >=, then you already done theta join. However, I think it is quite difficult to see the practicality of using theta join as compared to equijoin since the join cause that we normally use is V.primarykey = C.foreignkey. And if you want to change to theta join, then it may depends on the value, as such you are doing selection.
For natural Join, it is just similar to equijoin, the difference is just that get rid of the redundant attributes. easy!:)
Hope this explanation helps.
All joins can be thought of as beginning with a cross product and then weeding out certain rows. A natural join weeds out all rows in which columns of the same name in the two tables being joine have different values. An equijoin weeds out all rows in which the specified columns have different values. And a theta-join weeds out all rows in which the specified columns do not stand in the specified relationship (<, >, or whatever; in principle it could be is_prefix_of as a relationship beween strings).
Update: Note that outer joins cannot be understood this way, because they synthesize information (that is, nulls) out of nothing.

Databases Relational algebra optimization

I'm trying to do some query optimization; taking an SQL query into relational algebra and optimizing it.
My db tables schemas are as follow:
Hills(MId, Mname, Long, Lat, Height, Rating,... )
Runners(HId, HName, Age, Skill,... )
Runs(MId, CId, Date, Duration)
Where there may be many columns in Runners and Hills.
My SQL query is:
SELECT DISTINCT Runners.HName, Runners.Age
FROM Hills, Runners, Runs
WHERE Runners.HId = Runs.HId AND Runs.MID = Hills.MId AND Height > 1200
So i could start by doing:
π Name, Age(σ Height > 1200 (Hills × Runners × Runs))
Or something like this and then optimizing it with a good choice of joins, but i'm not sure where to start
You could start by using the SQL join notation:
SELECT DISTINCT P.HName, P.Age
FROM Hills AS H
JOIN Runs AS R ON H.MId = R.MId
JOIN Runners AS P ON P.HId = R.HId
WHERE H.Height > 1200
You can then observe that the WHERE condition applies only to Hills, so you could push down the search criterion:
SELECT DISTINCT P.HName, P.Age
FROM (SELECT MId FROM Hills WHERE Height > 1200) AS H
JOIN Runs AS R ON H.MId = R.MId
JOIN Runners AS P ON P.HId = R.HId
This is a standard optimization - and one which the SQL optimizer will do automatically. In fact, it probably isn't worth doing much rewriting of the first query shown because the optimizer can deal with it. The other optimization I see as possible is pushing the DISTINCT operation down a level:
SELECT P.HName, P.Age
FROM (SELECT DISTINCT R.HId
FROM (SELECT MId FROM Hills WHERE Height > 1200) AS H
JOIN Runs AS R ON H.MId = R.MId
) AS R1
JOIN Runners AS P ON P.HId = R1.HId
This keeps the intermediate result set as small as possible: R1 contains a list of ID-values for the people who have run at least one 1200 metre (or is that 1200 feet?) hill, and can be joined 1:1 with the details in the Runners table. It would be interesting to see whether an optimizer is able to deduce the push-down of the DISTINCT for itself.
Of course, in relational algebra, the DISTINCT operation is done 'automatically' - every result and intermediate result is always a relation with no duplicates.
Given the original 'relational algebra' notation:
π Name, Age(σ Height > 1200 (Hills × Runners × Runs))
This corresponds to the first SQL statement above.
The second SQL statement then corresponds (more or less) to:
π Name, Age ((π MId (σ Height > 1200 (Hills))) × Runners × Runs)
The third SQL statement then corresponds (more or less) to:
π Name, Age ((π HId ((π MId (σ Height > 1200 (Hills))) × Runs)) × Runners)
Where I'm assuming that parentheses force the relational algebra to evaluate expressions in order. I'm not sure that I've got the minimum possible number of parentheses in there, but the ones that are there don't leave much wriggle room for ambiguity.