Convert SQL Query to Relational Algebra - sql

I need some help converting an SQL query into relational algebra.
Here is the SQL query:
SELECT * FROM Customer, Appointment
WHERE Appointment.CustomerCode = Customer.CustomerCode
AND Appointment.ServerCode IN
(
SELECT ServerCode FROM Appointment WHERE CustomerCode = '102'
)
;
I'm stuck because of the IN subquery in the above example.
Can anyone demonstrate for me how to express this SQL query in relational algebra?
Many thanks.
EDIT: Here is my proposed solution in relational algebra. Is this correct? Does it reproduce the SQL query?
Scodes ← ΠServerCode(σCustomerCode='102'(Appointment))
Ccodes ← ΠCustomerCode(Appointment ⋉ Scodes)
Result ← (Customer ⋉ Ccodes)

Your SQL code will result in duplicate columns for CustomerCode and the use of SELECT [ALL] is likely to result in duplicate rows. Because the result is not a relation, it cannot be expressed in relational algebra.
These problems are easily fixed in SQL:
SELECT DISTINCT *
FROM Customer NATURAL JOIN Appointment
WHERE Appointment.ServerCode IN
(
SELECT ServerCode FROM Appointment WHERE CustomerCode = '102'
)
;
You didn't specify which relational algebra you are intereted in. Date and Darwen proposed an algebra named A, specified an A language named D, and designed a D language named Tutorial D.
Tutorial D uses operators JOIN for natural join, WHERE for restriction and MATCHING for semijoin, The slight complication is the comparison in SQL:
CustomerCode = '102'
The comparison of a CustomerCode value to a CHAR value in SQL is possible because of implicit coercion. Tutorial D is stricter -- type safe, if you will -- requiring you to overload the equality operator or, more practically, define a selector operator for CHAR, which would typically have the same name as the type.
Therefore, the above (revised) SQL may be written in Tutorial D as:
( Customer JOIN Appointment )
MATCHING ( ( Appointment WHERE CustomerCode = CustomerCode ( '102' ) ) { ServerCode } )

"How do I represent my query in this standard form of RA?"
It's not so much a question of "type of algebra" as it is of "type of notation".
Notation using greek symbols typically uses sigma, the restrict condition in subscript appended to the sigma character, and then the subject of the restriction (the relational expression that is subjected to the restrict condition).
Date avoid that notation, because typesetting and/or creating text using such notations is usually a lot harder than it is using just the western alphabet (a math teacher of mine once told us that math textbooks contain the most errors of all).
σ <cond> (<rel exp>) thus denotes the very same algebra expression as (Date's syntax) "<rel exp> WHERE <cond>".
Similarly, with greek symbols, projection is typically denoted using the letter Pi, with the list of retained attributes in subscript appended to the Pi, and the expression that is the subject of the projection following that.
Π <attr list> (<rel exp>) thus denotes the very same algebra expression as (Date's syntax) "<rel exp> { <attr list> }".
The join family of operators is usually denoted, in "greek" symbols, using (variations of) the Unicode BOWTIE character, or that character consisting of a lowercase letter 'x' surrounded by a full circle (usually used to denote full cartesian product, cross-product, ... whatever your algebra course happens to name it).
Some courses provide a "greek-symbol" notation for rename, using the greek letter Rho. Appended in subscript is the rename list, in the form a1->b1,a2->b2,... Appended after that comes the relational expression that is subjected to the rename. Likewise, Date has a non-greek-symbol equivalent syntax : <rel exp> RENAME a1 AS b1, a2 AS b2 , ...
The important thing is to see that these differences are merely differences in syntactical notation, not "different algebrae".
EDIT
One could imagine that the greek symbols notation would be the way to program relational algebra into an APL engine, Date's syntax would be the way to program relational algebra into a cobol-like or PL/1-like engine (there effectively exists such an engine called Rel), and the way to program relational algebra into an OO-like engine, could look something like relation.NaturalJoin(otherRelation).Matching(yetOtherRelation.Restrict(condition).project(attributesList)).

Related

I need help returning subtle differences in two text columns from two tables

My current SQL query is:
Select
A.CFCIF#, B.CIFNO, B.SNAME, A.CFSNME
From dbo.tbl_CIF_Master A
Left Join dbo.tbl_Loan_Master B
On A.CFCIF# = B.CIFNO
Where
B.[STATUS] not in (2,8)
--and CONTAINS (A.CFSNME or FORMSOF (THESAURUS, B.SNAME)) --doesn't work.
I'm not an admin so I can't design thesaurus mappings
and B.SNAME LIKE '%' + A.CFSNME + '%' -- works but no results which can't
be accurate
This runs fine but finds no differences as I noted using LIKE. The line using THESAURUS is commented out as I noted... An example of the subtle differences I would find in the two 'name' fields SNAME & CFSNME would be subtly differences like missing a comma before LLC or Robert abbreviated to Rob.
Given the indefinite nature of the differences that you are looking for (which the strict substring matching that you have implemented will not catch), you might consider calculating a similarity metric between the columns and then determine an appropriate critical value of that similarity metric to identify strings that are the same but for subtle differences. See A better similarity ranking algorithm for variable length strings for a similarity metric that you might want to use.

How to represent GROUP BY with HAVING COUNT(*)>1 in relational algebra?

For an exam, I am asked to get the list of clients having more than one rent, both as an SQL query and as an algebraic expression.
For some reasons, the correction doesn't provide the algebraic version.
So now I am left with:
SELECT IdClient, Name, ...
FROM Client
WHERE IdClient IN (
SELECT IdClient
FROM Rental
GROUP BY IdClient
HAVING COUNT(*) > 1
)
I don't know if there is a standard for algebraic notations, therefore:
Π projection
× Cartesian product
⋈ natural join
σ selection
Then I translate as:
Π IdClient, Name, ... (
σ (count(IdClient)>1) (Π Rental) ⋈ (Client ⋈ Rental)
)
But I find no source to prove me right or wrong, especially for:
the logic behind the math
Π Rental seems like a shady business
I saw the use of count() at https://cs.stackexchange.com/questions/29897/use-count-in-relational-algebra and while it isn't used the same way, I couldn't figure out a way to use it without the projection (which I wanted to avoid.)
There are many variants of "relational algebra", differing even on what a relation is. You need to tell us which one you are supposed to use.
Also you don't explain what it means for a pair of RA & SQL queries to "have the form of" or "be the same as" each other. (Earlier versions.) Same result? Or also some kind of parallel structure?
Also you don't explain what "get the list of clients" means. What attributes does the result have?
If you try to write a definition of the count you are trying to use in σ count(IdClient)>1 (...)--what it inputs & what it outputs based on that--you will see that you can't. That kind of count that takes just an attribute does not correspond to a relational operator. It is used in a grouping expression--which you are missing. Such count & group are not actually relational operators, they are non-terminals in so-called relational algebras that are really query languages, designed by SQL apologists, suggesting it is easy to map SQL to relational algebra, but begging the question of how we aggregate in an algebra. Still, maybe that's the sort of "relational algebra" you were told to use.
I saw the use of count() there https://cs.stackexchange.com/questions/29897/use-count-in-relational-algebra
The nature of algebras is that the only sense in which we "use" operators "with" other operators is to pass outputs of operator calls as inputs to other operator calls. (Hence some so-called algebras are not.) In your linked answer, grouping operator G inputs aggregate name count and attribute name name and that affects the output. The answer quotes Database System Concepts, 5th Ed:
G1, G2, ..., Gn G F1(A1), F2(A2), ..., Fm(Am) (E)
where E is any relational-algebra expression; G1, G2, ..., Gn constitute a list of attributes on which to group; each Fi is an aggregate function; and each Ai is an attribute name.
G returns rows with attributes G1, ..., A1, ... where one or more rows with the same G1, ... subrows are in E and each Ai holds the output from aggregating Fi on Ai over those rows.
But the answer when you read & linked it used that definition improperly. (I got it fixed since.) Correct is:
π name (σ phone>1 (name G count(phone) (Person)))
This is clear if you carefully read the definition.
G has misleading syntax. count(phone) is not a call of an operator; it's just a pair of arguments, an aggregate name count & an attribute name phone. Less misleading syntax would be
π name (σ phone>1 (name G count phone (Person)))
One does not need a grouping operator to write your query. That makes it all the more important to know what "relational algebra" means in the exam. It is harder if you can't use a grouping operator.
"Π Rental seems like a shady business" is unclear. You do use projection incorrectly; proper use is π attributes (relation). I guess you are using π in an attempt to involve a grouping operator like G. Re "the logic behind the math" see this.

How to write relational algebra for SQL WHERE column IN?

select age from person where name in (select name from eats where pizza="mushroom")
I am not sure what to write for the "in". How should I solve this?
In this case the sub-select is equivalent to a join:
select age
from person p, eats e
where p.name = e.name and pizza='mushroom'
So you could translate it in:
πage (person p ⋈p.name=e.name (σpizza='mushroom'(eats e)))
Here's my guess. I'm assuming that set membership symbol is part of relational algebra
For base table r, C a column of both r & s, and x an unused name,
select ... from r where C in s
returns the same value as
select ... from r natural join (s) x
The use of in requires that s has one column. The in is true for a row of r exactly when its C equals the value in s. So the where keeps exactly the rows of r whose C equals the value in s. We assumed that s has column C, so the where keeps exactly the rows of r whose C equals the C of the row in r. Those are same rows that are returned by the natural join.
(For an expression like this where-in with C not a column of both r and s then this translation is not applicable. Similarly, the reverse translation is only applicable under certain conditions.)
How useful this particular translation is to you or whether you could simplify it or must complexify it depends on what variants of SQL & "relational algebra" you are using, what limitations you have on input expressions and other translation decisions you have made. If you use very straightforward and general translations then the output is more complex but obviously correct. If you translate using a lot of special case rules and ad hoc simplifications along the way then the output is simpler but the justification that the answer is correct is longer.

Converting a certain SQL query into relational algebra

Doing an assignment for my database course and I want to double check my relational algebra.
The SQL:
SELECT dato, SUM(pris*antall) AS total
FROM produkt, ordre
WHERE ordre.varenr = produkt.varenr
GROUP BY dato
HAVING total >= 10000
The relational algebra:
σ total >= 10000 (
ρ R(dato, total)(
σ ordre.varenr = produkt.varenr (
dato ℑ SUM(pris*antall (produkt x ordre)
)
)
)
Is this correct?
I don't know. And anybody else is not likely to know either.
RA courses typically limit themselves to the selection, projection and join operators. Aggregations are not typically covered by an RA course. There even isn't any standard approach (that I know of) that the RA takes on aggregations.
What is the operator that your course defines for doing aggregations on relations ? What type of value does that operator produce for its result ? A relation ? Something else ? If something else, how does your course explain doing relational restrictions on that result, given that these result values aren't relations, but restriction works only on relations ?
Algebraically, this case starts with a natural join (produkt x ordre).
[The result of] this natural join is subjected to an aggregation operation. Thus this natural join is to appear where you specify the relational input argument to your aggregation operator. The other needed specs for specifying the aggregation are the output attribute names (total), and the way to compute them (SUM(...)). Those might appear in subscript next to your aggregation operator symbol as "annotations", much like the attribute lists on projection and the restriction condition on restriction. But anything concerning this operator is course-specific, because there isn't any agreed-upon standard notation for aggregations, as far as I know.
Then if your aggregation operator is defined to return a relation, you can specify your aggregation result as the input argument to a restriction with condition "total>=10000".

Clear explanation of the "theta join" in relational algebra?

I'm looking for a clear, basic explanation of the concept of theta join in relational algebra and perhaps an example (using SQL perhaps) to illustrate its usage.
If I understand it correctly, the theta join is a natural join with a condition added in. So, whereas the natural join enforces equality between attributes of the same name (and removes the duplicate?), the theta join does the same thing but adds in a condition. Do I have this right? Any clear explanation, in simple terms (for a non-mathmetician) would be greatly appreciated.
Also (sorry to just throw this in at the end, but its sort of related), could someone explain the importance or idea of cartesian product? I think I'm missing something with regard to the basic concept, because to me it just seems like a restating of a basic fact, i.e that a set of 13 X a set of 4 = 52...
Leaving SQL aside for a moment...
A relational operator takes one or more relations as parameters and results in a relation. Because a relation has no attributes with duplicate names by definition, relational operations theta join and natural join will both "remove the duplicate attributes." [A big problem with posting examples in SQL to explain relation operations, as you requested, is that the result of a SQL query is not a relation because, among other sins, it can have duplicate rows and/or columns.]
The relational Cartesian product operation (results in a relation) differs from set Cartesian product (results in a set of pairs). The word 'Cartesian' isn't particularly helpful here. In fact, Codd called his primitive operator 'product'.
The truly relational language Tutorial D lacks a product operator and product is not a primitive operator in the relational algebra proposed by co-author of Tutorial D, Hugh Darwen**. This is because the natural join of two relations with no attribute names in common results in the same relation as the product of the same two relations i.e. natural join is more general and therefore more useful.
Consider these examples (Tutorial D):
WITH RELATION { TUPLE { Y 1 } , TUPLE { Y 2 } , TUPLE { Y 3 } } AS R1 ,
RELATION { TUPLE { X 1 } , TUPLE { X 2 } } AS R2 :
R1 JOIN R2
returns the product of the relations i.e. degree of two (i.e. two attributes, X and Y) and cardinality of 6 (2 x 3 = 6 tuples).
However,
WITH RELATION { TUPLE { Y 1 } , TUPLE { Y 2 } , TUPLE { Y 3 } } AS R1 ,
RELATION { TUPLE { Y 1 } , TUPLE { Y 2 } } AS R2 :
R1 JOIN R2
returns the natural join of the relations i.e. degree of one (i.e. the set union of the attributes yielding one attribute Y) and cardinality of 2 (i.e. duplicate tuples removed).
I hope the above examples explain why your statement "that a set of 13 X a set of 4 = 52" is not strictly correct.
Similarly, Tutorial D does not include a theta join operator. This is essentially because other operators (e.g. natural join and restriction) make it both unnecessary and not terribly useful. In contrast, Codd's primitive operators included product and restriction which can be used to perform a theta join.
SQL has an explicit product operator named CROSS JOIN which forces the result to be the product even if it entails violating 1NF by creating duplicate columns (attributes). Consider the SQL equivalent to the latter Tutoral D exmaple above:
WITH R1 AS (SELECT * FROM (VALUES (1), (2), (3)) AS T (Y)),
R2 AS (SELECT * FROM (VALUES (1), (2)) AS T (Y))
SELECT *
FROM R1 CROSS JOIN R2;
This returns a table expression with two columns (rather than one attribute) both called Y (!!) and 6 rows i.e. this
SELECT c1 AS Y, c2 AS Y
FROM (VALUES (1, 1),
(2, 1),
(3, 1),
(1, 2),
(2, 2),
(3, 2)
) AS T (c1, c2);
** That is, although there is only one relational model (i.e. Codd's), there can be more than one relational algebra (i.e. Codd's is but one).
You're not quite right - a theta join is a join which may include a condition other than = - in SQL, typically < or >= etc. See TechNet
As for cartesian product (or CROSS JOIN), it is an operation rather than an idea or concept. It's important because sometimes you need to use it! It is a basic fact that set of 13 x set of 4 = 52, and cartesian product is based on this fact.
In my opinion, to make it simple, if you understand equijoin, you suppose should understand theta join. If you change the symbol = (equal) in equijoin to >=, then you already done theta join. However, I think it is quite difficult to see the practicality of using theta join as compared to equijoin since the join cause that we normally use is V.primarykey = C.foreignkey. And if you want to change to theta join, then it may depends on the value, as such you are doing selection.
For natural Join, it is just similar to equijoin, the difference is just that get rid of the redundant attributes. easy!:)
Hope this explanation helps.
All joins can be thought of as beginning with a cross product and then weeding out certain rows. A natural join weeds out all rows in which columns of the same name in the two tables being joine have different values. An equijoin weeds out all rows in which the specified columns have different values. And a theta-join weeds out all rows in which the specified columns do not stand in the specified relationship (<, >, or whatever; in principle it could be is_prefix_of as a relationship beween strings).
Update: Note that outer joins cannot be understood this way, because they synthesize information (that is, nulls) out of nothing.