Doing an assignment for my database course and I want to double check my relational algebra.
The SQL:
SELECT dato, SUM(pris*antall) AS total
FROM produkt, ordre
WHERE ordre.varenr = produkt.varenr
GROUP BY dato
HAVING total >= 10000
The relational algebra:
σ total >= 10000 (
ρ R(dato, total)(
σ ordre.varenr = produkt.varenr (
dato ℑ SUM(pris*antall (produkt x ordre)
)
)
)
Is this correct?
I don't know. And anybody else is not likely to know either.
RA courses typically limit themselves to the selection, projection and join operators. Aggregations are not typically covered by an RA course. There even isn't any standard approach (that I know of) that the RA takes on aggregations.
What is the operator that your course defines for doing aggregations on relations ? What type of value does that operator produce for its result ? A relation ? Something else ? If something else, how does your course explain doing relational restrictions on that result, given that these result values aren't relations, but restriction works only on relations ?
Algebraically, this case starts with a natural join (produkt x ordre).
[The result of] this natural join is subjected to an aggregation operation. Thus this natural join is to appear where you specify the relational input argument to your aggregation operator. The other needed specs for specifying the aggregation are the output attribute names (total), and the way to compute them (SUM(...)). Those might appear in subscript next to your aggregation operator symbol as "annotations", much like the attribute lists on projection and the restriction condition on restriction. But anything concerning this operator is course-specific, because there isn't any agreed-upon standard notation for aggregations, as far as I know.
Then if your aggregation operator is defined to return a relation, you can specify your aggregation result as the input argument to a restriction with condition "total>=10000".
Related
A colleague of mine who is generally well-versed in SQL told me that the order of operands in a > or = expression could determine whether or not the expression was sargable. In particular, with a query whose case statement included:
CASE
when (select count(i.id)
from inventory i
inner join orders o on o.idinventory = i.idInventory
where o.idOrder = #order) > 1 THEN 2
ELSE 1
and was told to reverse the order of the operands to the equivalent
CASE
when 1 < (select count(i.id)
from inventory i
inner join orders o on o.idinventory = i.idInventory
where o.idOrder = #order) THEN 2
ELSE 1
for sargability concerns. I found no difference in query plans, though ultimately I made the change for the sake of sticking to team coding standards. Is what my co-worker said true in some cases? Does the order of operands in an expression have potential impact on its execution time? This doesn't mesh with how I understand sargability to work.
For Postgres, the answer is definitely: "No." (sql-server was added later.)
The query planner can flip around left and right operands of an operator as long as a COMMUTATOR is defined, which is the case for all instance of < and >. (Operators are actually defined by the operator itself and their accepted operands.) And the query planner will do so to make an expression "sargable". Related answer with detailed explanation:
Can PostgreSQL index array columns?
It's different for other operators without COMMUTATOR. Example for ~~ (LIKE):
LATERAL JOIN not using trigram index
If you're talking about the most popular modern databases like Microsoft SQL, Oracle, Postgres, MySql, Teradata, the answer is definitely NO.
What is a SARGable query?
A SARGable query is the one that strive to narrow the number of rows a database has to process in order to return you the expected result. What I mean, for example:
Consider this query:
select * from table where column1 <> 'some_value';
Obviously, using an index in this case is useless, because a database most certainly would have to look through all rows in a table to give you expected rows.
But what if we change the operator?
select * from table where column1 = 'some_value';
In this case an index can give good performance and return expected rows almost in a flash.
SARGable operators are: =, <, >, <= ,>=, LIKE (without %), BETWEEN
Non-SARGable operators are: <>, IN, OR
Now, back to your case.
Your problem is simple. You have X and you have Y. X > Y or Y < X - in both cases you have to determine the values of both variables, so this switching gives you nothing.
P.S. Of course, I concede, there could be databases with very poor optimizers where this kind of swithing could play role. But, as I said before, in modern databases you should not worry about it.
I'm a beginner, studying on my own... please help me to clarify something about a query: I am working with a soccer database and trying to answer this question: list all seasons with an avg goal per Match rate of over 1, in Matchs that didn’t end with a draw;
The right query for it is:
select season,round((sum(home_team_goal+away_team_goal) *1.0) /count(id),3) as ratio
from match
where home_team_goal != away_team_goal
group by season
having ratio > 1
I don't understand 2 things about this query:
Why do I *1.0? why is it necessary?
I know that the execution in SQL is by this order:
from
where
group
having
select
So how does this query include: having ratio>1 if the "ratio" is only defined in the "select" which is executed AFTER the HAVING?
Am I confused?
Thanks in advance for the help!
The multiplication is added as a typecast to convert INT to FLOAT because by default sum of ints is int and the division looses decimal places after dividing 2 ints.
HAVING. You can consider HAVING as WHERE but applied to the query results. Imagine the query is executed first without HAVING and then the HAVING condition is applied to result rows leaving only suitable ones.
In you case you first select grouped data and calculate aggregated results and then skip unnecessary results of aggregation.
the *1.0 is used for its ".0" part so that it tells the system to treat the expression as a decimal, and thus not make an integer division which would cut-off the decimal part (eg 1 instead of 1.33).
About the second part: select being at the end just means that the last thing
to be done is showing the data. Hoewever, assigning an alias to a calculated field is being done, you could say, at first priority. Still, I am a bit doubtful; I am almost certain field aliases cannot be used in the where/group by/having in, say, sql server.
There is no order of execution of a SQL query. SQL is a descriptive language not a procedural language. A SQL query describes the result set that the query is producing. The SQL engine can execute it however it likes. In fact, most SQL engines compile the query into a directed acyclic graph, which looks nothing like the original query.
What you are referring to might be better phrased as the "order of interpretation". This is more simply described by simple rules. Column aliases can be used in the ORDER BY clause in any database. They cannot be used in the FROM, WHERE, or GROUP BY clauses. Some databases -- such as SQLite -- allow them to be referenced in the HAVING clause.
As for the * 1.0, it is because some databases -- such as SQLite -- do integer arithmetic. However, the logic that you want is probably more simply expressed as:
round((avg(home_team_goal + away_team_goal * 1.0), 3)
For an exam, I am asked to get the list of clients having more than one rent, both as an SQL query and as an algebraic expression.
For some reasons, the correction doesn't provide the algebraic version.
So now I am left with:
SELECT IdClient, Name, ...
FROM Client
WHERE IdClient IN (
SELECT IdClient
FROM Rental
GROUP BY IdClient
HAVING COUNT(*) > 1
)
I don't know if there is a standard for algebraic notations, therefore:
Π projection
× Cartesian product
⋈ natural join
σ selection
Then I translate as:
Π IdClient, Name, ... (
σ (count(IdClient)>1) (Π Rental) ⋈ (Client ⋈ Rental)
)
But I find no source to prove me right or wrong, especially for:
the logic behind the math
Π Rental seems like a shady business
I saw the use of count() at https://cs.stackexchange.com/questions/29897/use-count-in-relational-algebra and while it isn't used the same way, I couldn't figure out a way to use it without the projection (which I wanted to avoid.)
There are many variants of "relational algebra", differing even on what a relation is. You need to tell us which one you are supposed to use.
Also you don't explain what it means for a pair of RA & SQL queries to "have the form of" or "be the same as" each other. (Earlier versions.) Same result? Or also some kind of parallel structure?
Also you don't explain what "get the list of clients" means. What attributes does the result have?
If you try to write a definition of the count you are trying to use in σ count(IdClient)>1 (...)--what it inputs & what it outputs based on that--you will see that you can't. That kind of count that takes just an attribute does not correspond to a relational operator. It is used in a grouping expression--which you are missing. Such count & group are not actually relational operators, they are non-terminals in so-called relational algebras that are really query languages, designed by SQL apologists, suggesting it is easy to map SQL to relational algebra, but begging the question of how we aggregate in an algebra. Still, maybe that's the sort of "relational algebra" you were told to use.
I saw the use of count() there https://cs.stackexchange.com/questions/29897/use-count-in-relational-algebra
The nature of algebras is that the only sense in which we "use" operators "with" other operators is to pass outputs of operator calls as inputs to other operator calls. (Hence some so-called algebras are not.) In your linked answer, grouping operator G inputs aggregate name count and attribute name name and that affects the output. The answer quotes Database System Concepts, 5th Ed:
G1, G2, ..., Gn G F1(A1), F2(A2), ..., Fm(Am) (E)
where E is any relational-algebra expression; G1, G2, ..., Gn constitute a list of attributes on which to group; each Fi is an aggregate function; and each Ai is an attribute name.
G returns rows with attributes G1, ..., A1, ... where one or more rows with the same G1, ... subrows are in E and each Ai holds the output from aggregating Fi on Ai over those rows.
But the answer when you read & linked it used that definition improperly. (I got it fixed since.) Correct is:
π name (σ phone>1 (name G count(phone) (Person)))
This is clear if you carefully read the definition.
G has misleading syntax. count(phone) is not a call of an operator; it's just a pair of arguments, an aggregate name count & an attribute name phone. Less misleading syntax would be
π name (σ phone>1 (name G count phone (Person)))
One does not need a grouping operator to write your query. That makes it all the more important to know what "relational algebra" means in the exam. It is harder if you can't use a grouping operator.
"Π Rental seems like a shady business" is unclear. You do use projection incorrectly; proper use is π attributes (relation). I guess you are using π in an attempt to involve a grouping operator like G. Re "the logic behind the math" see this.
select age from person where name in (select name from eats where pizza="mushroom")
I am not sure what to write for the "in". How should I solve this?
In this case the sub-select is equivalent to a join:
select age
from person p, eats e
where p.name = e.name and pizza='mushroom'
So you could translate it in:
πage (person p ⋈p.name=e.name (σpizza='mushroom'(eats e)))
Here's my guess. I'm assuming that set membership symbol is part of relational algebra
For base table r, C a column of both r & s, and x an unused name,
select ... from r where C in s
returns the same value as
select ... from r natural join (s) x
The use of in requires that s has one column. The in is true for a row of r exactly when its C equals the value in s. So the where keeps exactly the rows of r whose C equals the value in s. We assumed that s has column C, so the where keeps exactly the rows of r whose C equals the C of the row in r. Those are same rows that are returned by the natural join.
(For an expression like this where-in with C not a column of both r and s then this translation is not applicable. Similarly, the reverse translation is only applicable under certain conditions.)
How useful this particular translation is to you or whether you could simplify it or must complexify it depends on what variants of SQL & "relational algebra" you are using, what limitations you have on input expressions and other translation decisions you have made. If you use very straightforward and general translations then the output is more complex but obviously correct. If you translate using a lot of special case rules and ad hoc simplifications along the way then the output is simpler but the justification that the answer is correct is longer.
I need some help converting an SQL query into relational algebra.
Here is the SQL query:
SELECT * FROM Customer, Appointment
WHERE Appointment.CustomerCode = Customer.CustomerCode
AND Appointment.ServerCode IN
(
SELECT ServerCode FROM Appointment WHERE CustomerCode = '102'
)
;
I'm stuck because of the IN subquery in the above example.
Can anyone demonstrate for me how to express this SQL query in relational algebra?
Many thanks.
EDIT: Here is my proposed solution in relational algebra. Is this correct? Does it reproduce the SQL query?
Scodes ← ΠServerCode(σCustomerCode='102'(Appointment))
Ccodes ← ΠCustomerCode(Appointment ⋉ Scodes)
Result ← (Customer ⋉ Ccodes)
Your SQL code will result in duplicate columns for CustomerCode and the use of SELECT [ALL] is likely to result in duplicate rows. Because the result is not a relation, it cannot be expressed in relational algebra.
These problems are easily fixed in SQL:
SELECT DISTINCT *
FROM Customer NATURAL JOIN Appointment
WHERE Appointment.ServerCode IN
(
SELECT ServerCode FROM Appointment WHERE CustomerCode = '102'
)
;
You didn't specify which relational algebra you are intereted in. Date and Darwen proposed an algebra named A, specified an A language named D, and designed a D language named Tutorial D.
Tutorial D uses operators JOIN for natural join, WHERE for restriction and MATCHING for semijoin, The slight complication is the comparison in SQL:
CustomerCode = '102'
The comparison of a CustomerCode value to a CHAR value in SQL is possible because of implicit coercion. Tutorial D is stricter -- type safe, if you will -- requiring you to overload the equality operator or, more practically, define a selector operator for CHAR, which would typically have the same name as the type.
Therefore, the above (revised) SQL may be written in Tutorial D as:
( Customer JOIN Appointment )
MATCHING ( ( Appointment WHERE CustomerCode = CustomerCode ( '102' ) ) { ServerCode } )
"How do I represent my query in this standard form of RA?"
It's not so much a question of "type of algebra" as it is of "type of notation".
Notation using greek symbols typically uses sigma, the restrict condition in subscript appended to the sigma character, and then the subject of the restriction (the relational expression that is subjected to the restrict condition).
Date avoid that notation, because typesetting and/or creating text using such notations is usually a lot harder than it is using just the western alphabet (a math teacher of mine once told us that math textbooks contain the most errors of all).
σ <cond> (<rel exp>) thus denotes the very same algebra expression as (Date's syntax) "<rel exp> WHERE <cond>".
Similarly, with greek symbols, projection is typically denoted using the letter Pi, with the list of retained attributes in subscript appended to the Pi, and the expression that is the subject of the projection following that.
Π <attr list> (<rel exp>) thus denotes the very same algebra expression as (Date's syntax) "<rel exp> { <attr list> }".
The join family of operators is usually denoted, in "greek" symbols, using (variations of) the Unicode BOWTIE character, or that character consisting of a lowercase letter 'x' surrounded by a full circle (usually used to denote full cartesian product, cross-product, ... whatever your algebra course happens to name it).
Some courses provide a "greek-symbol" notation for rename, using the greek letter Rho. Appended in subscript is the rename list, in the form a1->b1,a2->b2,... Appended after that comes the relational expression that is subjected to the rename. Likewise, Date has a non-greek-symbol equivalent syntax : <rel exp> RENAME a1 AS b1, a2 AS b2 , ...
The important thing is to see that these differences are merely differences in syntactical notation, not "different algebrae".
EDIT
One could imagine that the greek symbols notation would be the way to program relational algebra into an APL engine, Date's syntax would be the way to program relational algebra into a cobol-like or PL/1-like engine (there effectively exists such an engine called Rel), and the way to program relational algebra into an OO-like engine, could look something like relation.NaturalJoin(otherRelation).Matching(yetOtherRelation.Restrict(condition).project(attributesList)).