Let a and b be integer-valued attributes that may be NULL in some tuples. For each of the following conditions that may appear in a WHERE clause, describe exactly the set of (a,b) tuples that satisfy the condition, including the case where a and/or b is null.
(a) a=10 OR b=20
(b) a=10 AND b=20
(c) a<10 OR a>=10
(d) a=b
Kind of confused as to how to approach this problem. What exactly is this asking? It was in my textbook but we are learning about views and ER diagrams. Not sure how this ties into those topics.
For a) there will be two types of tuples:
a = 10 and b can be anything including null
b = 20 and a can be anything including null
From that you should be able to work out b) and c).
d) is interesting. There is only one type of tuple:
a = b and a is not null and b is not null
Null never equals anything, not even itself.
Related
select age from person where name in (select name from eats where pizza="mushroom")
I am not sure what to write for the "in". How should I solve this?
In this case the sub-select is equivalent to a join:
select age
from person p, eats e
where p.name = e.name and pizza='mushroom'
So you could translate it in:
πage (person p ⋈p.name=e.name (σpizza='mushroom'(eats e)))
Here's my guess. I'm assuming that set membership symbol is part of relational algebra
For base table r, C a column of both r & s, and x an unused name,
select ... from r where C in s
returns the same value as
select ... from r natural join (s) x
The use of in requires that s has one column. The in is true for a row of r exactly when its C equals the value in s. So the where keeps exactly the rows of r whose C equals the value in s. We assumed that s has column C, so the where keeps exactly the rows of r whose C equals the C of the row in r. Those are same rows that are returned by the natural join.
(For an expression like this where-in with C not a column of both r and s then this translation is not applicable. Similarly, the reverse translation is only applicable under certain conditions.)
How useful this particular translation is to you or whether you could simplify it or must complexify it depends on what variants of SQL & "relational algebra" you are using, what limitations you have on input expressions and other translation decisions you have made. If you use very straightforward and general translations then the output is more complex but obviously correct. If you translate using a lot of special case rules and ad hoc simplifications along the way then the output is simpler but the justification that the answer is correct is longer.
Suppose I have a table in pig with 3 columns, a , b, c. Now suppose I want to filter the table by b == 4 and then group it by a. I believe that would look something like this.
t1 = my_table; -- the table contains three columns a, b, c
t1_filtered = FILTER t1_filtered by (
b == 4
);
t1_grouped = GROUP t1_filtered by my_table.a;
My question is why can't it look like this:
t1 = my_table; -- the table contains three columns a, b, c
t1_filtered = FILTER t1_filtered by (
b == 4
);
t1_grouped = GROUP t1_filtered by t1_filtered.a;
Why do you have to reference the table before the filter? I'm trying to learn pig and i find myself making this mistake a lot. It seems to me that t1_filtered should equal a table that is just the filtered version of t1. Therefore a simple group should make sense, but i've been told you need to reference the table from before. Does anyone know whats going on behind the scenes and why this makes sense? Also, help naming this question is also appreciated.
The way you have De-referenced(.) is also not correct. This is how it should be.
A = LOAD '/filepath/to/tabledata' using PigStorage(',') as (a:int,b:int,c:int);
B = FILTER A BY a==1;
C = GROUP B BY a;
But your way of dereferencing(.) will also work in some cases. You can only use dot(.) when you are referencing a complex data type like a map,tuple or bag. If we use dot operator to access the normal fields it would expect a scalar output. If it has more than one output then you will get a error something like this.
java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (1,2,3), 2nd :(2,2,2)
Your way of using the dot operator would work only if the output of your group by has only one output if not you will end up with this error. Relation B is not a complex data type that is the reason we do not use any dereferencing operator in the group by clause.
Hope this answers your question.
I have one maybe stupid question.
Look at the query :
select count(a) as A, count(b) as b, count(a)+count(b) as C
From X
How can I sum up the two columns without repeating the code:
Something like:
select count(a) as A, count(b) as b, A+B as C
From X
For the sake of completeness, using a CTE:
WITH V AS (
SELECT COUNT(a) as A, COUNT(b) as B
FROM X
)
SELECT A, B, A + B as C
FROM V
This can easily be handled by making the engine perform only two aggregate functions and a scalar computation. Try this.
SELECT A, B, A + B as C
FROM (
SELECT COUNT(a) as A, COUNT(b) as B
FROM X
) T
You may get the two individual counts of a same table and then get the summation of those counts, like bellow
SELECT
(SELECT COUNT(a) FROM X )+
(SELECT COUNT(b) FROM X )
AS C
Let's agree on one point: SQL is not an Object-Oriented language. In fact, when we think of computer languages, we are thinking of procedural languages (you use the language to describe step by step how you want the data to be manipulated). SQL is declarative (you describe the desired result and the system works out how to get it).
When you program in a procedural languages your main concerns are: 1) is this the best algorithm to arrive at the correct result? and 2) do these steps correctly implement the algorithm?
When you program in a declarative language your main concern is: is this the best description of the desired result?
In SQL, most of your effort will be going into correctly forming the filtering criteria (the where clause) and the join criteria (any on clauses). Once that is done correctly, you're pretty much just down to aggregating and formating (if applicable).
The first query you show is perfectly formed. You want the number of all the non-null values in A, the number of all the non-null values in B, and the total of both of those amounts. In some systems, you can even use the second form you show, which does nothing more than abstract away the count(x) text. This is convenient in that if you should have to change a count(x) to sum(x), you only have to make a change in one place rather than two, but it doesn't change the description of the data -- and that is important.
Using a CTE or nested query may allow you to mimic the abstraction not available in some systems, but be careful making cosmetic changes -- changes that do not alter the description of the data. If you look at the execution plan of the two queries as you show them, the CTE and the subquery, in most systems they will probably all be identical. In other words, you've painted your car a different color, but it's still the same car.
But since it now takes you two distinct steps in 4 or 5 lines to explain what it originally took only one step in one line to express, it's rather difficult to defend the notion that you have made an improvement. In fact, I'll bet you can come up with a lot more bullet points explaining why it would be better if you had started with the CTE or subquery and should change them to your original query than the other way around.
I'm not saying that what you are doing is wrong. But in the real world, we are generally short of the spare time to spend on strictly cosmetic changes.
Good morning my beloved sql wizards and sorcerers,
I am wanting to substitute on 3 columns of data across 3 tables. Currently I am using the NVL function, however that is restricted to two columns.
See below for an example:
SELECT ccc.case_id,
NVL (ccvl.descr, ccc.char)) char_val
FROM case_char ccc, char_value ccvl, lookup_value lval1
WHERE
ccvl.descr(+) = ccc.value
AND ccc.value = lval1.descr (+)
AND ccc.case_id IN ('123'))
case_char table
case_id|char |value
123 |email| work_email
124 |issue| tim_
char_value table
char | descr
work_email | complaint mail
tim_ | timeliness
lookup_value table
descr | descrlong
work_email| xxx#blah.com
Essentially what I am trying to do is if there exists a match for case_char.value with lookup_value.descr then display it, if not, then if there exists a match with case_char.value and char_value.char then display it.
I am just trying to return the description for 'issue'from the char_value table, but for 'email' I want to return the descrlong from the lookup_value table (all under the same alias 'char_val').
So my question is, how do I achieve this keeping in mind that I want them to appear under the same alias.
Let me know if you require any further information.
Thanks guys
You could nest NVL:
NVL(a, NVL(b, NVL(c, d))
But even better, use the SQL-standard COALESCE, which does take multiple arguments and also works on non-Oracle systems:
COALESCE(a, b, c, d)
How about using COALESCE:
COALESCE(ccvl.descr, ccc.char)
Better to Use COALESCE(a, b, c, d) because of below reason:
Nested NVL logic can be achieved in single COALESCE(a, b, c, d).
It is SQL standard to use COALESCE.
COALESCE gives better performance in terms, NVL always first calculate both of the queries used and then compare if the first value is null then return a second value. but in COALESCE function it checks one by one and returns response whenever if found a non-null value instead of executing all the used queries.
I'm looking for a clear, basic explanation of the concept of theta join in relational algebra and perhaps an example (using SQL perhaps) to illustrate its usage.
If I understand it correctly, the theta join is a natural join with a condition added in. So, whereas the natural join enforces equality between attributes of the same name (and removes the duplicate?), the theta join does the same thing but adds in a condition. Do I have this right? Any clear explanation, in simple terms (for a non-mathmetician) would be greatly appreciated.
Also (sorry to just throw this in at the end, but its sort of related), could someone explain the importance or idea of cartesian product? I think I'm missing something with regard to the basic concept, because to me it just seems like a restating of a basic fact, i.e that a set of 13 X a set of 4 = 52...
Leaving SQL aside for a moment...
A relational operator takes one or more relations as parameters and results in a relation. Because a relation has no attributes with duplicate names by definition, relational operations theta join and natural join will both "remove the duplicate attributes." [A big problem with posting examples in SQL to explain relation operations, as you requested, is that the result of a SQL query is not a relation because, among other sins, it can have duplicate rows and/or columns.]
The relational Cartesian product operation (results in a relation) differs from set Cartesian product (results in a set of pairs). The word 'Cartesian' isn't particularly helpful here. In fact, Codd called his primitive operator 'product'.
The truly relational language Tutorial D lacks a product operator and product is not a primitive operator in the relational algebra proposed by co-author of Tutorial D, Hugh Darwen**. This is because the natural join of two relations with no attribute names in common results in the same relation as the product of the same two relations i.e. natural join is more general and therefore more useful.
Consider these examples (Tutorial D):
WITH RELATION { TUPLE { Y 1 } , TUPLE { Y 2 } , TUPLE { Y 3 } } AS R1 ,
RELATION { TUPLE { X 1 } , TUPLE { X 2 } } AS R2 :
R1 JOIN R2
returns the product of the relations i.e. degree of two (i.e. two attributes, X and Y) and cardinality of 6 (2 x 3 = 6 tuples).
However,
WITH RELATION { TUPLE { Y 1 } , TUPLE { Y 2 } , TUPLE { Y 3 } } AS R1 ,
RELATION { TUPLE { Y 1 } , TUPLE { Y 2 } } AS R2 :
R1 JOIN R2
returns the natural join of the relations i.e. degree of one (i.e. the set union of the attributes yielding one attribute Y) and cardinality of 2 (i.e. duplicate tuples removed).
I hope the above examples explain why your statement "that a set of 13 X a set of 4 = 52" is not strictly correct.
Similarly, Tutorial D does not include a theta join operator. This is essentially because other operators (e.g. natural join and restriction) make it both unnecessary and not terribly useful. In contrast, Codd's primitive operators included product and restriction which can be used to perform a theta join.
SQL has an explicit product operator named CROSS JOIN which forces the result to be the product even if it entails violating 1NF by creating duplicate columns (attributes). Consider the SQL equivalent to the latter Tutoral D exmaple above:
WITH R1 AS (SELECT * FROM (VALUES (1), (2), (3)) AS T (Y)),
R2 AS (SELECT * FROM (VALUES (1), (2)) AS T (Y))
SELECT *
FROM R1 CROSS JOIN R2;
This returns a table expression with two columns (rather than one attribute) both called Y (!!) and 6 rows i.e. this
SELECT c1 AS Y, c2 AS Y
FROM (VALUES (1, 1),
(2, 1),
(3, 1),
(1, 2),
(2, 2),
(3, 2)
) AS T (c1, c2);
** That is, although there is only one relational model (i.e. Codd's), there can be more than one relational algebra (i.e. Codd's is but one).
You're not quite right - a theta join is a join which may include a condition other than = - in SQL, typically < or >= etc. See TechNet
As for cartesian product (or CROSS JOIN), it is an operation rather than an idea or concept. It's important because sometimes you need to use it! It is a basic fact that set of 13 x set of 4 = 52, and cartesian product is based on this fact.
In my opinion, to make it simple, if you understand equijoin, you suppose should understand theta join. If you change the symbol = (equal) in equijoin to >=, then you already done theta join. However, I think it is quite difficult to see the practicality of using theta join as compared to equijoin since the join cause that we normally use is V.primarykey = C.foreignkey. And if you want to change to theta join, then it may depends on the value, as such you are doing selection.
For natural Join, it is just similar to equijoin, the difference is just that get rid of the redundant attributes. easy!:)
Hope this explanation helps.
All joins can be thought of as beginning with a cross product and then weeding out certain rows. A natural join weeds out all rows in which columns of the same name in the two tables being joine have different values. An equijoin weeds out all rows in which the specified columns have different values. And a theta-join weeds out all rows in which the specified columns do not stand in the specified relationship (<, >, or whatever; in principle it could be is_prefix_of as a relationship beween strings).
Update: Note that outer joins cannot be understood this way, because they synthesize information (that is, nulls) out of nothing.