What is the difference between `::` and `.` in pig? - apache-pig

What is the difference between :: and . in pig?
When do I use one vs the other?
E.g., I know that :: is need in join when a field exists in both aliases:
A = foreach (join B by (x), C by (y)) generate B::y as b_y, C::y as c_y;
and I need . when accessing group fields:
A = foreach (group B by (x,y)) generate group.x as x, group.y as y, SUM(B?z) as z;
However, do I pass B::z or B.z to SUM above instead of B?z?

In Pig, :: is used as a disambiguation tool after operations which could possibly create naming collisions. Notably, this happens with JOIN, CROSS, and FLATTEN. Consider two relations, A:{(id:int, name:chararray)} and B:{(id:int, location:chararray)}. If you want to associate names with locations, naturally you would do:
C = JOIN A BY id, B BY id;
Without the disambiguation operator, your schema would be
C:{(id:int, name:chararray, id:int, location:chararray)}
Now you can't tell which field id refers to. To avoid this, Pig will instead do
C:{(A::id:int, A::name:chararray, B::id:int, B::location:chararray)}
Likewise, you could FLATTEN two bags whose tuples have fields with the same name, and they would also collide. So the same operator is used in this case as well. When there is no such conflict, you do not need to use the full name: name is unambiguous here. To simplify C, then, you can do this:
D = FOREACH C GENERATE A::id, name, location;
The . operator, by contrast, projects fields from bags and tuples. If you have a bag b with schema {(x:int, y:int, z:int)}, the projection b.y yields a bag with just the specified field: {(y:int)}. You can project multiple fields at once with parentheses: b.(y,z) yields {(y:int, z:int)}.
When used with tuples, the result is a tuple with just the specified fields. If the tuple t has schema (x:int, y:int, z:int), then t.x is the tuple (x:int) and t.(y,z) is the tuple (y:int, z:int).
To your specific question about SUM, note that SUM along with the other summary statistic UDFs, takes a bag as its argument. Therefore, you need to create a bag with just the one field per tuple that you want to sum. Using the projection operator, .: B.z.

IIRC you get :: as a side effect after some statements. You cannot bother about it, unless (as you mentioned) a name exists inside two different prefixes.
The . is different in that you are going inside the structure.
group.x as x, group.y as y is equivalent to FLATTEN(group)
SUM(B?z) - here you should do SUM(B.z), to specify that you need a particular field to SUM.

Related

Unnesting structs in BigQuery

What is the correct way to flatten a struct of two arrays in BigQuery? I have a dataset like the one pictured here (the struct.destination and struct.visitors arrays are ordered - i.e. the visitor counts correspond specifically to the destinations in the same row):
I want to reorganize the data so that I have a total visitor count for each unique combination of origins and destinations. Ideally, the end result will look like this:
I tried using UNNEST twice in a row - once on struct.destination and then on struct.visitors, but this produces the wrong result (each destination gets mapped to every value in the array of visitor counts when it should only get mapped to the value in the same row):
SELECT
origin,
unnested_destination,
unnested_visitors
FROM
dataset.table,
UNNEST(struct.destination) AS unnested_destination,
UNNEST(struct.visitors) AS unnested_visitors
You have one struct that is repeated. So, I think you want:
SELECT origin,
s.destination,
s.visitors
FROM dataset.table t CROSS JOIN
UNNEST(t.struct) s;
EDIT:
I see, you have a struct of two arrays. You can do:
SELECT origin, d.destination, v.visitors
FROM dataset.table t CROSS JOIN
UNNEST(struct.destination) s WITH OFFSET nd LEFT JOIN
UNNEST(struct.visitors) v WITH OFFSET nv
ON nd = nv
Difficult to test by not having the underlying data to test on, so I created my own query with your dataset. As far as I can tell destination|visitors is not in an ARRAY-format, but rather in a STRUCT-format, so you do not need UNNEST it. Also view this thread please :)
SELECT
origin,
COUNT(struct.destination),
COUNT(struct.visitors)
FROM dataset.table
GROUP BY 1

How to represent GROUP BY with HAVING COUNT(*)>1 in relational algebra?

For an exam, I am asked to get the list of clients having more than one rent, both as an SQL query and as an algebraic expression.
For some reasons, the correction doesn't provide the algebraic version.
So now I am left with:
SELECT IdClient, Name, ...
FROM Client
WHERE IdClient IN (
SELECT IdClient
FROM Rental
GROUP BY IdClient
HAVING COUNT(*) > 1
)
I don't know if there is a standard for algebraic notations, therefore:
Π projection
× Cartesian product
⋈ natural join
σ selection
Then I translate as:
Π IdClient, Name, ... (
σ (count(IdClient)>1) (Π Rental) ⋈ (Client ⋈ Rental)
)
But I find no source to prove me right or wrong, especially for:
the logic behind the math
Π Rental seems like a shady business
I saw the use of count() at https://cs.stackexchange.com/questions/29897/use-count-in-relational-algebra and while it isn't used the same way, I couldn't figure out a way to use it without the projection (which I wanted to avoid.)
There are many variants of "relational algebra", differing even on what a relation is. You need to tell us which one you are supposed to use.
Also you don't explain what it means for a pair of RA & SQL queries to "have the form of" or "be the same as" each other. (Earlier versions.) Same result? Or also some kind of parallel structure?
Also you don't explain what "get the list of clients" means. What attributes does the result have?
If you try to write a definition of the count you are trying to use in σ count(IdClient)>1 (...)--what it inputs & what it outputs based on that--you will see that you can't. That kind of count that takes just an attribute does not correspond to a relational operator. It is used in a grouping expression--which you are missing. Such count & group are not actually relational operators, they are non-terminals in so-called relational algebras that are really query languages, designed by SQL apologists, suggesting it is easy to map SQL to relational algebra, but begging the question of how we aggregate in an algebra. Still, maybe that's the sort of "relational algebra" you were told to use.
I saw the use of count() there https://cs.stackexchange.com/questions/29897/use-count-in-relational-algebra
The nature of algebras is that the only sense in which we "use" operators "with" other operators is to pass outputs of operator calls as inputs to other operator calls. (Hence some so-called algebras are not.) In your linked answer, grouping operator G inputs aggregate name count and attribute name name and that affects the output. The answer quotes Database System Concepts, 5th Ed:
G1, G2, ..., Gn G F1(A1), F2(A2), ..., Fm(Am) (E)
where E is any relational-algebra expression; G1, G2, ..., Gn constitute a list of attributes on which to group; each Fi is an aggregate function; and each Ai is an attribute name.
G returns rows with attributes G1, ..., A1, ... where one or more rows with the same G1, ... subrows are in E and each Ai holds the output from aggregating Fi on Ai over those rows.
But the answer when you read & linked it used that definition improperly. (I got it fixed since.) Correct is:
π name (σ phone>1 (name G count(phone) (Person)))
This is clear if you carefully read the definition.
G has misleading syntax. count(phone) is not a call of an operator; it's just a pair of arguments, an aggregate name count & an attribute name phone. Less misleading syntax would be
π name (σ phone>1 (name G count phone (Person)))
One does not need a grouping operator to write your query. That makes it all the more important to know what "relational algebra" means in the exam. It is harder if you can't use a grouping operator.
"Π Rental seems like a shady business" is unclear. You do use projection incorrectly; proper use is π attributes (relation). I guess you are using π in an attempt to involve a grouping operator like G. Re "the logic behind the math" see this.

Hive UDF to generate all possible ordered combinations from the list

I am trying to figure out in Hive how to generate a UDF that would take as input a list and output a list with 2 way ordered combination all elements in the list
Input:
list_variable_b
[5142430,5146974,5141766]
Output:
list_variable_b
[(5142430,5146974),(5146974,5141766),(5142430,5141766)]
So you're asking how to write an UDF that can take an array<bigint> and
turn it into an array<struct<int,int> or array<array<int>.
It sounds you want what's called n take k, which will produce (n!)/(n-k)!k! elements.
Now, hive has two kinds of UDFs, one that's the simple one, that can only process primitive (non-collection) types. But here you are processing an array so you'll need a Generic UDF. Generic UDF can do much more than simple UDFs, but they are also more difficult to write. A good guide on how to do it is here: http://www.baynote.com/2012/11/a-word-from-the-engineers/
Another way would be to use a double LATERAL VIEW with the caveat that all the elements in the array have to be unique for this to work.
If the table is
create table xx ( col array<int>);
such that
select * from xx;
OK
[5142430,5146974,5141766]
Using a double lateral view to do the cartesian product of the array on itself, then only get the pairs where one element is bigger then the other:
select a1,b1 from xx
lateral view explode(col) a as a1
lateral view explode(col) b as b1 where a1 < b1;
5142430 5146974
5141766 5142430
5141766 5146974

How do I convert a column to a tuple in PIG

I've a PIG question and is related to converting columns of tables into tuples so that I can pass them to a UDF. Details as follows:-
There is a result "C" which looks like following if I do "dump C"
(a1,b1,c1)
(a2,b2,c2)
I want to convert extract the every combination of 2 columns as follows:
(a1,a2,a3), (b1,b2,b3), (c1,c2,c3)
and then call a UDF on each possible pair of tuples:
UDF((a1,a2,a3), (b1,b2,b3))
UDF((a1,a2,a3), (c1,c2,c3))
UDF((c1,c2,c3), (b1,b2,b3))
How do I do this in PIG?
You can get all of the values for a given "column" by using GROUP .. ALL and then using bag projection:
grpd = GROUP C ALL;
udfs =
FOREACH grpd
GENERATE
UDF(grpd.a, grpd.b),
UDF(grpd.a, grpd.c),
UDF(grpd.c, grpd.b);
Note, however, that the values for each column will be stored in bags rather than tuples. This is proper, because relations in Pig do not guarantee that the records are ordered in any particular way. So your UDF should be comparing bags and not rely on the order of the elements.
However, it may be important that you be able to compare values that were originally in the same row; i.e., match up a1 with b1, etc. For this, you will need to write your UDF to take a single bag, with each tuple containing the paired elements an and bn. To do this, use bag projection of two columns:
grpd = GROUP C ALL;
udfs =
FOREACH grpd
GENERATE
UDF(grpd.(a,b)),
UDF(grpd.(a,c)),
UDF(grpd.(c,b));
Again, the tuples will not necessarily be in order, but you should not rely on that fact. Your bag will contain the tuples (a1,b1), (a2,b2), etc.

Converting a certain SQL query into relational algebra

Doing an assignment for my database course and I want to double check my relational algebra.
The SQL:
SELECT dato, SUM(pris*antall) AS total
FROM produkt, ordre
WHERE ordre.varenr = produkt.varenr
GROUP BY dato
HAVING total >= 10000
The relational algebra:
σ total >= 10000 (
ρ R(dato, total)(
σ ordre.varenr = produkt.varenr (
dato ℑ SUM(pris*antall (produkt x ordre)
)
)
)
Is this correct?
I don't know. And anybody else is not likely to know either.
RA courses typically limit themselves to the selection, projection and join operators. Aggregations are not typically covered by an RA course. There even isn't any standard approach (that I know of) that the RA takes on aggregations.
What is the operator that your course defines for doing aggregations on relations ? What type of value does that operator produce for its result ? A relation ? Something else ? If something else, how does your course explain doing relational restrictions on that result, given that these result values aren't relations, but restriction works only on relations ?
Algebraically, this case starts with a natural join (produkt x ordre).
[The result of] this natural join is subjected to an aggregation operation. Thus this natural join is to appear where you specify the relational input argument to your aggregation operator. The other needed specs for specifying the aggregation are the output attribute names (total), and the way to compute them (SUM(...)). Those might appear in subscript next to your aggregation operator symbol as "annotations", much like the attribute lists on projection and the restriction condition on restriction. But anything concerning this operator is course-specific, because there isn't any agreed-upon standard notation for aggregations, as far as I know.
Then if your aggregation operator is defined to return a relation, you can specify your aggregation result as the input argument to a restriction with condition "total>=10000".