Pig: apply a FOREACH operator to each element within a bag - apache-pig

Example: I have a relation "class", with a nested bag of students:
class: {teacher_name: chararray,students: {(firstname: chararray, lastname: chararray)}
I want to perform an operation on each student, while leaving the global structure untouched, ie, obtain:
class: {teacher_name: chararray,students: {(fullname: chararray)}
where for each student, fullname = CONCAT(firstname, lastname)
My understanding is that a nested FOREACH would not be my solution here, as it still only generates 1 record per input tuple, whereas I want something that would apply within each bag item.
Pretty easy to do with an UDF but wondered if it's possible to do it in pure Piglatin

In PIG 0.10 it is possible without the UDF, as FOREACH can be nested in FOREACH. Here is an example:
inpt = load '~/pig/data/bag_concat.dat' as (k : chararray, c1 : chararray, c2 : chararray);
dump inpt;
1 q w
1 s d
2 q a
2 t y
2 u i
2 o p
bags = group inpt by k;
describe bags;
bags: {group: chararray,inpt: {(k: chararray,c1: chararray,c2: chararray)}}
result = foreach bags {
concat = foreach inpt generate CONCAT(c1, c2); --it will iterate only over the records of the inpt bag
generate group, concat;
};
dump result;
(1,{(qw),(sd)})
(2,{(qa),(ty),(ui),(op)})

Related

How to check if a tuple contains an element in Apache Pig?

Let's say I have this file:
movie_id,title,genres
95004,Superman/Doomsday (2007),Action|Animation
136297,Mortal Kombat: The Journey Begins (1995),Action
193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi
193573,Love Live! The School Idol Movie (2015),Animation
I load it like this:
movies = LOAD 'movies.csv' USING PigStorage(',') AS (moviesId:int, title:chararray, genres: chararray);
movies = FOREACH movies GENERATE movieId, title, STRSPLIT(genres,'\\|') as genres;
describe a; //a: {movieId: int,title: chararray,genres: ()}
Example of dump a results:
...
(193581,Black Butler: Book of the Atlantic (2017),(Action,Animation,Comedy,Fantasy))
(193583,No Game No Life: Zero (2017),(Animation,Comedy,Fantasy))
...
Now, if I undestand correctly, the field genres is of type tuple. The question is how can I do a query such as: "get all the action movies?". I don't know how to check if a specific element is present in the tuple genres.
I know how to do this with a Python UDF function, but I would like to know if it is possible without one. Maybe I should load the file differently.
Thank you for your help.
If you are happy to put the genres into a bag rather than into a tuple (I think this would be more appropriate since the number of genres varies from record to record). This could be then solved with a nested FOREACH by filtering the bag for specific genres then testing to see if the bag is not empty.
movies = LOAD 'movies.csv' USING PigStorage(',') AS (moviesId:int, title:chararray, genres: chararray);
moviesSplit = FOREACH movies GENERATE movieId, title, TOKENIZE(genres,'|') as genres;
actionTest = FOREACH moviesSplit {
action = FILTER genres by $0 == 'Action';
GENERATE *, action;
actionMovies = FILTER actionTest BY NOT IsEmpty(action);

Filter inner bag by other field (not constant)

I am trying to run the following scenario but fail.
I start with a list of movies, and group it by {year, rating}.
movies = LOAD '/movies_data.csv'
USING PigStorage(',') AS (id:int, name:chararray, year:int, rating:double, duration:int);
grouped = GROUP movies BY (year, rating);
The resulting schema is:
DESCRIBE grouped;
grouped: {group: (year: int,rating: double),movies: {(id: int,name: chararray,
year: int,rating: double,duration: int)}}
Now, for each group I would like to get a list of movie names that contain the year (which is the part of the group name).
So I try the following:
model =
FOREACH grouped {
listNames = DISTINCT movies.name;
listNamesFiltered = FILTER listNames BY name MATCHES group::year;
GENERATE
group.year AS year
,group.rating AS rating
,listNamesFiltered AS listNamesFiltered
,COUNT(listNamesFiltered) AS countNamesFiltered
;};
but fail with the message:
Invalid field projection. Projected field [group::year] does not exist in schema: name:chararray.
Using a constant (like in the following line) works:
listNamesFiltered = FILTER listNames BY name MATCHES '.*2010.*';
results in:
(2010,2.6,{(2010: Moby Dick)},1)
(2010,3.8,{(Saturday Night Live: The 2010s)},1)
Any help will be greatly appreciated.
This seems like it would be a lot easier if you did all the filtering and then did all GROUP/DISTINCT/COUNToperations.
Data:
1 2010: Moby Dick 2010 2.6 128
2 Saturday Night Live: The 2010s 2010 3.8 127
3 2001: A Space Odyssey 2001 4.0 145
4 Forrest Gump 1994 4.9 334
Query:
movies = LOAD 'movie_data.csv' USING PigStorage(',') AS (id:int,
name:chararray, year:int, rating:double, duration:int);
filtered = FILTER movies BY name MATCHES StringConcat('.*', (chararray)year, '.*');
dump filtered;
Output:
(1,2010: Moby Dick,2010,2.6,128)
(2,Saturday Night Live: The 2010s,2010,3.8,127)
(3,2001: A Space Odyssey,2001,4.0,145)
Then do whatever else you were going to do (COUNT etc ...).

Pig Latin: All pairs within a group - Nested foreach with a cross and filter

I have some grouped data:
glu: (
group:tuple(foo:bytearray, bar:chararray),
bam: bag {
:tuple(foo:bytearray, bar:chararray, pom:Long)
}
)
What I want is to do a nested cross-product to get all pairs of pom, and a filter to reduce to only pairs where the first pom is less than the second pom. Ending up with something like this:
glu: (
group:tuple(foo:bytearray, bar:chararray),
bam: bag {
:tuple(foo:bytearray, bar:chararray, pom1:Long, pom2:Long)
}
)
something like:
glupairs = FOREACH glu {
pairs = CROSS bam, bam;
filtered = FILTER pairs BY (bam1 != bam2) AND (bam1 < bam2);
GENERATE group, filtered;
};
This, of course, does not work. Is there a way to do this? Can I take a cross product of a relation against itself? how can I select the fields afterwards (to do the filter)?
Thanks in advance.
I figured this out by doing:
glupairs = FOREACH glu {
copied = FOREACH bam GENERATE -(-pom); -- Deals with the self cross bug
pairs = CROSS bam, copied;
filtered = FILTER pairs BY (bam.pom != copied.pom) AND (bam.pom < copied.pom);
GENERATE group, filtered;
};

Pig - removing duplicate tuples from bag

I have the following loaded in a relation with this schema {group: (int,int),A: {(n1: int,n2: int)}}:
((1,1),{(0,1)})
((2,2),{(0,2)})
((3,3),{(3,0)})
((4,2),{(1,3)})
((5,1),{(2,3)})
((5,3),{(1,4)})
((7,3),{(2,5)})
((9,1),{(4,5)})
((10,2),{(4,6)})
((10,4),{(7,3)})
((11,1),{(5,6)})
((11,3),{(4,7)})
((12,4),{(4,8)})
((13,1),{(6,7)})
((19,1),{(10,9),(9,10)})
((,),{(,),(,),(,)})
I would like to extract just the first tuple from each bag, i.e.:
((19,1),{(10,9),(9,10)}) --> (10,9)
Any help is appreciated.
Can you try like this?.
C = FOREACH B {
top1 = LIMIT A 1;
GENERATE FLATTEN((top1));
}
here B is your group relation name.

How do I accumulate vectors into a map?

I have an alias A like this:
{cookie: chararray,
keywords: {tuple_of_tokens: (token: chararray)},
weight: double}
where the 2nd and 3rd fields are defined as
keywords = TOKENIZE((chararray)$5,',');
weight = 1.0/(double)SIZE(keywords);
now I want to do
foreach (group A by cookie) generate
group.cookie as cookie,
???? as keywords;
and keywords should be a map from a keyword into a the sum of weights.
E.g.,
1 k1,k2,k3
1 k2,k4
should turn into
1 {k1:1/3, k2:5/6, k3:1/3, k4:1/2}
I am already using datafu, but I am open to any alternative...
I'd do
A_counts = foreach A generate cookie,flatten(keywords) as keyword,1.0/SIZE(keywords) as weight;
then
A_counts_gr = group A by (cookie,keyword); and
result= foreach A_counts_gr generate flatten(group) as (cookie,token), sum(A_counts_gr.weight);
and then one can group by cookie to get a bag like you want...after grouping by cookie again there will be a bag, than you can turn this bag to a map with datafu...