pig is null query - apache-pig

I am using pig 0.10.
i have an outer bag(relation)
grunt>dump e;
(vinyas,(shetty,12),{(12,vinyas),(99,shetty)})
(vas,(shety,12),{(12,vinyas),(33,shetty)})
(fgkyas,(shety,12),{(12,vinyas),(12,shetty)})
(fky,(uhjyt,12),{(,),(,)})
grunt> describe e;
e: {name: chararray,t: (),b: {t: (x: int,y: chararray)}}
grunt> op = filter e by IsEmpty(b) or b is null;
Now op does not return anything.I was actually expecting the last tuple of relation e(i.e name with "fky") to be returned.Can some1 plz explain this behavior

IsEmpty checks at the size of the DataBag, which takes into account empty tuples as well. So you need to specify a more rigorous check.
(And b is definitely not NULL - so that would not get what you wanted).

Related

How to check if a tuple contains an element in Apache Pig?

Let's say I have this file:
movie_id,title,genres
95004,Superman/Doomsday (2007),Action|Animation
136297,Mortal Kombat: The Journey Begins (1995),Action
193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi
193573,Love Live! The School Idol Movie (2015),Animation
I load it like this:
movies = LOAD 'movies.csv' USING PigStorage(',') AS (moviesId:int, title:chararray, genres: chararray);
movies = FOREACH movies GENERATE movieId, title, STRSPLIT(genres,'\\|') as genres;
describe a; //a: {movieId: int,title: chararray,genres: ()}
Example of dump a results:
...
(193581,Black Butler: Book of the Atlantic (2017),(Action,Animation,Comedy,Fantasy))
(193583,No Game No Life: Zero (2017),(Animation,Comedy,Fantasy))
...
Now, if I undestand correctly, the field genres is of type tuple. The question is how can I do a query such as: "get all the action movies?". I don't know how to check if a specific element is present in the tuple genres.
I know how to do this with a Python UDF function, but I would like to know if it is possible without one. Maybe I should load the file differently.
Thank you for your help.
If you are happy to put the genres into a bag rather than into a tuple (I think this would be more appropriate since the number of genres varies from record to record). This could be then solved with a nested FOREACH by filtering the bag for specific genres then testing to see if the bag is not empty.
movies = LOAD 'movies.csv' USING PigStorage(',') AS (moviesId:int, title:chararray, genres: chararray);
moviesSplit = FOREACH movies GENERATE movieId, title, TOKENIZE(genres,'|') as genres;
actionTest = FOREACH moviesSplit {
action = FILTER genres by $0 == 'Action';
GENERATE *, action;
actionMovies = FILTER actionTest BY NOT IsEmpty(action);

Many to many query joins in aqueduct

I have A -> AB <- B many to many relationship between 2 ManagedObjects (A and B), where AB is the junction table.
When querying A from db, how do i join B values to AB joint objects?
Query<A> query = await Query<A>(context)
..join(set: (a) => a.ab);
It gives me a list of A objects which contains AB joint objects, but AB objects doesn't include full B objects, but only b.id (not other fields from class B).
Cheers
When you call join, a new Query<T> is created and returned from that method, where T is the joined type. So if a.ab is of type AB, Query<A>.join returns a Query<AB> (it is linked to the original query internally).
Since you have a new Query<AB>, you can configure it like any other query, including initiating another join, adding sorting descriptors and where clauses.
There are some stylistic syntax choices to be made. You can condense this query into a one-liner:
final query = Query<A>(context)
..join(set: (a) => a.ab).join(object: (ab) => ab.b);
final results = await query.fetch();
This is OK if the query remains as-is, but as you add more criteria to a query, the difference between the dot operator and the cascade operator becomes harder to track. I often pull the join query into its own variable. (Note that you don't call any execution methods on the join query):
final query = Query<A>(context);
final join = query.join(set: (a) => a.ab)
..join(object: (ab) => ab.b);
final results = await query.fetch();

FILTER ON column from another relation in PIG

Suppose, I have the following data in PIG.
DUMP raw;
(2015-09-15T22:11:00.000-07:00,1)
(2015-09-15T22:12:00.000-07:00,2)
(2015-09-15T23:11:00.000-07:00,3)
(2015-09-16T21:02:00.000-07:00,4)
(2015-09-15T00:02:00.000-07:00,5)
(2015-09-17T08:02:00.000-07:00,5)
(2015-09-17T09:02:00.000-07:00,5)
(2015-09-17T09:02:00.000-07:00,1)
(2015-09-17T19:02:00.000-07:00,1)
DESCRIBE raw;
raw: {process_date: chararray,id: int}
A = GROUP raw BY id;
DESCRIBE A;
A: {group: int,raw: {(process_date: chararray,id: int)}}
DUMP A;
(1,{(2015-09-15T22:11:00.000-07:00,1),(2015-09-17T09:02:00.000-07:00,1),(2015-09-17T19:02:00.000-07:00,1)})
(2,{(2015-09-15T22:12:00.000-07:00,2)})
(3,{(2015-09-15T23:11:00.000-07:00,3)})
(4,{(2015-09-16T21:02:00.000-07:00,4)})
(5,{(2015-09-15T00:02:00.000-07:00,5),(2015-09-17T08:02:00.000-07:00,5),(2015-09-17T09:02:00.000-07:00,5)})
B = FOREACH A {generate raw,MAX(raw.process_date) AS max_date;}
DUMP B;
({(2015-09-15T22:11:00.000-07:00,1),(2015-09-17T09:02:00.000-07:00,1),(2015-09-17T19:02:00.000-07:00,1)},2015-09-17T19:02:00.000-07:00)
({(2015-09-15T22:12:00.000-07:00,2)},2015-09-15T22:12:00.000-07:00)
({(2015-09-15T23:11:00.000-07:00,3)},2015-09-15T23:11:00.000-07:00)
({(2015-09-16T21:02:00.000-07:00,4)},2015-09-16T21:02:00.000-07:00)
({(2015-09-15T00:02:00.000-07:00,5),(2015-09-17T08:02:00.000-07:00,5),(2015-09-17T09:02:00.000-07:00,5)},2015-09-17T09:02:00.000-07:00)
DESCRIBE B;
B: {raw: {(process_date: chararray,id: int)},max_date: chararray}
Now, I need to filter raw based on process_date eq max_date. I have tried the following:
C = FOREACH B {filtered = FILTER raw BY REGEX_EXTRACT(process_date,'(\\d{4}-\\d{2}-\\d{2})',1) eq REGEX_EXTRACT(max_date,'(\\d{4}-\\d{2}-\\d{2})',1)}, but its not working.
Is there any way to do such filtering? Basically, I need to filter the raw based on latest date.
The exception which I get is:
Invalid field projection. Projected field [max_date] does not exist in schema: process_date:chararray,id:int
Expected output: Latest data corresponding to latest date (not time) for each id
({(2015-09-17T09:02:00.000-07:00,1),(2015-09-17T19:02:00.000-07:00,1)})
({(2015-09-15T22:12:00.000-07:00,2)})
({(2015-09-15T23:11:00.000-07:00,3)})
({(2015-09-16T21:02:00.000-07:00,4)})
({(2015-09-17T08:02:00.000-07:00,5),(2015-09-17T09:02:00.000-07:00,5)})

Pig - removing duplicate tuples from bag

I have the following loaded in a relation with this schema {group: (int,int),A: {(n1: int,n2: int)}}:
((1,1),{(0,1)})
((2,2),{(0,2)})
((3,3),{(3,0)})
((4,2),{(1,3)})
((5,1),{(2,3)})
((5,3),{(1,4)})
((7,3),{(2,5)})
((9,1),{(4,5)})
((10,2),{(4,6)})
((10,4),{(7,3)})
((11,1),{(5,6)})
((11,3),{(4,7)})
((12,4),{(4,8)})
((13,1),{(6,7)})
((19,1),{(10,9),(9,10)})
((,),{(,),(,),(,)})
I would like to extract just the first tuple from each bag, i.e.:
((19,1),{(10,9),(9,10)}) --> (10,9)
Any help is appreciated.
Can you try like this?.
C = FOREACH B {
top1 = LIMIT A 1;
GENERATE FLATTEN((top1));
}
here B is your group relation name.

Pig: apply a FOREACH operator to each element within a bag

Example: I have a relation "class", with a nested bag of students:
class: {teacher_name: chararray,students: {(firstname: chararray, lastname: chararray)}
I want to perform an operation on each student, while leaving the global structure untouched, ie, obtain:
class: {teacher_name: chararray,students: {(fullname: chararray)}
where for each student, fullname = CONCAT(firstname, lastname)
My understanding is that a nested FOREACH would not be my solution here, as it still only generates 1 record per input tuple, whereas I want something that would apply within each bag item.
Pretty easy to do with an UDF but wondered if it's possible to do it in pure Piglatin
In PIG 0.10 it is possible without the UDF, as FOREACH can be nested in FOREACH. Here is an example:
inpt = load '~/pig/data/bag_concat.dat' as (k : chararray, c1 : chararray, c2 : chararray);
dump inpt;
1 q w
1 s d
2 q a
2 t y
2 u i
2 o p
bags = group inpt by k;
describe bags;
bags: {group: chararray,inpt: {(k: chararray,c1: chararray,c2: chararray)}}
result = foreach bags {
concat = foreach inpt generate CONCAT(c1, c2); --it will iterate only over the records of the inpt bag
generate group, concat;
};
dump result;
(1,{(qw),(sd)})
(2,{(qa),(ty),(ui),(op)})