How to check if a tuple contains an element in Apache Pig? - apache-pig

Let's say I have this file:
movie_id,title,genres
95004,Superman/Doomsday (2007),Action|Animation
136297,Mortal Kombat: The Journey Begins (1995),Action
193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi
193573,Love Live! The School Idol Movie (2015),Animation
I load it like this:
movies = LOAD 'movies.csv' USING PigStorage(',') AS (moviesId:int, title:chararray, genres: chararray);
movies = FOREACH movies GENERATE movieId, title, STRSPLIT(genres,'\\|') as genres;
describe a; //a: {movieId: int,title: chararray,genres: ()}
Example of dump a results:
...
(193581,Black Butler: Book of the Atlantic (2017),(Action,Animation,Comedy,Fantasy))
(193583,No Game No Life: Zero (2017),(Animation,Comedy,Fantasy))
...
Now, if I undestand correctly, the field genres is of type tuple. The question is how can I do a query such as: "get all the action movies?". I don't know how to check if a specific element is present in the tuple genres.
I know how to do this with a Python UDF function, but I would like to know if it is possible without one. Maybe I should load the file differently.
Thank you for your help.

If you are happy to put the genres into a bag rather than into a tuple (I think this would be more appropriate since the number of genres varies from record to record). This could be then solved with a nested FOREACH by filtering the bag for specific genres then testing to see if the bag is not empty.
movies = LOAD 'movies.csv' USING PigStorage(',') AS (moviesId:int, title:chararray, genres: chararray);
moviesSplit = FOREACH movies GENERATE movieId, title, TOKENIZE(genres,'|') as genres;
actionTest = FOREACH moviesSplit {
action = FILTER genres by $0 == 'Action';
GENERATE *, action;
actionMovies = FILTER actionTest BY NOT IsEmpty(action);

Related

Optimization of SQL query based on attribute (not having specific value) in joined table

My models structure: Movie has_many :captions. Language of the Caption may be “en”, “de”, “fr”...
Problem:
An effective query to select Movies that don’t have Captions with an “en” language.
App that needs above runs on Rails, and for this I’m currently using something like this in Caption model:
def self.ids_of_movies_without_caption_in_en
a = (1..(Movie.last.lp.to_i)).to_a
b = Caption.in_lang("en").collect {|h| h.movie_id }
(a - b)
end
As you can see, I collect id’s (lp) of all movies and then I remove from that array id’s of those movies where Captions have “en” as a language. The outcome is an array of id’s of Movies I need.
Above works, but as you can imagine it’s quite “heavy”. I believe that there is a better (and maybe trivial) approach to it. However, being “fresh” with SQL, I ask for some guidance in writing an efficient query. This runs on PostgreSQL
Implementation in Rails (5.2) would be an additional bonus!
This is the situation: let's say in the database there are 1000 movies, and 4000 captions for those movies. There are of course movies that don't have any captions. Out of those 4000 captions 400 are in "en" language. The query I'm looking for would return 600 movies, where caption in "en" does not exist (including movies with 0 captions).
This is quite easy in SQL. I'm not quite sure what the tables look like, but something like this:
select movie_id
from captions
group by movie_id
having not bool_or(language = 'en');
If you want movies with no captions, then use not exists:
select m.movie_id
from movies m
where not exists (select 1
from captions c
where c.movie_id = m.movie_id and
m.language = 'en'
);

Filter inner bag by other field (not constant)

I am trying to run the following scenario but fail.
I start with a list of movies, and group it by {year, rating}.
movies = LOAD '/movies_data.csv'
USING PigStorage(',') AS (id:int, name:chararray, year:int, rating:double, duration:int);
grouped = GROUP movies BY (year, rating);
The resulting schema is:
DESCRIBE grouped;
grouped: {group: (year: int,rating: double),movies: {(id: int,name: chararray,
year: int,rating: double,duration: int)}}
Now, for each group I would like to get a list of movie names that contain the year (which is the part of the group name).
So I try the following:
model =
FOREACH grouped {
listNames = DISTINCT movies.name;
listNamesFiltered = FILTER listNames BY name MATCHES group::year;
GENERATE
group.year AS year
,group.rating AS rating
,listNamesFiltered AS listNamesFiltered
,COUNT(listNamesFiltered) AS countNamesFiltered
;};
but fail with the message:
Invalid field projection. Projected field [group::year] does not exist in schema: name:chararray.
Using a constant (like in the following line) works:
listNamesFiltered = FILTER listNames BY name MATCHES '.*2010.*';
results in:
(2010,2.6,{(2010: Moby Dick)},1)
(2010,3.8,{(Saturday Night Live: The 2010s)},1)
Any help will be greatly appreciated.
This seems like it would be a lot easier if you did all the filtering and then did all GROUP/DISTINCT/COUNToperations.
Data:
1 2010: Moby Dick 2010 2.6 128
2 Saturday Night Live: The 2010s 2010 3.8 127
3 2001: A Space Odyssey 2001 4.0 145
4 Forrest Gump 1994 4.9 334
Query:
movies = LOAD 'movie_data.csv' USING PigStorage(',') AS (id:int,
name:chararray, year:int, rating:double, duration:int);
filtered = FILTER movies BY name MATCHES StringConcat('.*', (chararray)year, '.*');
dump filtered;
Output:
(1,2010: Moby Dick,2010,2.6,128)
(2,Saturday Night Live: The 2010s,2010,3.8,127)
(3,2001: A Space Odyssey,2001,4.0,145)
Then do whatever else you were going to do (COUNT etc ...).

Pig - removing duplicate tuples from bag

I have the following loaded in a relation with this schema {group: (int,int),A: {(n1: int,n2: int)}}:
((1,1),{(0,1)})
((2,2),{(0,2)})
((3,3),{(3,0)})
((4,2),{(1,3)})
((5,1),{(2,3)})
((5,3),{(1,4)})
((7,3),{(2,5)})
((9,1),{(4,5)})
((10,2),{(4,6)})
((10,4),{(7,3)})
((11,1),{(5,6)})
((11,3),{(4,7)})
((12,4),{(4,8)})
((13,1),{(6,7)})
((19,1),{(10,9),(9,10)})
((,),{(,),(,),(,)})
I would like to extract just the first tuple from each bag, i.e.:
((19,1),{(10,9),(9,10)}) --> (10,9)
Any help is appreciated.
Can you try like this?.
C = FOREACH B {
top1 = LIMIT A 1;
GENERATE FLATTEN((top1));
}
here B is your group relation name.

How do I accumulate vectors into a map?

I have an alias A like this:
{cookie: chararray,
keywords: {tuple_of_tokens: (token: chararray)},
weight: double}
where the 2nd and 3rd fields are defined as
keywords = TOKENIZE((chararray)$5,',');
weight = 1.0/(double)SIZE(keywords);
now I want to do
foreach (group A by cookie) generate
group.cookie as cookie,
???? as keywords;
and keywords should be a map from a keyword into a the sum of weights.
E.g.,
1 k1,k2,k3
1 k2,k4
should turn into
1 {k1:1/3, k2:5/6, k3:1/3, k4:1/2}
I am already using datafu, but I am open to any alternative...
I'd do
A_counts = foreach A generate cookie,flatten(keywords) as keyword,1.0/SIZE(keywords) as weight;
then
A_counts_gr = group A by (cookie,keyword); and
result= foreach A_counts_gr generate flatten(group) as (cookie,token), sum(A_counts_gr.weight);
and then one can group by cookie to get a bag like you want...after grouping by cookie again there will be a bag, than you can turn this bag to a map with datafu...

PIG FILTER relation with next row the same relation

i'm searching for a long time now to solve my problem but nearly found nothing helpful.
Hopefully some of you can give me a tip.
I have a relation A with the following format: username, timestamp, ip
For example:
Harald 2014-02-18T16:14:49.503Z 123.123.123.123
Harald 2014-02-18T16:14:51.503Z 123.123.123.123
Harald 2014-02-18T16:14:55.503Z 321.321.321.321
And i want to find out, who changed his ip adress in less then 5 seconds. So the second and the third row should be interesting.
I want do group the relation by username und want to compare the timestamp of the actuall row with the next row. if the ip adress isnt the same and the timestamp is less then 5 seconds bigger, this should be at the output.
could someone help me with that issue?
regards.
first i want to thank you for your time.
but i actually stuck at the Sessionize part.
this is my data comming in:
aoebcu 2014-02-19T14:23:17.503Z 220.61.65.25
aoebcu 2014-02-19T14:23:14.503Z 222.117.144.19
aoebcu 2014-02-19T14:23:14.503Z 222.117.144.19
jekgru 2014-02-19T14:23:14.503Z 213.56.157.109
zmembx 2014-02-19T14:23:12.503Z 199.188.198.91
qhixcg 2014-02-19T14:23:11.503Z 203.40.104.119
and my code till now looks like this:
hijack_Reduced = FOREACH finalLogs GENERATE ClientUserName, timestamp, OriginalClientIP;
hijack_Filtered = FILTER hijack_Reduced BY OriginalClientIP != '-';
hijack_Sessionized = FOREACH (GROUP hijack_Filtered BY ClientUserName) {
views = ORDER hijack_Filtered BY timestamp;
GENERATE FLATTEN(Sessionize(views)) AS (ClientUserName,timestamp,OriginalClientIP,session_id);
}
but when i run this script, i got the following error Message:
15:36:22 ERROR -
org.apache.pig.tools.pigstats.SimplePigStats.setBackendException(542)
| ERROR 0: Exception while executing [POUserFunc (Name:
POUserFunc(datafu.pig.sessions.Sessionize)[bag] - scope-199 Operator
Key: scope-199) children: null at []]:
java.lang.IllegalArgumentException: Invalid format: "aoebcu"
i already tried a lot, but nothing worked.
do you got an idea?
Regards
While you could write a UDF for this, you can actually make use of the UDFs already available in Apache DataFu to solve this.
My solution involves applying sessionization to the data. Basically you look at consecutive events and assign each event a session ID. If the time elapsed between two events exceeds a specified amount of time, in your case 5 seconds, then the next event gets a new session ID. Otherwise consecutive events get the same session ID. Once each event is assigned its session ID the rest is easy. We group by session ID and look for sessions that have more than one distinct IP address.
I'll walk through my solution.
Suppose you have the following input data. Both Harold and Kumar change their IP addresses. But Harold does it within 5 seconds, while Kumar does not. So the output of our script should just be simply "Harold".
Harold,2014-02-18T16:14:49.503Z,123.123.123.123
Harold,2014-02-18T16:14:51.503Z,123.123.123.123
Harold,2014-02-18T16:14:55.503Z,321.321.321.321
Kumar,2014-02-18T16:14:49.503Z,123.123.123.123
Kumar,2014-02-18T16:14:55.503Z,123.123.123.123
Kumar,2014-02-18T16:15:05.503Z,321.321.321.321
Load the data
data = LOAD 'input' using PigStorage(',')
AS (user:chararray,time:chararray,ip:chararray);
Now define a couple UDFs from DataFu. The Sessionize UDF performs sessionization as I described earlier. The DistinctBy UDF will be used to find the distinct IP addresses within each session.
define Sessionize datafu.pig.sessions.Sessionize('5s');
define DistinctBy datafu.pig.bags.DistinctBy('1');
Group the data by user, sort by time, and apply the Sessonize UDF. Note that the timestamp must be the first field, as this is what Sessionize expects. This UDF appends a session ID to each tuple.
data = FOREACH data GENERATE time,user,ip;
data_sessionized = FOREACH (GROUP data BY user) {
views = ORDER data BY time;
GENERATE flatten(Sessionize(views)) as (time,user,ip,session_id);
}
Now that the data is sessionized, we can group by the user and session. I group by user too because I want to spit this value back out. We pass the bag of events into the DistinctBy UDF. Check the documentation of this UDF for a more detailed description. But essentially we will get as many tuples as there are distinct IP addresses per session. Note that I have removed the time from the relation below. This is because 1) it isn't needed, and 2) the DistinctBy in 1.2.0 of DataFu has a bug when handling fields containing dashes, as the time field does.
data_sessionized = FOREACH data_sessionized GENERATE user,ip,session_id;
data_sessionized = FOREACH (GROUP data_sessionized BY (user, session_id)) GENERATE
group.user as user,
SIZE(DistinctBy(data_sessionized)) as distinctIpCount;
Now select all the sessions that had more than one distinct IP address and return the distinct users for these sessions.
data_sessionized = FILTER data_sessionized BY distinctIpCount > 1;
data_sessionized = FOREACH data_sessionized GENERATE user;
data_sessionized = DISTINCT data_sessionized;
This produces simply:
Harold
Here is the full source code, which you should be able to paste directly into the DataFu unit tests and run:
/**
define Sessionize datafu.pig.sessions.Sessionize('5s');
define DistinctBy datafu.pig.bags.DistinctBy('1'); -- distinct by ip
data = LOAD 'input' using PigStorage(',') AS (user:chararray,time:chararray,ip:chararray);
data = FOREACH data GENERATE time,user,ip;
data_sessionized = FOREACH (GROUP data BY user) {
views = ORDER data BY time;
GENERATE flatten(Sessionize(views)) as (time,user,ip,session_id);
}
data_sessionized = FOREACH data_sessionized GENERATE user,ip,session_id;
data_sessionized = FOREACH (GROUP data_sessionized BY (user, session_id)) GENERATE
group.user as user,
SIZE(DistinctBy(data_sessionized)) as distinctIpCount;
data_sessionized = FILTER data_sessionized BY distinctIpCount > 1;
data_sessionized = FOREACH data_sessionized GENERATE user;
data_sessionized = DISTINCT data_sessionized;
STORE data_sessionized INTO 'output';
*/
#Multiline private String sessionizeUserIpTest;
private String[] sessionizeUserIpTestData = new String[] {
"Harold,2014-02-18T16:14:49.503Z,123.123.123.123",
"Harold,2014-02-18T16:14:51.503Z,123.123.123.123",
"Harold,2014-02-18T16:14:55.503Z,321.321.321.321",
"Kumar,2014-02-18T16:14:49.503Z,123.123.123.123",
"Kumar,2014-02-18T16:14:55.503Z,123.123.123.123",
"Kumar,2014-02-18T16:15:05.503Z,321.321.321.321"
};
#Test
public void sessionizeUserIpTest() throws Exception
{
PigTest test = createPigTestFromString(sessionizeUserIpTest);
this.writeLinesToFile("input",
sessionizeUserIpTestData);
List<Tuple> result = this.getLinesForAlias(test, "data_sessionized");
assertEquals(result.size(),1);
assertEquals(result.get(0).get(0),"Harold");
}