Pig - removing duplicate tuples from bag - apache-pig

I have the following loaded in a relation with this schema {group: (int,int),A: {(n1: int,n2: int)}}:
((1,1),{(0,1)})
((2,2),{(0,2)})
((3,3),{(3,0)})
((4,2),{(1,3)})
((5,1),{(2,3)})
((5,3),{(1,4)})
((7,3),{(2,5)})
((9,1),{(4,5)})
((10,2),{(4,6)})
((10,4),{(7,3)})
((11,1),{(5,6)})
((11,3),{(4,7)})
((12,4),{(4,8)})
((13,1),{(6,7)})
((19,1),{(10,9),(9,10)})
((,),{(,),(,),(,)})
I would like to extract just the first tuple from each bag, i.e.:
((19,1),{(10,9),(9,10)}) --> (10,9)
Any help is appreciated.

Can you try like this?.
C = FOREACH B {
top1 = LIMIT A 1;
GENERATE FLATTEN((top1));
}
here B is your group relation name.

Related

How to check if a tuple contains an element in Apache Pig?

Let's say I have this file:
movie_id,title,genres
95004,Superman/Doomsday (2007),Action|Animation
136297,Mortal Kombat: The Journey Begins (1995),Action
193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi
193573,Love Live! The School Idol Movie (2015),Animation
I load it like this:
movies = LOAD 'movies.csv' USING PigStorage(',') AS (moviesId:int, title:chararray, genres: chararray);
movies = FOREACH movies GENERATE movieId, title, STRSPLIT(genres,'\\|') as genres;
describe a; //a: {movieId: int,title: chararray,genres: ()}
Example of dump a results:
...
(193581,Black Butler: Book of the Atlantic (2017),(Action,Animation,Comedy,Fantasy))
(193583,No Game No Life: Zero (2017),(Animation,Comedy,Fantasy))
...
Now, if I undestand correctly, the field genres is of type tuple. The question is how can I do a query such as: "get all the action movies?". I don't know how to check if a specific element is present in the tuple genres.
I know how to do this with a Python UDF function, but I would like to know if it is possible without one. Maybe I should load the file differently.
Thank you for your help.
If you are happy to put the genres into a bag rather than into a tuple (I think this would be more appropriate since the number of genres varies from record to record). This could be then solved with a nested FOREACH by filtering the bag for specific genres then testing to see if the bag is not empty.
movies = LOAD 'movies.csv' USING PigStorage(',') AS (moviesId:int, title:chararray, genres: chararray);
moviesSplit = FOREACH movies GENERATE movieId, title, TOKENIZE(genres,'|') as genres;
actionTest = FOREACH moviesSplit {
action = FILTER genres by $0 == 'Action';
GENERATE *, action;
actionMovies = FILTER actionTest BY NOT IsEmpty(action);

Pig Latin: All pairs within a group - Nested foreach with a cross and filter

I have some grouped data:
glu: (
group:tuple(foo:bytearray, bar:chararray),
bam: bag {
:tuple(foo:bytearray, bar:chararray, pom:Long)
}
)
What I want is to do a nested cross-product to get all pairs of pom, and a filter to reduce to only pairs where the first pom is less than the second pom. Ending up with something like this:
glu: (
group:tuple(foo:bytearray, bar:chararray),
bam: bag {
:tuple(foo:bytearray, bar:chararray, pom1:Long, pom2:Long)
}
)
something like:
glupairs = FOREACH glu {
pairs = CROSS bam, bam;
filtered = FILTER pairs BY (bam1 != bam2) AND (bam1 < bam2);
GENERATE group, filtered;
};
This, of course, does not work. Is there a way to do this? Can I take a cross product of a relation against itself? how can I select the fields afterwards (to do the filter)?
Thanks in advance.
I figured this out by doing:
glupairs = FOREACH glu {
copied = FOREACH bam GENERATE -(-pom); -- Deals with the self cross bug
pairs = CROSS bam, copied;
filtered = FILTER pairs BY (bam.pom != copied.pom) AND (bam.pom < copied.pom);
GENERATE group, filtered;
};

How do I accumulate vectors into a map?

I have an alias A like this:
{cookie: chararray,
keywords: {tuple_of_tokens: (token: chararray)},
weight: double}
where the 2nd and 3rd fields are defined as
keywords = TOKENIZE((chararray)$5,',');
weight = 1.0/(double)SIZE(keywords);
now I want to do
foreach (group A by cookie) generate
group.cookie as cookie,
???? as keywords;
and keywords should be a map from a keyword into a the sum of weights.
E.g.,
1 k1,k2,k3
1 k2,k4
should turn into
1 {k1:1/3, k2:5/6, k3:1/3, k4:1/2}
I am already using datafu, but I am open to any alternative...
I'd do
A_counts = foreach A generate cookie,flatten(keywords) as keyword,1.0/SIZE(keywords) as weight;
then
A_counts_gr = group A by (cookie,keyword); and
result= foreach A_counts_gr generate flatten(group) as (cookie,token), sum(A_counts_gr.weight);
and then one can group by cookie to get a bag like you want...after grouping by cookie again there will be a bag, than you can turn this bag to a map with datafu...

Pig: apply a FOREACH operator to each element within a bag

Example: I have a relation "class", with a nested bag of students:
class: {teacher_name: chararray,students: {(firstname: chararray, lastname: chararray)}
I want to perform an operation on each student, while leaving the global structure untouched, ie, obtain:
class: {teacher_name: chararray,students: {(fullname: chararray)}
where for each student, fullname = CONCAT(firstname, lastname)
My understanding is that a nested FOREACH would not be my solution here, as it still only generates 1 record per input tuple, whereas I want something that would apply within each bag item.
Pretty easy to do with an UDF but wondered if it's possible to do it in pure Piglatin
In PIG 0.10 it is possible without the UDF, as FOREACH can be nested in FOREACH. Here is an example:
inpt = load '~/pig/data/bag_concat.dat' as (k : chararray, c1 : chararray, c2 : chararray);
dump inpt;
1 q w
1 s d
2 q a
2 t y
2 u i
2 o p
bags = group inpt by k;
describe bags;
bags: {group: chararray,inpt: {(k: chararray,c1: chararray,c2: chararray)}}
result = foreach bags {
concat = foreach inpt generate CONCAT(c1, c2); --it will iterate only over the records of the inpt bag
generate group, concat;
};
dump result;
(1,{(qw),(sd)})
(2,{(qa),(ty),(ui),(op)})

NHibernate: how to retrieve an entity that "has" all entities with a certain predicate in Criteria

I have an Article with a Set of Category.
How can I query, using the criteria interface, for all Articles that contain all Categories with a certain Id?
This is not an "in", I need exclusively those who have all necessary categories - and others. Partial matches should not come in there.
Currently my code is failing with this desperate attempt:
var c = session.CreateCriteria<Article>("a");
if (categoryKeys.HasItems())
{
c.CreateAlias("a.Categories", "c");
foreach (var key in categoryKeys)
c.Add(Restrictions.Eq("c", key)); //bogus, I know!
}
Use the "IN" restriction, but supplement to ensure that the number of category matches is equal to the count of all the categories you're looking for to make sure that all the categories are matched and not just a subset.
For an example of what I mean, you might want to take a look at this page, especially the "Intersection" query under the "Toxi solution" heading. Replace "bookmarks" with "articles" and "tags" with "categories" to map that back to your specific problem. Here's the SQL that they show there:
SELECT b.*
FROM tagmap bt, bookmark b, tag t
WHERE bt.tag_id = t.tag_id
AND (t.name IN ('bookmark', 'webservice', 'semweb'))
AND b.id = bt.bookmark_id
GROUP BY b.id
HAVING COUNT( b.id )=3
I believe you can also represent this using a subquery that may be easier to represent with the Criteria API
SELECT Article.Id
FROM Article
INNER JOIN (
SELECT ArticleId, count(*) AS MatchingCategories
FROM ArticleCategoryMap
WHERE CategoryId IN (<list of category ids>)
GROUP BY ArticleId
) subquery ON subquery.ArticleId = EntityTable.Id
WHERE subquery.MatchingCategories = <number of category ids in list>
I'm not 100% sure, but I think query by example may be what you want.
Assuming that Article to Category is a one-to-many relationship and that the Category has a many-to-one property called Article here is a VERY dirty way of doing this (I am really not proud of this but it works)
List<long> catkeys = new List<long>() { 4, 5, 6, 7 };
if (catkeys.Count == 0)
return;
var cr = Session.CreateCriteria<Article>("article")
.CreateCriteria("Categories", "cat0")
.Add(Restrictions.Eq("cat0.Id", catkeys[0]));
if (catkeys.Count > 1)
{
for (int i = 1; i < catkeys.Count; i++)
{
cr = cr.CreateCriteria("Article", "a" + i)
.CreateCriteria("Categories", "cat" + i)
.Add(Restrictions.Eq("cat" + i + ".Id", catkeys[i]));
}
}
var results = cr.List<Article>();
What it does is to re-join the relationship over and over again guaranteeing you the AND between category Ids. It should be very slow query especially if the list of Ids gets big.
I am offering this solution as NOT a recommended way but at least you can have something working while looking for a proper one.