Filter inner bag by other field (not constant) - apache-pig

I am trying to run the following scenario but fail.
I start with a list of movies, and group it by {year, rating}.
movies = LOAD '/movies_data.csv'
USING PigStorage(',') AS (id:int, name:chararray, year:int, rating:double, duration:int);
grouped = GROUP movies BY (year, rating);
The resulting schema is:
DESCRIBE grouped;
grouped: {group: (year: int,rating: double),movies: {(id: int,name: chararray,
year: int,rating: double,duration: int)}}
Now, for each group I would like to get a list of movie names that contain the year (which is the part of the group name).
So I try the following:
model =
FOREACH grouped {
listNames = DISTINCT movies.name;
listNamesFiltered = FILTER listNames BY name MATCHES group::year;
GENERATE
group.year AS year
,group.rating AS rating
,listNamesFiltered AS listNamesFiltered
,COUNT(listNamesFiltered) AS countNamesFiltered
;};
but fail with the message:
Invalid field projection. Projected field [group::year] does not exist in schema: name:chararray.
Using a constant (like in the following line) works:
listNamesFiltered = FILTER listNames BY name MATCHES '.*2010.*';
results in:
(2010,2.6,{(2010: Moby Dick)},1)
(2010,3.8,{(Saturday Night Live: The 2010s)},1)
Any help will be greatly appreciated.

This seems like it would be a lot easier if you did all the filtering and then did all GROUP/DISTINCT/COUNToperations.
Data:
1 2010: Moby Dick 2010 2.6 128
2 Saturday Night Live: The 2010s 2010 3.8 127
3 2001: A Space Odyssey 2001 4.0 145
4 Forrest Gump 1994 4.9 334
Query:
movies = LOAD 'movie_data.csv' USING PigStorage(',') AS (id:int,
name:chararray, year:int, rating:double, duration:int);
filtered = FILTER movies BY name MATCHES StringConcat('.*', (chararray)year, '.*');
dump filtered;
Output:
(1,2010: Moby Dick,2010,2.6,128)
(2,Saturday Night Live: The 2010s,2010,3.8,127)
(3,2001: A Space Odyssey,2001,4.0,145)
Then do whatever else you were going to do (COUNT etc ...).

Related

How to check if a tuple contains an element in Apache Pig?

Let's say I have this file:
movie_id,title,genres
95004,Superman/Doomsday (2007),Action|Animation
136297,Mortal Kombat: The Journey Begins (1995),Action
193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi
193573,Love Live! The School Idol Movie (2015),Animation
I load it like this:
movies = LOAD 'movies.csv' USING PigStorage(',') AS (moviesId:int, title:chararray, genres: chararray);
movies = FOREACH movies GENERATE movieId, title, STRSPLIT(genres,'\\|') as genres;
describe a; //a: {movieId: int,title: chararray,genres: ()}
Example of dump a results:
...
(193581,Black Butler: Book of the Atlantic (2017),(Action,Animation,Comedy,Fantasy))
(193583,No Game No Life: Zero (2017),(Animation,Comedy,Fantasy))
...
Now, if I undestand correctly, the field genres is of type tuple. The question is how can I do a query such as: "get all the action movies?". I don't know how to check if a specific element is present in the tuple genres.
I know how to do this with a Python UDF function, but I would like to know if it is possible without one. Maybe I should load the file differently.
Thank you for your help.
If you are happy to put the genres into a bag rather than into a tuple (I think this would be more appropriate since the number of genres varies from record to record). This could be then solved with a nested FOREACH by filtering the bag for specific genres then testing to see if the bag is not empty.
movies = LOAD 'movies.csv' USING PigStorage(',') AS (moviesId:int, title:chararray, genres: chararray);
moviesSplit = FOREACH movies GENERATE movieId, title, TOKENIZE(genres,'|') as genres;
actionTest = FOREACH moviesSplit {
action = FILTER genres by $0 == 'Action';
GENERATE *, action;
actionMovies = FILTER actionTest BY NOT IsEmpty(action);

Searching on pubmed using biopython

I am trying to input over 200 entries into pubmed in order to record the number of articles published by an author and to refine the search by including his/her mentor and institution. I have tried to do this using biopython and xlrd (the code is below), but I am consistently getting 0 results for all three formats of inquiries (1. by name, 2. by name and institution name, and 3. by name and mentor's name). Are there steps of troubleshooting that I can do, or should I use a different format when using the keywords indicated below to search on pubmed?
Example output of the input queries;search_term is a linked list with lists of the input queries.
print(*search_term[8:15], sep='\n')
[text:'Andrew Bland', 'Weill Cornell Medical College', text:'David Cutler MD']
[text:'Andy Price', 'University of Alabama at Birmingham School of Medicine', text:'Jason Warem, PhD']
[text:'Bah Chamin', 'University of Texas Southwestern Medical School', text:'Dr. Timothy Hillar']
[text:'Eduo Cera', 'University of Colorado School of Medicine', text:'Dr. Tim']
Code used to generate the input queries above and to search on Pubmed:
Entrez.email = "mollyzhaoe#college.harvard.edu"
for search_term in search_terms[8:55]:
handle = Entrez.egquery(term="{0} AND ((2010[Date - Publication] : 2017[Date - Publication])) ".format(search_term[0]))
handle_1 = Entrez.egquery(term = "{0} AND ((2010[Date - Publication] : 2017[Date - Publication])) AND {1}".format(search_term[0], search_term[2]))
handle_2 = Entrez.egquery(term = "{0} AND ((2010[Date - Publication] : 2017[Date - Publication])) AND {1}".format(search_term[0], search_term[1]))
record = Entrez.read(handle)
record_1 = Entrez.read(handle_1)
record_2 = Entrez.read(handle_2)
pubmed_count = ['','','']
for row in record["eGQueryResult"]:
if row["DbName"] == "pubmed":
pubmed_count[0] = row["Count"]
for row in record_1["eGQueryResult"]:
if row["DbName"] == "pubmed":
pubmed_count[1] = row["Count"]
for row in record_2["eGQueryResult"]:
if row["DbName"] == "pubmed":
pubmed_count[2] = row["Count"]
Check your indentation, it is difficult to know which part belongs to which loop.
If you want to troubleshoot, try printing your egquery, e.g.
print("{0} AND ((2010[Date - Publication] : 2017[Date - Publication])) ".format(search_term[0]))
and paste the output to pubmed and see what you get. Perhaps modify it a bit and see which search term causes the problems.
Your input format is a little bit hard to guess. Print the query and make sure you are getting the right search values.
For the author names, try to get rid of the academic titles, PubMed might confused them with the initials, e.g. House MD, might be Mark David House.

How do I accumulate vectors into a map?

I have an alias A like this:
{cookie: chararray,
keywords: {tuple_of_tokens: (token: chararray)},
weight: double}
where the 2nd and 3rd fields are defined as
keywords = TOKENIZE((chararray)$5,',');
weight = 1.0/(double)SIZE(keywords);
now I want to do
foreach (group A by cookie) generate
group.cookie as cookie,
???? as keywords;
and keywords should be a map from a keyword into a the sum of weights.
E.g.,
1 k1,k2,k3
1 k2,k4
should turn into
1 {k1:1/3, k2:5/6, k3:1/3, k4:1/2}
I am already using datafu, but I am open to any alternative...
I'd do
A_counts = foreach A generate cookie,flatten(keywords) as keyword,1.0/SIZE(keywords) as weight;
then
A_counts_gr = group A by (cookie,keyword); and
result= foreach A_counts_gr generate flatten(group) as (cookie,token), sum(A_counts_gr.weight);
and then one can group by cookie to get a bag like you want...after grouping by cookie again there will be a bag, than you can turn this bag to a map with datafu...

Subqueries for each row

Using these tables:
*student {'s_id','s_name,'...} , class {'c_id','c_name',...} student2class {'s_id','c_id'}, grades {'s_id','c_id','grade'}*
Is it possible to perform a query (nested query?) put class name as subtitle and then all students (of that class) and grades, next class name as subtitle ...
The result I need is:
Maths
John .... C
Anna .... B
[...]
Biology
Anna .... C
Jack .... A
[...]
For each row from class I'll have a subquery fetching all data related with this class
No need of any sub-query. You can get your data this way:
SELECT c_name, s_name_, grade
FROM student, class, grades
WHERE student.s_id = grades.s_id and class.c_id = grades.c_id
ORDER BY c_name
The presentation of results depends on your system/tools, as others already said. This is a link to the solution for Microsoft Access:
http://office.microsoft.com/en-001/access-help/create-a-grouped-or-summary-report-HA001159160.aspx
The solution should be implemented in your client side code and not in the database. From database you should just get a simple table formatted data (subject, student, grade)
Then convert the above recordset to the format you want:
For an example in C# you could convert the recordset into lookup
var Lookup = DataSet.Tables[0].Rows.ToLookup(x => x["subjectColumn"]);
now you can loop through the lookup and format your result
foreach (var grade in Lookup)
{
subject = grade.Key;
...
}

Pig: apply a FOREACH operator to each element within a bag

Example: I have a relation "class", with a nested bag of students:
class: {teacher_name: chararray,students: {(firstname: chararray, lastname: chararray)}
I want to perform an operation on each student, while leaving the global structure untouched, ie, obtain:
class: {teacher_name: chararray,students: {(fullname: chararray)}
where for each student, fullname = CONCAT(firstname, lastname)
My understanding is that a nested FOREACH would not be my solution here, as it still only generates 1 record per input tuple, whereas I want something that would apply within each bag item.
Pretty easy to do with an UDF but wondered if it's possible to do it in pure Piglatin
In PIG 0.10 it is possible without the UDF, as FOREACH can be nested in FOREACH. Here is an example:
inpt = load '~/pig/data/bag_concat.dat' as (k : chararray, c1 : chararray, c2 : chararray);
dump inpt;
1 q w
1 s d
2 q a
2 t y
2 u i
2 o p
bags = group inpt by k;
describe bags;
bags: {group: chararray,inpt: {(k: chararray,c1: chararray,c2: chararray)}}
result = foreach bags {
concat = foreach inpt generate CONCAT(c1, c2); --it will iterate only over the records of the inpt bag
generate group, concat;
};
dump result;
(1,{(qw),(sd)})
(2,{(qa),(ty),(ui),(op)})