How do I combine two pig statements into one? - apache-pig

This two-stage pig processing works:
my_out = foreach (group my_in by id) {
grouped = BagGroup(my_in.(keyword,weight),my_in.keyword);
generate
group as id,
CountEach(my_in.domain) as domains,
grouped as grouped;
};
my_out1 = foreach my_out {
keywords = foreach grouped generate group as keyword, SUM($1.weight) as weight;
generate id, domains, keywords;
};
however, when I combine them:
my_out = foreach (foreach (group my_in by id) {
grouped = BagGroup(my_in.(keyword,weight),my_in.keyword);
generate
group as id,
CountEach(my_in.domain) as domains,
grouped as grouped;
}) {
keywords = foreach grouped generate group as keyword, SUM($1.weight) as weight;
generate id, domains, keywords;
};
I get an error:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered " <IDENTIFIER> "generate "" at line 1, column 5.
My questions are:
How do I avoid this error?
Does it even make sense what I am trying to do?
Even if I manage to accomplish this, will this save me an MR pass?

In general, Pig's ability to parse complicated nested expressions is unreliable. Another common error when the nesting gets to be too much to handle is ERROR 1000: Error during parsing. Lexical error at line XXXX, column 0. Encountered: <EOF> after : ""
I often try to do this to avoid having to come up with a bunch of names for aliases that have no meaning except as intermediate steps in a computation. But sometimes it's not possible, as you have found out. My guess is that nesting a nested foreach is a no-go. But in your case, it looks like the first nested foreach is not necessary. Try this:
my_out = foreach (foreach (group my_in by id)
generate
group as id,
CountEach(my_in.domain) as domains,
BagGroup(my_in.(keyword,weight),my_in.keyword) as grouped
) {
keywords = foreach grouped generate group as keyword, SUM($1.weight) as weight;
generate id, domains, keywords;
};
As for your second question, no, this will make no difference to the eventual MR plan. This is purely a matter of Pig parsing your script; the map-reduce logic is unchanged by grouping the commands in this way.

Related

Add array of other records from the same table to each record

My project is a Latin language learning app. My DB has all the words I'm teaching, in the table 'words'. It has the lemma (the main form of the word), along with the definition and other information the user needs to learn.
I show one word at a time for them to guess/remember what it means. The correct word is shown along with some wrong words, like:
What does Romanus mean? Greek - /Roman/ - Phoenician - barbarian
What does domus mean? /house/ - horse - wall - senator
The wrong options are randomly drawn from the same table, and must be from the same part of speech (adjective, noun...) as the correct word; but I am only interested in their lemma. My return value looks like this (some properties omitted):
[
{ lemma: 'Romanus', definition: 'Roman', options: ['Greek', 'Phoenician', 'barbarian'] },
{ lemma: 'domus', definition: 'house', options: ['horse', 'wall', 'senator'] }
]
What I am looking for is a more efficient way of doing it than my current approach, which runs a new query for each word:
// All the necessary requires are here
class Word extends Model {
static async fetch() {
const words = await this.findAll({
limit: 10,
order: [Sequelize.literal('RANDOM()')],
attributes: ['lemma', 'definition'], // also a few other columns I need
});
const wordsWithOptions = await Promise.all(words.map(this.addOptions.bind(this)));
return wordsWithOptions;
}
static async addOptions(word) {
const options = await this.findAll({
order: [Sequelize.literal('RANDOM()')],
limit: 3,
attributes: ['lemma'],
where: {
partOfSpeech: word.dataValues.partOfSpeech,
lemma: { [Op.not]: word.dataValues.lemma },
},
});
return { ...word.dataValues, options: options.map((row) => row.dataValues.lemma) };
}
}
So, is there a way I can do this with raw SQL? How about Sequelize? One thing that still helps me is to give a name to what I'm trying to do, so that I can Google it.
EDIT: I have tried the following and at least got somewhere:
const words = await this.findAll({
limit: 10,
order: [Sequelize.literal('RANDOM()')],
attributes: {
include: [[sequelize.literal(`(
SELECT lemma FROM words AS options
WHERE "partOfSpeech" = "options"."partOfSpeech"
ORDER BY RANDOM() LIMIT 1
)`), 'options']],
},
});
Now, there are two problems with this. First, I only get one option, when I need three; but if the query has LIMIT 3, I get: SequelizeDatabaseError: more than one row returned by a subquery used as an expression.
The second error is that while the code above does return something, it always gives the same word as an option! I thought to remedy that with WHERE "partOfSpeech" = "options"."partOfSpeech", but then I get SequelizeDatabaseError: invalid reference to FROM-clause entry for table "words".
So, how do I tell PostgreSQL "for each row in the result, add a column with an array of three lemmas, WHERE existingRow.partOfSpeech = wordToGoInTheArray.partOfSpeech?"
Revised
Well that seems like a different question and perhaps should be posted that way, but...
The main technique remains the same. JOIN instead of sub-select. The difference being generating the list of lemmas for then piping then into the initial query. In a single this can get nasty.
As single statement (actually this turned out not to be too bad):
select w.lemma, w.defination, string_to_array(string_agg(o.defination,','), ',') as options
from words w
join lateral
(select defination
from words o
where o.part_of_speech = w.part_of_speech
and o.lemma != w.lemma
order by random()
limit 3
) o on 1=1
where w.lemma in( select lemma
from words
order by random()
limit 4 --<<< replace with parameter
)
group by w.lemma, w.defination;
The other approach build a small SQL function to randomly select a specified number of lemmas. This selection is the piped into the (renamed) function previous fiddle.
create or replace
function exam_lemma_definition_options(lemma_array_in text[])
returns table (lemma text
,definition text
,option text[]
)
language sql strict
as $$
select w.lemma, w.definition, string_to_array(string_agg(o.definition,','), ',') as options
from words w
join lateral
(select definition
from words o
where o.part_of_speech = w.part_of_speech
and o.lemma != w.lemma
order by random()
limit 3
) o on 1=1
where w.lemma = any(lemma_array_in)
group by w.lemma, w.definition;
$$;
create or replace
function exam_lemmas(num_of_lemmas integer)
returns text[]
language sql
strict
as $$
select string_to_array(string_agg(lemma,','),',')
from (select lemma
from words
order by random()
limit num_of_lemmas
) ll
$$;
Using this approach your calling code reduces to a needs a single SQL statement:
select *
from exam_lemma_definition_options(exam_lemmas(4))
order by lemma;
This permits you to specify the numbers of lemmas to select (in this case 4) limited only by the number of rows in Words table. See revised fiddle.
Original
Instead of using a sub-select to get the option words just JOIN.
select w.lemma, w.definition, string_to_array(string_agg(o.definition,','), ',') as options
from words w
join lateral
(select definition
from words o
where o.part_of_speech = w.part_of_speech
and o.lemma != w.lemma
order by random()
limit 3
) o on 1=1
where w.lemma = any(array['Romanus', 'domus'])
group by w.lemma, w.definition;
See fiddle. Obviously this will not necessary produce the same options as your questions provides due to random() selection. But it will get matching parts of speech. I will leave translation to your source language to you; or you can use the function option and reduce your SQL to a simple "select *".

SQL Query to JSONiq Query

I want to convert an SQL query into a JSONiq Query, is there already an implementation for this, if not, what do I need to know to be able to create a program that can do this ?
I am not aware of an implementation, however, it is technically feasible and straightforward. JSONiq has 90% of its DNA coming from XQuery, which itself was partly designed by people involved in SQL as well.
From a data model perspective, a table is mapped to a collection and each row of the table is mapped to a flat JSON object, i.e., all fields are atomic values, like so:
{
"Name" : "Turing",
"First" : "Alan",
"Job" : "Inventor"
}
Then, the mapping is done by converting SELECT-FROM-WHERE queries to FLWOR expressions, which provide a superset of SQL's functionality.
For example:
SELECT Name, First
FROM people
WHERE Job = "Inventor"
Can be mapped to:
for $person in collection("people")
where $person.job eq "Inventor"
return project($person, ("Name", "First"))
More complicated queries can also be mapped quite straight-forwardly:
SELECT Name, COUNT(*)
FROM people
WHERE Job = "Inventor"
GROUP BY Name
HAVING COUNT(*) >= 2
to:
for $person in collection("people")
where $person.job eq "Inventor"
group by $name := $person.name
where count($person) ge 2
return {
name: $name,
count: count($person)
}
Actually, if for had been called from and return had been called select, and if these keywords were written uppercase, the syntax of JSONiq would be very similar to that of SQL: it's only cosmetics.

PIG ERROR 1070 when Count grouped data

I just want to count how many player for each team in 2011.
It works good when group it with tmID. However, when I try to count the grouped data, the ERROR 1070 comes out.
load_file = load 'Assignment2/basketball_players.csv' using PigStorage(',');
temp = foreach load_file generate
(chararray)$3 AS tmID,
(int)$1 AS year,
(chararray)$0 AS playerID;
fil_data = filter temp by year == 2011;
group_data = group fil_data by tmID;
count_data = foreach group_data generate group, count($1);
dump count_data;
the error message shows below.
<file script.pig, line 8, column 48> Failed to generate logical plan. Nested exception: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve count using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Could anyone could help me for this problem? THX
COUNT function is case sensitive. Ref : http://pig.apache.org/docs/r0.12.0/func.html#count
Try this :
count_data = foreach group_data generate group, COUNT($1);
instead of $1 would suggest to make use of alias fil_data as its more readable.
All functions need to be in UPPER case. Commands like foreach, generate group by etc can be in either case.

Error 1045 on sum function in pig latin with an int

The following pig latin script:
data = load 'access_log_Jul95' using PigStorage(' ') as (ip:chararray, dash1:chararray, dash2:chararray, date:chararray, date1:chararray, getRequset:chararray, location:chararray, http:chararray, code:int, size:int);
splitDate = foreach data generate size as size:int , ip as ip, FLATTEN(STRSPLIT(date, ':')) as h;
groupedIp = group splitDate by h.$1;
a = foreach groupedIp{
added = foreach splitDate generate SUM(size); --
generate added;
};
describe a;
gives me the error:
ERROR 1045:
<file 3.pig, line 10, column 39> Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.
This error makes me think I need to cast size as an int, but if i describe my groupedIp field, I get the following schema.
groupedIp: {group: bytearray,splitDate: {(size: int,ip: chararray,h: bytearray)}} which indicates that size is an int, and should be able to be used by the sum function.
Am I calling the sum function incorrectly? Let me know if you would like to see any thing else, such as the input file.
SUM operates on a bag as input, but you pass it the field 'size'.
Try to eliminate the nested foreach and use:
a = foreach groupedIp generate SUM(splitDate.size);
Do some dumps of your data. I'll bet some of the stuff in the size column is non-integer, and Pig runs into that and dies. You could also code up your own isInteger udf to check this before the rest of your processing, and throw out any that aren't integers.
SUM, AVG and COUNT are functions that always work on a bag, therefore group the data and then join with the original set like below:
A = load 'nyse_data.txt' as (exchange:chararray, symbol:chararray,date:chararray, pen:float,high:float, low:float, close:float,volume:int, adj_close:float);
G = group A by symbol;
C = foreach G generate group, SUM(A.open);

How to build the correct LINQ query (generate OR instead of AND)

I need some help in building the correct query. I have an Employees table. I need to get a list of all employees, that EENO (Employee ID) contains a string from a supplied array of partial Employee IDs.
When I use this code
// IEnumerable<string> employeeIds is a collection of partial Employee IDs
IQueryable<Employee> query = Employees;
foreach (string id in employeeIds)
{
query = query.Where(e => e.EENO.Contains(id));
}
return query;
I will get something like:
SELECT *
FROM Employees
WHERE EENO LIKE '%1111111%'
AND EENO LIKE '%2222222%'
AND EENO LIKE '%3333333%'
AND EENO LIKE '%4444444%'
Which doesn't make sense.
I need "OR" instead of "AND" in resulting SQL.
Thank you!
UPDATE
This code I wrote using PredicateBuilder works perfectly when I need to include these employees.
var predicate = PredicateBuilder.False<Employee>();
foreach (string id in employeeIds)
{
var temp = id;
predicate = predicate.Or(e => e.EENO.Contains(temp));
}
var query = Employees.Where(predicate);
Now, I need to write an opposite code, to exclude these employees,
here it is but it is not working: the generated SQL is totally weird.
var predicate = PredicateBuilder.False<Employee>();
foreach (string id in employeeIds)
{
var temp = id;
predicate = predicate.And(e => !e.EENO.Contains(temp)); // changed to "And" and "!"
}
var query = Employees.Where(predicate);
return query;
It's supposed to generate SQL Where clause like this one:
WHERE EENO NOT LIKE '%11111%'
AND NOT LIKE '%22222%'
AND NOT LIKE '%33333%'
But it's not happening
The SQL generated is this: http://i.imgur.com/9MDP7.png
Any help is appreciated. Thanks.
Instead of the foreach, just build the IQueryable once:
query = query.Where(e => employeeIds.Contains(e.EENO));
I'd take a look at http://www.albahari.com/nutshell/predicatebuilder.aspx. This has a great way of building Or queries, and is written by the guy that wrote LinqPad. The above link also has examples of usage.
I believe you can use Any():
var query = Employees.Where(emp => employeeIds.Any(id => id.Contains(emp.EENO)));
If you don't want to use a predicate builder, then the only other option is to UNION each of the collections together on an intermediate query:
// IEnumerable<string> employeeIds is a collection of partial Employee IDs
IQueryable<Employee> query = Enumerable.Empty<Employee>().AsQueryable();
foreach (string id in employeeIds)
{
string tempID = id;
query = query.Union(Employees.Where(e => e.EENO.Contains(tempID));
}
return query;
Also keep in mind that closure rules are going to break your predicate and only end up filtering on your last criteria. That's why I have the tempID variable inside the foreach loop.
EDIT: So here's the compendium of all the issues you've run across:
Generate ORs instead of ANDS
Done, using PredicateBuilder.
Only last predicate is being applied
Addressed by assigning a temp variable in your inner loop (due to closure rules)
Exclusion predicates not working
You need to start with the correct base case. When you use ORs, you need to make sure you start with the false case first, that way you only include records where AT LEAST ONE predicate matches (otherwise doesn't return anything). The reason for this is that the base case should just get ignored for purposes of evaluation. In other words false || predicate1 || predicate2 || ... really is just predicate1 || predicate2 || ... because you're looking for at least one true in your list of predicates (and you just need a base to build on). The opposite applies to the AND case. You start with true so that it gets "ignored" for purposes of evaluation, but you still need a base case. In other words: true && predicate1 && ... is the same as predicate1 && .... Hope that addresses your last issue.