Pig - comparing two similar statement : one working, the other not - apache-pig

I begin to be really annoyed with PIG :the language seems really not stable, the documentation is poor, there are not that many examples on internet, and any small change in the code can give radical differences :from failure to expected result.... Here is another kind of this last theme :
grunt> describe actions_by_unite;
actions_by_unite: {
group: chararray,
nb_actions_by_unite_and_action: {
(
unite: chararray,
lib_type_action: chararray,
double
)
}
}
-- works :
z = foreach actions_by_unite {
generate group, SUM(nb_actions_by_unite_and_action.$2);};
-- doesn't work :
z = foreach actions_by_unite {
x = SUM(nb_actions_by_unite_and_action.$2);
generate group, x;};
-- error :
2015-05-08 14:43:44,712 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse:
<line 107, column 16> Invalid scalar projection: x : A column needs to be projected from a relation for it to be used as a scalar
Details at logfile: /private/tmp/pig-err.log
And so :
-- doesn't work neither:
z = foreach actions_by_unite { x = SUM(nb_actions_by_unite_and_action.$2);
generate group, x.$0;};
--error :
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (AC,EMAIL,1.1186133550060547E-4), 2nd :(AC,VISITE,6.25755280560356E-4)
at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:120)
Does anyone would know why ?
Do you have some nice blog / ressources to propose with examples to master this language ?
I have the o'reilly book, but it seems a bit old, I have the 'Agile Data Science' and the "Hadoop definitive guide" book with some examples in it... I found this page really interesting : https://shrikantbang.wordpress.com/2014/01/14/apache-pig-group-by-nested-foreach-join-example/
Any good video on coursera or other inputs ? Do you guys also have problems with this language ? or I am simply dumb ?....

That thing in particular is not because of Pig being unstable, it's because what you are trying to do is correct in the first approach, but wrong in the others.
When you make a group by, you have for each group a bag that contains X tuples. Inside a nested foreach, you have one group with its bag for each iteration, which means that a SUM inside there will yield a scalar value: the sum of the bag you are currently working with. Apache Pig does not work with scalars, it works with relations, therefore you cannot assign a scalar value to an alias, which is exactly what you are doing in the second and third approach.
Therefore, the error comes from attempting something like:
A = foreach B {
x = SUM(bag.$0);
}
However, if you want to emit for each of the groups a scalar, you can perfectly do this as long as you never assign a scalar to an alias. That is why it works perfectly if you do the sum at the end of the foreach, because you are returning for each of the groups a tuple with two values: the group and the sum.

Related

RStudio Error: Unused argument ( by = ...) when fitting gam model, and smoothing seperately for a factor

I am still a beginnner in R. For a project I am trying to fit a gam model on a simple dataset with a timeset and year. I am doing it in R and I keep getting an error message that claims an argument is unused, even though I specify it in the code.
It concerns a dataset which includes a categorical variable of "Year", with only two levels. 2020 and 2022. I want to investigate if there is a peak in the hourly rate of visitors ("H1") in a nature reserve. For each observation period the average time was taken, which is the predictor variable used here ("T"). I want to use a Gam model for this, and have the smoothing applied differently for the two years.
The following is the line of code that I tried to use
`gam1 <- gam(H1~Year+s(T,by=Year),data = d)`
When I try to run this code, I get the following error message
`Error in s(T, by = Year) : unused argument (by = Year)`
I also tried simply getting rid of the "by" argument
`gam1 <- gam(H1~Year+s(T,Year),data = d)`
This allows me to run the code, but when trying to summon the output using summary(gam1), I get
Error in [<-(tmp, snames, 2, value = round(nldf, 1)) : subscript out of bounds
Since I feel like both errors are probably related to the same thing that I'm doing wrong, I decided to combine the question.
Did you load the {mgcv} package or the {gam} package? The latter doesn't have factor by smooths and as such the first error message is what I would expect if you did library("gam") and then tried to fit the model you showed.
To fit the model you showed, you should restart R and try in a clean session:
library("mgcv")
# load you data
# fit model
gam1 <- gam(H1 ~ Year + s(T, by = Year), data = d)
It could well be that you have both {gam} and {mgcv} loaded, in which case whichever you loaded last will be earlier on the function search path. As both packages have functions gam() and s(), R might just be finding the wrong versions (masking), so you might also try
gam1 <- mgcv::gam(H1 ~ Year + mgcv::s(T, by = Year), data = d)
But you would be better off only loading {mgcv} if you wan factor by smooths.
#Gavin Simpson
I did have both loaded, and I tried just using mgcv as you suggested. However, then I get the following error.
Error in names(dat) <- object$term :
'names' attribute [1] must be the same length as the vector [0]
I am assuming this is simply because it's not actually trying to use the "gam" function, but rather it attempts to name something gam1. So I would assume I actually need the package of 'gam' before I could do this.
The second line of code also doesn't work. I get the following error
Error in model.frame.default(formula = H1 ~ Year + mgcv::s(T, by = Year), :
invalid type (list) for variable 'mgcv::s(T, by = Year)'
This happens no matter the order I download the two packages in. And if I don't download 'gam', I get the error as described above.

tdbc::tokenize documentation and use

I'm using tdbc::odbc to connect to a Pervasive (btrieve type) database, and am unable to pass variables to the driver. A short test snippet:
set customer "100000"
set st [pvdb prepare {
INSERT INTO CUSTOMER_TEMP_EMPTY
SELECT * FROM CUSTOMER_MASTER
WHERE CUSTOMER = :customer
}]
$st execute
This returns:
[Pervasive][ODBC Client Interface]Parameter number out of range.
(binding the 'customer' parameter)
Works fine if I replace :customer with "100000", and I have tried using a variable with $, #, wrapping in apostrophes, quotes, braces. I believe that tdbc::tokenize is the answer I'm looking for, but the man page gives no useful information on its use. I've experimented with tokenize with no progress at all. Can anyone comment on this?
The tdbc::tokenize command is a helper for writing TDBC drivers. It's used for working out what bound variables are inside an SQL string so the binding map can be supplied to the low level driver or, in the case of particularly stupid drivers, string substitutions performed (I hope there's no drivers that need to do this; it'd be annoyingly difficult to get right). The parser knows enough to handle weird cases like things that look like bound variables in strings and comments (those aren't bound variables).
If we feed it (it's calling syntax is trivial) the example SQL you've got, we get this result:
{
INSERT INTO CUSTOMER_TEMP_EMPTY
SELECT * FROM CUSTOMER_MASTER
WHERE CUSTOMER = } :customer {
}
That's a list of three items (the last element just has a newline in it) that simplifies processing a lot; each item is either trivially a bound variable or trivially not.
Other examples (bear in mind in the second case that bound variables may also start with $ or #):
% tdbc::tokenize {':abc' = :abc = ":abc" -- :abc}
{':abc' = } :abc { = ":abc" -- :abc}
% tdbc::tokenize {foo + $bar - #grill}
{foo + } {$bar} { - } #grill
% tdbc::tokenize {foo + :bar + [:grill]}
{foo + } :bar { + [:grill]}
Note that the tokenizer does not fully understand SQL! It makes no attempt to parse the other bits; it's just looking for what is a bound variable.
I've no idea what use the tokenizer could be to you if you're not writing a DB driver.
Still could not get the driver to accept the variable, but looking at your first example of the tokenized return, I came up with:
set customer "100000"
set v [tdbc::tokenize "$customer"]
set query "INSERT INTO CUSTOMER_TEMP_EMPTY SELECT * FROM CUSTOMER_MASTER WHERE CUSTOMER = $v"
set st [pvdb prepare $query]
$st execute
as a test command, and it did indeed successfully pass the command through the driver

REGEX_EXTRACT error in PIG

I have a CSV file with 3 columns: tweetid , tweet, and Userid. However within the tweet column there are comma separated values.
i.e. of 1 row of data:
`396124437168537600`,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143
I want to extract all 3 fields individually, but REGEX_EXTRACT is giving me an error with this code:
a = LOAD tweets USING PigStorage(',') AS (f1,f2,f3);
b = FILTER a BY REGEX_EXTRACT(f1,'(.*)\\"(.*)',1);
The error is:
error: Filter's condition must evaluate to boolean.
In the use case shared, reading the data using PigStrorage(',') will result in missing savava143 (last field value)
A = LOAD '/Users/muralirao/learning/pig/a.csv' USING PigStorage(',') AS (f1,f2,f3);
DUMP A;
Output : A : Observe that the last field value is missing.
(396124437168537600,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.")
For the use case shared, to extract all the values from CSV file with field values having ',' we can use either CSVExcelStorage or CSVLoader.
Approach 1 : Using CSVExcelStorage
Ref : http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html
Input : a.csv
396124437168537600,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143
Pig Script :
REGISTER piggybank.jar;
A = LOAD 'a.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (f1,f2,f3);
DUMP A;
Output : A
(396124437168537600,I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.,savava143)
Approach 2 : Using CSVLoader
Ref : http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/CSVLoader.html
Below script makes use of CSVLoader(), DUMP A will result in the same output seen earlier.
A = LOAD 'a.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (f1,f2,f3);
The error is that you do not want to FILTER based on a regex but GENERATE new fields based on a regex. To filter, you need to know if the line have to be filtered, hence the boolean requirement.
Therefore, you have to use :
b = FOREACH a GENERATE REGEX_EXTRACT(FIELD, REGEX, HOW_MANY_GROUPS_TO_RETURN);
However, as #Murali Rao said, your values are not just coma separated but CSV (think how you will handle a coma in tweet : it is not a field separator, just some content).

What is the ideal way to select rows by projected keys in Groundhog Haskell

I'm trying to create a many to many table using Groundhog in Haskell, which basically looks like this if I cut away all the other logic I have:
data FooRow = FooRow {
fooRowUUID :: UUID
}
deriving instance Show FooRow
data BarRow = BarRow {
barRowUUID :: UUID
}
deriving instance Show BarRow
data FooToBarRow = FooToBarRow {
fooToBarRowUUID :: UUID,
fooToBarRowFoo :: DefaultKey FooRow,
fooToBarRowBar :: DefaultKey BarRow
}
deriving instance Show FooToBarRow
Now, trying to define operations, I can get and insert all of these records just fine, however I'm not sure how to go from having a FooRow, with it's ID, and then get all the related BarRows by way of the many to many table. Right now I've played with something like this:
getBarsForFoo fooID = do
barKeys <- project
(FooToBarRowBarField)
(FooToBarRowFooField ==. (Foo_FooKey fooID))
select $ (BarRowUUIDField `in_` barKeys)
However this doesn't typecheck, with the error:
Couldn't match type 'UUID' with 'BarRow'
Inspecting just the results of the project with putStrLn, I can see that the type of barKeys is:
[Bar_BarKey UUID]
but I don't quite understand how to make use of that within my query. I don't see any examples like this in the Groundhog documentation, so I'm hoping someone will be able to put me on the right path here.
I'm quite certain there are more efficient ways to go about this (there's going to be a bunch of underlying queries with this approach), but this does at at least get the job done for now.
getBarsForFoo fooID = do
barKeys <- project
(FooToBarRowBarField)
(FooToBarRowFooField ==. (Foo_FooKey fooID))
q <- mapM (getBy) barKeys
return (catMaybes q :: [BarRow])

USING Filter in a Nested FOREACH in PIG

I have two pig relations. The first one count_pairs shows pairs of words and how many times they were seen. ex ((car,tire), 4). The second is word_counts, which keeps track of how many times each word was seen ex. (car, 20). I would like to find the percentage of how many times each pair was seen compared to how many times just the first word was seen. In our case I would want ((car,tire), 4/20). I tried to write a nested foreach to solve this problem :
> percent_count_pairs = FOREACH count_pairs {
> denom = FILTER word_counts BY ($0 ==count_pairs.pair.word1);
> GENERATE pair, count2/(double)denom.$1;}
I keep getting this error:
'Pig script failed to parse:
<file src/cluster.pig, line 27, column 15> expression is not a project expression: (Name: ScalarExpression) Type: null Uid: null)'
This point to the line with the FILTER;
googling this error did not lead me to anything helpful. Please help!
(ps. this does work if I take the line with FILTER out of the foreach...)
After more googling I came to realize that this is a bug in Pig that will not allow this:
https://issues.apache.org/jira/browse/PIG-1798. I ended up writing my own UDF to filter.