Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast - apache-pig

I wanted to do the sum of a column which contains long type numbers.
I tried many possible ways but still the cast error is not getting resolved.
My pig code:
raw_ds = LOAD '/tmp/bimallik/data/part-r-00098' using PigStorage(',') AS (
d1:chararray, d2:chararray, d3:chararray, d4:chararray, d5:chararray,
d6:chararray, d7:chararray, d8:chararray, d9:chararray );
parsed_ds = FOREACH raw_ds GENERATE d8 as inBytes:long, d9 as outBytes:long;
X = FOREACH parsed_ds GENERATE (long)SUM(parsed_ds.inBytes) AS inBytes;
dump X;
Error snapshot:
2015-11-20 02:16:26,631 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1045:
Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.
Details at logfile: /users/bimallik/pig_1448014584395.log
2015-11-20 02:17:03,629 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complet

#ManjunathBallur Thanks for the input.
I changed my code as below now
<..same as before ...>
A = GROUP parsed_ds by inBytes;
X = FOREACH A GENERATE SUM(parsed_ds.inBytes) as h;
DUMP X;
Now A is generating a bag of common inBytes and X is giving sum of each bag's inBytes's summation which is again consisting of multiple rows where as I need one single summation value.

In local mode I was getting the same issue (pig -x local).
I have tried all the solutions available on the internet but noting seems to be working for me.
I toggled my PIG from local to the mapreduce mode and tried the solution. It worked.
In mapreduce mode all the solutions seem to be working.

Related

Error in data frame creation in R in Spark using as.data.frame

I am trying to convert SparkDataFrame to R data frame.
%python
temp_df.createOrReplaceTempView("temp_df_r")
%r
temp_sql = sql("select * from temp_df_r")
temp_r = as.data.frame(temp_sql)
Error in as.data.frame.default(temp_sql) :
cannot coerce class ‘structure("SparkDataFrame", package = "SparkR")’ to a data.frame
Sometimes I get error, it's still unknown why I get error sometimes and sometimes not.
I need more details. What environment do you use?

CSV file input not working together with set field value step in Pentaho Kettle

I have a very simple Pentaho Kettle transformation that causes a strange error. It consists of reading a field X from a CSV, add a field Y, set Y=X and finally write it back to another CSV.
Here you can see the steps and the configuration for them:
You can also download the ktr file from here. The input data is just this:
1
2
3
When I run this transformation, I get this error message:
ERROR (version 5.4.0.1-130, build 1 from 2015-06-14_12-34-55 by buildguy) : Unexpected error
ERROR (version 5.4.0.1-130, build 1 from 2015-06-14_12-34-55 by buildguy) : org.pentaho.di.core.exception.KettleStepException:
Error writing line
Error writing field content to file
Y Number : There was a data type error: the data type of [B object [[B©b4136a] does not correspond to value meta [Number]
at org.pentaho.di.trans.steps.textfiIeoutput.TextFiIeOutput.writeRowToFile(TextFiIeOutput.java:273)
at org.pentaho.di.trans.steps.textfiIeoutput.TextFileOutput.processRow(TextFiIeOutput.java:195)
at org.pentaho.di.trans.step.RunThread.run(RunThread.java:62)
atjava.Iang.Thread.run(Unknown Source)
Caused by: org.pentaho.di.core.exception.KettleStepException:
Error writing field content to file
Y Number : There was a data type error: the data type of [B object [[B©b4136a] does not correspond to value meta [Number]
at org.pentaho.di.trans.steps.textfiIeoutput.TextFiIeOutput.writeField(TextFileOutput.java:435)
at org.pentaho.di.trans.steps.textfiIeoutput.TextFiIeOutput.writeRowToFile(TextFiIeOutput.java:249)
3 more
Caused by: org.pentaho.di.core.exception.KettleVaIueException:
Y Number : There was a data type error: the data type of [B object [[B©b4136a] does not correspond to value meta [Number]
at org.pentaho.di.core.row.vaIue.VaIueMetaBase.getBinaryString(VaIueMetaBase.java:2185)
at org.pentaho.di.trans.steps.textfiIeoutput.TextFiIeOutput.formatField(TextFiIeOutput.java:290)
at org.pentaho.di.trans.steps.textfiIeoutput.TextFiIeOutput.writeField(TextFileOutput.java:392)
4 more
All of the above lines start with 2015/09/23 12:51:18 - Text file output.0 -, but I edited it out for brevity. I think the relevant, and confusing, part of the error message is this:
Y Number : There was a data type error: the data type of [B object [[B©b4136a] does not correspond to value meta [Number]
Some further notes:
If I bypass the set field value step by using the lower hop instead, the transformation finish without errors. This leads me to believe that it is the set field value step that causes the problem.
If I replace the CSV file input with a data frame with the same data (1,2,3) everything works just fine.
If I replace the file output step with a dummy the transformation finish without errors. However, if I preview the dummy, it causes a similar error and the field Y has the value <null> on all three rows.
Before I created this MCVE I got the error on all sorts of seemingly random steps, even when there was no file output present. So I do not think this is related to the file output.
If I change the format from Number to Integer, nothing changes. But if I change it to string the transformations finish without errors, and I get this output:
X;Y
1;[B#49e96951
2;[B#7b016abf
3;[B#1a0760b0
Is this a bug? Am I doing something wrong? How can I make this work?
It's because of lazy conversion. Turn it off. This is behaving exactly as designed - although admittedly the error and user experience could be improved.
Lazy conversion must not be used when you need to access the field value in your transformation. That's exactly what it does. The default should probably be off rather than on.
If your field is going directly to a database, then use it and it will be faster.
You can even have "partially lazy" streams, where you use lazy conversion for speed, but then use select values step, to "un-lazify" the fields you want to access, whilst the remainder remain lazy.
Cunning huh?

Apache Pig: ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: can't look backwards more than one token in this stream

I wrote a UDF that returns a string and here is a sample code:
split data into purchased IF ((boolean) (myudf(param)), failed OTHERWISE;
As an example, here is the example of that my udf returns:
split data into purchased IF ((boolean) (retcode == 'SUCCESS')), failed OTHERWISE;
Unfortunately. I get the following error:
Apache Pig: ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: can't look backwards more than one token in this stream
I also tried this:
split data into purchased IF ((boolean) '(retcode == 'SUCCESS')'), failed OTHERWISE;
I get this error:
2015-06-19 10:10:48,330 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 11, column 85> Syntax error, unexpected symbol at or near '250.00'
I also tried this:
split data into purchased IF ((boolean) '(retcode == \'SUCCESS\')'), failed OTHERWISE;
I don't get any error, but the I don't get the expected result back.
Any help with this would be great.
That error is thrown because ANTLR can't parse correctly that sentence. Pig should give you a different error showing what is the problem, like it generally does, but it seems the parsing rules for the SPLIT sentence don't take into account what happens when you try to cast the condition.
The problem here is solved simply by removing that cast to boolean:
split data into purchased IF (retcode == 'SUCCESS'), failed OTHERWISE;
That will work.
Why does the cast make it fail? That I don't know. I guess that casting the output of an expression counts as two different expressions, since it has to resolve the inner expression to apply the cast afterwards, and the syntax rule for the split operator does not allow this. 0% sure though.

Pig - trouble with datafu's TransposeTupletoBag

I can't get the example code for LinkedIn DataFu's TransposeTupletoBag (at http://linkedin.github.io/datafu/docs/current/datafu/pig/util/TransposeTupleToBag.html) to work.
register datafu-1.1.0.jar
define Transpose datafu.pig.util.Transpose();
x = LOAD 'input.txt' AS (id:int,val1:int,val2:int,val3:int);
dump x
(1,10,11,12)
y = FOREACH x GENERATE id, Transpose(val1 .. val3);
2013-11-08 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070:
Could not resolve datafu.pig.util.Transpose using imports: [, java.lang.,
org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Details at logfile: pig_1383941559971.log
For some reason, Pig cannot find Transpose(). I'm able to use other DataFu functions, so it's not a problem with the JAR path. I'm using Pig 0.11.1 in local mode.
This turned out to be an easy fix. Either DataFu's documentation was mistyped or the API has been updated; Transpose() should be TransposeTupleToBag() (which makes much more sense, anyway). Just replace
define Transpose datafu.pig.util.Transpose();
with
define Transpose datafu.pig.util.TransposeTupleToBag();
and you're good to go.

Apache Pig join error 2087 "Found index:0 in multiple LocalRearrange operators"

So I've got two relations:
pageview counts by a GUID and URL pv_counts
events by the same GUID and url ev_counts
I'm trying to join them with joined_counts = JOIN ev_counts BY ev_site_guid, pv_counts BY pv_site_guid;, but I keep getting this error:
ERROR 2087: Unexpected problem during optimization. Found index:0 in multiple LocalRearrange operators.
I've tried using Pig 10 and Pig 11, but both return the same error.
I've Googled it, but I'm mostly just coming up with the Pig source code, but not an explanation of what it is or means. I've tried making sure I don't have any nulls or empty strings in the keys
Anyone have any idea what I'm doing wrong?
Here's the schema and some sample data:
pv_counts
describe pv_counts;
{group::pv_site_guid:chararray, group::pv_hostname:chararray, pv_count:long}
dump pv_counts;
(bSAw-mF-0r4Q-4acwqm_6r,example-url.com,10)
(bSAw-mF-0r4Q-4acwqm_6r,sports.example-url.com,10)
(bSAw-mF-0r4Q-4acwqm_6r,opinion.example-url.com,10)
(bSAw-mF-0r4Q-4acwqm_6r,newsinfo.example-url.com,10)
(bSAw-mF-0r4Q-4acwqm_6r,lifestyle.example-url.com,10)
.... many more pageviews than events ....
(dZiLDGjsGr3O3zacn9QLBk,example-url2.com.com,10)
(dZiLDGjsGr3O3zacn9QLBk,example-url3.com,10)
ev_counts
describe ev_counts;
{group::ev_site_guid:chararray, group::ee_hostname:chararray, ev1count:long, ev2count:long, ev3count:long, ev4count:long, ev5count:long}
dump ev_counts;
(bSAw-mF-0r4Q-4acwqm_6r,example-url.com,29,0,0,0,0)
(bSAw-mF-0r4Q-4acwqm_6r,sports.example-url.com,7,0,0,0,0)
(bSAw-mF-0r4Q-4acwqm_6r,lifestyle.example-url.com,2,0,0,0,0)
.... not as many events as pageviews ....
(dZiLDGjsGr3O3zacn9QLBk,example-url2.com.com,0,0,37,0,0)
(dZiLDGjsGr3O3zacn9QLBk,example-url3.com,0,0,1,0,0)
I can dump the relations just fine in Pig and Grunt.
When I add the following join statement, it gets to the very end and dies:
joined_counts = JOIN ev_counts BY ev_site_guid, pv_counts BY pv_site_guid;
dump joined_counts;
It'll throw the "ERROR 2087: Unexpected problem during optimization. Found index:0 in multiple LocalRearrange operators." error and an ugly stacktrace. I'm relatively new to pig and so I've never dug into it's internals.
If anyone had any tips or things to try, I'd gladly try them. We're running on Cloudera's CDH3U3 (0.20.2).