BigQuery UDF Internal Error - google-bigquery

We had a simple UDF in BigQuery that somehow throws an error that keeps returning
Query Failed
Error: An internal error occurred and the request could not be completed.
The query was simply trying to use UDF to perform a SHA256.
SELECT
input AS title,
input_sha256 AS title_sha256
FROM
SHA256(
SELECT
title AS input
FROM
[bigquery-public-data:hacker_news.stories]
GROUP BY
input
)
LIMIT
1000
The in-line UDF is pasted below. However I can not post the full UDF as StackOverflow complaints too much code in the post. The full UDF can be seen this gist.
function sha256(row, emit) {
emit(
{
input: row.input,
input_sha256: CryptoJS.SHA256(row.input).toString(CryptoJS.enc.Hex)
}
);
}
bigquery.defineFunction(
'SHA256', // Name of the function exported to SQL
['input'], // Names of input columns
[
{'name': 'input', 'type': 'string'},
{'name': 'input_sha256', 'type': 'string'}
],
sha256 // Reference to JavaScript UDF
);
Not sure if it helps, but the Job-ID is
bigquery:bquijob_7fd3b51c_153c058dc7c
Looks like there is a similar issue at:
https://code.google.com/p/google-bigquery/issues/detail?id=478

Short answer - this is an issue related to memory allocation that I uncovered via my own testing and fixed today, but it will take a little while to flow out to production.
Slightly longer answer - we just rolled out a fix today for an issue where users who were having "out of memory" issues when scaling up their UDFs over larger number of rows, even though the UDF would succeed on smaller numbers of rows. The queries that were hitting that condition are now running fine on our internal / test trees. However, since public BigQuery hosts have much higher traffic loads, the JavaScript engine that executes the UDFs (V8) behaves somewhat differently in production than it does in internal trees. Specifically, there's a new memory allocation error that some of the previously OOMing jobs are now hitting that we couldn't observe until the queries ran on a fully-loaded tree.
It's a minor error with a quick fix, but we'd ideally let it flow through our regular testing and QA cycle. This should put the fix in production in about a week, assuming nothing else goes wrong with the candidate. Would that be acceptable for you?

i am re-using answer box to provide full query text. it works if uncomment LIMIT 40
SELECT input, input_sha256 FROM JS(
(
SELECT title AS input
FROM [bigquery-public-data:hacker_news.stories]
GROUP BY input
//LIMIT 40
),
input,
"[ {'name': 'input', 'type': 'string'}, {'name': 'input_sha256', 'type': 'string'} ] ",
"function(row, emit) {
var CryptoJS=CryptoJS||function(h,s){var f={},g=f.lib={},q=function(){},m=g.Base={extend:function(a){q.prototype=this;var c=new q;a&&c.mixIn(a);c.hasOwnProperty('init')||(c.init=function(){c.$super.init.apply(this,arguments)});c.init.prototype=c;c.$super=this;return c},create:function(){var a=this.extend();a.init.apply(a,arguments);return a},init:function(){},mixIn:function(a){for(var c in a)a.hasOwnProperty(c)&&(this[c]=a[c]);a.hasOwnProperty('toString')&&(this.toString=a.toString)},clone:function(){return this.init.prototype.extend(this)}}, r=g.WordArray=m.extend({init:function(a,c){a=this.words=a||[];this.sigBytes=c!=s?c:4*a.length},toString:function(a){return(a||k).stringify(this)},concat:function(a){var c=this.words,d=a.words,b=this.sigBytes;a=a.sigBytes;this.clamp();if(b%4)for(var e=0;e<a;e++)c[b+e>>>2]|=(d[e>>>2]>>>24-8*(e%4)&255)<<24-8*((b+e)%4);else if(65535<d.length)for(e=0;e<a;e+=4)c[b+e>>>2]=d[e>>>2];else c.push.apply(c,d);this.sigBytes+=a;return this},clamp:function(){var a=this.words,c=this.sigBytes;a[c>>>2]&=4294967295<< 32-8*(c%4);a.length=h.ceil(c/4)},clone:function(){var a=m.clone.call(this);a.words=this.words.slice(0);return a},random:function(a){for(var c=[],d=0;d<a;d+=4)c.push(4294967296*h.random()|0);return new r.init(c,a)}}),l=f.enc={},k=l.Hex={stringify:function(a){var c=a.words;a=a.sigBytes;for(var d=[],b=0;b<a;b++){var e=c[b>>>2]>>>24-8*(b%4)&255;d.push((e>>>4).toString(16));d.push((e&15).toString(16))}return d.join('')},parse:function(a){for(var c=a.length,d=[],b=0;b<c;b+=2)d[b>>>3]|=parseInt(a.substr(b, 2),16)<<24-4*(b%8);return new r.init(d,c/2)}},n=l.Latin1={stringify:function(a){var c=a.words;a=a.sigBytes;for(var d=[],b=0;b<a;b++)d.push(String.fromCharCode(c[b>>>2]>>>24-8*(b%4)&255));return d.join('')},parse:function(a){for(var c=a.length,d=[],b=0;b<c;b++)d[b>>>2]|=(a.charCodeAt(b)&255)<<24-8*(b%4);return new r.init(d,c)}},j=l.Utf8={stringify:function(a){try{return decodeURIComponent(escape(n.stringify(a)))}catch(c){throw Error('Malformed UTF-8 data');}},parse:function(a){return n.parse(unescape(encodeURIComponent(a)))}}, u=g.BufferedBlockAlgorithm=m.extend({reset:function(){this._data=new r.init;this._nDataBytes=0},_append:function(a){'string'==typeof a&&(a=j.parse(a));this._data.concat(a);this._nDataBytes+=a.sigBytes},_process:function(a){var c=this._data,d=c.words,b=c.sigBytes,e=this.blockSize,f=b/(4*e),f=a?h.ceil(f):h.max((f|0)-this._minBufferSize,0);a=f*e;b=h.min(4*a,b);if(a){for(var g=0;g<a;g+=e)this._doProcessBlock(d,g);g=d.splice(0,a);c.sigBytes-=b}return new r.init(g,b)},clone:function(){var a=m.clone.call(this); a._data=this._data.clone();return a},_minBufferSize:0});g.Hasher=u.extend({cfg:m.extend(),init:function(a){this.cfg=this.cfg.extend(a);this.reset()},reset:function(){u.reset.call(this);this._doReset()},update:function(a){this._append(a);this._process();return this},finalize:function(a){a&&this._append(a);return this._doFinalize()},blockSize:16,_createHelper:function(a){return function(c,d){return(new a.init(d)).finalize(c)}},_createHmacHelper:function(a){return function(c,d){return(new t.HMAC.init(a, d)).finalize(c)}}});var t=f.algo={};return f}(Math);
(function(h){for(var s=CryptoJS,f=s.lib,g=f.WordArray,q=f.Hasher,f=s.algo,m=[],r=[],l=function(a){return 4294967296*(a-(a|0))|0},k=2,n=0;64>n;){var j;a:{j=k;for(var u=h.sqrt(j),t=2;t<=u;t++)if(!(j%t)){j=!1;break a}j=!0}j&&(8>n&&(m[n]=l(h.pow(k,0.5))),r[n]=l(h.pow(k,1/3)),n++);k++}var a=[],f=f.SHA256=q.extend({_doReset:function(){this._hash=new g.init(m.slice(0))},_doProcessBlock:function(c,d){for(var b=this._hash.words,e=b[0],f=b[1],g=b[2],j=b[3],h=b[4],m=b[5],n=b[6],q=b[7],p=0;64>p;p++){if(16>p)a[p]= c[d+p]|0;else{var k=a[p-15],l=a[p-2];a[p]=((k<<25|k>>>7)^(k<<14|k>>>18)^k>>>3)+a[p-7]+((l<<15|l>>>17)^(l<<13|l>>>19)^l>>>10)+a[p-16]}k=q+((h<<26|h>>>6)^(h<<21|h>>>11)^(h<<7|h>>>25))+(h&m^~h&n)+r[p]+a[p];l=((e<<30|e>>>2)^(e<<19|e>>>13)^(e<<10|e>>>22))+(e&f^e&g^f&g);q=n;n=m;m=h;h=j+k|0;j=g;g=f;f=e;e=k+l|0}b[0]=b[0]+e|0;b[1]=b[1]+f|0;b[2]=b[2]+g|0;b[3]=b[3]+j|0;b[4]=b[4]+h|0;b[5]=b[5]+m|0;b[6]=b[6]+n|0;b[7]=b[7]+q|0},_doFinalize:function(){var a=this._data,d=a.words,b=8*this._nDataBytes,e=8*a.sigBytes; d[e>>>5]|=128<<24-e%32;d[(e+64>>>9<<4)+14]=h.floor(b/4294967296);d[(e+64>>>9<<4)+15]=b;a.sigBytes=4*d.length;this._process();return this._hash},clone:function(){var a=q.clone.call(this);a._hash=this._hash.clone();return a}});s.SHA256=q._createHelper(f);s.HmacSHA256=q._createHmacHelper(f)})(Math);
(function(){var h=CryptoJS,j=h.lib.WordArray;h.enc.Base64={stringify:function(b){var e=b.words,f=b.sigBytes,c=this._map;b.clamp();b=[];for(var a=0;a<f;a+=3)for(var d=(e[a>>>2]>>>24-8*(a%4)&255)<<16|(e[a+1>>>2]>>>24-8*((a+1)%4)&255)<<8|e[a+2>>>2]>>>24-8*((a+2)%4)&255,g=0;4>g&&a+0.75*g<f;g++)b.push(c.charAt(d>>>6*(3-g)&63));if(e=c.charAt(64))for(;b.length%4;)b.push(e);return b.join('')},parse:function(b){var e=b.length,f=this._map,c=f.charAt(64);c&&(c=b.indexOf(c),-1!=c&&(e=c));for(var c=[],a=0,d=0;d< e;d++)if(d%4){var g=f.indexOf(b.charAt(d-1))<<2*(d%4),h=f.indexOf(b.charAt(d))>>>6-2*(d%4);c[a>>>2]|=(g|h)<<24-8*(a%4);a++}return j.create(c,a)},_map:'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/='}})();
emit( { input: row.input, input_sha256: CryptoJS.SHA256(row.input).toString(CryptoJS.enc.Hex) } );
}"
)

Related

Return inserted id with TypeORM & NestJS raw query: await connection.manager.query(`INSERT INTO

I'm looking to return the id or better yet, all information that was inserted, using a raw query with TypeORM and NestJS. Example as follows:
await connection.manager.query(`INSERT INTO...`)
When assigning the query to a constant and console logging it below, it does not yield any helpful information:
OkPacket {
fieldCount: 0,
affectedRows: 1,
insertId: 0,
serverStatus: 2,
warningCount: 1,
message: '',
protocol41: true,
changedRows: 0
}
As you can see, it returns no pertinent information, the insertId above is obviously incorrect, and it returns this every time, regardless of the actual parameters of the query.
I know with more typical TypeORM queries you can use .return(['name_of_column_you_want_returned']).execute()
and it will return the relevant information just fine. Is there any way to do this with a raw query? Thank you!
tl;dr You're getting the raw mariadb driver response (OkPacket) from the INSERT command, and you'd need a new SELECT query to see the data.
You're using the TypeORM EntityManager, and the docs don't mention a return value. Looking at the source code for query, the return type is any. Since it's a raw query, it probably returns an object based on the type of database you're using rather than having a standard format.
In this case, you're using MariaDb, which returned an OkPacket. Here's the documentation:
https://mariadb.com/kb/en/ok_packet/

InfluxDB 1.8 schema design for industrial application?

I have node-red-S7PLC link pushing the following data to InfluxDB at 1.5 second cycle.
msg.payload = {
name: 'PLCTEST',
level1_m: msg.payload.a90, "value payload from PLC passed to influx"
power1: msg.payload.a93,
valvepos_%: msg.payload.a107,
temp1: msg.payload.a111,
washer_acidity: msg.payload.a113,
etc.
}
return msg;
In total 130 individual data points consisting of binary states like alarms and button presses and measurements (temp, pressure, flow...)
This has been running a week now as a stress test for DB writes. Writing seems to be fine but I have noticed that if i swap from 10 temperature measurements with 30min query window to 3hr query in Grafana dashboard the load times are starting to get annoyingly long. 12hr window is a no go. This i assume is because all my things are pushed as fieldkeys and field values. Without indexes this is straining the database.
Grafana query inspector gives me 1081 rows per measurement_query so x10 = 10810 rows/dasboard_query. But the whole pool influx has to go through is 130 measurements x 1081 = 140530 rows / 3hr window.
I would like to get a few pointers on how to optimize the schema. I have the following in mind.
DB: Aplication_nameX
Measurement: Process_metrics,
Tags: Temp,press,flow,%,Level,acidity, Power
Tag_values: CT-xx1...CT-xxn, CP-xx1...CP-xxn, CF-xx1...CF-xxn,....
Fieldkey= Value, fieldvalue= value
Measurement: Alarms_On,
Fieldkey= State, fieldvalue= "trues", "false"
Measurement:Binary_ON
Fieldkey: State, fieldvalue= "trues", "false"
This would then be in node-red for few temps (i think):
msg.payload = [{
Value: msg.payload.xxx, "value payload from PLC passed to influx"
Value: msg.payload.xxx,
Value: msg.payload.xxx
},
{
Temp:"CT_xx1",
Temp:"CT_xx2",
Temp:"CT_xx2"
}];
return msg;
EDIT: Following Roberts comments.
I read the influx manuals for a week and other samples online before writing here. Some how influx is just different and unique enough from normal SQL mind set that i do find this unusually difficult. But i did have a few moments of clarity over the weekend.
I think the following would be more appropriate.
DB: Station_name
measurements: Process_metrics,Alarms, Binary.
Tags: "SI_metric"
Values= "Temperature", "Pressure" etc.
Fieldkey: "proces_position"= CT/P/F_xxx.
values= process_values
This should prevent the cardinality going bonkers vs. my original thought.
I think alarms and binary can be left as fieldkey/fieldvalue only and separating them to own measurements should give enough filtering. These are also logged only at state change thus a lot less input to the database than analogs at 1s cycle.
Following my original node-red flow code this would translate to batch output function:
msg.payload = [
{
measurement: "Process_metrics",
fields: {
CT_xx1: msg.payload.xxx,
CT_xx2: msg.payload.xxx,
CT_xx3: msg.payload.xxx
},
tags:{
metric:"temperature"
},
{
measurement: "Process_metrics",
fields: {
CP_xx1: msg.payload.xxx,
CP_xx2: msg.payload.xxx,
CP_xx3: msg.payload.xxx
},
tags:{
metric:"pressure"
},
{
measurement: "Process_metrics",
fields: {
CF_xx1: msg.payload.xxx,
CF_xx2: msg.payload.xxx,
CF_xx3: msg.payload.xxx
},
tags:{
metric:"flow"
},
{
measurement: "Process_metrics",
fields: {
AP_xx1: msg.payload.xxx,
AP_xx2: msg.payload.xxx,
AP_xx3: msg.payload.xxx
},
tags:{
metric:"Pumps"
},
{
measurement: "Binary_states",
fields: {
Binary1: msg.payload.xxx,
Binary2: msg.payload.xxx,
Binary3: msg.payload.xxx
},
{
measurement: "Alarms",
fields: {
Alarm1: msg.payload.xxx,
Alarm2: msg.payload.xxx,
Alarm3: msg.payload.xxx
}
];
return msg;
EDIT 2:
Final thoughts after testing my above idea and refining it further.
My second idea did not work as intended. The final step with Grafana variables did not work as the process data had info needed in fields and not as tags. This made the Grafana side annoying with rexec queries to get the plc tag names info from fields to link to grafana variable drop down lists. Thus again running resource intensive field queries.
I stumbled on a blog post on the matter of how to get your mind straight with TSDB and the above idea is still too SQL like approach to data with TSDB. I refined the DB structure some more and i seem to have found a compromise with coding time in different steps (PLC->NodeRed->influxDB->Grafana) and query load on the database. From 1gb ram usage when stressing with write and query to 100-300MB in normal usage test.
Currently in testing:
Python script to crunch the PLC side tags and descriptions from csv to a copypastable format for Node-Red. Example for extracting temperature measurements from the csv and formating to nodered.
import pandas as pd
from pathlib import Path
file1 = r'C:\\Users\\....pandastestcsv.csv
df1 = pd.read_csv(file1, sep=';')
dfCT= df1[df1['POS'].str.contains('CT', regex=False, na=False)]
def my_functionCT(x,y):
print( "{measurement:"+'"temperature",'+"fields:{value:msg.payload."+ x +",},tags:{CT:\"" + y +'\",},},' )
result = [my_functionCT(x, y) for x, y in zip(dfCT['ID'], dfCT['POS'])]
Output of this is all the temperature measurements CT from the CSV. {measurement:"temperature",fields:{value:msg.payload.a1,},tags:{CT:"tag description with process position CT_310",},},
This list can be copypasted to Node-Red datalink payload to influxDB.
InfluxDB:
database: PLCTEST
Measurements: temperature, pressure, flow, pumps, valves, Alarms, on_off....
tag-keys: CT,CP,CF,misc_mes....
tag-field: "PLC description of the tag"
Field-key: value
field-value: "process measurement value from PLC payload"
This keeps the cardinality per measurement in check within reason and queries can be better targeted to relevant data without running through the whole DB. Ram and CPU loads are now minor and jumping from 1h to 12h query in Grafana loads in seconds without lock ups.
While designing InfluxDB measurement schema we need to be very careful on selecting the tags and fields.
Each tag value will create separate series and as the number of tag values increase the memory requirement of InfluxDB server will increase exponentially.
From the description of the measurement given in the question, I can see that you are keeping high cardinality values like temperature, pressure etc as tag values. These values should be kept as field instead.
By keeping these values as tags, influxdb will index those values for faster search. For each tag value a separate series will be created. As the number of tag values increase, the number of series also will increase leading to Out of Memory errors.
Quoting from InfluxDB documentation.
Tags containing highly variable information like UUIDs, hashes, and
random strings lead to a large number of series in the database, also
known as high series cardinality. High series cardinality is a primary
driver of high memory usage for many database workloads.
Please refer the influxDB documentation for designing schema for more details.
https://docs.influxdata.com/influxdb/v1.8/concepts/schema_and_data_layout/

MongoDB using $and with slice and a match

I'm using #Query from the spring data package and I want to query on the last element of an array in a document.
For example the data structure could be like this:
{
name : 'John',
scores: [10, 12, 14, 16]
},
{
name : 'Mary',
scores: [78, 20, 14]
},
So I've built a query, however it is complaining that "error message 'unknown operator: $slice' on server"
The $slice part of the query, when run separately, is fine:
db.getCollection('users').find({}, {scores: { $slice: -1 })
However as soon as I combine it with a more complex check, it gives the error as mentioned.
db.getCollection('users').find{{"$and":[{ } , {"scores" : { "$slice" : -1}} ,{"scores": "16"}]})
This query would return the list of users who had a last score of 16, in my example John would be returned but not Mary.
I've put it into a standard mongo query (to debug things), however ideally I need it to go into a spring-data #query construct - they should be fairly similar.
Is there anyway of doing this, without resorting to hand-cranked java calls? I don't see much documentation for #Query, other than it takes standard queries.
As commented with the link post, that refers to aggregate, how does that work with #Query, plus one of the main answers uses $where, this inefficient.
The general way forward with the problem is unfortunately the data, although #Veeram's response is correct, it will mean that you do not hit indexes. This is an issue where you've got very large data sets of course and you will see ever decreasing return times. It's something $where, $arrayElemAt cannot help you with. They have to pre-process the data and that means a full collection scan. We analysed several queries with these constructs and they involved a "COLSCAN".
The solution is ideally to create a field that contains the last item, for instance:
{
name : 'John',
scores: [10, 12, 14, 16],
lastScore: 16
},
{
name : 'Mary',
scores: [78, 20, 14],
lastScore: 14
}
You could create a listener to maintain this as follows:
#Component
public class ScoreListener extends AbstractMongoEventListener<Scores>
You then get the ability to sniff the data and make any updates:
#Override
public void onBeforeConvert(BeforeConvertEvent<Scores> event) {
// process any score and set lastScore
}
Don't forget to update your indexes (!):
#CompoundIndex(name = "lastScore", def = "{"
+ "'lastScore': 1"
+ " }")
Although this does contain a disadvantage of a slight duplication of data, in current Mongo (3.4) this really is the only way of doing this AND to include indexes in the search mechanism. The speed differences were dramatic, from nearly a minute response time down to milliseconds.
In Mongo 3.6 there may be better ways for doing that, however we are fixed on this version, so this has to be our solution.

Error using metrics listed in Google's Metrics and dimensions

I am using this code to query the api
function getResults(&$analytics, $profileId) {
// Calls the Core Reporting API and queries for the number of sessions
// for the last 30 days.
return $analytics->data_ga->get(
'ga:' . $profileId,
'30daysAgo',
'today',
'ga:sessionCount,ga:sessionDurationBucket,ga:users,ga:percentNewSessions,ga:bounceRate,ga:pageviews');
}
i get this error upon executing the code
Fatal error: Uncaught exception 'Google_Service_Exception' with
message 'Error calling GET
https://www.googleapis.com/analytics/v3/data/ga?ids=ga%3A114460017&start-date=30daysAgo&end-date=today&metrics=ga%3AsessionCount%2Cga%3AsessionDurationBucket%2Cga%3Ausers%2Cga%3ApercentNewSessions%2Cga%3AbounceRate%2Cga%3Apageviews:
(400) Unknown metric(s): ga:sessionCount, ga:sessionDurationBucket
anyone ever experience? I do not understand why it does not recognize those metrics when it is listed
https://developers.google.com/analytics/devguides/reporting/core/dimsmets#view=detail&group=user&jump=ga_sessioncount
If you look more closely into that documentation you will see that session count is not a metric, it's a dimension. The reason is that you want to be able to do breakdowns of metrics by session count (e.g. "show avg. duration of sessions for users with 3 sessions") and for that you need categorical data.
Even if you overlook the (not particularly distinctive) column heading in the table of contents (ga:sessionCount is in the "dimensions"-column) the fact that the datatype is a string would be a dead giveaway. Metrics are always numbers. Dimensions are always strings, even if they sometimes look like numbers.
Same goes for ga:sessionDurationBucket.
Look at this example from the documentation to see how dimensions are passed into the query via an array that holds optional parameters:
private function queryCoreReportingApi() {
$optParams = array(
'dimensions' => 'ga:source,ga:keyword',
'sort' => '-ga:sessions,ga:source',
'filters' => 'ga:medium==organic',
'max-results' => '25');
return $service->data_ga->get(
TABLE_ID,
'2010-01-01',
'2010-01-15',
'ga:sessions',
$optParams);
}
You'd need to construct a similar $optParams array:
$optParams = array(
'dimensions' => 'ga:sessionCount,ga:sessionDurationBucket'
');
and pass it to your query:
return $analytics->data_ga->get(
'ga:' . $profileId,
'30daysAgo',
'today',
$optParams,
'ga:users,ga:percentNewSessions,ga:bounceRate,ga:pageviews');
}
and remove the dimensions from the list of metrics.
Btw. Google has a wonderful documentation page on the difference between dimensions and metrics and how they are used in the reports.

Pig - comparing two similar statement : one working, the other not

I begin to be really annoyed with PIG :the language seems really not stable, the documentation is poor, there are not that many examples on internet, and any small change in the code can give radical differences :from failure to expected result.... Here is another kind of this last theme :
grunt> describe actions_by_unite;
actions_by_unite: {
group: chararray,
nb_actions_by_unite_and_action: {
(
unite: chararray,
lib_type_action: chararray,
double
)
}
}
-- works :
z = foreach actions_by_unite {
generate group, SUM(nb_actions_by_unite_and_action.$2);};
-- doesn't work :
z = foreach actions_by_unite {
x = SUM(nb_actions_by_unite_and_action.$2);
generate group, x;};
-- error :
2015-05-08 14:43:44,712 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse:
<line 107, column 16> Invalid scalar projection: x : A column needs to be projected from a relation for it to be used as a scalar
Details at logfile: /private/tmp/pig-err.log
And so :
-- doesn't work neither:
z = foreach actions_by_unite { x = SUM(nb_actions_by_unite_and_action.$2);
generate group, x.$0;};
--error :
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (AC,EMAIL,1.1186133550060547E-4), 2nd :(AC,VISITE,6.25755280560356E-4)
at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:120)
Does anyone would know why ?
Do you have some nice blog / ressources to propose with examples to master this language ?
I have the o'reilly book, but it seems a bit old, I have the 'Agile Data Science' and the "Hadoop definitive guide" book with some examples in it... I found this page really interesting : https://shrikantbang.wordpress.com/2014/01/14/apache-pig-group-by-nested-foreach-join-example/
Any good video on coursera or other inputs ? Do you guys also have problems with this language ? or I am simply dumb ?....
That thing in particular is not because of Pig being unstable, it's because what you are trying to do is correct in the first approach, but wrong in the others.
When you make a group by, you have for each group a bag that contains X tuples. Inside a nested foreach, you have one group with its bag for each iteration, which means that a SUM inside there will yield a scalar value: the sum of the bag you are currently working with. Apache Pig does not work with scalars, it works with relations, therefore you cannot assign a scalar value to an alias, which is exactly what you are doing in the second and third approach.
Therefore, the error comes from attempting something like:
A = foreach B {
x = SUM(bag.$0);
}
However, if you want to emit for each of the groups a scalar, you can perfectly do this as long as you never assign a scalar to an alias. That is why it works perfectly if you do the sum at the end of the foreach, because you are returning for each of the groups a tuple with two values: the group and the sum.