BigQuery UDF memory exceeded error on multiple rows but works fine on single row - google-bigquery

I'm writing a UDF to process Google Analytics data, and getting the "UDF out of memory" error message when I try to process multiple rows. I downloaded the raw data and found the largest record and tried running my UDF query on that, with success. Some of the rows have up to 500 nested hits, and the size of the hit record (by far the largest component of each row of the raw GA data) does seem to have an effect on how many rows I can process before getting the error.
For example, the query
select
user.ga_user_id,
ga_session_id,
...
from
temp_ga_processing(
select
fullVisitorId,
visitNumber,
...
from [79689075.ga_sessions_20160201] limit 100)
returns the error, but
from [79689075.ga_sessions_20160201] where totals.hits = 500 limit 1)
does not.
I was under the impression that any memory limitations were per-row? I've tried several techniques, such as setting row = null; before emit(return_dict); (where return_dict is the processed data) but to no avail.
The UDF itself doesn't do anything fancy; I'd paste it here but it's ~45 kB in length. It essentially does a bunch of things along the lines of:
function temp_ga_processing(row, emit) {
topic_id = -1;
hit_numbers = [];
first_page_load_hits = [];
return_dict = {};
return_dict["user"] = {};
return_dict["user"]["ga_user_id"] = row.fullVisitorId;
return_dict["ga_session_id"] = row.fullVisitorId.concat("-".concat(row.visitNumber));
for(i=0;i<row.hits.length;i++) {
hit_dict = {};
hit_dict["page"] = {};
hit_dict["time"] = row.hits[i].time;
hit_dict["type"] = row.hits[i].type;
hit_dict["page"]["engaged_10s"] = false;
hit_dict["page"]["engaged_30s"] = false;
hit_dict["page"]["engaged_60s"] = false;
add_hit = true;
for(j=0;j<row.hits[i].customMetrics.length;j++) {
if(row.hits[i].customDimensions[j] != null) {
if(row.hits[i].customMetrics[j]["index"] == 3) {
metrics = {"video_play_time": row.hits[i].customMetrics[j]["value"]};
hit_dict["metrics"] = metrics;
metrics = null;
row.hits[i].customDimensions[j] = null;
}
}
}
hit_dict["topic"] = {};
hit_dict["doctor"] = {};
hit_dict["doctor_location"] = {};
hit_dict["content"] = {};
if(row.hits[i].customDimensions != null) {
for(j=0;j<row.hits[i].customDimensions.length;j++) {
if(row.hits[i].customDimensions[j] != null) {
if(row.hits[i].customDimensions[j]["index"] == 1) {
hit_dict["topic"] = {"name": row.hits[i].customDimensions[j]["value"]};
row.hits[i].customDimensions[j] = null;
continue;
}
if(row.hits[i].customDimensions[j]["index"] == 3) {
if(row.hits[i].customDimensions[j]["value"].search("doctor") > -1) {
return_dict["logged_in_as_doctor"] = true;
}
}
// and so on...
}
}
}
if(row.hits[i]["eventInfo"]["eventCategory"] == "page load time" && row.hits[i]["eventInfo"]["eventLabel"].search("OUTLIER") == -1) {
elre = /(?:onLoad|pl|page):(\d+)/.exec(row.hits[i]["eventInfo"]["eventLabel"]);
if(elre != null) {
if(parseInt(elre[0].split(":")[1]) <= 60000) {
first_page_load_hits.push(parseFloat(row.hits[i].hitNumber));
if(hit_dict["page"]["page_load"] == null) {
hit_dict["page"]["page_load"] = {};
}
hit_dict["page"]["page_load"]["sample"] = 1;
page_load_time_re = /(?:onLoad|pl|page):(\d+)/.exec(row.hits[i]["eventInfo"]["eventLabel"]);
if(page_load_time_re != null) {
hit_dict["page"]["page_load"]["page_load_time"] = parseFloat(page_load_time_re[0].split(':')[1])/1000;
}
}
// and so on...
}
}
row = null;
emit return_dict;
}
The job ID is realself-main:bquijob_4c30bd3d_152fbfcd7fd

Update Aug 2016 : We have pushed out an update that will allow the JavaScript worker to use twice as much RAM. We will continue to monitor jobs that have failed with JS OOM to see if more increases are necessary; in the meantime, please let us know if you have further jobs failing with OOM. Thanks!
Update : this issue was related to limits we had on the size of the UDF code. It looks like V8's optimize+recompile pass of the UDF code generates a data segment that was bigger than our limits, but this was only happening when when the UDF runs over a "sufficient" number of rows. I'm meeting with the V8 team this week to dig into the details further.
#Grayson - I was able to run your job over the entire 20160201 table successfully; the query takes 1-2 minutes to execute. Could you please verify that this works on your side?
We've gotten a few reports of similar issues that seem related to # rows processed. I'm sorry for the trouble; I'll be doing some profiling on our JavaScript runtime to try to find if and where memory is being leaked. Stay tuned for the analysis.
In the meantime, if you're able to isolate any specific rows that cause the error, that would also be very helpful.

A UDF will fail on anything but very small datasets if it has a lot of if/then levels, such as:
if () {
.... if() {
.........if () {
etc
We had to track down and remove the deepest if/then statement.
But, that is not enough. In addition, when you pass the data into the UDF run a "GROUP EACH BY" on all the variables. This will force BQ to send the output to multiple "workers". Otherwise it will also fail.
I've wasted 3 days of my life on this annoying bug. Argh.

I love the concept of parsing my logs in BigQuery, but I've got the same problem, I get
Error: Resources exceeded during query execution.
The Job Id is bigquery-looker:bquijob_260be029_153dd96cfdb, if that at all helps.
I wrote a very basic parser does a simple match and returns rows. Works just fine on a 10K row data set, but I get out of resources when trying to run against a 3M row logfile.
Any suggestions for a work around?
Here is the javascript code.
function parseLogRow(row, emit) {
r = (row.logrow ? row.logrow : "") + (typeof row.l2 !== "undefined" ? " " + row.l2 : "") + (row.l3 ? " " + row.l3 : "")
ts = null
category = null
user = null
message = null
db = null
found = false
if (r) {
m = r.match(/^(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (\+|\-)\d\d\d\d) \[([^|]*)\|([^|]*)\|([^\]]*)\] :: (.*)/ )
if( m){
ts = new Date(m[1])/1000
category = m[3] || null
user = m[4] || null
db = m[5] || null
message = m[6] || null
found = true
}
else {
message = r
found = false
}
}
emit({
ts: ts,
category: category,
user: user,
db: db,
message: message,
found: found
});
}
bigquery.defineFunction(
'parseLogRow', // Name of the function exported to SQL
['logrow',"l2","l3"], // Names of input columns
[
{'name': 'ts', 'type': 'timestamp'}, // Output schema
{'name': 'category', 'type': 'string'},
{'name': 'user', 'type': 'string'},
{'name': 'db', 'type': 'string'},
{'name': 'message', 'type': 'string'},
{'name': 'found', 'type': 'boolean'},
],
parseLogRow // Reference to JavaScript UDF
);

Related

Filter cached sqlJdbs query in Pentaho CE

I use sqlJdbs query as a data provider for my CCC controls. I use geospatial request in my query that's why I cache my results(Cache=True). Otherwise the request made long.
It works fine. However I have to use parameters in my query to filter resulting rows:
SELECT ...
FROM ...
WHERE someField IN (${aoi_param})
Is there some way to cache full set of rows and then apply WHERE to cached results without rebuilding new cache for each set of values in the ${aoi_param}?
What is the best practice?
So, I am not really sure that it is the best practice, but I solved my problem this way:
I included aoi_param to the Listeners and Parameters of my chart control
Then I filtered data set in Post Fetch:
function f(data){
var _aoi_param = this.dashboard.getParameterValue('${p:aoi_param}');
function isInArray(myValue, myArray) {
var arrayLength = myArray.length;
for (var i = 0; i < arrayLength; i++) {
if (myValue == myArray[i]) return true;
}
return false;
}
function getFiltered(cdaData, filterArray) {
var allCdaData = cdaData;
cdaData = {
metadata: allCdaData.metadata,
resultset: allCdaData.resultset.filter(function(row){
// 2nd column is an AOI id in my dataset
return isInArray(row[2], filterArray);
})
};
return cdaData;
}
var dataFiltered = getFiltered(data, _aoi_param);
return dataFiltered;
}
excluded WHERE someField IN (${aoi_param}) from the query of my sql over sqlJdbc component

AJAX filled pulldown shows 5 identical options when there are 5 different options in database

Scratching head here. I've got a pulldown and if I query it in SQL Server Manager Query Window I get 5 different values (these are sample points for a water system).
However, when the pulldown loads, there are 5 options of the first value. Can someone see something I can't?
I narrowed it down to the code below because I held my cursor over "results" which was the final step in my Controller's code, and it showed 5 items all of the same value:
else if ((sampletype == "P") || (sampletype == "T") || (sampletype == "C") || (sampletype == "A"))
{
var SamplePoints = (from c in _db.tblPWS_WSF_SPID_ISN_Lookup
where c.PWS == id && c.WSFStateCode.Substring(0, 1) == "S"
select c).ToList();
if (SamplePoints.Any())
{
var listItemsBig = SamplePoints.Select(p => new SelectListItem
{
Selected = false,
Text = p.WSFStateCode.ToString() + ":::" + p.SamplePointID.ToString(),
Value = p.WSFStateCode.ToString()
}).ToList();
results = new JsonResult { Data = listItemsBig };
}
}
return results ;
}
I have had a similar problem in nHibernate, it was caused by how I defined my primary keys/foreign keys in the ORM, leading to a bad join and duplicate values.

Rally API 2 query historical velocity

I am doing some processing across iterations in a release and I want to find out what the teams velocity was at that point in time, is there any way to use the lookback API or otherwise get the information for that period?
i.e. the Rally generated velocity at that time or manually calculate the last 10 or all time velocity measures?
So, based on the responses below I have ended up with this code:
_getVelocity: function() {
this.logger.log("_getVelocity");
var me = this;
var deferred = Ext.create('Deft.Deferred');
Ext.Array.each(this.iterations,function(iteration){
iteration.PlanEstimate = 1;
me.logger.log("Fetching velocity for iteration", iteration.Name);
var start_date_iso = Rally.util.DateTime.toIsoString(iteration.StartDate);
var end_date_iso = Rally.util.DateTime.toIsoString(iteration.EndDate);
var type_filter = Ext.create('Rally.data.lookback.QueryFilter', {
property: '_TypeHierarchy',
operator: 'in',
value: this.show_types
});
var date_filter = Ext.create('Rally.data.lookback.QueryFilter', {
property: '_ValidFrom',
operator: '>=',
value:start_date_iso
}).and(Ext.create('Rally.data.lookback.QueryFilter', {
property: '_ValidFrom',
operator: '<=',
value:end_date_iso
}));
var filters = type_filter.and(date_filter);
me.logger.log("Filter ", filters.toObject());
Ext.create('Rally.data.lookback.SnapshotStore',{
autoLoad: true,
filters: filters,
fetch: ['FormattedID','PlanEstimate','ScheduleState'],
hydrate: ['ScheduleState'],
listeners: {
scope: this,
load: function(store,it_snaps,successful) {
if ( !successful ) {
deferred.reject("There was a problem retrieving changes");
} else {
me.logger.log(" Back for ", it_snaps.length, it_snaps);
deferred.resolve(it_snaps);
}
}
}
});
});
deferred.resolve([]);
return deferred;
},
The shape of this code and the filters etc is lifted from another function in the same app that IS working, however this one is NOT working, I get the following errors:
Uncaught TypeError: Cannot read property 'Errors' of null
GET https://rally1.rallydev.com/analytics/v2.0/service/rally/workspace/99052282…ScheduleState%22%5D&pagesize=20000&start=0&jsonp=Ext.data.JsonP.callback49 400 (Bad Request)
Since LookbackAPI gives historic data, you may query stories at the specific point of time and get number of points completed at that time. There is no Rally generated velocity, so this has to be accessed and summed up manually. For example, I have iteration that started on August 7 and ended on August 14, but if I want data from August 10 I use:
https://rally1.rallydev.com/analytics/v2.0/service/rally/workspace/1234/artifact/snapshot/query.js?find={"Iteration":5678,"_TypeHierarchy":"HierarchicalRequirement",_ValidFrom:{$gte:"014-08-10",$lte:"2014-08-14"}}&fields=['FormattedID','ScheduleState','PlanEstimate']&hydrate=['ScheduleState']
UPDATE: As far as the code you posted, change
var start_date_iso = Rally.util.DateTime.toIsoString(iteration.StartDate)
var end_date_iso = Rally.util.DateTime.toIsoString(iteration.EndDate);
to
var start_date_iso = Rally.util.DateTime.toIsoString(iteration.get('StartDate'),true);
var end_date_iso = Rally.util.DateTime.toIsoString(iteration.get('EndDate'),true);
I replicated the bad request error with your syntax, and after changing it to this format it worked: iteration.get('StartDate'),true
App source code is in this repo.

Exceeded maximum execution time on Google App script with Google Big Query

How can I extend the execution time within my code below. Essentially, I use Google App scripts to query data from our big query data base and export data on to Google spreadsheets.
The following is my code:
function Weekly_Metric(){
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sheetName = "Budget";
var sheet = ss.getSheetByName(sheetName);
ss.setActiveSheet(sheet);
var sql = ' bigqueryscript ';
var results = GSReport.runQueryAsync(sql);
var resultsValues = GSReport.parseBigQueryAPIResponse(results);
sheet.clear();
ss.appendRow(["Label1", "Label2", "Label3"]);
for ( var i = 0 ; i < resultsValues.length ; i++ ) {
ss.appendRow(resultsValues[i]);
}
}
Always reduce the number of calls to Google Apps Script services as much as you can.
In this case, the loop containing appendRow() can be replaced with javascript array operations and a single call to setValues().
...
sheet.clear();
var data = [];
data.push(["Label1", "Label2", "Label3"]);
for ( var i = 0 ; i < resultsValues.length ; i++ ) {
data.push(resultsValues[i]);
}
ss.getRange(1,1,data.length,data[0].length).setValues(data);
...
Alternatively, if resultsValues is an array of rows already, you only need to add the labels:
...
sheet.clear();
resultsValues.unshift(["Label1", "Label2", "Label3"]);
ss.getRange(1,1,resultsValues.length,resultsValues[0].length).setValues(resultsValues);
...
If that doesn't do the trick, then you should look at your GSReport object's methods.

websql use select in to get rows from an array

in websql we can request a certain row like this:
tx.executeSql('SELECT * FROM tblSettings where id = ?', [id], function(tx, rs){
// do stuff with the resultset.
},
function errorHandler(tx, e){
// do something upon error.
console.warn('SQL Error: ', e);
});
however, I know regular SQL and figured i should be able to request
var arr = [1, 2, 3];
tx.executeSql('SELECT * FROM tblSettings where id in (?)', [arr], function(tx, rs){
// do stuff with the resultset.
},
function errorHandler(tx, e){
// do something upon error.
console.warn('SQL Error: ', e);
});
but that gives us no results, the result is always empty. if i would remove the [arr] into arr, then the sql would get a variable amount of parameters, so i figured it should be [arr]. otherwise it would require us to add a dynamic amount of question marks (as many as there are id's in the array).
so can anyone see what i'm doing wrong?
aparently, there is no other solution, than to manually add a question mark for every item in your array.
this is actually in the specs on w3.org
var q = "";
for each (var i in labels)
q += (q == "" ? "" : ", ") + "?";
// later to be used as such:
t.executeSql('SELECT id FROM docs WHERE label IN (' + q + ')', labels, function (t, d) {
// do stuff with result...
});
more info here: http://www.w3.org/TR/webdatabase/#introduction (at the end of the introduction)
however, at the moment i created a helper function that creates such a string for me
might be better than the above, might not, i haven't done any performance testing.
this is what i use now
var createParamString = function(arr){
return _(arr).map(function(){ return "?"; }).join(',');
}
// when called like this:
createparamString([1,2,3,4,5]); // >> returns ?,?,?,?,?
this however makes use of the underscore.js library we have in our project.
Good answer. It was interesting to read an explanation in the official documentation.
I see this question was answered in 2012. I tried it in Google 37 exactly as it is recommened and this is what I got.
Data on input: (I outlined them with the black pencil)
Chrome complains:
So it accepts as many question signs as many input parameters are given. (Let us pay attention that although array is passed it's treated as one parameter)
Eventually I came up to this solution:
var activeItemIds = [1,2,3];
var q = "";
for (var i=0; i< activeItemIds.length; i++) {
q += '"' + activeItemIds[i] + '", ';
}
q= q.substring(0, q.length - 2);
var query = 'SELECT "id" FROM "products" WHERE "id" IN (' + q + ')';
_db.transaction(function (tx) {
tx.executeSql(query, [], function (tx, results1) {
console.log(results1);
debugger;
}, function (a, b) {
console.warn(a);
console.warn(b);
})
})