Spark Task Serialization error during broadcast with in for loop - serialization

I am trying to broadcast a variable within a for loop in spark. During this process, spark is throwing task not serialize-able error. If the same variable is broadcasted out of the for loop, there is no error. Below is code snippet that throws error. Any help is appreciated.
var Final = computedRDD.filter(x => x.Id == uniqueKey(0))
for (partId <- uniqueKey) {
val FinalBroadcast = sc.broadcast(Final.collect)
val computeNew = computedRDD.filter(x => x.partId == partId).repartition(executors).mapPartitions(performFinalPass(FinalBroadcast))
computeNew.collect.forall(x => Final.add(x))
}

Related

Kotlin: function to stop iteration when predicate satisfied

I'm looking for a function in Kotlin which stops an iteration once a predicate is fulfilled:
val services = listOf(service1, service2)
...
var res: Result = null
services.stopIfPredicateFulFilled { service ->
res = service.doSomething()
res != null
}
While this example is not really nice since res is overwritten in each iteration I hope the intention is clear.
forEach doesn't do the job the way I expect it to be done. So, I was wondering if there isn't anything else.
You can use the functions find { ... } and firstOrNull { ... } (they are equivalent, just named differently). They find the first element satisfying the predicate and return that element, ignoring all the remaining elements.
services.find { service ->
res = service.doSomething()
res != null
}

How to pass in a map into UDF in spark

Here is my problem, I have a map of Map[Array[String],String], and I want to pass that into a UDF.
Here is my UDF:
def lookup(lookupMap:Map[Array[String],String]) =
udf((input:Array[String]) => lookupMap.lift(input))
And here is my Map variable:
val srdd = df.rdd.map { row => (
Array(row.getString(1),row.getString(5),row.getString(8)).map(_.toString),
row.getString(7)
)}
Here is how I call the function:
val combinedDF = dftemp.withColumn("a",lookup(lookupMap))(Array($"b",$"c","d"))
I first got an error about immutable array, so I changed my array into immutable type, then I got an error about type mismatch. I googled a bit, apparently I can't pass in non-column type directly into a UDF. Can somebody help? Kudos.
Update: So I did convert everything to a wrapped array. Here is what I did:
val srdd = df.rdd.map{row => (WrappedArray.make[String](Array(row.getString(1),row.getString(5),row.getString(8))),row.getString(7))}
val lookupMap = srdd.collectAsMap()
def lookup(lookupMap:Map[collection.mutable.WrappedArray[String],String]) = udf((input:collection.mutable.WrappedArray[String]) => lookupMap.lift(input))
val combinedDF = dftemp.withColumn("a",lookup(lookupMap))(Array($"b",$"c",$"d"))
Now I am having an error like this:
required: Map[scala.collection.mutable.WrappedArray[String],String]
-ksh: Map[scala.collection.mutable.WrappedArray[String],String]: not found [No such file or directory]
I tried to do something like this:
val m = collection.immutable.Map(1->"one",2->"Two")
val n = collection.mutable.Map(m.toSeq: _*)
but then I just got back to the error of column type.
First, you have to pass a Column as an argument of the UDF; Since you want this argument to be an array, you should use the array function in org.apache.spark.sql.functions, which creates an array Column from a series of other Columns. So the UDF call would be:
lookup(lookupMap)(array($"b",$"c",$"d"))
Now, since array columns are deserialized into mutable.WrappedArray, in order for the map lookup to succeed you'd best make sure that's the type used by your UDF:
def lookup(lookupMap: Map[mutable.WrappedArray[String],String]) =
udf((input: mutable.WrappedArray[String]) => lookupMap.lift(input))
So altogether:
import spark.implicits._
import org.apache.spark.sql.functions._
// Create an RDD[(mutable.WrappedArray[String], String)]:
val srdd = df.rdd.map { row: Row => (
mutable.WrappedArray.make[String](Array(row.getString(1), row.getString(5), row.getString(8))),
row.getString(7)
)}
// collect it into a map (I assume this is what you're doing with srdd...)
val lookupMap: Map[mutable.WrappedArray[String], String] = srdd.collectAsMap()
def lookup(lookupMap: Map[mutable.WrappedArray[String],String]) =
udf((input: mutable.WrappedArray[String]) => lookupMap.lift(input))
val combinedDF = dftemp.withColumn("a",lookup(lookupMap)(array($"b",$"c",$"d")))
Anna your code for srdd/lookupmap is of type org.apache.spark.rdd.RDD[(Array[String], String)]
val srdd = df.rdd.map { row => (
Array(row.getString(1),row.getString(5),row.getString(8)).map(_.toString),
row.getString(7)
)}
Where as in lookup method you are expecting a Map as a parameter
def lookup(lookupMap:Map[Array[String],String]) =
udf((input:Array[String]) => lookupMap.lift(input))
That is the reason why you are getting type mismatch error.
First make srdd from RDD[tuple] to a RDD[Map] and then try converting the RDD to Map to resolve this error.
val srdd = df.rdd.map { row => Map(
Array(row.getString(1),row.getString(5),row.getString(8)).map(_.toString) ->
row.getString(7)
)}

BigQuery UDF memory exceeded error on multiple rows but works fine on single row

I'm writing a UDF to process Google Analytics data, and getting the "UDF out of memory" error message when I try to process multiple rows. I downloaded the raw data and found the largest record and tried running my UDF query on that, with success. Some of the rows have up to 500 nested hits, and the size of the hit record (by far the largest component of each row of the raw GA data) does seem to have an effect on how many rows I can process before getting the error.
For example, the query
select
user.ga_user_id,
ga_session_id,
...
from
temp_ga_processing(
select
fullVisitorId,
visitNumber,
...
from [79689075.ga_sessions_20160201] limit 100)
returns the error, but
from [79689075.ga_sessions_20160201] where totals.hits = 500 limit 1)
does not.
I was under the impression that any memory limitations were per-row? I've tried several techniques, such as setting row = null; before emit(return_dict); (where return_dict is the processed data) but to no avail.
The UDF itself doesn't do anything fancy; I'd paste it here but it's ~45 kB in length. It essentially does a bunch of things along the lines of:
function temp_ga_processing(row, emit) {
topic_id = -1;
hit_numbers = [];
first_page_load_hits = [];
return_dict = {};
return_dict["user"] = {};
return_dict["user"]["ga_user_id"] = row.fullVisitorId;
return_dict["ga_session_id"] = row.fullVisitorId.concat("-".concat(row.visitNumber));
for(i=0;i<row.hits.length;i++) {
hit_dict = {};
hit_dict["page"] = {};
hit_dict["time"] = row.hits[i].time;
hit_dict["type"] = row.hits[i].type;
hit_dict["page"]["engaged_10s"] = false;
hit_dict["page"]["engaged_30s"] = false;
hit_dict["page"]["engaged_60s"] = false;
add_hit = true;
for(j=0;j<row.hits[i].customMetrics.length;j++) {
if(row.hits[i].customDimensions[j] != null) {
if(row.hits[i].customMetrics[j]["index"] == 3) {
metrics = {"video_play_time": row.hits[i].customMetrics[j]["value"]};
hit_dict["metrics"] = metrics;
metrics = null;
row.hits[i].customDimensions[j] = null;
}
}
}
hit_dict["topic"] = {};
hit_dict["doctor"] = {};
hit_dict["doctor_location"] = {};
hit_dict["content"] = {};
if(row.hits[i].customDimensions != null) {
for(j=0;j<row.hits[i].customDimensions.length;j++) {
if(row.hits[i].customDimensions[j] != null) {
if(row.hits[i].customDimensions[j]["index"] == 1) {
hit_dict["topic"] = {"name": row.hits[i].customDimensions[j]["value"]};
row.hits[i].customDimensions[j] = null;
continue;
}
if(row.hits[i].customDimensions[j]["index"] == 3) {
if(row.hits[i].customDimensions[j]["value"].search("doctor") > -1) {
return_dict["logged_in_as_doctor"] = true;
}
}
// and so on...
}
}
}
if(row.hits[i]["eventInfo"]["eventCategory"] == "page load time" && row.hits[i]["eventInfo"]["eventLabel"].search("OUTLIER") == -1) {
elre = /(?:onLoad|pl|page):(\d+)/.exec(row.hits[i]["eventInfo"]["eventLabel"]);
if(elre != null) {
if(parseInt(elre[0].split(":")[1]) <= 60000) {
first_page_load_hits.push(parseFloat(row.hits[i].hitNumber));
if(hit_dict["page"]["page_load"] == null) {
hit_dict["page"]["page_load"] = {};
}
hit_dict["page"]["page_load"]["sample"] = 1;
page_load_time_re = /(?:onLoad|pl|page):(\d+)/.exec(row.hits[i]["eventInfo"]["eventLabel"]);
if(page_load_time_re != null) {
hit_dict["page"]["page_load"]["page_load_time"] = parseFloat(page_load_time_re[0].split(':')[1])/1000;
}
}
// and so on...
}
}
row = null;
emit return_dict;
}
The job ID is realself-main:bquijob_4c30bd3d_152fbfcd7fd
Update Aug 2016 : We have pushed out an update that will allow the JavaScript worker to use twice as much RAM. We will continue to monitor jobs that have failed with JS OOM to see if more increases are necessary; in the meantime, please let us know if you have further jobs failing with OOM. Thanks!
Update : this issue was related to limits we had on the size of the UDF code. It looks like V8's optimize+recompile pass of the UDF code generates a data segment that was bigger than our limits, but this was only happening when when the UDF runs over a "sufficient" number of rows. I'm meeting with the V8 team this week to dig into the details further.
#Grayson - I was able to run your job over the entire 20160201 table successfully; the query takes 1-2 minutes to execute. Could you please verify that this works on your side?
We've gotten a few reports of similar issues that seem related to # rows processed. I'm sorry for the trouble; I'll be doing some profiling on our JavaScript runtime to try to find if and where memory is being leaked. Stay tuned for the analysis.
In the meantime, if you're able to isolate any specific rows that cause the error, that would also be very helpful.
A UDF will fail on anything but very small datasets if it has a lot of if/then levels, such as:
if () {
.... if() {
.........if () {
etc
We had to track down and remove the deepest if/then statement.
But, that is not enough. In addition, when you pass the data into the UDF run a "GROUP EACH BY" on all the variables. This will force BQ to send the output to multiple "workers". Otherwise it will also fail.
I've wasted 3 days of my life on this annoying bug. Argh.
I love the concept of parsing my logs in BigQuery, but I've got the same problem, I get
Error: Resources exceeded during query execution.
The Job Id is bigquery-looker:bquijob_260be029_153dd96cfdb, if that at all helps.
I wrote a very basic parser does a simple match and returns rows. Works just fine on a 10K row data set, but I get out of resources when trying to run against a 3M row logfile.
Any suggestions for a work around?
Here is the javascript code.
function parseLogRow(row, emit) {
r = (row.logrow ? row.logrow : "") + (typeof row.l2 !== "undefined" ? " " + row.l2 : "") + (row.l3 ? " " + row.l3 : "")
ts = null
category = null
user = null
message = null
db = null
found = false
if (r) {
m = r.match(/^(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (\+|\-)\d\d\d\d) \[([^|]*)\|([^|]*)\|([^\]]*)\] :: (.*)/ )
if( m){
ts = new Date(m[1])/1000
category = m[3] || null
user = m[4] || null
db = m[5] || null
message = m[6] || null
found = true
}
else {
message = r
found = false
}
}
emit({
ts: ts,
category: category,
user: user,
db: db,
message: message,
found: found
});
}
bigquery.defineFunction(
'parseLogRow', // Name of the function exported to SQL
['logrow',"l2","l3"], // Names of input columns
[
{'name': 'ts', 'type': 'timestamp'}, // Output schema
{'name': 'category', 'type': 'string'},
{'name': 'user', 'type': 'string'},
{'name': 'db', 'type': 'string'},
{'name': 'message', 'type': 'string'},
{'name': 'found', 'type': 'boolean'},
],
parseLogRow // Reference to JavaScript UDF
);

NHibernate Linq Query with Projection and Count error

I have the following query:
var list = repositoy.Query<MyClass>.Select(domain => new MyDto()
{
Id = domain.Id,
StringComma = string.Join(",", domain.MyList.Select(y => y.Name))
});
That works great:
list.ToList();
But if I try to get the Count I got an exception:
list.Count();
Exception
NHibernate.Hql.Ast.ANTLR.QuerySyntaxException
A recognition error occurred. [.Count[MyDto](.Select[MyClass,MyDto](NHibernate.Linq.NhQueryable`1[MyClass], Quote((domain, ) => (new MyDto()domain.Iddomain.Name.Join(p1, .Select[MyListClass,System.String](domain.MyList, (y, ) => (y.Name), ), ))), ), )]
Any idea how to fix that without using ToList ?
The point is, that we should NOT call Count() over projection. So this will work
var query = repositoy.Query<MyClass>;
var list = query.Select(domain => new MyDto()
{
Id = domain.Id,
StringComma = string.Join(",", domain.MyList.Select(y => y.Name))
});
var count = query.Count();
When we use ICriteria query, the proper syntax would be
var criteria = ... // criteria, with WHERE, SELECT, ORDER BY...
// HERE cleaned up, just to contain WHERE clause
var totalCountCriteria = CriteriaTransformer.TransformToRowCount(criteria);
So, for Count - use the most simple query, i.e. containing the same JOINs and WHERE part
If you really don't need the results, but only the count, then you shouldn't even bother writing the .Select() clause. Radim's answer as posted is a good way to both get the results and the count, but if your database supports it, use future queries to execute both in the same roundtrip to the database:
var query = repository.Query<MyClass>;
var list = query.Select(domain => new MyDto()
{
Id = domain.Id,
StringComma = string.Join(",", domain.MyList.Select(y => y.Name))
}).ToFuture();
var countFuture = query.Count().ToFutureValue();
int actualCount = countFuture.Value; //queries are actually executed here
Note that there in NH prior to 3.3.3, this would still execute two round-trips (see https://nhibernate.jira.com/browse/NH-3184), but it would work, and if you ever upgrade NH, you get a (minor) performance boost.

How do I do a String.IsNullOrEmpty() test in an NHibernate Queryover where clause?

I hit a situation today where a field in our legacy db that should never be empty... was empty.
I am using NHibernate 3.2 against this database and the queries that are affected are written in QueryOver.
My current query is this
return Session
.QueryOver<FacilityGroup>()
.Where(fg => fg.Owner.Id == Token.OwnerId &&
fg.UserName == Token.UserName)
.OrderBy(fg => fg.Code).Asc
.TransformUsing(Transformers.DistinctRootEntity);
I want it to be this:
return Session
.QueryOver<FacilityGroup>()
.Where(fg => fg.Owner.Id == Token.OwnerId &&
fg.UserName == Token.UserName &&
!string.IsNullOrEmpty(fg.Code))
.OrderBy(fg => fg.Code).Asc
.TransformUsing(Transformers.DistinctRootEntity);
When I try this I get an exception "Unrecognised method call: System.String:Boolean IsNullOrEmpty(System.String)"
So NHibernate can't translate string.IsNullOrEmpty. Fair enough. However when I try this
return Session
.QueryOver<FacilityGroup>()
.Where(fg => fg.Owner.Id == Token.OwnerId &&
fg.UserName == Token.UserName &&
!(fg.Code == null || fg.Code.Trim() == "" ))
.OrderBy(fg => fg.Code).Asc
.TransformUsing(Transformers.DistinctRootEntity);
I get an InvalidOperationException "variable 'fg' of type 'Domain.Entities.FacilityGroup' referenced from scope '', but it is not defined"
Any thoughts?
Ok... I guess I asked this question too soon. I figured out a way around this.
What I was able to do was invoke the 'trim' function from hql via a SQL Function Projection. I ended up writing it as IQueryOver Extention method to keep it flexible. I will post it here in case anyone needs it.
public static class QueriesExtentions
{
public static IQueryOver<E, F> WhereStringIsNotNullOrEmpty<E, F>(this IQueryOver<E, F> query, Expression<Func<E, object>> propExpression)
{
var prop = Projections.Property(propExpression);
var criteria = Restrictions.Or(Restrictions.IsNull(prop), Restrictions.Eq(Projections.SqlFunction("trim", NHibernateUtil.String, prop), ""));
return query.Where(Restrictions.Not(criteria));
}
}
and here it is in use
return Session
.QueryOver<FacilityGroup>()
.Where(fg => fg.Owner.Id == Token.OwnerId && fg.UserName == Token.UserName )
.WhereStringIsNotNullOrEmpty(fg => fg.Code)
.OrderBy(fg => fg.Code).Asc
.TransformUsing(Transformers.DistinctRootEntity);