scala spark foldLeft on map with array - dataframe

I have a configuration in a form of a map:
val config = Map[String, Array[String]] = Map("column1" -> Array("field1"), "column2" -> Array("field1","field2"))
I want to use foldLeft to apply this to a dataframe and dynamically filter nested fields using dropFields functions.
I came out with:
val result = config.foldLeft(""){(k, v) =>
df.withColumn( v._1, col(k + v._1).dropFields(v._2: _*))
}
but struggle to make foldleft work, any help would be appreciated.

Related

Scope DataFrame transformations in Spark

I need to transform some DataFrame rows for which specific flag is set and leave all other rows untouched.
df.withColumn("a", when($"flag".isNotNull, lit(1)).otherwise($"a"))
.withColumn("b", when($"flag".isNotNull, "$b" + 1).otherwise($"b"))
.withColumn("c", when($"flag".isNotNull, concat($"c", "++")).otherwise($"c"))
There might be more columns like that and I am looking for a way to refactor this into something nicer.
I thought about:
df.filter($"flag".isNotNull)
.withColumn("a", lit(1))
.withColumn("b", $"b" + 1)
.withColumn("c", concat($"c", "++"))
.union(df.filter($"flag".isNull))
but it scans/recalculates df twice. Even if I cache it, the plan contains lineage of each branch separately - and I actually chain multiple similar transformations, so final plan explodes expotentially and crashes.
Would it be possible to implement something like:
df.withScope($"flag".isNotNull) { scoped =>
scoped.withColumn("a", lit(1))
.withColumn("b", $"b" + 1)
.withColumn("c", concat($"c", "++"))
}
Using when expressions is ok. You can write something like this:
val updates = Map(
"a" -> lit(1),
"b" -> $"b" + 1,
"c" -> concat($"c", "++")
)
val df2 = updates.foldLeft(df) { case (acc, (c, v)) =>
acc.withColumn(c, when($"flag".isNotNull, v).otherwise(col(c)))
}

Apply a function into grouped dataframe using Scala Spark

I'm trying to do DBSCAN in each group of latitudes and longitudes from users. The implementation of this clustering algorithm was done by irvingc here. I bumped up all dependencies to make the code work properly in my env.
Describing the sistuation: I have a Dataframe which is composed by events from user, each event has an id, a lat, and a long, you can see the columns by this case class. By that, I transform the dataframe to dataset to use the groupbykey and mapgroups methods to apply the function to the grouped data. However, the DBSCAN I'm using receive an RDD[linalg.Vector], so I have to transform the group into Vector of lat/lon, and this transformation gives the error SPARK-28702. Can you give some advice how to handle this issue?
case class StayDataset(objectID: Long, latitude: Double, longitude: Double, timeStart: Long, timeEnd: Long)
var dfs: Array[DataFrame] = Array()
val s = dataset.groupByKey(k => k.objectID).mapGroups{
case(k, iter) => {
POIDetection.groupStayPointsFromUser(k, iter, dataset.sparkSession)
dfs = dfs ++ Array(df)
k
}
}
def groupStayPointsFromUser(k: Long, dataset: Iterator[StayDataset], spark: SparkSession): DataFrame = {
val points = dataset.map(row => Vectors.dense(Array(row.latitude, row.longitude))).toSeq
val rddVector = spark.sparkContext.parallelize(points)
val size = points.length
val model = DBSCAN.train(rddVector, eps = 20, minPoints = (size * 0.18).toInt, maxPointsPerPartition = (size / 4).toInt)
val pointRDD = new PointRDD(model.labeledPoints.map(p => {
val point = POIDetection.geoFactory.createPoint(new Coordinate(p.x, p.y))
point.setUserData(p.cluster.toString())
point
}))
val df = Adapter.toDf(pointRDD, Seq("cluster"), spark)
.select(col("cluster").cast("long"), col("geometry"))
df
}
I think this problem arises when we want to apply a KNN in a grouped data. How to do that?
I don't understand what you want to achieve but first, you need to create a RDD[linalg.Vector], I suppose that you have the dataset of StayDataset already, to retrieve the RDDs, you need to transform the Dataset of StayDataset to linalg.Vector
val dsVector = dataset.transform[linalg.Vector](rec => linalg.Vectors.dense(rec.latitude, rec.longitude))
and then you retrieve the rdd[linalg.Vector]:
val rdd = dsVector.rdd
and you pass the rdd to your DBSCAN:
DBSCAN.train(rdd, ...)
These are necessary for your to get the rdd to do the train.
I think you also need to do some aggregation beforehand. If it is true, you need to manipulate on the dataset you have

Extend Groupby to include multiply aggregation

I implemented a groupby function which groups columns based on a particular aggregation successfully. The issue is I am using a argument for chosen columns and aggregation as Map[String,String] which means multiple aggregations cannot be performed on one column. for example sum, mean and max all on one column.
below is what works soo far:
groupByFunction(input, Map("someSignal" -> "mean"))
def groupByFunction(dataframeDummy: DataFrame,
columnsWithOperation: Map[String,String],
someSession: String = "sessionId",
someSignal: String = "signalName"): DataFrame = {
dataframeDummy
.groupBy(
col(someSession),
col(someSignal)
).agg(columnsWithOperation)
}
Upon looking into it a bit more, the agg function can take a list of columns like below
userData
.groupBy(
window(
(col(timeStampColumnName) / lit(millisSecondsPerSecond)).cast(TimestampType),
timeWindowInS.toString.concat(" seconds")
),
col(sessionColumnName),
col(signalColumnName)
).agg(
mean("physicalSignalValue"),
sum("physicalSignalValue")).show()
So I decided to try to manipulate the input to look like that, below is how I did it:
val signalIdColumn = columnsWithOperation.toSeq.flatMap { case (key, list) => list.map(key -> _) }
val result = signalIdColumn.map(tuple =>
if (tuple._2 == "mean")
mean(tuple._1)
else if (tuple._2 == "sum")
sum(tuple._1)
else if (tuple._2 == "max")
max(tuple._1))
Now I have a list of columns, which is still a problem for agg funciton.
I was able to solve it using a sequence of tuples like this Seq[(String, String)] instead of Map[String,String]
def groupByFunction(dataframeDummy: DataFrame,
columnsWithOperation: Seq[(String, String)],
someSession: String = "sessionId",
someSignal: String = "signalName"): DataFrame = {
dataframeDummy
.groupBy(
col(someSession),
col(someSignal)
).agg(columnsWithOperation)
and then with the information
from below post:
https://stackoverflow.com/a/34955432/2091294
userData
.groupBy(
col(someSession),
col(someSignal)
).agg(columnsWithOperation.head, columnsWithOperation.tail: _*)

Slick 2 - Update columns in a table and return whole table object

How would you update a few columns in a table table while returning the entire updated table when using slick?
Assuming SomeTables is some TableQuery, you would typically write a query like this if you want to, for example, add an item to the table (and returning the newly added item)
val returnedItem = SomeTables returning SomeTables += someTable
How would you do the same if you want to update an item and return the whole back the whole item, I suspect you would do something like this
val q = SomeTables.filter(_.id === id).map(x => (x.someColumn,x.anotherColumn)) returning SomeTables
val returnedItem = q.update((3,"test"))
The following code however does not work, and I can't see any documentation on how to do this
Note that I am aware you can just query the item beforehand, update it, and then use copy on the original object, however this requires a lot of boilerplate (and DB trips as well)
This feature is not supported in Slick (v2 or v3-M1); although I don't see any specific reason prohibiting it's implementation, UPDATE ... RETURNING is not a standard SQL feature (for example, H2 does not support it: http://www.h2database.com/html/grammar.html#update). I'll leave as an exercise to the reader to explore how one might safely and efficiently emulate the feature for RDBMSes lacking UDPATE ... RETURNING.
When you call "returning" on a scala.slick.lifted.Query, it gives you a JdbcInsertInvokerComponent$ReturningInsertInvokerDef. You'll find no update method, although there is an insertOrUpdate method; however, insertOrUpdate only returns the returning expression result if an insert occurs, None is returned for updates, so no help here.
From this we can conclude that if you want to use the UPDATE ... RETURNING SQL feature, you'll either need to use StaticQuery or roll your own patch to Slick. You can manually write your queries (and re-implement your table projections as GetResult / SetParameter serializers), or you can try this snippet of code:
package com.spingo.slick
import scala.slick.driver.JdbcDriver.simple.{queryToUpdateInvoker, Query}
import scala.slick.driver.JdbcDriver.{updateCompiler, queryCompiler, quoteIdentifier}
import scala.slick.jdbc.{ResultConverter, CompiledMapping, JdbcBackend, JdbcResultConverterDomain, GetResult, SetParameter, StaticQuery => Q}
import scala.slick.util.SQLBuilder
import slick.ast._
object UpdateReturning {
implicit class UpdateReturningInvoker[E, U, C[_]](updateQuery: Query[E, U, C]) {
def updateReturning[A, F](returningQuery: Query[A, F, C], v: U)(implicit session: JdbcBackend#Session): List[F] = {
val ResultSetMapping(_,
CompiledStatement(_, sres: SQLBuilder.Result, _),
CompiledMapping(_updateConverter, _)) = updateCompiler.run(updateQuery.toNode).tree
val returningNode = returningQuery.toNode
val fieldNames = returningNode match {
case Bind(_, _, Pure(Select(_, col), _)) =>
List(col.name)
case Bind(_, _, Pure(ProductNode(children), _)) =>
children map { case Select(_, col) => col.name } toList
case Bind(_, TableExpansion(_, _, TypeMapping(ProductNode(children), _, _)), Pure(Ref(_), _)) =>
children map { case Select(_, col) => col.name } toList
}
implicit val pconv: SetParameter[U] = {
val ResultSetMapping(_, compiled, CompiledMapping(_converter, _)) = updateCompiler.run(updateQuery.toNode).tree
val converter = _converter.asInstanceOf[ResultConverter[JdbcResultConverterDomain, U]]
SetParameter[U] { (value, params) =>
converter.set(value, params.ps)
}
}
implicit val rconv: GetResult[F] = {
val ResultSetMapping(_, compiled, CompiledMapping(_converter, _)) = queryCompiler.run(returningNode).tree
val converter = _converter.asInstanceOf[ResultConverter[JdbcResultConverterDomain, F]]
GetResult[F] { p => converter.read(p.rs) }
}
val fieldsExp = fieldNames map (quoteIdentifier) mkString ", "
val sql = sres.sql + s" RETURNING ${fieldsExp}"
val unboundQuery = Q.query[U, F](sql)
unboundQuery(v).list
}
}
}
I'm certain the above can be improved; I've written it based on my somewhat limited understanding of Slick internals, and it works for me and can leverage the projections / type-mappings you've already defined.
Usage:
import com.spingo.slick.UpdateReturning._
val tq = TableQuery[MyTable]
val st = tq filter(_.id === 1048003) map { e => (e.id, e.costDescription) }
st.updateReturning(tq map (identity), (1048003, Some("such cost")))

why both transform and map methods in scala?

I'm having trouble understanding the difference between / reason for, for example, immutable.Map.transform and immutable.Map.map. It looks like transform won't change the key, but that just seems like a trivial variation of the map method. Am I missing something?
I was expecting to find a method that applied a function to the (key,value) of the map when/if that element was accessed (rather than having to iterate through the map eagerly with the map function). Does such a method exist?
You can do exactly that with mapValues. Here is the explanation from the docs:
def mapValues[C](f: (B) ⇒ C): Map[A, C]
Transforms this map by applying a function to every retrieved value.
f - the function used to transform values of this map.
returns - a map view which maps every key of this map to f(this(key)). The resulting map wraps the original map without copying any elements.
edit:
Although extending classes of the collection API is not often a good idea, it could work like this:
class LazilyModifiedMap[A,B,C](underlying: Map[A,B])(f: (A,B) => C) extends Map[A,C] {
def get(key: A) = underlying.get(key).map( x => f(key, x))
def iterator = underlying.iterator.map { case (k,v) => (k, f(k,v)) }
def -(key: A) = iterator.toMap - key
def +[C1 >: C](kv: (A,C1)) = iterator.toMap + kv
}
If you only need the interface of PartialFunction, you can exploit the fact that Map inherits from PartialFunction:
val m = Map(1 -> "foo", 2 -> "bar")
val n = m.andThen(_.reverse)
n(1) // --> oof