Apply a function into grouped dataframe using Scala Spark - dataframe

I'm trying to do DBSCAN in each group of latitudes and longitudes from users. The implementation of this clustering algorithm was done by irvingc here. I bumped up all dependencies to make the code work properly in my env.
Describing the sistuation: I have a Dataframe which is composed by events from user, each event has an id, a lat, and a long, you can see the columns by this case class. By that, I transform the dataframe to dataset to use the groupbykey and mapgroups methods to apply the function to the grouped data. However, the DBSCAN I'm using receive an RDD[linalg.Vector], so I have to transform the group into Vector of lat/lon, and this transformation gives the error SPARK-28702. Can you give some advice how to handle this issue?
case class StayDataset(objectID: Long, latitude: Double, longitude: Double, timeStart: Long, timeEnd: Long)
var dfs: Array[DataFrame] = Array()
val s = dataset.groupByKey(k => k.objectID).mapGroups{
case(k, iter) => {
POIDetection.groupStayPointsFromUser(k, iter, dataset.sparkSession)
dfs = dfs ++ Array(df)
k
}
}
def groupStayPointsFromUser(k: Long, dataset: Iterator[StayDataset], spark: SparkSession): DataFrame = {
val points = dataset.map(row => Vectors.dense(Array(row.latitude, row.longitude))).toSeq
val rddVector = spark.sparkContext.parallelize(points)
val size = points.length
val model = DBSCAN.train(rddVector, eps = 20, minPoints = (size * 0.18).toInt, maxPointsPerPartition = (size / 4).toInt)
val pointRDD = new PointRDD(model.labeledPoints.map(p => {
val point = POIDetection.geoFactory.createPoint(new Coordinate(p.x, p.y))
point.setUserData(p.cluster.toString())
point
}))
val df = Adapter.toDf(pointRDD, Seq("cluster"), spark)
.select(col("cluster").cast("long"), col("geometry"))
df
}
I think this problem arises when we want to apply a KNN in a grouped data. How to do that?

I don't understand what you want to achieve but first, you need to create a RDD[linalg.Vector], I suppose that you have the dataset of StayDataset already, to retrieve the RDDs, you need to transform the Dataset of StayDataset to linalg.Vector
val dsVector = dataset.transform[linalg.Vector](rec => linalg.Vectors.dense(rec.latitude, rec.longitude))
and then you retrieve the rdd[linalg.Vector]:
val rdd = dsVector.rdd
and you pass the rdd to your DBSCAN:
DBSCAN.train(rdd, ...)
These are necessary for your to get the rdd to do the train.
I think you also need to do some aggregation beforehand. If it is true, you need to manipulate on the dataset you have

Related

scala spark foldLeft on map with array

I have a configuration in a form of a map:
val config = Map[String, Array[String]] = Map("column1" -> Array("field1"), "column2" -> Array("field1","field2"))
I want to use foldLeft to apply this to a dataframe and dynamically filter nested fields using dropFields functions.
I came out with:
val result = config.foldLeft(""){(k, v) =>
df.withColumn( v._1, col(k + v._1).dropFields(v._2: _*))
}
but struggle to make foldleft work, any help would be appreciated.

Extend Groupby to include multiply aggregation

I implemented a groupby function which groups columns based on a particular aggregation successfully. The issue is I am using a argument for chosen columns and aggregation as Map[String,String] which means multiple aggregations cannot be performed on one column. for example sum, mean and max all on one column.
below is what works soo far:
groupByFunction(input, Map("someSignal" -> "mean"))
def groupByFunction(dataframeDummy: DataFrame,
columnsWithOperation: Map[String,String],
someSession: String = "sessionId",
someSignal: String = "signalName"): DataFrame = {
dataframeDummy
.groupBy(
col(someSession),
col(someSignal)
).agg(columnsWithOperation)
}
Upon looking into it a bit more, the agg function can take a list of columns like below
userData
.groupBy(
window(
(col(timeStampColumnName) / lit(millisSecondsPerSecond)).cast(TimestampType),
timeWindowInS.toString.concat(" seconds")
),
col(sessionColumnName),
col(signalColumnName)
).agg(
mean("physicalSignalValue"),
sum("physicalSignalValue")).show()
So I decided to try to manipulate the input to look like that, below is how I did it:
val signalIdColumn = columnsWithOperation.toSeq.flatMap { case (key, list) => list.map(key -> _) }
val result = signalIdColumn.map(tuple =>
if (tuple._2 == "mean")
mean(tuple._1)
else if (tuple._2 == "sum")
sum(tuple._1)
else if (tuple._2 == "max")
max(tuple._1))
Now I have a list of columns, which is still a problem for agg funciton.
I was able to solve it using a sequence of tuples like this Seq[(String, String)] instead of Map[String,String]
def groupByFunction(dataframeDummy: DataFrame,
columnsWithOperation: Seq[(String, String)],
someSession: String = "sessionId",
someSignal: String = "signalName"): DataFrame = {
dataframeDummy
.groupBy(
col(someSession),
col(someSignal)
).agg(columnsWithOperation)
and then with the information
from below post:
https://stackoverflow.com/a/34955432/2091294
userData
.groupBy(
col(someSession),
col(someSignal)
).agg(columnsWithOperation.head, columnsWithOperation.tail: _*)

Extract words from a string in spark hadoop with scala

I was using the code below to extract strings I needed in Spark SQL. But now I am working with more data in Spark Hadoop and I want to extract strings. I tried the same code, but it does not work.
val sparkConf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.functions.{col, udf}
import java.util.regex.Pattern
//User Defined function to extract
def toExtract(str: String) = {
val pattern = Pattern.compile("#\\w+")
val tmplst = scala.collection.mutable.ListBuffer.empty[String]
val matcher = pattern.matcher(str)
while (matcher.find()) {
tmplst += matcher.group()
}
tmplst.mkString(",")
}
val Extract = udf(toExtract _)
val values = List("#always_nidhi #YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking")
val df = sc.parallelize(values).toDF("words")
df.select(Extract(col("words"))).show()
How do I solve this problem?
First off, you're using Spark not the way its meant to. Your DataFrame isn't partitioned at all. Use:
val values = List("#always_nidhi", "#YouTube", "no", "i", "dnt", "understand" ...). That way, each bulk of words will be assigned to a different partition, different JVMs and/or clusters (depending on the total number of partitions and size of data). In your solution, the entire sentence is assigned to a specific partition and thus there's no parallelism nor distribution.
Second, you don't have to use a UDF (try to avoid those in general).
In order to find your regex, you can simply execute:
dataFrame.filter(col("words") rlike "#\\w+")
Hope it helps :-)

Scala/Apache Spark Converting DataFrame column values and type, multiple when otherwise

I have a primary SQL table that I am reading into Spark and modifying to write to CassandraDB. Currently I have a working implementation for converting a gender from 0, 1, 2, 3 (integers) to "Male", "Female", "Trans", etc (Strings). Though the below method does work, it seems very inefficient to make a seperate Array with those mappings into a DataFrame, join it into the main table/DataFrame, then remove, rename, etc.
I have seen:
.withColumn("gender", when(col("gender) === 1, "male").otherwise("female")
that would allow me to continue method chaining on the primary table but have not been able to get it working with more than 2 options. Is there a way to do this? I have around 10 different columns on this table that each need their own custom conversion created. Since this code will be processing TBs of data, is there a less repetitive and more efficient way to accomplish this. Thanks for any help in advance!
case class Gender(tmpid: Int, tmpgender: String)
private def createGenderDf(spark:SparkSession): DataFrame = {
import spark.implicits._
Seq(
Gender(1, "Male"),
Gender(2, "Female"),
Gender(777, "Prefer not to answer")
).toDF
}
private def createPersonsDf(spark: SparkSession): DataFrame = {
val genderDf = createGenderDf(spark)
genderDf.show()
val personsDf: DataFrame = spark.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", "\t")
.load(dataPath + "people.csv")
.withColumnRenamed("ID", "id")
.withColumnRenamed("name_first", "firstname")
val personsDf1: DataFrame = personsDf
.join(genderDf, personsDf("gender") === genderDf("tmpid"), "leftouter")
val personsDf2: DataFrame = personsDf1
.drop("gender")
.drop("tmpid")
.withColumnRenamed("tmpgender", "gender")
}
You can use nested when function which would eliminate your need of creating genderDf, join, drop, rename etc. As for your example you can do the following
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StringType
personsDf.withColumn("gender", when(col("gender") === 1, "male").otherwise(when(col("gender") ===2, "female").otherwise("Prefer not to answer")).cast(StringType))
You can add more when function in the above nested structure and you can repeate the same for other 10 columns as well.

Programmatically adding several columns to Spark DataFrame

I'm using spark with scala.
I have a Dataframe with 3 columns: ID,Time,RawHexdata.
I have a user defined function which takes RawHexData and expands it into X more columns. It is important to state that for each row X is the same (the columns do not vary). However, before I receive the first data, I do not know what the columns are. But once I have the head, I can deduce it.
I would like a second Dataframe with said columns: Id,Time,RawHexData,NewCol1,...,NewCol3.
The "Easiest" method I can think of to do this is:
1. deserialize each row into json (every data tyoe is serializable here)
2. add my new columns,
3. deserialize a new dataframe from the altered json,
However, that seems like a waste, as it involves 2 costly and redundant json serialization steps. I am looking for a cleaner pattern.
Using case-classes, seems like a bad idea, because I don't know the number of columns, or the column names in advance.
What you can do to dynamically extend your DataFrame is to operate on the row RDD which you can obtain by calling dataFrame.rdd. Having a Row instance, you can access the RawHexdata column and parse the contained data. By adding the newly parsed columns to the resulting Row, you've almost solved your problem. The only thing necessary to convert a RDD[Row] back into a DataFrame is to generate the schema data for your new columns. You can do this by collecting a single RawHexdata value on your driver and then extracting the column types.
The following code illustrates this approach.
object App {
case class Person(name: String, age: Int)
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Test").setMaster("local[4]")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val input = sc.parallelize(Seq(Person("a", 1), Person("b", 2)))
val dataFrame = input.df
dataFrame.show()
// create the extended rows RDD
val rowRDD = dataFrame.rdd.map{
row =>
val blob = row(1).asInstanceOf[Int]
val newColumns: Seq[Any] = Seq(blob, blob * 2, blob * 3)
Row.fromSeq(row.toSeq.init ++ newColumns)
}
val schema = dataFrame.schema
// we know that the new columns are all integers
val newColumns = StructType{
Seq(new StructField("1", IntegerType), new StructField("2", IntegerType), new StructField("3", IntegerType))
}
val newSchema = StructType(schema.init ++ newColumns)
val newDataFrame = sqlContext.createDataFrame(rowRDD, newSchema)
newDataFrame.show()
}
}
SELECT is your friend solving it without going back to RDD.
case class Entry(Id: String, Time: Long)
val entries = Seq(
Entry("x1", 100L),
Entry("x2", 200L)
)
val newColumns = Seq("NC1", "NC2", "NC3")
val df = spark.createDataFrame(entries)
.select(col("*") +: (newColumns.map(c => lit(null).as(c))): _*)
df.show(false)
+---+----+----+----+----+
|Id |Time|NC1 |NC2 |NC3 |
+---+----+----+----+----+
|x1 |100 |null|null|null|
|x2 |200 |null|null|null|
+---+----+----+----+----+