Scala + Spark: filter a dataset if it contains elements from a list - dataframe

I have a dataset and I want to filtered base on a column.
val test = Seq(
("1", "r2_test"),
("2", "some_other_value"),
("3", "hs_2_card"),
("4", "vsx_np_v2"),
("5", "r2_test"),
("2", "some_other_value2")
).toDF("id", "my_column")
I want to create a function to filter my dataframe based on the elements of this list using contains on "my_column"(if contains part of the string, the filter must be applied)
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
def filteredElements(df: DataFrame): DataFrame = {
val elements = List("r2", "hs", "np")
df.filter($"my_column".contains(elements))
}
But like this, won't work for a list, just for a single element.
How can I do to adapt to use my list without have to do multiple filters?
Below how the expected output must be when apply the function
val output = test.transform(filteredElements)
expected =
("1", "r2_test"), // contains "rs"
("3", "hs_2_card"), // contains "hs"
("4", "vsx_np_v2"), // contains "np"
("5", "r2_test"), // contains "r2"

You can do it in one line without udf ( better for performance and simpler ):
df.filter(col("my_column").isNotNull).filter(row => elements.exists(row.getAs[String]("my_column").contains)).show()

One way to solve this would be to use a UDF. I think there should be some way to solve this with spark sql functions that I'm not aware of. Anyway, you can define a udf to tell weather a String contains any of the values in your elements List or not:
import org.apache.sql.functions._
val elements = List("r2", "hs", "np")
val isContainedInList = udf { (value: String) =>
elements.exists(e => value.indexOf(e) != -1)
}
You can use this udf in select, filter, basically anywhere you want:
def filteredElements(df: DataFrame): DataFrame = {
df.filter(isContainedInList($"my_column"))
}
And the result is as expected:
+---+---------+
| id|my_column|
+---+---------+
| 1| r2_test|
| 3|hs_2_card|
| 4|vsx_np_v2|
| 5| r2_test|
+---+---------+

Related

How do I create a new DataFame based on an old DataFrame?

I have csv file: dbname1.table1.csv:
|target | source |source_table |relation_type|
---------------------------------------------------------------------------------------
avg_ensure_sum_12m | inn_num | custom_cib_ml_stg.p_overall_part_tend_cust | direct
avg_ensure_sum_12m | protocol_dttm | custom_cib_ml_stg.p_overall_part_tend_cust | direct
avg_ensure_sum_12m | inn_num | custom_cib_ml_stg.p_overall_part_tend_cust | indirect
csv format for this table:
target,source,source_table,relation_type
avg_ensure_sum_12m,inn_num,custom_cib_ml_stg.p_overall_part_tend_cust,direct
avg_ensure_sum_12m,protocol_dttm,custom_cib_ml_stg.p_overall_part_tend_cust,direct
avg_ensure_sum_12m,inn_num,custom_cib_ml_stg.p_overall_part_tend_cust,indirect
Then I create a dataframe by reading it:
val dfDL = spark.read.option("delimiter", ",")
.option("header", true)
.csv(file.getPath.toUri.getPath)
Now I need to create a new dataframe based on dfDL.
The structure of the new dataframe looks like this:
case class DataLink(schema_from: String,
table_from: String,
column_from: String,
link_type: String,
schema_to: String,
table_to: String,
column_to: String)
The information for the fields of the new DataFrame is obtained from a csv file:
pseudocode:
schema_from = source_table.split(".")(0) // Example: custom_cib_ml_stg
table_from = source_table.split(".")(1) // Example: p_overall_part_tend_cust
column_from = source // Example: inn_num
link_type = relation_type // Example: direct
schema_to = "dbname1.table1.csv".split(".")(0) // Example: dbname1
table_to = "dbname1.table1.csv".split(".")(1) // Example: table1
column_to = target // Example: avg_ensure_sum_12m
I need to create a new dataframe. I can't cope on my own.
P.S. I need this dataframe to create a json file from it later.
Example JSON:
[{"schema_from":"custom_cib_ml36_stg",
"table_from":"p_overall_part_tend_cust",
"column_from":"inn_num",
"link_type":"direct",
"schema_to":"dbname1",
"table_to":"table1",
"column_to":"avg_ensure_sum_12m"
},
{"schema_from":"custom_cib_ml36_stg",
"table_from":"p_overall_part_tend_cust",
"column_from":"protocol_dttm",
"link_type":"direct","schema_to":"dbname1",
"table_to":"table1",
"column_to":"avg_ensure_sum_12m"}
I don't like my current implementation:
def readDLFromHDFS(file: LocatedFileStatus): Array[DataLink] = {
val arrTableName = file.getPath.getName.split("\\.")
val (schemaTo, tableTo) = (arrTableName(0), arrTableName(1))
val dfDL = spark.read.option("delimiter", ",")
.option("header", true)
.csv(file.getPath.toUri.getPath)
//val sourceTable = dfDL.select("source_table").collect().map(value => value.toString().split("."))
dfDL.collect.map(row => DataLink(row.getString(2).split("\\.")(0),
row.getString(2).split("\\.")(1),
row.getString(1),
row.getString(3),
schemaTo,
tableTo,
row.getString(0)))
}
def toJSON(dataLinks: Array[DataLink]): Option[JValue] =
dataLinks.map(Extraction.decompose).reduceOption(_ ++ _)
}
You definitely don't want to collect, that defeats the point of using spark here. As always with Spark you have a lot of options. You can use RDDs but I don't see a need to switch between modes here. You just want to apply custom logic to some columns and end up with a dataframe with the resulting column alone.
First, define a UDF that you want to apply:
def convert(target, source, source_table, relation_type) =
DataLink(source_table.split("\\.")(0),
source_table.split("\\.")(1),
source,
"dbname1.table1.csv".split(".")(0)
"dbname1.table1.csv".split(".")(1)
target))
Then apply this function to all the relevant columns (making sure you wrap it in udf to make it a spark function rather than a plain Scala function) and select the result:
df.select(udf(convert)($"target", $"source", $"source_table", $"relation_type"))
If you want to a DataFrame with 7 columns as your result:
df.select(
split(col("source_table"), "\\.").getItem(0),
split(col("source_table"), "\\.").getItem(1),
col("source"),
lit("dbname1"),
lit("table1"),
col("target")
)
You can also add .as("column_name") to each of these 7 columns.
You can use dataset directly.
import spark.implicits._
case class DataLink(schema_from: String,
table_from: String,
column_from: String,
link_type: String,
schema_to: String,
table_to: String,
column_to: String)
val filename = "dbname1.table1.csv"
val df = spark.read.option("header","true").csv("test.csv")
df.show(false)
+------------------+-------------+------------------------------------------+-------------+
|target |source |source_table |relation_type|
+------------------+-------------+------------------------------------------+-------------+
|avg_ensure_sum_12m|inn_num |custom_cib_ml_stg.p_overall_part_tend_cust|direct |
|avg_ensure_sum_12m|protocol_dttm|custom_cib_ml_stg.p_overall_part_tend_cust|direct |
|avg_ensure_sum_12m|inn_num |custom_cib_ml_stg.p_overall_part_tend_cust|indirect |
+------------------+-------------+------------------------------------------+-------------+
df.createOrReplaceTempView("table")
val df2 = spark.sql(s"""
select split(source_table, '[.]')[0] as schema_from
, split(source_table, '[.]')[1] as table_from
, source as column_from
, relation_type as link_type
, split('${filename}', '[.]')[0] as schema_to
, split('${filename}', '[.]')[1] as table_to
, target as column_to
from table
""").as[DataLink]
df2.show()
+-----------------+--------------------+-------------+---------+---------+--------+------------------+
| schema_from| table_from| column_from|link_type|schema_to|table_to| column_to|
+-----------------+--------------------+-------------+---------+---------+--------+------------------+
|custom_cib_ml_stg|p_overall_part_te...| inn_num| direct| dbname1| table1|avg_ensure_sum_12m|
|custom_cib_ml_stg|p_overall_part_te...|protocol_dttm| direct| dbname1| table1|avg_ensure_sum_12m|
|custom_cib_ml_stg|p_overall_part_te...| inn_num| indirect| dbname1| table1|avg_ensure_sum_12m|
+-----------------+--------------------+-------------+---------+---------+--------+------------------+
My progress...
Now, i can create new DataFrame, but he contain only 1 column.
val dfDL = spark.read.option("delimiter", ",")
.option("header", true)
.csv(file.getPath.toUri.getPath)
val convertCase = (target: String, source: String, source_table: String, relation_type: String) =>
DataLink(
source_table.split("\\.")(0),
source_table.split("\\.")(1),
source,
relation_type,
schemaTo,
tableTo,
target,
)
val udfConvert = udf(convertCase)
val dfForJson = dfDL.select(udfConvert(col("target"),
col("source"),
col("source_table"),
col("relation_type")))

How to write a One to Many Query with Scala Slick which returns something like this `(Model1, Option[Seq[Model2]])`

I know this question has been asked before but i cant figure it out.
In my data model i have a news-model containing an arbitrary amount of images.
case class NewsDataModel(
newsId: Option[Long],
name: String,
description: String,
author: String,
creationDateTime: Option[OffsetDateTime]
)
case class Image(
id: Long,
url: String,
newsId: Long
)
Now i want to query my db to get something like this (NewsDataModel, Option[Seq[Image]])
My query is currently implemented as followed:
val q = newsTable.joinLeft(imagesTable).on(_.newsId === _.newsId).result
db.run(q)
This evaluates to Future[Seq[(NewsDataModel, Option[Image])]]. I guess the right way to solve this would be to use the groupBy-function but i dont know how to implement it since this
val q = newsTable.joinLeft(imagesTable).on(_.newsId === _.newsId)
.groupBy(_._1.newsId)
.result
db.run(q)
evaluates to Future[Seq[(Option[Long], Query[(NewsTable, Rep[Option[ImagesTable]]), (NewsDataModel, Option[Image]), Seq])]]
Slick won't automatically create that data structure for you. (I find it helpful to think of Slick in terms of rows and tables and what you can do in portable SQL, and not in terms of "object-relational mappers" or similar).
What you'll want to do is convert the rows into the format you want in Scala, after the database layer. There are many ways you can do that.
Here's one way to do that.
Given this example data...
scala> case class NewsDataModel(newsId: Long)
class NewsDataModel
scala> case class Image(id: Long)
class Image
scala> val results = Seq(
| ( NewsDataModel(1L), Some(Image(1L)) ),
| ( NewsDataModel(1L), Some(Image(10L)) ),
| ( NewsDataModel(1L), None ),
| ( NewsDataModel(2L), None ),
| ( NewsDataModel(3L), Some(Image(3L)) ),
| )
|
val results: Seq[(NewsDataModel, Option[Image])] = List((NewsDataModel(1),Some(Image(1))), (NewsDataModel(1),Some(Image(10))), (NewsDataModel(1),None), (NewsDataModel(2),None), (NewsDataModel(3),Some(Image(3))))
We can group by the key:
scala> val groups = results.groupBy { case (key, values) => key }
val groups: scala.collection.immutable.Map[NewsDataModel,Seq[(NewsDataModel, Option[Image])]] = HashMap(NewsDataModel(3) -> List((NewsDataModel(3),Some(Image(3)))), NewsDataModel(1) -> List((NewsDataModel(1),Some(Image(1))), (NewsDataModel(1),Some(Image(10))), (NewsDataModel(1),None)), NewsDataModel(2) -> List((NewsDataModel(2),None)))
And convert that into something like the type you want:
scala> val flat = groups.map { case (key, seq) => key -> seq.flatMap(_._2) }
val flat: scala.collection.immutable.Map[NewsDataModel,Seq[Image]] = HashMap(NewsDataModel(3) -> List(Image(3)), NewsDataModel(1) -> List(Image(1), Image(10)), NewsDataModel(2) -> List())
That flat result is a map, but you can turn it into (for example) a List with the type signature (close to) the type you want:
scala> flat.toList
val res18: List[(NewsDataModel, Seq[Image])] = List((NewsDataModel(3),List(Image(3))), (NewsDataModel(1),List(Image(1), Image(10))), (NewsDataModel(2),List()))
You can find lots of ways to do that, but the point is you're doing it in Scala, not Slick (SQL). Note, in particular, the groupBy method I've used are the Scala one in the collection library, not the Slick ones (which would be a SQL GROUP BY clause). That is, I'm modifying the result of running the query, not modifying the query itself.
I'd suggest putting whateever conversion you want into a method and then applying it to the Slick action. For example:
def convert(input: Seq[(NewsDataModel, Option[Image])]): Seq[(NewsDataModel, Seq[Image])] =
??? // your implementation here
val action = newsTable.joinLeft(imagesTable).on(_.newsId === _.newsId).result
val convertedAction = action.map(convert)
db.run(convertedAction)

Is there a way to create a UDF that accepts an array of two strings and pass those strings as two arguments to a function?

I'm new to Scala so pardon my poor penmanship.
I have a function func1 that accepts two strings and returns a string.
I also have a dataframe df1 that has 2 columns a1 and b1. I'm trying to create a new dataframe df2 with both the columns from df1 (a1 and b1) and a new column c1 that is the output of the function func1. I know I need to use UDF. I don't know how to create a UDF that can accept 2 columns and pass these two as parameters to func1 and return the output string (column c1).
Here's some of the things that I tried -
def func1(str1:String, str2:String) : String = {
//code
return str3;
}
val df1= spark.sql("select * from emp")
.select("a1", "b1").cache()
val df2 = spark.sql("select * from df1")
.withColumn("c1", func1("a1","b1"))
.select("a1", "b1").cache()
But I don't get the results. Please advise. Thanks in advance.
You basically have a syntax problem.
Remember that when you do def func1(str1:String, str2:String) : String = ... func1 refers to a Scala function object, and not a Spark expression.
On the other hand, .withColumn expects a Spark expression as its second argument.
So what happens is that your call to .withColumn("c1", func1("a1","b1")) sends Spark a Scala function object, whereas the Spark API expects a "Spark Expression" (e.g. a column, or operation on columns, such as a User Defined Function (UDF)).
Luckily, it is easy to transform a Scala function into a Spark UDF, generally speaking, by wrapping it by a call to spark's udf method.
So a working example can go out like this :
// A sample dataframe
val dataframe = Seq(("a", "b"), ("c", "d")).toDF("columnA", "columnB")
// An example scala function that actually does something (string concat)
def concat(first: String, second: String) = first+second
// A conversion from scala function to spark UDF :
val concatUDF = udf((first: String, second: String) => concat(first, second))
// An sample execution of the UDF
// note the $ sign, which is short for indicating a column name
dataframe.withColumn("concat", concatUDF($"columnA", $"columnB")).show
+-------+-------+------+
|columnA|columnB|concat|
+-------+-------+------+
| a| b| ab|
| c| d| cd|
+-------+-------+------+
From there on, it should be easy to adapt to your precise function and its arguments.
Here is how, you would do it
scala> val df = Seq(("John","26"),("Bob","31")).toDF("a1","b1")
df: org.apache.spark.sql.DataFrame = [a1: string, b1: string]
scala> df.createOrReplaceTempView("emp")
scala> :paste
// Entering paste mode (ctrl-D to finish)
def func1(str1:String, str2:String) : String = {
val str3 = s" ${str1} is ${str2} years old"
return str3;
}
// Exiting paste mode, now interpreting.
func1: (str1: String, str2: String)String
scala> val my_udf_func1 = udf( func1(_:String,_:String):String )
my_udf_func1: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,StringType,Some(List(StringType, StringType)))
scala> spark.sql("select * from emp").withColumn("c1", my_udf_func1($"a1",$"b1")).show(false)
2019-01-14 21:08:30 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
+----+---+---------------------+
|a1 |b1 |c1 |
+----+---+---------------------+
|John|26 | John is 26 years old|
|Bob |31 | Bob is 31 years old |
+----+---+---------------------+
scala>
Two places where you need to correct it..
After defining the regular function, you need to register it in udf() as
val my_udf_func1 = udf( func1(_:String,_:String):String )
when calling the udf you should use $"a1" syntax, not simply "a1"

Scala/Apache Spark Converting DataFrame column values and type, multiple when otherwise

I have a primary SQL table that I am reading into Spark and modifying to write to CassandraDB. Currently I have a working implementation for converting a gender from 0, 1, 2, 3 (integers) to "Male", "Female", "Trans", etc (Strings). Though the below method does work, it seems very inefficient to make a seperate Array with those mappings into a DataFrame, join it into the main table/DataFrame, then remove, rename, etc.
I have seen:
.withColumn("gender", when(col("gender) === 1, "male").otherwise("female")
that would allow me to continue method chaining on the primary table but have not been able to get it working with more than 2 options. Is there a way to do this? I have around 10 different columns on this table that each need their own custom conversion created. Since this code will be processing TBs of data, is there a less repetitive and more efficient way to accomplish this. Thanks for any help in advance!
case class Gender(tmpid: Int, tmpgender: String)
private def createGenderDf(spark:SparkSession): DataFrame = {
import spark.implicits._
Seq(
Gender(1, "Male"),
Gender(2, "Female"),
Gender(777, "Prefer not to answer")
).toDF
}
private def createPersonsDf(spark: SparkSession): DataFrame = {
val genderDf = createGenderDf(spark)
genderDf.show()
val personsDf: DataFrame = spark.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", "\t")
.load(dataPath + "people.csv")
.withColumnRenamed("ID", "id")
.withColumnRenamed("name_first", "firstname")
val personsDf1: DataFrame = personsDf
.join(genderDf, personsDf("gender") === genderDf("tmpid"), "leftouter")
val personsDf2: DataFrame = personsDf1
.drop("gender")
.drop("tmpid")
.withColumnRenamed("tmpgender", "gender")
}
You can use nested when function which would eliminate your need of creating genderDf, join, drop, rename etc. As for your example you can do the following
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StringType
personsDf.withColumn("gender", when(col("gender") === 1, "male").otherwise(when(col("gender") ===2, "female").otherwise("Prefer not to answer")).cast(StringType))
You can add more when function in the above nested structure and you can repeate the same for other 10 columns as well.

Programmatically adding several columns to Spark DataFrame

I'm using spark with scala.
I have a Dataframe with 3 columns: ID,Time,RawHexdata.
I have a user defined function which takes RawHexData and expands it into X more columns. It is important to state that for each row X is the same (the columns do not vary). However, before I receive the first data, I do not know what the columns are. But once I have the head, I can deduce it.
I would like a second Dataframe with said columns: Id,Time,RawHexData,NewCol1,...,NewCol3.
The "Easiest" method I can think of to do this is:
1. deserialize each row into json (every data tyoe is serializable here)
2. add my new columns,
3. deserialize a new dataframe from the altered json,
However, that seems like a waste, as it involves 2 costly and redundant json serialization steps. I am looking for a cleaner pattern.
Using case-classes, seems like a bad idea, because I don't know the number of columns, or the column names in advance.
What you can do to dynamically extend your DataFrame is to operate on the row RDD which you can obtain by calling dataFrame.rdd. Having a Row instance, you can access the RawHexdata column and parse the contained data. By adding the newly parsed columns to the resulting Row, you've almost solved your problem. The only thing necessary to convert a RDD[Row] back into a DataFrame is to generate the schema data for your new columns. You can do this by collecting a single RawHexdata value on your driver and then extracting the column types.
The following code illustrates this approach.
object App {
case class Person(name: String, age: Int)
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Test").setMaster("local[4]")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val input = sc.parallelize(Seq(Person("a", 1), Person("b", 2)))
val dataFrame = input.df
dataFrame.show()
// create the extended rows RDD
val rowRDD = dataFrame.rdd.map{
row =>
val blob = row(1).asInstanceOf[Int]
val newColumns: Seq[Any] = Seq(blob, blob * 2, blob * 3)
Row.fromSeq(row.toSeq.init ++ newColumns)
}
val schema = dataFrame.schema
// we know that the new columns are all integers
val newColumns = StructType{
Seq(new StructField("1", IntegerType), new StructField("2", IntegerType), new StructField("3", IntegerType))
}
val newSchema = StructType(schema.init ++ newColumns)
val newDataFrame = sqlContext.createDataFrame(rowRDD, newSchema)
newDataFrame.show()
}
}
SELECT is your friend solving it without going back to RDD.
case class Entry(Id: String, Time: Long)
val entries = Seq(
Entry("x1", 100L),
Entry("x2", 200L)
)
val newColumns = Seq("NC1", "NC2", "NC3")
val df = spark.createDataFrame(entries)
.select(col("*") +: (newColumns.map(c => lit(null).as(c))): _*)
df.show(false)
+---+----+----+----+----+
|Id |Time|NC1 |NC2 |NC3 |
+---+----+----+----+----+
|x1 |100 |null|null|null|
|x2 |200 |null|null|null|
+---+----+----+----+----+