How do I create a new DataFame based on an old DataFrame? - dataframe

I have csv file: dbname1.table1.csv:
|target | source |source_table |relation_type|
---------------------------------------------------------------------------------------
avg_ensure_sum_12m | inn_num | custom_cib_ml_stg.p_overall_part_tend_cust | direct
avg_ensure_sum_12m | protocol_dttm | custom_cib_ml_stg.p_overall_part_tend_cust | direct
avg_ensure_sum_12m | inn_num | custom_cib_ml_stg.p_overall_part_tend_cust | indirect
csv format for this table:
target,source,source_table,relation_type
avg_ensure_sum_12m,inn_num,custom_cib_ml_stg.p_overall_part_tend_cust,direct
avg_ensure_sum_12m,protocol_dttm,custom_cib_ml_stg.p_overall_part_tend_cust,direct
avg_ensure_sum_12m,inn_num,custom_cib_ml_stg.p_overall_part_tend_cust,indirect
Then I create a dataframe by reading it:
val dfDL = spark.read.option("delimiter", ",")
.option("header", true)
.csv(file.getPath.toUri.getPath)
Now I need to create a new dataframe based on dfDL.
The structure of the new dataframe looks like this:
case class DataLink(schema_from: String,
table_from: String,
column_from: String,
link_type: String,
schema_to: String,
table_to: String,
column_to: String)
The information for the fields of the new DataFrame is obtained from a csv file:
pseudocode:
schema_from = source_table.split(".")(0) // Example: custom_cib_ml_stg
table_from = source_table.split(".")(1) // Example: p_overall_part_tend_cust
column_from = source // Example: inn_num
link_type = relation_type // Example: direct
schema_to = "dbname1.table1.csv".split(".")(0) // Example: dbname1
table_to = "dbname1.table1.csv".split(".")(1) // Example: table1
column_to = target // Example: avg_ensure_sum_12m
I need to create a new dataframe. I can't cope on my own.
P.S. I need this dataframe to create a json file from it later.
Example JSON:
[{"schema_from":"custom_cib_ml36_stg",
"table_from":"p_overall_part_tend_cust",
"column_from":"inn_num",
"link_type":"direct",
"schema_to":"dbname1",
"table_to":"table1",
"column_to":"avg_ensure_sum_12m"
},
{"schema_from":"custom_cib_ml36_stg",
"table_from":"p_overall_part_tend_cust",
"column_from":"protocol_dttm",
"link_type":"direct","schema_to":"dbname1",
"table_to":"table1",
"column_to":"avg_ensure_sum_12m"}
I don't like my current implementation:
def readDLFromHDFS(file: LocatedFileStatus): Array[DataLink] = {
val arrTableName = file.getPath.getName.split("\\.")
val (schemaTo, tableTo) = (arrTableName(0), arrTableName(1))
val dfDL = spark.read.option("delimiter", ",")
.option("header", true)
.csv(file.getPath.toUri.getPath)
//val sourceTable = dfDL.select("source_table").collect().map(value => value.toString().split("."))
dfDL.collect.map(row => DataLink(row.getString(2).split("\\.")(0),
row.getString(2).split("\\.")(1),
row.getString(1),
row.getString(3),
schemaTo,
tableTo,
row.getString(0)))
}
def toJSON(dataLinks: Array[DataLink]): Option[JValue] =
dataLinks.map(Extraction.decompose).reduceOption(_ ++ _)
}

You definitely don't want to collect, that defeats the point of using spark here. As always with Spark you have a lot of options. You can use RDDs but I don't see a need to switch between modes here. You just want to apply custom logic to some columns and end up with a dataframe with the resulting column alone.
First, define a UDF that you want to apply:
def convert(target, source, source_table, relation_type) =
DataLink(source_table.split("\\.")(0),
source_table.split("\\.")(1),
source,
"dbname1.table1.csv".split(".")(0)
"dbname1.table1.csv".split(".")(1)
target))
Then apply this function to all the relevant columns (making sure you wrap it in udf to make it a spark function rather than a plain Scala function) and select the result:
df.select(udf(convert)($"target", $"source", $"source_table", $"relation_type"))
If you want to a DataFrame with 7 columns as your result:
df.select(
split(col("source_table"), "\\.").getItem(0),
split(col("source_table"), "\\.").getItem(1),
col("source"),
lit("dbname1"),
lit("table1"),
col("target")
)
You can also add .as("column_name") to each of these 7 columns.

You can use dataset directly.
import spark.implicits._
case class DataLink(schema_from: String,
table_from: String,
column_from: String,
link_type: String,
schema_to: String,
table_to: String,
column_to: String)
val filename = "dbname1.table1.csv"
val df = spark.read.option("header","true").csv("test.csv")
df.show(false)
+------------------+-------------+------------------------------------------+-------------+
|target |source |source_table |relation_type|
+------------------+-------------+------------------------------------------+-------------+
|avg_ensure_sum_12m|inn_num |custom_cib_ml_stg.p_overall_part_tend_cust|direct |
|avg_ensure_sum_12m|protocol_dttm|custom_cib_ml_stg.p_overall_part_tend_cust|direct |
|avg_ensure_sum_12m|inn_num |custom_cib_ml_stg.p_overall_part_tend_cust|indirect |
+------------------+-------------+------------------------------------------+-------------+
df.createOrReplaceTempView("table")
val df2 = spark.sql(s"""
select split(source_table, '[.]')[0] as schema_from
, split(source_table, '[.]')[1] as table_from
, source as column_from
, relation_type as link_type
, split('${filename}', '[.]')[0] as schema_to
, split('${filename}', '[.]')[1] as table_to
, target as column_to
from table
""").as[DataLink]
df2.show()
+-----------------+--------------------+-------------+---------+---------+--------+------------------+
| schema_from| table_from| column_from|link_type|schema_to|table_to| column_to|
+-----------------+--------------------+-------------+---------+---------+--------+------------------+
|custom_cib_ml_stg|p_overall_part_te...| inn_num| direct| dbname1| table1|avg_ensure_sum_12m|
|custom_cib_ml_stg|p_overall_part_te...|protocol_dttm| direct| dbname1| table1|avg_ensure_sum_12m|
|custom_cib_ml_stg|p_overall_part_te...| inn_num| indirect| dbname1| table1|avg_ensure_sum_12m|
+-----------------+--------------------+-------------+---------+---------+--------+------------------+

My progress...
Now, i can create new DataFrame, but he contain only 1 column.
val dfDL = spark.read.option("delimiter", ",")
.option("header", true)
.csv(file.getPath.toUri.getPath)
val convertCase = (target: String, source: String, source_table: String, relation_type: String) =>
DataLink(
source_table.split("\\.")(0),
source_table.split("\\.")(1),
source,
relation_type,
schemaTo,
tableTo,
target,
)
val udfConvert = udf(convertCase)
val dfForJson = dfDL.select(udfConvert(col("target"),
col("source"),
col("source_table"),
col("relation_type")))

Related

Scala + Spark: filter a dataset if it contains elements from a list

I have a dataset and I want to filtered base on a column.
val test = Seq(
("1", "r2_test"),
("2", "some_other_value"),
("3", "hs_2_card"),
("4", "vsx_np_v2"),
("5", "r2_test"),
("2", "some_other_value2")
).toDF("id", "my_column")
I want to create a function to filter my dataframe based on the elements of this list using contains on "my_column"(if contains part of the string, the filter must be applied)
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
def filteredElements(df: DataFrame): DataFrame = {
val elements = List("r2", "hs", "np")
df.filter($"my_column".contains(elements))
}
But like this, won't work for a list, just for a single element.
How can I do to adapt to use my list without have to do multiple filters?
Below how the expected output must be when apply the function
val output = test.transform(filteredElements)
expected =
("1", "r2_test"), // contains "rs"
("3", "hs_2_card"), // contains "hs"
("4", "vsx_np_v2"), // contains "np"
("5", "r2_test"), // contains "r2"
You can do it in one line without udf ( better for performance and simpler ):
df.filter(col("my_column").isNotNull).filter(row => elements.exists(row.getAs[String]("my_column").contains)).show()
One way to solve this would be to use a UDF. I think there should be some way to solve this with spark sql functions that I'm not aware of. Anyway, you can define a udf to tell weather a String contains any of the values in your elements List or not:
import org.apache.sql.functions._
val elements = List("r2", "hs", "np")
val isContainedInList = udf { (value: String) =>
elements.exists(e => value.indexOf(e) != -1)
}
You can use this udf in select, filter, basically anywhere you want:
def filteredElements(df: DataFrame): DataFrame = {
df.filter(isContainedInList($"my_column"))
}
And the result is as expected:
+---+---------+
| id|my_column|
+---+---------+
| 1| r2_test|
| 3|hs_2_card|
| 4|vsx_np_v2|
| 5| r2_test|
+---+---------+

How to write a One to Many Query with Scala Slick which returns something like this `(Model1, Option[Seq[Model2]])`

I know this question has been asked before but i cant figure it out.
In my data model i have a news-model containing an arbitrary amount of images.
case class NewsDataModel(
newsId: Option[Long],
name: String,
description: String,
author: String,
creationDateTime: Option[OffsetDateTime]
)
case class Image(
id: Long,
url: String,
newsId: Long
)
Now i want to query my db to get something like this (NewsDataModel, Option[Seq[Image]])
My query is currently implemented as followed:
val q = newsTable.joinLeft(imagesTable).on(_.newsId === _.newsId).result
db.run(q)
This evaluates to Future[Seq[(NewsDataModel, Option[Image])]]. I guess the right way to solve this would be to use the groupBy-function but i dont know how to implement it since this
val q = newsTable.joinLeft(imagesTable).on(_.newsId === _.newsId)
.groupBy(_._1.newsId)
.result
db.run(q)
evaluates to Future[Seq[(Option[Long], Query[(NewsTable, Rep[Option[ImagesTable]]), (NewsDataModel, Option[Image]), Seq])]]
Slick won't automatically create that data structure for you. (I find it helpful to think of Slick in terms of rows and tables and what you can do in portable SQL, and not in terms of "object-relational mappers" or similar).
What you'll want to do is convert the rows into the format you want in Scala, after the database layer. There are many ways you can do that.
Here's one way to do that.
Given this example data...
scala> case class NewsDataModel(newsId: Long)
class NewsDataModel
scala> case class Image(id: Long)
class Image
scala> val results = Seq(
| ( NewsDataModel(1L), Some(Image(1L)) ),
| ( NewsDataModel(1L), Some(Image(10L)) ),
| ( NewsDataModel(1L), None ),
| ( NewsDataModel(2L), None ),
| ( NewsDataModel(3L), Some(Image(3L)) ),
| )
|
val results: Seq[(NewsDataModel, Option[Image])] = List((NewsDataModel(1),Some(Image(1))), (NewsDataModel(1),Some(Image(10))), (NewsDataModel(1),None), (NewsDataModel(2),None), (NewsDataModel(3),Some(Image(3))))
We can group by the key:
scala> val groups = results.groupBy { case (key, values) => key }
val groups: scala.collection.immutable.Map[NewsDataModel,Seq[(NewsDataModel, Option[Image])]] = HashMap(NewsDataModel(3) -> List((NewsDataModel(3),Some(Image(3)))), NewsDataModel(1) -> List((NewsDataModel(1),Some(Image(1))), (NewsDataModel(1),Some(Image(10))), (NewsDataModel(1),None)), NewsDataModel(2) -> List((NewsDataModel(2),None)))
And convert that into something like the type you want:
scala> val flat = groups.map { case (key, seq) => key -> seq.flatMap(_._2) }
val flat: scala.collection.immutable.Map[NewsDataModel,Seq[Image]] = HashMap(NewsDataModel(3) -> List(Image(3)), NewsDataModel(1) -> List(Image(1), Image(10)), NewsDataModel(2) -> List())
That flat result is a map, but you can turn it into (for example) a List with the type signature (close to) the type you want:
scala> flat.toList
val res18: List[(NewsDataModel, Seq[Image])] = List((NewsDataModel(3),List(Image(3))), (NewsDataModel(1),List(Image(1), Image(10))), (NewsDataModel(2),List()))
You can find lots of ways to do that, but the point is you're doing it in Scala, not Slick (SQL). Note, in particular, the groupBy method I've used are the Scala one in the collection library, not the Slick ones (which would be a SQL GROUP BY clause). That is, I'm modifying the result of running the query, not modifying the query itself.
I'd suggest putting whateever conversion you want into a method and then applying it to the Slick action. For example:
def convert(input: Seq[(NewsDataModel, Option[Image])]): Seq[(NewsDataModel, Seq[Image])] =
??? // your implementation here
val action = newsTable.joinLeft(imagesTable).on(_.newsId === _.newsId).result
val convertedAction = action.map(convert)
db.run(convertedAction)

Is there a way to create a UDF that accepts an array of two strings and pass those strings as two arguments to a function?

I'm new to Scala so pardon my poor penmanship.
I have a function func1 that accepts two strings and returns a string.
I also have a dataframe df1 that has 2 columns a1 and b1. I'm trying to create a new dataframe df2 with both the columns from df1 (a1 and b1) and a new column c1 that is the output of the function func1. I know I need to use UDF. I don't know how to create a UDF that can accept 2 columns and pass these two as parameters to func1 and return the output string (column c1).
Here's some of the things that I tried -
def func1(str1:String, str2:String) : String = {
//code
return str3;
}
val df1= spark.sql("select * from emp")
.select("a1", "b1").cache()
val df2 = spark.sql("select * from df1")
.withColumn("c1", func1("a1","b1"))
.select("a1", "b1").cache()
But I don't get the results. Please advise. Thanks in advance.
You basically have a syntax problem.
Remember that when you do def func1(str1:String, str2:String) : String = ... func1 refers to a Scala function object, and not a Spark expression.
On the other hand, .withColumn expects a Spark expression as its second argument.
So what happens is that your call to .withColumn("c1", func1("a1","b1")) sends Spark a Scala function object, whereas the Spark API expects a "Spark Expression" (e.g. a column, or operation on columns, such as a User Defined Function (UDF)).
Luckily, it is easy to transform a Scala function into a Spark UDF, generally speaking, by wrapping it by a call to spark's udf method.
So a working example can go out like this :
// A sample dataframe
val dataframe = Seq(("a", "b"), ("c", "d")).toDF("columnA", "columnB")
// An example scala function that actually does something (string concat)
def concat(first: String, second: String) = first+second
// A conversion from scala function to spark UDF :
val concatUDF = udf((first: String, second: String) => concat(first, second))
// An sample execution of the UDF
// note the $ sign, which is short for indicating a column name
dataframe.withColumn("concat", concatUDF($"columnA", $"columnB")).show
+-------+-------+------+
|columnA|columnB|concat|
+-------+-------+------+
| a| b| ab|
| c| d| cd|
+-------+-------+------+
From there on, it should be easy to adapt to your precise function and its arguments.
Here is how, you would do it
scala> val df = Seq(("John","26"),("Bob","31")).toDF("a1","b1")
df: org.apache.spark.sql.DataFrame = [a1: string, b1: string]
scala> df.createOrReplaceTempView("emp")
scala> :paste
// Entering paste mode (ctrl-D to finish)
def func1(str1:String, str2:String) : String = {
val str3 = s" ${str1} is ${str2} years old"
return str3;
}
// Exiting paste mode, now interpreting.
func1: (str1: String, str2: String)String
scala> val my_udf_func1 = udf( func1(_:String,_:String):String )
my_udf_func1: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,StringType,Some(List(StringType, StringType)))
scala> spark.sql("select * from emp").withColumn("c1", my_udf_func1($"a1",$"b1")).show(false)
2019-01-14 21:08:30 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
+----+---+---------------------+
|a1 |b1 |c1 |
+----+---+---------------------+
|John|26 | John is 26 years old|
|Bob |31 | Bob is 31 years old |
+----+---+---------------------+
scala>
Two places where you need to correct it..
After defining the regular function, you need to register it in udf() as
val my_udf_func1 = udf( func1(_:String,_:String):String )
when calling the udf you should use $"a1" syntax, not simply "a1"

Programmatically adding several columns to Spark DataFrame

I'm using spark with scala.
I have a Dataframe with 3 columns: ID,Time,RawHexdata.
I have a user defined function which takes RawHexData and expands it into X more columns. It is important to state that for each row X is the same (the columns do not vary). However, before I receive the first data, I do not know what the columns are. But once I have the head, I can deduce it.
I would like a second Dataframe with said columns: Id,Time,RawHexData,NewCol1,...,NewCol3.
The "Easiest" method I can think of to do this is:
1. deserialize each row into json (every data tyoe is serializable here)
2. add my new columns,
3. deserialize a new dataframe from the altered json,
However, that seems like a waste, as it involves 2 costly and redundant json serialization steps. I am looking for a cleaner pattern.
Using case-classes, seems like a bad idea, because I don't know the number of columns, or the column names in advance.
What you can do to dynamically extend your DataFrame is to operate on the row RDD which you can obtain by calling dataFrame.rdd. Having a Row instance, you can access the RawHexdata column and parse the contained data. By adding the newly parsed columns to the resulting Row, you've almost solved your problem. The only thing necessary to convert a RDD[Row] back into a DataFrame is to generate the schema data for your new columns. You can do this by collecting a single RawHexdata value on your driver and then extracting the column types.
The following code illustrates this approach.
object App {
case class Person(name: String, age: Int)
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Test").setMaster("local[4]")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val input = sc.parallelize(Seq(Person("a", 1), Person("b", 2)))
val dataFrame = input.df
dataFrame.show()
// create the extended rows RDD
val rowRDD = dataFrame.rdd.map{
row =>
val blob = row(1).asInstanceOf[Int]
val newColumns: Seq[Any] = Seq(blob, blob * 2, blob * 3)
Row.fromSeq(row.toSeq.init ++ newColumns)
}
val schema = dataFrame.schema
// we know that the new columns are all integers
val newColumns = StructType{
Seq(new StructField("1", IntegerType), new StructField("2", IntegerType), new StructField("3", IntegerType))
}
val newSchema = StructType(schema.init ++ newColumns)
val newDataFrame = sqlContext.createDataFrame(rowRDD, newSchema)
newDataFrame.show()
}
}
SELECT is your friend solving it without going back to RDD.
case class Entry(Id: String, Time: Long)
val entries = Seq(
Entry("x1", 100L),
Entry("x2", 200L)
)
val newColumns = Seq("NC1", "NC2", "NC3")
val df = spark.createDataFrame(entries)
.select(col("*") +: (newColumns.map(c => lit(null).as(c))): _*)
df.show(false)
+---+----+----+----+----+
|Id |Time|NC1 |NC2 |NC3 |
+---+----+----+----+----+
|x1 |100 |null|null|null|
|x2 |200 |null|null|null|
+---+----+----+----+----+

Scala: Using HashMap with a default value

I have a mutable HashMap and would like to use it like a default-dictionary. The obvious method appears to be to use getOrElse and provide the default value each time as a second value. However this seems a little inelegant in my use case since the default value doesn't change.
var x = HashMap(1 -> "b", 2 -> "a", 3 -> "c")
println(x.getOrElse(4, "_")
println(x.getOrElse(5, "_"))
// And so on...
println(x.getOrElse(10, "_"))
Is there any way to create a HashMap (or similar class) such that attempting to access undefined keys returns a default value set on the creation of the HashMap? I notice that HashMap.default is just set to throw an exception but I wonder if this can be changed...
Wow, I happened to visit this thread exactly one year after I posted my last answer here. :-)
Scala 2.9.1. mutable.Map comes with a withDefaultValue method. REPL session:
scala> import collection.mutable
import collection.mutable
scala> mutable.Map[Int, String]().withDefaultValue("")
res18: scala.collection.mutable.Map[Int,String] = Map()
scala> res18(3)
res19: String = ""
Try this:
import collection.mutable.HashMap
val x = new HashMap[Int,String]() { override def default(key:Int) = "-" }
x += (1 -> "b", 2 -> "a", 3 -> "c")
Then:
scala> x(1)
res7: String = b
scala> x(2)
res8: String = a
scala> x(3)
res9: String = c
scala> x(4)
res10: String = -
scala> val x = HashMap(1 -> "b", 2 -> "a", 3 -> "c").withDefaultValue("-")
x: scala.collection.immutable.Map[Int,java.lang.String] = Map((1,b), (2,a), (3,c))
scala> x(3)
res0: java.lang.String = c
scala> x(5)
res1: java.lang.String = -
EDIT:
For mutable.HashMap, you could do the following:
scala> import collection.mutable
import collection.mutable
scala> val x = new mutable.HashMap[Int, String] {
| override def apply(key: Int) = super.get(key) getOrElse "-"
| }
x: scala.collection.mutable.HashMap[Int,String] = Map()
scala> x += (1 -> "a", 2 -> "b", 3 -> "c")
res9: x.type = Map((2,b), (1,a), (3,c))
scala> x(2)
res10: String = b
scala> x(4)
res11: String = -
There might be a better way to do this. Wait for others to respond.
I'm more a java guy... but if getOrElse is not final, why don't you just extend HasMap and provide something like this:
override def getOrElse(k: Int, default: String) = {
return super.getOrElse(k,"_")
}
Note: syntax is probably screwed up but hopefully you'll get the point