Apache Spark Scala - apache-spark-sql

Apologies if it very obvious question as i am newbie in Spark. I have defined a function in Scala and created a udf but how can i run the same udf on multiple columns. It seems withColumn can be used with just one column. Any pointers will be greatly appreciated. Morever, how can i use lambda functions ( Scala's USP) on DataSet/ DataFrames ?
def stringConvertor(prefix: String) : Double = {
prefix.takeRight(1) match {
case "k" => prefix.reverse.substring(1).reverse.toDouble * 1000
case "m" => prefix.reverse.substring(1).reverse.toDouble * 1000000
case "n" => prefix.reverse.substring(2).reverse.toDouble * 1000000000
case "%" => prefix.reverse.substring(1).reverse.toDouble / 100
case ")" => prefix.dropRight(1).drop(1).toDouble * -1
case "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" => prefix.toDouble
case _ => 0
}
}
val udfStringtoNumber = udf((s : String) => stringConvertor(s))
val parseString = udf(sp)
val spark = SparkSession.builder
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.master("local")
.getOrCreate()
val df = spark.read.option("delimiter" , "|").option("header","true").option("inferSchema","true")
.csv("FileName.csv")
//df.printSchema()
import spark.implicits._
val newdf = df.withColumn("TradeCCY", parseString($"PriceCCY"))
val newDF = df.withColumn("CAvgVolume", udfStringtoNumber($"AvgVolume") )

Related

Scala + Spark: filter a dataset if it contains elements from a list

I have a dataset and I want to filtered base on a column.
val test = Seq(
("1", "r2_test"),
("2", "some_other_value"),
("3", "hs_2_card"),
("4", "vsx_np_v2"),
("5", "r2_test"),
("2", "some_other_value2")
).toDF("id", "my_column")
I want to create a function to filter my dataframe based on the elements of this list using contains on "my_column"(if contains part of the string, the filter must be applied)
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
def filteredElements(df: DataFrame): DataFrame = {
val elements = List("r2", "hs", "np")
df.filter($"my_column".contains(elements))
}
But like this, won't work for a list, just for a single element.
How can I do to adapt to use my list without have to do multiple filters?
Below how the expected output must be when apply the function
val output = test.transform(filteredElements)
expected =
("1", "r2_test"), // contains "rs"
("3", "hs_2_card"), // contains "hs"
("4", "vsx_np_v2"), // contains "np"
("5", "r2_test"), // contains "r2"
You can do it in one line without udf ( better for performance and simpler ):
df.filter(col("my_column").isNotNull).filter(row => elements.exists(row.getAs[String]("my_column").contains)).show()
One way to solve this would be to use a UDF. I think there should be some way to solve this with spark sql functions that I'm not aware of. Anyway, you can define a udf to tell weather a String contains any of the values in your elements List or not:
import org.apache.sql.functions._
val elements = List("r2", "hs", "np")
val isContainedInList = udf { (value: String) =>
elements.exists(e => value.indexOf(e) != -1)
}
You can use this udf in select, filter, basically anywhere you want:
def filteredElements(df: DataFrame): DataFrame = {
df.filter(isContainedInList($"my_column"))
}
And the result is as expected:
+---+---------+
| id|my_column|
+---+---------+
| 1| r2_test|
| 3|hs_2_card|
| 4|vsx_np_v2|
| 5| r2_test|
+---+---------+

How do I create a new DataFame based on an old DataFrame?

I have csv file: dbname1.table1.csv:
|target | source |source_table |relation_type|
---------------------------------------------------------------------------------------
avg_ensure_sum_12m | inn_num | custom_cib_ml_stg.p_overall_part_tend_cust | direct
avg_ensure_sum_12m | protocol_dttm | custom_cib_ml_stg.p_overall_part_tend_cust | direct
avg_ensure_sum_12m | inn_num | custom_cib_ml_stg.p_overall_part_tend_cust | indirect
csv format for this table:
target,source,source_table,relation_type
avg_ensure_sum_12m,inn_num,custom_cib_ml_stg.p_overall_part_tend_cust,direct
avg_ensure_sum_12m,protocol_dttm,custom_cib_ml_stg.p_overall_part_tend_cust,direct
avg_ensure_sum_12m,inn_num,custom_cib_ml_stg.p_overall_part_tend_cust,indirect
Then I create a dataframe by reading it:
val dfDL = spark.read.option("delimiter", ",")
.option("header", true)
.csv(file.getPath.toUri.getPath)
Now I need to create a new dataframe based on dfDL.
The structure of the new dataframe looks like this:
case class DataLink(schema_from: String,
table_from: String,
column_from: String,
link_type: String,
schema_to: String,
table_to: String,
column_to: String)
The information for the fields of the new DataFrame is obtained from a csv file:
pseudocode:
schema_from = source_table.split(".")(0) // Example: custom_cib_ml_stg
table_from = source_table.split(".")(1) // Example: p_overall_part_tend_cust
column_from = source // Example: inn_num
link_type = relation_type // Example: direct
schema_to = "dbname1.table1.csv".split(".")(0) // Example: dbname1
table_to = "dbname1.table1.csv".split(".")(1) // Example: table1
column_to = target // Example: avg_ensure_sum_12m
I need to create a new dataframe. I can't cope on my own.
P.S. I need this dataframe to create a json file from it later.
Example JSON:
[{"schema_from":"custom_cib_ml36_stg",
"table_from":"p_overall_part_tend_cust",
"column_from":"inn_num",
"link_type":"direct",
"schema_to":"dbname1",
"table_to":"table1",
"column_to":"avg_ensure_sum_12m"
},
{"schema_from":"custom_cib_ml36_stg",
"table_from":"p_overall_part_tend_cust",
"column_from":"protocol_dttm",
"link_type":"direct","schema_to":"dbname1",
"table_to":"table1",
"column_to":"avg_ensure_sum_12m"}
I don't like my current implementation:
def readDLFromHDFS(file: LocatedFileStatus): Array[DataLink] = {
val arrTableName = file.getPath.getName.split("\\.")
val (schemaTo, tableTo) = (arrTableName(0), arrTableName(1))
val dfDL = spark.read.option("delimiter", ",")
.option("header", true)
.csv(file.getPath.toUri.getPath)
//val sourceTable = dfDL.select("source_table").collect().map(value => value.toString().split("."))
dfDL.collect.map(row => DataLink(row.getString(2).split("\\.")(0),
row.getString(2).split("\\.")(1),
row.getString(1),
row.getString(3),
schemaTo,
tableTo,
row.getString(0)))
}
def toJSON(dataLinks: Array[DataLink]): Option[JValue] =
dataLinks.map(Extraction.decompose).reduceOption(_ ++ _)
}
You definitely don't want to collect, that defeats the point of using spark here. As always with Spark you have a lot of options. You can use RDDs but I don't see a need to switch between modes here. You just want to apply custom logic to some columns and end up with a dataframe with the resulting column alone.
First, define a UDF that you want to apply:
def convert(target, source, source_table, relation_type) =
DataLink(source_table.split("\\.")(0),
source_table.split("\\.")(1),
source,
"dbname1.table1.csv".split(".")(0)
"dbname1.table1.csv".split(".")(1)
target))
Then apply this function to all the relevant columns (making sure you wrap it in udf to make it a spark function rather than a plain Scala function) and select the result:
df.select(udf(convert)($"target", $"source", $"source_table", $"relation_type"))
If you want to a DataFrame with 7 columns as your result:
df.select(
split(col("source_table"), "\\.").getItem(0),
split(col("source_table"), "\\.").getItem(1),
col("source"),
lit("dbname1"),
lit("table1"),
col("target")
)
You can also add .as("column_name") to each of these 7 columns.
You can use dataset directly.
import spark.implicits._
case class DataLink(schema_from: String,
table_from: String,
column_from: String,
link_type: String,
schema_to: String,
table_to: String,
column_to: String)
val filename = "dbname1.table1.csv"
val df = spark.read.option("header","true").csv("test.csv")
df.show(false)
+------------------+-------------+------------------------------------------+-------------+
|target |source |source_table |relation_type|
+------------------+-------------+------------------------------------------+-------------+
|avg_ensure_sum_12m|inn_num |custom_cib_ml_stg.p_overall_part_tend_cust|direct |
|avg_ensure_sum_12m|protocol_dttm|custom_cib_ml_stg.p_overall_part_tend_cust|direct |
|avg_ensure_sum_12m|inn_num |custom_cib_ml_stg.p_overall_part_tend_cust|indirect |
+------------------+-------------+------------------------------------------+-------------+
df.createOrReplaceTempView("table")
val df2 = spark.sql(s"""
select split(source_table, '[.]')[0] as schema_from
, split(source_table, '[.]')[1] as table_from
, source as column_from
, relation_type as link_type
, split('${filename}', '[.]')[0] as schema_to
, split('${filename}', '[.]')[1] as table_to
, target as column_to
from table
""").as[DataLink]
df2.show()
+-----------------+--------------------+-------------+---------+---------+--------+------------------+
| schema_from| table_from| column_from|link_type|schema_to|table_to| column_to|
+-----------------+--------------------+-------------+---------+---------+--------+------------------+
|custom_cib_ml_stg|p_overall_part_te...| inn_num| direct| dbname1| table1|avg_ensure_sum_12m|
|custom_cib_ml_stg|p_overall_part_te...|protocol_dttm| direct| dbname1| table1|avg_ensure_sum_12m|
|custom_cib_ml_stg|p_overall_part_te...| inn_num| indirect| dbname1| table1|avg_ensure_sum_12m|
+-----------------+--------------------+-------------+---------+---------+--------+------------------+
My progress...
Now, i can create new DataFrame, but he contain only 1 column.
val dfDL = spark.read.option("delimiter", ",")
.option("header", true)
.csv(file.getPath.toUri.getPath)
val convertCase = (target: String, source: String, source_table: String, relation_type: String) =>
DataLink(
source_table.split("\\.")(0),
source_table.split("\\.")(1),
source,
relation_type,
schemaTo,
tableTo,
target,
)
val udfConvert = udf(convertCase)
val dfForJson = dfDL.select(udfConvert(col("target"),
col("source"),
col("source_table"),
col("relation_type")))

Return empty expression in Exposed DAO?

I'm trying to conditionally add a part to my SQL query using Exposed's DAO API. My goal is to have:
SELECT * FROM table
WHERE column1 = 1
AND column2 = $value
AND column3 = 3
where the existence of the AND column2 = $value part depends on a filter.
I've tried with:
TableDAO.find {
Table.column1 eq 1 and (
when (filter.value) {
null -> null // Type mismatch. Required: Expression<Boolean>. Found: Op<Boolean>?
else -> Table.column2 eq filter.value
}) and (
Table.column3 = 3
)
}.map { it.toModel() }
but I can't find a way to return an empty expression or somehow exclude that part from the query. The only solution I can make work is something like
null -> Table.column2 neq -1
but I feel like there should be a better way.
You'll have to assign your expressions into a local variable:
var expr = Table.column1 eq 1
if(filter.value) {
expr = expr and (Table.column2 eq filter.value)
}
expr = expr and (
Table.column3 = 3
)
I don't have my IDE in front of me, but this is the general idea. You can try to figure out something clever, but it would make your code unnecessarily complex.

creating sequences in ML

datatype 'a seq = Nil | Cons of 'a * (unit -> 'a seq);
fun head(Cons(x,_)) = x | head Nil = raise EmptySeq;
fun tail(Cons(_,xf)) = xf() | tail Nil = raise EmptySeq;
datatype direction = Back | Forward;
datatype 'a bseq = bNil
| bCons of 'a * (direction -> 'a bseq);
fun bHead(bCons(x,_)) = x | bHead bNil = raise EmptySeq;
fun bForward(bCons(_,xf)) = xf(Forward) | bForward bNil = raise EmptySeq;
fun bBack(bCons(_,xf)) = xf(Back) | bBack bNil = raise EmptySeq;
so after all those definitions. here's what I'm trying to do. I need to take 2 sequences and make them into 1 sequence that I can move Farward and Back on.
for example if 1 sequence is 0123... and 2 is -1-2-3-4..., I get -4-3-2-10123.
my current location must always be the first element of the sequence going "up".
this is what I've tried to do:
(*************************************************************)
local
fun append_aux (Nil, yq) = yq
|append_aux (Cons(x,xf), yq) = Cons(x,fn() => append_aux(xf(),yq))
fun append(t,yq) = append_aux(Cons(t,fn() => Nil),yq)
in
fun seq2bseq (Cons(x,xrev)) (Cons(y,ynorm)) = bCons(y,fn Farward => seq2bseq (append(y,Cons(x,xrev))) (ynorm())
|Back => seq2bseq (xrev()) (append(x,Cons(y,ynorm))))
|seq2bseq (_) (_) = bNil
end;
but I get an error "match redundant" for the "Back" match and I can't figure out why.

Scala: Using HashMap with a default value

I have a mutable HashMap and would like to use it like a default-dictionary. The obvious method appears to be to use getOrElse and provide the default value each time as a second value. However this seems a little inelegant in my use case since the default value doesn't change.
var x = HashMap(1 -> "b", 2 -> "a", 3 -> "c")
println(x.getOrElse(4, "_")
println(x.getOrElse(5, "_"))
// And so on...
println(x.getOrElse(10, "_"))
Is there any way to create a HashMap (or similar class) such that attempting to access undefined keys returns a default value set on the creation of the HashMap? I notice that HashMap.default is just set to throw an exception but I wonder if this can be changed...
Wow, I happened to visit this thread exactly one year after I posted my last answer here. :-)
Scala 2.9.1. mutable.Map comes with a withDefaultValue method. REPL session:
scala> import collection.mutable
import collection.mutable
scala> mutable.Map[Int, String]().withDefaultValue("")
res18: scala.collection.mutable.Map[Int,String] = Map()
scala> res18(3)
res19: String = ""
Try this:
import collection.mutable.HashMap
val x = new HashMap[Int,String]() { override def default(key:Int) = "-" }
x += (1 -> "b", 2 -> "a", 3 -> "c")
Then:
scala> x(1)
res7: String = b
scala> x(2)
res8: String = a
scala> x(3)
res9: String = c
scala> x(4)
res10: String = -
scala> val x = HashMap(1 -> "b", 2 -> "a", 3 -> "c").withDefaultValue("-")
x: scala.collection.immutable.Map[Int,java.lang.String] = Map((1,b), (2,a), (3,c))
scala> x(3)
res0: java.lang.String = c
scala> x(5)
res1: java.lang.String = -
EDIT:
For mutable.HashMap, you could do the following:
scala> import collection.mutable
import collection.mutable
scala> val x = new mutable.HashMap[Int, String] {
| override def apply(key: Int) = super.get(key) getOrElse "-"
| }
x: scala.collection.mutable.HashMap[Int,String] = Map()
scala> x += (1 -> "a", 2 -> "b", 3 -> "c")
res9: x.type = Map((2,b), (1,a), (3,c))
scala> x(2)
res10: String = b
scala> x(4)
res11: String = -
There might be a better way to do this. Wait for others to respond.
I'm more a java guy... but if getOrElse is not final, why don't you just extend HasMap and provide something like this:
override def getOrElse(k: Int, default: String) = {
return super.getOrElse(k,"_")
}
Note: syntax is probably screwed up but hopefully you'll get the point