Scala: Using HashMap with a default value - oop

I have a mutable HashMap and would like to use it like a default-dictionary. The obvious method appears to be to use getOrElse and provide the default value each time as a second value. However this seems a little inelegant in my use case since the default value doesn't change.
var x = HashMap(1 -> "b", 2 -> "a", 3 -> "c")
println(x.getOrElse(4, "_")
println(x.getOrElse(5, "_"))
// And so on...
println(x.getOrElse(10, "_"))
Is there any way to create a HashMap (or similar class) such that attempting to access undefined keys returns a default value set on the creation of the HashMap? I notice that HashMap.default is just set to throw an exception but I wonder if this can be changed...

Wow, I happened to visit this thread exactly one year after I posted my last answer here. :-)
Scala 2.9.1. mutable.Map comes with a withDefaultValue method. REPL session:
scala> import collection.mutable
import collection.mutable
scala> mutable.Map[Int, String]().withDefaultValue("")
res18: scala.collection.mutable.Map[Int,String] = Map()
scala> res18(3)
res19: String = ""

Try this:
import collection.mutable.HashMap
val x = new HashMap[Int,String]() { override def default(key:Int) = "-" }
x += (1 -> "b", 2 -> "a", 3 -> "c")
Then:
scala> x(1)
res7: String = b
scala> x(2)
res8: String = a
scala> x(3)
res9: String = c
scala> x(4)
res10: String = -

scala> val x = HashMap(1 -> "b", 2 -> "a", 3 -> "c").withDefaultValue("-")
x: scala.collection.immutable.Map[Int,java.lang.String] = Map((1,b), (2,a), (3,c))
scala> x(3)
res0: java.lang.String = c
scala> x(5)
res1: java.lang.String = -
EDIT:
For mutable.HashMap, you could do the following:
scala> import collection.mutable
import collection.mutable
scala> val x = new mutable.HashMap[Int, String] {
| override def apply(key: Int) = super.get(key) getOrElse "-"
| }
x: scala.collection.mutable.HashMap[Int,String] = Map()
scala> x += (1 -> "a", 2 -> "b", 3 -> "c")
res9: x.type = Map((2,b), (1,a), (3,c))
scala> x(2)
res10: String = b
scala> x(4)
res11: String = -
There might be a better way to do this. Wait for others to respond.

I'm more a java guy... but if getOrElse is not final, why don't you just extend HasMap and provide something like this:
override def getOrElse(k: Int, default: String) = {
return super.getOrElse(k,"_")
}
Note: syntax is probably screwed up but hopefully you'll get the point

Related

How to remove String from a ListOf Strings in Kotlin?

I have a listOf Strings
val help = listOf("a","b","c")
And want to remove b from the list but not using index because I will get Strings randomly like this
val help = listOf("c","a","b")
How to do this?
You can filter the List using the condition item does not equal "b", like…
fun main(args: Array<String>) {
// example list
val help = listOf("a","b","c")
// item to be dropped / removed
val r = "b"
// print state before
println(help)
// create a new list filtering the source
val helped = help.filter { it != r }.toList()
// print the result
println(helped)
}
Output:
[a, b, c]
[a, c]
Lists by default aren't mutable. You should use mutable lists instead if you want that. Then you can simply do
val help = mutableListOf("a","b","c")
help.remove("b")
or you can do it like this if help really needs to be a non-mutable list
val help = listOf("a","b","c")
val newHelp = help.toMutableList().remove("b")
using a filter like in deHaar's answer is also possible

How do I create a new DataFame based on an old DataFrame?

I have csv file: dbname1.table1.csv:
|target | source |source_table |relation_type|
---------------------------------------------------------------------------------------
avg_ensure_sum_12m | inn_num | custom_cib_ml_stg.p_overall_part_tend_cust | direct
avg_ensure_sum_12m | protocol_dttm | custom_cib_ml_stg.p_overall_part_tend_cust | direct
avg_ensure_sum_12m | inn_num | custom_cib_ml_stg.p_overall_part_tend_cust | indirect
csv format for this table:
target,source,source_table,relation_type
avg_ensure_sum_12m,inn_num,custom_cib_ml_stg.p_overall_part_tend_cust,direct
avg_ensure_sum_12m,protocol_dttm,custom_cib_ml_stg.p_overall_part_tend_cust,direct
avg_ensure_sum_12m,inn_num,custom_cib_ml_stg.p_overall_part_tend_cust,indirect
Then I create a dataframe by reading it:
val dfDL = spark.read.option("delimiter", ",")
.option("header", true)
.csv(file.getPath.toUri.getPath)
Now I need to create a new dataframe based on dfDL.
The structure of the new dataframe looks like this:
case class DataLink(schema_from: String,
table_from: String,
column_from: String,
link_type: String,
schema_to: String,
table_to: String,
column_to: String)
The information for the fields of the new DataFrame is obtained from a csv file:
pseudocode:
schema_from = source_table.split(".")(0) // Example: custom_cib_ml_stg
table_from = source_table.split(".")(1) // Example: p_overall_part_tend_cust
column_from = source // Example: inn_num
link_type = relation_type // Example: direct
schema_to = "dbname1.table1.csv".split(".")(0) // Example: dbname1
table_to = "dbname1.table1.csv".split(".")(1) // Example: table1
column_to = target // Example: avg_ensure_sum_12m
I need to create a new dataframe. I can't cope on my own.
P.S. I need this dataframe to create a json file from it later.
Example JSON:
[{"schema_from":"custom_cib_ml36_stg",
"table_from":"p_overall_part_tend_cust",
"column_from":"inn_num",
"link_type":"direct",
"schema_to":"dbname1",
"table_to":"table1",
"column_to":"avg_ensure_sum_12m"
},
{"schema_from":"custom_cib_ml36_stg",
"table_from":"p_overall_part_tend_cust",
"column_from":"protocol_dttm",
"link_type":"direct","schema_to":"dbname1",
"table_to":"table1",
"column_to":"avg_ensure_sum_12m"}
I don't like my current implementation:
def readDLFromHDFS(file: LocatedFileStatus): Array[DataLink] = {
val arrTableName = file.getPath.getName.split("\\.")
val (schemaTo, tableTo) = (arrTableName(0), arrTableName(1))
val dfDL = spark.read.option("delimiter", ",")
.option("header", true)
.csv(file.getPath.toUri.getPath)
//val sourceTable = dfDL.select("source_table").collect().map(value => value.toString().split("."))
dfDL.collect.map(row => DataLink(row.getString(2).split("\\.")(0),
row.getString(2).split("\\.")(1),
row.getString(1),
row.getString(3),
schemaTo,
tableTo,
row.getString(0)))
}
def toJSON(dataLinks: Array[DataLink]): Option[JValue] =
dataLinks.map(Extraction.decompose).reduceOption(_ ++ _)
}
You definitely don't want to collect, that defeats the point of using spark here. As always with Spark you have a lot of options. You can use RDDs but I don't see a need to switch between modes here. You just want to apply custom logic to some columns and end up with a dataframe with the resulting column alone.
First, define a UDF that you want to apply:
def convert(target, source, source_table, relation_type) =
DataLink(source_table.split("\\.")(0),
source_table.split("\\.")(1),
source,
"dbname1.table1.csv".split(".")(0)
"dbname1.table1.csv".split(".")(1)
target))
Then apply this function to all the relevant columns (making sure you wrap it in udf to make it a spark function rather than a plain Scala function) and select the result:
df.select(udf(convert)($"target", $"source", $"source_table", $"relation_type"))
If you want to a DataFrame with 7 columns as your result:
df.select(
split(col("source_table"), "\\.").getItem(0),
split(col("source_table"), "\\.").getItem(1),
col("source"),
lit("dbname1"),
lit("table1"),
col("target")
)
You can also add .as("column_name") to each of these 7 columns.
You can use dataset directly.
import spark.implicits._
case class DataLink(schema_from: String,
table_from: String,
column_from: String,
link_type: String,
schema_to: String,
table_to: String,
column_to: String)
val filename = "dbname1.table1.csv"
val df = spark.read.option("header","true").csv("test.csv")
df.show(false)
+------------------+-------------+------------------------------------------+-------------+
|target |source |source_table |relation_type|
+------------------+-------------+------------------------------------------+-------------+
|avg_ensure_sum_12m|inn_num |custom_cib_ml_stg.p_overall_part_tend_cust|direct |
|avg_ensure_sum_12m|protocol_dttm|custom_cib_ml_stg.p_overall_part_tend_cust|direct |
|avg_ensure_sum_12m|inn_num |custom_cib_ml_stg.p_overall_part_tend_cust|indirect |
+------------------+-------------+------------------------------------------+-------------+
df.createOrReplaceTempView("table")
val df2 = spark.sql(s"""
select split(source_table, '[.]')[0] as schema_from
, split(source_table, '[.]')[1] as table_from
, source as column_from
, relation_type as link_type
, split('${filename}', '[.]')[0] as schema_to
, split('${filename}', '[.]')[1] as table_to
, target as column_to
from table
""").as[DataLink]
df2.show()
+-----------------+--------------------+-------------+---------+---------+--------+------------------+
| schema_from| table_from| column_from|link_type|schema_to|table_to| column_to|
+-----------------+--------------------+-------------+---------+---------+--------+------------------+
|custom_cib_ml_stg|p_overall_part_te...| inn_num| direct| dbname1| table1|avg_ensure_sum_12m|
|custom_cib_ml_stg|p_overall_part_te...|protocol_dttm| direct| dbname1| table1|avg_ensure_sum_12m|
|custom_cib_ml_stg|p_overall_part_te...| inn_num| indirect| dbname1| table1|avg_ensure_sum_12m|
+-----------------+--------------------+-------------+---------+---------+--------+------------------+
My progress...
Now, i can create new DataFrame, but he contain only 1 column.
val dfDL = spark.read.option("delimiter", ",")
.option("header", true)
.csv(file.getPath.toUri.getPath)
val convertCase = (target: String, source: String, source_table: String, relation_type: String) =>
DataLink(
source_table.split("\\.")(0),
source_table.split("\\.")(1),
source,
relation_type,
schemaTo,
tableTo,
target,
)
val udfConvert = udf(convertCase)
val dfForJson = dfDL.select(udfConvert(col("target"),
col("source"),
col("source_table"),
col("relation_type")))

Replace multiple chars with multiple chars in string

I am looking for a possibility to replace multiple different characters with corresponding different characters in Kotlin.
As an example I look for a similar function as this one in PHP:
str_replace(["ā", "ē", "ī", "ō", "ū"], ["a","e","i","o","u"], word)
In Kotlin right now I am just calling 5 times the same function (for every single vocal) like this:
var newWord = word.replace("ā", "a")
newWord = word.replace("ē", "e")
newWord = word.replace("ī", "i")
newWord = word.replace("ō", "o")
newWord = word.replace("ū", "u")
Which of course might not be the best option, if I have to do this with a list of words and not just one word. Is there a way to do that?
You can maintain the character mapping and replace required characters by iterating over each character in the word.
val map = mapOf('ā' to 'a', 'ē' to 'e' ......)
val newword = word.map { map.getOrDefault(it, it) }.joinToString("")
If you want to do it for multiple words, you can create an extension function for better readability
fun String.replaceChars(replacement: Map<Char, Char>) =
map { replacement.getOrDefault(it, it) }.joinToString("")
val map = mapOf('ā' to 'a', 'ē' to 'e', .....)
val newword = word.replaceChars(map)
Just adding another way using zip with transform function
val l1 = listOf("ā", "ē", "ī", "ō", "ū")
val l2 = listOf("a", "e", "i", "o", "u")
l1.zip(l2) { a, b -> word = word.replace(a, b) }
l1.zip(l2) will build List<Pair<String,String>> which is:
[(ā, a), (ē, e), (ī, i), (ō, o), (ū, u)]
And the transform function { a, b -> word = word.replace(a, b) } will give you access to each item at each list (l1 ->a , l2->b).

Is there a way to create a UDF that accepts an array of two strings and pass those strings as two arguments to a function?

I'm new to Scala so pardon my poor penmanship.
I have a function func1 that accepts two strings and returns a string.
I also have a dataframe df1 that has 2 columns a1 and b1. I'm trying to create a new dataframe df2 with both the columns from df1 (a1 and b1) and a new column c1 that is the output of the function func1. I know I need to use UDF. I don't know how to create a UDF that can accept 2 columns and pass these two as parameters to func1 and return the output string (column c1).
Here's some of the things that I tried -
def func1(str1:String, str2:String) : String = {
//code
return str3;
}
val df1= spark.sql("select * from emp")
.select("a1", "b1").cache()
val df2 = spark.sql("select * from df1")
.withColumn("c1", func1("a1","b1"))
.select("a1", "b1").cache()
But I don't get the results. Please advise. Thanks in advance.
You basically have a syntax problem.
Remember that when you do def func1(str1:String, str2:String) : String = ... func1 refers to a Scala function object, and not a Spark expression.
On the other hand, .withColumn expects a Spark expression as its second argument.
So what happens is that your call to .withColumn("c1", func1("a1","b1")) sends Spark a Scala function object, whereas the Spark API expects a "Spark Expression" (e.g. a column, or operation on columns, such as a User Defined Function (UDF)).
Luckily, it is easy to transform a Scala function into a Spark UDF, generally speaking, by wrapping it by a call to spark's udf method.
So a working example can go out like this :
// A sample dataframe
val dataframe = Seq(("a", "b"), ("c", "d")).toDF("columnA", "columnB")
// An example scala function that actually does something (string concat)
def concat(first: String, second: String) = first+second
// A conversion from scala function to spark UDF :
val concatUDF = udf((first: String, second: String) => concat(first, second))
// An sample execution of the UDF
// note the $ sign, which is short for indicating a column name
dataframe.withColumn("concat", concatUDF($"columnA", $"columnB")).show
+-------+-------+------+
|columnA|columnB|concat|
+-------+-------+------+
| a| b| ab|
| c| d| cd|
+-------+-------+------+
From there on, it should be easy to adapt to your precise function and its arguments.
Here is how, you would do it
scala> val df = Seq(("John","26"),("Bob","31")).toDF("a1","b1")
df: org.apache.spark.sql.DataFrame = [a1: string, b1: string]
scala> df.createOrReplaceTempView("emp")
scala> :paste
// Entering paste mode (ctrl-D to finish)
def func1(str1:String, str2:String) : String = {
val str3 = s" ${str1} is ${str2} years old"
return str3;
}
// Exiting paste mode, now interpreting.
func1: (str1: String, str2: String)String
scala> val my_udf_func1 = udf( func1(_:String,_:String):String )
my_udf_func1: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,StringType,Some(List(StringType, StringType)))
scala> spark.sql("select * from emp").withColumn("c1", my_udf_func1($"a1",$"b1")).show(false)
2019-01-14 21:08:30 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
+----+---+---------------------+
|a1 |b1 |c1 |
+----+---+---------------------+
|John|26 | John is 26 years old|
|Bob |31 | Bob is 31 years old |
+----+---+---------------------+
scala>
Two places where you need to correct it..
After defining the regular function, you need to register it in udf() as
val my_udf_func1 = udf( func1(_:String,_:String):String )
when calling the udf you should use $"a1" syntax, not simply "a1"

Programmatically adding several columns to Spark DataFrame

I'm using spark with scala.
I have a Dataframe with 3 columns: ID,Time,RawHexdata.
I have a user defined function which takes RawHexData and expands it into X more columns. It is important to state that for each row X is the same (the columns do not vary). However, before I receive the first data, I do not know what the columns are. But once I have the head, I can deduce it.
I would like a second Dataframe with said columns: Id,Time,RawHexData,NewCol1,...,NewCol3.
The "Easiest" method I can think of to do this is:
1. deserialize each row into json (every data tyoe is serializable here)
2. add my new columns,
3. deserialize a new dataframe from the altered json,
However, that seems like a waste, as it involves 2 costly and redundant json serialization steps. I am looking for a cleaner pattern.
Using case-classes, seems like a bad idea, because I don't know the number of columns, or the column names in advance.
What you can do to dynamically extend your DataFrame is to operate on the row RDD which you can obtain by calling dataFrame.rdd. Having a Row instance, you can access the RawHexdata column and parse the contained data. By adding the newly parsed columns to the resulting Row, you've almost solved your problem. The only thing necessary to convert a RDD[Row] back into a DataFrame is to generate the schema data for your new columns. You can do this by collecting a single RawHexdata value on your driver and then extracting the column types.
The following code illustrates this approach.
object App {
case class Person(name: String, age: Int)
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Test").setMaster("local[4]")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val input = sc.parallelize(Seq(Person("a", 1), Person("b", 2)))
val dataFrame = input.df
dataFrame.show()
// create the extended rows RDD
val rowRDD = dataFrame.rdd.map{
row =>
val blob = row(1).asInstanceOf[Int]
val newColumns: Seq[Any] = Seq(blob, blob * 2, blob * 3)
Row.fromSeq(row.toSeq.init ++ newColumns)
}
val schema = dataFrame.schema
// we know that the new columns are all integers
val newColumns = StructType{
Seq(new StructField("1", IntegerType), new StructField("2", IntegerType), new StructField("3", IntegerType))
}
val newSchema = StructType(schema.init ++ newColumns)
val newDataFrame = sqlContext.createDataFrame(rowRDD, newSchema)
newDataFrame.show()
}
}
SELECT is your friend solving it without going back to RDD.
case class Entry(Id: String, Time: Long)
val entries = Seq(
Entry("x1", 100L),
Entry("x2", 200L)
)
val newColumns = Seq("NC1", "NC2", "NC3")
val df = spark.createDataFrame(entries)
.select(col("*") +: (newColumns.map(c => lit(null).as(c))): _*)
df.show(false)
+---+----+----+----+----+
|Id |Time|NC1 |NC2 |NC3 |
+---+----+----+----+----+
|x1 |100 |null|null|null|
|x2 |200 |null|null|null|
+---+----+----+----+----+