Spark 2.0: A named function inside mapGroups for sql.KeyValueGroupedDataset cause java.io.NotSerializableException - apache-spark-sql

Anonymous function work fine.
For following code set up the problem:
import sparkSession.implicits._
val sparkSession = SparkSession.builder.appName("demo").getOrCreate()
val sc = sparkSession.sparkContext
case class DemoRow(keyId: Int, evenOddId: Int)
case class EvenOddCountRow(keyId: Int, oddCnt: Int, evenCnt: Int)
val demoDS = sc.parallelize(Seq(DemoRow(1, 1),
DemoRow(1, 2),
DemoRow(1, 3),
DemoRow(2, 1),
DemoRow(2, 2))).toDS()
Showing the demoDS.show():
+-----+---------+
|keyId|evenOddId|
+-----+---------+
| 1| 1|
| 1| 2|
| 1| 3|
| 2| 1|
| 2| 2|
+-----+---------+
Using the Anonymous function id => id % 2 == 1 inside mapGroups() works fine:
val demoGroup = demoDS.groupByKey(_.keyId).mapGroups((key, iter) => {
val evenOddIds = iter.map(_.evenOddId).toList
val (oddIds, evenIds) = evenOddIds.partition(id => id % 2 == 1)
EvenOddCountRow(key, oddIds.size, evenIds.size)
})
The result of demoGroup.show() is what we expected:
+-----+------+-------+
|keyId|oddCnt|evenCnt|
+-----+------+-------+
| 1| 2| 1|
| 2| 1| 1|
+-----+------+-------+
Now if I define the isOdd function, and put it into the function inside mapGroups() like below, it will raise Exception:
def isOdd(id: Int) = id % 2 == 1
val demoGroup = demoDS.groupByKey(_.keyId).mapGroups((key, iter) => {
val evenOddIds = iter.map(_.evenOddId).toList
val (oddIds, evenIds) = evenOddIds.partition(isOdd)
EvenOddCountRow(key, oddIds.size, evenIds.size)
})
Caused by: java.io.NotSerializableException: scala.collection.LinearSeqLike$$anon$1
I tried different ways of define the isOdd function try to make it serializable:
val isOdd = (id: Int) => id % 2 == 1 // does not work
case object isOdd extends Function[Int, Boolean] with Serializable {
def apply(id: Int) = id % 2 == 1
} // still does not work
Did I miss any thing or anything wrong? Thanks in advance!

The following works for me:
object Utils {
def isOdd(id: Int) = id % 2 == 1
}
And then use:
evenOddIds.partition(Utils.isOdd)

Related

Transform list of map to dataframe

I have the following data:
d = Some(List(Map(id -> 1, n -> Hi), Map(id -> 2, n -> Hello)))
I would like to transform it into a dataframe like the following:
+--------+
|id|n |
+--------+
|1 |Hi |
+--------+
|2 |Hello|
+--------+
I tried the following:
import spark.implicits._
val df = d
.map( m => (m.get("id"),m.get("n")))
.toDF("id", "n")
But im getting:
error: value get is not a member of Any
.map( m => (m.get("id"),m.get("n")))
Your top level here is Option and i think thats the reason why you cant handle it with single map. I managed to do it with something like this:
import spark.implicits._
val d = Some(List(Map("id" -> "1", "n" -> "Hi"), Map("id" -> "2", "n" -> "Hello")))
val data = d.fold (List.empty [(Option [String], Option [String])]) (_.map (m => (m.get ("id"), m.get ("n"))))
val df = data.toDF("id", "n").show()
Output:
+---+-----+
| id| n|
+---+-----+
| 1| Hi|
| 2|Hello|
+---+-----+

How to use separate key lists to perform a join between two DataFrames?

I want to join two different DataFrames (dfA and dfB) built as follows :
dfA.show()
+-----+-------+-------+
| id_A| name_A|address|
+-----+-------+-------+
| 1| AAAA| Paris|
| 4| DDDD| Sydney|
+-----+-------+-------+
dfB.show()
+-----+-------+---------+
| id_B| name_B| job|
+-----+-------+---------+
| 1| AAAA| Analyst|
| 2| AERF| Engineer|
| 3| UOPY| Gardener|
| 4| DDDD| Insurer|
+-----+-------+---------+
I need to use the following lists in order to do the join :
val keyListA = List("id_A", "name_A")
val keyListB = List("id_B", "name_B")
A simple solution would be :
val join = dfA.join(
dfA("id_A") === dfB("id_B") &&
dfA("name_A") === dfB("name_B"),
"left_outer")
Is there a syntax that would allow you to do this join by using the keyListA and keyListB lists ?
If you really want to build your join expression from lists of column names:
import org.apache.spark.sql.{Column, DataFrame}
import org.apache.spark.sql.functions._
val dfA: DataFrame = ???
val dfB: DataFrame = ???
val keyListA = List("id_A", "name_A", "property1_A", "property2_A", "property3_A")
val keyListB = List("id_B", "name_B", "property1_B", "property2_B", "property3_B")
def joinExprsFrom(keyListA: List[String], keyListB: List[String]): Column =
keyListA
.zip(keyListB)
.map { case (fromA, fromB) => col(fromA) === col(fromB) }
.reduce((acc, expr) => acc && expr )
dfA.join(
dfB,
joinExprsFrom(keyListA, keyListB),
"left_outer")
You need to make sure keyListA and keyListB are the same size and non-empty.

using sparksql and spark dataframe How can we find the COLUMN NAME based on the minimum value in a row

i have a dataframe df . its having 4 columns
+-------+-------+-------+-------+
| dist1 | dist2 | dist3 | dist4 |
+-------+-------+-------+-------+
| 42 | 53 | 24 | 17 |
+-------+-------+-------+-------+
output i want is
dist4
seems easy but i did not find any proper solution using dataframe or sparksql query
You may use least function as
select least(dist1,dist2,dist3,dist4) as min_dist
from yourTable;
For the opposite cases greatest may be used.
EDIT :
To detect column names the following maybe used to get rows
select inline(array(struct(42, 'dist1'), struct(53, 'dist2'),
struct(24, 'dist3'), struct(17, 'dist4') ))
42 dist1
53 dist2
24 dist3
17 dist4
and then min function may be applied to get dist4
Try this,
df.show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
val temp_df = df.columns.foldLeft(df) { (acc: DataFrame, colName: String) => acc.withColumn(colName, concat(col(colName), lit(","+colName)))}
val minval = udf((ar: Seq[String]) => ar.min.split(",")(1))
val result = temp_df.withColumn("least", split(concat_ws(":",x.columns.map(col(_)):_*),":")).withColumn("least_col", minval(col("least")))
result.show
+---+---+---+---+--------------------+---------+
| A| B| C| D| least|least_col|
+---+---+---+---+--------------------+---------+
|1,A|2,B|3,C|4,D|[1,A, 2,B, 3,C, 4,D]| A|
|5,A|4,B|3,C|1,D|[5,A, 4,B, 3,C, 1,D]| D|
+---+---+---+---+--------------------+---------+
RDD way and without udf()s.
scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val df2 = df.withColumn("arr", array(df.columns.map(col(_)):_*))
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 3 more fields]
scala> val rowarr = df.columns
rowarr: Array[String] = Array(A, B, C, D)
scala> val rdd1 = df2.rdd.map( x=> {val p = x.getAs[WrappedArray[Int]]("arr").toArray; val q=rowarr(p.indexWhere(_==p.min));Row.merge(x,Row(q)) })
rdd1: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[83] at map at <console>:47
scala> spark.createDataFrame(rdd1,df2.schema.add(StructField("mincol",StringType))).show
+---+---+---+---+------------+------+
| A| B| C| D| arr|mincol|
+---+---+---+---+------------+------+
| 1| 2| 3| 4|[1, 2, 3, 4]| A|
| 5| 4| 3| 1|[5, 4, 3, 1]| D|
+---+---+---+---+------------+------+
scala>
you can do something like,
import org.apache.spark.sql.functions._
val cols = df.columns
val u1 = udf((s: Seq[Int]) => cols(s.zipWithIndex.min._2))
df.withColumn("res", u1(array("*")))
You could access the rows schema, retrieve a list of names out of there and access the rows value by name and then figure it out that way.
See: https://spark.apache.org/docs/2.3.2/api/scala/index.html#org.apache.spark.sql.Row
It would look roughly like this
dataframe.map(
row => {
val schema = row.schema
val fieldNames:List[String] = ??? //extract names from schema
fieldNames.foldLeft(("", 0))(???) // retrieve field value using it's name and retain maximum
}
)
This would yield a Dataset[String]

how to generate unique sequence numbers to replace null values in a column of a table in spark scala

I am facing difficulty in generating unique sequence numbers to replace the null values in a column of a table. The table is obtained after joining two other tables and the column id the primary key column where null values are to be replaced with unique sequence values.
I tried using accumulators but i am facing difficulty when running the program in a multinode cluster.
val joined=csv2.join(csv,csv2("ACCT_PRDCT_CD")===csv("ACCT_PRDCT_CD"),"left_outer")
joined.filter("ACCT_CO_NO is null").show
val k=joined.withColumn("Acc_flag", when($"ACCT_CO_NO".isNull,0).otherwise($"ACCT_CO_NO"))
var a=1
def generate(s:Int):Int={
if (s==0){
a=a+1
return a
}
else {
return s
}
}
val generateNum = udf(generate(_:Int))
val newjoined=k.withColumn("n",generateNum($"ACC_flag"))
If I understand your requirement correctly, consider using monotonically_increasing_id or RDD's zipWithIndex. To avoid collision, the generated sequence numbers will then be added to a number greater than the maximum column value before replacing the nulls.
import org.apache.spark.sql.functions._
val dfL = Seq(
(1, "a"),
(2, "b"),
(3, "c"),
(4, "d"),
(5, "e"),
(6, "f")
).toDF("c1", "c2")
val dfR = Seq(
(1, 100L),
(2, 200L),
(3, 300L)
).toDF("c1", "c2")
val c2max = dfR.select(max($"c2")).first.getLong(0)
// c2max: Long = 300
val dfJoined = dfL.join(dfR, Seq("c1"), "left").
select(dfL("c1"), dfR("c2"))
METHOD 1: using monotonically_increasing_id
dfJoined.withColumn( "c2x", when(col("c2").isNotNull, col("c2")).
otherwise(monotonically_increasing_id + c2max + 1)
).
show
// +---+----+-----------+
// | c1| c2| c2x|
// +---+----+-----------+
// | 1| 100| 100|
// | 2| 200| 200|
// | 3| 300| 300|
// | 4|null|25769804077|
// | 5|null|34359738669|
// | 6|null|42949673261|
// +---+----+-----------+
Note that the generated sequence numbers aren't necessarily consecutive.
METHOD 2: using RDD's zipWithIndex
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val rdd = dfJoined.rdd.zipWithIndex.
map{ case (row: Row, idx: Long) => Row.fromSeq(row.toSeq :+ idx) }
spark.createDataFrame(rdd,
StructType(dfJoined.schema.fields :+ StructField("idx", LongType))
).
select( $"c1", $"c2",
when(col("c2").isNotNull, col("c2")).otherwise($"idx" + c2max + 1).
as("c2x")
).
show
// +---+----+---+
// | c1| c2|c2x|
// +---+----+---+
// | 1| 100|100|
// | 2| 200|200|
// | 3| 300|300|
// | 4|null|304|
// | 5|null|305|
// | 6|null|306|
// +---+----+---+

Adding an extra column that represents the difference between the closest difference of a previous column

My scenario might be more easily explained through an example. Say I had the following data:
Type Time
A 1
B 3
A 5
B 9
I want to add an extra column to each row that represents the minimum absolute value difference between all columns of the same type. So for the first row, the minimum difference between all times of type A is 4, so the value would be 4 for columns 1 and 3, and likewise, 6 for columns 2 and 4.
I am doing this in Spark and Spark SQL, so guidance there would be more useful, but if it needs to be explained through plain SQL, that would be a great help as well.
One possible approach is to use window functions.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{lag, min, abs}
val df = Seq(
("A", -10), ("A", 1), ("A", 5), ("B", 3), ("B", 9)
).toDF("type", "time")
First lets determine difference between consecutive rows sorted by time:
// Partition by type and sort by time
val w1 = Window.partitionBy($"Type").orderBy($"Time")
// Difference between this and previous
val diff = $"time" - lag($"time", 1).over(w1)
Then find minimum over all diffs for a given type:
// Partition by time unordered and take unbounded window
val w2 = Window.partitionBy($"Type").rowsBetween(Long.MinValue, Long.MaxValue)
// Minimum difference over type
val minDiff = min(diff).over(w2)
df.withColumn("min_diff", minDiff).show
// +----+----+--------+
// |type|time|min_diff|
// +----+----+--------+
// | A| -10| 4|
// | A| 1| 4|
// | A| 5| 4|
// | B| 3| 6|
// | B| 9| 6|
// +----+----+--------+
If your goal is to find a minimum distance between current row and any other row in a group you can use a similar approach
import org.apache.spark.sql.functions.{lead, when}
// Diff to previous
val diff_lag = $"time" - lag($"time", 1).over(w1)
// Diff to next
val diff_lead = lead($"time", 1).over(w1) - $"time"
val diffToClosest = when(
diff_lag < diff_lead || diff_lead.isNull,
diff_lag
).otherwise(diff_lead)
df.withColumn("diff_to_closest", diffToClosest)
// +----+----+---------------+
// |type|time|diff_to_closest|
// +----+----+---------------+
// | A| -10| 11|
// | A| 1| 4|
// | A| 5| 4|
// | B| 3| 6|
// | B| 9| 6|
// +----+----+---------------+
tested in sql server 2008
create table d(
type varchar(25),
Time int
)
insert into d
values ('A',1),
('B',3),
('A',5),
('B',9)
--solution one, calculation in query, might not be smart if dataset is large.
select *
, (select max(time) m from d as i where i.type = o.type) - (select MIN(time) m from d as i where i.type = o.type) dif
from d as o
--or this
select d.*, diftable.dif from d inner join
(select type, MAX(time) - MIN(time) dif
from d group by type ) as diftable on d.type = diftable.type
You should try something like this:
val sc: SparkContext = ...
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val input = sc.parallelize(Seq(
("A", 1),
("B", 3),
("A", 5),
("B", 9)
))
val df = input.groupByKey().flatMap { case (key, values) =>
val smallestDiff = values.toList.sorted match {
case firstMin :: secondMin :: _ => secondMin - firstMin
case singleVal :: Nil => singleVal // Only one record for some `Type`
}
values.map { value =>
(key, value, smallestDiff)
}
}.toDF("Type", "Time", "SmallestDiff")
df.show()
Output:
+----+----+------------+
|Type|Time|SmallestDiff|
+----+----+------------+
| A| 1| 4|
| A| 5| 4|
| B| 3| 6|
| B| 9| 6|
+----+----+------------+