Get the column names of the max and second value for each record in a scala dataframe - dataframe

I have a dataframe with a lot of columns, but for this example, we can use this one:
`val dfIn = sqlContext.createDataFrame(Seq(("r0", 0, 2, 3, "a"),("r1", 1, 0, 0, "a"),("r2", 0, 2, 2, "a"))).toDF("prev_column", "c0", "c1", "c2", "post_column")`
Obtaining a dataframe like this:
prev_column
c0
c1
c2
post_column
r0
0
1
2
a0
r1
1
2
0
a1
r2
2
0
1
a2
I need the names of the two columns with the highest value for each record.
I want to achieve a dataframe like this and I cannot do it:
prev_column
c0
c1
c2
post_column
first
second
r0
0
1
2
a0
c2
c1
r1
1
2
0
a1
c1
c0
r2
2
0
1
a2
c0
c2
I saw some answers for pyspark and with numPy but none with scala.

You can do it using .when() and .otherwise() statements, which are similiar to if else statements in other programming languages:
val dfIn = spark.createDataFrame(Seq(
("r0", 0, 1, 2, "a0"),
("r1", 1, 2, 0, "a1"),
("r2", 2, 0, 1, "a2")
)).toDF("prev_column", "c0", "c1", "c2", "post_column")
val resultDf = dfIn.withColumn("first",
when(col("c0").geq(col("c1")).and(col("c0").geq(col("c2"))), "c0")
.when(col("c1").geq(col("c0")).and(col("c1").geq(col("c2"))), "c1")
.otherwise("c2")
).withColumn("second",
when(col("c0").between(col("c1"), col("c2"))
.or(col("c0").between(col("c2"), col("c1"))), "c0")
.when(col("c1").between(col("c0"), col("c2"))
.or(col("c1").between(col("c2"), col("c1"))), "c1")
.otherwise("c2")
)
resultDf.show(false)
+-----------+---+---+---+-----------+-----+------+
|prev_column|c0 |c1 |c2 |post_column|first|second|
+-----------+---+---+---+-----------+-----+------+
|r0 |0 |1 |2 |a0 |c2 |c1 |
|r1 |1 |2 |0 |a1 |c1 |c0 |
|r2 |2 |0 |1 |a2 |c0 |c2 |
+-----------+---+---+---+-----------+-----+------+

There are probably multiple ways to achieve this with UDFs. Without UDFs, you can unpivot the dataframe, find the highest value column via a Window operation and then re-group everything together. The code below works for any number of columns that should be ranked; you just need to change colsForRanking accordingly.
First, unpivot (stack) the data:
val colsForRanking = List("c0", "c1", "c2")
val remainingCols = dfIn.columns.filterNot(colsForRanking.contains(_))
val stackString = (colsForRanking.size.toString +: colsForRanking
.flatMap(c => List(s"'$c'", c))).mkString(",") // 3,'c0',c0,'c1',c1,'c2',c2
val df_unpivot = dfIn.select(
dfIn.columns.map(col) :+ expr(s"stack($stackString) as (colName, value)"): _*)
This adds value and colName columns by duplicating the original data for all cNs and putting each cN in an individual row. Next, you need to rank the rows using a Window:
val w = Window.partitionBy(remainingCols.map(col): _*)
.orderBy($"value".desc) // we don't need _all_ columns for the partition, but have to make sure the combination is unique
val df_ranked = df_unpivot
.withColumn("valueRank", row_number.over(w))
.withColumn("first", when($"valueRank" === 1, $"colName"))
.withColumn("second", when($"valueRank" === 2, $"colName"))
when without otherwise is NULL when the condition isn't met, which will be useful in the last step.
Finally, you can group everything back together:
val df_result = df_ranked.groupBy(dfIn.columns.map(col): _*)
.agg(
first($"first", ignoreNulls = true).alias("first"),
first($"second", ignoreNulls = true).alias("second"),
)
first() gives you an arbitrary value within each group (as there is no sorting); however, since you made sure that the values you don't want are NULL, it returns exactly the value you want.

Related

Spark- scan data frame base on value

I'm trying to find a column (I do know the name of the column) base on a value. For example in this dataframe below, I'd like to know which row that has a column contains yellow for Category = A . The thing is I don't know the column name (colour) in advance so I couldn't do select * where Category = 'A' and colour = 'yellow' How can I scan the columns and achieve this? Many thanks for your help.
+--------+-----------+-------------+
|Category|colour |. name. |
+--------+-----------+-------------+
|A. | blue.| Elmo|
|A | yellow | Alex|
|B | desc | Erin|
+--------+-----------+-------------+
You can loop that check through the list of column names. You also can wrap this loop in a function for the readable purpose. Please note that this check per column would happen in sequence.
from pyspark.sql import functions as F
cols = df.columns
for c in cols:
cnt = df.where((F.col('Category') == 'A') & (F.col(c) == 'yellow')).count()
if cnt > 0:
print(c)

Using Spark window with more than one partition when there is no obvious partitioning column

Here is the scenario. Assuming I have the following table:
identifier
line
51169081604
2
00034886044
22
51168939455
52
The challenge is to, for every single column line, select the next biggest column line, which I have accomplished by the following SQL:
SELECT i1.line,i1.identifier,
MAX(i1.line) OVER (
ORDER BY i1.line ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
)AS parent
FROM global_temp.documentIdentifiers i1
The challenge is partially solved alright, the problem is, when I execute this code on Spark, the performance is terrible. The warning message is very clear about it:
No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
Partitioning by any of the two fields does not work, it breaks the result, of course, as every created partition is not aware of the other lines.
Does anyone have any clue on how can I " select the next biggest column line" without performance issues?
Thanks
Using your "next" approach AND assuming the data is generated in ascending line order, the following does work in parallel, but if actually faster you can tell me; I do not know your volume of data. In any event you cannot solve just with SQL (%sql).
Here goes:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import spark.implicits._
case class X(identifier: Long, line: Long) // Too hard to explain, just gets around issues with df --> rdd --> df.
// Gen some more data.
val df = Seq(
(1000000, 23), (1200, 56), (1201, 58), (1202, 60),
(8200, 63), (890000, 67), (990000, 99), (33000, 123),
(33001, 124), (33002, 126), (33009, 132), (33019, 133),
(33029, 134), (33039, 135), (800, 201), (1800, 999),
(1801, 1999), (1802, 2999), (1800444, 9999)
).toDF("identifier", "line")
// Add partition so as to be able to apply parallelism - except for upper boundary record.
val df2 = df.as[X]
.rdd
.mapPartitionsWithIndex((index, iter) => {
iter.map(x => (index, x ))
}).mapValues(v => (v.identifier, v.line)).map(x => (x._1, x._2._1, x._2._2))
.toDF("part", "identifier", "line")
// Process per partition.
#transient val w = org.apache.spark.sql.expressions.Window.partitionBy("part").orderBy("line")
val df3 = df2.withColumn("next", lead("line", 1, null).over(w))
// Process upper boundary.
val df4 = df3.filter(df3("part") =!= 0).groupBy("part").agg(min("line").as("nxt")).toDF("pt", "nxt")
val df5 = df3.join(df4, (df3("part") === df4("pt") - 1), "outer" )
val df6 = df5.withColumn("next", when(col("next").isNull, col("nxt")).otherwise(col("next"))).select("identifier", "line", "next")
// Display. Sort accordingly.
df6.show(false)
returns:
+----------+----+----+
|identifier|line|next|
+----------+----+----+
|1000000 |23 |56 |
|1200 |56 |58 |
|1201 |58 |60 |
|1202 |60 |63 |
|8200 |63 |67 |
|890000 |67 |99 |
|990000 |99 |123 |
|33000 |123 |124 |
|33001 |124 |126 |
|33002 |126 |132 |
|33009 |132 |133 |
|33019 |133 |134 |
|33029 |134 |135 |
|33039 |135 |201 |
|800 |201 |999 |
|1800 |999 |1999|
|1801 |1999|2999|
|1802 |2999|9999|
|1800444 |9999|null|
+----------+----+----+
You can add additional sorting etc. Relies on narrow transformation when adding partition index. How you load may be an issue. Caching not considered.
If data is not ordered as stated above, a range partitioning needs to occur first.

Rxjava - How to get the current and the previous item?

How do to use operators so that i always get the previous and the current value? If possible i want to avoid creating state outside the pipe.
- time ->
1 2 3 4
| | | |
Operations
| | |
(1,2) (2,3) (3,4)
Note that every value besides the first and the last one have to appear twice, so a simple buffer won't do.
I thought about combining skip with merge and buffer but merge does not seem to guarantee ordering.
val s = PublishSubject.create<Int>()
s.mergeWith(s.skip(1)).buffer(2).subscribe{i -> print(i)}
s.onNext(1)
s.onNext(2)
s.onNext(3)
s.onNext(4)
output:
[1, 2][2, 3][3, 4]
val o = Observable.just(1,2,3,4)
o.mergeWith(o.skip(1)).buffer(2).subscribe{i -> print(i)}
output:
[1, 2][3, 4][2, 3][4]
(the sole 4 is fine, and expected)
Looks like you still can use buffer:
Observable.just(1, 2, 3, 4)
.buffer(2, 1)
.subscribe { println(it) }
// prints
// [1, 2]
// [2, 3]
// [3, 4]
// [4]

How to filter after group by and aggregate in Spark dataframe?

I have a spark dataframe df with schema as such:
[id:string, label:string, tags:string]
id | label | tag
---|-------|-----
1 | h | null
1 | w | x
1 | v | null
1 | v | x
2 | h | x
3 | h | x
3 | w | x
3 | v | null
3 | v | null
4 | h | null
4 | w | x
5 | w | x
(h,w,v are labels. x can be any non-empty values)
For each id, there is at most one label "h" or "w", but there might be multiple "v". I would like to select all the ids that satisfies following conditions:
Each id has:
1. one label "h" and its tag = null,
2. one label "w" and its tag != null,
3. at least one label "v" for each id.
I am thinking that I need to create three columns checking each above conditions. And then I need to do a group by "id".
val hCheck = (label: String, tag: String) => {if (label=="h" && tag==null) 1 else 0}
val udfHCheck = udf(hCheck)
val wCheck = (label: String, tag: String) => {if (label=="w" && tag!=null) 1 else 0}
val udfWCheck = udf(wCheck)
val vCheck = (label: String) => {if (label==null) 1 else 0}
val udfVCheck = udf(vCheck)
dfx = df.withColumn("hCheck", udfHCheck(col("label"), col("tag")))
.withColumn("wCheck", udfWCheck(col("label"), col("tag")))
.withColumn("vCheck", udfVCheck(col("label")))
.select("id","hCheck","wCheck","vCheck")
.groupBy("id")
Somehow I need to group three columns {"hCheck","wCheck","vCheck"} into vector of list [x,0,0],[0,x,0],[0,0,x]. And check if these vector contain all three {[1,0,0],[0,1,0],[0,0,1]}
I have not been able to solve this problem yet. And there might be a better approach than this one. Hope someone can give me suggestions. Thanks
To convert the three checks to vectors you can do:
Specifically you can do:
val df1 = df.withColumn("hCheck", udfHCheck(col("label"), col("tag")))
.withColumn("wCheck", udfWCheck(col("label"), col("tag")))
.withColumn("vCheck", udfVCheck(col("label")))
.select($"id",array($"hCheck",$"wCheck",$"vCheck").as("vec"))
Next the groupby returns a grouped object on which you need to perform aggregations. Specifically to get all the vectors you should do something like:
.groupBy("id").agg(collect_list($"vec"))
Also you do not need udfs for the various checks. You can do it with column semantics. For example udfHCheck can be written as:
with($"label" == lit("h") && tag.isnull 1).otherwise(0)
BTW, you said you wanted a label 'v' for each but in vcheck you just check if the label is null.
Update: Alternative solution
Upon looking on this question again, I would do something like this:
val grouped = df.groupBy("id", "label").agg(count("$label").as("cnt"), first($"tag").as("tag"))
val filtered1 = grouped.filter($"label" === "v" || $"cnt" === 1)
val filtered2 = filtered.filter($"label" === "v" || ($"label" === "h" && $"tag".isNull) || ($"label" === "w" && $"tag".isNotNull))
val ids = filtered2.groupBy("id").count.filter($"count" === 3)
The idea is that first we groupby BOTH id and label so we have information on the combination. The information we collect is how many values (cnt) and the first element (doesn't matter which).
Now we do two filtering steps:
1. we need exactly one h and one w and any number of v so the first filter gets us these cases.
2. we make sure all the rules are met for each of the cases.
Now we have only combinations of id and label which match the rules so in order for the id to be legal we need to have exactly three instances of label. This leads to the second groupby which simply counts the number of labels which matched the rules. We need exactly three to be legal (i.e. matched all the rules).

How to apply a custom filtering function on a Spark DataFrame

I have a DataFrame of the form:
A_DF = |id_A: Int|concatCSV: String|
and another one:
B_DF = |id_B: Int|triplet: List[String]|
Examples of concatCSV could look like:
"StringD, StringB, StringF, StringE, StringZ"
"StringA, StringB, StringX, StringY, StringZ"
...
while a triplet is something like:
("StringA", "StringF", "StringZ")
("StringB", "StringU", "StringR")
...
I want to produce the cartesian set of A_DF and B_DF, e.g.;
| id_A: Int | concatCSV: String | id_B: Int | triplet: List[String] |
| 14 | "StringD, StringB, StringF, StringE, StringZ" | 21 | ("StringA", "StringF", "StringZ")|
| 14 | "StringD, StringB, StringF, StringE, StringZ" | 45 | ("StringB", "StringU", "StringR")|
| 18 | "StringA, StringB, StringX, StringY, StringG" | 21 | ("StringA", "StringF", "StringZ")|
| 18 | "StringA, StringB, StringX, StringY, StringG" | 45 | ("StringB", "StringU", "StringR")|
| ... | | | |
Then keep just the records that have at least two substrings (e.g StringA, StringB) from A_DF("concatCSV") that appear in B_DF("triplet"), i.e. use filter to exclude those that don't satisfy this condition.
First question is: can I do this without converting the DFs into RDDs?
Second question is: can I ideally do the whole thing in the join step--as a where condition?
I have tried experimenting with something like:
val cartesianRDD = A_DF
.join(B_DF,"right")
.where($"triplet".exists($"concatCSV".contains(_)))
but where cannot be resolved. I tried it with filter instead of where but still no luck. Also, for some strange reason, type annotation for cartesianRDD is SchemaRDD and not DataFrame. How did I end up with that? Finally, what I am trying above (the short code I wrote) is incomplete as it would keep records with just one substring from concatCSV found in triplet.
So, third question is: Should I just change to RDDs and solve it with a custom filtering function?
Finally, last question: Can I use a custom filtering function with DataFrames?
Thanks for the help.
The function CROSS JOIN is implemented in Hive, so you could first do the cross-join using Hive SQL:
A_DF.registerTempTable("a")
B_DF.registerTempTable("b")
// sqlContext should be really a HiveContext
val result = sqlContext.sql("SELECT * FROM a CROSS JOIN b")
Then you can filter down to your expected output using two udf's. One that converts your string to an array of words, and a second one that gives us the length of the intersection of the resulting array column and the existing column "triplet":
import scala.collection.mutable.WrappedArray
import org.apache.spark.sql.functions.col
val splitArr = udf { (s: String) => s.split(",").map(_.trim) }
val commonLen = udf { (a: WrappedArray[String],
b: WrappedArray[String]) => a.intersect(b).length }
val temp = (result.withColumn("concatArr",
splitArr(col("concatCSV"))).select(col("*"),
commonLen(col("triplet"), col("concatArr")).alias("comm"))
.filter(col("comm") >= 2)
.drop("comm")
.drop("concatArr"))
temp.show
+----+--------------------+----+--------------------+
|id_A| concatCSV|id_B| triplet|
+----+--------------------+----+--------------------+
| 14|StringD, StringB,...| 21|[StringA, StringF...|
| 18|StringA, StringB,...| 21|[StringA, StringF...|
+----+--------------------+----+--------------------+