SparkSQL: conditional sum using two columns - sql

I hope you can help me with this.
I have a DF as follows:
val df = sc.parallelize(Seq(
(1, "a", "2014-12-01", "2015-01-01", 100),
(2, "a", "2014-12-01", "2015-01-02", 150),
(3, "a", "2014-12-01", "2015-01-03", 120),
(4, "b", "2015-12-15", "2015-01-01", 100)
)).toDF("id", "prodId", "dateIns", "dateTrans", "value")
.withColumn("dateIns", to_date($"dateIns")
.withColumn("dateTrans", to_date($"dateTrans"))
I would love to do a groupBy prodId and aggregate 'value' summing it for ranges of dates defined by the difference between the column 'dateIns' and 'dateTrans'. In particular, I would like to have a way to define a conditional sum that sums all values within a predefined max difference between the above mentioned columns. I.e. all value that happened between 10, 20, 30 days from dateIns ('dateTrans' - 'dateIns' <=10, 20, 30).
Is there any predefined aggregated function in spark that allows doing conditional sums? Do you recommend develop a aggr. UDF (if so, any suggestions)?
I'm using pySpqrk, but very happy to get Scala solutions as well. Thanks a lot!

Lets make your a little bit more interesting so there are some events in the window:
val df = sc.parallelize(Seq(
(1, "a", "2014-12-30", "2015-01-01", 100),
(2, "a", "2014-12-21", "2015-01-02", 150),
(3, "a", "2014-12-10", "2015-01-03", 120),
(4, "b", "2014-12-05", "2015-01-01", 100)
)).toDF("id", "prodId", "dateIns", "dateTrans", "value")
.withColumn("dateIns", to_date($"dateIns"))
.withColumn("dateTrans", to_date($"dateTrans"))
What you need is more or less something like this:
import org.apache.spark.sql.functions.{col, datediff, lit, sum}
// Find difference in tens of days
val diff = (datediff(col("dateTrans"), col("dateIns")) / 10)
.cast("integer") * 10
val dfWithDiff = df.withColumn("diff", diff)
val aggregated = dfWithDiff
.where((col("diff") < 30) && (col("diff") >= 0))
.groupBy(col("prodId"), col("diff"))
.agg(sum(col("value")))
And the results
aggregated.show
// +------+----+----------+
// |prodId|diff|sum(value)|
// +------+----+----------+
// | a| 20| 120|
// | b| 20| 100|
// | a| 0| 100|
// | a| 10| 150|
// +------+----+----------+
where diff is a lower bound for the range (0 -> [0, 10), 10 -> [10, 20), ...). This will work in PySpark as well if you remove val and adjust imports.
Edit (aggregate per column):
val exprs = Seq(0, 10, 20).map(x => sum(
when(col("diff") === lit(x), col("value"))
.otherwise(lit(0)))
.alias(x.toString))
dfWithDiff.groupBy(col("prodId")).agg(exprs.head, exprs.tail: _*).show
// +------+---+---+---+
// |prodId| 0| 10| 20|
// +------+---+---+---+
// | a|100|150|120|
// | b| 0| 0|100|
// +------+---+---+---+
with Python equivalent:
from pyspark.sql.functions import *
def make_col(x):
cnd = when(col("diff") == lit(x), col("value")).otherwise(lit(0))
return sum(cnd).alias(str(x))
exprs = [make_col(x) for x in range(0, 30, 10)]
dfWithDiff.groupBy(col("prodId")).agg(*exprs).show()
## +------+---+---+---+
## |prodId| 0| 10| 20|
## +------+---+---+---+
## | a|100|150|120|
## | b| 0| 0|100|
## +------+---+---+---+

Related

How to retain the preceding updated row values in PySpark and use it in the next row calculation?

The below condition needs to be applied on RANK and RANKA columns
Input table:
Condition for RANK column:
IF RANK == 0 : then RANK= previous RANK value + 1 ;
else : RANK=RANK
Condition for RANKA column:
IF RANKA == 0 : then RANKA= previous RANKA value + current row Salary
value;
else : RANKA=RANKA
Below is a piece of code that I tried.
I have created dummy columns named RANK_new and RANKA_new for storing the desired outputs of RANK and RANKA columns after applying conditions.
And then once I get the correct values I can replace the RANK and RANKA column with those dummy columns.
# importing necessary libraries
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import when, col
# function to create new SparkSession
from pyspark.sql.functions import lit
from pyspark.sql.functions import lag,lead
def create_session():
spk = SparkSession.builder \
.master("local") \
.appName("employee_profile.com") \
.getOrCreate()
return spk
def create_df(spark, data, schema):
df1 = spark.createDataFrame(data, schema)
return df1
if __name__ == "__main__":
# calling function to create SparkSession
spark = create_session()
input_data = [(1, "Shivansh", "Data Scientist", 2,1,1,2),
(0, "Rishabh", "Software Developer", 5,2,0,3),
(0, "Swati", "Data Analyst", 10,3,10,4),
(1, "Amar", "Data Analyst", 2,4,9,0),
(0, "Arpit", "Android Developer", 3,5,0,0),
(0, "Ranjeet", "Python Developer", 4,6,0,0),
(0, "Priyanka", "Full Stack Developer",5,7,0,0)]
schema = ["Id", "Name", "Job Profile", "Salary",'hi','RANK','RANKA']
# calling function to create dataframe
dff = create_df(spark, input_data, schema)
# Below 3 lines for RANK
df1=dff.repartition(1)
df2 = df1.withColumn('RANK_new', when(col('RANK') == 0,lag(col('RANK')+lit(1)).over(Window.orderBy(col('hi')))).otherwise(col('RANK')))
df2 = df2.withColumn('RANK_new', when((col('RANK') == 0) & (lag(col('RANK')).over(Window.orderBy(col('hi'))) == 0) ,lag(col('RANK_new')+lit(1)).over(Window.orderBy(col('hi')))).otherwise(col('RANK_new')))
#Below line for RANKA
df2=df2.withColumn('RANKA_new', when(col('RANKA') == 0, lag(col("RANKA")).over(Window.orderBy("hi"))+col("Salary")).otherwise(col('RANKA')))
df2.show()
The issue with this code is that the lag function is not taking the updated values of the previous rows.
This can be done with a for loop but since my data is so huge, I need a solution without for loop.
Below is the desired output:
Below is a summarized picture to show the Output I got and the desired output.
RANK_new, RANKA_new --> These are the output I got for RANK and RANKA columns after I applied the above code
RANK_desired, RANKA-desired ---> This is what is expected to be produced.
You can first create groups for partitioning for both, RANK and RANKA. Then using sum inside partitions should work.
Input
from pyspark.sql import functions as F, Window as W
input_data = [(1, "Shivansh", "Data Scientist", 2,1,1,2),
(0, "Rishabh", "Software Developer", 5,2,0,3),
(0, "Swati", "Data Analyst", 10,3,10,4),
(1, "Amar", "Data Analyst", 2,4,9,0),
(0, "Arpit", "Android Developer", 3,5,0,0),
(0, "Ranjeet", "Python Developer", 4,6,0,0),
(0, "Priyanka", "Full Stack Developer",5,7,0,0)]
schema = ["Id", "Name", "Job Profile", "Salary",'hi','RANK','RANKA']
dff = spark.createDataFrame(input_data, schema)
Script:
w0 = W.orderBy('hi')
rank_grp = F.when(F.col('RANK') != 0, 1).otherwise(0)
dff = dff.withColumn('RANK_grp', F.sum(rank_grp).over(w0))
w1 = W.partitionBy('RANK_grp').orderBy('hi')
ranka_grp = F.when(F.col('RANKA') != 0, 1).otherwise(0)
dff = dff.withColumn('RANKA_grp', F.sum(ranka_grp).over(w0))
w2 = W.partitionBy('RANKA_grp').orderBy('hi')
dff = (dff
.withColumn('RANK_new', F.sum(F.when(F.col('RANK') == 0, 1).otherwise(F.col('RANK'))).over(w1))
.withColumn('RANKA_new', F.sum(F.when(F.col('RANKA') == 0, F.col('Salary')).otherwise(F.col('RANKA'))).over(w2))
.drop('RANK_grp', 'RANKA_grp')
)
dff.show()
# +---+--------+--------------------+------+---+----+-----+--------+---------+
# | Id| Name| Job Profile|Salary| hi|RANK|RANKA|RANK_new|RANKA_new|
# +---+--------+--------------------+------+---+----+-----+--------+---------+
# | 1|Shivansh| Data Scientist| 2| 1| 1| 2| 1| 2|
# | 0| Rishabh| Software Developer| 5| 2| 0| 3| 2| 3|
# | 0| Swati| Data Analyst| 10| 3| 10| 4| 10| 4|
# | 1| Amar| Data Analyst| 2| 4| 9| 0| 9| 6|
# | 0| Arpit| Android Developer| 3| 5| 0| 0| 10| 9|
# | 0| Ranjeet| Python Developer| 4| 6| 0| 0| 11| 13|
# | 0|Priyanka|Full Stack Developer| 5| 7| 0| 0| 12| 18|
# +---+--------+--------------------+------+---+----+-----+--------+---------+

Replacing column values in pyspark by iterating through list

I have a pyspark data frame as
| ID|colA|colB |colC|
+---+----+-----+----+
|ID1| 3|5.85 | LB|
|ID2| 4|12.67| RF|
|ID3| 2|20.78| LCM|
|ID4| 1| 2 | LWB|
|ID5| 6| 3 | LF|
|ID6| 7| 4 | LM|
|ID7| 8| 5 | RS|
+---+----+----+----+
My goal is to replace the values in ColC as for the values of LB,LWB,LF with x and so on as shown below.
x = [LB,LWB,LF]
y = [RF,LCM]
z = [LM,RS]
Currently I'm able to achieve this by replacing each of the values manually as in below code :
# Replacing the values LB,LWF,LF with x
df_new = df.withColumn('ColC',f.when((f.col('ColC') == 'LB')|(f.col('ColC') == 'LWB')|(f.col('ColC') == 'LF'),'x').otherwise(df.ColC))
My question here is that how can we replace the values of a column (ColC in my example) by iterating through a list (x,y,z) dynamically at once using pyspark? What is the time complexity involved? Also, how can we truncate the decimal values in ColB to 1 decmial place?
You can coalesce the when statements if you have many conditions to match. You can also use a dictionary to hold the columns to be converted, and construct the when statements dynamically using a dict comprehension. As for rounding to 1 decimal place, you can use round.
import pyspark.sql.functions as F
xyz_dict = {'x': ['LB','LWB','LF'],
'y': ['RF','LCM'],
'z': ['LM','RS']}
df2 = df.withColumn(
'colC',
F.coalesce(*[F.when(F.col('colC').isin(v), k) for (k, v) in xyz_dict.items()])
).withColumn(
'colB',
F.round('colB', 1)
)
df2.show()
+---+----+----+----+
| ID|colA|colB|colC|
+---+----+----+----+
|ID1| 3| 5.9| x|
|ID2| 4|12.7| y|
|ID3| 2|20.8| y|
|ID4| 1| 2.0| x|
|ID5| 6| 3.0| x|
|ID6| 7| 4.0| z|
|ID7| 8| 5.0| z|
+---+----+----+----+
You can use replace on dataframe to replace the values in colC by passing a dict object for the mappings. And round function to limit the number of decimals in colB:
from pyspark.sql import functions as F
replacement = {
"LB": "x", "LWB": "x", "LF": "x",
"RF": "y", "LCM": "y",
"LM": "z", "RS": "z"
}
df1 = df.replace(replacement, ["colC"]).withColumn("colB", F.round("colB", 1))
df1.show()
#+---+----+----+----+
#| ID|colA|colB|colC|
#+---+----+----+----+
#|ID1| 3| 5.9| x|
#|ID2| 4|12.7| y|
#|ID3| 2|20.8| y|
#|ID4| 1| 2.0| x|
#|ID5| 6| 3.0| x|
#|ID6| 7| 4.0| z|
#|ID7| 8| 5.0| z|
#+---+----+----+----+
Also you can use isin function:
from pyspark.sql.functions import col, when
x = ['LB','LWB','LF']
y = ['LCM','RF']
z = ['LM','RS']
df = df.withColumn('ColC', when(col('colC').isin(x), "x")\
.otherwise(when(col('colC').isin(y), "y")\
.otherwise(when(col('colC').isin(z), "z")\
.otherwise(df.ColC))))
If you have a few lists with too many values in this way your complexity is less than blackbishop answer but in this problem his answer is easier.
You can try also with a regular expression using regexp_replace:
import pyspark.sql.functions as f
replacements = [
("(LB)|(LWB)|(LF)", "x"),
("(LCM)|(RF)", "y"),
("(LM)|(RS)", "z")
]
for x, y in replacements:
df = df.withColumn("colC", f.regexp_replace("colC", x, y))

Is there any method to find number of columns having data in pyspark data frame

I have a pyspark data frame that has 7 columns, I have to add a new column named "sum" and calculate a number of columns that have data (Not null) in the sum column.Example a data frame in which yellow highlighted part is required answer
This sum can be calculated like this:
df = spark.createDataFrame([
(1, "a", "xxx", None, "abc", "xyz","fgh"),
(2, "b", None, 3, "abc", "xyz","fgh"),
(3, "c", "a23", None, None, "xyz","fgh")
], ("ID","flag", "col1", "col2", "col3", "col4", "col5"))
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType
df2 = df.withColumn("sum",sum([(~F.isnull(df[col])).cast(IntegerType()) for col in df.columns]))
df2.show()
+---+----+----+----+----+----+----+---+
| ID|flag|col1|col2|col3|col4|col5|sum|
+---+----+----+----+----+----+----+---+
| 1| a| xxx|null| abc| xyz| fgh| 6|
| 2| b|null| 3| abc| xyz| fgh| 6|
| 3| c| a23|null|null| xyz| fgh| 5|
+---+----+----+----+----+----+----+---+
Hope this helps!

how to generate unique sequence numbers to replace null values in a column of a table in spark scala

I am facing difficulty in generating unique sequence numbers to replace the null values in a column of a table. The table is obtained after joining two other tables and the column id the primary key column where null values are to be replaced with unique sequence values.
I tried using accumulators but i am facing difficulty when running the program in a multinode cluster.
val joined=csv2.join(csv,csv2("ACCT_PRDCT_CD")===csv("ACCT_PRDCT_CD"),"left_outer")
joined.filter("ACCT_CO_NO is null").show
val k=joined.withColumn("Acc_flag", when($"ACCT_CO_NO".isNull,0).otherwise($"ACCT_CO_NO"))
var a=1
def generate(s:Int):Int={
if (s==0){
a=a+1
return a
}
else {
return s
}
}
val generateNum = udf(generate(_:Int))
val newjoined=k.withColumn("n",generateNum($"ACC_flag"))
If I understand your requirement correctly, consider using monotonically_increasing_id or RDD's zipWithIndex. To avoid collision, the generated sequence numbers will then be added to a number greater than the maximum column value before replacing the nulls.
import org.apache.spark.sql.functions._
val dfL = Seq(
(1, "a"),
(2, "b"),
(3, "c"),
(4, "d"),
(5, "e"),
(6, "f")
).toDF("c1", "c2")
val dfR = Seq(
(1, 100L),
(2, 200L),
(3, 300L)
).toDF("c1", "c2")
val c2max = dfR.select(max($"c2")).first.getLong(0)
// c2max: Long = 300
val dfJoined = dfL.join(dfR, Seq("c1"), "left").
select(dfL("c1"), dfR("c2"))
METHOD 1: using monotonically_increasing_id
dfJoined.withColumn( "c2x", when(col("c2").isNotNull, col("c2")).
otherwise(monotonically_increasing_id + c2max + 1)
).
show
// +---+----+-----------+
// | c1| c2| c2x|
// +---+----+-----------+
// | 1| 100| 100|
// | 2| 200| 200|
// | 3| 300| 300|
// | 4|null|25769804077|
// | 5|null|34359738669|
// | 6|null|42949673261|
// +---+----+-----------+
Note that the generated sequence numbers aren't necessarily consecutive.
METHOD 2: using RDD's zipWithIndex
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val rdd = dfJoined.rdd.zipWithIndex.
map{ case (row: Row, idx: Long) => Row.fromSeq(row.toSeq :+ idx) }
spark.createDataFrame(rdd,
StructType(dfJoined.schema.fields :+ StructField("idx", LongType))
).
select( $"c1", $"c2",
when(col("c2").isNotNull, col("c2")).otherwise($"idx" + c2max + 1).
as("c2x")
).
show
// +---+----+---+
// | c1| c2|c2x|
// +---+----+---+
// | 1| 100|100|
// | 2| 200|200|
// | 3| 300|300|
// | 4|null|304|
// | 5|null|305|
// | 6|null|306|
// +---+----+---+

Adding an extra column that represents the difference between the closest difference of a previous column

My scenario might be more easily explained through an example. Say I had the following data:
Type Time
A 1
B 3
A 5
B 9
I want to add an extra column to each row that represents the minimum absolute value difference between all columns of the same type. So for the first row, the minimum difference between all times of type A is 4, so the value would be 4 for columns 1 and 3, and likewise, 6 for columns 2 and 4.
I am doing this in Spark and Spark SQL, so guidance there would be more useful, but if it needs to be explained through plain SQL, that would be a great help as well.
One possible approach is to use window functions.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{lag, min, abs}
val df = Seq(
("A", -10), ("A", 1), ("A", 5), ("B", 3), ("B", 9)
).toDF("type", "time")
First lets determine difference between consecutive rows sorted by time:
// Partition by type and sort by time
val w1 = Window.partitionBy($"Type").orderBy($"Time")
// Difference between this and previous
val diff = $"time" - lag($"time", 1).over(w1)
Then find minimum over all diffs for a given type:
// Partition by time unordered and take unbounded window
val w2 = Window.partitionBy($"Type").rowsBetween(Long.MinValue, Long.MaxValue)
// Minimum difference over type
val minDiff = min(diff).over(w2)
df.withColumn("min_diff", minDiff).show
// +----+----+--------+
// |type|time|min_diff|
// +----+----+--------+
// | A| -10| 4|
// | A| 1| 4|
// | A| 5| 4|
// | B| 3| 6|
// | B| 9| 6|
// +----+----+--------+
If your goal is to find a minimum distance between current row and any other row in a group you can use a similar approach
import org.apache.spark.sql.functions.{lead, when}
// Diff to previous
val diff_lag = $"time" - lag($"time", 1).over(w1)
// Diff to next
val diff_lead = lead($"time", 1).over(w1) - $"time"
val diffToClosest = when(
diff_lag < diff_lead || diff_lead.isNull,
diff_lag
).otherwise(diff_lead)
df.withColumn("diff_to_closest", diffToClosest)
// +----+----+---------------+
// |type|time|diff_to_closest|
// +----+----+---------------+
// | A| -10| 11|
// | A| 1| 4|
// | A| 5| 4|
// | B| 3| 6|
// | B| 9| 6|
// +----+----+---------------+
tested in sql server 2008
create table d(
type varchar(25),
Time int
)
insert into d
values ('A',1),
('B',3),
('A',5),
('B',9)
--solution one, calculation in query, might not be smart if dataset is large.
select *
, (select max(time) m from d as i where i.type = o.type) - (select MIN(time) m from d as i where i.type = o.type) dif
from d as o
--or this
select d.*, diftable.dif from d inner join
(select type, MAX(time) - MIN(time) dif
from d group by type ) as diftable on d.type = diftable.type
You should try something like this:
val sc: SparkContext = ...
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val input = sc.parallelize(Seq(
("A", 1),
("B", 3),
("A", 5),
("B", 9)
))
val df = input.groupByKey().flatMap { case (key, values) =>
val smallestDiff = values.toList.sorted match {
case firstMin :: secondMin :: _ => secondMin - firstMin
case singleVal :: Nil => singleVal // Only one record for some `Type`
}
values.map { value =>
(key, value, smallestDiff)
}
}.toDF("Type", "Time", "SmallestDiff")
df.show()
Output:
+----+----+------------+
|Type|Time|SmallestDiff|
+----+----+------------+
| A| 1| 4|
| A| 5| 4|
| B| 3| 6|
| B| 9| 6|
+----+----+------------+