I am trying to execute the below code using PySpark:
join_on = (df_1.C1_PROFIT == df_2.C2_PROFIT) & \ # JOIN CONDITION
(df_1.C1_REVENUE == df_3.C3_REVENUE_BREAK) & \ # JOIN CONDITION
(df_1.C1_LOSS == df_4.C4_TOTAL_LOSS) & \ # JOIN CONDITION
((df_4.TOTAL_YEAR_PROFIT) > (df_3.TOTAL_GROWTH)) # WHERE CONDITION
df = (df_1.alias('a')
.join(df_2.alias('b'), join_on, 'left')
.join(df_3.alias('c'), join_on, 'left')
.join(df_4.alias('d'), join_on, 'left')
.select(
*[c for c in df_2.columns if c != 'C2_TARGET'],
F.expr("nvl2(b.C2_PROFIT, '500', a.C2_TARGET) C2_TARGET")
)
)
Error after running the query:
'TOTAL_YEAR_PROFIT','TOTAL_GROWTH', 'TOTAL_LOSS', 'REVENUE_BREAK'
does not exist in df_1 columns:
The original SQL query:
UPDATE (( companyc1
INNER JOIN companyc2
ON company1.c1_profit = company2.c2_profit)
INNER JOIN companyc3
ON company1.c1_revenue = company3.revenue_break)
INNER JOIN companyc4
ON company1.c1_loss = company4.c4_total_loss
SET companyc1.sales = "500"
WHERE (( ( company4.total_year_profit ) > [company3].[total_growth] ))
Can anyone help me find where I am doing the mistake?
Your join_on conditions will have to be split for each of the join operation, like so:
df = (df_1.alias('a')
.join(df_2.alias('b'), df_1.C1_PROFIT == df_2.C2_PROFIT, 'left')
.join(df_3.alias('c'), df_1.C1_REVENUE == df_3.C3_REVENUE_BREAK, 'left')
.join(df_4.alias('d'), df_1.C1_LOSS == df_4.C4_TOTAL_LOSS. 'left')
.select(
*[c for c in df_2.columns if c != 'C2_TARGET'],
F.expr("nvl2(b.C2_PROFIT, '500', a.C2_TARGET) C2_TARGET")
).where("d.TOTAL_YEAR_PROFIT > c.TOTAL_GROWTH")
)
When translating SQL UPDATE containing multiple joins, it seems to me, that the universally safe approach could involve groupBy, agg and monotonically_increasing_id (to make sure that the row number of the original df will not shrink after the aggregation).
I've made the following tables in MS Access, to make sure that the approach I suggest, would work the same way in Spark.
Inputs:
Result after the update:
Spark
It seems, MS Access aggregates column values, so the following will do the same.
Inputs:
from pyspark.sql import functions as F
df_1 = spark.createDataFrame(
[(2, 10, 5, 'replace'),
(2, 10, 5, 'replace'),
(1, 10, None, 'keep'),
(2, None, 5, 'keep')],
['C1_PROFIT', 'C1_REVENUE', 'C1_LOSS', 'SALES']
)
df_2 = spark.createDataFrame([(1,), (2,), (8,)], ['C2_PROFIT'])
df_3 = spark.createDataFrame([(10, 51), (10, 50)], ['REVENUE_BREAK', 'TOTAL_GROWTH'])
df_4 = spark.createDataFrame([(5, 50), (5, 51),], ['C4_TOTAL_LOSS', 'TOTAL_YEAR_PROFIT'])
Script:
df_1 = df_1.withColumn('_id', F.monotonically_increasing_id())
df = (df_1.alias('a')
.join(df_2.alias('b'), df_1.C1_PROFIT == df_2.C2_PROFIT, 'left')
.join(df_3.alias('c'), df_1.C1_REVENUE == df_3.REVENUE_BREAK, 'left')
.join(df_4.alias('d'), df_1.C1_LOSS == df_4.C4_TOTAL_LOSS, 'left')
.groupBy(*[c for c in df_1.columns if c != 'SALES'])
.agg(F.when(F.max('d.total_year_profit') > F.min('c.total_growth'), '500')
.otherwise(F.first('a.SALES')).alias('SALES')
).drop('_id')
)
df.show()
# +---------+----------+-------+-----+
# |C1_PROFIT|C1_REVENUE|C1_LOSS|SALES|
# +---------+----------+-------+-----+
# | 1| 10| null| keep|
# | 2| null| 5| keep|
# | 2| 10| 5| 500|
# | 2| 10| 5| 500|
# +---------+----------+-------+-----+
Related
The below condition needs to be applied on RANK and RANKA columns
Input table:
Condition for RANK column:
IF RANK == 0 : then RANK= previous RANK value + 1 ;
else : RANK=RANK
Condition for RANKA column:
IF RANKA == 0 : then RANKA= previous RANKA value + current row Salary
value;
else : RANKA=RANKA
Below is a piece of code that I tried.
I have created dummy columns named RANK_new and RANKA_new for storing the desired outputs of RANK and RANKA columns after applying conditions.
And then once I get the correct values I can replace the RANK and RANKA column with those dummy columns.
# importing necessary libraries
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import when, col
# function to create new SparkSession
from pyspark.sql.functions import lit
from pyspark.sql.functions import lag,lead
def create_session():
spk = SparkSession.builder \
.master("local") \
.appName("employee_profile.com") \
.getOrCreate()
return spk
def create_df(spark, data, schema):
df1 = spark.createDataFrame(data, schema)
return df1
if __name__ == "__main__":
# calling function to create SparkSession
spark = create_session()
input_data = [(1, "Shivansh", "Data Scientist", 2,1,1,2),
(0, "Rishabh", "Software Developer", 5,2,0,3),
(0, "Swati", "Data Analyst", 10,3,10,4),
(1, "Amar", "Data Analyst", 2,4,9,0),
(0, "Arpit", "Android Developer", 3,5,0,0),
(0, "Ranjeet", "Python Developer", 4,6,0,0),
(0, "Priyanka", "Full Stack Developer",5,7,0,0)]
schema = ["Id", "Name", "Job Profile", "Salary",'hi','RANK','RANKA']
# calling function to create dataframe
dff = create_df(spark, input_data, schema)
# Below 3 lines for RANK
df1=dff.repartition(1)
df2 = df1.withColumn('RANK_new', when(col('RANK') == 0,lag(col('RANK')+lit(1)).over(Window.orderBy(col('hi')))).otherwise(col('RANK')))
df2 = df2.withColumn('RANK_new', when((col('RANK') == 0) & (lag(col('RANK')).over(Window.orderBy(col('hi'))) == 0) ,lag(col('RANK_new')+lit(1)).over(Window.orderBy(col('hi')))).otherwise(col('RANK_new')))
#Below line for RANKA
df2=df2.withColumn('RANKA_new', when(col('RANKA') == 0, lag(col("RANKA")).over(Window.orderBy("hi"))+col("Salary")).otherwise(col('RANKA')))
df2.show()
The issue with this code is that the lag function is not taking the updated values of the previous rows.
This can be done with a for loop but since my data is so huge, I need a solution without for loop.
Below is the desired output:
Below is a summarized picture to show the Output I got and the desired output.
RANK_new, RANKA_new --> These are the output I got for RANK and RANKA columns after I applied the above code
RANK_desired, RANKA-desired ---> This is what is expected to be produced.
You can first create groups for partitioning for both, RANK and RANKA. Then using sum inside partitions should work.
Input
from pyspark.sql import functions as F, Window as W
input_data = [(1, "Shivansh", "Data Scientist", 2,1,1,2),
(0, "Rishabh", "Software Developer", 5,2,0,3),
(0, "Swati", "Data Analyst", 10,3,10,4),
(1, "Amar", "Data Analyst", 2,4,9,0),
(0, "Arpit", "Android Developer", 3,5,0,0),
(0, "Ranjeet", "Python Developer", 4,6,0,0),
(0, "Priyanka", "Full Stack Developer",5,7,0,0)]
schema = ["Id", "Name", "Job Profile", "Salary",'hi','RANK','RANKA']
dff = spark.createDataFrame(input_data, schema)
Script:
w0 = W.orderBy('hi')
rank_grp = F.when(F.col('RANK') != 0, 1).otherwise(0)
dff = dff.withColumn('RANK_grp', F.sum(rank_grp).over(w0))
w1 = W.partitionBy('RANK_grp').orderBy('hi')
ranka_grp = F.when(F.col('RANKA') != 0, 1).otherwise(0)
dff = dff.withColumn('RANKA_grp', F.sum(ranka_grp).over(w0))
w2 = W.partitionBy('RANKA_grp').orderBy('hi')
dff = (dff
.withColumn('RANK_new', F.sum(F.when(F.col('RANK') == 0, 1).otherwise(F.col('RANK'))).over(w1))
.withColumn('RANKA_new', F.sum(F.when(F.col('RANKA') == 0, F.col('Salary')).otherwise(F.col('RANKA'))).over(w2))
.drop('RANK_grp', 'RANKA_grp')
)
dff.show()
# +---+--------+--------------------+------+---+----+-----+--------+---------+
# | Id| Name| Job Profile|Salary| hi|RANK|RANKA|RANK_new|RANKA_new|
# +---+--------+--------------------+------+---+----+-----+--------+---------+
# | 1|Shivansh| Data Scientist| 2| 1| 1| 2| 1| 2|
# | 0| Rishabh| Software Developer| 5| 2| 0| 3| 2| 3|
# | 0| Swati| Data Analyst| 10| 3| 10| 4| 10| 4|
# | 1| Amar| Data Analyst| 2| 4| 9| 0| 9| 6|
# | 0| Arpit| Android Developer| 3| 5| 0| 0| 10| 9|
# | 0| Ranjeet| Python Developer| 4| 6| 0| 0| 11| 13|
# | 0|Priyanka|Full Stack Developer| 5| 7| 0| 0| 12| 18|
# +---+--------+--------------------+------+---+----+-----+--------+---------+
I have the following code
from pyspark.sql.functions import col, count, when
from functools import reduce
df = spark.createDataFrame([ (1,""), (2,None),(3,"c"),(4,"d") ], ['id','name'])
filter1 = col("name").isNull()
filter2 = col("name") == ""
dfresult = df.filter(filter1 | filter2).select(col("id"), when(filter1, "name is null").when(filter2, "name is empty").alias("new_col"))
dfresult.show()
+---+-------------+
| id| new_col|
+---+-------------+
| 1|name is empty|
| 2| name is null|
+---+-------------+
In the scenario with N filters. I think about
filters = []
filters.append({ "item": filter1, "msg":"name is null"})
filters.append({ "item": filter2, "msg":"name is empty"})
dynamic_filter = reduce(
lambda x,y: x | y,
[s['item'] for s in filters]
)
df2 = df.filter(dynamic_filter).select(col("id"), when(filter1, "name is null").when(filter2, "name is empty").alias("new_col"))
df2.show()
How can I make something better for new_col column with dynamic when?
Simply use functools.reduce as your already did for the filter expression:
from functools import reduce
from pyspark.sql import functions as F
new_col = reduce(
lambda acc, x: acc.when(x["item"], F.lit(x["msg"])),
filters,
F
)
df2 = df.filter(dynamic_filter).select(col("id"), new_col.alias("new_col"))
df2.show()
#+---+-------------+
#| id| new_col|
#+---+-------------+
#| 1|name is empty|
#| 2| name is null|
#+---+-------------+
I would like to include null values in an Apache Spark join. Spark doesn't include rows with null by default.
Here is the default Spark behavior.
val numbersDf = Seq(
("123"),
("456"),
(null),
("")
).toDF("numbers")
val lettersDf = Seq(
("123", "abc"),
("456", "def"),
(null, "zzz"),
("", "hhh")
).toDF("numbers", "letters")
val joinedDf = numbersDf.join(lettersDf, Seq("numbers"))
Here is the output of joinedDf.show():
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| | hhh|
+-------+-------+
This is the output I would like:
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| | hhh|
| null| zzz|
+-------+-------+
Spark provides a special NULL safe equality operator:
numbersDf
.join(lettersDf, numbersDf("numbers") <=> lettersDf("numbers"))
.drop(lettersDf("numbers"))
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| null| zzz|
| | hhh|
+-------+-------+
Be careful not to use it with Spark 1.5 or earlier. Prior to Spark 1.6 it required a Cartesian product (SPARK-11111 - Fast null-safe join).
In Spark 2.3.0 or later you can use Column.eqNullSafe in PySpark:
numbers_df = sc.parallelize([
("123", ), ("456", ), (None, ), ("", )
]).toDF(["numbers"])
letters_df = sc.parallelize([
("123", "abc"), ("456", "def"), (None, "zzz"), ("", "hhh")
]).toDF(["numbers", "letters"])
numbers_df.join(letters_df, numbers_df.numbers.eqNullSafe(letters_df.numbers))
+-------+-------+-------+
|numbers|numbers|letters|
+-------+-------+-------+
| 456| 456| def|
| null| null| zzz|
| | | hhh|
| 123| 123| abc|
+-------+-------+-------+
and %<=>% in SparkR:
numbers_df <- createDataFrame(data.frame(numbers = c("123", "456", NA, "")))
letters_df <- createDataFrame(data.frame(
numbers = c("123", "456", NA, ""),
letters = c("abc", "def", "zzz", "hhh")
))
head(join(numbers_df, letters_df, numbers_df$numbers %<=>% letters_df$numbers))
numbers numbers letters
1 456 456 def
2 <NA> <NA> zzz
3 hhh
4 123 123 abc
With SQL (Spark 2.2.0+) you can use IS NOT DISTINCT FROM:
SELECT * FROM numbers JOIN letters
ON numbers.numbers IS NOT DISTINCT FROM letters.numbers
This is can be used with DataFrame API as well:
numbersDf.alias("numbers")
.join(lettersDf.alias("letters"))
.where("numbers.numbers IS NOT DISTINCT FROM letters.numbers")
val numbers2 = numbersDf.withColumnRenamed("numbers","num1") //rename columns so that we can disambiguate them in the join
val letters2 = lettersDf.withColumnRenamed("numbers","num2")
val joinedDf = numbers2.join(letters2, $"num1" === $"num2" || ($"num1".isNull && $"num2".isNull) ,"outer")
joinedDf.select("num1","letters").withColumnRenamed("num1","numbers").show //rename the columns back to the original names
Based on K L's idea, you could use foldLeft to generate join column expression:
def nullSafeJoin(rightDF: DataFrame, columns: Seq[String], joinType: String)(leftDF: DataFrame): DataFrame =
{
val colExpr: Column = leftDF(columns.head) <=> rightDF(columns.head)
val fullExpr = columns.tail.foldLeft(colExpr) {
(colExpr, p) => colExpr && leftDF(p) <=> rightDF(p)
}
leftDF.join(rightDF, fullExpr, joinType)
}
then, you could call this function just like:
aDF.transform(nullSafejoin(bDF, columns, joinType))
Complementing the other answers, for PYSPARK < 2.3.0 you would not have Column.eqNullSafe neither IS NOT DISTINCT FROM.
You still can build the <=> operator with an sql expression to include it in the join, as long as you define alias for the join queries:
from pyspark.sql.types import StringType
import pyspark.sql.functions as F
numbers_df = spark.createDataFrame (["123","456",None,""], StringType()).toDF("numbers")
letters_df = spark.createDataFrame ([("123", "abc"),("456", "def"),(None, "zzz"),("", "hhh") ]).\
toDF("numbers", "letters")
joined_df = numbers_df.alias("numbers").join(letters_df.alias("letters"),
F.expr('numbers.numbers <=> letters.numbers')).\
select('letters.*')
joined_df.show()
+-------+-------+
|numbers|letters|
+-------+-------+
| 456| def|
| null| zzz|
| | hhh|
| 123| abc|
+-------+-------+
Based on timothyzhang's idea one can further improve it by removing duplicate columns:
def dropDuplicateColumns(df: DataFrame, rightDf: DataFrame, cols: Seq[String]): DataFrame
= cols.foldLeft(df)((df, c) => df.drop(rightDf(c)))
def joinTablesWithSafeNulls(rightDF: DataFrame, leftDF: DataFrame, columns: Seq[String], joinType: String): DataFrame =
{
val colExpr: Column = leftDF(columns.head) <=> rightDF(columns.head)
val fullExpr = columns.tail.foldLeft(colExpr) {
(colExpr, p) => colExpr && leftDF(p) <=> rightDF(p)
}
val finalDF = leftDF.join(rightDF, fullExpr, joinType)
val filteredDF = dropDuplicateColumns(finalDF, rightDF, columns)
filteredDF
}
Try the following method to include the null rows to the result of JOIN operator:
def nullSafeJoin(leftDF: DataFrame, rightDF: DataFrame, columns: Seq[String], joinType: String): DataFrame = {
var columnsExpr: Column = leftDF(columns.head) <=> rightDF(columns.head)
columns.drop(1).foreach(column => {
columnsExpr = columnsExpr && (leftDF(column) <=> rightDF(column))
})
var joinedDF: DataFrame = leftDF.join(rightDF, columnsExpr, joinType)
columns.foreach(column => {
joinedDF = joinedDF.drop(leftDF(column))
})
joinedDF
}
My scenario might be more easily explained through an example. Say I had the following data:
Type Time
A 1
B 3
A 5
B 9
I want to add an extra column to each row that represents the minimum absolute value difference between all columns of the same type. So for the first row, the minimum difference between all times of type A is 4, so the value would be 4 for columns 1 and 3, and likewise, 6 for columns 2 and 4.
I am doing this in Spark and Spark SQL, so guidance there would be more useful, but if it needs to be explained through plain SQL, that would be a great help as well.
One possible approach is to use window functions.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{lag, min, abs}
val df = Seq(
("A", -10), ("A", 1), ("A", 5), ("B", 3), ("B", 9)
).toDF("type", "time")
First lets determine difference between consecutive rows sorted by time:
// Partition by type and sort by time
val w1 = Window.partitionBy($"Type").orderBy($"Time")
// Difference between this and previous
val diff = $"time" - lag($"time", 1).over(w1)
Then find minimum over all diffs for a given type:
// Partition by time unordered and take unbounded window
val w2 = Window.partitionBy($"Type").rowsBetween(Long.MinValue, Long.MaxValue)
// Minimum difference over type
val minDiff = min(diff).over(w2)
df.withColumn("min_diff", minDiff).show
// +----+----+--------+
// |type|time|min_diff|
// +----+----+--------+
// | A| -10| 4|
// | A| 1| 4|
// | A| 5| 4|
// | B| 3| 6|
// | B| 9| 6|
// +----+----+--------+
If your goal is to find a minimum distance between current row and any other row in a group you can use a similar approach
import org.apache.spark.sql.functions.{lead, when}
// Diff to previous
val diff_lag = $"time" - lag($"time", 1).over(w1)
// Diff to next
val diff_lead = lead($"time", 1).over(w1) - $"time"
val diffToClosest = when(
diff_lag < diff_lead || diff_lead.isNull,
diff_lag
).otherwise(diff_lead)
df.withColumn("diff_to_closest", diffToClosest)
// +----+----+---------------+
// |type|time|diff_to_closest|
// +----+----+---------------+
// | A| -10| 11|
// | A| 1| 4|
// | A| 5| 4|
// | B| 3| 6|
// | B| 9| 6|
// +----+----+---------------+
tested in sql server 2008
create table d(
type varchar(25),
Time int
)
insert into d
values ('A',1),
('B',3),
('A',5),
('B',9)
--solution one, calculation in query, might not be smart if dataset is large.
select *
, (select max(time) m from d as i where i.type = o.type) - (select MIN(time) m from d as i where i.type = o.type) dif
from d as o
--or this
select d.*, diftable.dif from d inner join
(select type, MAX(time) - MIN(time) dif
from d group by type ) as diftable on d.type = diftable.type
You should try something like this:
val sc: SparkContext = ...
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val input = sc.parallelize(Seq(
("A", 1),
("B", 3),
("A", 5),
("B", 9)
))
val df = input.groupByKey().flatMap { case (key, values) =>
val smallestDiff = values.toList.sorted match {
case firstMin :: secondMin :: _ => secondMin - firstMin
case singleVal :: Nil => singleVal // Only one record for some `Type`
}
values.map { value =>
(key, value, smallestDiff)
}
}.toDF("Type", "Time", "SmallestDiff")
df.show()
Output:
+----+----+------------+
|Type|Time|SmallestDiff|
+----+----+------------+
| A| 1| 4|
| A| 5| 4|
| B| 3| 6|
| B| 9| 6|
+----+----+------------+
I hope you can help me with this.
I have a DF as follows:
val df = sc.parallelize(Seq(
(1, "a", "2014-12-01", "2015-01-01", 100),
(2, "a", "2014-12-01", "2015-01-02", 150),
(3, "a", "2014-12-01", "2015-01-03", 120),
(4, "b", "2015-12-15", "2015-01-01", 100)
)).toDF("id", "prodId", "dateIns", "dateTrans", "value")
.withColumn("dateIns", to_date($"dateIns")
.withColumn("dateTrans", to_date($"dateTrans"))
I would love to do a groupBy prodId and aggregate 'value' summing it for ranges of dates defined by the difference between the column 'dateIns' and 'dateTrans'. In particular, I would like to have a way to define a conditional sum that sums all values within a predefined max difference between the above mentioned columns. I.e. all value that happened between 10, 20, 30 days from dateIns ('dateTrans' - 'dateIns' <=10, 20, 30).
Is there any predefined aggregated function in spark that allows doing conditional sums? Do you recommend develop a aggr. UDF (if so, any suggestions)?
I'm using pySpqrk, but very happy to get Scala solutions as well. Thanks a lot!
Lets make your a little bit more interesting so there are some events in the window:
val df = sc.parallelize(Seq(
(1, "a", "2014-12-30", "2015-01-01", 100),
(2, "a", "2014-12-21", "2015-01-02", 150),
(3, "a", "2014-12-10", "2015-01-03", 120),
(4, "b", "2014-12-05", "2015-01-01", 100)
)).toDF("id", "prodId", "dateIns", "dateTrans", "value")
.withColumn("dateIns", to_date($"dateIns"))
.withColumn("dateTrans", to_date($"dateTrans"))
What you need is more or less something like this:
import org.apache.spark.sql.functions.{col, datediff, lit, sum}
// Find difference in tens of days
val diff = (datediff(col("dateTrans"), col("dateIns")) / 10)
.cast("integer") * 10
val dfWithDiff = df.withColumn("diff", diff)
val aggregated = dfWithDiff
.where((col("diff") < 30) && (col("diff") >= 0))
.groupBy(col("prodId"), col("diff"))
.agg(sum(col("value")))
And the results
aggregated.show
// +------+----+----------+
// |prodId|diff|sum(value)|
// +------+----+----------+
// | a| 20| 120|
// | b| 20| 100|
// | a| 0| 100|
// | a| 10| 150|
// +------+----+----------+
where diff is a lower bound for the range (0 -> [0, 10), 10 -> [10, 20), ...). This will work in PySpark as well if you remove val and adjust imports.
Edit (aggregate per column):
val exprs = Seq(0, 10, 20).map(x => sum(
when(col("diff") === lit(x), col("value"))
.otherwise(lit(0)))
.alias(x.toString))
dfWithDiff.groupBy(col("prodId")).agg(exprs.head, exprs.tail: _*).show
// +------+---+---+---+
// |prodId| 0| 10| 20|
// +------+---+---+---+
// | a|100|150|120|
// | b| 0| 0|100|
// +------+---+---+---+
with Python equivalent:
from pyspark.sql.functions import *
def make_col(x):
cnd = when(col("diff") == lit(x), col("value")).otherwise(lit(0))
return sum(cnd).alias(str(x))
exprs = [make_col(x) for x in range(0, 30, 10)]
dfWithDiff.groupBy(col("prodId")).agg(*exprs).show()
## +------+---+---+---+
## |prodId| 0| 10| 20|
## +------+---+---+---+
## | a|100|150|120|
## | b| 0| 0|100|
## +------+---+---+---+