spark sql left join with comparison in subquery - dataframe

I have the following 2 dataframes:
df_a:
id
date
code
1
2021-06-27
A
df_b:
id
date
code
1
2021-05-19
A
1
2021-05-31
B
1
2021-08-27
C
I want to use df_b.code to update df_a.code by the following condition:
use the row from df_b where b.date is latest prior to the df_a.date.
so df_a.code will be updated to 'B' since the df_b.date '2021-05-31' is the latest prior to '2021-06-27'
I tried:
select a.id, b.code
from df_a left join df_b
on a.id = b.id
and b.date = (select max(b.date) from df_b where id = a.id and date <= a.date)
but I'm getting 'Correlated scalar sub-queries can only be used in a Filter/Aggregate/Project and a few commands' error

You can use a window function:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
win = Window.partitionBy(df_a.id).orderBy(df_b.date.desc())
(
df_a
.join(df_b,['id'])
.filter(df_a.date > df_b.date)
.withColumn("r", F.row_number().over(win))
.filter(F.col("r")==1)
.select(df_a.id, df_a.date, df_b.code)
).show()
Output:
| id| date|code|
+---+----------+----+
| 1|2021-06-27| B|
+---+----------+----+

Another approach is, get the lead date for the df1 first and join with between.
data1 = [[1, '2021-06-27', 'A']]
data2 = [[1, '2021-05-19', 'A'], [1, '2021-05-31', 'B'], [1, '2021-08-27', 'C']]
cols = ['id', 'date', 'code']
df1 = spark.createDataFrame(data1, cols).withColumn('date', f.col('date').cast('date'))
df2 = spark.createDataFrame(data2, cols).withColumn('date', f.col('date').cast('date'))
w = Window.partitionBy('id').orderBy('date')
df3 = df2.withColumn('date_after', f.lead('date', 1, '2999-12-31').over(w))
df3.show()
df1.alias('a') \
.join(df3.alias('b'), (f.col('a.id') == f.col('b.id')) & (f.col('a.date').between(f.col('b.date'), f.col('b.date_after'))), 'left') \
.withColumn('new_code', f.coalesce('b.code', 'a.code')) \
.select('a.id', 'a.date', 'new_code').toDF('id', 'date', 'code') \
.show()
+---+----------+----+----------+
| id| date|code|date_after|
+---+----------+----+----------+
| 1|2021-05-19| A|2021-05-31|
| 1|2021-05-31| B|2021-08-27|
| 1|2021-08-27| C|2999-12-31|
+---+----------+----+----------+
+---+----------+----+
| id| date|code|
+---+----------+----+
| 1|2021-06-27| B|
+---+----------+----+

Related

Output multiple summarized lists with KQL

I want to output multiple lists of unique column values with KQL.
For instance for the following table:
A
B
C
1
x
one
1
x
two
1
y
one
I want to output
K
V
A
[1]
B
[x,y]
C
[one, two]
I accomplished this using summarize with make_list and 2 unions, been wondering if its possible to accomplish this in the same query without union?
Table
| distinct A
| summarize k="A", v= make_list(A)
union
Table
| distinct b
| summarize k="B", v= make_list(B)
...
if your data set is reasonably-sized, you could try using the narrow() plugin: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/narrowplugin
datatable(A:int, B:string, C:string)
[
1, 'x', 'one',
1, 'x', 'two',
1, 'y', 'one',
]
| evaluate narrow()
| summarize make_set(Value) by Column
Column
set_Value
A
["1"]
B
["x","y"]
C
["one","two"]
Alternatively, you could use a combination of pack_all() and mv-apply
datatable(A:int, B:string, C:string)
[
1, 'x', 'one',
1, 'x', 'two',
1, 'y', 'one',
]
| project p = pack_all()
| mv-apply p on (
extend key = tostring(bag_keys(p)[0])
| project key, value = p[key]
)
| summarize make_set(value) by key
key
set_value
A
["1"]
B
["x","y"]
C
["one","two"]

check first dataframe value startswith any of the second dataframe value

I have two pyspark dataframe as follow :
df1 = spark.createDataFrame(
["yes","no","yes23", "no3", "35yes", """41no["maybe"]"""],
"string"
).toDF("location")
df2 = spark.createDataFrame(
["yes","no"],
"string"
).toDF("location")
i want to check if values in location col from df1, startsWith, values in location col of df2 and vice versa.
Something like :
df1.select("location").startsWith(df2.location)
Following is the output i am expecting here:
+-------------+
| location|
+-------------+
| yes|
| no|
| yes23|
| no3|
+-------------+
Using spark SQL looks the easiest to me:
df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')
joined = spark.sql("""
select df1.*
from df1
join df2
on df1.location rlike '^' || df2.location
""")

GROUP BY with overlapping rows in PySpark SQL

The following table was created using Parquet / PySpark, and the objective is to aggregate rows where 1 < count < 5 and rows where 2 < count < 6. Note the row where count is 4.1 falls in both ranges.
+-----+-----+
|count|value|
+-----+-----+
| 1.1| 1|
| 1.2| 2|
| 4.1| 3|
| 5.5| 4|
| 5.6| 5|
| 5.7| 6|
+-----+-----+
Here is code to create and then read the above table as a PySpark DataFrame.
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
from pyspark import SparkContext, SQLContext
# create Parquet DataFrame
pdf = pd.DataFrame({
'count': [1.1, 1.2, 4.1, 5.5, 5.6, 5.7],
'value': [1, 2, 3, 4, 5, 6]})
table = pa.Table.from_pandas(pdf)
pq.write_to_dataset(table, r'c:/data/data.parquet')
# read Parquet DataFrame and create view
sc = SparkContext()
sql = SQLContext(sc)
df = sql.read.parquet(r'c:/data/data.parquet')
df.createTempView('data')
The operation can use two separate queries.
q1 = sql.sql("""
SELECT AVG(value) AS va
FROM data
WHERE count > 1
AND count < 5
""")
+---+
| va|
+---+
|2.0|
+---+
and, similarly
q2 = sql.sql("""
SELECT AVG(value) as va
FROM data
WHERE count > 2
AND count < 6
""")
+---+
| va|
+---+
|4.5|
+---+
However I want to do this in one efficient query.
Here is an approach that does not work because the row where count is 4.1 is included in only one group.
qc = sql.sql("""
SELECT AVG(value) AS va,
(CASE WHEN count > 1 AND count < 5 THEN 1
WHEN count > 2 AND count < 6 THEN 2
ELSE 0 END) AS id
FROM data
GROUP BY id
""")
The above query produces
+---+---+
| va| id|
+---+---+
|2.0| 1|
|5.0| 2|
+---+---+
To be clear the desired result is something more like
+---+---+
| va| id|
+---+---+
|2.0| 1|
|4.5| 2|
+---+---+
The simplest method is probably union all:
SELECT 1, AVG(value) AS va
FROM data
WHERE count > 1 AND count < 5
UNION ALL
SELECT 2, AVG(value) as va
FROM data
WHERE count > 2 AND count < 6;
You can also phrase this as:
select r.id, avg(d.value)
from data d join
(select 1 as lo, 5 as hi, 1 as id union all
select 2 as lo, 6 as hi, 2 as id
) r
on d.count > r.lo and d.count < r.hi
group by r.id;

PySpark - how to update Dataframe by using join?

I have a dataframe a:
id,value
1,11
2,22
3,33
And another dataframe b:
id,value
1,123
3,345
I want to update dataframe a with all matching values from b (based on column 'id').
Final dataframe 'c' would be:
id,value
1,123
2,22
3,345
How to achieve that using datafame joins (or other approach)?
Tried:
a.join(b, a.id == b.id, "inner").drop(a.value)
Gives (not desired output):
+---+---+-----+
| id| id|value|
+---+---+-----+
| 1| 1| 123|
| 3| 3| 345|
+---+---+-----+
Thanks.
I don't think there is an update functionality. But this should work:
import pyspark.sql.functions as F
df1.join(df2, df1.id == df2.id, "left_outer") \
.select(df1.id, df2.id, F.when(df2.value.isNull(), df1.value).otherwise(df2.value).alias("value")))

Pyspark DataFrame Conditional groupBy

from pyspark.sql import Row, functions as F
row = Row("UK_1","UK_2","Date","Cat")
agg = ''
agg = 'Cat'
tdf = (sc.parallelize
([
row(1,1,'12/10/2016',"A"),
row(1,2,None,'A'),
row(2,1,'14/10/2016','B'),
row(3,3,'!~2016/2/276','B'),
row(None,1,'26/09/2016','A'),
row(1,1,'12/10/2016',"A"),
row(1,2,None,'A'),
row(2,1,'14/10/2016','B'),
row(None,None,'!~2016/2/276','B'),
row(None,1,'26/09/2016','A')
]).toDF())
tdf.groupBy( iff(len(agg.strip()) > 0 , F.col(agg), )).agg(F.count('*').alias('row_count')).show()
Is there a way to use a column or no column based on some condition in the dataframe groupBy?
You can provide an empty list to groupBy if the condition you are looking for is not met, which will groupBy no column:
tdf.groupBy(agg if len(agg) > 0 else []).agg(...)
agg = ''
tdf.groupBy(agg if len(agg) > 0 else []).agg(F.count('*').alias('row_count')).show()
+---------+
|row_count|
+---------+
| 10|
+---------+
agg = 'Cat'
tdf.groupBy(agg if len(agg) > 0 else []).agg(F.count('*').alias('row_count')).show()
+---+---------+
|Cat|row_count|
+---+---------+
| B| 4|
| A| 6|
+---+---------+