Check column of df1 exists in col of df2 - dataframe

I have a df1 with column values as below
names
AB
DC
DE
FG
GG
TR
Another df2 as
date names
2022-11-01 AB
2022-11-01 DE
2011-11-01 FG
2022-11-02 DC
2022-11-02 GG
2022-11-02 TR
I want to check if all values of df1 column exists in df2 names column, if yes update true else false in a new column.
I am able to do it for a given single date using dataframes with flag column. Using when.otherwise to check the flag value. I am not to run this across many days.

This should do the trick:
df1["exists"] = df1["names"].isin(df2["names"].unique())

import pyspark.sql.functions as F
df1 = spark.createDataFrame(
[
('AB',),('DC',),('DE',),('FG',),('GG',),('TR',)
]
).toDF("names")
df2 = spark.createDataFrame(
[
('2022-11-01','AB'),
('2022-11-01','DE'),
('2011-11-01','FG'),
('2022-11-02','DC'),
('2022-11-02','GG'),
('2022-11-02','TR'),
('2022-11-02','ZZ'),
],
["date", "names"]
)\
.withColumn('date', F.col('date').cast('date'))
df1 = df1.withColumn('bool', F.lit('True').cast('boolean'))
df2\
.join(df1, on='names', how='left')\
.fillna(False)\
.select('date', 'names', 'bool')\
.show()
# +----------+-----+-----+
# | date|names| bool|
# +----------+-----+-----+
# |2022-11-01| AB| true|
# |2022-11-02| DC| true|
# |2022-11-01| DE| true|
# |2011-11-01| FG| true|
# |2022-11-02| GG| true|
# |2022-11-02| TR| true|
# |2022-11-02| ZZ|false|
# +----------+-----+-----+

Related

How to convert 1 row 4 columns dataframe to 4 rows 2 columns dataframe in pyspark or sql

I have a dataframe which returns the output as
I would like to transpose this into
Can someone help to understand how to prepare the pyspark code to achieve this result dynamically. I have tried Unpivot in sql but no luck.
df =spark.createDataFrame([
(78,20,19,90),
],
('Machines', 'Books', 'Vehicles', 'Plants'))
Create a new array of struct column that combines column names and value names. Use the magic inline to explode the struct field. Code below
df.withColumn('tab', F.array(*[F.struct(lit(x).alias('Fields'), col(x).alias('Count')).alias(x) for x in df.columns])).selectExpr('inline(tab)').show()
+--------+-----+
| Fields|Count|
+--------+-----+
|Machines| 78|
| Books| 20|
|Vehicles| 19|
| Plants| 90|
+--------+-----+
As mentioned in unpivot-dataframe tutoral use:
df = df.selectExpr("""stack(4, "Machines", Machines, "Books", Books, "Vehicles", Vehicles, "Plants", Plants) as (Fields, Count)""")
Or to generalise:
cols = [f'"{c}", {c}' for c in df.columns]
exprs = f"stack({len(cols)}, {', '.join(str(c) for c in cols)}) as (Fields, Count)"
df = df.selectExpr(exprs)
Full example:
df = spark.createDataFrame(data=[[78,20,19,90]], schema=['Machines','Books','Vehicles','Plants'])
# Hard coded
# df = df.selectExpr("""stack(4, "Machines", Machines, "Books", Books, "Vehicles", Vehicles, "Plants", Plants) as (Fields, Count)""")
# Generalised
cols = [f'"{c}", {c}' for c in df.columns]
exprs = f"stack({len(cols)}, {', '.join(str(c) for c in cols)}) as (Fields, Count)"
df = df.selectExpr(exprs)
[Out]:
+--------+-----+
|Fields |Count|
+--------+-----+
|Machines|78 |
|Books |20 |
|Vehicles|19 |
|Plants |90 |
+--------+-----+

Modify key column to match the join condition

I am working on datasets (having 20k distinct records) to join two data frames based on a identifier columns id_txt
df1.join(df2,df1.id_text== df2.id_text,"inner").select(df1['*'], df2['Name'].alias('DName'))
df1 has the following sample values in the identifier column id_text:
X North
Y South
Z West
Whereas df2 has the following sample values from identifier column id_text:
North X
South Y
West Z
Logically, the different values for id_text are correct. Hardcoding those values for 10k records is not a feasible solution. Is there any way id_text can be modified for df2 to be the same as in df1?
You can use a column expression directly inside join (it will not create an additional column). In this example, I used regexp_replace to switch places of both elements.
from pyspark.sql import functions as F
df1 = spark.createDataFrame([('X North', 1), ('Y South', 1), ('Z West', 1)], ['id_text', 'val1'])
df2 = spark.createDataFrame([('North X', 2), ('South Y', 2), ('West Z', 2)], ['id_text', 'Name'])
# df1 df2
# +-------+----+ +-------+----+
# |id_text|val1| |id_text|Name|
# +-------+----+ +-------+----+
# |X North| 1| |North X| 2|
# |Y South| 1| |South Y| 2|
# | Z West| 1| | West Z| 2|
# +-------+----+ +-------+----+
df = (df1
.join(df2, df1.id_text == F.regexp_replace(df2.id_text, r'(.+) (.+)', '$2 $1'), 'inner')
.select(df1['*'], df2.Name))
df.show()
# +-------+----+----+
# |id_text|val1|Name|
# +-------+----+----+
# |X North| 1| 2|
# |Y South| 1| 2|
# | Z West| 1| 2|
# +-------+----+----+

Insert data into a single column but in dictionary format after concatenating few column of data

I want to create a single column after concatenating number of columns in a single column but in dictionary format in PySpark.
I have concatenated data into a single column but I am unable to store it in a dictionary format.
Please find the below attached screenshot for more details.
Let me know if need more information.
In your current situation, you can use str_to_map
from pyspark.sql import functions as F
df = spark.createDataFrame([("datatype:0,length:1",)], ['region_validation_check_status'])
df = df.withColumn(
'region_validation_check_status',
F.expr("str_to_map(region_validation_check_status, ',')")
)
df.show(truncate=0)
# +------------------------------+
# |region_validation_check_status|
# +------------------------------+
# |{datatype -> 0, length -> 1} |
# +------------------------------+
If you didn't have a string yet, you could do it from column values with to_json and from_json
from pyspark.sql import functions as F
df = spark.createDataFrame([(1, 2), (3, 4)], ['a', 'b'])
df.show()
# +---+---+
# | a| b|
# +---+---+
# | 1| 2|
# | 3| 4|
# +---+---+
df = df.select(
F.from_json(F.to_json(F.struct('a', 'b')), 'map<string, int>')
)
df.show()
# +----------------+
# | entries|
# +----------------+
# |{a -> 1, b -> 2}|
# |{a -> 3, b -> 4}|
# +----------------+

How to extract the rows that changed id-value in a dataframe?

I'm trying to extract from a dataframe only the rows that the id is changing: Let's say we have the following dataframe:
# id Date Value
# 152 12/4 True
# 152 12/4 True
# 152 12/4 True
# 158 12/4 True
# 158 13/4 False
# 158 13/4 False
I want to create a new Dataframe only with the Values when the id is changing and the preview row:
# id Date Value
# 152 12/4 True
# 158 12/4 True
I try with lag and window function but I didnt have a good result. Thanks in advance.
using lag and lead, here is a solution. As per your requirement, when id is changing, this selects current row and also previous row. I modified the test data to cover other scenarios
from pyspark.sql.window import Window
import pyspark.sql.functions as F
df = spark.createDataFrame([[151, '12/4', True],
[152, '12/4', True],
[152, '12/4', True],
[158, '12/4', True],
[158, '12/4', True],
[158, '12/4', True]
], schema=['id', 'Date', 'Value'])
window = Window.orderBy("id")
df = df.withColumn("prev_id", F.lag(F.col("id")).over(window))
df = df.withColumn("next_id", F.lead(F.col("id")).over(window))
df.filter(
'id != next_id or id != prev_id'
).drop(
'prev_id','next_id'
).show()
which results
+---+----+-----+
| id|Date|Value|
+---+----+-----+
|151|12/4| true|
|152|12/4| true| (Id changed, so select and previous row)
|152|12/4| true|
|158|12/4| true| (Id changed, so select and previous row)
+---+----+-----+

Pyspark - how to backfill a DataFrame?

How can you do the same thing as df.fillna(method='bfill') for a pandas dataframe with a pyspark.sql.DataFrame?
The pyspark dataframe has the pyspark.sql.DataFrame.fillna method, however there is no support for a method parameter.
In pandas you can use the following to backfill a time series:
Create data
import pandas as pd
index = pd.date_range('2017-01-01', '2017-01-05')
data = [1, 2, 3, None, 5]
df = pd.DataFrame({'data': data}, index=index)
Giving
Out[1]:
data
2017-01-01 1.0
2017-01-02 2.0
2017-01-03 3.0
2017-01-04 NaN
2017-01-05 5.0
Backfill the dataframe
df = df.fillna(method='bfill')
Produces the backfilled frame
Out[2]:
data
2017-01-01 1.0
2017-01-02 2.0
2017-01-03 3.0
2017-01-04 5.0
2017-01-05 5.0
How can the same thing be done for a pyspark.sql.DataFrame?
The last and first functions, with their ignorenulls=True flags, can be combined with the rowsBetween windowing. If we want to fill backwards, we select the first non-null that is between the current row and the end. If we want to fill forwards, we select the last non-null that is between the beginning and the current row.
from pyspark.sql import functions as F
from pyspark.sql.window import Window as W
import sys
df.withColumn(
'data',
F.first(
F.col('data'),
ignorenulls=True
) \
.over(
W.orderBy('date').rowsBetween(0, sys.maxsize)
)
)
source on filling in spark: https://towardsdatascience.com/end-to-end-time-series-interpolation-in-pyspark-filling-the-gap-5ccefc6b7fc9
Actually backfill on distributed dataset is not as easy task as in pandas (local) dataframe - you cannot be sure that value to fill exists in the same partition. I would use crossJoin with windowing, for example fo DF:
df = spark.createDataFrame([
('2017-01-01', None),
('2017-01-02', 'B'),
('2017-01-03', None),
('2017-01-04', None),
('2017-01-05', 'E'),
('2017-01-06', None),
('2017-01-07', 'G')], ['date', 'value'])
df.show()
+----------+-----+
| date|value|
+----------+-----+
|2017-01-01| null|
|2017-01-02| B|
|2017-01-03| null|
|2017-01-04| null|
|2017-01-05| E|
|2017-01-06| null|
|2017-01-07| G|
+----------+-----+
The code would be:
from pyspark.sql.window import Window
df.alias('a').crossJoin(df.alias('b')) \
.where((col('b.date') >= col('a.date')) & (col('a.value').isNotNull() | col('b.value').isNotNull())) \
.withColumn('rn', row_number().over(Window.partitionBy('a.date').orderBy('b.date'))) \
.where(col('rn') == 1) \
.select('a.date', coalesce('a.value', 'b.value').alias('value')) \
.orderBy('a.date') \
.show()
+----------+-----+
| date|value|
+----------+-----+
|2017-01-01| B|
|2017-01-02| B|
|2017-01-03| E|
|2017-01-04| E|
|2017-01-05| E|
|2017-01-06| G|
|2017-01-07| G|
+----------+-----+