Pyspark Coalesce with first non null and most recent nonnull values - dataframe

I have a dataframe with a column has null sofr first few and last few rows. How do I coalesce this column using the first non-null value and the last non-null record?
For example say I have the following dataframe:
What'd I'd want to produce is the following:
So as you can see the first two rows get populated with 0.6 because that is the first non-null record. The last several rows become 3 because that was the last non-null record.

You can use last() for filling and Window for sorting:
from pyspark.sql import Row, Window, functions as F
df = sql_context.createDataFrame([
Row(Month=datetime.date(2021,1,1), Rating=None),
Row(Month=datetime.date(2021,2,1), Rating=None),
Row(Month=datetime.date(2021,3,1), Rating=0.6),
Row(Month=datetime.date(2021,4,1), Rating=1.2),
Row(Month=datetime.date(2021,5,1), Rating=1.),
Row(Month=datetime.date(2021,6,1), Rating=None),
Row(Month=datetime.date(2021,7,1), Rating=None),
])
(
df
.withColumn('Rating',
F.when(F.isnull('Rating'),
F.last('Rating', ignorenulls=True).over(Window.orderBy('Month'))
).otherwise(F.col('Rating')))
# This second run below is only required for the first rows in the DF
.withColumn('Rating',
F.when(F.isnull('Rating'),
F.last('Rating', ignorenulls=True).over(Window.orderBy(F.desc('Month')))
).otherwise(F.col('Rating')))
.sort('Month') # Only required for output formatting
.show()
)
# Output
+----------+------+
| Month|Rating|
+----------+------+
|2021-01-01| 0.6|
|2021-02-01| 0.6|
|2021-03-01| 0.6|
|2021-04-01| 1.2|
|2021-05-01| 1.0|
|2021-06-01| 1.0|
|2021-07-01| 1.0|
+----------+------+

Related

Equivalent of `takeWhile` for Spark dataframe

I have a dataframe looking like this:
scala> val df = Seq((1,.5), (2,.3), (3,.9), (4,.0), (5,.6), (6,.0)).toDF("id", "x")
scala> df.show()
+---+---+
| id| x|
+---+---+
| 1|0.5|
| 2|0.3|
| 3|0.9|
| 4|0.0|
| 5|0.6|
| 6|0.0|
+---+---+
I would like to take the first rows of the data as long as the x column is nonzero (note that the dataframe is sorted by id so talking about the first rows is relevant). For this given dataframe, it would give something like that:
+---+---+
| id| x|
+---+---+
| 1|0.5|
| 2|0.3|
| 3|0.9|
+---+---+
I only kept the 3 first rows, as the 4th row was zero.
For a simple Seq, I can do something like Seq(0.5, 0.3, 0.9, 0.0, 0.6, 0.0).takeWhile(_ != 0.0). So for my dataframe I thought of something like this:
df.takeWhile('x =!= 0.0)
But unfortunately, the takeWhile method is not available for dataframes.
I know that I can transform my dataframe to a Seq to solve my problem, but I would like to avoid gathering all the data to the driver as it will likely crash it.
The take and the limit methods allow to get the n first rows of a dataframe, but I can't specify a predicate. Is there a simple way to do this?
Can you guarantee that ID's will be in ascending order? New data is not necessarily guaranteed to be added in a specific order. If you can guarantee the order then you can use this query to achieve what you want. It's not going to perform well on large data sets, but it may be the only way to achieve what you are interested in.
We'll mark all 0's as '1' and everything else as '0'. We'll then do a rolling total over the entire data awr. As the numbers only increase in value on a zero it will partition the dataset into sections with number between zero's.
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy().orderBy("id")
df.select(
col("id"),
col("x"),
sum( // creates a running total which will be 0 for the first partition --> All numbers before the first 0
when( col("x") === lit(0), lit(1) ).otherwise(lit(0)) // mark 0's to help partition the data set.
).over(windowSpec).as("partition")
).where(col("partition") === lit(0) )
.show()
---+---+---------+
| id| x|partition|
+---+---+---------+
| 1|0.5| 0|
| 2|0.3| 0|
| 3|0.9| 0|
+---+---+---------+

How to add multiple column dynamically based on filter condition

I am trying to create multiple columns dynamically based on filter condition after comparing two data frame with below code
source_df
+---+-----+-----+----+
|key|val11|val12|date|
+---+-----+-----+-----+
|abc| 1.1| john|2-3-21
|def| 3.0| dani|2-2-21
+---+-----+-----+------
dest_df
+---+-----+-----+------+
|key|val11|val12|date |
+---+-----+-----+------
|abc| 2.1| jack|2-3-21|
|def| 3.0| dani|2-2-21|
-----------------------
columns= source_df.columns[1:]
joined_df=source_df\
.join(dest_df, 'key', 'full')
for column in columns:
column_name="difference_in_"+str(column)
report = joined_df\
.filter((source_df[column] != dest_df[column]))\
.withColumn(column_name, F.concat(F.lit('[src:'), source_df[column], F.lit(',dst:'),dest_df[column],F.lit(']')))
The output I expect is
#Expected
+---+-----------------+------------------+
|key| difference_in_val11| difference_in_val12 |
+---+-----------------+------------------+
|abc|[src:1.1,dst:2.1]|[src:john,dst:jack]|
+---+-----------------+-------------------+
I get only one column result
#Actual
+---+-----------------+-
|key| difference_in_val12 |
+---+-----------------+-|
|abc|[src:john,dst:jack]|
+---+-----------------+-
How to generate multiple columns based on filter condition dynamically?
Dataframes are immutable objects. Having said that, you need to create another dataframe using the one that got generated in the 1st iteration. Something like below -
from pyspark.sql import functions as F
columns= source_df.columns[1:]
joined_df=source_df\
.join(dest_df, 'key', 'full')
for column in columns:
if column != columns[-1]:
column_name="difference_in_"+str(column)
report = joined_df\
.filter((source_df[column] != dest_df[column]))\
.withColumn(column_name, F.concat(F.lit('[src:'), source_df[column], F.lit(',dst:'),dest_df[column],F.lit(']')))
else:
column_name="difference_in_"+str(column)
report1 = report.filter((source_df[column] != dest_df[column]))\
.withColumn(column_name, F.concat(F.lit('[src:'), source_df[column], F.lit(',dst:'),dest_df[column],F.lit(']')))
report1.show()
#report.show()
Output -
+---+-----+-----+-----+-----+-------------------+-------------------+
|key|val11|val12|val11|val12|difference_in_val11|difference_in_val12|
+---+-----+-----+-----+-----+-------------------+-------------------+
|abc| 1.1| john| 2.1| jack| [src:1.1,dst:2.1]|[src:john,dst:jack]|
+---+-----+-----+-----+-----+-------------------+-------------------+
You could also do this with a union of both dataframes and then collect list only if collect_set size is greater than 1 , this can avoid joining the dataframes:
from pyspark.sql import functions as F
cols = source_df.drop("key").columns
output = (source_df.withColumn("ref",F.lit("src:"))
.unionByName(dest_df.withColumn("ref",F.lit("dst:"))).groupBy("key")
.agg(*[F.when(F.size(F.collect_set(i))>1,F.collect_list(F.concat("ref",i))).alias(i)
for i in cols]).dropna(subset = cols, how='all')
)
output.show()
+---+------------------+--------------------+
|key| val11| val12|
+---+------------------+--------------------+
|abc|[src:1.1, dst:2.1]|[src:john, dst:jack]|
+---+------------------+--------------------+

Shift pyspark column value to left by one

I have a pyspark dataframe that looks like this:
|name|age|height |weight
+-------------+--------------------+------------------------+------------------------+-------------------------+--------------------+------------------+------------------+------------+
| |Mike |20|6-7|
As you can see the values and the column names are not aligned. For example, "Mike" should be under the column of "name", instead of age.
How can I shift the values to left by one so it can match the column name?
The ideal dataframe looks like:
|name|age|height |weight
+-------------+--------------------+------------------------+------------------------+-------------------------+--------------------+------------------+------------------+------------+
| Mike |20 |6-0|160|
Please note that the above data is just an example. In reality I have more than 200 columns and more than 1M rows of data.
Try with .toDF with new column names by dropping name column from the dataframe.
Example:
df=spark.createDataFrame([('','Mike',20,'6-7',160)],['name','age','height','weight'])
df.show()
#+----+----+------+------+---+
#|name| age|height|weight| _5|
#+----+----+------+------+---+
#| |Mike| 20| 6-7|160|
#+----+----+------+------+---+
#select all columns except name
df1=df.select(*[i for i in df.columns if i != 'name'])
drop_col=df.columns.pop()
req_cols=[i for i in df.columns if i != drop_col]
df1.toDF(*req_cols).show()
#+----+---+------+------+
#|name|age|height|weight|
#+----+---+------+------+
#|Mike| 20| 6-7| 160|
#+----+---+------+------+
Using spark.createDataFrame():
cols=['name','age','height','weight']
spark.createDataFrame(df.select(*[i for i in df.columns if i != 'name']).rdd,cols).show()
#+----+---+------+------+
#|name|age|height|weight|
#+----+---+------+------+
#|Mike| 20| 6-7| 160|
#+----+---+------+------+
If you are creating dataframe while reading a file then define schema having first column name as dummy then once you read the data drop the column using .drop() function.
spark.read.schema(<struct_type schema>).csv(<path>).drop('<dummy_column_name>')
spark.read.option("header","true").csv(<path>).toDF(<columns_list_with dummy_column>).drop('<dummy_column_name>')

Pyspark number of unique values in dataframe is different compared with Pandas result

I have large dataframe with 4 million rows. One of the columns is a variable called "name".
When I check the number of unique values in Pandas by: df['name].nunique() I get a different answer than from Pyspark df.select("name").distinct().show() (around 1800 in Pandas versus 350 in Pyspark). How can this be? Is this a data partitioning thing?
EDIT:
The record "name" in the dataframe looks like: name-{number}, for example: name-1, name-2, etc.
In Pandas:
df['name'] = df['name'].str.lstrip('name-').astype(int)
df['name'].nunique() # 1800
In Pyspark:
import pyspark.sql.functions as f
df = df.withColumn("name", f.split(df['name'], '\-')[1].cast("int"))
df.select(f.countDistinct("name")).show()
IIUC, it's most likely from non-numeric chars(i.e. SPACE) shown in the name column. Pandas will force the type conversion while with Spark, you get NULL, see below example:
df = spark.createDataFrame([(e,) for e in ['name-1', 'name-22 ', 'name- 3']],['name'])
for PySpark:
import pyspark.sql.functions as f
df.withColumn("name1", f.split(df['name'], '\-')[1].cast("int")).show()
#+--------+-----+
#| name|name1|
#+--------+-----+
#| name-1| 1|
#|name-22 | null|
#| name- 3| null|
#+--------+-----+
for Pandas:
df.toPandas()['name'].str.lstrip('name-').astype(int)
#Out[xxx]:
#0 1
#1 22
#2 3
#Name: name, dtype: int64

pyspark withColumn, how to vary column name

is there any way to create/fill columns with pyspark 2.1.0 where the name of the column is the value of a different column?
I tried the following
def createNewColumnsFromValues(dataFrame, colName, targetColName):
"""
Set value of column colName to targetColName's value
"""
cols = dataFrame.columns
#df = dataFrame.withColumn(f.col(colName), f.col(targetColName))
df = dataFrame.withColumn('x', f.col(targetColName))
return df
The out commented line does not work, when calling the method I get the error
TypeError: 'Column' object is not callable
whereas the fixed name (as a string) is no problem. Any idea of how to also make the name of the column come from another one, not just the value? I also tried to use a UDF function definition as a workaround with the same no success result.
Thanks for help!
Edit:
from pyspark.sql import functions as f
I figured a solution which scales nicely for few (or not many) distinct values I need columns for. Which is necessarily the case or the number of columns would explode.
def createNewColumnsFromValues(dataFrame, colName, targetCol):
distinctValues = dataFrame.select(colName).distinct().collect()
for value in distinctValues:
dataFrame = dataFrame.withColumn(str(value[0]), f.when(f.col(colName) == value[0], f.col(targetCol)).otherwise(f.lit(None)))
return dataFrame
You might want to try the following code:
test_df = spark.createDataFrame([
(1,"2",5,1),(3,"4",7,8),
], ("col1","col2","col3","col4"))
def createNewColumnsFromValues(dataFrame, sourceCol, colName, targetCol):
"""
Set value column colName to targetCol
"""
for value in sourceCol:
dataFrame = dataFrame.withColumn(str(value[0]), when(col(colName)==value[0], targetCol).otherwise(None))
return dataFrame
createNewColumnsFromValues(test_df, test_df.select("col4").collect(), "col4", test_df["col3"]).show()
The trick here is to do select("COLUMNNAME").collect() to get a list of the values in the column. Then colName contains this list, which is a list of rows, where each row has a single element. So you can directly iterate through the list and access the element at position 0. In this case a cast to string was necessary to ensure the column name of the new column is a string. The target column is used for the values for each of the individual columns. So the result would look like:
+----+----+----+----+----+----+
|col1|col2|col3|col4| 1| 8|
+----+----+----+----+----+----+
| 1| 2| 5| 1| 5|null|
| 3| 4| 7| 8|null| 7|
+----+----+----+----+----+----+