Drop the duplicated rows and merge the ids using groupby in pyspark

Drop the duplicated rows and merge the ids using groupby in pyspark - dataframe

I have a dataframe where some of rows having duplicated ids but different timestamp and some of rows having duplicated ids but the same timestamp but having one of following (yob and gender) columns null. Now I want to do an operation using groupby:
if the same id having difference timestamp, want to pickup the recent timestamp.
If the same ids having same timestamp but the any of column having null(yob and gender), that time, want to merge the both id as single record without null. below I have pasted the data frame and desired output.
Input data
from pyspark.sql.functions import col, max as max_
df = sc.parallelize([
("e5882", "null", "M", "AD", "9/14/2021 13:50"),
("e5882", "null", "M", "AD", "10/22/2021 13:10"),
("5cddf", "null", "M", "ED", "9/9/2021 12:00"),
("5cddf", "2010", "null", "ED", "9/9/2021 12:00"),
("c3882", "null", "M", "BD", "11/27/2021 5:00"),
("c3882", "1975", "null", "BD", "11/27/2021 5:00"),
("9297d","1999", "null", "GF","10/18/2021 7:00"),
("9298e","1990","null","GF","10/18/2021 7:00")
]).toDF(["ID", "yob", "gender","country","timestamp"])
Desire output:
code used in this problem, but not get the accurate result, some of ids are missing,
w = Window.partitionBy('Id')
# to obtain the recent date
df1 = df.withColumn('maxB', F.max('timestamp').over(w)).where(F.col('timestamp') == F.col('maxB')).drop('maxB')
# to merge the null column based of id
(df1.groupBy('Id').agg(*[F.first(x,ignorenulls=True) for x in df1.columns if x!='Id'])).show()

Using this input dataframe:
df = spark.createDataFrame([
("e5882", None, "M", "AD", "9/14/2021 13:50"),
("e5882", None, "M", "AD", "10/22/2021 13:10"),
("5cddf", None, "M", "ED", "9/9/2021 12:00"),
("5cddf", "2010", None, "ED", "9/9/2021 12:00"),
("c3882", None, "M", "BD", "11/27/2021 5:00"),
("c3882", "1975", None, "BD", "11/27/2021 5:00"),
("9297d", None, "M", "GF", "10/18/2021 7:00"),
("9297d", "1999", None, "GF", "10/18/2021 7:00"),
("9298e", "1990", None, "GF", "10/18/2021 7:00"),
], ["id", "yob", "gender", "country", "timestamp"])
If the same id having difference timestamp, want to pickup the recent timestamp.
Use window ranking function to get most recent row per id. As you want to merge those with the same timestamp you can use dense_rank instead of row_number. But first you need to convert timestamp strings into TimestampType otherwise comparison won't be correct (as '9/9/2021 12:00' > '10/18/2021 7:00')
from pyspark.sql import Window
import pyspark.sql.functions as F
df_most_recent = df.withColumn(
"timestamp",
F.to_timestamp("timestamp", "M/d/yyyy H:mm")
).withColumn(
"rn",
F.dense_rank().over(Window.partitionBy("id").orderBy(F.desc("timestamp")))
).filter("rn = 1")
If the same ids having same timestamp but the any of column having null(yob and gender), that time, want to merge the both id as single
record without null. below I have pasted the data frame and desired
output.
Now the above df_most_recent contains one or more rows having the same most recent timestamp per id, you can group by id to merge the values of the other columns like this:
result = df_most_recent.groupBy("id").agg(
*[F.collect_set(c)[0].alias(c) for c in df.columns if c!='id']
# or *[F.first(c).alias(c) for c in df.columns if c!='id']
)
result.show()
#+-----+----+------+-------+-------------------+
#|id |yob |gender|country|timestamp |
#+-----+----+------+-------+-------------------+
#|5cddf|2010|M |ED |2021-09-09 12:00:00|
#|9297d|1999|M |GF |2021-10-18 07:00:00|
#|9298e|1990|null |GF |2021-10-18 07:00:00|
#|c3882|1975|M |BD |2021-11-27 05:00:00|
#|e5882|null|M |AD |2021-10-22 13:10:00|
#+-----+----+------+-------+-------------------+

Related

In a pyspark.sql.dataframe.Dataframe, how to resample the "TIMESTAMP" column to daily intervals, for each unique id in the "ID" column?

The title almost says it already. I have a pyspark.sql.dataframe.Dataframe with a "ID", "TIMESTAMP", "CONSUMPTION" and "TEMPERATURE" column. I need the "TIMESTAMP" column to be resampled to daily intervals (from 15min intervals) and the "CONSUMPTION" and "TEMPERATURE" column aggregated by summation. However, this needs to be performed for each unique id in the "ID" column. How do I do this?
Efficiency/speed is of importance to me. I have a huge dataframe to start with, which is why I would like to avoid .toPandas() and for loops.
Any help would be greatly appreciated!
The following code will build a spark_df to play around with. The input_spark_df represents the input spark dataframe, the disred output is like desired_outcome_spark_df.
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
df_list = []
for unique_id in ['012', '345', '678']:
date_range = pd.date_range(pd.Timestamp('2022-12-28 00:00'), pd.Timestamp('2022-12-30 23:00'),freq = 'H')
df = pd.DataFrame()
df['TIMESTAMP'] = date_range
df['ID'] = unique_id
df['TEMPERATURE'] = np.random.randint(1, 10, df.shape[0])
df['CONSUMPTION'] = np.random.randint(1, 10, df.shape[0])
df = df[['ID', 'TIMESTAMP', 'TEMPERATURE', 'CONSUMPTION']]
df_list.append(df)
pandas_df = pd.concat(df_list)
spark = SparkSession.builder.getOrCreate()
input_spark_df = spark.createDataFrame(pandas_df)
desired_outcome_spark_df = spark.createDataFrame(pandas_df.set_index('TIMESTAMP').groupby('ID').resample('1d').sum().reset_index())
To condense the question thus: how do I go from input_spark_df to desired_outcome_spark_df as efficient as possible?

I found the answer to my own question. I first change the timestamp to "date only" using pyspark.sql.functions.to_date. Then I groupby both "ID" and "TIMESTAMP" and perfrom the aggregation.
from pyspark.sql.functions import to_date, sum, avg
# Group the DataFrame by the "ID" column
spark_df = input_spark_df.withColumn('TIMESTAMP', to_date(col('TIMESTAMP')))
desired_outcome = (input_spark_df
.withColumn('TIMESTAMP', to_date(col('TIMESTAMP')))
.groupBy("ID", 'TIMESTAMP')
.agg(
sum(col("CONSUMPTION")).alias("CUMULATIVE_DAILY_POWER_CONSUMPTION"),
avg(col('TEMPERATURE')).alias("AVERAGE_DAILY_TEMPERATURE")
))
grouped_df.display()

Row deduplication problem with daily updated rows. How to avoid counting the same row?

This particular dataframe is updated daily with the "Customer ID" ,"status" and the "date" that said update occured, here is an example:
example
Some clients receive updates daily, other don't. Some can have a status changed in a matter of days from 'no' to 'yes' and vice versa
Status with yes can be fetched with :
df = df \
.select('id','status','date') \
.filter(
(col('date') >= '2022-10-01') &
(col('date') <= '2022-10-31') &
(col(status) == "yes"))
The second selection must have none of the ID's present in the "yes" query. See ID "123" per example, if i exclued all rows with "yes" i am still counting that client in my "no" part of the query.
Tried using an OVER function to create a flag based on the ID to exclude what i already selected then apply a filter but it does not work, pyspark says that the expression is not supported within a window function.
partition = Window.partitionBy("id").orderBy("date")
df = df \
.withColumn("results",
when((col("status") == "approved").over(partition), '0')
.otherwise("1"))
Py4JJavaError: An error occurred while calling o808.withColumn.
: org.apache.spark.sql.AnalysisException: Expression '(result_decisaofinal#8593 = APROVA)' not supported within a window function.;;

Following your comment:
Exactly, only one row for each ID following the rule: if Id has one row containing "yes" most recent "yes" else most recent "no"
You can do it with a simple row_number window function, partitioning by customer and ordering by the status (with 'yes' being before 'no' when using desc and date)
from datetime import date
from pyspark.sql import Window, functions as F
data = [
{'customer': 123, 'status': 'no', 'date': date(2022, 10, 25)},
{'customer': 123, 'status': 'yes', 'date': date(2022, 10, 22)},
{'customer': 4141, 'status': 'no', 'date': date(2022, 10, 25)},
{'customer': 4141, 'status': 'no', 'date': date(2022, 10, 22)},
{'customer': 4141, 'status': 'no', 'date': date(2022, 10, 15)},
{'customer': 5555, 'status': 'yes', 'date': date(2022, 10, 25)},
{'customer': 5555, 'status': 'no', 'date': date(2022, 10, 22)},
{'customer': 5555, 'status': 'no', 'date': date(2022, 10, 15)},
]
df = spark.createDataFrame(data)
part = Window.partitionBy('customer').orderBy(F.col('status').desc(), F.col('date').desc())
df2 = df.withColumn('rn', F.row_number().over(part)).filter('rn=1').drop('rn')
df2.show(truncate=False)
+--------+----------+------+
|customer|date |status|
+--------+----------+------+
|123 |2022-10-22|yes |
|4141 |2022-10-25|no |
|5555 |2022-10-25|yes |
+--------+----------+------+

I have one solution which may work but i am not sure if its good solution in terms of time and resources so if anyone knows how to improve it please leave a comment. For this moment i wasnt able to figure out anything else but maybe it will be usefull for you. I have a feeling that there is some trick that i dont know to do it smarter :D
import datetime
import pyspark.sql.functions as F
x = [(123,"no", datetime.date(2020,10,25)),
(123,"yes", datetime.date(2020,10,22)),
(4141,"no", datetime.date(2020,10,25)),
(4141,"no", datetime.date(2020,10,22)),
(4141,"no", datetime.date(2020,10,15)),
(5555,"yes", datetime.date(2020,10,25)),
(5555,"no", datetime.date(2020,10,22)),
(5555,"no", datetime.date(2020,10,15))]
df = spark.createDataFrame(x, schema=['customer_id', 'status', 'date'])
groupedDf = df.groupBy(F.col('customer_id'), F.col('status')).agg(F.max("date").alias("most_recent_date")).cache()
trueDf = groupedDf.filter(F.col('status') == F.lit('yes'))
falseDf = groupedDf.filter(F.col('status') == F.lit('no'))
falseWithNoCorrecpondingTrueDf = falseDf.join(trueDf, falseDf.customer_id == trueDf.customer_id, "anti")
finalDf = falseWithNoCorrecpondingTrueDf.union(trueDf)
No need for separate variables for dfs, i added it to make it more descriptive
Description step by step:
First i am grouping records to get max date for customer_id and
status
Then i cache result of grouping as i know that it will be used two
times and i dont want to compute it two times
I am splitting result of group by into two parts, one with "yes",
other with "no"
I am dropping "no" which have correspoding "yes" because according to
your logic they are not going to be used
I am doing a union of "no" which left with all "yes" which should
give me resulting df you want to have
Output from sample job:
+-----------+------+----------------+
|customer_id|status|most_recent_date|
+-----------+------+----------------+
| 4141| no| 2020-10-25|
| 123| yes| 2020-10-22|
| 5555| yes| 2020-10-25|
+-----------+------+----------------+

Exclude low sample counts from Pandas' "groupby" calculations

Using Pandas, I'd like to "groupby" and calculate the mean values for each group of my Dataframe. I do it like this:
dict = {
"group": ["A", "B", "C", "A", "A", "B", "B", "C", "A"],
"value": [5, 6, 8, 7, 3, 9, 4, 6, 5]
}
import pandas as pd
df = pd.DataFrame(dict)
print(df)
g = df.groupby([df['group']]).mean()
print(g)
Which gives me:
value
group
A 5.000000
B 6.333333
C 7.000000
However, I'd like to exclude groups which have, let's say, less than 3 entries (so that the mean has somewhat of a value). In this case, it would exclude group "C" from the results. How can I implement this?

Filter the group based on the length and then take the mean.
df = df.groupby('group').filter(lambda x : len(x) > 5).mean()
#if you want the mean group-wise after filtering the required groups
result = df.groupby('group').filter(lambda x : len(x) >= 3).groupby('group').mean().reset_index()
Output:
group value
0 A 5.000000
1 B 6.333333

Get beginning end end date from one column to two columns for each seperate ID

I've following df
ID Status Date
1 A 01-09-2020
1 B 03-09-2020
2 A 10-12-2020
2 B -
And would like to convert to this:
ID Status1 Status2 Date1 Date2
1 A B 01-09-2020 03-09-2020
2 A B 10-12-2020 -
I've think pivot doesn't apply here since I'm not really aggregating something and I've managed somewhat by using a group_by function where I get the min and max date for each ID, and afterwards joining it, but that seems somehow very devious and doesn't give me the status columns for which I cant use the min or max function since this is a numeric value.
I've tried following solution (answer 10 as someone suggested) How to pivot a dataframe?, which would look like this
df.insert(0, 'count', df.groupby('ID').cumcount())
pivot = df.pivot(index='count', columns='ID', values='STATUS')
but this resulted in following df:
1 2
A A
B B
I've also tried How to do a transpose a dataframe group by key on pandas? but this gives me the error
Index contains duplicate entries, cannot reshape
Also if I use pd.pivot_table() instead of df.pivot as someone else suggested in another post.

In my opinion you must create two aux dataframes before, the first one grouping by ID and getting the first and last Status (if you are going to have A and B everytime you could use a pivot table) and the second one, you should get the first and last date. Something like that:
import pandas as pd
df = (
pd.DataFrame(
{
"ID": [1, 1, 2, 2],
"Status": ["A", "B", "A", "B"],
"Date": ["01-09-2020", "03-09-2020", "10-12-2020", "11-12-2020"]
}
)
.assign(Date=lambda x: pd.to_datetime(x["Date"])
)
aux1 = df.groupby("ID").agg(Status1=("Status", "first"), Status2=("Status", "last"))
aux2 = df.groupby("ID").agg(Date1=("Date", "min"), Date2=("Date", "max"))
output = pd.merge(aux1, aux2, left_index=True, right_index=True)
print(output)

How to groupBy a value, sum one column and keep equal values in another?

If you have a DataFrame with values that appear together, but also have an independent value, like so:
df = {'address': ["A", "A", "B"], 'balances': [30, 40, 50], 'sessions': ["V","V","K"]}
and you'd like to groupby both both and aggregate the other:
>>> df.groupby(["address"]).agg({'balances': 'sum', 'sessions': ??? })
{'address': ["A", "B"], 'balances': [80, 50], 'sessions': ["V","K"]}

just take the first or the last in the aggregate:
df.groupby(["address"],as_index=False).agg({'balances': 'sum', 'sessions': 'first'})
address balances sessions
0 A 70 V
1 B 50 K

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Drop the duplicated rows and merge the ids using groupby in pyspark - dataframe

Related

In a pyspark.sql.dataframe.Dataframe, how to resample the "TIMESTAMP" column to daily intervals, for each unique id in the "ID" column?

Row deduplication problem with daily updated rows. How to avoid counting the same row?

Exclude low sample counts from Pandas' "groupby" calculations

Get beginning end end date from one column to two columns for each seperate ID

How to groupBy a value, sum one column and keep equal values in another?

Categories

Resources