In a pyspark.sql.dataframe.Dataframe, how to resample the "TIMESTAMP" column to daily intervals, for each unique id in the "ID" column? - dataframe

The title almost says it already. I have a pyspark.sql.dataframe.Dataframe with a "ID", "TIMESTAMP", "CONSUMPTION" and "TEMPERATURE" column. I need the "TIMESTAMP" column to be resampled to daily intervals (from 15min intervals) and the "CONSUMPTION" and "TEMPERATURE" column aggregated by summation. However, this needs to be performed for each unique id in the "ID" column. How do I do this?
Efficiency/speed is of importance to me. I have a huge dataframe to start with, which is why I would like to avoid .toPandas() and for loops.
Any help would be greatly appreciated!
The following code will build a spark_df to play around with. The input_spark_df represents the input spark dataframe, the disred output is like desired_outcome_spark_df.
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
df_list = []
for unique_id in ['012', '345', '678']:
date_range = pd.date_range(pd.Timestamp('2022-12-28 00:00'), pd.Timestamp('2022-12-30 23:00'),freq = 'H')
df = pd.DataFrame()
df['TIMESTAMP'] = date_range
df['ID'] = unique_id
df['TEMPERATURE'] = np.random.randint(1, 10, df.shape[0])
df['CONSUMPTION'] = np.random.randint(1, 10, df.shape[0])
df = df[['ID', 'TIMESTAMP', 'TEMPERATURE', 'CONSUMPTION']]
df_list.append(df)
pandas_df = pd.concat(df_list)
spark = SparkSession.builder.getOrCreate()
input_spark_df = spark.createDataFrame(pandas_df)
desired_outcome_spark_df = spark.createDataFrame(pandas_df.set_index('TIMESTAMP').groupby('ID').resample('1d').sum().reset_index())
To condense the question thus: how do I go from input_spark_df to desired_outcome_spark_df as efficient as possible?

I found the answer to my own question. I first change the timestamp to "date only" using pyspark.sql.functions.to_date. Then I groupby both "ID" and "TIMESTAMP" and perfrom the aggregation.
from pyspark.sql.functions import to_date, sum, avg
# Group the DataFrame by the "ID" column
spark_df = input_spark_df.withColumn('TIMESTAMP', to_date(col('TIMESTAMP')))
desired_outcome = (input_spark_df
.withColumn('TIMESTAMP', to_date(col('TIMESTAMP')))
.groupBy("ID", 'TIMESTAMP')
.agg(
sum(col("CONSUMPTION")).alias("CUMULATIVE_DAILY_POWER_CONSUMPTION"),
avg(col('TEMPERATURE')).alias("AVERAGE_DAILY_TEMPERATURE")
))
grouped_df.display()

Related

select top n rows after resampling DatetimeIndex

I need to get top n rows by some value per week (and I have hourly data).
data:
import numpy as np
import pandas as pd
dates = pd.date_range(start='1/1/2020', end='11/1/2020', freq="1H")
values = np.random.randint(20, 100500, len(dates))
some_other_column = np.random.randint(0, 10000000, len(dates))
df = pd.DataFrame({"date": dates, "value": values, "another_column": some_other_column})
My attempt:
resampled = df.set_index("date").resample("W")["value"].nlargest(5).to_frame()
It does give top 5 rows but all other columns except for date and value are missing - and I want to keep them all (in my dataset I have lots of columns but here another_column just to show that it's missing).
The solution I came up with:
resampled.index.names = ["week", "date"]
result = pd.merge(
resampled.reset_index(),
df,
how="left",
on=["date", "value"]
)
But it all feels wrong, I know there should be much simpler solution. Any help?
The output I was looking for. Thanks #wwnde.
df["week"] = df["date"].dt.isocalendar().week
df.loc[df.groupby("week")["value"].nlargest(5).index.get_level_values(1), :]
Groupby, and mask any nlargest
df.set_index('date', inplace=True)
df[df.groupby(df.index.week)['value'].transform(lambda x:x.nlargest(5).any())]

pandas df columns series

Have dataframe, and I have done some operations with its columns as follows
df1=sample_data.sort_values("Population")
df2=df1[(df1.Population > 500000) & (df1.Population < 1000000)]
df3=df2["Avg check"]*df2["Avg Daily Rides Last Week"]/df2["CAC"]
df4=df2["Avg check"]*df2["Avg Daily Rides Last Week"]
([[df3],[df4]])
If I understand right, then df3 & df4 now are series only, not dataframe. There should be a way to make a new dataframe with these Series and to plot scatter. Please advise. Thanks.
Wanted to add annotate for each and faced the issue
df3=df2["Avg check"]*df2["Avg Daily Rides Last Week"]/df2["CAC"]
df4=df2["Avg check"]*df2["Avg Daily Rides Last Week"]
df5=df2["Population"]
df6=df2["city_id"]
sct=plt.scatter(df5,df4,c=df3, cmap="viridis")
plt.xlabel("Population")
plt.ylabel("Avg check x Avg Daily Rides")
for i, txt in enumerate(df6):
plt.annotate(txt,(df4[i],df5[i]))
plt.colorbar()
plt.show()
I think you can pass both Series to matplotlib.pyplot.scatter:
import matplotlib.pyplot as plt
sc = plt.scatter(df3, df4)
EDIT: Swap df5 and df4 and for select by positions use Series.iat:
for i, txt in enumerate(df6):
plt.annotate(txt,(df5.iat[i],df4.iat[i]))
You can create a DataFrame from Series. Here is how to do it. Simply add both series in a dictionary
author = ['Jitender', 'Purnima', 'Arpit', 'Jyoti']
article = [210, 211, 114, 178]
auth_series = pd.Series(author)
article_series = pd.Series(article)
frame = { 'Author': auth_series, 'Article': article_series }
and then create a DataFrame from that dictionary:
result = pd.DataFrame(frame)
The code is from geeksforgeeks.org

How can I add values from pandas group to new Dataframe after a function?

I am trying to separate a Dataframe into groups, run each group through a function, and have the return value from the first row of each group placed into a new Dataframe.
When I try the code below, I can print out the information I want, but when I try to add it to the new Dataframe, it only shows the values for the last group.
How can I add the values from each group into the new Dataframe?
Thanks,
Here is what I have so far:
import pandas as pd
import numpy as np
#Build random dataframe
df = pd.DataFrame(np.random.randint(0,40,size=10),
columns=["Random"],
index=pd.date_range("20200101", freq='6h',periods=10))
df["Random2"] = np.random.randint(70,100,size=10)
df["Random3"] = 2
df.index =df.index.map(lambda t: t.strftime('%Y-%m-%d'))
df.index.name = 'Date'
df.reset_index(inplace=True)
#Setup groups by date
df = df.groupby(['Date']).apply(lambda x: x.reset_index())
df.drop(["index","Date"],axis=1,inplace = True)
#Creat new dataframe for newValue
df2 = pd.DataFrame(index=(df.index)).unstack()
#random function for an example
def any_func(df):
df["Value"] = df["Random"] * df["Random2"] / df["Random3"]
return df["Value"]
#loop by unique group name
for date in df.index.get_level_values('Date').unique():
#I can print the data I want
print(any_func(df.loc[date])[0])
#But when I add it to a new dataframe, it only shows the value from the last group
df2["newValue"] = any_func(df.loc[date])[0]
df2
Unrelated, but try modifying your any_func to take advantage of vectorized functions is possible.
Now if I understand you correctly:
new_value = df['Random'] * df['Random2'] / df['Random3']
df2['New Value'] = new_value.loc[:, 0]
This line of code gave me the desired outcome. I just needed to set the index using the "date" variable when I created the column, not when I created the Dataframe.
df2.loc[date, "newValue"] = any_func(df.loc[date])[0]

How to access dask dataframe index value in map_paritions?

I am trying to use dask dataframe map_partition to apply a function which access the value in the dataframe index, rowise and create a new column.
Below is the code I tried.
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame(index = ["row0" , "row1","row2","row3","row4"])
df
ddf = dd.from_pandas(df, npartitions=2)
res = ddf.map_partitions(lambda df: df.assign(index_copy= str(df.index)),meta={'index_copy': 'U' })
res.compute()
I am expecting df.index to be the value in the row index, not the entire partition index which it seems to refer to. From the doc here, this work well for columns but not the index.
what you want to do is this
df.index = ['row'+str(x) for x in df.index]
and for that first create your pandas dataframe and then run this code after you will have your expected result.
let me know if this works for you.

I am trying to take the 95th quantile of each group of records in a dataframe

I am trying to get the 95th quantile for each group of rows in a dataframe. I have tried:
mdf=mdf.groupby('GroupID').quantile(.95)
but the interpreter returns an error:
ValueError: 'GroupID' is both an index level and a column label, which is ambiguous.
I have three columns and I want the 95th of Utilization for each group:
GroupID, Timestamp, Util
Code below:
#pandas 95th percentile calculator
import pandas as pd
import numpy as np
#pd.set_option('display.max_columns', 8)
cfile = "path"
rfile = "path"
#define columns in corereport dataframe
cdf = pd.read_csv(cfile, skiprows = 1, names = ['ID','Device','Bundle','IsPolled','Status','SpeedIn','SpeedOut','Timestamp','MaxIn','MaxOut'])
#drop specified columns from dataframe
to_drop = ['Device', 'Bundle', 'IsPolled', 'Status', 'SpeedIn', 'SpeedOut']
cdf.drop(to_drop, inplace=True, axis=1)
#define columns in relationship dataframe
rdf = pd.read_csv(rfile, skiprows = 1, names = ['GroupID', 'ID', 'Path', 'LowestBW', 'TotalBW'])
#merge the two dataframes together on the ID field
mdf = pd.merge(cdf, rdf, left_on='ID', right_on='ID', how = 'left')
#print(mdf.head())
#Add a column with the larger of two values of MaxIn and MaxOut for each row
mdf.loc[mdf['MaxIn'] > mdf['MaxOut'], 'Util'] = mdf['MaxIn']
mdf.loc[mdf['MaxIn'] < mdf['MaxOut'], 'Util'] = mdf['MaxOut']
#drop specified columns from data frame
to_drop = ['ID', 'MaxIn', 'MaxOut', 'Path', 'LowestBW', 'TotalBW']
mdf.drop(to_drop, inplace=True, axis=1)
#print(mdf.head().values)
#Group by the GroupID and Timestamp Columns and sum the value in Util
mdf = mdf.groupby(['GroupID', 'Timestamp'])['Util'].sum().reset_index()
#Grouping by GroupID and then sorting ascending
mdf = mdf.groupby(['GroupID']).apply(lambda x: x.sort_values(['Util']))
mdf = mdf.groupby('GroupID').quantile(.95)
#Write new dataframe out to a csv
ofile = 'path'
mdf.to_csv(ofile, encoding='utf-8', index=False)
The problem is here:
mdf = mdf.groupby(['GroupID']).apply(lambda x: x.sort_values(['Util']))
which sets 'GroupID' as index of mdf. Try instead:
mdf = (mdf.groupby(['GroupID'])[['Timestamp', 'Util']]
.apply(lambda x: x.sort_values(['Util'])))
or
mdf.sort_values(['GroupID', 'Util'], inplace=True)
However, I believe you don't need to sort values for quantile.