The title almost says it already. I have a pyspark.sql.dataframe.Dataframe with a "ID", "TIMESTAMP", "CONSUMPTION" and "TEMPERATURE" column. I need the "TIMESTAMP" column to be resampled to daily intervals (from 15min intervals) and the "CONSUMPTION" and "TEMPERATURE" column aggregated by summation. However, this needs to be performed for each unique id in the "ID" column. How do I do this?
Efficiency/speed is of importance to me. I have a huge dataframe to start with, which is why I would like to avoid .toPandas() and for loops.
Any help would be greatly appreciated!
The following code will build a spark_df to play around with. The input_spark_df represents the input spark dataframe, the disred output is like desired_outcome_spark_df.
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
df_list = []
for unique_id in ['012', '345', '678']:
date_range = pd.date_range(pd.Timestamp('2022-12-28 00:00'), pd.Timestamp('2022-12-30 23:00'),freq = 'H')
df = pd.DataFrame()
df['TIMESTAMP'] = date_range
df['ID'] = unique_id
df['TEMPERATURE'] = np.random.randint(1, 10, df.shape[0])
df['CONSUMPTION'] = np.random.randint(1, 10, df.shape[0])
df = df[['ID', 'TIMESTAMP', 'TEMPERATURE', 'CONSUMPTION']]
df_list.append(df)
pandas_df = pd.concat(df_list)
spark = SparkSession.builder.getOrCreate()
input_spark_df = spark.createDataFrame(pandas_df)
desired_outcome_spark_df = spark.createDataFrame(pandas_df.set_index('TIMESTAMP').groupby('ID').resample('1d').sum().reset_index())
To condense the question thus: how do I go from input_spark_df to desired_outcome_spark_df as efficient as possible?
I found the answer to my own question. I first change the timestamp to "date only" using pyspark.sql.functions.to_date. Then I groupby both "ID" and "TIMESTAMP" and perfrom the aggregation.
from pyspark.sql.functions import to_date, sum, avg
# Group the DataFrame by the "ID" column
spark_df = input_spark_df.withColumn('TIMESTAMP', to_date(col('TIMESTAMP')))
desired_outcome = (input_spark_df
.withColumn('TIMESTAMP', to_date(col('TIMESTAMP')))
.groupBy("ID", 'TIMESTAMP')
.agg(
sum(col("CONSUMPTION")).alias("CUMULATIVE_DAILY_POWER_CONSUMPTION"),
avg(col('TEMPERATURE')).alias("AVERAGE_DAILY_TEMPERATURE")
))
grouped_df.display()
I have a series like this:
00:00:08,00:00:24,00:00:27,00:00:36,00:00:36,00:00:37,00:00:42,00:00:43,00:00:44,00:00:47,00:00:54,00:00:57,00:00:57,00:01:09,00:01:16,00:01:18,00:01:21,00:01:25,00:01:26,00:01:33,00:01:33,00:01:33,00:01:38,00:01:44,00:01:45,00:01:53,00:01:57,00:02:01,00:02:03,00:02:19,00:02:20,00:02:33,00:02:33,00:02:34,00:02:48,00:02:50,00:03:12,00:03:21,00:03:23,00:03:24,00:03:28,00:03:34,00:03:34,00:03:35,00:03:38,00:03:39,00:03:40,00:03:40,00:03:42,00:03:42,00:03:48,00:03:49,00:03:54,00:03:55,00:04:03,00:04:06,00:04:07,00:04:10,00:04:11,00:04:16,00:04:21,00:04:26,00:04:27,00:04:27,00:04:28,00:04:30,00:04:33,00:04:41,00:04:49,00:04:50,00:04:51,00:04:54,00:04:55,00:04:59,00:05:16,00:05:16,00:05:27,00:05:34,00:05:37,00:05:46,00:05:50,00:05:53,00:06:07,00:06:16,00:06:24,00:06:25,00:06:26,00:06:30,00:06:38,00:06:38,00:06:42,00:06:44,00:06:46,00:06:53,00:07:00,00:07:00
It is time in HH:MM:SS (as series in time dataframe)
I'm interested in finding / visualising amount of data points in (for example) 10 second window and plotting it as histogram barplot.
# make it a list
time_series = "00:00:08,00:00:24,00:00:27,00:00:36,00:00:36,00:00:37,00:00:42,00:00:43,00:00:44,00:00:47,00:00:54,00:00:57,00:00:57,00:01:09,00:01:16,00:01:18,00:01:21,00:01:25,00:01:26,00:01:33,00:01:33,00:01:33,00:01:38,00:01:44,00:01:45,00:01:53,00:01:57,00:02:01,00:02:03,00:02:19,00:02:20,00:02:33,00:02:33,00:02:34,00:02:48,00:02:50,00:03:12,00:03:21,00:03:23,00:03:24,00:03:28,00:03:34,00:03:34,00:03:35,00:03:38,00:03:39,00:03:40,00:03:40,00:03:42,00:03:42,00:03:48,00:03:49,00:03:54,00:03:55,00:04:03,00:04:06,00:04:07,00:04:10,00:04:11,00:04:16,00:04:21,00:04:26,00:04:27,00:04:27,00:04:28,00:04:30,00:04:33,00:04:41,00:04:49,00:04:50,00:04:51,00:04:54,00:04:55,00:04:59,00:05:16,00:05:16,00:05:27,00:05:34,00:05:37,00:05:46,00:05:50,00:05:53,00:06:07,00:06:16,00:06:24,00:06:25,00:06:26,00:06:30,00:06:38,00:06:38,00:06:42,00:06:44,00:06:46,00:06:53,00:07:00,00:07:00"
time_series = time_series.split(',')
def histogram_from_time_series(_time_series):
time_list = []
for item in _time_series:
t = item.split(":")
try:
time_list.append(datetime.datetime(year= 1970, month=1, day=1, hour=int(t[-3]), minute=int(t[-2]), second=int(t[-1]))) # has to be a datetime for it to work
except IndexError:
time_list.append(datetime.datetime(year= 1970, month=1, day=1, minute=int(t[-2]), second=int(t[-1])))
_dict = {'Timestamp': time_list,} # make a dict from list
_df = pd.DataFrame(_dict) # make df from dict
_df.insert(0, 'A', 1) # create a column and fill it with int(1)
grouped = _df.groupby(pd.Grouper(key="Timestamp", axis=0, freq='30S')).sum() # group in given frequency and sum the 'A'
# https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
grouped.plot.bar(grid=True, figsize=(9,9)) # plot histogram
histogram_from_time_series(time_series)
Ok so I did it myself. Sharing if someone needs it later.
TLDR: How can one adjust the for-loop for a faster execution time:
import numpy as np
import pandas as pd
import time
np.random.seed(0)
# Given a DataFrame df and a row_index
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5
start = time.time()
target_row = df.loc[target_row_index]
result = []
# Method 1: Optimize this for-loop
for row in df.iterrows():
"""
Logic of calculating the variables check and score:
if the values for a specific column are 2 for both rows (row/target_row), it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
"""
check = row[1]+target_row # row[1] takes 30 microseconds per call
score = np.sum(check == 4) - np.sum(check == 3) # np.sum takes 47 microseconds per call
result.append(score)
print(time.time()-start)
# Goal: Calculate the list result as efficient as possible
# Method 2: Optimize Apply
def add(a, b):
check = a + b
return np.sum(check == 4) - np.sum(check == 3)
start = time.time()
q = df.apply(lambda row : add(row, target_row), axis = 1)
print(time.time()-start)
So I have a dataframe of size 30'000 and a target row in this dataframe with a given row index. Now I want to compare this row to all the other rows in the dataset by calculating a score. The score is calculated as follows:
if the values for a specific column are 2 for both rows, it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
The result is then the list of all the scores we just calculated.
As I need to execute this code quite often I would like to optimize it for performance.
Any help is very much appreciated.
I already read Optimization when using Pandas are there further resources you can recommend? Thanks
If you're willing to convert your df to a NumPy array, NumPy has some really good vectorisation that helps. My code using NumPy is as below:
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5
start_time = time.time()
# Converting stuff to NumPy arrays
target_row = df.loc[target_row_index].to_numpy()
np_arr = df.to_numpy()
# Calculations
np_arr += target_row
check = np.sum(np_arr == 4, axis=1) - np.sum(np_arr == 3, axis=1)
result = list(check)
end_time = time.time()
print(end_time - start_time)
Your complete code (on Google Colab for me) outputs a time of 14.875332832336426 s, while the NumPy code above outputs a time of 0.018691539764404297 s, and of course, the result list is the same in both cases.
Note that in general, if your calculations are purely numerical, NumPy will virtually always be better than Pandas and a for loop. Pandas really shines through with strings and when you need the column and row names, but for pure numbers, NumPy is the way to go due to vectorisation.
I need to get top n rows by some value per week (and I have hourly data).
data:
import numpy as np
import pandas as pd
dates = pd.date_range(start='1/1/2020', end='11/1/2020', freq="1H")
values = np.random.randint(20, 100500, len(dates))
some_other_column = np.random.randint(0, 10000000, len(dates))
df = pd.DataFrame({"date": dates, "value": values, "another_column": some_other_column})
My attempt:
resampled = df.set_index("date").resample("W")["value"].nlargest(5).to_frame()
It does give top 5 rows but all other columns except for date and value are missing - and I want to keep them all (in my dataset I have lots of columns but here another_column just to show that it's missing).
The solution I came up with:
resampled.index.names = ["week", "date"]
result = pd.merge(
resampled.reset_index(),
df,
how="left",
on=["date", "value"]
)
But it all feels wrong, I know there should be much simpler solution. Any help?
The output I was looking for. Thanks #wwnde.
df["week"] = df["date"].dt.isocalendar().week
df.loc[df.groupby("week")["value"].nlargest(5).index.get_level_values(1), :]
Groupby, and mask any nlargest
df.set_index('date', inplace=True)
df[df.groupby(df.index.week)['value'].transform(lambda x:x.nlargest(5).any())]