pandas calculate time difference per several levels in one column df - pandas

I have a following dataset:
I would like to get a result as follows:
The goal is to calculate duration per "Level" column.
Dataset:
import pandas as pd
from datetime import datetime, date
data = {'Time': ["08:35:00", "08:40:00", "08:45:00", "08:55:00", "08:57:00", "08:59:00"],
'Level': [250, 250, 250, 200, 200, 200]}
df = pd.DataFrame(data)
df['Time'] = pd.to_datetime(df['Time'],format= '%H:%M:%S' ).dt.time
Difference between two datetimes i am able to calculate with the code:
t1 = df['Time'].iloc[0]
t2 = df['Time'].iloc[1]
c = datetime.combine(date.today(), t2) - datetime.combine(date.today(), t1)
But i am not able to "automate" the calculation. This code works the only for integers.
df2 = df.groupby('Level').apply(lambda x: x.Time.max() - x.Time.min())

If you keep the date part of Time, the calculation is a lot easier:
df = pd.DataFrame(data)
# Keep the date part, even though it's meaningless
df["Time"] = pd.to_datetime(df["Time"], format="%H:%M:%S")
def to_string(duration: pd.Timedelta) -> str:
total = duration.total_seconds()
hours, remainder = divmod(total, 3600)
minutes, seconds = divmod(remainder, 60)
return f"{hours:02.0f}:{minutes:02.0f}:{seconds:02.0f}"
level = df["Level"]
# CAUTION: avoid calling to_string until the very last step,
# when you need to display your result. There's not many
# calculations you can do with strings.
df["Time"].groupby(level).diff().groupby(level).sum().apply(to_string)

Related

In a pyspark.sql.dataframe.Dataframe, how to resample the "TIMESTAMP" column to daily intervals, for each unique id in the "ID" column?

The title almost says it already. I have a pyspark.sql.dataframe.Dataframe with a "ID", "TIMESTAMP", "CONSUMPTION" and "TEMPERATURE" column. I need the "TIMESTAMP" column to be resampled to daily intervals (from 15min intervals) and the "CONSUMPTION" and "TEMPERATURE" column aggregated by summation. However, this needs to be performed for each unique id in the "ID" column. How do I do this?
Efficiency/speed is of importance to me. I have a huge dataframe to start with, which is why I would like to avoid .toPandas() and for loops.
Any help would be greatly appreciated!
The following code will build a spark_df to play around with. The input_spark_df represents the input spark dataframe, the disred output is like desired_outcome_spark_df.
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
df_list = []
for unique_id in ['012', '345', '678']:
date_range = pd.date_range(pd.Timestamp('2022-12-28 00:00'), pd.Timestamp('2022-12-30 23:00'),freq = 'H')
df = pd.DataFrame()
df['TIMESTAMP'] = date_range
df['ID'] = unique_id
df['TEMPERATURE'] = np.random.randint(1, 10, df.shape[0])
df['CONSUMPTION'] = np.random.randint(1, 10, df.shape[0])
df = df[['ID', 'TIMESTAMP', 'TEMPERATURE', 'CONSUMPTION']]
df_list.append(df)
pandas_df = pd.concat(df_list)
spark = SparkSession.builder.getOrCreate()
input_spark_df = spark.createDataFrame(pandas_df)
desired_outcome_spark_df = spark.createDataFrame(pandas_df.set_index('TIMESTAMP').groupby('ID').resample('1d').sum().reset_index())
To condense the question thus: how do I go from input_spark_df to desired_outcome_spark_df as efficient as possible?
I found the answer to my own question. I first change the timestamp to "date only" using pyspark.sql.functions.to_date. Then I groupby both "ID" and "TIMESTAMP" and perfrom the aggregation.
from pyspark.sql.functions import to_date, sum, avg
# Group the DataFrame by the "ID" column
spark_df = input_spark_df.withColumn('TIMESTAMP', to_date(col('TIMESTAMP')))
desired_outcome = (input_spark_df
.withColumn('TIMESTAMP', to_date(col('TIMESTAMP')))
.groupBy("ID", 'TIMESTAMP')
.agg(
sum(col("CONSUMPTION")).alias("CUMULATIVE_DAILY_POWER_CONSUMPTION"),
avg(col('TEMPERATURE')).alias("AVERAGE_DAILY_TEMPERATURE")
))
grouped_df.display()

Visualising frequency of events from time series

I have a series like this:
00:00:08,00:00:24,00:00:27,00:00:36,00:00:36,00:00:37,00:00:42,00:00:43,00:00:44,00:00:47,00:00:54,00:00:57,00:00:57,00:01:09,00:01:16,00:01:18,00:01:21,00:01:25,00:01:26,00:01:33,00:01:33,00:01:33,00:01:38,00:01:44,00:01:45,00:01:53,00:01:57,00:02:01,00:02:03,00:02:19,00:02:20,00:02:33,00:02:33,00:02:34,00:02:48,00:02:50,00:03:12,00:03:21,00:03:23,00:03:24,00:03:28,00:03:34,00:03:34,00:03:35,00:03:38,00:03:39,00:03:40,00:03:40,00:03:42,00:03:42,00:03:48,00:03:49,00:03:54,00:03:55,00:04:03,00:04:06,00:04:07,00:04:10,00:04:11,00:04:16,00:04:21,00:04:26,00:04:27,00:04:27,00:04:28,00:04:30,00:04:33,00:04:41,00:04:49,00:04:50,00:04:51,00:04:54,00:04:55,00:04:59,00:05:16,00:05:16,00:05:27,00:05:34,00:05:37,00:05:46,00:05:50,00:05:53,00:06:07,00:06:16,00:06:24,00:06:25,00:06:26,00:06:30,00:06:38,00:06:38,00:06:42,00:06:44,00:06:46,00:06:53,00:07:00,00:07:00
It is time in HH:MM:SS (as series in time dataframe)
I'm interested in finding / visualising amount of data points in (for example) 10 second window and plotting it as histogram barplot.
# make it a list
time_series = "00:00:08,00:00:24,00:00:27,00:00:36,00:00:36,00:00:37,00:00:42,00:00:43,00:00:44,00:00:47,00:00:54,00:00:57,00:00:57,00:01:09,00:01:16,00:01:18,00:01:21,00:01:25,00:01:26,00:01:33,00:01:33,00:01:33,00:01:38,00:01:44,00:01:45,00:01:53,00:01:57,00:02:01,00:02:03,00:02:19,00:02:20,00:02:33,00:02:33,00:02:34,00:02:48,00:02:50,00:03:12,00:03:21,00:03:23,00:03:24,00:03:28,00:03:34,00:03:34,00:03:35,00:03:38,00:03:39,00:03:40,00:03:40,00:03:42,00:03:42,00:03:48,00:03:49,00:03:54,00:03:55,00:04:03,00:04:06,00:04:07,00:04:10,00:04:11,00:04:16,00:04:21,00:04:26,00:04:27,00:04:27,00:04:28,00:04:30,00:04:33,00:04:41,00:04:49,00:04:50,00:04:51,00:04:54,00:04:55,00:04:59,00:05:16,00:05:16,00:05:27,00:05:34,00:05:37,00:05:46,00:05:50,00:05:53,00:06:07,00:06:16,00:06:24,00:06:25,00:06:26,00:06:30,00:06:38,00:06:38,00:06:42,00:06:44,00:06:46,00:06:53,00:07:00,00:07:00"
time_series = time_series.split(',')
def histogram_from_time_series(_time_series):
time_list = []
for item in _time_series:
t = item.split(":")
try:
time_list.append(datetime.datetime(year= 1970, month=1, day=1, hour=int(t[-3]), minute=int(t[-2]), second=int(t[-1]))) # has to be a datetime for it to work
except IndexError:
time_list.append(datetime.datetime(year= 1970, month=1, day=1, minute=int(t[-2]), second=int(t[-1])))
_dict = {'Timestamp': time_list,} # make a dict from list
_df = pd.DataFrame(_dict) # make df from dict
_df.insert(0, 'A', 1) # create a column and fill it with int(1)
grouped = _df.groupby(pd.Grouper(key="Timestamp", axis=0, freq='30S')).sum() # group in given frequency and sum the 'A'
# https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
grouped.plot.bar(grid=True, figsize=(9,9)) # plot histogram
histogram_from_time_series(time_series)
Ok so I did it myself. Sharing if someone needs it later.

How do you speed up a score calculation based on two rows in a Pandas Dataframe?

TLDR: How can one adjust the for-loop for a faster execution time:
import numpy as np
import pandas as pd
import time
np.random.seed(0)
# Given a DataFrame df and a row_index
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5
start = time.time()
target_row = df.loc[target_row_index]
result = []
# Method 1: Optimize this for-loop
for row in df.iterrows():
"""
Logic of calculating the variables check and score:
if the values for a specific column are 2 for both rows (row/target_row), it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
"""
check = row[1]+target_row # row[1] takes 30 microseconds per call
score = np.sum(check == 4) - np.sum(check == 3) # np.sum takes 47 microseconds per call
result.append(score)
print(time.time()-start)
# Goal: Calculate the list result as efficient as possible
# Method 2: Optimize Apply
def add(a, b):
check = a + b
return np.sum(check == 4) - np.sum(check == 3)
start = time.time()
q = df.apply(lambda row : add(row, target_row), axis = 1)
print(time.time()-start)
So I have a dataframe of size 30'000 and a target row in this dataframe with a given row index. Now I want to compare this row to all the other rows in the dataset by calculating a score. The score is calculated as follows:
if the values for a specific column are 2 for both rows, it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
The result is then the list of all the scores we just calculated.
As I need to execute this code quite often I would like to optimize it for performance.
Any help is very much appreciated.
I already read Optimization when using Pandas are there further resources you can recommend? Thanks
If you're willing to convert your df to a NumPy array, NumPy has some really good vectorisation that helps. My code using NumPy is as below:
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5
start_time = time.time()
# Converting stuff to NumPy arrays
target_row = df.loc[target_row_index].to_numpy()
np_arr = df.to_numpy()
# Calculations
np_arr += target_row
check = np.sum(np_arr == 4, axis=1) - np.sum(np_arr == 3, axis=1)
result = list(check)
end_time = time.time()
print(end_time - start_time)
Your complete code (on Google Colab for me) outputs a time of 14.875332832336426 s, while the NumPy code above outputs a time of 0.018691539764404297 s, and of course, the result list is the same in both cases.
Note that in general, if your calculations are purely numerical, NumPy will virtually always be better than Pandas and a for loop. Pandas really shines through with strings and when you need the column and row names, but for pure numbers, NumPy is the way to go due to vectorisation.

select top n rows after resampling DatetimeIndex

I need to get top n rows by some value per week (and I have hourly data).
data:
import numpy as np
import pandas as pd
dates = pd.date_range(start='1/1/2020', end='11/1/2020', freq="1H")
values = np.random.randint(20, 100500, len(dates))
some_other_column = np.random.randint(0, 10000000, len(dates))
df = pd.DataFrame({"date": dates, "value": values, "another_column": some_other_column})
My attempt:
resampled = df.set_index("date").resample("W")["value"].nlargest(5).to_frame()
It does give top 5 rows but all other columns except for date and value are missing - and I want to keep them all (in my dataset I have lots of columns but here another_column just to show that it's missing).
The solution I came up with:
resampled.index.names = ["week", "date"]
result = pd.merge(
resampled.reset_index(),
df,
how="left",
on=["date", "value"]
)
But it all feels wrong, I know there should be much simpler solution. Any help?
The output I was looking for. Thanks #wwnde.
df["week"] = df["date"].dt.isocalendar().week
df.loc[df.groupby("week")["value"].nlargest(5).index.get_level_values(1), :]
Groupby, and mask any nlargest
df.set_index('date', inplace=True)
df[df.groupby(df.index.week)['value'].transform(lambda x:x.nlargest(5).any())]

How to plot only business hours and weekdays in pandas

I have hourly stock data.
I need a) to format it so that matplotlib ignores weekends and non-business hours and b) an hourly frequency.
The problem:
Currently, the graph looks crammed and I suspect it is because matplotlib is taking into account 24 hours instead of 8, and 7 days a week instead of business days.
How do I tell pandas to only take into account business hours, M- F?
How I am graphing the data:
I am looping through a list of price data dataframes, graphing each data frame:
mm = 0
for ii in df:
Ddate = ii['Date']
Pprice = ii['Price']
d = Ddate.to_list()
p = Pprice.to_list()
dates = make_dt(d)
prices = unstring(p)
plt.figure()
plt.plot(dates,prices)
plt.title(stocks[mm])
plt.grid(True)
plt.xlabel('Dates')
plt.ylabel('Prices')
mm += 1
the graph:
To fetch business days, you can use below function:
df["IsBDay"] = bool(len(pd.bdate_range(df['date'], df['date'])))
//Above line should add a new column into the DF as IsBday.
//You can also use Lambda expression to check and have new column for BDay.
df['IsBDay'] = df['date'].apply(lambda x: 'True' if bool(len(pd.bdate_range(x, x))) else 'False')
Now create a new DF that will have only True IsBday column value and other columns.
df[df.IsBday != 'False']
Now your DF is ready for ploting.
Hope this helps.