Visualising frequency of events from time series - pandas

I have a series like this:
00:00:08,00:00:24,00:00:27,00:00:36,00:00:36,00:00:37,00:00:42,00:00:43,00:00:44,00:00:47,00:00:54,00:00:57,00:00:57,00:01:09,00:01:16,00:01:18,00:01:21,00:01:25,00:01:26,00:01:33,00:01:33,00:01:33,00:01:38,00:01:44,00:01:45,00:01:53,00:01:57,00:02:01,00:02:03,00:02:19,00:02:20,00:02:33,00:02:33,00:02:34,00:02:48,00:02:50,00:03:12,00:03:21,00:03:23,00:03:24,00:03:28,00:03:34,00:03:34,00:03:35,00:03:38,00:03:39,00:03:40,00:03:40,00:03:42,00:03:42,00:03:48,00:03:49,00:03:54,00:03:55,00:04:03,00:04:06,00:04:07,00:04:10,00:04:11,00:04:16,00:04:21,00:04:26,00:04:27,00:04:27,00:04:28,00:04:30,00:04:33,00:04:41,00:04:49,00:04:50,00:04:51,00:04:54,00:04:55,00:04:59,00:05:16,00:05:16,00:05:27,00:05:34,00:05:37,00:05:46,00:05:50,00:05:53,00:06:07,00:06:16,00:06:24,00:06:25,00:06:26,00:06:30,00:06:38,00:06:38,00:06:42,00:06:44,00:06:46,00:06:53,00:07:00,00:07:00
It is time in HH:MM:SS (as series in time dataframe)
I'm interested in finding / visualising amount of data points in (for example) 10 second window and plotting it as histogram barplot.

# make it a list
time_series = "00:00:08,00:00:24,00:00:27,00:00:36,00:00:36,00:00:37,00:00:42,00:00:43,00:00:44,00:00:47,00:00:54,00:00:57,00:00:57,00:01:09,00:01:16,00:01:18,00:01:21,00:01:25,00:01:26,00:01:33,00:01:33,00:01:33,00:01:38,00:01:44,00:01:45,00:01:53,00:01:57,00:02:01,00:02:03,00:02:19,00:02:20,00:02:33,00:02:33,00:02:34,00:02:48,00:02:50,00:03:12,00:03:21,00:03:23,00:03:24,00:03:28,00:03:34,00:03:34,00:03:35,00:03:38,00:03:39,00:03:40,00:03:40,00:03:42,00:03:42,00:03:48,00:03:49,00:03:54,00:03:55,00:04:03,00:04:06,00:04:07,00:04:10,00:04:11,00:04:16,00:04:21,00:04:26,00:04:27,00:04:27,00:04:28,00:04:30,00:04:33,00:04:41,00:04:49,00:04:50,00:04:51,00:04:54,00:04:55,00:04:59,00:05:16,00:05:16,00:05:27,00:05:34,00:05:37,00:05:46,00:05:50,00:05:53,00:06:07,00:06:16,00:06:24,00:06:25,00:06:26,00:06:30,00:06:38,00:06:38,00:06:42,00:06:44,00:06:46,00:06:53,00:07:00,00:07:00"
time_series = time_series.split(',')
def histogram_from_time_series(_time_series):
time_list = []
for item in _time_series:
t = item.split(":")
try:
time_list.append(datetime.datetime(year= 1970, month=1, day=1, hour=int(t[-3]), minute=int(t[-2]), second=int(t[-1]))) # has to be a datetime for it to work
except IndexError:
time_list.append(datetime.datetime(year= 1970, month=1, day=1, minute=int(t[-2]), second=int(t[-1])))
_dict = {'Timestamp': time_list,} # make a dict from list
_df = pd.DataFrame(_dict) # make df from dict
_df.insert(0, 'A', 1) # create a column and fill it with int(1)
grouped = _df.groupby(pd.Grouper(key="Timestamp", axis=0, freq='30S')).sum() # group in given frequency and sum the 'A'
# https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
grouped.plot.bar(grid=True, figsize=(9,9)) # plot histogram
histogram_from_time_series(time_series)
Ok so I did it myself. Sharing if someone needs it later.

Related

Interpolate values based in date in pandas

I have the following datasets
import pandas as pd
import numpy as np
df = pd.read_excel("https://github.com/norhther/datasets/raw/main/ncp1b.xlsx",
sheet_name="Sheet1")
df2 = pd.read_excel("https://github.com/norhther/datasets/raw/main/ncp1b.xlsx",
sheet_name="Sheet2")
df2.dropna(inplace = True)
For each group of values on the first df X-Axis Value, Y-Axis Value, where the first one is the date and the second one is a value, I would like to create rows with the same date. For instance, df.iloc[0,0] the timestamp is Timestamp('2020-08-25 23:14:12'). However, in the following columns of the same row maybe there is other dates with different Y-Axis Value associated. The first one in that specific row being X-Axis Value NCVE-064 HPNDE with a timestap 2020-08-25 23:04:12 and a Y-Axis Value associated of value 0.952.
What I want to accomplish is to interpolate those values for a time interval, maybe 10 minutes, and then merge those results to have the same date for each row.
For the df2 is moreless the same, interpolate the values in a time interval and add them to the original dataframe. Is there any way to do this?
The trick is to realize that datetimes can be represented as seconds elapsed with respect to some time.
Without further context part the hardest things is to decide at what times you wants to have the interpolated values.
import pandas as pd
import numpy as np
from scipy.interpolate import interp1d
df = pd.read_excel(
"https://github.com/norhther/datasets/raw/main/ncp1b.xlsx",
sheet_name="Sheet1",
)
x_columns = [col for col in df.columns if col.startswith("X-Axis")]
# What time do we want to align the columsn to?
# You can use anything else here or define equally spaced time points
# or something else.
target_times = df[x_columns].min(axis=1)
def interpolate_column(target_times, x_times, y_values):
ref_time = x_times.min()
# For interpolation we need to represent the values as floats. One options is to
# compute the delta in seconds between a reference time and the "current" time.
deltas = (x_times - ref_time).dt.total_seconds()
# repeat for our target times
target_times_seconds = (target_times - ref_time).dt.total_seconds()
return interp1d(deltas, y_values, bounds_error=False,fill_value="extrapolate" )(target_times_seconds)
output_df = pd.DataFrame()
output_df["Times"] = target_times
output_df["Y-Axis Value NCVE-063 VPNDE"] = interpolate_column(
target_times,
df["X-Axis Value NCVE-063 VPNDE"],
df["Y-Axis Value NCVE-063 VPNDE"],
)
# repeat for the other columns, better in a loop

How do you speed up a score calculation based on two rows in a Pandas Dataframe?

TLDR: How can one adjust the for-loop for a faster execution time:
import numpy as np
import pandas as pd
import time
np.random.seed(0)
# Given a DataFrame df and a row_index
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5
start = time.time()
target_row = df.loc[target_row_index]
result = []
# Method 1: Optimize this for-loop
for row in df.iterrows():
"""
Logic of calculating the variables check and score:
if the values for a specific column are 2 for both rows (row/target_row), it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
"""
check = row[1]+target_row # row[1] takes 30 microseconds per call
score = np.sum(check == 4) - np.sum(check == 3) # np.sum takes 47 microseconds per call
result.append(score)
print(time.time()-start)
# Goal: Calculate the list result as efficient as possible
# Method 2: Optimize Apply
def add(a, b):
check = a + b
return np.sum(check == 4) - np.sum(check == 3)
start = time.time()
q = df.apply(lambda row : add(row, target_row), axis = 1)
print(time.time()-start)
So I have a dataframe of size 30'000 and a target row in this dataframe with a given row index. Now I want to compare this row to all the other rows in the dataset by calculating a score. The score is calculated as follows:
if the values for a specific column are 2 for both rows, it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
The result is then the list of all the scores we just calculated.
As I need to execute this code quite often I would like to optimize it for performance.
Any help is very much appreciated.
I already read Optimization when using Pandas are there further resources you can recommend? Thanks
If you're willing to convert your df to a NumPy array, NumPy has some really good vectorisation that helps. My code using NumPy is as below:
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5
start_time = time.time()
# Converting stuff to NumPy arrays
target_row = df.loc[target_row_index].to_numpy()
np_arr = df.to_numpy()
# Calculations
np_arr += target_row
check = np.sum(np_arr == 4, axis=1) - np.sum(np_arr == 3, axis=1)
result = list(check)
end_time = time.time()
print(end_time - start_time)
Your complete code (on Google Colab for me) outputs a time of 14.875332832336426 s, while the NumPy code above outputs a time of 0.018691539764404297 s, and of course, the result list is the same in both cases.
Note that in general, if your calculations are purely numerical, NumPy will virtually always be better than Pandas and a for loop. Pandas really shines through with strings and when you need the column and row names, but for pure numbers, NumPy is the way to go due to vectorisation.

Change the stacked bar chart to Stacked Percentage Bar Plot

How can I change this stacked bar into a stacked Percentage Bar Plot with percentage labels:
here is the code:
df_responses= pd.read_csv('https://raw.githubusercontent.com/eng-aomar/Security_in_practice/main/secuirtyInPractice.csv')
df_new =df_responses.iloc[:,9:21]
image_format = 'svg' # e.g .png, .svg, etc.
# initialize empty dataframe
df2 = pd.DataFrame()
# group by each column counting the size of each category values
for col in df_new:
grped = df_new.groupby(col).size()
grped = grped.rename(grped.index.name)
df2 = df2.merge(grped.to_frame(), how='outer', left_index=True, right_index=True)
# plot the merged dataframe
df2.plot.bar(stacked=True)
plt.show()
You can just calculate the percentages yourself e.g. in a new column of your dataframe as you do have the absolute values and plot this column instead.
Using sum() and division using dataframes you should get there quickly.
You might wanna have a look at GeeksForGeeks post which shows how this could be done.
EDIT
I have now gone ahead and adjusted your program so it will give the results that you want (at least the result I think you would like).
Two key functions that I used and you did not, are df.value_counts() and df.transpose(). You might wanna read on those two as they are quite helpful in many situations.
import pandas as pd
import matplotlib.pyplot as plt
df_responses= pd.read_csv('https://raw.githubusercontent.com/eng-aomar/Security_in_practice/main/secuirtyInPractice.csv')
df_new =df_responses.iloc[:,9:21]
image_format = 'svg' # e.g .png, .svg, etc.
# initialize empty dataframe providing the columns
df2 = pd.DataFrame(columns=df_new.columns)
# loop over all columns
for col in df_new.columns:
# counting occurences for each value can be done by value_counts()
val_counts = df_new[col].value_counts()
# replace nan values with 0
val_counts.fillna(0)
# calculate the sum of all categories
total = val_counts.sum()
# use value count for each category and divide it by the total count of all categories
# and multiply by 100 to get nice percent values
df2[col] = val_counts / total * 100
# columns and rows need to be transposed in order to get the result we want
df2.transpose().plot.bar(stacked=True)
plt.show()

Pandas get max delta in a timeseries for a specified period

Given a dataframe with a non-regular time series as an index, I'd like to find the max delta between the values for a period of 10 secs. Here is some code that does the same thing:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
xs = np.cumsum(np.random.rand(200))
# This function is to create a general situation where the max is not aways at the end or beginning
ys = xs**1.2 + 10 * np.sin(xs)
plt.plot(xs, ys, '+-')
threshold = 10
xs_thresh_ind = np.zeros_like(xs, dtype=int)
deltas = np.zeros_like(ys)
for i, x in enumerate(xs):
# Find indices that lie within the time threshold
period_end_ind = np.argmax(xs > x + threshold)
# Only operate when the window is wide enough (this can be treated differently)
if period_end_ind > 0:
xs_thresh_ind[i] = period_end_ind
# Find extrema in the period
period_min = np.min(ys[i:period_end_ind + 1])
period_max = np.max(ys[i:period_end_ind + 1])
deltas[i] = period_max - period_min
max_ind_low = np.argmax(deltas)
max_ind_high = xs_thresh_ind[max_ind_low]
max_delta = deltas[max_ind_low]
print(
'Max delta {:.2f} is in period x[{}]={:.2f},{:.2f} and x[{}]={:.2f},{:.2f}'
.format(max_delta, max_ind_low, xs[max_ind_low], ys[max_ind_low],
max_ind_high, xs[max_ind_high], ys[max_ind_high]))
df = pd.DataFrame(ys, index=xs)
OUTPUT:
Max delta 48.76 is in period x[167]=86.10,200.32 and x[189]=96.14,249.09
Is there an efficient pandaic way to achieve something similar?
Create a Series from ys values, indexed by xs - but convert xs to be actual timedelta elements, rather than the float equivalent.
ts = pd.Series(ys, index=pd.to_timedelta(xs, unit="s"))
We want to apply a leading, 10 second window in which we calculate the difference between max and min. Because we want it to be leading, we'll sort the Series in descending order and apply a trailing window.
deltas = ts.sort_index(ascending=False).rolling("10s").agg(lambda s: s.max() - s.min())
Find the maximum delta with deltas[deltas == deltas.max()], which gives
0 days 00:01:26.104797298 48.354851
meaning a delta of 48.35 was found in the interval [86.1, 96.1)

How to graph events on a timeline

I tracked all the movies I watched in 2019 and I want to represent the year on a graph using matplotlib, pyplot or seaborn. I saw a graph by a user who also tracked the movies he watched in a year:
I want a graph like this:
How do I represent each movie as an 'event' on a timeline?
For reference, here is a look at my table.
(sorry if basic)
I've made an assumption (from your comment) that your date column is type str. Here is code that will produce the graph:
Modify your pd.DataFrame object
Firstly, a function to add a column to your dataframe:
def modify_dataframe(df):
""" Modify dataframe to include new columns """
df['Month'] = pd.to_datetime(df['Date'], format='%Y-%m-%d').dt.month
return df
The pd.to_datetime function converts the series df['Date'] to a datetime series; and I'm creating a new column called Month which equates to the month number.
From this column, we can generate X and Y coordinates for your plot.
def get_x_y(df):
""" Get X and Y coordinates; return tuple """
series = df['Month'].value_counts().sort_index()
new_series = series.reindex(range(1,13)).fillna(0).astype(int)
return new_series.index, new_series.values
This takes in your modified dataframe, creates a series that counts the number of occurrences of each month. Then if there are any missing months, fillna fills them in with a value of 0. Now you can begin to plot.
Plotting the graph
I've created a plot that looks like the desired output you linked.
Firstly, call your functions:
df = modify_dataframe(df)
X, Y = get_x_y(df)
Create the canvas and axis to plot on to.
fig = plt.figure(figsize=(12,5))
ax = fig.add_subplot(1, 1, 1, title='Films watched per month - 2019')
Generate x-labels. This will replace the current month int values (i.e. 1, 2, 3...) on the x-axis.
xlabels = [datetime.datetime(2019, i, 1).strftime("%B") for i in list(range(1,13))]
ax.set_xticklabels(xlabels, rotation=45, ha='right')
Set the x-ticks, and x-label.
ax.set_xticks(range(1,13))
ax.set_xlabel('Month')
Set the y-axis, y-lim, and y-label.
ax.set_yticks(range(0, max(s1.values)+2))
ax.set_ylim(0, max(s1.values)+1)
ax.set_ylabel('Count')
To get your desired output, fill underneath the graph with a block-colour (I've chosen green here but you can change it to something else).
ax.fill_between(X, [0]*len(X), Y, facecolor='green')
ax.plot(X, Y, color="black", linewidth=3, marker="o")
Plot your graph!
plt.show() # or plt.savefig('output.png', format='png')