I'm trying to add an array of time offsets (in seconds, which can be both positive and negative) to a constant timestamp using numpy.
numpy version is 1.19.1, python = 3.7.4
If "offsets" is all positive numbers, things work just fine:
time0 = numpy.datetime64("2007-04-03T15:06:48.032208Z")
offsets = numpy.arange(0, 10)
time = offsets.astype("datetime64[s]")
time2 = time0 + time
But, if offsets includes some negative numbers:
offsets = numpy.arange(-5, 5)
time = offsets.astype("datetime64[s]")
time2 = time0 + time
Traceback (most recent call last):
File "", line 1, in
numpy.core._exceptions.UFuncTypeError: ufunc 'add' cannot use operands with types dtype('<M8[ms]') and dtype('<M8[s]')
How do I deal with an offsets array that can contain both positive and negative numbers?
Any insight appreciated, I'm stumped here.
Catherine
The error tells you that you cannot perform addition of two dates (two datetime64[ns] objects). As you can imagine, adding say May 12 and May 19 together does not make logical sense. Running your first example produces the same error in my environment, even with only positive values in the offsets array.
Instead, you can convert your offsets values into timedelta values:
import numpy
time0 = numpy.datetime64("2007-04-03T15:06:48.032208Z")
offsets = numpy.arange(0, 10)
time = offsets.astype(numpy.timedelta64(1, "s"))
time2 = time0 + time
print(time2)
# ['2007-04-03T15:06:48.032208' '2007-04-03T15:06:49.032208'
# '2007-04-03T15:06:50.032208' '2007-04-03T15:06:51.032208'
# '2007-04-03T15:06:52.032208' '2007-04-03T15:06:53.032208'
# '2007-04-03T15:06:54.032208' '2007-04-03T15:06:55.032208'
# '2007-04-03T15:06:56.032208' '2007-04-03T15:06:57.032208']
offsets = numpy.arange(-5, 5)
time = offsets.astype(numpy.timedelta64(1, "s"))
time2 = time0 + time
print(time2)
# ['2007-04-03T15:06:43.032208' '2007-04-03T15:06:44.032208'
# '2007-04-03T15:06:45.032208' '2007-04-03T15:06:46.032208'
# '2007-04-03T15:06:47.032208' '2007-04-03T15:06:48.032208'
# '2007-04-03T15:06:49.032208' '2007-04-03T15:06:50.032208'
# '2007-04-03T15:06:51.032208' '2007-04-03T15:06:52.032208']
Related
TLDR: How can one adjust the for-loop for a faster execution time:
import numpy as np
import pandas as pd
import time
np.random.seed(0)
# Given a DataFrame df and a row_index
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5
start = time.time()
target_row = df.loc[target_row_index]
result = []
# Method 1: Optimize this for-loop
for row in df.iterrows():
"""
Logic of calculating the variables check and score:
if the values for a specific column are 2 for both rows (row/target_row), it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
"""
check = row[1]+target_row # row[1] takes 30 microseconds per call
score = np.sum(check == 4) - np.sum(check == 3) # np.sum takes 47 microseconds per call
result.append(score)
print(time.time()-start)
# Goal: Calculate the list result as efficient as possible
# Method 2: Optimize Apply
def add(a, b):
check = a + b
return np.sum(check == 4) - np.sum(check == 3)
start = time.time()
q = df.apply(lambda row : add(row, target_row), axis = 1)
print(time.time()-start)
So I have a dataframe of size 30'000 and a target row in this dataframe with a given row index. Now I want to compare this row to all the other rows in the dataset by calculating a score. The score is calculated as follows:
if the values for a specific column are 2 for both rows, it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
The result is then the list of all the scores we just calculated.
As I need to execute this code quite often I would like to optimize it for performance.
Any help is very much appreciated.
I already read Optimization when using Pandas are there further resources you can recommend? Thanks
If you're willing to convert your df to a NumPy array, NumPy has some really good vectorisation that helps. My code using NumPy is as below:
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5
start_time = time.time()
# Converting stuff to NumPy arrays
target_row = df.loc[target_row_index].to_numpy()
np_arr = df.to_numpy()
# Calculations
np_arr += target_row
check = np.sum(np_arr == 4, axis=1) - np.sum(np_arr == 3, axis=1)
result = list(check)
end_time = time.time()
print(end_time - start_time)
Your complete code (on Google Colab for me) outputs a time of 14.875332832336426 s, while the NumPy code above outputs a time of 0.018691539764404297 s, and of course, the result list is the same in both cases.
Note that in general, if your calculations are purely numerical, NumPy will virtually always be better than Pandas and a for loop. Pandas really shines through with strings and when you need the column and row names, but for pure numbers, NumPy is the way to go due to vectorisation.
I have a time serie determined by sec.nsec (unix time?) where a signal is either 0 or 1 and I want to plot it to have a square signal. Currently I have the following code:
from matplotlib.pyplot import *
time = ['1633093403.754783918', '1633093403.755350983', '1633093403.760918965', '1633093403.761298577', '1633093403.761340378', '1633093403.761907443']
data = [1, 0, 1, 0, 1, 0]
plot(time, data)
show()
This plots:
Is there any conversion needed for the time before plotting? I cannot have date:time as this points might have ns to ms between them
Thank you.
EDIT: The values of the list for time are strings
To convert unix timestamp strings to datetime64 you need to fist convert to float, and then convert to datetime64 with the correct units:
time = ['1633093403.754783918', '1633093403.755350983', '1633093403.760918965', '1633093403.761298577', '1633093403.761340378', '1633093403.761907443']
time = (np.asarray(time).astype(float)).astype('datetime64[s]')
print(time.dtype)
print(time)
yields:
datetime64[s]
['2021-10-01T13:03:23' '2021-10-01T13:03:23' '2021-10-01T13:03:23'
Note the nanoseconds have been stripped. If you want to keep those...
time = (np.asarray(time).astype(float)*1e9).astype('datetime64[ns]')
yields:
datetime64[ns]
['2021-10-01T13:03:23.754783744' '2021-10-01T13:03:23.755351040'
'2021-10-01T13:03:23.760918784' '2021-10-01T13:03:23.761298688'
'2021-10-01T13:03:23.761340416' '2021-10-01T13:03:23.761907456']
This all works because datetime64 has the same "epoch" or zero as unix timestamps (1970-01-01T00:00:00.000000)
Once you do this conversion, plotting should work fine.
I have a function which calculates the difference between two dates and then multiplies that by a rate. i would like to use this in a one off example, but also apply to a pd.Series in a vectorized format for large scale calculations. currently it is getting hung up at
(start_date - end_date).days
AttributeError: 'Series' object has no attribute 'days'
pddt = lambda x: pd.to_datetime(x)
def cost(start_date, end_date, cost_per_day)
start_date=pddt(start_date)
end_date=pddt(end_date)
total_days = (end_date-start_date).days
cost = total_days * cost_per_day
return cost
a={'start_date': ['2020-07-01','2020-07-02'], 'end_date': ['2020-07-04','2020-07-10'],'cost_per_day': [2,1.5]}
df = pd.DataFrame.from_dict(a)
costs = cost(a.start_date, a.end_date, a.cost_per_day)
cost_adhoc = cost('2020-07-15', '2020-07-22',3)
if i run it with the series i get the following error
AttributeError: 'Series' object has no attribute 'days'
if I try to correct it by adding .dt.days then when I only use a single input i get the following error
AttributeError: 'Timestamp' object has no attribute 'dt'
you can change the function
total_days = (end_date-start_date) / np.timedelta64(1, 'D')
Assuming both variables are datetime objects, the expression (end_date-start_date) gives you a timedelta object [docs]. It holds time difference as days, seconds, and microseconds. To convert that to days for example, you would use (end_date-start_date).total_seconds()/(24*60*60).
For the given question, the goal is to multiply daily costs with the total number of days. pandas uses a subclass of timedelta (timedelta64[ns] by default) which facilitates getting the total days (no total_seconds() needed), see frequency conversion. All you need to do is change the timedelta to dtype timedelta64[D] (D for daily frequency):
import pandas as pd
df = pd.DataFrame({'start_date': ['2020-07-01', '2020-07-02'],
'end_date': ['2020-07-04', '2020-07-10'],
'cost_per_day': [2, 1.5]})
# make sure dtype is datetime:
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
# multiply cost/d with total days: end_date-start_date converted to days
df['total_cost'] = df['cost_per_day'] * (df['end_date']-df['start_date']).astype('timedelta64[D]')
# df['total_cost']
# 0 6.0
# 1 12.0
# Name: total_cost, dtype: float64
Note: you don't need to use a pandas.DataFrame here, working with pandas.Series also does the trick. However, since pandas was created for these kind of operations, it brings a lot of convenience. Especially here, you don't need to do any iteration in Python; it's done for you in fast C code.
I have a dataset of measured values and their corresponding timestamps in the format hh:mm:ss, where hh can be > 24 h.
For machine learning tasks, the data need to be interpolated since there are multiple measured values with different timestamps, respectively.
For resampling and interpolation, I figuered out that the dtype of the index should be in the datetime-format.
For further data-processing and machine learning tasks, I would need the timedelta format again.
Here is some code:
Res_cont = Res_cont.set_index('t_a') #t_a is the column of the timestamps for the measured variable a from a dataframe
#Then, I need to change datetime-format for resampling and interpolation, otherwise timedate are not like 00:15:00, but like 00:15:16 for example
Res_cont.index = pd.to_datetime(Res_cont.index)
#first, upsample to seconds, then interpolate linearly and downsample to 15min steps, lastly
Res_cont = Res_cont.resample('s').interpolate(method='linear').resample('15T').asfreq().dropna()
Res_cont.index = pd.to_timedelta(Res_cont.index) #Here is, where the error ocurred
Unfortunatly, I get the following Error message:
FutureWarning: Passing datetime64-dtype data to TimedeltaIndex is
deprecated, will raise a TypeError in a future version Res_cont =
pd.to_timedelta(Res_cont.index)
So obviously, there is a problem with the last row of my provided code. I would like to know, how to change this code to prevent a Type Error in a future version. Unfortunatly, I don't have any idea how to fix it.
Maybe you can help?
EDIT: Here some arbitrary sample data:
t_a = ['00:00:26', '00:16:16', '00:25:31', '00:36:14', '25:45:44']
a = [0, 1.3, 2.4, 3.8, 4.9]
Res_cont = pd.Series(data = a, index = t_a)
You can use DatetimeIndex.strftime for convert output datetimes to HH:MM:SS format:
t_a = ['00:00:26', '00:16:16', '00:25:31', '00:36:14', '00:45:44']
a = [0, 1, 2, 3, 4]
Res_cont = pd.DataFrame({'t_a':t_a,'a':a})
print (Res_cont)
t_a a
0 00:00:26 0
1 00:16:16 1
2 00:25:31 2
3 00:36:14 3
4 00:45:44 4
Res_cont = Res_cont.set_index('t_a')
Res_cont.index = pd.to_datetime(Res_cont.index)
Res_cont=Res_cont.resample('s').interpolate(method='linear').resample('15T').asfreq().dropna()
Res_cont.index = pd.to_timedelta(Res_cont.index.strftime('%H:%M:%S'))
print (Res_cont)
a
00:15:00 0.920000
00:30:00 2.418351
00:45:00 3.922807
I want to build a 2d numpy array from a random distribution so that each of the values in the last column of each row exceeds a threshold.
Here's the working code I have now. Is there a cleaner way to build numpy arrays with an arbitrary condition?
def new_array(
num_rows: int,
dist: Callable[[int], np.ndarray],
min_hours: int) -> np.ndarray:
# Get the 40th percentile as a reasonable guess for how many samples we need.
# Use a lower percentile to increase num_cols and avoid looping in most cases.
p40_val = np.quantile(dist(20), 0.4)
# Generate at least 10 columns each time.
num_cols = max(int(min_hours / p40_val), 10)
def create_starts() -> np.ndarray:
return dist(num_rows * num_cols).reshape((num_rows, num_cols)).cumsum(axis=1)
max_iters = 20
starts = create_starts()
for _ in range(max_iters):
if np.min(starts[:, -1]) >= min_hours:
# All the last columns exceed min_hours.
break
last_col_vals = starts[:, -1].repeat(num_cols).reshape(starts.shape)
next_starts = create_starts() + last_col_vals
starts = np.append(starts, next_starts, axis=1)
else:
# We didn't break out of the for loop, so we hit the max iterations.
raise AssertionError('Failed to create enough samples to exceed '
'sim duration for all columns')
# Only keep columns up to the column where each value > min_hours.
mins_per_col = np.min(starts, axis=0)
cols_exceeding_sim_duration = np.nonzero(mins_per_col > min_hours)[0]
cols_to_keep = cols_exceeding_sim_duration[0]
return np.delete(starts, np.s_[cols_to_keep:], axis=1)
new_array(5, lambda size: np.random.normal(3, size=size), 7)
# Example output
array([[1.47584632, 4.04034105, 7.19592256],
[3.10804306, 6.46487043, 9.74177227],
[1.03633165, 2.62430309, 6.92413189],
[3.46100139, 6.53068143, 7.37990547],
[2.70152742, 6.09488369, 9.58376664]])
I simplified several things and replaced them with Numpy's logical indexing. The for-loop is now while and there is no need to handle the error as it just runs until there are enough rows.
Is this still working as you expect it?
def new_array(num_rows, dist, min_hours):
# Get the 40th percentile as a reasonable guess for how many samples we need.
# Use a lower percentile to increase num_cols and avoid looping in most cases.
p40_val = np.quantile(dist(20), 0.4)
# Generate at least 10 columns each time.
num_cols = max(int(min_hours / p40_val), 10)
# no need to reshape here, size can be a shape tuple
def create_starts() -> np.ndarray:
return dist((num_rows, num_cols)).cumsum(axis=1)
# append to list, in the end stack it into a Numpy array once.
# faster than numpy.append
# due to Numpy's pre-allocation which will slow down things here.
storage = []
while True:
starts = create_starts()
# boolean / logical array
is_larger = starts[:, -1] >= min_hours
# Use Numpy boolean indexing instead to find the rows
# fitting your condition
good_rows = starts[is_larger, :]
# can also be empty array if none found, but will
# be skipped later
storage.append(good_rows)
# count what is in storage so far, empty arrays will be skipped
# due to shape (0, x)
number_of_good_rows = sum([_a.shape[0] for _a in storage])
print('number_of_good_rows', number_of_good_rows)
if number_of_good_rows >= num_rows:
starts = np.vstack(storage)
print(starts)
break
# Only keep columns up to the column where each value > min_hours.
# also use logical indexing here
is_something = np.logical_not(np.all(starts > min_hours, axis=0))
return starts[:, is_something]