Pandas timezone - how to check if it is not defined for a whole column? - pandas

To check if a timezone is not defined for the first row of a "timestamp" column in a pandas Series I can query .tz for a single element with:
import pandas as pd
dates = pd.Series(pd.date_range('2/2/2002', periods=10, freq='M'))
assert dates.iloc[0].tz is None
Do I have a way to check if there are elements where the timezone is defined, or even better, a way to list all the timezones in the whole series, without looping through its elements, such as:
dates.iloc[5] = dates.iloc[5].tz_localize('Africa/Abidjan')
dates.iloc[7] = dates.iloc[7].tz_localize('Africa/Banjul')
zones = []
for k in range(dates.shape[0]):
zones.append(dates.iloc[k].tz)
print(set(zones))
?

You can get the time zone setting of a datetime Series using the dt accessor, i.e. S.dt.tz. This will raise ValueError if you have multiple time zones since the datetime objects will then be stored in an object array, as opposed to a datetime64 array if you have only one time zone or None. You can make use of this to get a solution that is a bit more efficient than looping every time:
import pandas as pd
# tzinfo is None:
dates0 = pd.Series(pd.date_range('2/2/2002', periods=10, freq='M'))
# one timezone:
dates1 = dates0.dt.tz_localize('Africa/Abidjan')
# mixed timezones:
dates2 = dates0.copy()
dates2.iloc[5] = dates2.iloc[5].tz_localize('Africa/Abidjan')
dates2.iloc[7] = dates2.iloc[7].tz_localize('Africa/Banjul')
for ds in [dates0, dates1, dates2]:
try:
zones = ds.dt.tz
except ValueError:
zones = set(t.tz for t in ds.values)
print(zones)
# prints
None
Africa/Abidjan
{None, <DstTzInfo 'Africa/Banjul' GMT0:00:00 STD>, <DstTzInfo 'Africa/Abidjan' GMT0:00:00 STD>}

Related

Efficient way to update a single element of a Polars DataFrame?

Polars DataFrame does not provide a method to update the value of a single cell currently. Instead, we have to the method DataFrame.apply or DataFrame.apply_at_idx that updates a whole column / Series. This can be very expensive in situations where an algorithm repeated update a few elements of some columns. Why is DataFrame designed in this way? Looking into the code, it seems to me that Series does provide inner mutability via the method Series._get_inner_mut?
As of polars >= 0.15.9 mutation of any data backed by number is constant complexity O(1) if data is not shared. That is numeric data and dates and duration.
If the data is shared we first must copy it, so that we become the solely owner.
import polars as pl
import matplotlib.pyplot as plt
from time import time
ts = []
ts_shared = []
clone_times = []
ns = []
for n in [1e3, 1e5, 1e6, 1e7, 1e8]:
s = pl.zeros(int(n))
t0 = time()
# we are the only owner
# so mutation is inplace
s[10] = 10
# time
t = time() - t0
# store datapoints
ts.append(t)
ns.append(n)
# clone is free
t0 = time()
s2 = s.clone()
t = time() - t0
clone_times.append(t)
# now there are two owners of the memory
# we write to it so we must copy all the data first
t0 = time()
s2[11] = 11
t = time() - t0
ts_shared.append(t)
plt.plot(ns, ts_shared, label="writing to shared memory")
plt.plot(ns, ts, label="writing to owned memory")
plt.plot(ns, clone_times, label="clone time")
plt.legend()
In rust this dispatches to set_at_idx2, but it is not released yet. Note that using the lazy engine this will all be done implicitly for you.

Interpolate values based in date in pandas

I have the following datasets
import pandas as pd
import numpy as np
df = pd.read_excel("https://github.com/norhther/datasets/raw/main/ncp1b.xlsx",
sheet_name="Sheet1")
df2 = pd.read_excel("https://github.com/norhther/datasets/raw/main/ncp1b.xlsx",
sheet_name="Sheet2")
df2.dropna(inplace = True)
For each group of values on the first df X-Axis Value, Y-Axis Value, where the first one is the date and the second one is a value, I would like to create rows with the same date. For instance, df.iloc[0,0] the timestamp is Timestamp('2020-08-25 23:14:12'). However, in the following columns of the same row maybe there is other dates with different Y-Axis Value associated. The first one in that specific row being X-Axis Value NCVE-064 HPNDE with a timestap 2020-08-25 23:04:12 and a Y-Axis Value associated of value 0.952.
What I want to accomplish is to interpolate those values for a time interval, maybe 10 minutes, and then merge those results to have the same date for each row.
For the df2 is moreless the same, interpolate the values in a time interval and add them to the original dataframe. Is there any way to do this?
The trick is to realize that datetimes can be represented as seconds elapsed with respect to some time.
Without further context part the hardest things is to decide at what times you wants to have the interpolated values.
import pandas as pd
import numpy as np
from scipy.interpolate import interp1d
df = pd.read_excel(
"https://github.com/norhther/datasets/raw/main/ncp1b.xlsx",
sheet_name="Sheet1",
)
x_columns = [col for col in df.columns if col.startswith("X-Axis")]
# What time do we want to align the columsn to?
# You can use anything else here or define equally spaced time points
# or something else.
target_times = df[x_columns].min(axis=1)
def interpolate_column(target_times, x_times, y_values):
ref_time = x_times.min()
# For interpolation we need to represent the values as floats. One options is to
# compute the delta in seconds between a reference time and the "current" time.
deltas = (x_times - ref_time).dt.total_seconds()
# repeat for our target times
target_times_seconds = (target_times - ref_time).dt.total_seconds()
return interp1d(deltas, y_values, bounds_error=False,fill_value="extrapolate" )(target_times_seconds)
output_df = pd.DataFrame()
output_df["Times"] = target_times
output_df["Y-Axis Value NCVE-063 VPNDE"] = interpolate_column(
target_times,
df["X-Axis Value NCVE-063 VPNDE"],
df["Y-Axis Value NCVE-063 VPNDE"],
)
# repeat for the other columns, better in a loop

How to categorize a range of hours in Pandas?

In my project I am trying to create a new column to categorize records by range of hours, let me explain, I have a column in the dataframe called 'TowedTime' with time series data, I want another column to categorize by full hour without minutes, for example if the value in the 'TowedTime' column is 09:32:10 I want it to be categorized as 9 AM, if says 12:45:10 it should be categorized as 12 PM and so on with all the other values. I've read about the .cut and bins function but I can't get the result I want.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
df = pd.read_excel("Baltimore Towing Division.xlsx",sheet_name="TowingData")
df['Month'] = pd.DatetimeIndex(df['TowedDate']).strftime("%b")
df['Week day'] = pd.DatetimeIndex(df['TowedDate']).strftime("%a")
monthOrder = ['Jan', 'Feb', 'Mar', 'Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
dayOrder = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']
pivotHours = pd.pivot_table(df, values='TowedDate',index='TowedTime',
columns='Week day',
fill_value=0,
aggfunc= 'count',
margins = False, margins_name='Total').reindex(dayOrder,axis=1)
print(pivotHours)
First, make sure the type of the column 'TowedTime' is datetime. Second, you can easily extract the hour from this data type.
df['TowedTime'] = pd.to_datetime(df['TowedTime'],format='%H:%M:%S')
df['hour'] = df['TowedTime'].dt.hour
hope it answers your question
With the help of #Fabien C I was able to solve the problem.
First, I had to check the data type of values in the 'TowedTime' column with dtypes function. I found that were a Object.
I proceed to try convert 'TowedTime' to datetime:
df['TowedTime'] = pd.to_datetime(df['TowedTime'],format='%H:%M:%S').dt.time
Then to create a new column in the df, for only the hours:
df['Hour'] = pd.to_datetime(df['TowedTime'],format='%H:%M:%S').dt.hour
And the result was this:
You can notice in the image that 'TowedTime' column remains as an object, but the new 'Hour' column correctly returns the hour value.
Originally, the dataset already had the date and time separated into different columns, I think they used some method to separate date and time in excel and this created the time ('TowedTime') to be an object, I could not convert it, Or at least that's what the dtypes function shows me.
I tried all this Pandas methods for converting the Object to Datetime :
df['TowedTime'] = pd.to_datetime(df['TowedTime'])
df['TowedTime'] = pd.to_datetime(df['TowedTime'])
df['TowedTime'] = df['TowedTime'].astype('datetime64[ns]')
df['TowedTime'] = pd.to_datetime(df['TowedTime'], format='%H:%M:%S')
df['TowedTime'] = pd.to_datetime(df['TowedTime'], format='%H:%M:%S')

multiplying difference between two dates in days by float vectorized form

I have a function which calculates the difference between two dates and then multiplies that by a rate. i would like to use this in a one off example, but also apply to a pd.Series in a vectorized format for large scale calculations. currently it is getting hung up at
(start_date - end_date).days
AttributeError: 'Series' object has no attribute 'days'
pddt = lambda x: pd.to_datetime(x)
def cost(start_date, end_date, cost_per_day)
start_date=pddt(start_date)
end_date=pddt(end_date)
total_days = (end_date-start_date).days
cost = total_days * cost_per_day
return cost
a={'start_date': ['2020-07-01','2020-07-02'], 'end_date': ['2020-07-04','2020-07-10'],'cost_per_day': [2,1.5]}
df = pd.DataFrame.from_dict(a)
costs = cost(a.start_date, a.end_date, a.cost_per_day)
cost_adhoc = cost('2020-07-15', '2020-07-22',3)
if i run it with the series i get the following error
AttributeError: 'Series' object has no attribute 'days'
if I try to correct it by adding .dt.days then when I only use a single input i get the following error
AttributeError: 'Timestamp' object has no attribute 'dt'
you can change the function
total_days = (end_date-start_date) / np.timedelta64(1, 'D')
Assuming both variables are datetime objects, the expression (end_date-start_date) gives you a timedelta object [docs]. It holds time difference as days, seconds, and microseconds. To convert that to days for example, you would use (end_date-start_date).total_seconds()/(24*60*60).
For the given question, the goal is to multiply daily costs with the total number of days. pandas uses a subclass of timedelta (timedelta64[ns] by default) which facilitates getting the total days (no total_seconds() needed), see frequency conversion. All you need to do is change the timedelta to dtype timedelta64[D] (D for daily frequency):
import pandas as pd
df = pd.DataFrame({'start_date': ['2020-07-01', '2020-07-02'],
'end_date': ['2020-07-04', '2020-07-10'],
'cost_per_day': [2, 1.5]})
# make sure dtype is datetime:
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
# multiply cost/d with total days: end_date-start_date converted to days
df['total_cost'] = df['cost_per_day'] * (df['end_date']-df['start_date']).astype('timedelta64[D]')
# df['total_cost']
# 0 6.0
# 1 12.0
# Name: total_cost, dtype: float64
Note: you don't need to use a pandas.DataFrame here, working with pandas.Series also does the trick. However, since pandas was created for these kind of operations, it brings a lot of convenience. Especially here, you don't need to do any iteration in Python; it's done for you in fast C code.

Pandas time conversion not consistent?

I tried to convert a timestamp to a datetime object in a dataframe using the apply method, however, when I convert the datetime object back to timestamp, it is a different timestamp. Has anyone encountered this issue and how to resolve this?
As above
original_ts = 1564560384
a = pd.DataFrame({"ts": [original_ts]})
a["time"] = a["ts"].apply(lambda x:
datetime.datetime.fromtimestamp(x))
a["time"].iloc[0].timestamp()
# outputs 1564545984.0
I expected the output to be the same as the original_ts but it is not
You need to specify that timezone is None:
import datetime
original_ts = 1564560384
a = pd.DataFrame({"ts": [original_ts]})
a["time"] = a["ts"].apply(lambda x: datetime.datetime.fromtimestamp(x, tz=None))
a["time"].iloc[0].timestamp() # 1564560384.0