Pandas Timeseries: Total duration meeting a specific condition - pandas

I have a timeseries
ts = pd.Series(data=[0,1,2,3,4],index=[pd.Timestamp('1991-01-01'),pd.Timestamp('1995-01-01'),pd.Timestamp('1996-01-01'),pd.Timestamp('2010-01-01'),pd.Timestamp('2011-01-01')])
Whats the fastest, most readable, way to get the total duration in which the value is below 2, assuming the values are valid until the next time-step indicates otherwise (no linear interpolation). I imagine there probably is a pandas function for this

This seems to be working quite well, however I am still baffled that there does not seem to be a pandas function for this!
import pandas as pd
import numpy as np
ts = pd.Series(data=[0,1,2,3,4],index=[pd.Timestamp('1991-01-01'),pd.Timestamp('1995-01-01'),pd.Timestamp('1996-01-01'),pd.Timestamp('2010-01-01'),pd.Timestamp('2011-01-01')])
# making the timeseries binary. 1 = meets condition, 0 = does not
ts = ts.where(ts>=2,other=1)
ts = ts.where(ts<2,other=0)
delta_time = ts.index.to_pydatetime()[1:]-ts.index.to_pydatetime()[:-1]
time_below_2 = np.sum(delta_time[np.invert(ts.values[:-1])]).total_seconds()
time_above_2 = np.sum(delta_time[(ts.values[:-1])]).total_seconds()
The above function seems to break for certain timeframes. This option is slower, but did not break in any of my tests:
def get_total_duration_above_and_below_value(value,ts):
# making the timeseries binary. 1 = above value, 0 = below value
ts = ts.where(ts >= value, other=1)
ts = ts.where(ts < value, other=0)
time_above_value = 0
time_below_value = 0
for i in range(ts.size - 1):
if ts[i] == 1:
time_above_value += abs(pd.Timedelta(
ts.index[i] - ts.index[i + 1]).total_seconds()) / 3600
else:
time_below_value += abs(pd.Timedelta(
ts.index[i] - ts.index[i + 1]).total_seconds()) / 3600
return time_above_value, time_below_value

Related

Csv file search speedup

I need to build a relief profile graph by coordinates, I have a csv file with 12,000,000 lines. searching through a csv file of the same height takes about 2 - 2.5 seconds. I rewrote the csv to parquet and it helped me save some time, it takes about 1.7 - 1 second to find one height. However, I need to build a profile for 500 - 2000 values, which makes the time very long. In the future, you may have to increase the base of the csv file, which will slow down this process even more. In this regard, my question is, is it possible to somehow reduce the processing time of values?
Code example:
import dask.dataframe as dk
import numpy as np
import pandas as pd
import time
filename = 'n46_e032_1arc_v3.csv'
df = dk.read_csv(filename)
df.to_parquet('n46_e032_1arc_v3_parquet')
Latitude1y, Longitude1x = 46.6276, 32.5942
Latitude2y, Longitude2x = 46.6451, 32.6781
sec, steps, k = 0.00027778, 1, 11.73
Latitude, Longitude = [Latitude1y], [Longitude1x]
sin, cos = Latitude2y - Latitude1y, Longitude2x - Longitude1x
y, x = Latitude1y, Longitude1x
while Latitude[-1] < Latitude2y and Longitude[-1] < Longitude2x:
y, x, steps = y + sec * k * sin, x + sec * k * cos, steps + 1
Latitude.append(y)
Longitude.append(x)
time_start = time.time()
long, elevation_data = [], []
df2 = dk.read_parquet('n46_e032_1arc_v3_parquet')
for i in range(steps + 1):
elevation_line = df2[(Longitude[i] <= df2['x']) & (df2['x'] <= Longitude[i] + sec) &
(Latitude[i] <= df2['y']) & (df2['y'] <= Latitude[i] + sec)].compute()
elevation = np.asarray(elevation_line.z.tolist())
if elevation[-1] < 0:
elevation_data.append(0)
else:
elevation_data.append(elevation[-1])
long.append(30 * i)
plt.bar(long, elevation_data, width = 30)
plt.show()
print(time.time() - time_start)
Here's one way to solve this problem using KD trees. A KD tree is a data structure for doing fast nearest-neighbor searches.
import scipy.spatial
tree = scipy.spatial.KDTree(df[['x', 'y']].values)
elevations = df['z'].values
long, elevation_data = [], []
for i in range(steps):
lon, lat = Longitude[i], Latitude[i]
dist, idx = tree.query([lon, lat])
elevation = elevations[idx]
if elevation < 0:
elevation = 0
elevation_data.append(elevation)
long.append(30 * i)
Note: if you can make assumptions about the data, like "all of the points in the CSV are equally spaced," faster algorithms are possible.
It looks like your data might be on a regular grid. If (and only if) every combination of x and y exist in your data, then it probably makes sense to turn this into a labeled 2D array of points, after which querying the correct position will be very fast.
For this, I'll use xarray, which is essentially pandas for N-dimensional data, and integrates well with dask:
# bring the dataframe into memory
df = dk.read('n46_e032_1arc_v3_parquet').compute()
da = df.set_index(["y", "x"]).z.to_xarray()
# now you can query the nearest points:
desired_lats = xr.DataArray([46.6276, 46.6451], dims=["point"])
desired_lons = xr.DataArray([32.5942, 32.6781], dims=["point"])
subset = da.sel(y=desired_lats, x=desired_lons, method="nearest")
# if you'd like, you can return to pandas:
subset_s = subset.to_series()
# you could do this only once, and save the reshaped array as a zarr store:
ds = da.to_dataset(name="elevation")
ds.to_zarr("n46_e032_1arc_v3.zarr")

Sliding window method over a large range using numpy vectorization

I'm trying to implement a sliding window method for a genomics dataset that I have, over a fairly long range (upwards of 50k nucleotide's). My approach so far works fine, however is fairly slow (taking several seconds per range, and several minutes per range at intervals >150k bp). Here is my code so far:
import numpy as np
VectorizedRange = np.arange(Start, End)#Start, End genomic flags on the reference genome
SlidingWindow = np.lib.stride_tricks.sliding_window_view(VectorizedRange, 100)#100 = the window size
GroupedDictFrame = pd.DataFrame({"Bins":GenomeRange})
GroupedDictFrame["ReadCov"] = 0
GroupedDictFrame["ReadSeq"] = [list() for _ in range(len(GroupedDictFrame.index.values))]
GroupedDictFrame.set_index(keys=["Bins"], inplace=True, drop=True)
def Appender(Start, End, Width, Seq):
AvgCov = 0
SeqList = []
if End <= Window[-1]:
AvgCov += 1
SeqList.append(Seq)
elif End > Window[-1]:
AvgCov += (Window[-1] - Start)/Width
SeqList.append(Seq[0:(Window[-1] - Start)])
GroupedDictFrame.loc[Window[0], "ReadCov"] += AvgCov
GroupedDictFrame.loc[Window[0], "ReadSeq"] = SeqList
for Window in SlidingWindow:
SubsetBAM = BAMFrame[(
(BAMFrame["start_coord"]>=Window[0])&
(BAMFrame["start_coord"]<=Window[-1])
)].reset_index(drop=True)
SubsetBAM.apply(
lambda x: Appender(x.start_coord,
x.end_coord,
x.width_lis,
x.seq_lis), axis=1
)
I think my vectorization isn't the best, any suggestions for speeding this up?
So I think I figured it out on my own, I'll add my solution in case anyone else faces a similar problem.
Essentially, I stopped subsetting my dataframe containing the small DNA read fragments in the for loop, and did one subset before the loop and converted it to a numpy array.
I removed my function and used numpy.where to do all my logic.
import numpy as np
VectorizedRange = np.arange(Start, End)
SlidingWindow = np.lib.stride_tricks.sliding_window_view(VectorizedRange, 100)
GroupedDictFrame = pd.DataFrame({"Bins":GenomeRange})
GroupedDictFrame["ReadCov"] = 0
GroupedDictFrame["ReadSeq"] = [list() for _ in range(len(GroupedDictFrame.index.values))]
GroupedDictFrame.set_index(keys=["Bins"], inplace=True, drop=True)
CoordArray = BAMFrame.loc[:, "start_coord":"end_coord"].to_numpy()
for Window in SlidingWindow:
ReadCovIn = np.where(((CoordArray[:,1] <= Window[-1]) & (CoordArray[:,0] >= Window[0])), 1, 0)
ReadCovOut = np.where(((CoordArray[:,1] > Window[-1]) & ((CoordArray[:,0] >= Window[0]) & (CoordArray[:,0] < Window[-1]))),
(Window[-1] - CoordArray[:,0])/(CoordArray[:,1] - CoordArray[:,0]), 0)
GroupedDictFrame.loc[Window[0], "ReadCov"] += np.sum((np.sum(ReadCovIn), np.sum(ReadCovOut)))
I've gotten it down to ~1 second per gene region which is typically about 50kb (so that would mean the SlidingWindow has a shape of (49900,100)), which is pretty good I think!

cumsum() on running streak of data

i'm attempting to determine the sum of a column for a running period where there is a negative gain - ie i want to determine the total for a loosing streak.
i've set up a column that provides the numerical number of days where its been loosing (Consecutive Losses), but i wish to sum up the same for the total loss throughout the streak. what i have (Aggregate Consecutive Loss) 1) doesn't work (because it just cumsums() without resetting to zero at each streak) and 2) is incorrectly as i should in fact take the Open value at the start of the streak and Close value at the end.
how can i correctly setup this Aggregate Consecutive Loss value in pandas?
import pandas as pd
import numpy as np
import yfinance as yf
def get( symbols, group_by_ticker=False, **kwargs ):
if not isinstance(symbols, list):
symbols = [ symbols, ]
kwargs['auto_adjust'] = True
kwargs['prepost'] = True
kwargs['threads'] = True
df = None
if group_by_ticker:
kwargs['group_by'] = 'ticker'
df = yf.download( symbols, **kwargs)
for t in symbols:
df["Change Percent",t] = df["Close",t].pct_change() * 100
df["Gain",t] = np.where( df['Change Percent',t] > 0, True, False ).astype('bool')
a = df['Gain',t] != True
df['Consecutive Losses',t] = a.cumsum() - a.cumsum().where(~a).ffill().fillna(0).astype(int)
x = df['Change Percent',t].where( df['Consecutive Losses',t] > 0 )
df['Aggregate Consecutive Loss',t] = x.cumsum() - x.cumsum().where(~a).ffill().fillna(0).astype(float)
return df
data = get( ["DOW", "IDX"], period="6mo")
data[['Change Percent','Gain','Consecutive Losses','Aggregate Consecutive Loss']].head(50)

Time Difference between Time Period and Instant

I have some time periods (df_A) and some time instants (df_B):
import pandas as pd
import numpy as np
import datetime as dt
from datetime import timedelta
# Data
df_A = pd.DataFrame({'A1': [dt.datetime(2017,1,5,9,8), dt.datetime(2017,1,5,9,9), dt.datetime(2017,1,7,9,19), dt.datetime(2017,1,7,9,19), dt.datetime(2017,1,7,9,19), dt.datetime(2017,2,7,9,19), dt.datetime(2017,2,7,9,19)],
'A2': [dt.datetime(2017,1,5,9,9), dt.datetime(2017,1,5,9,12), dt.datetime(2017,1,7,9,26), dt.datetime(2017,1,7,9,20), dt.datetime(2017,1,7,9,21), dt.datetime(2017,2,7,9,23), dt.datetime(2017,2,7,9,25)]})
df_B = pd.DataFrame({ 'B': [dt.datetime(2017,1,6,14,45), dt.datetime(2017,1,4,3,31), dt.datetime(2017,1,7,3,31), dt.datetime(2017,1,7,14,57), dt.datetime(2017,1,9,14,57)]})
I can match these together:
# Define an Extra Margin
M = dt.timedelta(days = 10)
df_A["A1X"] = df_A["A1"] + M
df_A["A2X"] = df_A["A2"] - M
# Match
Bv = df_B .B .values
A1 = df_A .A1X.values
A2 = df_A .A2X.values
i, j = np.where((Bv[:, None] >= A1) & (Bv[:, None] <= A2))
df_C = pd.DataFrame(np.column_stack([df_B .values[i], df_A .values[j]]),
columns = df_B .columns .append (df_A.columns))
I would like to find the time difference between each time period and the time instant matched to it. I mean that
if B is between A1 and A2
then dT = 0
I've tried doing it like this:
# Calculate dt
def time(A1,A2,B):
if df_C["B"] < df_C["A1"]:
return df_C["A1"].subtract(df_C["B"])
elif df_C["B"] > df_C["A2"]:
return df_C["B"].subtract(df_C["A2"])
else:
return 0
df_C['dt'] = df_C.apply(time)
I'm getting "ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series"
So, I found two fixes:
You are adding M to the lower value and subtracting from the higher one. Change it to:
df_A['A1X'] = df_A['A1'] - M
df_A['A2X'] = df_A['A2'] + M
You are only passing one row of your dataframe at a time to your time function, so it should be something like:
def time(row):
if row['B'] < row['A1']:
return row['A1'] - row['B']
elif row['B'] > row['A2']:
return row['B'] - row['A2']
else:
return 0
And then you can call it like this:
df_C['dt'] = df_C.apply(time, axis=1) :)

matplotlib x-axis ticks dates formatting and locations

I've tried to duplicate plotted graphs originally created with flotr2 for pdf output with matplotlib. I must say that flotr is way easyer to use... but that aside - im currently stuck at trying to format the dates /times on x-axis to desired format, which is hours:minutes with interval of every 2 hours, if period on x-axis is less than one day and year-month-day format if period is longer than 1 day with interval of one day.
I've read through numerous examples and tried to copy them, but outcome remains the same which is hours:minutes:seconds with 1 to 3 hour interval based on how long is the period.
My code:
colorMap = {
'speed': '#3388ff',
'fuel': '#ffaa33',
'din1': '#3bb200',
'din2': '#ff3333',
'satellites': '#bfbfff'
}
otherColors = ['#00A8F0','#C0D800','#CB4B4B','#4DA74D','#9440ED','#800080','#737CA1','#E4317F','#7D0541','#4EE2EC','#6698FF','#437C17','#7FE817','#FBB117']
plotMap = {}
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import matplotlib.dates as dates
fig = plt.figure(figsize=(22, 5), dpi = 300, edgecolor='k')
ax1 = fig.add_subplot(111)
realdata = data['data']
keys = realdata.keys()
if 'speed' in keys:
speed_index = keys.index('speed')
keys.pop(speed_index)
keys.insert(0, 'speed')
i = 0
for key in keys:
if key not in colorMap.keys():
color = otherColors[i]
otherColors.pop(i)
colorMap[key] = color
i += 1
label = u'%s' % realdata[keys[0]]['name']
ax1.set_ylabel(label)
plotMap[keys[0]] = {}
plotMap[keys[0]]['label'] = label
first_dates = [ r[0] for r in realdata[keys[0]]['data']]
date_range = first_dates[-1] - first_dates[0]
ax1.xaxis.reset_ticks()
if date_range > datetime.timedelta(days = 1):
ax1.xaxis.set_major_locator(dates.WeekdayLocator(byweekday = 1, interval=1))
ax1.xaxis.set_major_formatter(dates.DateFormatter('%Y-%m-%d'))
else:
ax1.xaxis.set_major_locator(dates.HourLocator(byhour=range(24), interval=2))
ax1.xaxis.set_major_formatter(dates.DateFormatter('%H:%M'))
ax1.xaxis.grid(True)
plotMap[keys[0]]['plot'] = ax1.plot_date(
dates.date2num(first_dates),
[r[1] for r in realdata[keys[0]]['data']], colorMap[keys[0]], xdate=True)
if len(keys) > 1:
first = True
for key in keys[1:]:
if first:
ax2 = ax1.twinx()
ax2.set_ylabel(u'%s' % realdata[key]['name'])
first = False
plotMap[key] = {}
plotMap[key]['label'] = u'%s' % realdata[key]['name']
plotMap[key]['plot'] = ax2.plot_date(
dates.date2num([ r[0] for r in realdata[key]['data']]),
[r[1] for r in realdata[key]['data']], colorMap[key], xdate=True)
plt.legend([value['plot'] for key, value in plotMap.iteritems()], [value['label'] for key, value in plotMap.iteritems()], loc = 2)
plt.savefig(path +"node.png", dpi = 300, bbox_inches='tight')
could someone point out why im not getting desired results, please?
Edit1:
I moved the formatting block after the plotting and seem to be getting better results now. They are still now desired results though. If period is less than day then i get ticks after every 2 hours (interval=2), but i wish i could get those ticks at even hours not uneven hours. Is that possible?
if date_range > datetime.timedelta(days = 1):
xax.set_major_locator(dates.DayLocator(bymonthday=range(1,32), interval=1))
xax.set_major_formatter(dates.DateFormatter('%Y-%m-%d'))
else:
xax.set_major_locator(dates.HourLocator(byhour=range(24), interval=2))
xax.set_major_formatter(dates.DateFormatter('%H:%M'))
Edit2:
This seemed to give me what i wanted:
if date_range > datetime.timedelta(days = 1):
xax.set_major_locator(dates.DayLocator(bymonthday=range(1,32), interval=1))
xax.set_major_formatter(dates.DateFormatter('%Y-%m-%d'))
else:
xax.set_major_locator(dates.HourLocator(byhour=range(0,24,2)))
xax.set_major_formatter(dates.DateFormatter('%H:%M'))
Alan
You are making this way harder on your self than you need to. matplotlib can directly plot against datetime objects. I suspect your problem is you are setting up the locators, then plotting, and the plotting is replacing your locators/formatters with the default auto versions. Try moving that block of logic about the locators to below the plotting loop.
I think that this could replace a fair chunk of your code:
d = datetime.timedelta(minutes=2)
now = datetime.datetime.now()
times = [now + d * j for j in range(500)]
ax = plt.gca() # get the current axes
ax.plot(times, range(500))
xax = ax.get_xaxis() # get the x-axis
adf = xax.get_major_formatter() # the the auto-formatter
adf.scaled[1./24] = '%H:%M' # set the < 1d scale to H:M
adf.scaled[1.0] = '%Y-%m-%d' # set the > 1d < 1m scale to Y-m-d
adf.scaled[30.] = '%Y-%m' # set the > 1m < 1Y scale to Y-m
adf.scaled[365.] = '%Y' # set the > 1y scale to Y
plt.draw()
doc for AutoDateFormatter
I achieved what i wanted by doing this:
if date_range > datetime.timedelta(days = 1):
xax.set_major_locator(dates.DayLocator(bymonthday=range(1,32), interval=1))
xax.set_major_formatter(dates.DateFormatter('%Y-%m-%d'))
else:
xax.set_major_locator(dates.HourLocator(byhour=range(0,24,2)))
xax.set_major_formatter(dates.DateFormatter('%H:%M'))