Sliding window method over a large range using numpy vectorization - pandas

I'm trying to implement a sliding window method for a genomics dataset that I have, over a fairly long range (upwards of 50k nucleotide's). My approach so far works fine, however is fairly slow (taking several seconds per range, and several minutes per range at intervals >150k bp). Here is my code so far:
import numpy as np
VectorizedRange = np.arange(Start, End)#Start, End genomic flags on the reference genome
SlidingWindow = np.lib.stride_tricks.sliding_window_view(VectorizedRange, 100)#100 = the window size
GroupedDictFrame = pd.DataFrame({"Bins":GenomeRange})
GroupedDictFrame["ReadCov"] = 0
GroupedDictFrame["ReadSeq"] = [list() for _ in range(len(GroupedDictFrame.index.values))]
GroupedDictFrame.set_index(keys=["Bins"], inplace=True, drop=True)
def Appender(Start, End, Width, Seq):
AvgCov = 0
SeqList = []
if End <= Window[-1]:
AvgCov += 1
SeqList.append(Seq)
elif End > Window[-1]:
AvgCov += (Window[-1] - Start)/Width
SeqList.append(Seq[0:(Window[-1] - Start)])
GroupedDictFrame.loc[Window[0], "ReadCov"] += AvgCov
GroupedDictFrame.loc[Window[0], "ReadSeq"] = SeqList
for Window in SlidingWindow:
SubsetBAM = BAMFrame[(
(BAMFrame["start_coord"]>=Window[0])&
(BAMFrame["start_coord"]<=Window[-1])
)].reset_index(drop=True)
SubsetBAM.apply(
lambda x: Appender(x.start_coord,
x.end_coord,
x.width_lis,
x.seq_lis), axis=1
)
I think my vectorization isn't the best, any suggestions for speeding this up?

So I think I figured it out on my own, I'll add my solution in case anyone else faces a similar problem.
Essentially, I stopped subsetting my dataframe containing the small DNA read fragments in the for loop, and did one subset before the loop and converted it to a numpy array.
I removed my function and used numpy.where to do all my logic.
import numpy as np
VectorizedRange = np.arange(Start, End)
SlidingWindow = np.lib.stride_tricks.sliding_window_view(VectorizedRange, 100)
GroupedDictFrame = pd.DataFrame({"Bins":GenomeRange})
GroupedDictFrame["ReadCov"] = 0
GroupedDictFrame["ReadSeq"] = [list() for _ in range(len(GroupedDictFrame.index.values))]
GroupedDictFrame.set_index(keys=["Bins"], inplace=True, drop=True)
CoordArray = BAMFrame.loc[:, "start_coord":"end_coord"].to_numpy()
for Window in SlidingWindow:
ReadCovIn = np.where(((CoordArray[:,1] <= Window[-1]) & (CoordArray[:,0] >= Window[0])), 1, 0)
ReadCovOut = np.where(((CoordArray[:,1] > Window[-1]) & ((CoordArray[:,0] >= Window[0]) & (CoordArray[:,0] < Window[-1]))),
(Window[-1] - CoordArray[:,0])/(CoordArray[:,1] - CoordArray[:,0]), 0)
GroupedDictFrame.loc[Window[0], "ReadCov"] += np.sum((np.sum(ReadCovIn), np.sum(ReadCovOut)))
I've gotten it down to ~1 second per gene region which is typically about 50kb (so that would mean the SlidingWindow has a shape of (49900,100)), which is pretty good I think!

Related

Csv file search speedup

I need to build a relief profile graph by coordinates, I have a csv file with 12,000,000 lines. searching through a csv file of the same height takes about 2 - 2.5 seconds. I rewrote the csv to parquet and it helped me save some time, it takes about 1.7 - 1 second to find one height. However, I need to build a profile for 500 - 2000 values, which makes the time very long. In the future, you may have to increase the base of the csv file, which will slow down this process even more. In this regard, my question is, is it possible to somehow reduce the processing time of values?
Code example:
import dask.dataframe as dk
import numpy as np
import pandas as pd
import time
filename = 'n46_e032_1arc_v3.csv'
df = dk.read_csv(filename)
df.to_parquet('n46_e032_1arc_v3_parquet')
Latitude1y, Longitude1x = 46.6276, 32.5942
Latitude2y, Longitude2x = 46.6451, 32.6781
sec, steps, k = 0.00027778, 1, 11.73
Latitude, Longitude = [Latitude1y], [Longitude1x]
sin, cos = Latitude2y - Latitude1y, Longitude2x - Longitude1x
y, x = Latitude1y, Longitude1x
while Latitude[-1] < Latitude2y and Longitude[-1] < Longitude2x:
y, x, steps = y + sec * k * sin, x + sec * k * cos, steps + 1
Latitude.append(y)
Longitude.append(x)
time_start = time.time()
long, elevation_data = [], []
df2 = dk.read_parquet('n46_e032_1arc_v3_parquet')
for i in range(steps + 1):
elevation_line = df2[(Longitude[i] <= df2['x']) & (df2['x'] <= Longitude[i] + sec) &
(Latitude[i] <= df2['y']) & (df2['y'] <= Latitude[i] + sec)].compute()
elevation = np.asarray(elevation_line.z.tolist())
if elevation[-1] < 0:
elevation_data.append(0)
else:
elevation_data.append(elevation[-1])
long.append(30 * i)
plt.bar(long, elevation_data, width = 30)
plt.show()
print(time.time() - time_start)
Here's one way to solve this problem using KD trees. A KD tree is a data structure for doing fast nearest-neighbor searches.
import scipy.spatial
tree = scipy.spatial.KDTree(df[['x', 'y']].values)
elevations = df['z'].values
long, elevation_data = [], []
for i in range(steps):
lon, lat = Longitude[i], Latitude[i]
dist, idx = tree.query([lon, lat])
elevation = elevations[idx]
if elevation < 0:
elevation = 0
elevation_data.append(elevation)
long.append(30 * i)
Note: if you can make assumptions about the data, like "all of the points in the CSV are equally spaced," faster algorithms are possible.
It looks like your data might be on a regular grid. If (and only if) every combination of x and y exist in your data, then it probably makes sense to turn this into a labeled 2D array of points, after which querying the correct position will be very fast.
For this, I'll use xarray, which is essentially pandas for N-dimensional data, and integrates well with dask:
# bring the dataframe into memory
df = dk.read('n46_e032_1arc_v3_parquet').compute()
da = df.set_index(["y", "x"]).z.to_xarray()
# now you can query the nearest points:
desired_lats = xr.DataArray([46.6276, 46.6451], dims=["point"])
desired_lons = xr.DataArray([32.5942, 32.6781], dims=["point"])
subset = da.sel(y=desired_lats, x=desired_lons, method="nearest")
# if you'd like, you can return to pandas:
subset_s = subset.to_series()
# you could do this only once, and save the reshaped array as a zarr store:
ds = da.to_dataset(name="elevation")
ds.to_zarr("n46_e032_1arc_v3.zarr")

passing panda dataframe data to functions and its not outputting the results

In my code, I am trying to extract data from csv file to use in the function, but it doesnt output anything, and gives no error. My code works because I tried it with just numpy array as inputs. not sure why it doesnt work with panda.
import numpy as np
import pandas as pd
import os
# change the current directory to the directory where the running script file is
os.chdir(os.path.dirname(os.path.abspath(__file__)))
# finding best fit line for y=mx+b by iteration
def gradient_descent(x,y):
m_iter = b_iter = 1 #starting point
iteration = 10000
n = len(x)
learning_rate = 0.05
last_mse = 10000
#take baby steps to reach global minima
for i in range(iteration):
y_predicted = m_iter*x + b_iter
#mse = 1/n*sum([value**2 for value in (y-y_predicted)]) # cost function to minimize
mse = 1/n*sum((y-y_predicted)**2) # cost function to minimize
if (last_mse - mse)/mse < 0.001:
break
# recall MSE formula is 1/n*sum((yi-y_predicted)^2), where y_predicted = m*x+b
# using partial deriv of MSE formula, d/dm and d/db
dm = -(2/n)*sum(x*(y-y_predicted))
db = -(2/n)*sum((y-y_predicted))
# use current predicted value to get the next value for prediction
# by using learning rate
m_iter = m_iter - learning_rate*dm
b_iter = b_iter - learning_rate*db
print('m is {}, b is {}, cost is {}, iteration {}'.format(m_iter,b_iter,mse,i))
last_mse = mse
#x = np.array([1,2,3,4,5])
#y = np.array([5,7,8,10,13])
#gradient_descent(x,y)
df = pd.read_csv('Linear_Data.csv')
x = df['Area']
y = df['Price']
gradient_descent(x,y)
My code works because I tried it with just numpy array as inputs. not sure why it doesnt work with panda.
Well no, your code also works with pandas dataframes:
df = pd.DataFrame({'Area': [1,2,3,4,5], 'Price': [5,7,8,10,13]})
x = df['Area']
y = df['Price']
gradient_descent(x,y)
Above will give you the same output as with numpy arrays.
Try to check what's in Linear_Data.csv and/or add some print statements in the gradient_descent function just to check your assumptions. I would suggest to first of all add a print statement before the condition with the break statement:
print(last_mse, mse)
if (last_mse - mse)/mse < 0.001:
break

Is it possible without using parallelization (Swifter, Parallel) to make an instant calculation immediately without passing through the index?

Is it possible without using parallelization (Swifter, Parallel) to make an instant calculation immediately without passing through the index, for example through the use of the "apply"-function for all dataset?
%%time
import random
df = pd.DataFrame({'A':random.sample(range(200), 200)})
for j in range(200):
for i in df.index:
df.loc[i,'A_last_{}'.format(j)] = df.loc[(df.index < i) & (df.index >= i - j),'A'].mean()
%%time
import random
df = pd.DataFrame({'A':random.sample(range(200), 200)})
First calculate the sums.
df[1] = df['A'].shift()
for j in range(2, 200):
df[j] = df[j-1].fillna(0) + df['A'].shift(j)
Then do the division for means and take care of the formatting
df = df.set_index('A')
df.divide(df.columns, axis=1)\
.fillna(method='ffill', axis=1)\
.rename(lambda x: f'A_last_{x}', axis=1)\
.reset_index()

Pandas Timeseries: Total duration meeting a specific condition

I have a timeseries
ts = pd.Series(data=[0,1,2,3,4],index=[pd.Timestamp('1991-01-01'),pd.Timestamp('1995-01-01'),pd.Timestamp('1996-01-01'),pd.Timestamp('2010-01-01'),pd.Timestamp('2011-01-01')])
Whats the fastest, most readable, way to get the total duration in which the value is below 2, assuming the values are valid until the next time-step indicates otherwise (no linear interpolation). I imagine there probably is a pandas function for this
This seems to be working quite well, however I am still baffled that there does not seem to be a pandas function for this!
import pandas as pd
import numpy as np
ts = pd.Series(data=[0,1,2,3,4],index=[pd.Timestamp('1991-01-01'),pd.Timestamp('1995-01-01'),pd.Timestamp('1996-01-01'),pd.Timestamp('2010-01-01'),pd.Timestamp('2011-01-01')])
# making the timeseries binary. 1 = meets condition, 0 = does not
ts = ts.where(ts>=2,other=1)
ts = ts.where(ts<2,other=0)
delta_time = ts.index.to_pydatetime()[1:]-ts.index.to_pydatetime()[:-1]
time_below_2 = np.sum(delta_time[np.invert(ts.values[:-1])]).total_seconds()
time_above_2 = np.sum(delta_time[(ts.values[:-1])]).total_seconds()
The above function seems to break for certain timeframes. This option is slower, but did not break in any of my tests:
def get_total_duration_above_and_below_value(value,ts):
# making the timeseries binary. 1 = above value, 0 = below value
ts = ts.where(ts >= value, other=1)
ts = ts.where(ts < value, other=0)
time_above_value = 0
time_below_value = 0
for i in range(ts.size - 1):
if ts[i] == 1:
time_above_value += abs(pd.Timedelta(
ts.index[i] - ts.index[i + 1]).total_seconds()) / 3600
else:
time_below_value += abs(pd.Timedelta(
ts.index[i] - ts.index[i + 1]).total_seconds()) / 3600
return time_above_value, time_below_value

matplotlib x-axis ticks dates formatting and locations

I've tried to duplicate plotted graphs originally created with flotr2 for pdf output with matplotlib. I must say that flotr is way easyer to use... but that aside - im currently stuck at trying to format the dates /times on x-axis to desired format, which is hours:minutes with interval of every 2 hours, if period on x-axis is less than one day and year-month-day format if period is longer than 1 day with interval of one day.
I've read through numerous examples and tried to copy them, but outcome remains the same which is hours:minutes:seconds with 1 to 3 hour interval based on how long is the period.
My code:
colorMap = {
'speed': '#3388ff',
'fuel': '#ffaa33',
'din1': '#3bb200',
'din2': '#ff3333',
'satellites': '#bfbfff'
}
otherColors = ['#00A8F0','#C0D800','#CB4B4B','#4DA74D','#9440ED','#800080','#737CA1','#E4317F','#7D0541','#4EE2EC','#6698FF','#437C17','#7FE817','#FBB117']
plotMap = {}
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import matplotlib.dates as dates
fig = plt.figure(figsize=(22, 5), dpi = 300, edgecolor='k')
ax1 = fig.add_subplot(111)
realdata = data['data']
keys = realdata.keys()
if 'speed' in keys:
speed_index = keys.index('speed')
keys.pop(speed_index)
keys.insert(0, 'speed')
i = 0
for key in keys:
if key not in colorMap.keys():
color = otherColors[i]
otherColors.pop(i)
colorMap[key] = color
i += 1
label = u'%s' % realdata[keys[0]]['name']
ax1.set_ylabel(label)
plotMap[keys[0]] = {}
plotMap[keys[0]]['label'] = label
first_dates = [ r[0] for r in realdata[keys[0]]['data']]
date_range = first_dates[-1] - first_dates[0]
ax1.xaxis.reset_ticks()
if date_range > datetime.timedelta(days = 1):
ax1.xaxis.set_major_locator(dates.WeekdayLocator(byweekday = 1, interval=1))
ax1.xaxis.set_major_formatter(dates.DateFormatter('%Y-%m-%d'))
else:
ax1.xaxis.set_major_locator(dates.HourLocator(byhour=range(24), interval=2))
ax1.xaxis.set_major_formatter(dates.DateFormatter('%H:%M'))
ax1.xaxis.grid(True)
plotMap[keys[0]]['plot'] = ax1.plot_date(
dates.date2num(first_dates),
[r[1] for r in realdata[keys[0]]['data']], colorMap[keys[0]], xdate=True)
if len(keys) > 1:
first = True
for key in keys[1:]:
if first:
ax2 = ax1.twinx()
ax2.set_ylabel(u'%s' % realdata[key]['name'])
first = False
plotMap[key] = {}
plotMap[key]['label'] = u'%s' % realdata[key]['name']
plotMap[key]['plot'] = ax2.plot_date(
dates.date2num([ r[0] for r in realdata[key]['data']]),
[r[1] for r in realdata[key]['data']], colorMap[key], xdate=True)
plt.legend([value['plot'] for key, value in plotMap.iteritems()], [value['label'] for key, value in plotMap.iteritems()], loc = 2)
plt.savefig(path +"node.png", dpi = 300, bbox_inches='tight')
could someone point out why im not getting desired results, please?
Edit1:
I moved the formatting block after the plotting and seem to be getting better results now. They are still now desired results though. If period is less than day then i get ticks after every 2 hours (interval=2), but i wish i could get those ticks at even hours not uneven hours. Is that possible?
if date_range > datetime.timedelta(days = 1):
xax.set_major_locator(dates.DayLocator(bymonthday=range(1,32), interval=1))
xax.set_major_formatter(dates.DateFormatter('%Y-%m-%d'))
else:
xax.set_major_locator(dates.HourLocator(byhour=range(24), interval=2))
xax.set_major_formatter(dates.DateFormatter('%H:%M'))
Edit2:
This seemed to give me what i wanted:
if date_range > datetime.timedelta(days = 1):
xax.set_major_locator(dates.DayLocator(bymonthday=range(1,32), interval=1))
xax.set_major_formatter(dates.DateFormatter('%Y-%m-%d'))
else:
xax.set_major_locator(dates.HourLocator(byhour=range(0,24,2)))
xax.set_major_formatter(dates.DateFormatter('%H:%M'))
Alan
You are making this way harder on your self than you need to. matplotlib can directly plot against datetime objects. I suspect your problem is you are setting up the locators, then plotting, and the plotting is replacing your locators/formatters with the default auto versions. Try moving that block of logic about the locators to below the plotting loop.
I think that this could replace a fair chunk of your code:
d = datetime.timedelta(minutes=2)
now = datetime.datetime.now()
times = [now + d * j for j in range(500)]
ax = plt.gca() # get the current axes
ax.plot(times, range(500))
xax = ax.get_xaxis() # get the x-axis
adf = xax.get_major_formatter() # the the auto-formatter
adf.scaled[1./24] = '%H:%M' # set the < 1d scale to H:M
adf.scaled[1.0] = '%Y-%m-%d' # set the > 1d < 1m scale to Y-m-d
adf.scaled[30.] = '%Y-%m' # set the > 1m < 1Y scale to Y-m
adf.scaled[365.] = '%Y' # set the > 1y scale to Y
plt.draw()
doc for AutoDateFormatter
I achieved what i wanted by doing this:
if date_range > datetime.timedelta(days = 1):
xax.set_major_locator(dates.DayLocator(bymonthday=range(1,32), interval=1))
xax.set_major_formatter(dates.DateFormatter('%Y-%m-%d'))
else:
xax.set_major_locator(dates.HourLocator(byhour=range(0,24,2)))
xax.set_major_formatter(dates.DateFormatter('%H:%M'))