Is it possible without using parallelization (Swifter, Parallel) to make an instant calculation immediately without passing through the index?

Is it possible without using parallelization (Swifter, Parallel) to make an instant calculation immediately without passing through the index? - pandas

Is it possible without using parallelization (Swifter, Parallel) to make an instant calculation immediately without passing through the index, for example through the use of the "apply"-function for all dataset?
%%time
import random
df = pd.DataFrame({'A':random.sample(range(200), 200)})
for j in range(200):
for i in df.index:
df.loc[i,'A_last_{}'.format(j)] = df.loc[(df.index < i) & (df.index >= i - j),'A'].mean()

%%time
import random
df = pd.DataFrame({'A':random.sample(range(200), 200)})
First calculate the sums.
df[1] = df['A'].shift()
for j in range(2, 200):
df[j] = df[j-1].fillna(0) + df['A'].shift(j)
Then do the division for means and take care of the formatting
df = df.set_index('A')
df.divide(df.columns, axis=1)\
.fillna(method='ffill', axis=1)\
.rename(lambda x: f'A_last_{x}', axis=1)\
.reset_index()

Related

Sliding window method over a large range using numpy vectorization

I'm trying to implement a sliding window method for a genomics dataset that I have, over a fairly long range (upwards of 50k nucleotide's). My approach so far works fine, however is fairly slow (taking several seconds per range, and several minutes per range at intervals >150k bp). Here is my code so far:
import numpy as np
VectorizedRange = np.arange(Start, End)#Start, End genomic flags on the reference genome
SlidingWindow = np.lib.stride_tricks.sliding_window_view(VectorizedRange, 100)#100 = the window size
GroupedDictFrame = pd.DataFrame({"Bins":GenomeRange})
GroupedDictFrame["ReadCov"] = 0
GroupedDictFrame["ReadSeq"] = [list() for _ in range(len(GroupedDictFrame.index.values))]
GroupedDictFrame.set_index(keys=["Bins"], inplace=True, drop=True)
def Appender(Start, End, Width, Seq):
AvgCov = 0
SeqList = []
if End <= Window[-1]:
AvgCov += 1
SeqList.append(Seq)
elif End > Window[-1]:
AvgCov += (Window[-1] - Start)/Width
SeqList.append(Seq[0:(Window[-1] - Start)])
GroupedDictFrame.loc[Window[0], "ReadCov"] += AvgCov
GroupedDictFrame.loc[Window[0], "ReadSeq"] = SeqList
for Window in SlidingWindow:
SubsetBAM = BAMFrame[(
(BAMFrame["start_coord"]>=Window[0])&
(BAMFrame["start_coord"]<=Window[-1])
)].reset_index(drop=True)
SubsetBAM.apply(
lambda x: Appender(x.start_coord,
x.end_coord,
x.width_lis,
x.seq_lis), axis=1
)
I think my vectorization isn't the best, any suggestions for speeding this up?

So I think I figured it out on my own, I'll add my solution in case anyone else faces a similar problem.
Essentially, I stopped subsetting my dataframe containing the small DNA read fragments in the for loop, and did one subset before the loop and converted it to a numpy array.
I removed my function and used numpy.where to do all my logic.
import numpy as np
VectorizedRange = np.arange(Start, End)
SlidingWindow = np.lib.stride_tricks.sliding_window_view(VectorizedRange, 100)
GroupedDictFrame = pd.DataFrame({"Bins":GenomeRange})
GroupedDictFrame["ReadCov"] = 0
GroupedDictFrame["ReadSeq"] = [list() for _ in range(len(GroupedDictFrame.index.values))]
GroupedDictFrame.set_index(keys=["Bins"], inplace=True, drop=True)
CoordArray = BAMFrame.loc[:, "start_coord":"end_coord"].to_numpy()
for Window in SlidingWindow:
ReadCovIn = np.where(((CoordArray[:,1] <= Window[-1]) & (CoordArray[:,0] >= Window[0])), 1, 0)
ReadCovOut = np.where(((CoordArray[:,1] > Window[-1]) & ((CoordArray[:,0] >= Window[0]) & (CoordArray[:,0] < Window[-1]))),
(Window[-1] - CoordArray[:,0])/(CoordArray[:,1] - CoordArray[:,0]), 0)
GroupedDictFrame.loc[Window[0], "ReadCov"] += np.sum((np.sum(ReadCovIn), np.sum(ReadCovOut)))
I've gotten it down to ~1 second per gene region which is typically about 50kb (so that would mean the SlidingWindow has a shape of (49900,100)), which is pretty good I think!

Remove the requirement to loop through numpy array

Overview
The code below contains a numpy array clusters with values that are compared against each row of a pandas Dataframe using np.where. The SoFunc function returns rows where all conditions are True and takes the clusters array as input.
Question
I can loop through this array to compare each array element against the respective np.where conditions. How do I remove the requirement to loop but still get the same output?
I appreciate looping though numpy arrays is inefficient and want to improve this code. The actual dataset will be much larger.
Prepare the reproducible mock data
def genMockDataFrame(days,startPrice,colName,startDate,seed=None):
periods = days*24
np.random.seed(seed)
steps = np.random.normal(loc=0, scale=0.0018, size=periods)
steps[0]=0
P = startPrice+np.cumsum(steps)
P = [round(i,4) for i in P]
fxDF = pd.DataFrame({
'ticker':np.repeat( [colName], periods ),
'date':np.tile( pd.date_range(startDate, periods=periods, freq='H'), 1 ),
'price':(P)})
fxDF.index = pd.to_datetime(fxDF.date)
fxDF = fxDF.price.resample('D').ohlc()
fxDF.columns = [i.title() for i in fxDF.columns]
return fxDF
def SoFunc(clust):
#generate mock data
df = genMockDataFrame(10,1.1904,'eurusd','19/3/2020',seed=157)
df["Upper_Band"] = 1.1928
df.loc["2020-03-27", "Upper_Band"] = 1.2118
df.loc["2020-03-26", "Upper_Band"] = 1.2200
df["Level"] = np.where((df["High"] >= clust)
& (df["Low"] <= clust)
& (df["High"] >= df["Upper_Band"] ),1,np.NaN
)
return df.dropna()
Loop through the clusters array
clusters = np.array([1.1929 , 1.2118 ])
l = []
for i in range(len(clusters)):
l.append(SoFunc(clusters[i]))
pd.concat(l)
Output
Open High Low Close Upper_Band Level
date
2020-03-19 1.1904 1.1937 1.1832 1.1832 1.1928 1.0
2020-03-25 1.1939 1.1939 1.1864 1.1936 1.1928 1.0
2020-03-27 1.2118 1.2144 1.2039 1.2089 1.2118 1.0

(Edited based on #tdy's comment below)
pandas.merge allows you to make len(clusters) copies of your dataframe and then pare it down to according to the conditions in your SoFunc function.
The cross merge creates a dataframe with a copy of df for each record in clusters_df. The overall result ought to be faster for large dataframes than the loop-based approach, provided you have enough memory to temporarily accommodate the merged dataframe (if not, the operation may spill over onto page / swap and slow down drastically).
import numpy as np
import pandas as pd
def genMockDataFrame(days,startPrice,colName,startDate,seed=None):
''' identical to the example provided '''
periods = days*24
np.random.seed(seed)
steps = np.random.normal(loc=0, scale=0.0018, size=periods)
steps[0]=0
P = startPrice+np.cumsum(steps)
P = [round(i,4) for i in P]
fxDF = pd.DataFrame({
'ticker':np.repeat( [colName], periods ),
'date':np.tile( pd.date_range(startDate, periods=periods, freq='H'), 1 ),
'price':(P)})
fxDF.index = pd.to_datetime(fxDF.date)
fxDF = fxDF.price.resample('D').ohlc()
fxDF.columns = [i.title() for i in fxDF.columns]
return fxDF
# create the base dataframe according to the former SoFunc
df = genMockDataFrame(10,1.1904,'eurusd','19/3/2020',seed=157)
df["Upper_Band"] = 1.1928
df.loc["2020-03-27"]["Upper_Band"] = 1.2118
df.loc["2020-03-26"]["Upper_Band"] = 1.2200
# create a df out of the cluster array
clusters = np.array([1.1929 , 1.2118 ])
clusters_df = pd.DataFrame({"clust": clusters})
# perform the merge, then filter and finally clean up
result_df = (
pd
.merge(df.reset_index(), clusters_df, how="cross") # for each entry in cluster, make a copy of df
.loc[lambda z: (z.Low <= z.clust) & (z.High >= z.clust) & (z.High >= z.Upper_Band), :] # filter the copies down
.drop(columns=["clust"]) # not needed in result
.assign(Level=1.0) # to match your result; not really needed
.set_index("date") # bring back the old index
)
print(result_df)
I recommend inspecting just the result of pd.merge(df.reset_index(), clusters_df, how="cross") to see how it works.

call functions on a specific level of a numpy ndarray without for loops

suppose a numpy ndarrary
arr
has shape (100,100,5,5)
The following codes work:
result=np.zeros((arr.shape[0], arr.shape[1], 10))
for i in range(arr.shape[0]):
for j in range(arr.shape[1]):
v=arr[i,j].flatten()
hist, bi= np.histogram(v, bins=10, range=(0,3))
result[i,j] =hist
but it's slow. Is there a more efficient way to write the codes, say avoid the for loops?

Hmm, I thought apply_along_axis would help, but it doesn't seem to make much of a difference, at least at the problem sizes of interest to you. Maybe there's overhead in myhist.
See the code below.
import numpy as np
import time
low = 0.0
high = 3.0
bins = 10
arrshp = (100,100,5,5)
def myhist(xx):
out = np.histogram(xx,bins=bins,range=(low,high))
return out[0]
arr = np.random.uniform(low,high,arrshp)
time1 = time.time()
arr2 = arr.reshape(arrshp[0],arrshp[1],-1)
out_fast = np.apply_along_axis(myhist,-1,arr2)
time2 = time.time()
print('time (secs) fast = ',time2-time1)
time3 = time.time()
out_slow = np.zeros((arr.shape[0],arr.shape[1],bins),dtype='float64')
for i in range(arr.shape[0]):
for j in range(arr.shape[1]):
v = arr[i,j].flatten()
_hh = np.histogram(v, bins=bins, range=(low,high))
out_slow[i,j,:] = _hh[0]
time4 = time.time()
print('norm diff = ',np.linalg.norm(out_fast-out_slow))
print('time (secs) slow = ',time4-time3)

Speeding up Euclidean Distance in python [duplicate]

How do you optimize this code?
At the moment it is running to slow for the amount of data that goes through this loop. This code runs 1-nearest neighbor. It will predict the label of the training_element based off the p_data_set
# [x] , [[x1],[x2],[x3]], [l1, l2, l3]
def prediction(training_element, p_data_set, p_label_set):
temp = np.array([], dtype=float)
for p in p_data_set:
temp = np.append(temp, distance.euclidean(training_element, p))
minIndex = np.argmin(temp)
return p_label_set[minIndex]

Use a k-D tree for fast nearest-neighbour lookups, e.g. scipy.spatial.cKDTree:
from scipy.spatial import cKDTree
# I assume that p_data_set is (nsamples, ndims)
tree = cKDTree(p_data_set)
# training_elements is also assumed to be (nsamples, ndims)
dist, idx = tree.query(training_elements, k=1)
predicted_labels = p_label_set[idx]

You could use distance.cdist to directly get the distances temp and then use .argmin() to get min-index, like so -
minIndex = distance.cdist(training_element[None],p_data_set).argmin()
Here's an alternative approach using np.einsum -
subs = p_data_set - training_element
minIndex = np.einsum('ij,ij->i',subs,subs).argmin()
Runtime test
Well I was thinking cKDTree would easily beat cdist, but I guess training_element being a 1D array isn't too heavy for cdist and I am seeing it to beat out cKDTree instead by a good 10x+ margin!
Here's the timing results -
In [422]: # Setup arrays
...: p_data_set = np.random.randint(0,9,(40000,100))
...: training_element = np.random.randint(0,9,(100,))
...:
In [423]: def tree_based(p_data_set,training_element): ##ali_m's soln
...: tree = cKDTree(p_data_set)
...: dist, idx = tree.query(training_element, k=1)
...: return idx
...:
...: def einsum_based(p_data_set,training_element):
...: subs = p_data_set - training_element
...: return np.einsum('ij,ij->i',subs,subs).argmin()
...:
In [424]: %timeit tree_based(p_data_set,training_element)
1 loops, best of 3: 210 ms per loop
In [425]: %timeit einsum_based(p_data_set,training_element)
100 loops, best of 3: 17.3 ms per loop
In [426]: %timeit distance.cdist(training_element[None],p_data_set).argmin()
100 loops, best of 3: 14.8 ms per loop

Python can be quite fast programming language if used properly.
This is my suggestion (faster_prediction):
import numpy as np
import time
def euclidean(a,b):
return np.linalg.norm(a-b)
def prediction(training_element, p_data_set, p_label_set):
temp = np.array([], dtype=float)
for p in p_data_set:
temp = np.append(temp, euclidean(training_element, p))
minIndex = np.argmin(temp)
return p_label_set[minIndex]
def faster_prediction(training_element, p_data_set, p_label_set):
temp = np.tile(training_element, (p_data_set.shape[0],1))
temp = np.sqrt(np.sum( (temp - p_data_set)**2 , 1))
minIndex = np.argmin(temp)
return p_label_set[minIndex]
training_element = [1,2,3]
p_data_set = np.random.rand(100000, 3)*10
p_label_set = np.r_[0:p_data_set.shape[0]]
t1 = time.time()
result_1 = prediction(training_element, p_data_set, p_label_set)
t2 = time.time()
t3 = time.time()
result_2 = faster_prediction(training_element, p_data_set, p_label_set)
t4 = time.time()
print "Execution time 1:", t2-t1, "value: ", result_1
print "Execution time 2:", t4-t3, "value: ", result_2
print "Speed up: ", (t4-t3) / (t2-t1)
I get the following result on pretty old laptop:
Execution time 1: 21.6033108234 value: 9819
Execution time 2: 0.0176379680634 value: 9819
Speed up: 1224.81857013
which makes me think I must have done some stupid mistake :)
In case of very huge data, where memory might be an issue, I suggest using Cython or implementing function in C++ and wrapping it in python.

dask how to define a custom (time fold) function that operates in parallel and returns a dataframe with a different shape

I am trying to implement a time fold function to be 'map'ed to various partitions of a dask dataframe which in turn changes the shape of the dataframe in question (or alternatively produces a new dataframe with the altered shape). This is how far I have gotten. The result 'res' returned on compute is a list of 3 delayed objects. When I try to compute each of them in a loop (last tow lines of code) this results in a "TypeError: 'DataFrame' object is not callable" After going through the examples for map_partitions, I also tried altering the input DF (inplace) in the function with no return value which causes a similar TypeError with NoneType. What am I missing?
Also, looking at the visualization (attached) I feel like there is a need for reducing the individually computed (folded) partitions into a single DF. How do I do this?
#! /usr/bin/env python
# Start dask scheduler and workers
# dask-scheduler &
# dask-worker --nthreads 1 --nprocs 6 --memory-limit 3GB localhost:8786 --local-directory /dev/shm &
from dask.distributed import Client
from dask.delayed import delayed
import pandas as pd
import numpy as np
import dask.dataframe as dd
import math
foldbucketsecs=30
periodicitysecs=15
secsinday=24 * 60 * 60
chunksizesecs=60 # 1 minute
numts = 5
start = 1525132800 # 01/05
end = 1525132800 + (3 * 60) # 3 minute
c = Client('127.0.0.1:8786')
def fold(df, start, bucket):
return df
def reduce_folds(df):
return df
def load(epoch):
idx = []
for ts in range(0, chunksizesecs, periodicitysecs):
idx.append(epoch + ts)
d = np.random.rand(chunksizesecs/periodicitysecs, numts)
ts = []
for i in range(0, numts):
tsname = "ts_%s" % (i)
ts.append(tsname)
gts.append(tsname)
res = pd.DataFrame(index=idx, data=d, columns=ts, dtype=np.float64)
res.index = pd.to_datetime(arg=res.index, unit='s')
return res
gts = []
load(start)
cols = len(gts)
idx1 = pd.DatetimeIndex(start=start, freq=('%sS' % periodicitysecs), end=start+periodicitysecs, dtype='datetime64[s]')
meta = pd.DataFrame(index=idx1[:0], data=[], columns=gts, dtype=np.float64)
dfs = [delayed(load)(fn) for fn in range(start, end, chunksizesecs)]
from_delayed = dd.from_delayed(dfs, meta, 'sorted')
nfolds = int(math.ceil((end - start)/foldbucketsecs))
cprime = nfolds * cols
gtsnew = []
for i in range(0, cprime):
gtsnew.append("ts_%s,fold=%s" % (i%cols, i/cols))
idx2 = pd.DatetimeIndex(start=start, freq=('%sS' % periodicitysecs), end=start+foldbucketsecs, dtype='datetime64[s]')
meta = pd.DataFrame(index=idx2[:0], data=[], columns=gtsnew, dtype=np.float64)
folded_df = from_delayed.map_partitions(delayed(fold)(from_delayed, start, foldbucketsecs), meta=meta)
result = c.submit(reduce_folds, folded_df)
c.gather(result).visualize(filename='/usr/share/nginx/html/svg/df4.svg')
res = c.gather(result).compute()
for f in res:
f.compute()

Never mind! It was my fault, instead of wrapping my function in delayed I simply passed it to the map_partitions call like so and it worked.
folded_df = from_delayed.map_partitions(fold, start, foldbucketsecs, nfolds, meta=meta)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Is it possible without using parallelization (Swifter, Parallel) to make an instant calculation immediately without passing through the index? - pandas

Related

Sliding window method over a large range using numpy vectorization

Remove the requirement to loop through numpy array

call functions on a specific level of a numpy ndarray without for loops

Speeding up Euclidean Distance in python [duplicate]

dask how to define a custom (time fold) function that operates in parallel and returns a dataframe with a different shape

Categories

Resources