Numpy: regrid by averaging? - numpy

I'm trying to regrid a numpy array onto a new grid. In this specific case, I'm trying to regrid a power spectrum onto a logarithmic grid so that the data are evenly spaced logarithmically for plotting purposes.
Doing this with straight interpolation using np.interp results in some of the original data being ignored entirely. Using digitize gets the result I want, but I have to use some ugly loops to get it to work:
xfreq = np.fft.fftfreq(100)[1:50] # only positive, nonzero freqs
psw = np.arange(xfreq.size) # dummy array for MWE
# new logarithmic grid
logfreq = np.logspace(np.log10(np.min(xfreq)), np.log10(np.max(xfreq)), 100)
inds = np.digitize(xfreq,logfreq)
# interpolation: ignores data *but* populates all points
logpsw = np.interp(logfreq, xfreq, psw)
# so average down where available...
logpsw[np.unique(inds)] = [psw[inds==i].mean() for i in np.unique(inds)]
# the new plot
loglog(logfreq, logpsw, linewidth=0.5, color='k')
Is there a nicer way to accomplish this in numpy? I'd be satisfied with just a replacement of the inline loop step.

You can use bincount() twice to calculate the average value of every bins:
logpsw2 = np.interp(logfreq, xfreq, psw)
counts = np.bincount(inds)
mask = counts != 0
logpsw2[mask] = np.bincount(inds, psw)[mask] / counts[mask]
or use unique(inds, return_inverse=True) and bincount() twice:
logpsw4 = np.interp(logfreq, xfreq, psw)
uinds, inv_index = np.unique(inds, return_inverse=True)
logpsw4[uinds] = np.bincount(inv_index, psw) / np.bincount(inv_index)
Or if you use Pandas:
import pandas as pd
logpsw4 = np.interp(logfreq, xfreq, psw)
s = pd.groupby(pd.Series(psw), inds).mean()
logpsw4[s.index] = s.values

Related

passing panda dataframe data to functions and its not outputting the results

In my code, I am trying to extract data from csv file to use in the function, but it doesnt output anything, and gives no error. My code works because I tried it with just numpy array as inputs. not sure why it doesnt work with panda.
import numpy as np
import pandas as pd
import os
# change the current directory to the directory where the running script file is
os.chdir(os.path.dirname(os.path.abspath(__file__)))
# finding best fit line for y=mx+b by iteration
def gradient_descent(x,y):
m_iter = b_iter = 1 #starting point
iteration = 10000
n = len(x)
learning_rate = 0.05
last_mse = 10000
#take baby steps to reach global minima
for i in range(iteration):
y_predicted = m_iter*x + b_iter
#mse = 1/n*sum([value**2 for value in (y-y_predicted)]) # cost function to minimize
mse = 1/n*sum((y-y_predicted)**2) # cost function to minimize
if (last_mse - mse)/mse < 0.001:
break
# recall MSE formula is 1/n*sum((yi-y_predicted)^2), where y_predicted = m*x+b
# using partial deriv of MSE formula, d/dm and d/db
dm = -(2/n)*sum(x*(y-y_predicted))
db = -(2/n)*sum((y-y_predicted))
# use current predicted value to get the next value for prediction
# by using learning rate
m_iter = m_iter - learning_rate*dm
b_iter = b_iter - learning_rate*db
print('m is {}, b is {}, cost is {}, iteration {}'.format(m_iter,b_iter,mse,i))
last_mse = mse
#x = np.array([1,2,3,4,5])
#y = np.array([5,7,8,10,13])
#gradient_descent(x,y)
df = pd.read_csv('Linear_Data.csv')
x = df['Area']
y = df['Price']
gradient_descent(x,y)
My code works because I tried it with just numpy array as inputs. not sure why it doesnt work with panda.
Well no, your code also works with pandas dataframes:
df = pd.DataFrame({'Area': [1,2,3,4,5], 'Price': [5,7,8,10,13]})
x = df['Area']
y = df['Price']
gradient_descent(x,y)
Above will give you the same output as with numpy arrays.
Try to check what's in Linear_Data.csv and/or add some print statements in the gradient_descent function just to check your assumptions. I would suggest to first of all add a print statement before the condition with the break statement:
print(last_mse, mse)
if (last_mse - mse)/mse < 0.001:
break

Idiomatic tensorflow expression of for-loop which considers value of preceding iteration

It might not even be possible, but I would like to express the following code without the for-loop.
tf.scan is prohibitively slow and therefore not a good solution.
I am perfectly happy to accept any answer which gives a solution or an argument why this is not possible.
import tensorflow as tf
import matplotlib.pyplot as plt
# Some data
random_series = tf.reshape(tf.math.cumsum(tf.random.normal([100])),[1,-1])
# a mesh_grid of "co-distance"
random_mesh_gain = 1 - tf.matmul(random_series,tf.math.reciprocal(random_series), True, False)
# lower triangular matrix
random_tri = tf.linalg.band_part(random_mesh_gain, 0, -1)
# some lambda working on a series-type data
drop = 0.5
lambda_map = lambda series: tf.math.floor(series/interval)*interval - drop
# apply map
random_img = tf.map_fn(lambda_map, random_tri)
# init preceding and output list sl_full
preceding = - drop * tf.ones(tf.transpose(random_img).shape[0],dtype=tf.float32)
sl_full = []
for a in tf.transpose(random_img):
prec_a_max = tf.reduce_max(tf.stack([preceding, a]),axis=0)
preceding = prec_a_max
sl_full.append(prec_a_max)
# create tensor from list
sl_full = tf.transpose(tf.stack(sl_full))
plt.imshow(sl_full,origin="lower")

discrete numpy array to continuous array

I have some discrete data in an array, such that:
arr = np.array([[1,1,1],[2,2,2],[3,3,3],[2,2,2],[1,1,1]])
whose plot looks like:
I also have an index array, such that each unique value in arr is associated with a unique index value, like:
ind = np.array([[1,1,1],[2,2,2],[3,3,3],[4,4,4],[5,5,5]])
What is the most pythonic way of converting arr from discrete values to continuous values, so that the array would look like this when plotted?:
therefore, interpolating between the discrete points to make continuous data
I found a solution to this if anyone has a similar issue. It is maybe not the most elegant so modifications are welcome:
def ref_linear_interp(x, y):
arr = []
ux=np.unique(x) #unique x values
for u in ux:
idx = y[x==u]
try:
min = y[x==u-1][0]
max = y[x==u][0]
except:
min = y[x==u][0]
max = y[x==u][0]
try:
min = y[x==u][0]
max = y[x==u+1][0]
except:
min = y[x==u][0]
max = y[x==u][0]
if min==max:
sub = np.full((len(idx)), min)
arr.append(sub)
else:
sub = np.linspace(min, max, len(idx))
arr.append(sub)
return np.concatenate(arr, axis=None).ravel()
y = np.array([[1,1,1],[2,2,2],[3,3,3],[2,2,2],[1,1,1]])
x = np.array([[1,1,1],[2,2,2],[3,3,3],[4,4,4],[5,5,5]])
z = np.arange(1, 16, 1)
Here is an answer for the symmetric solution that I would expect when reading the question:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# create the data as described
numbers = [1,2,3,2,1]
nblock = 3
df = pd.DataFrame({
"x": np.arange(nblock*len(numbers)),
"y": np.repeat(numbers, nblock),
"label": np.repeat(np.arange(len(numbers)), nblock)
})
Expecting a constant block size of 3, we could use a rolling window:
df['y-smooth'] = df['y'].rolling(nblock, center=True).mean()
# fill NaNs
df['y-smooth'].bfill(inplace=True)
df['y-smooth'].ffill(inplace=True)
plt.plot(df['x'], df['y-smooth'], marker='*')
If the block size is allowed to vary, we could determine the block centers and interpolate piecewise.
centers = df[['x', 'y', 'label']].groupby('label').mean()
df['y-interp'] = np.interp(df['x'], centers['x'], centers['y'])
plt.plot(df['x'], df['y-interp'], marker='*')
Note: You may also try
centers = df[['x', 'y', 'label']].groupby('label').min() to select the left corner of the labelled blocks.

how to make a memory efficient multiple dimension groupby/stack using xarray?

I have a large time series of np.float64 with a 5-min frequency (size is ~2,500,000 ~=24 years).
I'm using Xarray to represent it in-memory and the time-dimension is named 'time'.
I want to group-by 'time.hour' and then 'time.dayofyear' (or vice-versa) and remove both their mean from the time-series.
In order to do that efficiently, i need to reorder the time-series into a new xr.DataArray with the dimensions of ['hour', 'dayofyear', 'rest'].
I wrote a function that plays with the GroupBy objects of Xarray and manages to do just that although it takes a lot of memory to do that...
I have a machine with 32GB RAM and i still get the MemoryError from numpy.
I know the code works because i used it on an hourly re-sampled version of my original time-series. so here's the code:
def time_series_stack(time_da, time_dim='time', grp1='hour', grp2='dayofyear'):
"""Takes a time-series xr.DataArray objects and reshapes it using
grp1 and grp2. outout is a xr.Dataset that includes the reshaped DataArray
, its datetime-series and the grps."""
import xarray as xr
import numpy as np
import pandas as pd
# try to infer the freq and put it into attrs for later reconstruction:
freq = pd.infer_freq(time_da[time_dim].values)
name = time_da.name
time_da.attrs['freq'] = freq
attrs = time_da.attrs
# drop all NaNs:
time_da = time_da.dropna(time_dim)
# group grp1 and concat:
grp_obj1 = time_da.groupby(time_dim + '.' + grp1)
s_list = []
for grp_name, grp_inds in grp_obj1.groups.items():
da = time_da.isel({time_dim: grp_inds})
s_list.append(da)
grps1 = [x for x in grp_obj1.groups.keys()]
stacked_da = xr.concat(s_list, dim=grp1)
stacked_da[grp1] = grps1
# group over the concatenated da and concat again:
grp_obj2 = stacked_da.groupby(time_dim + '.' + grp2)
s_list = []
for grp_name, grp_inds in grp_obj2.groups.items():
da = stacked_da.isel({time_dim: grp_inds})
s_list.append(da)
grps2 = [x for x in grp_obj2.groups.keys()]
stacked_da = xr.concat(s_list, dim=grp2)
stacked_da[grp2] = grps2
# numpy part:
# first, loop over both dims and drop NaNs, append values and datetimes:
vals = []
dts = []
for i, grp1_val in enumerate(stacked_da[grp1]):
da = stacked_da.sel({grp1: grp1_val})
for j, grp2_val in enumerate(da[grp2]):
val = da.sel({grp2: grp2_val}).dropna(time_dim)
vals.append(val.values)
dts.append(val[time_dim].values)
# second, we get the max of the vals after the second groupby:
max_size = max([len(x) for x in vals])
# we fill NaNs and NaT for the remainder of them:
concat_sizes = [max_size - len(x) for x in vals]
concat_arrys = [np.empty((x)) * np.nan for x in concat_sizes]
concat_vals = [np.concatenate(x) for x in list(zip(vals, concat_arrys))]
# 1970-01-01 is the NaT for this time-series:
concat_arrys = [np.zeros((x), dtype='datetime64[ns]')
for x in concat_sizes]
concat_dts = [np.concatenate(x) for x in list(zip(dts, concat_arrys))]
concat_vals = np.array(concat_vals)
concat_dts = np.array(concat_dts)
# finally , we reshape them:
concat_vals = concat_vals.reshape((stacked_da[grp1].shape[0],
stacked_da[grp2].shape[0],
max_size))
concat_dts = concat_dts.reshape((stacked_da[grp1].shape[0],
stacked_da[grp2].shape[0],
max_size))
# create a Dataset and DataArrays for them:
sda = xr.Dataset()
sda.attrs = attrs
sda[name] = xr.DataArray(concat_vals, dims=[grp1, grp2, 'rest'])
sda[time_dim] = xr.DataArray(concat_dts, dims=[grp1, grp2, 'rest'])
sda[grp1] = grps1
sda[grp2] = grps2
sda['rest'] = range(max_size)
return sda
So for the 2,500,000 items time-series, numpy throws the MemoryError so I'm guessing this has to be my memory bottle-neck. What can i do to solve this ?
Would Dask help me ? and if so how can i implement it ?
Like you, I ran it without issue when inputting a small time series (10,000 long). However, when inputting a 100,000 long time series xr.DataArraythe grp_obj2 for loop ran away and used all the memory of the system.
This is what I used to generate the time series xr.DataArray:
n = 10**5
times = np.datetime64('2000-01-01') + np.arange(n) * np.timedelta64(5,'m')
data = np.random.randn(n)
time_da = xr.DataArray(data, name='rand_data', dims=('time'), coords={'time': times})
# time_da.to_netcdf('rand_time_series.nc')
As you point out, Dask would be a way to solve it but I can't see a clear path at the moment...
Typically, the kind of problem with Dask would be to:
Make the input a dataset from a file (like NetCDF). This will not load the file in memory but allow Dask to pull data from disk one chunk at a time.
Define all calculations with dask.delayed or dask.futures methods for entire body of code up until the writing the output. This is what allows Dask to chunk a small piece of data to read then write.
Calculate one chunk of work and immediately write output to new dataset file. Effectively you ending up steaming one chunk of input to one chunk of output at a time (but also threaded/parallelized).
I tried importing Dask and breaking the input time_da xr.DataArray into chunks for Dask to work on but it didn't help. From what I can tell, the line stacked_da = xr.concat(s_list, dim=grp1) forces Dask to make a full copy of stacked_da in memory and much more...
One workaround to this is to write stacked_da to disk then immediately read it again:
##For group1
xr.concat(s_list, dim=grp1).to_netcdf('stacked_da1.nc')
stacked_da = xr.load_dataset('stacked_da1.nc')
stacked_da[grp1] = grps1
##For group2
xr.concat(s_list, dim=grp2).to_netcdf('stacked_da2.nc')
stacked_da = xr.load_dataset('stacked_da2.nc')
stacked_da[grp2] = grps2
However, the file size for stacked_da1.nc is 19MB and stacked_da2.nc gets huge at 6.5GB. This is for time_da with 100,000 elements... so there's clearly something amiss...
Originally, it sounded like you want to subtract the mean of the groups from the time series data. It looks like Xarray docs has an example for that. http://xarray.pydata.org/en/stable/groupby.html#grouped-arithmetic
The key is to group once and loop over the groups and then group again on each of the groups and append it to list.
Next i concat and use pd.MultiIndex.from_product for the groups.
No Memory problems and no Dask needed and it only takes a few seconds to run.
here's the code, enjoy:
def time_series_stack(time_da, time_dim='time', grp1='hour', grp2='month',
plot=True):
"""Takes a time-series xr.DataArray objects and reshapes it using
grp1 and grp2. output is a xr.Dataset that includes the reshaped DataArray
, its datetime-series and the grps. plots the mean also"""
import xarray as xr
import pandas as pd
# try to infer the freq and put it into attrs for later reconstruction:
freq = pd.infer_freq(time_da[time_dim].values)
name = time_da.name
time_da.attrs['freq'] = freq
attrs = time_da.attrs
# drop all NaNs:
time_da = time_da.dropna(time_dim)
# first grouping:
grp_obj1 = time_da.groupby(time_dim + '.' + grp1)
da_list = []
t_list = []
for grp1_name, grp1_inds in grp_obj1.groups.items():
da = time_da.isel({time_dim: grp1_inds})
# second grouping:
grp_obj2 = da.groupby(time_dim + '.' + grp2)
for grp2_name, grp2_inds in grp_obj2.groups.items():
da2 = da.isel({time_dim: grp2_inds})
# extract datetimes and rewrite time coord to 'rest':
times = da2[time_dim]
times = times.rename({time_dim: 'rest'})
times.coords['rest'] = range(len(times))
t_list.append(times)
da2 = da2.rename({time_dim: 'rest'})
da2.coords['rest'] = range(len(da2))
da_list.append(da2)
# get group keys:
grps1 = [x for x in grp_obj1.groups.keys()]
grps2 = [x for x in grp_obj2.groups.keys()]
# concat and convert to dataset:
stacked_ds = xr.concat(da_list, dim='all').to_dataset(name=name)
stacked_ds[time_dim] = xr.concat(t_list, 'all')
# create a multiindex for the groups:
mindex = pd.MultiIndex.from_product([grps1, grps2], names=[grp1, grp2])
stacked_ds.coords['all'] = mindex
# unstack:
ds = stacked_ds.unstack('all')
ds.attrs = attrs
return ds

In numpy, what is the efficient way to find the maximum values and their indices of a 3D ndarray across two axis?

How to find the correlation-peak values and coordinates of a set of 2D cross-correlation functions?
Given an 3D ndarray that contains a set of 2D cross-correlation functions. What is the efficient way to find the maximum(peak) values and their coordinates(x and y indices)?
The code below do the work but I think it is inefficient.
import numpy as np
import numpy.matlib
ccorr = np.random.rand(7,5,5)
xind = ccorr.argmax(axis=-1)
mccorr = ccorr[np.matlib.repmat(np.arange(0,7)[:,np.newaxis],1,5),np.matlib.repmat(np.arange(0,5)[np.newaxis,:],7,1), xind]
yind = mccorr.argmax(axis=-1)
xind = xind[np.arange(0,7),yind]
values = mccorr[np.arange(0,7),yind]
print("cross-correlation functions (z,y,x)")
print(ccorr)
print("x and y indices of the maximum values")
print(xind,yind)
print("Maximum values")
print(values)
You'll want to flatten the dimensions you're searching over and then use unravel_index and take_along_axis to get the coordinates and values, respectively.
ccorr = np.random.rand(7,5,5)
cc_rav = ccorr.reshape(ccorr.shape[0], -1)
idx = np.argmax(cc_rav, axis = -1)
indices_2d = np.unravel_index(idx, ccorr.shape[1:])
vals = np.take_along_axis(ccorr, indices = indices_2d, axis = 0)
if you're using numpy version <1.15:
vals = cc_rav[np.arange(ccorr.shape[0]), idx]
or:
vals = ccorr[np.arange(ccorr.shape[0]),
indices_2d[0], indices_2d[1]]