Xarray can do weighted rolling mean via the .construct() object as stated in answer on SO here and also in the docs.
The weighted rolling mean example in the docs doesn't quite look right as it seems to give the same answer as the ordinary rolling mean.
import xarray as xr
import numpy as np
arr = xr.DataArray(np.arange(0, 7.5, 0.5).reshape(3, 5),
... dims=('x', 'y'))
arr.rolling(y=3, center=True).mean()
#<xarray.DataArray (x: 3, y: 5)>
#array([[nan, 0.5, 1. , 1.5, nan],
# [nan, 3. , 3.5, 4. , nan],
# [nan, 5.5, 6. , 6.5, nan]])
#Dimensions without coordinates: x, y
weight = xr.DataArray([0.25, 0.5, 0.25], dims=['window'])
arr.rolling(y=3, center=True).construct('window').dot(weight)
#<xarray.DataArray (x: 3, y: 5)>
#array([[nan, 0.5, 1. , 1.5, nan],
# [nan, 3. , 3.5, 4. , nan],
# [nan, 5.5, 6. , 6.5, nan]])
#Dimensions without coordinates: x, y
Here is a more simple example which I would like to get the syntax right on:
da = xr.DataArray(np.arange(1,6), dims='x')
da.rolling(x=3, center=True).mean()
#<xarray.DataArray (x: 5)>
#array([nan, 2., 3., 4., nan])
#Dimensions without coordinates: x
weight = xr.DataArray([0.5, 1, 0.5], dims=['window'])
da.rolling(x=3, center=True).construct('window').dot(weight)
#<xarray.DataArray (x: 5)>
#array([nan, 4., 6., 8., nan])
#Dimensions without coordinates: x
It returns 4, 6, 8. I thought it would do:
(1 x 0.5) + (2 x 1) + (3 x 0.5) / 3 = 4/3
(2 x 0.5) + (3 x 1) + (4 x 0.5) / 3 = 2
(3 x 0.5) + (4 x 1) + (5 x 0.5) / 3 = 8/3
1.33, 2. 2.66
In the first example, you use evenly spaced data for arr.
Therefore, the weighted mean (with [0.25, 5, 0.25]) will be the same as the simple mean.
If you consider non-linear data, the result differs
In [50]: arr = xr.DataArray((np.arange(0, 7.5, 0.5)**2).reshape(3, 5),
...: dims=('x', 'y'))
...:
In [51]: arr.rolling(y=3, center=True).mean()
Out[51]:
<xarray.DataArray (x: 3, y: 5)>
array([[ nan, 0.416667, 1.166667, 2.416667, nan],
[ nan, 9.166667, 12.416667, 16.166667, nan],
[ nan, 30.416667, 36.166667, 42.416667, nan]])
Dimensions without coordinates: x, y
In [52]: weight = xr.DataArray([0.25, 0.5, 0.25], dims=['window'])
...: arr.rolling(y=3, center=True).construct('window').dot(weight)
...:
Out[52]:
<xarray.DataArray (x: 3, y: 5)>
array([[ nan, 0.375, 1.125, 2.375, nan],
[ nan, 9.125, 12.375, 16.125, nan],
[ nan, 30.375, 36.125, 42.375, nan]])
Dimensions without coordinates: x, y
For the second example, you use [0.5, 1, 0.5] as weight, the total of which is 2.
Therefore, the first non-nan item will be
(1 x 0.5) + (2 x 1) + (3 x 0.5) = 4
If you want weighted mean, rather than the weighted sum, use [0.25, 0.5, 0.25] instead.
Related
Extract data from the given SalaryGender CSV file and store the data from each column in a separate NumPy array
SalaryGender.csv sample data
Salary,Gender,Age,PhD
140,1,47,1
30,0,65,1
35.1,0,56,0
30,1,23,0
80,0,53,1
Use: DataFrame.groupby
that will create a list where each element has a numpy array of each column:
[group.values for i,group in df.groupby(level=0,axis=1)]
If you aren't looking for a list then use:
for i,group in df.groupby(level=0,axis=1):
print(group.values)
.....
Also you can use DataFrame.iteritems:
for i,col in df.iteritems():
print(col.to_numpy())
In [199]: txt = """Salary,Gender,Age,PhD
...: 140,1,47,1
...: 30,0,65,1
...: 35.1,0,56,0
...: 30,1,23,0
...: 80,0,53,1"""
We can load your sample as a structured array:
In [203]: data = np.genfromtxt(txt.splitlines(), dtype=None, delimiter=',', encoding=None, names=True)
In [204]: data
Out[204]:
array([(140. , 1, 47, 1), ( 30. , 0, 65, 1), ( 35.1, 0, 56, 0),
( 30. , 1, 23, 0), ( 80. , 0, 53, 1)],
dtype=[('Salary', '<f8'), ('Gender', '<i8'), ('Age', '<i8'), ('PhD', '<i8')])
Each element of the array is a row of the file; field names come from the header line, field dtype is deduced from the data.
Fields can be accessed by name:
In [205]: data['Salary']
Out[205]: array([140. , 30. , 35.1, 30. , 80. ])
In [206]: data['Gender']
Out[206]: array([1, 0, 0, 1, 0])
They can be accessed that way or can be assigned to a variable
salary = data['Salary']
You can also use unpack:
In [213]: a,b,c,d = np.genfromtxt(txt.splitlines(), delimiter=',', encoding=Non
...: e, skip_header=1, unpack=True)
In [214]: a
Out[214]: array([140. , 30. , 35.1, 30. , 80. ])
In [215]: b
Out[215]: array([1., 0., 0., 1., 0.])
In [216]: c
Out[216]: array([47., 65., 56., 23., 53.])
In [217]: d
Out[217]: array([1., 1., 0., 0., 1.])
Sometimes it's simpler to load the file one (or selected) column at a time:
In [218]: b = np.genfromtxt(txt.splitlines(), delimiter=',', encoding=None, ski
...: p_header=1, usecols=[1])
In [219]: b
Out[219]: array([1., 0., 0., 1., 0.])
Please try this:
SG[SG.columns].values
where SG is your file name. The code above gives you all columns values in array in a single go.
Assume I have matrices P with the size [4, 4] which partitioned (block) into 4 smaller matrices [2,2]. How can I efficiently multiply this block-matrix into another matrix (not partitioned matrix but smaller)?
Let's Assume our original matric is:
P = [ 1 1 2 2
1 1 2 2
3 3 4 4
3 3 4 4]
Which split into submatrices:
P_1 = [1 1 , P_2 = [2 2 , P_3 = [3 3 P_4 = [4 4
1 1] 2 2] 3 3] 4 4]
Now our P is:
P = [P_1 P_2
P_3 p_4]
In the next step, I want to do element-wise multiplication between P and smaller matrices which its size is equal to number of sub-matrices:
P * [ 1 0 = [P_1 0 = [1 1 0 0
0 0 ] 0 0] 1 1 0 0
0 0 0 0
0 0 0 0]
You can think of representing your large block matrix in a more efficient way.
For instance, a block matrix
P = [ 1 1 2 2
1 1 2 2
3 3 4 4
3 3 4 4]
Can be represented using
a = [ 1 0 b = [ 1 1 0 0 p = [ 1 2
1 0 0 0 1 1 ] 3 4 ]
0 1
0 1 ]
As
P = a # p # b
With (# representing matrix multiplication). Matrices a and b represents/encode the block structure of P and the small p represents the values of each block.
Now, if you want to multiply (element-wise) p with a small (2x2) matrix q you simply
a # (p * q) # b
A simple pytorch example
In [1]: a = torch.tensor([[1., 0], [1., 0], [0., 1], [0, 1]])
In [2]: b = torch.tensor([[1., 1., 0, 0], [0, 0, 1., 1]])
In [3]: p=torch.tensor([[1., 2.], [3., 4.]])
In [4]: q = torch.tensor([[1., 0], [0., 0]])
In [5]: a # p # b
Out[5]:
tensor([[1., 1., 2., 2.],
[1., 1., 2., 2.],
[3., 3., 4., 4.],
[3., 3., 4., 4.]])
In [6]: a # (p*q) # b
Out[6]:
tensor([[1., 1., 0., 0.],
[1., 1., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
I leave it to you as an exercise how to efficiently produce the "structure" matrices a and b given the sizes of the blocks.
Following is a general Tensorflow-based solution that works for input matrices p (large) and m (small) of arbitrary shapes as long as the sizes of p are divisible by the sizes of m on both axes.
def block_mul(p, m):
p_x, p_y = p.shape
m_x, m_y = m.shape
m_4d = tf.reshape(m, (m_x, 1, m_y, 1))
m_broadcasted = tf.broadcast_to(m_4d, (m_x, p_x // m_x, m_y, p_y // m_y))
mp = tf.reshape(m_broadcasted, (p_x, p_y))
return p * mp
Test:
import tensorflow as tf
tf.enable_eager_execution()
p = tf.reshape(tf.constant(range(36)), (6, 6))
m = tf.reshape(tf.constant(range(9)), (3, 3))
print(f"p:\n{p}\n")
print(f"m:\n{m}\n")
print(f"block_mul(p, m):\n{block_mul(p, m)}")
Output (Python 3.7.3, Tensorflow 1.13.1):
p:
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]
[24 25 26 27 28 29]
[30 31 32 33 34 35]]
m:
[[0 1 2]
[3 4 5]
[6 7 8]]
block_mul(p, m):
[[ 0 0 2 3 8 10]
[ 0 0 8 9 20 22]
[ 36 39 56 60 80 85]
[ 54 57 80 84 110 115]
[144 150 182 189 224 232]
[180 186 224 231 272 280]]
Another solution that uses implicit broadcasting is the following:
def block_mul2(p, m):
p_x, p_y = p.shape
m_x, m_y = m.shape
p_4d = tf.reshape(p, (m_x, p_x // m_x, m_y, p_y // m_y))
m_4d = tf.reshape(m, (m_x, 1, m_y, 1))
return tf.reshape(p_4d * m_4d, (p_x, p_y))
Don't know about the efficient method, but you can try these:
Method 1:
Using torch.cat()
import torch
def multiply(a, b):
x1 = a[0:2, 0:2]*b[0,0]
x2 = a[0:2, 2:]*b[0,1]
x3 = a[2:, 0:2]*b[1,0]
x4 = a[2:, 2:]*b[1,1]
return torch.cat((torch.cat((x1, x2), 1), torch.cat((x3, x4), 1)), 0)
a = torch.tensor([[1, 1, 2, 2],[1, 1, 2, 2],[3, 3, 4, 4,],[3, 3, 4, 4]])
b = torch.tensor([[1, 0],[0, 0]])
print(multiply(a, b))
output:
tensor([[1, 1, 0, 0],
[1, 1, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]])
Method 2:
Using torch.nn.functional.pad()
import torch.nn.functional as F
import torch
def multiply(a, b):
b = F.pad(input=b, pad=(1, 1, 1, 1), mode='constant', value=0)
b[0,0] = 1
b[0,1] = 1
b[1,0] = 1
return a*b
a = torch.tensor([[1, 1, 2, 2],[1, 1, 2, 2],[3, 3, 4, 4,],[3, 3, 4, 4]])
b = torch.tensor([[1, 0],[0, 0]])
print(multiply(a, b))
output:
tensor([[1, 1, 0, 0],
[1, 1, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]])
If the matrices are small, you are probably fine with cat or pad. The solution with factorization is very elegant, as the one with a block_mul implementation.
Another solution is turning the 2D block matrix in a 3D volume where each 2D slice is a block (P_1, P_2, P_3, P_4). Then use the power of broadcasting to multiply each 2D slice by a scalar. Finally reshape the output. Reshaping is not immediate but it's doable, port from numpy to pytorch of https://stackoverflow.com/a/16873755/4892874
In Pytorch:
import torch
h = w = 4
x = torch.ones(h, w)
x[:2, 2:] = 2
x[2:, :2] = 3
x[2:, 2:] = 4
# number of blocks along x and y
nrows=2
ncols=2
vol3d = x.reshape(h//nrows, nrows, -1, ncols)
vol3d = vol3d.permute(0, 2, 1, 3).reshape(-1, nrows, ncols)
out = vol3d * torch.Tensor([1, 0, 0, 0])[:, None, None].float()
# reshape to original
n, nrows, ncols = out.shape
out = out.reshape(h//nrows, -1, nrows, ncols)
out = out.permute(0, 2, 1, 3)
out = out.reshape(h, w)
print(out)
tensor([[1., 1., 0., 0.],
[1., 1., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
I haven't benchmarked this against the others, but this doesn't consume additional memory like padding would do and it doesn't do slow operations like concatenation. It has also ther advantage of being easy to understand and visualize.
You can generalize it to any situation by playing with h, w, nrows, ncols.
Although the other answer may be the solution, it is not an efficient way. I come up with another one to tackle the problem (but still is not perfect). The following implementation needs too much memory when our inputs are 3 or 4 dimensions. For example, for input size of 20*75*1024*1024, the following calculation needs around 12gb ram.
Here is my implementation:
import tensorflow as tf
tf.enable_eager_execution()
inps = tf.constant([
[1, 1, 1, 1, 2, 2, 2, 2],
[1, 1, 1, 1, 2, 2, 2, 2],
[1, 1, 1, 1, 2, 2, 2, 2],
[1, 1, 1, 1, 2, 2, 2, 2],
[3, 3, 3, 3, 4, 4, 4, 4],
[3, 3, 3, 3, 4, 4, 4, 4],
[3, 3, 3, 3, 4, 4, 4, 4],
[3, 3, 3, 3, 4, 4, 4, 4]])
on_cells = tf.constant([[1, 0, 0, 1]])
on_cells = tf.expand_dims(on_cells, axis=-1)
# replicate the value to block-size (4*4)
on_cells = tf.tile(on_cells, [1, 1, 4 * 4])
# reshape to a format for permutation
on_cells = tf.reshape(on_cells, (1, 2, 2, 4, 4))
# permutation
on_cells = tf.transpose(on_cells, [0, 1, 3, 2, 4])
# reshape
on_cells = tf.reshape(on_cells, [1, 8, 8])
# element-wise operation
print(inps * on_cells)
Output:
tf.Tensor(
[[[1 1 1 1 0 0 0 0]
[1 1 1 1 0 0 0 0]
[1 1 1 1 0 0 0 0]
[1 1 1 1 0 0 0 0]
[0 0 0 0 4 4 4 4]
[0 0 0 0 4 4 4 4]
[0 0 0 0 4 4 4 4]
[0 0 0 0 4 4 4 4]]], shape=(1, 8, 8), dtype=int32)
Code:
shape = np.array([6, 6])
grid = np.array([x.ravel() for x in np.meshgrid(*[np.arange(x) for i, x in enumerate(shape)], indexing='ij')]).T
slices = [tuple(slice(box[i], box[i] + 2) for i in range(len(box))) for box in grid]
score = np.zeros((7,7,3))
column = np.random.randn(36, 12) #just for example
column
>> array([[ 0, 1, 2, 3, ... 425, 426, 427, 428, 429, 430, 431]])
column = column.reshape((16, 3, 3, 3))
for i, window in enumerate(slices):
score[window] += column[i]
score
>> array([[[0.000e+00, 1.000e+00, 2.000e+00],
[3.000e+01, 3.200e+01, 3.400e+01],
[9.000e+01, 9.300e+01, 9.600e+01], ...
[8.280e+02, 8.300e+02, 8.320e+02],
[4.290e+02, 4.300e+02, 4.310e+02]]])
It works but last 2 lines take really much time as they will be in loop. The problem is that 'grid' variable contains an array of windows. And I don't now how to speed up the process.
Let's simplify the problem at bit - reduce the dimensions, and drop the final size 3 dimension:
In [265]: shape = np.array([4,4])
In [266]: grid = np.array([x.ravel() for x in np.meshgrid(*[np.arange(x) for i
...: , x in enumerate(shape)], indexing='ij')]).T
...: grid = [tuple(slice(box[i], box[i] + 3) for i in range(len(box))) fo
...: r box in grid]
...:
...:
In [267]: len(grid)
Out[267]: 16
In [268]: score = np.arange(36).reshape(6,6)
In [269]: X = np.array([score[x] for x in grid]).reshape(4,4,3,3)
In [270]: X
Out[270]:
array([[[[ 0, 1, 2],
[ 6, 7, 8],
[12, 13, 14]],
[[ 1, 2, 3],
[ 7, 8, 9],
[13, 14, 15]],
[[ 2, 3, 4],
[ 8, 9, 10],
[14, 15, 16]],
....
[[21, 22, 23],
[27, 28, 29],
[33, 34, 35]]]])
This is a moving window - one (3,3) array, shift over 1,..., shift down 1, etc
With as_strided is is possible to construct a view of the array, that consists of all these windows, but without actually copying values. Having worked with as_strided before I was able construct the equivalent strides as:
In [271]: score.shape
Out[271]: (6, 6)
In [272]: score.strides
Out[272]: (48, 8)
In [273]: ast = np.lib.stride_tricks.as_strided
In [274]: x=ast(score, shape=(4,4,3,3), strides=(48,8,48,8))
In [275]: np.allclose(X,x)
Out[275]: True
This could be extended to your (28,28,3) dimensions, and turned into the summation.
Generating such moving windows has been covered in previous SO questions. And it's also implemented in one of the image processing packages.
Adaptation for a 3 channel image,
In [45]: arr.shape
Out[45]: (6, 6, 3)
In [46]: arr.strides
Out[46]: (144, 24, 8)
In [47]: arr[:3,:3,0]
Out[47]:
array([[ 0., 1., 2.],
[ 6., 7., 8.],
[12., 13., 14.]])
In [48]: x = ast(arr, shape=(4,4,3,3,3), strides=(144,24,144,24,8))
In [49]: x[0,0,:,:,0]
Out[49]:
array([[ 0., 1., 2.],
[ 6., 7., 8.],
[12., 13., 14.]])
Since we are moving the window by one element at a time, the strides for x are easily derived form the source strides.
For 4x4 windows, just change the shape
x = ast(arr, shape=(3,3,4,4,3), strides=(144,24,144,24,8))
In Efficiently Using Multiple Numpy Slices for Random Image Cropping
#Divikar suggests using skimage
With the default step=1, the result is compatible:
In [55]: from skimage.util.shape import view_as_windows
In [63]: y = view_as_windows(arr,(4,4,3))
In [64]: y.shape
Out[64]: (3, 3, 1, 4, 4, 3)
In [69]: np.allclose(x,y[:,:,0])
Out[69]: True
I'd like to build a kernel from a list of positions and list of kernel centers. The kernel should be an indicator of the TWO closest centers to each position.
> x = np.array([0.1, .49, 1.9, ]).reshape((3,1)) # Positions
> c = np.array([-2., 0.1, 0.2, 0.4, 0.5, 2.]) # centers
print x
print c
[[ 0.1 ]
[ 0.49]
[ 1.9 ]]
[-2. 0.1 0.2 0.4 0.5 2. ]
What I'd like to get out is:
array([[ 0, 1, 1, 0, 0, 0], # Index 1,2 closest to 0.1
[ 0, 0, 0, 1, 1, 0], # Index 3,4 closest to 0.49
[ 0, 0, 0, 0, 1, 1]]) # Index 4,5 closest to 1.9
I can get:
> dist = np.abs(x-c)
array([[ 2.1 , 0. , 0.1 , 0.3 , 0.4 , 1.9 ],
[ 2.49, 0.39, 0.29, 0.09, 0.01, 1.51],
[ 3.9 , 1.8 , 1.7 , 1.5 , 1.4 , 0.1 ]])
and:
> np.argsort(dist, axis=1)[:,:2]
array([[1, 2],
[4, 3],
[5, 4]])
Here I have a matrix of column indexes, but I but can't see how to use them to set values of those columns in another matrix (using efficient numpy operations).
idx = np.argsort(dist, axis=1)[:,:2]
z = np.zeros(dist.shape)
z[idx]=1 # NOPE
z[idx,:]=1 # NOPE
z[:,idx]=1 # NOPE
One way would be to initialize zeros array and then index with advanced-indexing -
out = np.zeros(dist.shape,dtype=int)
out[np.arange(idx.shape[0])[:,None],idx] = 1
Alternatively, we could play around with dimensions extension to use broadcasting and come up with a one-liner -
out = (idx[...,None] == np.arange(dist.shape[1])).any(1).astype(int)
For performance, I would suggest using np.argpartition to get those indices -
idx = np.argpartition(dist, 2, axis=1)[:,:2]
I have this sample Dataset containing worldwide air temperature, and more importantly, a mask land, marking land/non-water areas.
<xarray.Dataset>
Dimensions: (lat: 55, lon: 143, time: 5)
Coordinates:
* time (time) datetime64[ns] 2016-01-01 2016-01-02 2016-01-03 ...
* lat (lat) float64 -52.5 -50.0 -47.5 -45.0 -42.5 -40.0 -37.5 -35.0 ...
* lon (lon) float64 -177.5 -175.0 -172.5 -170.0 -167.5 -165.0 -162.5 ...
land (lat, lon) bool False False False False False False False False ...
Data variables:
airt (time, lat, lon) float64 7.952 7.61 7.389 7.267 7.124 6.989 ...
I can now mask the oceans and plot it
dry_areas = ds.where(ds.land)
dry_areas.airt.plot()
dry_areas looks like this
<xarray.Dataset>
Dimensions: (lat: 55, lon: 143)
Coordinates:
* lat (lat) float64 -52.5 -50.0 -47.5 -45.0 -42.5 -40.0 -37.5 -35.0 ...
* lon (lon) float64 -177.5 -175.0 -172.5 -170.0 -167.5 -165.0 -162.5 ...
land (lat, lon) bool False False False False False False False False ...
Data variables:
airt (lat, lon) float64 nan nan nan nan nan nan nan nan nan nan nan ...
How can I now get the coordinates for all non-nan values?
dry_areas.coords gives me the bounding box and I can't get lat and lon into the (55, 143) shape so I could apply the mask on.
The only working workaround I could find is
dry_areas.to_dataframe().dropna().reset_index()[['lat', 'lon']].values, which does not feel very lean and clean.
I feel this is quite simply, however I am clearly not a numpy/matrix ninja.
Best solution so far
This is the shortest I could come with so far:
lon, lat = np.meshgrid(ds.coords['lon'], ds.coords['lat'])
lat_masked = ma.array(lat, mask=dry_areas.airt.fillna(False))
lon_masked = ma.array(lon, mask=dry_areas.airt.fillna(False))
land_coordinates = zip(lat_masked[lat_masked.mask].data, lon_masked[lon_masked.mask].data)
You can use .stack to get an array of coord pairs of the non-null values:
In [31]: da=xr.DataArray(np.arange(20).reshape(5,4))
In [33]: da_nans = da.where(da % 2 == 1)
In [34]: da_nans
Out[34]:
<xarray.DataArray (dim_0: 5, dim_1: 4)>
array([[ nan, 1., nan, 3.],
[ nan, 5., nan, 7.],
[ nan, 9., nan, 11.],
[ nan, 13., nan, 15.],
[ nan, 17., nan, 19.]])
Coordinates:
* dim_0 (dim_0) int64 0 1 2 3 4
* dim_1 (dim_1) int64 0 1 2 3
In [35]: da_stacked = da_nans.stack(x=['dim_0','dim_1'])
In [36]: da_stacked
Out[36]:
<xarray.DataArray (x: 20)>
array([ nan, 1., nan, 3., nan, 5., nan, 7., nan, 9., nan,
11., nan, 13., nan, 15., nan, 17., nan, 19.])
Coordinates:
* x (x) object (0, 0) (0, 1) (0, 2) (0, 3) (1, 0) (1, 1) (1, 2) ...
In [37]: da_stacked[da_stacked.notnull()]
Out[37]:
<xarray.DataArray (x: 10)>
array([ 1., 3., 5., 7., 9., 11., 13., 15., 17., 19.])
Coordinates:
* x (x) object (0, 1) (0, 3) (1, 1) (1, 3) (2, 1) (2, 3) (3, 1) ...