How to use scipy.interpolate.interpn function with xarray (3d), to fill nan gaps? Current Error [The points in dimension 0 must be strictly ascending] - numpy

I am a bit frustrated as I could not find a solution to my problem which seems easy to do in r with package gapfill but here in python it is more difficult.
Coming to my problem: I have an xarray (3d) with the dimensions latitude, longitude and time. What I want is to interpolate nan values in each raster/array (caused by cloud and other distortions). The nan values form blocks (due to the clouds) and are sometimes relatively big. My idea is to interpolate not only with the neighbouring pixels of each timestep but also with the timesteps from before and after (the assumption is that the pixel some days before and some days after have a realtively similar value as the landcoverage is not changing so fast). My aim is to do a linear interpolation over time with the same pixel position. (how many timesteps before and after is also something where I am not sure how I can define that in the interpn function?)
I found different options to do that, however non was working yet. The most promising method I found is from the package scipy with the interpolate.interpn function. This function uses a numpy array not an xarray. My attempt:
#change from xarray to numpy
my array_np = my array.to_numpy()
# lable dimensions (what is done when building a numpy with meshgrid)
x = array_np [0]
y = array_np [1]
z = array_np [2]
#get index of nan values
nanIndex= np.isnan(array_np ).nonzero()
nanIndex
#name dimensions of nan values
xc= nanIndex[0]
yc= nanIndex[1]
zc= nanIndex[2]
# For using the scipy interpolate. interpn function:
# points = the regular grid - in my case x,y,z
# values = the data on the regular grid - in my case my array (my_array_np)
# point_nan = the point that is evaluate in the 3D grid - in my case xc, y,c, zy
points = (x, y, z) # dimensions
points_nan = (xc, yc, zc) #nandimensions
print(interpolate.interpn(points, my_array_np, points_nan))
What I get now as an error is:
"The points in dimension 0 must be strictly ascending"
Where am I wrong? Thanks for you help in advance! If you have another other solutions which also solves my probelem beside scipy I am also happy for help!
This is a how my array looks:
array([[[ nan, nan, nan, ..., 279.64 , 282.16998,
279.66998],
[277.62 , 277.52 , 277.88 , ..., 281.75998, 281.72 ,
281.66 ],
[277.38 , 277.75 , 277.88998, ..., 281.75998, 281.75998,
280.91998],
...,
[ nan, nan, nan, ..., 280.72998, 280.33 ,
280.94 ],
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan]],
[[ nan, nan, nan, ..., 272.22 , 271.54 ,
271.02 ],
[280.02 , 280.44998, 281.18 , ..., 271.47998, 271.88 ,
272.03 ],
[280.32 , 281. , 281.27 , ..., 270.83 , 271.58 ,
272.03 ],
...,
[ nan, nan, nan, ..., 290.34 , 290.25 ,
288.365 ],
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan]],
[[ nan, nan, nan, ..., nan, nan,
nan],
[276.44998, 276.19998, 276.19 , ..., nan, nan,
nan],
[276.50998, 276.79 , 276.58 , ..., nan, nan,
nan],
...,
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan]],
...,
[[ nan, nan, nan, ..., 276.38998, 276.44 ,
275.72998],
[ nan, nan, nan, ..., 276.55 , 276.81 ,
276.72998],
[ nan, nan, nan, ..., 279.74 , 277.11 ,
276.97 ],
...,
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan]],
[[ nan, nan, nan, ..., 277.38 , 278.08 ,
277.79 ],
[279.66998, 280.00998, 283.13 , ..., 277.34 , 277.41998,
277.62 ],
[ nan, 277.41 , 277.41 , ..., 277.825 , 277.31 ,
277.52 ],
...,
[ nan, nan, nan, ..., 276.52 , nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan]],
[[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan],
...,
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan]]], dtype=float32)

interpn cannot be used to fill gaps in a regular grid - interpn is a fast method for interpolating a full regular grid (with no gaps) to different coordinates.
To fill missing values with N-dimensional interpolation, use of the scipy interpolation methods for unstructured N-dimensional data.
Since you're interpolating to a regular grid, I'll demo the use of scipy.interpolate.griddata:
import xarray as xr, pandas as pd, numpy as np, scipy.interpolate
# create dummy data
x = y = z = np.linspace(0, 1, 5)
da = xr.DataArray(
np.sin(x).reshape(-1, 1, 1) * np.cos(y).reshape(1, -1, 1) + z.reshape(1, 1, -1),
dims=['x', 'y', 'z'],
coords=[x, y, z],
)
# randomly fill with NaNs
da = da.where(np.random.random(size=da.shape) > 0.1)
This looks like the following
In [11]: da
Out[11]:
<xarray.DataArray (x: 5, y: 5, z: 5)>
array([[[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ]],
[[0.24740396, 0.49740396, 0.74740396, 0.99740396, nan],
[ nan, 0.48971277, nan, 0.98971277, 1.23971277],
[0.2171174 , 0.4671174 , 0.7171174 , 0.9671174 , 1.2171174 ],
[0.18102272, 0.43102272, 0.68102272, 0.93102272, 1.18102272],
[ nan, 0.38367293, 0.63367293, 0.88367293, 1.13367293]],
[[0.47942554, nan, 0.97942554, 1.22942554, 1.47942554],
[0.46452136, 0.71452136, 0.96452136, 1.21452136, 1.46452136],
[0.42073549, 0.67073549, 0.92073549, 1.17073549, nan],
[0.35079033, 0.60079033, 0.85079033, 1.10079033, 1.35079033],
[ nan, 0.50903472, 0.75903472, nan, 1.25903472]],
[[0.68163876, 0.93163876, 1.18163876, 1.43163876, 1.68163876],
[0.66044826, 0.91044826, nan, 1.41044826, 1.66044826],
[0.59819429, 0.84819429, 1.09819429, 1.34819429, nan],
[ nan, 0.74874749, 0.99874749, 1.24874749, nan],
[0.36829099, 0.61829099, 0.86829099, 1.11829099, 1.36829099]],
[[0.84147098, 1.09147098, nan, 1.59147098, 1.84147098],
[0.81531169, 1.06531169, 1.31531169, 1.56531169, 1.81531169],
[0.73846026, 0.98846026, 1.23846026, 1.48846026, nan],
[0.61569495, 0.86569495, 1.11569495, 1.36569495, 1.61569495],
[0.45464871, 0.70464871, 0.95464871, nan, 1.45464871]]])
Coordinates:
* x (x) float64 0.0 0.25 0.5 0.75 1.0
* y (y) float64 0.0 0.25 0.5 0.75 1.0
* z (z) float64 0.0 0.25 0.5 0.75 1.0
To use an unstructured scipy interpolator, you must convert the gridded data with missing values to vectors of 1D points with no missing values:
# ravel all points and find the valid ones
points = da.data.ravel()
valid = ~np.isnan(points)
points_valid = points[valid]
# construct arrays of (x, y, z) points, masked to only include the valid points
xx, yy, zz = np.meshgrid(x, y, z)
xx, yy, zz = xx.ravel(), yy.ravel(), zz.ravel()
xxv = xx[valid]
yyv = yy[valid]
zzv = zz[valid]
# feed these into the interpolator, and also provide the target grid
interpolated = scipy.interpolate.griddata(np.stack([xxv, yyv, zzv]).T, points_valid, (xx, yy, zz), method="linear")
# reshape to match the original array and replace the DataArray values with
# the interpolated data
da.values = interpolated.reshape(da.shape)
This results in the array being filled
In [32]: da
Out[32]:
<xarray.DataArray (x: 5, y: 5, z: 5)>
array([[[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ]],
[[0.24740396, 0.49740396, 0.74740396, 0.99740396, 1.23971277],
[0.23226068, 0.48971277, 0.73226068, 0.98971277, 1.23971277],
[0.2171174 , 0.4671174 , 0.7171174 , 0.9671174 , 1.2171174 ],
[0.18102272, 0.43102272, 0.68102272, 0.93102272, 1.18102272],
[0.12276366, 0.38367293, 0.63367293, 0.88367293, 1.13367293]],
[[0.47942554, 0.71452136, 0.97942554, 1.22942554, 1.47942554],
[0.46452136, 0.71452136, 0.96452136, 1.21452136, 1.46452136],
[0.42073549, 0.67073549, 0.92073549, 1.17073549, 1.40765584],
[0.35079033, 0.60079033, 0.85079033, 1.10079033, 1.35079033],
[0.24552733, 0.50903472, 0.75903472, 1.00903472, 1.25903472]],
[[0.68163876, 0.93163876, 1.18163876, 1.43163876, 1.68163876],
[0.66044826, 0.91044826, 1.16044826, 1.41044826, 1.66044826],
[0.59819429, 0.84819429, 1.09819429, 1.34819429, 1.57184545],
[0.48324264, 0.74874749, 0.99874749, 1.24874749, 1.48324264],
[0.36829099, 0.61829099, 0.86829099, 1.11829099, 1.36829099]],
[[0.84147098, 1.09147098, 1.34147098, 1.59147098, 1.84147098],
[0.81531169, 1.06531169, 1.31531169, 1.56531169, 1.81531169],
[0.73846026, 0.98846026, 1.23846026, 1.48846026, 1.71550332],
[0.61569495, 0.86569495, 1.11569495, 1.36569495, 1.61569495],
[0.45464871, 0.70464871, 0.95464871, 1.20464871, 1.45464871]]])
Coordinates:
* x (x) float64 0.0 0.25 0.5 0.75 1.0
* y (y) float64 0.0 0.25 0.5 0.75 1.0
* z (z) float64 0.0 0.25 0.5 0.75 1.0
Note that this filled the complete array because the convex hull of available points covers the whole array. If this is not the case, you may need a second step using nearest neighbor or fitting a spline to the filled data.

Related

JsonSchema Validation Exception

I am trying to to do Json Schema validation in restassured by matching response body with JsonSchema in class path . I am getting following error
io.restassured.module.jsv.JsonSchemaValidationException: com.fasterxml.jackson.core.JsonParseException: Non-standard token 'NaN': enable JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS to allow
at [Source: (StringReader); line: 1, column: 3948]
Response body contains the following values "20": NaN, "21": NaN, "22": NaN, "23": NaN, "24": NaN, "25": ""25minutes"", "26": NaN, "27": NaN, "28": NaN, "29": NaN, which causes the problem.How can i avoid this error. Pls someone help.
NaN is not valid JSON.
Reference

How to (correctly) merge 2 Pandas DataFrames and scatter-plot

Thank you for your answers, in advance.
My end goal is to produce a scatter-plot - corruption as an explanatory variable (x axis, from a DataFrame 'corr') and inequality as a dependent variable (y axis, from a DataFrame 'inq').
A hint to produce an informative table (DataFrame) by joining these two Dataframes would be much appreciated
I have a dataframe 'inq' for a country inequality (GINI index) and another one 'corr' for country corruption index.
pd.DataFrame(
{
"country": {0: "Angola", 1: "Albania", 2: "United Arab Emirates"},
"1975": {0: nan, 1: nan, 2: nan},
"1976": {0: nan, 1: nan, 2: nan},
"2017": {0: nan, 1: 33.2, 2: nan},
"2018": {0: 51.3, 1: nan, 2: nan},
}
)
pd.DataFrame(
{
"country": {0: "Afghanistan", 1: "Angola", 2: "Albania"},
"1975": {0: 44.8, 1: 48.1, 2: 75.1},
"1976": {0: 44.8, 1: 48.1, 2: 75.1},
"2018": {0: 24.2, 1: 40.4, 2: 28.4},
"2019": {0: 40.5, 1: 37.6, 2: 35.9},
}
)
I concatenate and manipulate and get
cm = pd.concat([inq, corr], axis=0, keys=["Inequality", "Corruption"]).reset_index(
level=1, drop=True
)
a new Dataframe
pd.DataFrame(
{
"indicator": {0: "Inequality", 1: "Inequality", 2: "Inequality"},
"country": {0: "Angola", 1: "Albania", 2: "United Arab Emirates"},
"1967": {0: nan, 1: nan, 2: nan},
"1969": {0: nan, 1: nan, 2: nan},
"2018": {0: 51.3, 1: nan, 2: nan},
"2019": {0: nan, 1: nan, 2: nan},
}
)
You should concatenate your dataframe in a different way:
df = (pd.concat([inq.set_index('country'),
corr.set_index('country')],
axis=1,
keys=["Inequality", "Corruption"]
)
.stack(level=1)
)
Inequality Corruption
country
Angola 1975 NaN 48.1
1976 NaN 48.1
2018 51.3 40.4
2019 NaN 37.6
Albania 1975 NaN 75.1
1976 NaN 75.1
2017 33.2 NaN
2018 NaN 28.4
2019 NaN 35.9
Afghanistan 1975 NaN 44.8
1976 NaN 44.8
2018 NaN 24.2
2019 NaN 40.5
Then to plot:
df.plot.scatter(x='Corruption', y='Inequality')
NB. there is only one point as most of your data is NaN

xarray: simple weighted rolling mean example using .construct()

Xarray can do weighted rolling mean via the .construct() object as stated in answer on SO here and also in the docs.
The weighted rolling mean example in the docs doesn't quite look right as it seems to give the same answer as the ordinary rolling mean.
import xarray as xr
import numpy as np
arr = xr.DataArray(np.arange(0, 7.5, 0.5).reshape(3, 5),
... dims=('x', 'y'))
arr.rolling(y=3, center=True).mean()
#<xarray.DataArray (x: 3, y: 5)>
#array([[nan, 0.5, 1. , 1.5, nan],
# [nan, 3. , 3.5, 4. , nan],
# [nan, 5.5, 6. , 6.5, nan]])
#Dimensions without coordinates: x, y
weight = xr.DataArray([0.25, 0.5, 0.25], dims=['window'])
arr.rolling(y=3, center=True).construct('window').dot(weight)
#<xarray.DataArray (x: 3, y: 5)>
#array([[nan, 0.5, 1. , 1.5, nan],
# [nan, 3. , 3.5, 4. , nan],
# [nan, 5.5, 6. , 6.5, nan]])
#Dimensions without coordinates: x, y
Here is a more simple example which I would like to get the syntax right on:
da = xr.DataArray(np.arange(1,6), dims='x')
da.rolling(x=3, center=True).mean()
#<xarray.DataArray (x: 5)>
#array([nan, 2., 3., 4., nan])
#Dimensions without coordinates: x
weight = xr.DataArray([0.5, 1, 0.5], dims=['window'])
da.rolling(x=3, center=True).construct('window').dot(weight)
#<xarray.DataArray (x: 5)>
#array([nan, 4., 6., 8., nan])
#Dimensions without coordinates: x
It returns 4, 6, 8. I thought it would do:
(1 x 0.5) + (2 x 1) + (3 x 0.5) / 3 = 4/3
(2 x 0.5) + (3 x 1) + (4 x 0.5) / 3 = 2
(3 x 0.5) + (4 x 1) + (5 x 0.5) / 3 = 8/3
1.33, 2. 2.66
In the first example, you use evenly spaced data for arr.
Therefore, the weighted mean (with [0.25, 5, 0.25]) will be the same as the simple mean.
If you consider non-linear data, the result differs
In [50]: arr = xr.DataArray((np.arange(0, 7.5, 0.5)**2).reshape(3, 5),
...: dims=('x', 'y'))
...:
In [51]: arr.rolling(y=3, center=True).mean()
Out[51]:
<xarray.DataArray (x: 3, y: 5)>
array([[ nan, 0.416667, 1.166667, 2.416667, nan],
[ nan, 9.166667, 12.416667, 16.166667, nan],
[ nan, 30.416667, 36.166667, 42.416667, nan]])
Dimensions without coordinates: x, y
In [52]: weight = xr.DataArray([0.25, 0.5, 0.25], dims=['window'])
...: arr.rolling(y=3, center=True).construct('window').dot(weight)
...:
Out[52]:
<xarray.DataArray (x: 3, y: 5)>
array([[ nan, 0.375, 1.125, 2.375, nan],
[ nan, 9.125, 12.375, 16.125, nan],
[ nan, 30.375, 36.125, 42.375, nan]])
Dimensions without coordinates: x, y
For the second example, you use [0.5, 1, 0.5] as weight, the total of which is 2.
Therefore, the first non-nan item will be
(1 x 0.5) + (2 x 1) + (3 x 0.5) = 4
If you want weighted mean, rather than the weighted sum, use [0.25, 0.5, 0.25] instead.

How to convert a Numpy array (Rows x Cols) to an array of XYZ coordinates?

I have an input array from a camera (greyscale image) that looks like:
[
[0.5, 0.75, 0.1, 0.6],
[0.3, 0.75, 1.0, 0.9]
]
actual size = 434x512
I need an output which is a list of XYZ coordinates:
i.e. [[x,y,z],[x,y,z],...]
[[0,0,0.5],[1,0,0.75],[2,0,0.1],[3,0,0.6],[0,1,0.3],[1,1,0.75],[2,1,1.0],[3,1,0.9]]
Are there any efficient ways to do this using Numpy?
Here's an approach -
m,n = a.shape
R,C = np.mgrid[:m,:n]
out = np.column_stack((C.ravel(),R.ravel(), a.ravel()))
Sample run -
In [45]: a
Out[45]:
array([[ 0.5 , 0.75, 0.1 , 0.6 ],
[ 0.3 , 0.75, 1. , 0.9 ]])
In [46]: m,n = a.shape
...: R,C = np.mgrid[:m,:n]
...: out = np.column_stack((C.ravel(),R.ravel(), a.ravel()))
...:
In [47]: out
Out[47]:
array([[ 0. , 0. , 0.5 ],
[ 1. , 0. , 0.75],
[ 2. , 0. , 0.1 ],
[ 3. , 0. , 0.6 ],
[ 0. , 1. , 0.3 ],
[ 1. , 1. , 0.75],
[ 2. , 1. , 1. ],
[ 3. , 1. , 0.9 ]])
In [48]: out.tolist() # Convert to list of lists if needed
Out[48]:
[[0.0, 0.0, 0.5],
[1.0, 0.0, 0.75],
[2.0, 0.0, 0.1],
[3.0, 0.0, 0.6],
[0.0, 1.0, 0.3],
[1.0, 1.0, 0.75],
[2.0, 1.0, 1.0],
[3.0, 1.0, 0.9]]

Get coordinates of non-nan values of xarray Dataset

I have this sample Dataset containing worldwide air temperature, and more importantly, a mask land, marking land/non-water areas.
<xarray.Dataset>
Dimensions: (lat: 55, lon: 143, time: 5)
Coordinates:
* time (time) datetime64[ns] 2016-01-01 2016-01-02 2016-01-03 ...
* lat (lat) float64 -52.5 -50.0 -47.5 -45.0 -42.5 -40.0 -37.5 -35.0 ...
* lon (lon) float64 -177.5 -175.0 -172.5 -170.0 -167.5 -165.0 -162.5 ...
land (lat, lon) bool False False False False False False False False ...
Data variables:
airt (time, lat, lon) float64 7.952 7.61 7.389 7.267 7.124 6.989 ...
I can now mask the oceans and plot it
dry_areas = ds.where(ds.land)
dry_areas.airt.plot()
dry_areas looks like this
<xarray.Dataset>
Dimensions: (lat: 55, lon: 143)
Coordinates:
* lat (lat) float64 -52.5 -50.0 -47.5 -45.0 -42.5 -40.0 -37.5 -35.0 ...
* lon (lon) float64 -177.5 -175.0 -172.5 -170.0 -167.5 -165.0 -162.5 ...
land (lat, lon) bool False False False False False False False False ...
Data variables:
airt (lat, lon) float64 nan nan nan nan nan nan nan nan nan nan nan ...
How can I now get the coordinates for all non-nan values?
dry_areas.coords gives me the bounding box and I can't get lat and lon into the (55, 143) shape so I could apply the mask on.
The only working workaround I could find is
dry_areas.to_dataframe().dropna().reset_index()[['lat', 'lon']].values, which does not feel very lean and clean.
I feel this is quite simply, however I am clearly not a numpy/matrix ninja.
Best solution so far
This is the shortest I could come with so far:
lon, lat = np.meshgrid(ds.coords['lon'], ds.coords['lat'])
lat_masked = ma.array(lat, mask=dry_areas.airt.fillna(False))
lon_masked = ma.array(lon, mask=dry_areas.airt.fillna(False))
land_coordinates = zip(lat_masked[lat_masked.mask].data, lon_masked[lon_masked.mask].data)
You can use .stack to get an array of coord pairs of the non-null values:
In [31]: da=xr.DataArray(np.arange(20).reshape(5,4))
In [33]: da_nans = da.where(da % 2 == 1)
In [34]: da_nans
Out[34]:
<xarray.DataArray (dim_0: 5, dim_1: 4)>
array([[ nan, 1., nan, 3.],
[ nan, 5., nan, 7.],
[ nan, 9., nan, 11.],
[ nan, 13., nan, 15.],
[ nan, 17., nan, 19.]])
Coordinates:
* dim_0 (dim_0) int64 0 1 2 3 4
* dim_1 (dim_1) int64 0 1 2 3
In [35]: da_stacked = da_nans.stack(x=['dim_0','dim_1'])
In [36]: da_stacked
Out[36]:
<xarray.DataArray (x: 20)>
array([ nan, 1., nan, 3., nan, 5., nan, 7., nan, 9., nan,
11., nan, 13., nan, 15., nan, 17., nan, 19.])
Coordinates:
* x (x) object (0, 0) (0, 1) (0, 2) (0, 3) (1, 0) (1, 1) (1, 2) ...
In [37]: da_stacked[da_stacked.notnull()]
Out[37]:
<xarray.DataArray (x: 10)>
array([ 1., 3., 5., 7., 9., 11., 13., 15., 17., 19.])
Coordinates:
* x (x) object (0, 1) (0, 3) (1, 1) (1, 3) (2, 1) (2, 3) (3, 1) ...