I am a bit frustrated as I could not find a solution to my problem which seems easy to do in r with package gapfill but here in python it is more difficult.
Coming to my problem: I have an xarray (3d) with the dimensions latitude, longitude and time. What I want is to interpolate nan values in each raster/array (caused by cloud and other distortions). The nan values form blocks (due to the clouds) and are sometimes relatively big. My idea is to interpolate not only with the neighbouring pixels of each timestep but also with the timesteps from before and after (the assumption is that the pixel some days before and some days after have a realtively similar value as the landcoverage is not changing so fast). My aim is to do a linear interpolation over time with the same pixel position. (how many timesteps before and after is also something where I am not sure how I can define that in the interpn function?)
I found different options to do that, however non was working yet. The most promising method I found is from the package scipy with the interpolate.interpn function. This function uses a numpy array not an xarray. My attempt:
#change from xarray to numpy
my array_np = my array.to_numpy()
# lable dimensions (what is done when building a numpy with meshgrid)
x = array_np [0]
y = array_np [1]
z = array_np [2]
#get index of nan values
nanIndex= np.isnan(array_np ).nonzero()
nanIndex
#name dimensions of nan values
xc= nanIndex[0]
yc= nanIndex[1]
zc= nanIndex[2]
# For using the scipy interpolate. interpn function:
# points = the regular grid - in my case x,y,z
# values = the data on the regular grid - in my case my array (my_array_np)
# point_nan = the point that is evaluate in the 3D grid - in my case xc, y,c, zy
points = (x, y, z) # dimensions
points_nan = (xc, yc, zc) #nandimensions
print(interpolate.interpn(points, my_array_np, points_nan))
What I get now as an error is:
"The points in dimension 0 must be strictly ascending"
Where am I wrong? Thanks for you help in advance! If you have another other solutions which also solves my probelem beside scipy I am also happy for help!
This is a how my array looks:
array([[[ nan, nan, nan, ..., 279.64 , 282.16998,
279.66998],
[277.62 , 277.52 , 277.88 , ..., 281.75998, 281.72 ,
281.66 ],
[277.38 , 277.75 , 277.88998, ..., 281.75998, 281.75998,
280.91998],
...,
[ nan, nan, nan, ..., 280.72998, 280.33 ,
280.94 ],
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan]],
[[ nan, nan, nan, ..., 272.22 , 271.54 ,
271.02 ],
[280.02 , 280.44998, 281.18 , ..., 271.47998, 271.88 ,
272.03 ],
[280.32 , 281. , 281.27 , ..., 270.83 , 271.58 ,
272.03 ],
...,
[ nan, nan, nan, ..., 290.34 , 290.25 ,
288.365 ],
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan]],
[[ nan, nan, nan, ..., nan, nan,
nan],
[276.44998, 276.19998, 276.19 , ..., nan, nan,
nan],
[276.50998, 276.79 , 276.58 , ..., nan, nan,
nan],
...,
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan]],
...,
[[ nan, nan, nan, ..., 276.38998, 276.44 ,
275.72998],
[ nan, nan, nan, ..., 276.55 , 276.81 ,
276.72998],
[ nan, nan, nan, ..., 279.74 , 277.11 ,
276.97 ],
...,
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan]],
[[ nan, nan, nan, ..., 277.38 , 278.08 ,
277.79 ],
[279.66998, 280.00998, 283.13 , ..., 277.34 , 277.41998,
277.62 ],
[ nan, 277.41 , 277.41 , ..., 277.825 , 277.31 ,
277.52 ],
...,
[ nan, nan, nan, ..., 276.52 , nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan]],
[[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan],
...,
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan]]], dtype=float32)
interpn cannot be used to fill gaps in a regular grid - interpn is a fast method for interpolating a full regular grid (with no gaps) to different coordinates.
To fill missing values with N-dimensional interpolation, use of the scipy interpolation methods for unstructured N-dimensional data.
Since you're interpolating to a regular grid, I'll demo the use of scipy.interpolate.griddata:
import xarray as xr, pandas as pd, numpy as np, scipy.interpolate
# create dummy data
x = y = z = np.linspace(0, 1, 5)
da = xr.DataArray(
np.sin(x).reshape(-1, 1, 1) * np.cos(y).reshape(1, -1, 1) + z.reshape(1, 1, -1),
dims=['x', 'y', 'z'],
coords=[x, y, z],
)
# randomly fill with NaNs
da = da.where(np.random.random(size=da.shape) > 0.1)
This looks like the following
In [11]: da
Out[11]:
<xarray.DataArray (x: 5, y: 5, z: 5)>
array([[[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ]],
[[0.24740396, 0.49740396, 0.74740396, 0.99740396, nan],
[ nan, 0.48971277, nan, 0.98971277, 1.23971277],
[0.2171174 , 0.4671174 , 0.7171174 , 0.9671174 , 1.2171174 ],
[0.18102272, 0.43102272, 0.68102272, 0.93102272, 1.18102272],
[ nan, 0.38367293, 0.63367293, 0.88367293, 1.13367293]],
[[0.47942554, nan, 0.97942554, 1.22942554, 1.47942554],
[0.46452136, 0.71452136, 0.96452136, 1.21452136, 1.46452136],
[0.42073549, 0.67073549, 0.92073549, 1.17073549, nan],
[0.35079033, 0.60079033, 0.85079033, 1.10079033, 1.35079033],
[ nan, 0.50903472, 0.75903472, nan, 1.25903472]],
[[0.68163876, 0.93163876, 1.18163876, 1.43163876, 1.68163876],
[0.66044826, 0.91044826, nan, 1.41044826, 1.66044826],
[0.59819429, 0.84819429, 1.09819429, 1.34819429, nan],
[ nan, 0.74874749, 0.99874749, 1.24874749, nan],
[0.36829099, 0.61829099, 0.86829099, 1.11829099, 1.36829099]],
[[0.84147098, 1.09147098, nan, 1.59147098, 1.84147098],
[0.81531169, 1.06531169, 1.31531169, 1.56531169, 1.81531169],
[0.73846026, 0.98846026, 1.23846026, 1.48846026, nan],
[0.61569495, 0.86569495, 1.11569495, 1.36569495, 1.61569495],
[0.45464871, 0.70464871, 0.95464871, nan, 1.45464871]]])
Coordinates:
* x (x) float64 0.0 0.25 0.5 0.75 1.0
* y (y) float64 0.0 0.25 0.5 0.75 1.0
* z (z) float64 0.0 0.25 0.5 0.75 1.0
To use an unstructured scipy interpolator, you must convert the gridded data with missing values to vectors of 1D points with no missing values:
# ravel all points and find the valid ones
points = da.data.ravel()
valid = ~np.isnan(points)
points_valid = points[valid]
# construct arrays of (x, y, z) points, masked to only include the valid points
xx, yy, zz = np.meshgrid(x, y, z)
xx, yy, zz = xx.ravel(), yy.ravel(), zz.ravel()
xxv = xx[valid]
yyv = yy[valid]
zzv = zz[valid]
# feed these into the interpolator, and also provide the target grid
interpolated = scipy.interpolate.griddata(np.stack([xxv, yyv, zzv]).T, points_valid, (xx, yy, zz), method="linear")
# reshape to match the original array and replace the DataArray values with
# the interpolated data
da.values = interpolated.reshape(da.shape)
This results in the array being filled
In [32]: da
Out[32]:
<xarray.DataArray (x: 5, y: 5, z: 5)>
array([[[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ]],
[[0.24740396, 0.49740396, 0.74740396, 0.99740396, 1.23971277],
[0.23226068, 0.48971277, 0.73226068, 0.98971277, 1.23971277],
[0.2171174 , 0.4671174 , 0.7171174 , 0.9671174 , 1.2171174 ],
[0.18102272, 0.43102272, 0.68102272, 0.93102272, 1.18102272],
[0.12276366, 0.38367293, 0.63367293, 0.88367293, 1.13367293]],
[[0.47942554, 0.71452136, 0.97942554, 1.22942554, 1.47942554],
[0.46452136, 0.71452136, 0.96452136, 1.21452136, 1.46452136],
[0.42073549, 0.67073549, 0.92073549, 1.17073549, 1.40765584],
[0.35079033, 0.60079033, 0.85079033, 1.10079033, 1.35079033],
[0.24552733, 0.50903472, 0.75903472, 1.00903472, 1.25903472]],
[[0.68163876, 0.93163876, 1.18163876, 1.43163876, 1.68163876],
[0.66044826, 0.91044826, 1.16044826, 1.41044826, 1.66044826],
[0.59819429, 0.84819429, 1.09819429, 1.34819429, 1.57184545],
[0.48324264, 0.74874749, 0.99874749, 1.24874749, 1.48324264],
[0.36829099, 0.61829099, 0.86829099, 1.11829099, 1.36829099]],
[[0.84147098, 1.09147098, 1.34147098, 1.59147098, 1.84147098],
[0.81531169, 1.06531169, 1.31531169, 1.56531169, 1.81531169],
[0.73846026, 0.98846026, 1.23846026, 1.48846026, 1.71550332],
[0.61569495, 0.86569495, 1.11569495, 1.36569495, 1.61569495],
[0.45464871, 0.70464871, 0.95464871, 1.20464871, 1.45464871]]])
Coordinates:
* x (x) float64 0.0 0.25 0.5 0.75 1.0
* y (y) float64 0.0 0.25 0.5 0.75 1.0
* z (z) float64 0.0 0.25 0.5 0.75 1.0
Note that this filled the complete array because the convex hull of available points covers the whole array. If this is not the case, you may need a second step using nearest neighbor or fitting a spline to the filled data.
import pandas as pd
import numpy as np
d = {
'Fruit':['Guava','Orange','Lemon'],
'ID1':[1,2,11],
'ID2':[3,4,12],
'ID3':[5,6,np.nan],
'ID4':[7,8,14],
'ID5':[9,10,np.nan],
'ID6':[11,np.nan,np.nan],
'ID7':[13,np.nan,np.nan],
'ID8':[15,np.nan,np.nan],
'ID9':[17,np.nan,np.nan],
'Category':['Myrtaceae','Citrus','Citrus']
}
df = pd.DataFrame(data = d)
df
How to convert the above dataframe to the following dictionary.
Expected Output :
{
'Myrtacease':{'Guava':{1,3,5,7,9,11,13,15,17}},
'Citrus':{'Orange':{2,4,6,8,10,np.nan,np.nan,np.nan,np.nan},{'Lemon':{11,12,np.nan,14,np.nan,np.nan,np.nan,np.nan,np.nan}},
}
How to again convert the dictionary to a dataframe?
Use list comprehension with groupby:
d = {k: v.set_index('Fruit').T.to_dict('list')
for k, v in df.set_index('Category').groupby(level=0)}
print (d)
{'Citrus': {'Orange': [2.0, 4.0, 6.0, 8.0, 10.0, nan, nan, nan, nan],
'Lemon': [11.0, 12.0, nan, 14.0, nan, nan, nan, nan, nan]},
'Myrtaceae': {'Guava': [1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 15.0, 17.0]}}
Or:
d = {k: v.drop('Category', axis=1).set_index('Fruit').T.to_dict('list')
for k, v in df.groupby('Category')}
And then:
df = (pd.concat({k: pd.DataFrame(v) for k, v in d.items()}, axis=1)
.T
.rename_axis(('Category','Fruit'))
.rename(columns=lambda x: f'ID{x+1}')
.reset_index())
print (df)
Category Fruit ID1 ID2 ID3 ID4 ID5 ID6 ID7 ID8 ID9
0 Citrus Orange 2.0 4.0 6.0 8.0 10.0 NaN NaN NaN NaN
1 Citrus Lemon 11.0 12.0 NaN 14.0 NaN NaN NaN NaN NaN
2 Myrtaceae Guava 1.0 3.0 5.0 7.0 9.0 11.0 13.0 15.0 17.0
I have this sample Dataset containing worldwide air temperature, and more importantly, a mask land, marking land/non-water areas.
<xarray.Dataset>
Dimensions: (lat: 55, lon: 143, time: 5)
Coordinates:
* time (time) datetime64[ns] 2016-01-01 2016-01-02 2016-01-03 ...
* lat (lat) float64 -52.5 -50.0 -47.5 -45.0 -42.5 -40.0 -37.5 -35.0 ...
* lon (lon) float64 -177.5 -175.0 -172.5 -170.0 -167.5 -165.0 -162.5 ...
land (lat, lon) bool False False False False False False False False ...
Data variables:
airt (time, lat, lon) float64 7.952 7.61 7.389 7.267 7.124 6.989 ...
I can now mask the oceans and plot it
dry_areas = ds.where(ds.land)
dry_areas.airt.plot()
dry_areas looks like this
<xarray.Dataset>
Dimensions: (lat: 55, lon: 143)
Coordinates:
* lat (lat) float64 -52.5 -50.0 -47.5 -45.0 -42.5 -40.0 -37.5 -35.0 ...
* lon (lon) float64 -177.5 -175.0 -172.5 -170.0 -167.5 -165.0 -162.5 ...
land (lat, lon) bool False False False False False False False False ...
Data variables:
airt (lat, lon) float64 nan nan nan nan nan nan nan nan nan nan nan ...
How can I now get the coordinates for all non-nan values?
dry_areas.coords gives me the bounding box and I can't get lat and lon into the (55, 143) shape so I could apply the mask on.
The only working workaround I could find is
dry_areas.to_dataframe().dropna().reset_index()[['lat', 'lon']].values, which does not feel very lean and clean.
I feel this is quite simply, however I am clearly not a numpy/matrix ninja.
Best solution so far
This is the shortest I could come with so far:
lon, lat = np.meshgrid(ds.coords['lon'], ds.coords['lat'])
lat_masked = ma.array(lat, mask=dry_areas.airt.fillna(False))
lon_masked = ma.array(lon, mask=dry_areas.airt.fillna(False))
land_coordinates = zip(lat_masked[lat_masked.mask].data, lon_masked[lon_masked.mask].data)
You can use .stack to get an array of coord pairs of the non-null values:
In [31]: da=xr.DataArray(np.arange(20).reshape(5,4))
In [33]: da_nans = da.where(da % 2 == 1)
In [34]: da_nans
Out[34]:
<xarray.DataArray (dim_0: 5, dim_1: 4)>
array([[ nan, 1., nan, 3.],
[ nan, 5., nan, 7.],
[ nan, 9., nan, 11.],
[ nan, 13., nan, 15.],
[ nan, 17., nan, 19.]])
Coordinates:
* dim_0 (dim_0) int64 0 1 2 3 4
* dim_1 (dim_1) int64 0 1 2 3
In [35]: da_stacked = da_nans.stack(x=['dim_0','dim_1'])
In [36]: da_stacked
Out[36]:
<xarray.DataArray (x: 20)>
array([ nan, 1., nan, 3., nan, 5., nan, 7., nan, 9., nan,
11., nan, 13., nan, 15., nan, 17., nan, 19.])
Coordinates:
* x (x) object (0, 0) (0, 1) (0, 2) (0, 3) (1, 0) (1, 1) (1, 2) ...
In [37]: da_stacked[da_stacked.notnull()]
Out[37]:
<xarray.DataArray (x: 10)>
array([ 1., 3., 5., 7., 9., 11., 13., 15., 17., 19.])
Coordinates:
* x (x) object (0, 1) (0, 3) (1, 1) (1, 3) (2, 1) (2, 3) (3, 1) ...
I have a numpy array with shape [t, z, x, y] epresenting an hourly time series of three-D data. The axes of the array are time, vertical coordinate, horizontal coordinate 1, horizontal coordinate 2. There is also a t-element list of hourly datetime.datetime timestamps.
I want to calculate the daily mid-day means for each day. This will be an [nday, Z, X, Y] array.
I'm trying to find a pythonic way to do this. I've written something with a bunch of for loops that works but seems slow, inflexible, and verbose.
It appears to me that Pandas is not a solution for me because my time series data are three-dimensional. I'd be happy to be proven wrong.
I've come up with this, using itertools, to find mid-day timestamps and group them by date, and now I'm coming up short trying to apply imap to find the means.
import numpy as np
import pandas as pd
import itertools
# create 72 hours of pseudo-data with 3 vertical levels and a 4 by 4
# horizontal grid.
data = np.zeros((72, 3, 4, 4))
t = pd.date_range(datetime(2008,7,1), freq='1H', periods=72)
for i in range(data.shape[0]):
data[i,...] = i
# find the timestamps that are "midday" in North America. We'll
# define midday as between 15:00 and 23:00 UTC, which is 10:00 EST to
# 15:00 PST.
def is_midday(this_t):
return ((this_t.hour >= 15) and (this_t.hour <= 23))
# group the midday timestamps by date
for dt, grp in itertools.groupby(itertools.ifilter(is_midday, t),
key=lambda x: x.date()):
print 'date ' + str(dt)
for g in grp:
print g
# find means of mid-day data by date
data_list = np.split(data, data.shape[0])
grps = itertools.groupby(itertools.ifilter(is_midday, t),
key=lambda x: x.date())
# how to apply itertools.imap (or something else) to data_list and
# grps? Or somehow split data along axis 0 according to grps?
You can shove pretty much any object into a pandas structure. Normally not recommended, but in this case it might work for you.
Create a Series indexed by time, with each element a 3-d numpy array
In [117]: s = Series([data[i] for i in range(data.shape[0])],index=t)
In [118]: s
Out[118]:
2008-07-01 00:00:00 [[[0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0], ...
2008-07-01 01:00:00 [[[1.0, 1.0, 1.0, 1.0], [1.0, 1.0, 1.0, 1.0], ...
2008-07-01 02:00:00 [[[2.0, 2.0, 2.0, 2.0], [2.0, 2.0, 2.0, 2.0], ...
2008-07-01 03:00:00 [[[3.0, 3.0, 3.0, 3.0], [3.0, 3.0, 3.0, 3.0], ...
2008-07-01 04:00:00 [[[4.0, 4.0, 4.0, 4.0], [4.0, 4.0, 4.0, 4.0], ...
2008-07-01 05:00:00 [[[5.0, 5.0, 5.0, 5.0], [5.0, 5.0, 5.0, 5.0], ...
2008-07-01 06:00:00 [[[6.0, 6.0, 6.0, 6.0], [6.0, 6.0, 6.0, 6.0], ...
2008-07-01 07:00:00 [[[7.0, 7.0, 7.0, 7.0], [7.0, 7.0, 7.0, 7.0], ...
2008-07-01 08:00:00 [[[8.0, 8.0, 8.0, 8.0], [8.0, 8.0, 8.0, 8.0], ...
2008-07-01 09:00:00 [[[9.0, 9.0, 9.0, 9.0], [9.0, 9.0, 9.0, 9.0], ...
2008-07-01 10:00:00 [[[10.0, 10.0, 10.0, 10.0], [10.0, 10.0, 10.0,...
2008-07-01 11:00:00 [[[11.0, 11.0, 11.0, 11.0], [11.0, 11.0, 11.0,...
2008-07-01 12:00:00 [[[12.0, 12.0, 12.0, 12.0], [12.0, 12.0, 12.0,...
2008-07-01 13:00:00 [[[13.0, 13.0, 13.0, 13.0], [13.0, 13.0, 13.0,...
2008-07-01 14:00:00 [[[14.0, 14.0, 14.0, 14.0], [14.0, 14.0, 14.0,...
...
2008-07-03 09:00:00 [[[57.0, 57.0, 57.0, 57.0], [57.0, 57.0, 57.0,...
2008-07-03 10:00:00 [[[58.0, 58.0, 58.0, 58.0], [58.0, 58.0, 58.0,...
2008-07-03 11:00:00 [[[59.0, 59.0, 59.0, 59.0], [59.0, 59.0, 59.0,...
2008-07-03 12:00:00 [[[60.0, 60.0, 60.0, 60.0], [60.0, 60.0, 60.0,...
2008-07-03 13:00:00 [[[61.0, 61.0, 61.0, 61.0], [61.0, 61.0, 61.0,...
2008-07-03 14:00:00 [[[62.0, 62.0, 62.0, 62.0], [62.0, 62.0, 62.0,...
2008-07-03 15:00:00 [[[63.0, 63.0, 63.0, 63.0], [63.0, 63.0, 63.0,...
2008-07-03 16:00:00 [[[64.0, 64.0, 64.0, 64.0], [64.0, 64.0, 64.0,...
2008-07-03 17:00:00 [[[65.0, 65.0, 65.0, 65.0], [65.0, 65.0, 65.0,...
2008-07-03 18:00:00 [[[66.0, 66.0, 66.0, 66.0], [66.0, 66.0, 66.0,...
2008-07-03 19:00:00 [[[67.0, 67.0, 67.0, 67.0], [67.0, 67.0, 67.0,...
2008-07-03 20:00:00 [[[68.0, 68.0, 68.0, 68.0], [68.0, 68.0, 68.0,...
2008-07-03 21:00:00 [[[69.0, 69.0, 69.0, 69.0], [69.0, 69.0, 69.0,...
2008-07-03 22:00:00 [[[70.0, 70.0, 70.0, 70.0], [70.0, 70.0, 70.0,...
2008-07-03 23:00:00 [[[71.0, 71.0, 71.0, 71.0], [71.0, 71.0, 71.0,...
Freq: H, Length: 72
Define your aggregating function. You need to access the values which returns the inside object; concatenating coerces back to an actual numpy array, then aggregate (mean in this case)
In [119]: def f(g,grp):
.....: return np.concatenate(grp.values).mean()
.....:
Since not sure what your end output should look like, just create a time-based grouper manually (this is essentially a resample), but doesn't do anything with the final results (its just a list of the aggregated values)
In [121]: [ f(g,grp) for g, grp in s.groupby(pd.Grouper(freq='D')) ]
Out[121]: [11.5, 35.5, 59.5]
You can get reasonable fancy here and say return a pandas object (and potentially concat them).