Efficient way to create dictionary of symmetric matrix with colum-row pair as key, and corresponding value in matrix as value

Efficient way to create dictionary of symmetric matrix with colum-row pair as key, and corresponding value in matrix as value - pandas

I want to create a dictionary in the form of (row, column): value, from a symmetric matrix (like a distance matrix) as depicted below, whithout taking into account the NaN values or zeros (zeros is the diagonal). The matrix is a pandas dataframe.
Material 100051 100120 100138 100179 100253 100265 100281
100051 0.0 0.953488 0.959302 0.953488 0.959302 0.953488 0.953488
100120 NaN 0.000000 0.965116 0.953488 0.959302 0.959302 0.959302
100138 NaN NaN 0.000000 0.959302 0.970930 0.970930 0.970930
100179 NaN NaN NaN 0.000000 0.959302 0.953488 0.953488
100253 NaN NaN NaN NaN 0.000000 0.976744 0.976744
... ... ... ... ... ... ... ...
So a dictionary that looks like:
{((100120, 100051): 0.953488); ((1000138, 100051): 0.959302); ....}
For creating a dictionary, you can probably iterate over both rows and columns like:
jacsim_values = {}
for i in jacsim_matrix2:
for j in jacsim_matrix2:
if jacsim_matrix[i][j] != 0:
jacsim_values[i,j] = jacsim_matrix2[i][j]
But I am looking for something more efficient. This takes quite some time for the size of the matrix. However, I could not find how to do so. Is there somebody who can help me out?

IIUC, DataFrame.stack (row, column) or DataFrame.unstack (column, row) + DataFrame.to_dict
df.set_index('Material').rename(int, axis=1).unstack().to_dict()
{(100051, 100051): 0.0,
(100051, 100120): nan,
(100051, 100138): nan,
(100051, 100179): nan,
(100051, 100253): nan,
(100120, 100051): 0.9534879999999999,
(100120, 100120): 0.0,
(100120, 100138): nan,
(100120, 100179): nan,
(100120, 100253): nan,
(100138, 100051): 0.9593020000000001,
(100138, 100120): 0.965116,
(100138, 100138): 0.0,
(100138, 100179): nan,
(100138, 100253): nan,
(100179, 100051): 0.9534879999999999,
(100179, 100120): 0.9534879999999999,
(100179, 100138): 0.9593020000000001,
(100179, 100179): 0.0,
(100179, 100253): nan,
(100253, 100051): 0.9593020000000001,
(100253, 100120): 0.9593020000000001,
(100253, 100138): 0.97093,
(100253, 100179): 0.9593020000000001,
(100253, 100253): 0.0,
(100265, 100051): 0.9534879999999999,
(100265, 100120): 0.9593020000000001,
(100265, 100138): 0.97093,
(100265, 100179): 0.9534879999999999,
(100265, 100253): 0.9767440000000001,
(100281, 100051): 0.9534879999999999,
(100281, 100120): 0.9593020000000001,
(100281, 100138): 0.97093,
(100281, 100179): 0.9534879999999999,
(100281, 100253): 0.9767440000000001}

Related

How to use scipy.interpolate.interpn function with xarray (3d), to fill nan gaps? Current Error [The points in dimension 0 must be strictly ascending]

I am a bit frustrated as I could not find a solution to my problem which seems easy to do in r with package gapfill but here in python it is more difficult.
Coming to my problem: I have an xarray (3d) with the dimensions latitude, longitude and time. What I want is to interpolate nan values in each raster/array (caused by cloud and other distortions). The nan values form blocks (due to the clouds) and are sometimes relatively big. My idea is to interpolate not only with the neighbouring pixels of each timestep but also with the timesteps from before and after (the assumption is that the pixel some days before and some days after have a realtively similar value as the landcoverage is not changing so fast). My aim is to do a linear interpolation over time with the same pixel position. (how many timesteps before and after is also something where I am not sure how I can define that in the interpn function?)
I found different options to do that, however non was working yet. The most promising method I found is from the package scipy with the interpolate.interpn function. This function uses a numpy array not an xarray. My attempt:
#change from xarray to numpy
my array_np = my array.to_numpy()
# lable dimensions (what is done when building a numpy with meshgrid)
x = array_np [0]
y = array_np [1]
z = array_np [2]
#get index of nan values
nanIndex= np.isnan(array_np ).nonzero()
nanIndex
#name dimensions of nan values
xc= nanIndex[0]
yc= nanIndex[1]
zc= nanIndex[2]
# For using the scipy interpolate. interpn function:
# points = the regular grid - in my case x,y,z
# values = the data on the regular grid - in my case my array (my_array_np)
# point_nan = the point that is evaluate in the 3D grid - in my case xc, y,c, zy
points = (x, y, z) # dimensions
points_nan = (xc, yc, zc) #nandimensions
print(interpolate.interpn(points, my_array_np, points_nan))
What I get now as an error is:
"The points in dimension 0 must be strictly ascending"
Where am I wrong? Thanks for you help in advance! If you have another other solutions which also solves my probelem beside scipy I am also happy for help!
This is a how my array looks:
array([[[ nan, nan, nan, ..., 279.64 , 282.16998,
279.66998],
[277.62 , 277.52 , 277.88 , ..., 281.75998, 281.72 ,
281.66 ],
[277.38 , 277.75 , 277.88998, ..., 281.75998, 281.75998,
280.91998],
...,
[ nan, nan, nan, ..., 280.72998, 280.33 ,
280.94 ],
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan]],
[[ nan, nan, nan, ..., 272.22 , 271.54 ,
271.02 ],
[280.02 , 280.44998, 281.18 , ..., 271.47998, 271.88 ,
272.03 ],
[280.32 , 281. , 281.27 , ..., 270.83 , 271.58 ,
272.03 ],
...,
[ nan, nan, nan, ..., 290.34 , 290.25 ,
288.365 ],
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan]],
[[ nan, nan, nan, ..., nan, nan,
nan],
[276.44998, 276.19998, 276.19 , ..., nan, nan,
nan],
[276.50998, 276.79 , 276.58 , ..., nan, nan,
nan],
...,
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan]],
...,
[[ nan, nan, nan, ..., 276.38998, 276.44 ,
275.72998],
[ nan, nan, nan, ..., 276.55 , 276.81 ,
276.72998],
[ nan, nan, nan, ..., 279.74 , 277.11 ,
276.97 ],
...,
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan]],
[[ nan, nan, nan, ..., 277.38 , 278.08 ,
277.79 ],
[279.66998, 280.00998, 283.13 , ..., 277.34 , 277.41998,
277.62 ],
[ nan, 277.41 , 277.41 , ..., 277.825 , 277.31 ,
277.52 ],
...,
[ nan, nan, nan, ..., 276.52 , nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan]],
[[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan],
...,
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan],
[ nan, nan, nan, ..., nan, nan,
nan]]], dtype=float32)

interpn cannot be used to fill gaps in a regular grid - interpn is a fast method for interpolating a full regular grid (with no gaps) to different coordinates.
To fill missing values with N-dimensional interpolation, use of the scipy interpolation methods for unstructured N-dimensional data.
Since you're interpolating to a regular grid, I'll demo the use of scipy.interpolate.griddata:
import xarray as xr, pandas as pd, numpy as np, scipy.interpolate
# create dummy data
x = y = z = np.linspace(0, 1, 5)
da = xr.DataArray(
np.sin(x).reshape(-1, 1, 1) * np.cos(y).reshape(1, -1, 1) + z.reshape(1, 1, -1),
dims=['x', 'y', 'z'],
coords=[x, y, z],
)
# randomly fill with NaNs
da = da.where(np.random.random(size=da.shape) > 0.1)
This looks like the following
In [11]: da
Out[11]:
<xarray.DataArray (x: 5, y: 5, z: 5)>
array([[[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ]],
[[0.24740396, 0.49740396, 0.74740396, 0.99740396, nan],
[ nan, 0.48971277, nan, 0.98971277, 1.23971277],
[0.2171174 , 0.4671174 , 0.7171174 , 0.9671174 , 1.2171174 ],
[0.18102272, 0.43102272, 0.68102272, 0.93102272, 1.18102272],
[ nan, 0.38367293, 0.63367293, 0.88367293, 1.13367293]],
[[0.47942554, nan, 0.97942554, 1.22942554, 1.47942554],
[0.46452136, 0.71452136, 0.96452136, 1.21452136, 1.46452136],
[0.42073549, 0.67073549, 0.92073549, 1.17073549, nan],
[0.35079033, 0.60079033, 0.85079033, 1.10079033, 1.35079033],
[ nan, 0.50903472, 0.75903472, nan, 1.25903472]],
[[0.68163876, 0.93163876, 1.18163876, 1.43163876, 1.68163876],
[0.66044826, 0.91044826, nan, 1.41044826, 1.66044826],
[0.59819429, 0.84819429, 1.09819429, 1.34819429, nan],
[ nan, 0.74874749, 0.99874749, 1.24874749, nan],
[0.36829099, 0.61829099, 0.86829099, 1.11829099, 1.36829099]],
[[0.84147098, 1.09147098, nan, 1.59147098, 1.84147098],
[0.81531169, 1.06531169, 1.31531169, 1.56531169, 1.81531169],
[0.73846026, 0.98846026, 1.23846026, 1.48846026, nan],
[0.61569495, 0.86569495, 1.11569495, 1.36569495, 1.61569495],
[0.45464871, 0.70464871, 0.95464871, nan, 1.45464871]]])
Coordinates:
* x (x) float64 0.0 0.25 0.5 0.75 1.0
* y (y) float64 0.0 0.25 0.5 0.75 1.0
* z (z) float64 0.0 0.25 0.5 0.75 1.0
To use an unstructured scipy interpolator, you must convert the gridded data with missing values to vectors of 1D points with no missing values:
# ravel all points and find the valid ones
points = da.data.ravel()
valid = ~np.isnan(points)
points_valid = points[valid]
# construct arrays of (x, y, z) points, masked to only include the valid points
xx, yy, zz = np.meshgrid(x, y, z)
xx, yy, zz = xx.ravel(), yy.ravel(), zz.ravel()
xxv = xx[valid]
yyv = yy[valid]
zzv = zz[valid]
# feed these into the interpolator, and also provide the target grid
interpolated = scipy.interpolate.griddata(np.stack([xxv, yyv, zzv]).T, points_valid, (xx, yy, zz), method="linear")
# reshape to match the original array and replace the DataArray values with
# the interpolated data
da.values = interpolated.reshape(da.shape)
This results in the array being filled
In [32]: da
Out[32]:
<xarray.DataArray (x: 5, y: 5, z: 5)>
array([[[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ],
[0. , 0.25 , 0.5 , 0.75 , 1. ]],
[[0.24740396, 0.49740396, 0.74740396, 0.99740396, 1.23971277],
[0.23226068, 0.48971277, 0.73226068, 0.98971277, 1.23971277],
[0.2171174 , 0.4671174 , 0.7171174 , 0.9671174 , 1.2171174 ],
[0.18102272, 0.43102272, 0.68102272, 0.93102272, 1.18102272],
[0.12276366, 0.38367293, 0.63367293, 0.88367293, 1.13367293]],
[[0.47942554, 0.71452136, 0.97942554, 1.22942554, 1.47942554],
[0.46452136, 0.71452136, 0.96452136, 1.21452136, 1.46452136],
[0.42073549, 0.67073549, 0.92073549, 1.17073549, 1.40765584],
[0.35079033, 0.60079033, 0.85079033, 1.10079033, 1.35079033],
[0.24552733, 0.50903472, 0.75903472, 1.00903472, 1.25903472]],
[[0.68163876, 0.93163876, 1.18163876, 1.43163876, 1.68163876],
[0.66044826, 0.91044826, 1.16044826, 1.41044826, 1.66044826],
[0.59819429, 0.84819429, 1.09819429, 1.34819429, 1.57184545],
[0.48324264, 0.74874749, 0.99874749, 1.24874749, 1.48324264],
[0.36829099, 0.61829099, 0.86829099, 1.11829099, 1.36829099]],
[[0.84147098, 1.09147098, 1.34147098, 1.59147098, 1.84147098],
[0.81531169, 1.06531169, 1.31531169, 1.56531169, 1.81531169],
[0.73846026, 0.98846026, 1.23846026, 1.48846026, 1.71550332],
[0.61569495, 0.86569495, 1.11569495, 1.36569495, 1.61569495],
[0.45464871, 0.70464871, 0.95464871, 1.20464871, 1.45464871]]])
Coordinates:
* x (x) float64 0.0 0.25 0.5 0.75 1.0
* y (y) float64 0.0 0.25 0.5 0.75 1.0
* z (z) float64 0.0 0.25 0.5 0.75 1.0
Note that this filled the complete array because the convex hull of available points covers the whole array. If this is not the case, you may need a second step using nearest neighbor or fitting a spline to the filled data.

How to (correctly) merge 2 Pandas DataFrames and scatter-plot

Thank you for your answers, in advance.
My end goal is to produce a scatter-plot - corruption as an explanatory variable (x axis, from a DataFrame 'corr') and inequality as a dependent variable (y axis, from a DataFrame 'inq').
A hint to produce an informative table (DataFrame) by joining these two Dataframes would be much appreciated
I have a dataframe 'inq' for a country inequality (GINI index) and another one 'corr' for country corruption index.
pd.DataFrame(
{
"country": {0: "Angola", 1: "Albania", 2: "United Arab Emirates"},
"1975": {0: nan, 1: nan, 2: nan},
"1976": {0: nan, 1: nan, 2: nan},
"2017": {0: nan, 1: 33.2, 2: nan},
"2018": {0: 51.3, 1: nan, 2: nan},
}
)
pd.DataFrame(
{
"country": {0: "Afghanistan", 1: "Angola", 2: "Albania"},
"1975": {0: 44.8, 1: 48.1, 2: 75.1},
"1976": {0: 44.8, 1: 48.1, 2: 75.1},
"2018": {0: 24.2, 1: 40.4, 2: 28.4},
"2019": {0: 40.5, 1: 37.6, 2: 35.9},
}
)
I concatenate and manipulate and get
cm = pd.concat([inq, corr], axis=0, keys=["Inequality", "Corruption"]).reset_index(
level=1, drop=True
)
a new Dataframe
pd.DataFrame(
{
"indicator": {0: "Inequality", 1: "Inequality", 2: "Inequality"},
"country": {0: "Angola", 1: "Albania", 2: "United Arab Emirates"},
"1967": {0: nan, 1: nan, 2: nan},
"1969": {0: nan, 1: nan, 2: nan},
"2018": {0: 51.3, 1: nan, 2: nan},
"2019": {0: nan, 1: nan, 2: nan},
}
)

You should concatenate your dataframe in a different way:
df = (pd.concat([inq.set_index('country'),
corr.set_index('country')],
axis=1,
keys=["Inequality", "Corruption"]
)
.stack(level=1)
)
Inequality Corruption
country
Angola 1975 NaN 48.1
1976 NaN 48.1
2018 51.3 40.4
2019 NaN 37.6
Albania 1975 NaN 75.1
1976 NaN 75.1
2017 33.2 NaN
2018 NaN 28.4
2019 NaN 35.9
Afghanistan 1975 NaN 44.8
1976 NaN 44.8
2018 NaN 24.2
2019 NaN 40.5
Then to plot:
df.plot.scatter(x='Corruption', y='Inequality')
NB. there is only one point as most of your data is NaN

How to create a nested dictionary from pandas dataframe and again convert it to dataframe?

import pandas as pd
import numpy as np
d = {
'Fruit':['Guava','Orange','Lemon'],
'ID1':[1,2,11],
'ID2':[3,4,12],
'ID3':[5,6,np.nan],
'ID4':[7,8,14],
'ID5':[9,10,np.nan],
'ID6':[11,np.nan,np.nan],
'ID7':[13,np.nan,np.nan],
'ID8':[15,np.nan,np.nan],
'ID9':[17,np.nan,np.nan],
'Category':['Myrtaceae','Citrus','Citrus']
}
df = pd.DataFrame(data = d)
df
How to convert the above dataframe to the following dictionary.
Expected Output :
{
'Myrtacease':{'Guava':{1,3,5,7,9,11,13,15,17}},
'Citrus':{'Orange':{2,4,6,8,10,np.nan,np.nan,np.nan,np.nan},{'Lemon':{11,12,np.nan,14,np.nan,np.nan,np.nan,np.nan,np.nan}},
}
How to again convert the dictionary to a dataframe?

Use list comprehension with groupby:
d = {k: v.set_index('Fruit').T.to_dict('list')
for k, v in df.set_index('Category').groupby(level=0)}
print (d)
{'Citrus': {'Orange': [2.0, 4.0, 6.0, 8.0, 10.0, nan, nan, nan, nan],
'Lemon': [11.0, 12.0, nan, 14.0, nan, nan, nan, nan, nan]},
'Myrtaceae': {'Guava': [1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 15.0, 17.0]}}
Or:
d = {k: v.drop('Category', axis=1).set_index('Fruit').T.to_dict('list')
for k, v in df.groupby('Category')}
And then:
df = (pd.concat({k: pd.DataFrame(v) for k, v in d.items()}, axis=1)
.T
.rename_axis(('Category','Fruit'))
.rename(columns=lambda x: f'ID{x+1}')
.reset_index())
print (df)
Category Fruit ID1 ID2 ID3 ID4 ID5 ID6 ID7 ID8 ID9
0 Citrus Orange 2.0 4.0 6.0 8.0 10.0 NaN NaN NaN NaN
1 Citrus Lemon 11.0 12.0 NaN 14.0 NaN NaN NaN NaN NaN
2 Myrtaceae Guava 1.0 3.0 5.0 7.0 9.0 11.0 13.0 15.0 17.0

Get coordinates of non-nan values of xarray Dataset

I have this sample Dataset containing worldwide air temperature, and more importantly, a mask land, marking land/non-water areas.
<xarray.Dataset>
Dimensions: (lat: 55, lon: 143, time: 5)
Coordinates:
* time (time) datetime64[ns] 2016-01-01 2016-01-02 2016-01-03 ...
* lat (lat) float64 -52.5 -50.0 -47.5 -45.0 -42.5 -40.0 -37.5 -35.0 ...
* lon (lon) float64 -177.5 -175.0 -172.5 -170.0 -167.5 -165.0 -162.5 ...
land (lat, lon) bool False False False False False False False False ...
Data variables:
airt (time, lat, lon) float64 7.952 7.61 7.389 7.267 7.124 6.989 ...
I can now mask the oceans and plot it
dry_areas = ds.where(ds.land)
dry_areas.airt.plot()
dry_areas looks like this
<xarray.Dataset>
Dimensions: (lat: 55, lon: 143)
Coordinates:
* lat (lat) float64 -52.5 -50.0 -47.5 -45.0 -42.5 -40.0 -37.5 -35.0 ...
* lon (lon) float64 -177.5 -175.0 -172.5 -170.0 -167.5 -165.0 -162.5 ...
land (lat, lon) bool False False False False False False False False ...
Data variables:
airt (lat, lon) float64 nan nan nan nan nan nan nan nan nan nan nan ...
How can I now get the coordinates for all non-nan values?
dry_areas.coords gives me the bounding box and I can't get lat and lon into the (55, 143) shape so I could apply the mask on.
The only working workaround I could find is
dry_areas.to_dataframe().dropna().reset_index()[['lat', 'lon']].values, which does not feel very lean and clean.
I feel this is quite simply, however I am clearly not a numpy/matrix ninja.
Best solution so far
This is the shortest I could come with so far:
lon, lat = np.meshgrid(ds.coords['lon'], ds.coords['lat'])
lat_masked = ma.array(lat, mask=dry_areas.airt.fillna(False))
lon_masked = ma.array(lon, mask=dry_areas.airt.fillna(False))
land_coordinates = zip(lat_masked[lat_masked.mask].data, lon_masked[lon_masked.mask].data)

You can use .stack to get an array of coord pairs of the non-null values:
In [31]: da=xr.DataArray(np.arange(20).reshape(5,4))
In [33]: da_nans = da.where(da % 2 == 1)
In [34]: da_nans
Out[34]:
<xarray.DataArray (dim_0: 5, dim_1: 4)>
array([[ nan, 1., nan, 3.],
[ nan, 5., nan, 7.],
[ nan, 9., nan, 11.],
[ nan, 13., nan, 15.],
[ nan, 17., nan, 19.]])
Coordinates:
* dim_0 (dim_0) int64 0 1 2 3 4
* dim_1 (dim_1) int64 0 1 2 3
In [35]: da_stacked = da_nans.stack(x=['dim_0','dim_1'])
In [36]: da_stacked
Out[36]:
<xarray.DataArray (x: 20)>
array([ nan, 1., nan, 3., nan, 5., nan, 7., nan, 9., nan,
11., nan, 13., nan, 15., nan, 17., nan, 19.])
Coordinates:
* x (x) object (0, 0) (0, 1) (0, 2) (0, 3) (1, 0) (1, 1) (1, 2) ...
In [37]: da_stacked[da_stacked.notnull()]
Out[37]:
<xarray.DataArray (x: 10)>
array([ 1., 3., 5., 7., 9., 11., 13., 15., 17., 19.])
Coordinates:
* x (x) object (0, 1) (0, 3) (1, 1) (1, 3) (2, 1) (2, 3) (3, 1) ...

numpy: aggregate 4D array by groups

I have a numpy array with shape [t, z, x, y] epresenting an hourly time series of three-D data. The axes of the array are time, vertical coordinate, horizontal coordinate 1, horizontal coordinate 2. There is also a t-element list of hourly datetime.datetime timestamps.
I want to calculate the daily mid-day means for each day. This will be an [nday, Z, X, Y] array.
I'm trying to find a pythonic way to do this. I've written something with a bunch of for loops that works but seems slow, inflexible, and verbose.
It appears to me that Pandas is not a solution for me because my time series data are three-dimensional. I'd be happy to be proven wrong.
I've come up with this, using itertools, to find mid-day timestamps and group them by date, and now I'm coming up short trying to apply imap to find the means.
import numpy as np
import pandas as pd
import itertools
# create 72 hours of pseudo-data with 3 vertical levels and a 4 by 4
# horizontal grid.
data = np.zeros((72, 3, 4, 4))
t = pd.date_range(datetime(2008,7,1), freq='1H', periods=72)
for i in range(data.shape[0]):
data[i,...] = i
# find the timestamps that are "midday" in North America. We'll
# define midday as between 15:00 and 23:00 UTC, which is 10:00 EST to
# 15:00 PST.
def is_midday(this_t):
return ((this_t.hour >= 15) and (this_t.hour <= 23))
# group the midday timestamps by date
for dt, grp in itertools.groupby(itertools.ifilter(is_midday, t),
key=lambda x: x.date()):
print 'date ' + str(dt)
for g in grp:
print g
# find means of mid-day data by date
data_list = np.split(data, data.shape[0])
grps = itertools.groupby(itertools.ifilter(is_midday, t),
key=lambda x: x.date())
# how to apply itertools.imap (or something else) to data_list and
# grps? Or somehow split data along axis 0 according to grps?

You can shove pretty much any object into a pandas structure. Normally not recommended, but in this case it might work for you.
Create a Series indexed by time, with each element a 3-d numpy array
In [117]: s = Series([data[i] for i in range(data.shape[0])],index=t)
In [118]: s
Out[118]:
2008-07-01 00:00:00 [[[0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0], ...
2008-07-01 01:00:00 [[[1.0, 1.0, 1.0, 1.0], [1.0, 1.0, 1.0, 1.0], ...
2008-07-01 02:00:00 [[[2.0, 2.0, 2.0, 2.0], [2.0, 2.0, 2.0, 2.0], ...
2008-07-01 03:00:00 [[[3.0, 3.0, 3.0, 3.0], [3.0, 3.0, 3.0, 3.0], ...
2008-07-01 04:00:00 [[[4.0, 4.0, 4.0, 4.0], [4.0, 4.0, 4.0, 4.0], ...
2008-07-01 05:00:00 [[[5.0, 5.0, 5.0, 5.0], [5.0, 5.0, 5.0, 5.0], ...
2008-07-01 06:00:00 [[[6.0, 6.0, 6.0, 6.0], [6.0, 6.0, 6.0, 6.0], ...
2008-07-01 07:00:00 [[[7.0, 7.0, 7.0, 7.0], [7.0, 7.0, 7.0, 7.0], ...
2008-07-01 08:00:00 [[[8.0, 8.0, 8.0, 8.0], [8.0, 8.0, 8.0, 8.0], ...
2008-07-01 09:00:00 [[[9.0, 9.0, 9.0, 9.0], [9.0, 9.0, 9.0, 9.0], ...
2008-07-01 10:00:00 [[[10.0, 10.0, 10.0, 10.0], [10.0, 10.0, 10.0,...
2008-07-01 11:00:00 [[[11.0, 11.0, 11.0, 11.0], [11.0, 11.0, 11.0,...
2008-07-01 12:00:00 [[[12.0, 12.0, 12.0, 12.0], [12.0, 12.0, 12.0,...
2008-07-01 13:00:00 [[[13.0, 13.0, 13.0, 13.0], [13.0, 13.0, 13.0,...
2008-07-01 14:00:00 [[[14.0, 14.0, 14.0, 14.0], [14.0, 14.0, 14.0,...
...
2008-07-03 09:00:00 [[[57.0, 57.0, 57.0, 57.0], [57.0, 57.0, 57.0,...
2008-07-03 10:00:00 [[[58.0, 58.0, 58.0, 58.0], [58.0, 58.0, 58.0,...
2008-07-03 11:00:00 [[[59.0, 59.0, 59.0, 59.0], [59.0, 59.0, 59.0,...
2008-07-03 12:00:00 [[[60.0, 60.0, 60.0, 60.0], [60.0, 60.0, 60.0,...
2008-07-03 13:00:00 [[[61.0, 61.0, 61.0, 61.0], [61.0, 61.0, 61.0,...
2008-07-03 14:00:00 [[[62.0, 62.0, 62.0, 62.0], [62.0, 62.0, 62.0,...
2008-07-03 15:00:00 [[[63.0, 63.0, 63.0, 63.0], [63.0, 63.0, 63.0,...
2008-07-03 16:00:00 [[[64.0, 64.0, 64.0, 64.0], [64.0, 64.0, 64.0,...
2008-07-03 17:00:00 [[[65.0, 65.0, 65.0, 65.0], [65.0, 65.0, 65.0,...
2008-07-03 18:00:00 [[[66.0, 66.0, 66.0, 66.0], [66.0, 66.0, 66.0,...
2008-07-03 19:00:00 [[[67.0, 67.0, 67.0, 67.0], [67.0, 67.0, 67.0,...
2008-07-03 20:00:00 [[[68.0, 68.0, 68.0, 68.0], [68.0, 68.0, 68.0,...
2008-07-03 21:00:00 [[[69.0, 69.0, 69.0, 69.0], [69.0, 69.0, 69.0,...
2008-07-03 22:00:00 [[[70.0, 70.0, 70.0, 70.0], [70.0, 70.0, 70.0,...
2008-07-03 23:00:00 [[[71.0, 71.0, 71.0, 71.0], [71.0, 71.0, 71.0,...
Freq: H, Length: 72
Define your aggregating function. You need to access the values which returns the inside object; concatenating coerces back to an actual numpy array, then aggregate (mean in this case)
In [119]: def f(g,grp):
.....: return np.concatenate(grp.values).mean()
.....:
Since not sure what your end output should look like, just create a time-based grouper manually (this is essentially a resample), but doesn't do anything with the final results (its just a list of the aggregated values)
In [121]: [ f(g,grp) for g, grp in s.groupby(pd.Grouper(freq='D')) ]
Out[121]: [11.5, 35.5, 59.5]
You can get reasonable fancy here and say return a pandas object (and potentially concat them).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Efficient way to create dictionary of symmetric matrix with colum-row pair as key, and corresponding value in matrix as value - pandas

Related

How to use scipy.interpolate.interpn function with xarray (3d), to fill nan gaps? Current Error [The points in dimension 0 must be strictly ascending]

How to (correctly) merge 2 Pandas DataFrames and scatter-plot

How to create a nested dictionary from pandas dataframe and again convert it to dataframe?

Get coordinates of non-nan values of xarray Dataset

numpy: aggregate 4D array by groups

Categories

Resources