Calculating percentile for each gridpoint in xarray - numpy

I am currently using xarray to make probability maps. I want to use a statistical assessment like a “counting” exercise. Meaning, for all data points in NEU count how many times both variables jointly exceed their threshold. That means 1th percentile of the precipitation data and 99th percentile of temperature data. Then the probability (P) of join occurrence is simply the number of joint exceedances divided by the number of data points in your dataset.
<xarray.Dataset>
Dimensions: (latitude: 88, longitude: 200, time: 6348)
Coordinates:
* latitude (latitude) float64 49.62 49.88 50.12 50.38 ... 70.88 71.12 71.38
* longitude (longitude) float64 -9.875 -9.625 -9.375 ... 39.38 39.62 39.88
* time (time) datetime64[ns] 1950-06-01 1950-06-02 ... 2018-08-31
Data variables:
rr (time, latitude, longitude) float32 dask.array<chunksize=(6348, 88, 200), meta=np.ndarray>
tx (time, latitude, longitude) float32 dask.array<chunksize=(6348, 88, 200), meta=np.ndarray>
Ellipsis float64 0.0
I want to calculate the percentile of both precipitation and temperature for each gridpoint, that means basically that I want to repeat the function below for every gridpoint.
Neu_Precentile=np.nanpercentile(NEU.rr[:,0,0],1)
Can anyone help me out with this problem. I also tried to use xr.apply_ufunc but unfortunately it doesn't worked out well.

I'm not sure how you want to process quantiles, but here is a version from which you may be able to adapt.
Also, I chose to keep the dataset structure when computing the quantiles, as it shows how to retrieve the values of the outliers if this is ever relevant (and it is one step away from retrieving the values of valid data points, which is likely relevant).
1. Create some data
coords = ("time", "latitude", "longitude")
sizes = (500, 80, 120)
ds = xr.Dataset(
coords={c: np.arange(s) for c, s in zip(coords, sizes)},
data_vars=dict(
precipitation=(coords, np.random.randn(*sizes)),
temperature=(coords, np.random.randn(*sizes)),
),
)
View of the data:
<xarray.Dataset>
Dimensions: (latitude: 80, longitude: 120, time: 500)
Coordinates:
* time (time) int64 0 1 2 3 ... 496 497 498 499
* latitude (latitude) int64 0 1 2 3 ... 76 77 78 79
* longitude (longitude) int64 0 1 2 3 ... 117 118 119
Data variables:
precipitation (time, latitude, longitude) float64 -1.673 ... -0.3323
temperature (time, latitude, longitude) float64 -0.331 ... -0.03728
2. Compute quantiles
qt_dims = ("latitude", "longitude")
qt_values = (0.1, 0.9)
ds_qt = ds.quantile(qt_values, dim=qt_dims)
It is a Dataset, with dimensions of analysis ("latitude", "longitude") lost, and with a new "quantile" dimension:
<xarray.Dataset>
Dimensions: (quantile: 2, time: 500)
Coordinates:
* time (time) int64 0 1 2 3 ... 496 497 498 499
* quantile (quantile) float64 0.1 0.9
Data variables:
precipitation (quantile, time) float64 -1.305 ... 1.264
temperature (quantile, time) float64 -1.267 ... 1.254
3. Compute outliers co-occurrence
For the locations of outliers:
(edit: use of np.logical_and, more readable than the & operator)
da_outliers_loc = np.logical_and(
ds.precipitation > ds_qt.precipitation.sel(quantile=qt_values[0]),
ds.temperature > ds_qt.temperature.sel(quantile=qt_values[1]),
)
The output is a boolean DataArray:
<xarray.DataArray (time: 500, latitude: 80, longitude: 120)>
array([[[False, ...]]])
Coordinates:
* time (time) int64 0 1 2 3 4 ... 496 497 498 499
* latitude (latitude) int64 0 1 2 3 4 ... 75 76 77 78 79
* longitude (longitude) int64 0 1 2 3 ... 116 117 118 119
And if ever the values are relevant:
ds_outliers = ds.where(
(ds.precipitation > ds_qt.precipitation.sel(quantile=qt_values[0]))
& (ds.temperature > ds_qt.temperature.sel(quantile=qt_values[1]))
)
4. Count outliers per timestep
outliers_count = da_outliers_loc.sum(dim=qt_dims)
Finally, here is the DataArray with only a time dimension, and having for values the number of outliers at each timestamp.
<xarray.DataArray (time: 500)>
array([857, ...])
Coordinates:
* time (time) int64 0 1 2 3 4 ... 495 496 497 498 499

np.nanpercentile by default works on a flattened array, however, in this case, the goal is to reduce only the first dimension generating a 2D array containing the result at each gridpoint. To achieve this, the axis argument of nanpercentile can be used:
np.nanpercentile(NEU.rr, 1, axis=0)
This however will remove the labeled dimensions and coordinates. It is to preserve the dims and coords that apply_ufunc has to be used, it does not vectorize the functions for you.
xr.apply_ufunc(
lambda x: np.nanpercentile(x, 1, axis=-1), NEU.rr, input_core_dims=[["time"]]
)
Note how now the axis is -1 and we are using input_core_dims which tells apply_ufunc this dimension will be reduced and also moves it to the last position (hence the -1). For a more detailed explanation on apply_ufunc, this other answer may help.

Related

Apply np.polyfit() to an xarray DataArray containing NaN values

I am working with the ERSST.v5 Dataset, which contains monthly temperature data of the dimensions (time, latitude, longitude). I want to calculate the trend of the temperature per year at each grid point (via a for-loop) and plot it as a function of (latitude,longitude).
I now have a problem applying np.polyfit() on the data, because the DataArray contains NaN-values. I tried this: numpy.polyfit doesn't handle NaN values, but my index doesn't seem to work properly and I'm struggling to find the solution. Here's my code:
import numpy as np
import xarray as xr
#load data
sst_data=xr.open_dataset('sst.mnmean.nc') #ersst.v5 dataset
#define sea surface temperature and calculate annual mean
sst=sst_data.sst[:-1]
annual_sst = sst.groupby('time.year').mean(axis=0) #annual mean sst with dimensions (year, lat, lon)
#longitudes, latitudes
sst_lon=sst_data.variables['lon']
sst_lat=sst_data.variables['lat']
#map lon values to -180..180 range
f = lambda x: ((x+180) % 360) - 180
sst_lon = f(sst_lon)
#rearange data
ind = np.argsort(sst_lon)
annual_sst = annual_sst[:,:,ind] #rearanged annual mean sst
#calculate sst trend at each grid point
year=annual_sst.coords['year'] #define time variable
idx = np.isfinite(annual_sst) #find all finite values
a=np.where(idx[0,:,0]==1)[0]
b=np.where(idx[0,0,:]==1)[0]
#new sst
SST=annual_sst[:,a,b]
for i in range (0, len(SST.coords['lat'])):
for j in range (0, len(SST.coords['lon'])):
sst = SST[:,i,j]
trend = np.polyfit(year, sst, deg=1)
I get this error: LinAlgError: SVD did not converge in Linear Least Squares
I am thankful for any tips/suggestions!

Pandas apply function to multiple columns, using value from another dataframe

I have a dataframe with some examples, and another dataframe representing a population. For each numeric column in the examples df, I want to calculate the Cumulative Distribution Function of those values with respect to the population df.
This relies on column-wise mean and std values from the population df - and I can't find a way properly refer to these mean and std values in my apply function.
Here is a simplified example of what I'm trying:
The examples:
df_test = pd.DataFrame([['Azriel', 45, 76], ['Moses', 23, 34]])
df_test.columns = (['Name', 'Age', 'Weight'])
Name Age Weight
0 Azriel 45 76
1 Moses 23 34
The population:
df_comp = pd.DataFrame([['Mary', 28, 66], ['Joseph', 32, 86], ['Paul', 54, 88]])
df_comp.columns = (['Name', 'Age', 'Weight'])
Name Age Weight
0 Mary 28 66
1 Joseph 32 86
2 Paul 54 88
I am trying to produce the calculation in df_dist:
df_dist = df_test.copy()
numeric_cols = df_comp.select_dtypes(include=[np.number]).columns
mu = df_comp[numeric_cols].mean()
sig = df_comp[numeric_cols].std()
df_dist[numeric_cols] = df_dist[numeric_cols].apply(lambda x: scipy.stats.norm.cdf(x, mu, sig))
The output of df_dist is:
Name Age Weight
0 Azriel 0.691462 0.996679
1 Moses 0.000001 0.000078
The expected output of df_dist (calculated manually):
Age Weight
Azriel 0.6914624613 0.371154197
Moses 0.1419883859 0.00007804441375
You can see, the value for Azriel's Age and Moses's Weight is correct, but the rest are wrong.
I think I am making a mistake trying to referring to mu and sig in the apply function, when I only want to refer to one of the values within mu and sig.
I hope that makes sense - can anyone see a solution?
If we look at mu and sig, we see they are series and have values for each numeric column:
>>> mu
Age 38.0
Weight 80.0
dtype: float64
>>> sigma
Age 14.000000
Weight 12.165525
dtype: float64
When you are applying CDF function per column, you are using the whole mu and sigma series instead of using the corresponding values specific to the column (so your suspicion is correct!).
Remedy is to use the column's name in apply and select from the mu and sigma accordingly:
df_dist[numeric_cols].apply(lambda x: scipy.stats.norm.cdf(x, mu[x.name], sig[x.name]))
x.name will be e.g. "Age" when Age column is applied upon, and so on.
This gives:
Name Age Weight
0 Azriel 0.691462 0.371154
1 Moses 0.141988 0.000078

Pandas - keep track of value after applying linear regression on other values

I am trying to apply linear regression to a series of variables in my pandas dataframe, excepting player_id, which is only a way to track the player being predicted.
print (df.info())
player_id 1601 non-null int64
X1 1601 non-null float64
X2 1601 non-null float64
X3 1601 non-null float64
X4 1601 non-null float64
X5 1601 non-null float64
X6 1601 non-null float64
X7 1601 non-null float64
X8 1601 non-null float64
Y 1601 non-null float64
this is how I try to declare my variables:
df = df[['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'Y']]
X = df.drop(axis=1, columns=['Y'])
# normalize data
X = X.astype('float32') / 255.
# independent variable
y = df['Y']
# normalize data
y = y.astype('float32') / 255.
model = LinearRegression()
model.fit(X, y)
y_hat = model.predict(X)
The question is: once I have my array of predicted values, how do I track them back to each player_id, in order to know to which player the predicted value refers to?
Example:
To which player_id max(network.predict(X)) refers to?
This works:
for i, value in enumerate(list(y_hat.flatten())):
print (df.iloc[i]['player_id'])
df['prediction'].iloc[i] = value.astype('float32')
it will return list of values in same order as the X you supplied. So for ith row in X, ith value of y_hat is the prediction.

Double grouping data by bins AND time with pandas

I am trying to bin values from a timeseries (hourly and subhourly temperature values) within a time window.
That is, from original hourly values, I'd like to extract binned values on a daily, weekly or monthly basis.
I have tried to combine groupby+TimeGrouper(" ") with pd.cut, with poor results.
I have came across a nice function from this tutorial, which suggests to map the data (associating to each value with its mapped range on the next column) and then grouping according to that.
def map_bin(x, bins):
kwargs = {}
if x == max(bins):
kwargs['right'] = True
bin = bins[np.digitize([x], bins, **kwargs)[0]]
bin_lower = bins[np.digitize([x], bins, **kwargs)[0]-1]
return '[{0}-{1}]'.format(bin_lower, bin)
df['Binned'] = df['temp'].apply(map_bin, bins=freq_bins)
However, applying this function results in an IndexError: index n is out of bounds for axis 0 with size n.
Ideally, I'd like make this work and apply it to achieve a double grouping at the same time: one by bins and one by timegrouper.
Update:
It appears that my earlier attempt was causing problems because of the double-indexed columns. I have simplified to something that seems to work much better.
import pandas as pd
import numpy as np
xaxis = np.linspace(0,50)
temps = pd.Series(data=xaxis,name='temps')
times = pd.date_range(start='2015-07-15',periods=50,freq='6H')
temps.index = times
bins = [0,10,20,30,40,50]
temps.resample('W').agg(lambda series:pd.value_counts(pd.cut(series,bins),sort=False)).unstack()
This outputs:
(0, 10] (10, 20] (20, 30] (30, 40] (40, 50]
2015-07-19 9 10 0 0 0
2015-07-26 0 0 10 10 8
2015-08-02 0 0 0 0 2

Pandas subsampling

I have some event data that is measured in time, so the data format looks like
Time(s) Pressure Humidity
0 10 5
0 9.9 5.1
0 10.1 5
1 10 4.9
2 11 6
Here the first column is Time elapsed since the start of the experiment, in seconds. The other two cols are some observations. A row is created when certain conditions are true, these conditions are beyond the scope of the discussion here. Each set of 3 numbers separated by a semi colon is a row of data. Since the lowest granularity of resolution in time here is only seconds, you could have two rows with the same timestamp but but will different observations. Basically these were two distinct events that time could not distinguish.
Now my problem is to roll up the data series, by subsampling it say every 10 or 100 seconds, or 1000 seconds. So I want a skimmed data series from the original higher granularity data series. There are a few ways to decide which row you would use, for instance say you are subsampling at every 10 seconds, when 10 seconds elapse, you could have multiple rows with the time stamp of 10 seconds. You could either take
1) first row
2) mean of all rows with the same timestamp of 10
3) some other technique
I am looking to do this in pandas, any ideas or way to start would be very appreciated. Thanks.
Here is a simple example that shows how to perform
the operations requested with pandas.
One uses data binning to group samples and
resample data.
import pandas as pd
# Creation of the dataframe
df = pd.DataFrame({\
'Time(s)':[0 ,0 ,0 ,1 ,2],\
'Pressure':[10, 9.9, 10.1, 10, 11],\
'Humidity':[5 ,5.1 ,5 ,4.9 ,6]})
# Select time increment
delta_t = 1
timeCol = 'Time(s)'
# Creation of the time sampling
v = xrange(df[timeCol].min()-delta_t,df[timeCol].max()+delta_t,delta_t)
# Pandas magic instructions with cut and groupby
df_binned = df.groupby(pd.cut(df[timeCol],v))
# Display the first element
dfFirst = df_binned.head(1)
# Evaluate the mean of each group
dfMean = df_binned.mean()
# Evaluate the median of each group
dfMedian = df_binned.median()
# Find the max of each group
dfMax = df_binned.max()
# Find the min of each group
dfMin = df_binned.min()
Result will look like this for dfFirst
Humidity Pressure Time(s)
Time(s)
(-1, 0] 0 5.0 10 0
(0, 1] 3 4.9 10 1
(1, 2] 4 6.0 11 2
Result will look like this for dfMean
Humidity Pressure Time(s)
Time(s)
(-1, 0] 5.033333 10 0
(0, 1] 4.900000 10 1
(1, 2] 6.000000 11 2