Apply np.polyfit() to an xarray DataArray containing NaN values - numpy

I am working with the ERSST.v5 Dataset, which contains monthly temperature data of the dimensions (time, latitude, longitude). I want to calculate the trend of the temperature per year at each grid point (via a for-loop) and plot it as a function of (latitude,longitude).
I now have a problem applying np.polyfit() on the data, because the DataArray contains NaN-values. I tried this: numpy.polyfit doesn't handle NaN values, but my index doesn't seem to work properly and I'm struggling to find the solution. Here's my code:
import numpy as np
import xarray as xr
#load data
sst_data=xr.open_dataset('sst.mnmean.nc') #ersst.v5 dataset
#define sea surface temperature and calculate annual mean
sst=sst_data.sst[:-1]
annual_sst = sst.groupby('time.year').mean(axis=0) #annual mean sst with dimensions (year, lat, lon)
#longitudes, latitudes
sst_lon=sst_data.variables['lon']
sst_lat=sst_data.variables['lat']
#map lon values to -180..180 range
f = lambda x: ((x+180) % 360) - 180
sst_lon = f(sst_lon)
#rearange data
ind = np.argsort(sst_lon)
annual_sst = annual_sst[:,:,ind] #rearanged annual mean sst
#calculate sst trend at each grid point
year=annual_sst.coords['year'] #define time variable
idx = np.isfinite(annual_sst) #find all finite values
a=np.where(idx[0,:,0]==1)[0]
b=np.where(idx[0,0,:]==1)[0]
#new sst
SST=annual_sst[:,a,b]
for i in range (0, len(SST.coords['lat'])):
for j in range (0, len(SST.coords['lon'])):
sst = SST[:,i,j]
trend = np.polyfit(year, sst, deg=1)
I get this error: LinAlgError: SVD did not converge in Linear Least Squares
I am thankful for any tips/suggestions!

Related

How can i create a bubble chart using this data in seaborn?

i have all the data i need to plot in a single row e.g.:
mcc_name year_1 year_2 year_3 year_1_% year_2_% year_3_%
book shop 30000 1500.41 9006.77 NaN -0.4708 -0.60379
i want the x axis to be the values in columns: [year_1, year_2, year_3] and values in y axis to be the y - axis (pct change)... and the size of the bubble proportional to the values in [year_1, year_2, year_3] .
sns.scatterplot(data=data_row , x=['year_1', 'year_2', 'year_3'], y=['year_1_%', 'year_2_%', 'year_3_%'], size="pop", legend=False, sizes=(20, 2000))
# show the graph
plt.show()
but i get this error:
ValueError: Length of list vectors must match length of `data` when both are used, but `data` has length 1 and the vector passed to `y` has length 3.
how can i plot??
You need to have your data in long format:
import pandas as pd
import seaborn as sns
import numpy as np
df = pd.DataFrame(np.array([30000,1500.41,9006.77,np.NaN,-0.4708,-0.60379]).reshape(1,-1),
columns = ['year_1','year_2','year_3','year_1_%','year_2_%','year_3_%'],
index = ['mcc_name'])
Usually you can use wide_to_long if your columns are formatted properly, but in this case, maybe easily to melt separately and join:
values = df.filter(regex='year_[0-9]$', axis=1).melt(value_name="value",var_name="year")
perc = df.filter(regex='_%', axis=1).melt(value_name="perc",var_name="year")
perc.year = perc.year.str.replace("_%","")
sns.scatterplot(data=values.merge(perc,on="year"),x = "year", y = "perc", size = "value")

Convert multidimensional climate numpy array to Pandas dataframe

I want to convert a multidimensional climate data into the pandas data frame. The shape of my numpy array is temperature.shape -> (365,100,200) -> ["time", "longitude", "latitude"]. Then I would like to have the following columns in my pandas dataframe: columns=["time", "lon", "lat", "temp"].
I tried this code:
df = pd.DataFrame(temperature, columns=['time', 'lat', 'lon', 'temp'])
I got this error:
ValueError: Must pass 2-d input
How can I solve it? I could not find any hint in suggested topics. Thanks.
Pandas is expects a 2D array where the columns and rows correspond to the final data frame.
It looks like you're trying to unravel the (365,100,200) array in 365*100*200=7,300,000 individual records. This can be done by flattening the array if you have the values for each independent quantity along each access.
For example, unravelling a (3,4,5) shaped 3D array with X, Y and Z dimensions given by the lists/arrays x_index, y_index, z_index, rather than time, longitude, latitude and M replacing temperature:
import numpy as np
import pandas as pd
nx = 3
ny = 4
nz = 5
M = np.ndarray((nx,ny,nz))
for i in range(nx):
for j in range(ny):
for k in range(nz):
M[i,j,k] = (i+j)*k
# constructed nx by ny by nz matrix from function f(x,y,z) = (x+y)*z
x_index = list(range(nx))
y_index = list(range(ny))
z_index = list(range(nz))
# Get arrays/list giving the values of x/y/z
X, Y, Z = np.meshgrid(x_index,y_index,z_index)
# Make (3,4,5) arrays of each independent variable
pd.DataFrame({"M=(X+Y)*Z":M.flatten(), "X":X.flatten(), "Y":Y.flatten(), "Z":Z.flatten()})
# Flatten the data and independent variables to make 3*4*5=60 individual records

changing range causes a distribution not normal

A post gives some code to plot this figure
import scipy.stats as ss
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(-10, 11)
xU, xL = x + 0.5, x - 0.5
prob = ss.norm.cdf(xU, scale = 3) - ss.norm.cdf(xL, scale = 3)
prob = prob / prob.sum() #normalize the probabilities so their sum is 1
nums = np.random.choice(x, size = 10000, p = prob)
plt.hist(nums, bins = len(x))
I modifyied this line
x = np.arange(-10, 11)
to this line
x = np.arange(10, 31)
I got this figure
How to fix that?
Given what you're asking Python to do, there's no error in this plot: it's a histogram of 10,000 samples from the tail (anything that rounds to between 10 and 31) of a normal distribution with mean 0 and standard deviation 3. Since probabilities drop off steeply in the tail of a normal, it happens that none of the 10,000 exceeded 17, which is why you didn't get the full range up to 31.
If you just want the x-axis of the plot to cover your full intended range, you could add plt.xlim(9.5, 31.5) after plt.hist.
If you want a histogram with support over this entire range, then you'll need to adjust the mean and/or variance of the distribution. For instance, if you specify that your normal distribution has mean 20 rather than mean 0 when you obtain prob, i.e.
prob = ss.norm.cdf(xU, loc=20, scale=3) - ss.norm.cdf(xL, loc=20, scale=3)
then you'll recover a similar-looking histogram, just translated to the right by 20.

resample time series on uniform interval in numpy/scipy?

I have a random variable X sampled at random times T similar to this toy data:
import numpy as np
T = np.random.exponential(size=1000).cumsum()
X = np.random.normal(size=1000)
This timeseries looks like this:
A key point is that the sampling interval is non-uniform: by this I mean that all elements of np.diff(T) are not equal. I need to resample the timeseries T,X on uniform intervals with a specified width dt, meaning (np.diff(T)==dt).all() should return True.
I can resample the timeseries on uniform intervals using scipy.interpolate.interp1d, but this method does not allow me to specify the interval size dt:
from scipy.interpolate import interp1d
T = np.linspace(T.min(),T.max(),T.size) # same range and size with a uniform interval
F = interp1d(T,X,fill_value='extrapolate') # resample the series on uniform interval
X = F(T) # Now it's resampled.
The essential issue is that interp1d does not accept an array T unless T.size==X.size.
Is there another method I can try to resample the time series T,X on uniform intervals of width dt?
dt = ...
from scipy.interpolate import interp1d
Told = np.arange(T.min(),T.max(),T.size)
F = interp1d(Told,X,fill_value='extrapolate')
Tnew = np.linspace(T.min(), T.max(), dt)
Xnew = F(Tnew)

getting counterintuitive results with numpy FFT when calculating mean frequency and ESD Progression

I have some timecourse data that visually appears to have differing levels of high frequency fluctuation. I have plotted the timecourse Data A and B below.
I have used numpy FFT to perform Fourier Transformation as follows:
Fs = 1.0; # sampling rate
Ts = 864000; # sampling interval
t = np.arange(0,Ts,Fs) # time vector
n = 864000 # number of datapoints of timecourse data
k = np.arange(n)
T = n/Fs
frq = k/T # two sides frequency range
frq = frq[range(n/2)] # one side frequency range
Y = np.fft.fft(A1)/n # fft computing and normalization of timecourse data (A1)
Y = Y[range(n/2)]
Z = abs(Y)
##### Calculate Mean Frequency ########
Mean_Frequency = sum((frq*Z))/(sum(Z))
################################make sure the first value doesnt create an issue
Freq = frq[1:]
Z = Z[1:]
max_y = max(Z) # Find the maximum y value
mode_Frequency = Freq[Z.argmax()] # Find the x value corresponding to the maximum y value
###################### Plot figures
fig, ax = plt.subplots(2, 1)
fig.suptitle(str(D[k]), fontsize=14, fontweight='bold')
ax[0].plot(t,A1, 'black')
ax[0].set_xlabel('Time')
ax[0].set_ylabel('Amplitude')
ax[1].plot(frq,abs(Y),'r') # plotting the spectrum
ax[1].set_xscale('log')
ax[1].set_xlabel('Freq (Hz)')
ax[1].set_ylabel('|Y(freq)|')
text(2*mode_x, 0.95*max_y, "Mode Frequency (Hz): "+str(mode_x))
text(2*mode_x, 0.85*max_y, "Mean Frequency (Hz): "+str(mean_x))
plt.savefig("/home/phoenix/Desktop/Figures/Figure3/FourierGraphs/"+str(D[k])+".png")
plt.close()
This results look like this:
The timecourse data (black) for A on the left appears to me to have much more high frequency noise than than the data on the right (B).
yet the mean frequency is higher for the data on the left.
Is this because I have performed FFT incorrectly, or because I have calculated mean frequency incorrectly or because I need to use a different method to capture the really low frequency noise in timecourse B?
thanks for your time.