T-test of gridded (3-D) data in Python

T-test of gridded (3-D) data in Python - numpy

I have two sets of gridded (NetCDF) data with dimensions: Time, S-N, W-E and would like to do a t-test of paired values in time across the entire grid with Python. The computation should return decisions across the entire grid that can then be processed into a spy plot aerial overlay.
I am able to process this in MATLAB, after reading the variable, comprising two main steps (as a hint to what is expected in Python):
**[h,p] = ttest2(permute(grid_1,[3 2 1]),permute(griid_2,[3 2 1]))** % [3 2 1] ===> [Time S-N W-E]
**spy(squeeze(h),'k+',1)** % h is 1/0 decision.
Perhaps this would entail array transforms but my knowledge of Python is limited.
A snippet of my gridded data is as follows:
<xarray.DataArray 'WSPD' (Time: 120, south_north: 105, west_east: 120)>
array([[[3.0042849, 3.6635756, 3.5766048, ..., 2.7890186, 3.5537026,
2.4510043],
Coordinates:
XLONG (south_north, west_east) float32 -125.7 -125.7 ... -122.3 -122.2
XLAT (south_north, west_east) float32 55.7 55.7 55.7 ... 56.99 56.99
XTIME (Time) float32 3.003e+04 7.282e+04 ... 1.417e+06 3.742e+06
Time (Time) datetime64[ns] 2010-08-01T08:00:00 ... 2020-07-01T08:00:00

You can perform paired t-test using scipy.stats.ttest_rel. It takes in two numpy arrays (to which xarray DataArrays natively converts) and compares them across an axis. Default is axis=0.
Example
Let's create some sample data
import xarray as xr
from scipy.stats import ttest_rel
import numpy as np
grid_1 = xr.tutorial.open_dataset('air_temperature').load().to_array().squeeze()
grid_2 = grid_1 + np.random.randn(*grid_1.shape) # Normally perturbed grid_1
>>> grid_1
<xarray.DataArray (time: 2920, lat: 25, lon: 53)>
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
variable <U3 'air'
Since we want to compare across time, we need to specify the correct axis. Here time is first, so we do not need to do anything. Otherwise we'd have to either grid_1.transpose('time', 'lat', 'lon') or call ttest_rel with the appropriate axis argument.
Now we can simply perform the paired t-test.
statres, pval = ttest_rel(grid_1, grid_2)
Plotting
Following the details added to the question, here is how to plot a spy plot with the appropriate ticks.
import matplotlib.pyplot as plt
spy1 = np.where(pval < 0.05, 1, 0)
# Plot spy plot
ax = plt.gca()
ax.spy(spy1, markersize=1, color='black') # 'black' markers
# Prepare ticks
x_ticks = np.arange(0, grid_1.coords['lon'].shape[0], 10)
x_tick_labels = grid_1.coords['lon'].isel(lon=x_ticks).values
y_ticks = np.arange(0, grid_1.coords['lat'].shape[0], 10)
y_tick_labels = grid_1.coords['lat'].isel(lat=y_ticks).values
# Set ticks
ax.set_xticks(x_ticks)
ax.set_xticklabels(x_tick_labels)
ax.set_yticks(y_ticks)
ax.set_yticklabels(y_tick_labels)

Related

How to adjust x-axis spacing on distplot? [duplicate]

I am trying to fix how python plots my data.
Say:
x = [0,5,9,10,15]
y = [0,1,2,3,4]
matplotlib.pyplot.plot(x,y)
matplotlib.pyplot.show()
The x axis' ticks are plotted in intervals of 5. Is there a way to make it show intervals of 1?

You could explicitly set where you want to tick marks with plt.xticks:
plt.xticks(np.arange(min(x), max(x)+1, 1.0))
For example,
import numpy as np
import matplotlib.pyplot as plt
x = [0,5,9,10,15]
y = [0,1,2,3,4]
plt.plot(x,y)
plt.xticks(np.arange(min(x), max(x)+1, 1.0))
plt.show()
(np.arange was used rather than Python's range function just in case min(x) and max(x) are floats instead of ints.)
The plt.plot (or ax.plot) function will automatically set default x and y limits. If you wish to keep those limits, and just change the stepsize of the tick marks, then you could use ax.get_xlim() to discover what limits Matplotlib has already set.
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, stepsize))
The default tick formatter should do a decent job rounding the tick values to a sensible number of significant digits. However, if you wish to have more control over the format, you can define your own formatter. For example,
ax.xaxis.set_major_formatter(ticker.FormatStrFormatter('%0.1f'))
Here's a runnable example:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
x = [0,5,9,10,15]
y = [0,1,2,3,4]
fig, ax = plt.subplots()
ax.plot(x,y)
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 0.712123))
ax.xaxis.set_major_formatter(ticker.FormatStrFormatter('%0.1f'))
plt.show()

Another approach is to set the axis locator:
import matplotlib.ticker as plticker
loc = plticker.MultipleLocator(base=1.0) # this locator puts ticks at regular intervals
ax.xaxis.set_major_locator(loc)
There are several different types of locator depending upon your needs.
Here is a full example:
import matplotlib.pyplot as plt
import matplotlib.ticker as plticker
x = [0,5,9,10,15]
y = [0,1,2,3,4]
fig, ax = plt.subplots()
ax.plot(x,y)
loc = plticker.MultipleLocator(base=1.0) # this locator puts ticks at regular intervals
ax.xaxis.set_major_locator(loc)
plt.show()

I like this solution (from the Matplotlib Plotting Cookbook):
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
x = [0,5,9,10,15]
y = [0,1,2,3,4]
tick_spacing = 1
fig, ax = plt.subplots(1,1)
ax.plot(x,y)
ax.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
plt.show()
This solution give you explicit control of the tick spacing via the number given to ticker.MultipleLocater(), allows automatic limit determination, and is easy to read later.

In case anyone is interested in a general one-liner, simply get the current ticks and use it to set the new ticks by sampling every other tick.
ax.set_xticks(ax.get_xticks()[::2])

if you just want to set the spacing a simple one liner with minimal boilerplate:
plt.gca().xaxis.set_major_locator(plt.MultipleLocator(1))
also works easily for minor ticks:
plt.gca().xaxis.set_minor_locator(plt.MultipleLocator(1))
a bit of a mouthfull, but pretty compact

This is a bit hacky, but by far the cleanest/easiest to understand example that I've found to do this. It's from an answer on SO here:
Cleanest way to hide every nth tick label in matplotlib colorbar?
for label in ax.get_xticklabels()[::2]:
label.set_visible(False)
Then you can loop over the labels setting them to visible or not depending on the density you want.
edit: note that sometimes matplotlib sets labels == '', so it might look like a label is not present, when in fact it is and just isn't displaying anything. To make sure you're looping through actual visible labels, you could try:
visible_labels = [lab for lab in ax.get_xticklabels() if lab.get_visible() is True and lab.get_text() != '']
plt.setp(visible_labels[::2], visible=False)

This is an old topic, but I stumble over this every now and then and made this function. It's very convenient:
import matplotlib.pyplot as pp
import numpy as np
def resadjust(ax, xres=None, yres=None):
"""
Send in an axis and I fix the resolution as desired.
"""
if xres:
start, stop = ax.get_xlim()
ticks = np.arange(start, stop + xres, xres)
ax.set_xticks(ticks)
if yres:
start, stop = ax.get_ylim()
ticks = np.arange(start, stop + yres, yres)
ax.set_yticks(ticks)
One caveat of controlling the ticks like this is that one does no longer enjoy the interactive automagic updating of max scale after an added line. Then do
gca().set_ylim(top=new_top) # for example
and run the resadjust function again.

I developed an inelegant solution. Consider that we have the X axis and also a list of labels for each point in X.
Example:
import matplotlib.pyplot as plt
x = [0,1,2,3,4,5]
y = [10,20,15,18,7,19]
xlabels = ['jan','feb','mar','apr','may','jun']
Let's say that I want to show ticks labels only for 'feb' and 'jun'
xlabelsnew = []
for i in xlabels:
if i not in ['feb','jun']:
i = ' '
xlabelsnew.append(i)
else:
xlabelsnew.append(i)
Good, now we have a fake list of labels. First, we plotted the original version.
plt.plot(x,y)
plt.xticks(range(0,len(x)),xlabels,rotation=45)
plt.show()
Now, the modified version.
plt.plot(x,y)
plt.xticks(range(0,len(x)),xlabelsnew,rotation=45)
plt.show()

Pure Python Implementation
Below's a pure python implementation of the desired functionality that handles any numeric series (int or float) with positive, negative, or mixed values and allows for the user to specify the desired step size:
import math
def computeTicks (x, step = 5):
"""
Computes domain with given step encompassing series x
# params
x - Required - A list-like object of integers or floats
step - Optional - Tick frequency
"""
xMax, xMin = math.ceil(max(x)), math.floor(min(x))
dMax, dMin = xMax + abs((xMax % step) - step) + (step if (xMax % step != 0) else 0), xMin - abs((xMin % step))
return range(dMin, dMax, step)
Sample Output
# Negative to Positive
series = [-2, 18, 24, 29, 43]
print(list(computeTicks(series)))
[-5, 0, 5, 10, 15, 20, 25, 30, 35, 40, 45]
# Negative to 0
series = [-30, -14, -10, -9, -3, 0]
print(list(computeTicks(series)))
[-30, -25, -20, -15, -10, -5, 0]
# 0 to Positive
series = [19, 23, 24, 27]
print(list(computeTicks(series)))
[15, 20, 25, 30]
# Floats
series = [1.8, 12.0, 21.2]
print(list(computeTicks(series)))
[0, 5, 10, 15, 20, 25]
# Step – 100
series = [118.3, 293.2, 768.1]
print(list(computeTicks(series, step = 100)))
[100, 200, 300, 400, 500, 600, 700, 800]
Sample Usage
import matplotlib.pyplot as plt
x = [0,5,9,10,15]
y = [0,1,2,3,4]
plt.plot(x,y)
plt.xticks(computeTicks(x))
plt.show()
Notice the x-axis has integer values all evenly spaced by 5, whereas the y-axis has a different interval (the matplotlib default behavior, because the ticks weren't specified).

Generalisable one liner, with only Numpy imported:
ax.set_xticks(np.arange(min(x),max(x),1))
Set in the context of the question:
import numpy as np
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
x = [0,5,9,10,15]
y = [0,1,2,3,4]
ax.plot(x,y)
ax.set_xticks(np.arange(min(x),max(x),1))
plt.show()
How it works:
fig, ax = plt.subplots() gives the ax object which contains the axes.
np.arange(min(x),max(x),1) gives an array of interval 1 from the min of x to the max of x. This is the new x ticks that we want.
ax.set_xticks() changes the ticks on the ax object.

xmarks=[i for i in range(1,length+1,1)]
plt.xticks(xmarks)
This worked for me
if you want ticks between [1,5] (1 and 5 inclusive) then replace
length = 5

Since None of the above solutions worked for my usecase, here I provide a solution using None (pun!) which can be adapted to a wide variety of scenarios.
Here is a sample piece of code that produces cluttered ticks on both X and Y axes.
# Note the super cluttered ticks on both X and Y axis.
# inputs
x = np.arange(1, 101)
y = x * np.log(x)
fig = plt.figure() # create figure
ax = fig.add_subplot(111)
ax.plot(x, y)
ax.set_xticks(x) # set xtick values
ax.set_yticks(y) # set ytick values
plt.show()
Now, we clean up the clutter with a new plot that shows only a sparse set of values on both x and y axes as ticks.
# inputs
x = np.arange(1, 101)
y = x * np.log(x)
fig = plt.figure() # create figure
ax = fig.add_subplot(111)
ax.plot(x, y)
ax.set_xticks(x)
ax.set_yticks(y)
# which values need to be shown?
# here, we show every third value from `x` and `y`
show_every = 3
sparse_xticks = [None] * x.shape[0]
sparse_xticks[::show_every] = x[::show_every]
sparse_yticks = [None] * y.shape[0]
sparse_yticks[::show_every] = y[::show_every]
ax.set_xticklabels(sparse_xticks, fontsize=6) # set sparse xtick values
ax.set_yticklabels(sparse_yticks, fontsize=6) # set sparse ytick values
plt.show()
Depending on the usecase, one can adapt the above code simply by changing show_every and using that for sampling tick values for X or Y or both the axes.
If this stepsize based solution doesn't fit, then one can also populate the values of sparse_xticks or sparse_yticks at irregular intervals, if that is what is desired.

You can loop through labels and show or hide those you want:
for i, label in enumerate(ax.get_xticklabels()):
if i % interval != 0:
label.set_visible(False)

Extract x/y data from PolyCollection from fill_between

How can I extract the x/y data from the resulting PolyCollection from a fill_between plot?
polyCollection = ax.fill_between(x,ylo,yhi)
Now how do I get the data back from polyCollection?
For other Collection objects, I use x, y = artist.get_offsets().T, but here that returns just zeros for some reason.
For "Line" type objects, I use x, y = artist.get_xdata(), artist.get_ydata().
(I use this information in a callback to locally auto-zoom the y-axis to fit the data within a certain x-range.)

polyCollection.get_paths() gives a list of paths. In this case a list with one element. From there you can get the vertices as an Nx2 numpy array, and there the x and y:
from matplotlib import pyplot as plt
import numpy as np
N = 20
polyCollection = plt.fill_between(np.arange(0, N),
5 + np.random.normal(size=N).cumsum(),
10 + np.random.normal(size=N).cumsum(), color='lightblue', alpha=0.3)
points = polyCollection.get_paths()[0].vertices
xs = points[:, 0]
ys = points[:, 1]
plt.scatter(xs, ys, marker='o', color='crimson')

Matplotlib plot labels overlap [duplicate]

I am trying to fix how python plots my data.
Say:
x = [0,5,9,10,15]
y = [0,1,2,3,4]
matplotlib.pyplot.plot(x,y)
matplotlib.pyplot.show()
The x axis' ticks are plotted in intervals of 5. Is there a way to make it show intervals of 1?

You could explicitly set where you want to tick marks with plt.xticks:
plt.xticks(np.arange(min(x), max(x)+1, 1.0))
For example,
import numpy as np
import matplotlib.pyplot as plt
x = [0,5,9,10,15]
y = [0,1,2,3,4]
plt.plot(x,y)
plt.xticks(np.arange(min(x), max(x)+1, 1.0))
plt.show()
(np.arange was used rather than Python's range function just in case min(x) and max(x) are floats instead of ints.)
The plt.plot (or ax.plot) function will automatically set default x and y limits. If you wish to keep those limits, and just change the stepsize of the tick marks, then you could use ax.get_xlim() to discover what limits Matplotlib has already set.
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, stepsize))
The default tick formatter should do a decent job rounding the tick values to a sensible number of significant digits. However, if you wish to have more control over the format, you can define your own formatter. For example,
ax.xaxis.set_major_formatter(ticker.FormatStrFormatter('%0.1f'))
Here's a runnable example:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
x = [0,5,9,10,15]
y = [0,1,2,3,4]
fig, ax = plt.subplots()
ax.plot(x,y)
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 0.712123))
ax.xaxis.set_major_formatter(ticker.FormatStrFormatter('%0.1f'))
plt.show()

Another approach is to set the axis locator:
import matplotlib.ticker as plticker
loc = plticker.MultipleLocator(base=1.0) # this locator puts ticks at regular intervals
ax.xaxis.set_major_locator(loc)
There are several different types of locator depending upon your needs.
Here is a full example:
import matplotlib.pyplot as plt
import matplotlib.ticker as plticker
x = [0,5,9,10,15]
y = [0,1,2,3,4]
fig, ax = plt.subplots()
ax.plot(x,y)
loc = plticker.MultipleLocator(base=1.0) # this locator puts ticks at regular intervals
ax.xaxis.set_major_locator(loc)
plt.show()

I like this solution (from the Matplotlib Plotting Cookbook):
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
x = [0,5,9,10,15]
y = [0,1,2,3,4]
tick_spacing = 1
fig, ax = plt.subplots(1,1)
ax.plot(x,y)
ax.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
plt.show()
This solution give you explicit control of the tick spacing via the number given to ticker.MultipleLocater(), allows automatic limit determination, and is easy to read later.

In case anyone is interested in a general one-liner, simply get the current ticks and use it to set the new ticks by sampling every other tick.
ax.set_xticks(ax.get_xticks()[::2])

if you just want to set the spacing a simple one liner with minimal boilerplate:
plt.gca().xaxis.set_major_locator(plt.MultipleLocator(1))
also works easily for minor ticks:
plt.gca().xaxis.set_minor_locator(plt.MultipleLocator(1))
a bit of a mouthfull, but pretty compact

This is a bit hacky, but by far the cleanest/easiest to understand example that I've found to do this. It's from an answer on SO here:
Cleanest way to hide every nth tick label in matplotlib colorbar?
for label in ax.get_xticklabels()[::2]:
label.set_visible(False)
Then you can loop over the labels setting them to visible or not depending on the density you want.
edit: note that sometimes matplotlib sets labels == '', so it might look like a label is not present, when in fact it is and just isn't displaying anything. To make sure you're looping through actual visible labels, you could try:
visible_labels = [lab for lab in ax.get_xticklabels() if lab.get_visible() is True and lab.get_text() != '']
plt.setp(visible_labels[::2], visible=False)

This is an old topic, but I stumble over this every now and then and made this function. It's very convenient:
import matplotlib.pyplot as pp
import numpy as np
def resadjust(ax, xres=None, yres=None):
"""
Send in an axis and I fix the resolution as desired.
"""
if xres:
start, stop = ax.get_xlim()
ticks = np.arange(start, stop + xres, xres)
ax.set_xticks(ticks)
if yres:
start, stop = ax.get_ylim()
ticks = np.arange(start, stop + yres, yres)
ax.set_yticks(ticks)
One caveat of controlling the ticks like this is that one does no longer enjoy the interactive automagic updating of max scale after an added line. Then do
gca().set_ylim(top=new_top) # for example
and run the resadjust function again.

I developed an inelegant solution. Consider that we have the X axis and also a list of labels for each point in X.
Example:
import matplotlib.pyplot as plt
x = [0,1,2,3,4,5]
y = [10,20,15,18,7,19]
xlabels = ['jan','feb','mar','apr','may','jun']
Let's say that I want to show ticks labels only for 'feb' and 'jun'
xlabelsnew = []
for i in xlabels:
if i not in ['feb','jun']:
i = ' '
xlabelsnew.append(i)
else:
xlabelsnew.append(i)
Good, now we have a fake list of labels. First, we plotted the original version.
plt.plot(x,y)
plt.xticks(range(0,len(x)),xlabels,rotation=45)
plt.show()
Now, the modified version.
plt.plot(x,y)
plt.xticks(range(0,len(x)),xlabelsnew,rotation=45)
plt.show()

Pure Python Implementation
Below's a pure python implementation of the desired functionality that handles any numeric series (int or float) with positive, negative, or mixed values and allows for the user to specify the desired step size:
import math
def computeTicks (x, step = 5):
"""
Computes domain with given step encompassing series x
# params
x - Required - A list-like object of integers or floats
step - Optional - Tick frequency
"""
xMax, xMin = math.ceil(max(x)), math.floor(min(x))
dMax, dMin = xMax + abs((xMax % step) - step) + (step if (xMax % step != 0) else 0), xMin - abs((xMin % step))
return range(dMin, dMax, step)
Sample Output
# Negative to Positive
series = [-2, 18, 24, 29, 43]
print(list(computeTicks(series)))
[-5, 0, 5, 10, 15, 20, 25, 30, 35, 40, 45]
# Negative to 0
series = [-30, -14, -10, -9, -3, 0]
print(list(computeTicks(series)))
[-30, -25, -20, -15, -10, -5, 0]
# 0 to Positive
series = [19, 23, 24, 27]
print(list(computeTicks(series)))
[15, 20, 25, 30]
# Floats
series = [1.8, 12.0, 21.2]
print(list(computeTicks(series)))
[0, 5, 10, 15, 20, 25]
# Step – 100
series = [118.3, 293.2, 768.1]
print(list(computeTicks(series, step = 100)))
[100, 200, 300, 400, 500, 600, 700, 800]
Sample Usage
import matplotlib.pyplot as plt
x = [0,5,9,10,15]
y = [0,1,2,3,4]
plt.plot(x,y)
plt.xticks(computeTicks(x))
plt.show()
Notice the x-axis has integer values all evenly spaced by 5, whereas the y-axis has a different interval (the matplotlib default behavior, because the ticks weren't specified).

Generalisable one liner, with only Numpy imported:
ax.set_xticks(np.arange(min(x),max(x),1))
Set in the context of the question:
import numpy as np
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
x = [0,5,9,10,15]
y = [0,1,2,3,4]
ax.plot(x,y)
ax.set_xticks(np.arange(min(x),max(x),1))
plt.show()
How it works:
fig, ax = plt.subplots() gives the ax object which contains the axes.
np.arange(min(x),max(x),1) gives an array of interval 1 from the min of x to the max of x. This is the new x ticks that we want.
ax.set_xticks() changes the ticks on the ax object.

xmarks=[i for i in range(1,length+1,1)]
plt.xticks(xmarks)
This worked for me
if you want ticks between [1,5] (1 and 5 inclusive) then replace
length = 5

Since None of the above solutions worked for my usecase, here I provide a solution using None (pun!) which can be adapted to a wide variety of scenarios.
Here is a sample piece of code that produces cluttered ticks on both X and Y axes.
# Note the super cluttered ticks on both X and Y axis.
# inputs
x = np.arange(1, 101)
y = x * np.log(x)
fig = plt.figure() # create figure
ax = fig.add_subplot(111)
ax.plot(x, y)
ax.set_xticks(x) # set xtick values
ax.set_yticks(y) # set ytick values
plt.show()
Now, we clean up the clutter with a new plot that shows only a sparse set of values on both x and y axes as ticks.
# inputs
x = np.arange(1, 101)
y = x * np.log(x)
fig = plt.figure() # create figure
ax = fig.add_subplot(111)
ax.plot(x, y)
ax.set_xticks(x)
ax.set_yticks(y)
# which values need to be shown?
# here, we show every third value from `x` and `y`
show_every = 3
sparse_xticks = [None] * x.shape[0]
sparse_xticks[::show_every] = x[::show_every]
sparse_yticks = [None] * y.shape[0]
sparse_yticks[::show_every] = y[::show_every]
ax.set_xticklabels(sparse_xticks, fontsize=6) # set sparse xtick values
ax.set_yticklabels(sparse_yticks, fontsize=6) # set sparse ytick values
plt.show()
Depending on the usecase, one can adapt the above code simply by changing show_every and using that for sampling tick values for X or Y or both the axes.
If this stepsize based solution doesn't fit, then one can also populate the values of sparse_xticks or sparse_yticks at irregular intervals, if that is what is desired.

You can loop through labels and show or hide those you want:
for i, label in enumerate(ax.get_xticklabels()):
if i % interval != 0:
label.set_visible(False)

Percentile Distribution Graph

Does anyone have an idea how to change X axis scale and ticks to display a percentile distribution like the graph below? This image is from MATLAB, but I want to use Python (via Matplotlib or Seaborn) to generate.
From the pointer by #paulh, I'm a lot closer now. This code
import matplotlib
matplotlib.use('Agg')
import numpy as np
import matplotlib.pyplot as plt
import probscale
import seaborn as sns
clear_bkgd = {'axes.facecolor':'none', 'figure.facecolor':'none'}
sns.set(style='ticks', context='notebook', palette="muted", rc=clear_bkgd)
fig, ax = plt.subplots(figsize=(8, 4))
x = [30, 60, 80, 90, 95, 97, 98, 98.5, 98.9, 99.1, 99.2, 99.3, 99.4]
y = np.arange(0, 12.1, 1)
ax.set_xlim(40, 99.5)
ax.set_xscale('prob')
ax.plot(x, y)
sns.despine(fig=fig)
Generates the following plot (notice the re-distributed X-Axis)
Which I find much more useful than a the standard scale:
I contacted the author of the original graph and they gave me some pointers. It is actually a log scale graph, with x axis reversed and values of [100-val], with manual labeling of the x axis ticks. The code below recreates the original image with the same sample data as the other graphs here.
import matplotlib
matplotlib.use('Agg')
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
clear_bkgd = {'axes.facecolor':'none', 'figure.facecolor':'none'}
sns.set(style='ticks', context='notebook', palette="muted", rc=clear_bkgd)
x = [30, 60, 80, 90, 95, 97, 98, 98.5, 98.9, 99.1, 99.2, 99.3, 99.4]
y = np.arange(0, 12.1, 1)
# Number of intervals to display.
# Later calculations add 2 to this number to pad it to align with the reversed axis
num_intervals = 3
x_values = 1.0 - 1.0/10**np.arange(0,num_intervals+2)
# Start with hard-coded lengths for 0,90,99
# Rest of array generated to display correct number of decimal places as precision increases
lengths = [1,2,2] + [int(v)+1 for v in list(np.arange(3,num_intervals+2))]
# Build the label string by trimming on the calculated lengths and appending %
labels = [str(100*v)[0:l] + "%" for v,l in zip(x_values, lengths)]
fig, ax = plt.subplots(figsize=(8, 4))
ax.set_xscale('log')
plt.gca().invert_xaxis()
# Labels have to be reversed because axis is reversed
ax.xaxis.set_ticklabels( labels[::-1] )
ax.plot([100.0 - v for v in x], y)
ax.grid(True, linewidth=0.5, zorder=5)
ax.grid(True, which='minor', linewidth=0.5, linestyle=':')
sns.despine(fig=fig)
plt.savefig("test.png", dpi=300, format='png')
This is the resulting graph:

These type of graphs are popular in the low-latency community for plotting latency distributions. When dealing with latencies most of the interesting information tends to be in the higher percentiles, so a logarithmic view tends to work better. I've first seen these graphs used in https://github.com/giltene/jHiccup and https://github.com/HdrHistogram/.
The cited graph was generated by the following code
n = ceil(log10(length(values)));
p = 1 - 1./10.^(0:0.01:n);
percentiles = prctile(values, p * 100);
semilogx(1./(1-p), percentiles);
The x-axis was labelled with the code below
labels = cell(n+1, 1);
for i = 1:n+1
labels{i} = getPercentileLabel(i-1);
end
set(gca, 'XTick', 10.^(0:n));
set(gca, 'XTickLabel', labels);
% {'0%' '90%' '99%' '99.9%' '99.99%' '99.999%' '99.999%' '99.9999%'}
function label = getPercentileLabel(i)
switch(i)
case 0
label = '0%';
case 1
label = '90%';
case 2
label = '99%';
otherwise
label = '99.';
for k = 1:i-2
label = [label '9'];
end
label = [label '%'];
end
end

The following Python code uses Pandas to read a csv file that contains a list of recorded latency values (in milliseconds), then it records those latency values (as microseconds) in an HdrHistogram, and saves the HdrHistogram to an hgrm file, that will then be used by Seaborn to plot the latency distribution graph.
import pandas as pd
from hdrh.histogram import HdrHistogram
from hdrh.dump import dump
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import sys
import argparse
# Parse the command line arguments.
parser = argparse.ArgumentParser()
parser.add_argument('csv_file')
parser.add_argument('hgrm_file')
parser.add_argument('png_file')
args = parser.parse_args()
csv_file = args.csv_file
hgrm_file = args.hgrm_file
png_file = args.png_file
# Read the csv file into a Pandas data frame and generate an hgrm file.
csv_df = pd.read_csv(csv_file, index_col=False)
USECS_PER_SEC=1000000
MIN_LATENCY_USECS = 1
MAX_LATENCY_USECS = 24 * 60 * 60 * USECS_PER_SEC # 24 hours
# MAX_LATENCY_USECS = int(csv_df['response-time'].max()) * USECS_PER_SEC # 1 hour
LATENCY_SIGNIFICANT_DIGITS = 5
histogram = HdrHistogram(MIN_LATENCY_USECS, MAX_LATENCY_USECS, LATENCY_SIGNIFICANT_DIGITS)
for latency_sec in csv_df['response-time'].tolist():
histogram.record_value(latency_sec*USECS_PER_SEC)
# histogram.record_corrected_value(latency_sec*USECS_PER_SEC, 10)
TICKS_PER_HALF_DISTANCE=5
histogram.output_percentile_distribution(open(hgrm_file, 'wb'), USECS_PER_SEC, TICKS_PER_HALF_DISTANCE)
# Read the generated hgrm file into a Pandas data frame.
hgrm_df = pd.read_csv(hgrm_file, comment='#', skip_blank_lines=True, sep=r"\s+", engine='python', header=0, names=['Latency', 'Percentile'], usecols=[0, 3])
# Plot the latency distribution using Seaborn and save it as a png file.
sns.set_theme()
sns.set_style("dark")
sns.set_context("paper")
sns.set_color_codes("pastel")
fig, ax = plt.subplots(1,1,figsize=(20,15))
fig.suptitle('Latency Results')
sns.lineplot(x='Percentile', y='Latency', data=hgrm_df, ax=ax)
ax.set_title('Latency Distribution')
ax.set_xlabel('Percentile (%)')
ax.set_ylabel('Latency (seconds)')
ax.set_xscale('log')
ax.set_xticks([1, 10, 100, 1000, 10000, 100000, 1000000, 10000000])
ax.set_xticklabels(['0', '90', '99', '99.9', '99.99', '99.999', '99.9999', '99.99999'])
fig.tight_layout()
fig.savefig(png_file)

want to smooth a contour from a masked array

I have a masked array which is used by matplotlib.plt.contourf to project a temperature contour on a glabal map. I was trying to smooth the contour, but unfortunately none of the proposed solutions seems to be able to handle masked array. I tested these solutions:
-scipy.ndimage.gaussian_filter - moving averages
scipy.ndimage.zoom
none of them works(they count in the masked values also). Is there any way I can smooth my contour on maskedArray
I have added this part after trying the proposed 'inpaint' solution and the results were unchanged. here is the code (if it helps)
import Scientific.IO.NetCDF as S
import mpl_toolkits.basemap as bm
import numpy.ma as MA
import numpy as np
import matplotlib.pyplot as plt
import inpaint
def main():
fileobj = S.NetCDFFile('Bias.ANN.tas_A1_1.nc', mode='r')
# take the values
set1 = {'time', 'lat', 'lon'}
set2 = set(fileobj.variables.keys())
set3 = set2 - set1
datadim = set3.pop()
print "******************datadim: "+datadim
data = fileobj.variables[datadim].getValue()[0,:,:]
lon = fileobj.variables['lon'].getValue()
lat = fileobj.variables['lat'].getValue()
fileobj.close()
data, lon = bm.shiftgrid(180.,data, lon,start=False)
data = MA.masked_equal(data, 1.0e20)
#data2 = inpaint.replace_nans(data, 10, 0.25, 2, 'idw')
#- Make 2-D longitude and latitude arrays:
[lon2d, lat2d] =np.meshgrid(lon, lat)
#- Set up map:
mapproj = bm.Basemap(projection='cyl',
llcrnrlat=-90.0, llcrnrlon=-180.00,
urcrnrlat=90.0, urcrnrlon=180.0)
mapproj.drawcoastlines(linewidth=.5)
mapproj.drawmapboundary(fill_color='.8')
#mapproj.drawparallels(N.array([-90, -45, 0, 45, 90]), labels=[1,0,0,0])
#mapproj.drawmeridians(N.array([0, 90, 180, 270, 360]), labels=[0,0,0,1])
lonall, latall = mapproj(lon2d, lat2d)
cmap=plt.cm.Spectral
#- Make a contour plot of the temperature:
mymapf = plt.contourf(lonall, latall, data, 20, cmap=cmap)
#plt.clabel(mymapf, fontsize=12)
plt.title(cmap.name)
plt.colorbar(mymapf, orientation='horizontal')
plt.savefig('sample2.png', dpi=150, edgecolor='red', format='png', bbox_inches='tight', pad_inches=.2)
plt.close()
if __name__ == "__main__":
main()
I am comparing the output from this code (the first figure), with output of the same datafile from Panoply. Zoomin in and looking more precisely it seems like it is not the smoothness problem, but the pyplot model provides one stripe slimmer, or the contours are cut earlier (the outer boundaries shows this clearly, and inner contours are different due to this fact). It makes it to look like that the pyplot model is not as smooth as the Panoply one. how can I get (nearly) the same model? Am I distinguishing it right?

I had similar problem and google pointed me to this: blog post. Basically he's using inpaint algorithm to interpolate missing values and produce valid array for filtering.
The code is at the end of the post, and you can save it to site-packages (or else) and load it as module (i.e. inpaint.py):
import inpaint
filled = inpaint.replace_nans(NANMask, 5, 0.5, 2, 'idw')
I'm happy with the result, and I guess it will suite missing temperature values just fine. There is also next version here: github but code will need some cleaning for general usage as it's part of a project.
For reference, easy use and preservation sake I'll post the code (of initial version) here:
# -*- coding: utf-8 -*-
"""A module for various utilities and helper functions"""
import numpy as np
#cimport numpy as np
#cimport cython
DTYPEf = np.float64
#ctypedef np.float64_t DTYPEf_t
DTYPEi = np.int32
#ctypedef np.int32_t DTYPEi_t
##cython.boundscheck(False) # turn of bounds-checking for entire function
##cython.wraparound(False) # turn of bounds-checking for entire function
def replace_nans(array, max_iter, tol,kernel_size=1,method='localmean'):
"""Replace NaN elements in an array using an iterative image inpainting algorithm.
The algorithm is the following:
1) For each element in the input array, replace it by a weighted average
of the neighbouring elements which are not NaN themselves. The weights depends
of the method type. If ``method=localmean`` weight are equal to 1/( (2*kernel_size+1)**2 -1 )
2) Several iterations are needed if there are adjacent NaN elements.
If this is the case, information is "spread" from the edges of the missing
regions iteratively, until the variation is below a certain threshold.
Parameters
----------
array : 2d np.ndarray
an array containing NaN elements that have to be replaced
max_iter : int
the number of iterations
kernel_size : int
the size of the kernel, default is 1
method : str
the method used to replace invalid values. Valid options are
`localmean`, 'idw'.
Returns
-------
filled : 2d np.ndarray
a copy of the input array, where NaN elements have been replaced.
"""
# cdef int i, j, I, J, it, n, k, l
# cdef int n_invalids
filled = np.empty( [array.shape[0], array.shape[1]], dtype=DTYPEf)
kernel = np.empty( (2*kernel_size+1, 2*kernel_size+1), dtype=DTYPEf )
# cdef np.ndarray[np.int_t, ndim=1] inans
# cdef np.ndarray[np.int_t, ndim=1] jnans
# indices where array is NaN
inans, jnans = np.nonzero( np.isnan(array) )
# number of NaN elements
n_nans = len(inans)
# arrays which contain replaced values to check for convergence
replaced_new = np.zeros( n_nans, dtype=DTYPEf)
replaced_old = np.zeros( n_nans, dtype=DTYPEf)
# depending on kernel type, fill kernel array
if method == 'localmean':
print 'kernel_size', kernel_size
for i in range(2*kernel_size+1):
for j in range(2*kernel_size+1):
kernel[i,j] = 1
print kernel, 'kernel'
elif method == 'idw':
kernel = np.array([[0, 0.5, 0.5, 0.5,0],
[0.5,0.75,0.75,0.75,0.5],
[0.5,0.75,1,0.75,0.5],
[0.5,0.75,0.75,0.5,1],
[0, 0.5, 0.5 ,0.5 ,0]])
print kernel, 'kernel'
else:
raise ValueError( 'method not valid. Should be one of `localmean`.')
# fill new array with input elements
for i in range(array.shape[0]):
for j in range(array.shape[1]):
filled[i,j] = array[i,j]
# make several passes
# until we reach convergence
for it in range(max_iter):
print 'iteration', it
# for each NaN element
for k in range(n_nans):
i = inans[k]
j = jnans[k]
# initialize to zero
filled[i,j] = 0.0
n = 0
# loop over the kernel
for I in range(2*kernel_size+1):
for J in range(2*kernel_size+1):
# if we are not out of the boundaries
if i+I-kernel_size < array.shape[0] and i+I-kernel_size >= 0:
if j+J-kernel_size < array.shape[1] and j+J-kernel_size >= 0:
# if the neighbour element is not NaN itself.
if filled[i+I-kernel_size, j+J-kernel_size] == filled[i+I-kernel_size, j+J-kernel_size] :
# do not sum itself
if I-kernel_size != 0 and J-kernel_size != 0:
# convolve kernel with original array
filled[i,j] = filled[i,j] + filled[i+I-kernel_size, j+J-kernel_size]*kernel[I, J]
n = n + 1*kernel[I,J]
# divide value by effective number of added elements
if n != 0:
filled[i,j] = filled[i,j] / n
replaced_new[k] = filled[i,j]
else:
filled[i,j] = np.nan
# check if mean square difference between values of replaced
#elements is below a certain tolerance
print 'tolerance', np.mean( (replaced_new-replaced_old)**2 )
if np.mean( (replaced_new-replaced_old)**2 ) < tol:
break
else:
for l in range(n_nans):
replaced_old[l] = replaced_new[l]
return filled
def sincinterp(image, x, y, kernel_size=3 ):
"""Re-sample an image at intermediate positions between pixels.
This function uses a cardinal interpolation formula which limits
the loss of information in the resampling process. It uses a limited
number of neighbouring pixels.
The new image :math:`im^+` at fractional locations :math:`x` and :math:`y` is computed as:
.. math::
im^+(x,y) = \sum_{i=-\mathtt{kernel\_size}}^{i=\mathtt{kernel\_size}} \sum_{j=-\mathtt{kernel\_size}}^{j=\mathtt{kernel\_size}} \mathtt{image}(i,j) sin[\pi(i-\mathtt{x})] sin[\pi(j-\mathtt{y})] / \pi(i-\mathtt{x}) / \pi(j-\mathtt{y})
Parameters
----------
image : np.ndarray, dtype np.int32
the image array.
x : two dimensions np.ndarray of floats
an array containing fractional pixel row
positions at which to interpolate the image
y : two dimensions np.ndarray of floats
an array containing fractional pixel column
positions at which to interpolate the image
kernel_size : int
interpolation is performed over a ``(2*kernel_size+1)*(2*kernel_size+1)``
submatrix in the neighbourhood of each interpolation point.
Returns
-------
im : np.ndarray, dtype np.float64
the interpolated value of ``image`` at the points specified
by ``x`` and ``y``
"""
# indices
# cdef int i, j, I, J
# the output array
r = np.zeros( [x.shape[0], x.shape[1]], dtype=DTYPEf)
# fast pi
pi = 3.1419
# for each point of the output array
for I in range(x.shape[0]):
for J in range(x.shape[1]):
#loop over all neighbouring grid points
for i in range( int(x[I,J])-kernel_size, int(x[I,J])+kernel_size+1 ):
for j in range( int(y[I,J])-kernel_size, int(y[I,J])+kernel_size+1 ):
# check that we are in the boundaries
if i >= 0 and i <= image.shape[0] and j >= 0 and j <= image.shape[1]:
if (i-x[I,J]) == 0.0 and (j-y[I,J]) == 0.0:
r[I,J] = r[I,J] + image[i,j]
elif (i-x[I,J]) == 0.0:
r[I,J] = r[I,J] + image[i,j] * np.sin( pi*(j-y[I,J]) )/( pi*(j-y[I,J]) )
elif (j-y[I,J]) == 0.0:
r[I,J] = r[I,J] + image[i,j] * np.sin( pi*(i-x[I,J]) )/( pi*(i-x[I,J]) )
else:
r[I,J] = r[I,J] + image[i,j] * np.sin( pi*(i-x[I,J]) )*np.sin( pi*(j-y[I,J]) )/( pi*pi*(i-x[I,J])*(j-y[I,J]))
return r
#cdef extern from "math.h":
# double sin(double)

A simple smoothing function that works with masked data will solve this. One can then avoid the approaches that involve making up data (ie, interpolating, inpainting, etc); and making up data should always be avoided.
The main issue that arises when smoothing masked data is that for each point, smoothing uses the neighboring values to calculate a new value at a center point, but when those neighbors are masked, the new value for the center point will also become masked due to the rules of masked arrays. Therefore, one needs to do the calculation with unmasked data, and explicitly account for the mask. That's easy to do, and is not in the function smooth below.
from numpy import *
import pylab as plt
# make a grid and a striped mask as test data
N = 100
x = linspace(0, 5, N, endpoint=True)
grid = 2. + 1.*(sin(2*pi*x)[:,newaxis]*sin(2*pi*x)>0.)
m = resize((sin(pi*x)>0), (N,N))
plt.imshow(grid.copy(), cmap='jet', interpolation='nearest')
plt.colorbar()
plt.title('original data')
def smooth(u, mask):
m = ~mask
r = u*m # set all 'masked' points to 0. so they aren't used in the smoothing
a = 4*r[1:-1,1:-1] + r[2:,1:-1] + r[:-2,1:-1] + r[1:-1,2:] + r[1:-1,:-2]
b = 4*m[1:-1,1:-1] + m[2:,1:-1] + m[:-2,1:-1] + m[1:-1,2:] + m[1:-1,:-2] # a divisor that accounts for masked points
b[b==0] = 1. # for avoiding divide by 0 error (region is masked so value doesn't matter)
u[1:-1,1:-1] = a/b
# run the data through the smoothing filter a few times
for i in range(10):
smooth(grid, m)
mg = ma.array(grid, mask=m) # put together the mask and the data
plt.figure()
plt.imshow(mg, cmap='jet', interpolation='nearest')
plt.colorbar()
plt.title('smoothed with mask')
plt.show()
The main point is that at the boundary of the mask, the masked values are not used in the smoothing. (This is also where the grid squares switch values, so it would be clear in the figure if the masked neighboring values were being used.)

We also just had this problem and the astropy package has us covered:
import numpy as np
import matplotlib.pyplot as plt
# Some Axes
x = np.arange(100)
y = np.arange(100)
#Some Interesting Shape
z = np.array(np.outer(np.sin((x+y)/10),np.sin(y/3)),dtype=float)
# some mask
mask = np.outer(np.sin((x+y)/20),np.sin(y/5))**2>.9
# masked data represent noise, so lets put in some trash into the masked points
z[mask] = (np.random.random(size = (100,100))*10)[mask]
# masked data
z_masked = np.ma.masked_array(z, mask)
# "Conventional" filter
filter_kernelsize = 2
import scipy.ndimage
z_filtered_bad = scipy.ndimage.gaussian_filter(z_masked,filter_kernelsize)
# Lets filter it
import astropy.convolution.convolve
from astropy.convolution import Gaussian2DKernel
k = Gaussian2DKernel(1.5)
z_filtered = astropy.convolution.convolve(z_masked, k, boundary='extend')
### Plots:
fig, axes = plt.subplots(2,2)
plt.sca(axes[0,0])
plt.title('Raw Data')
plt.imshow(z)
plt.colorbar()
plt.sca(axes[0,1])
plt.title('Raw Data Masked')
plt.imshow(z_masked)
plt.colorbar()
plt.sca(axes[1,0])
plt.title('ndimage filter (ignores mask)')
plt.imshow(z_filtered_bad)
plt.colorbar()
plt.sca(axes[1,1])
plt.title('astropy filter (uses mask)')
plt.imshow(z_filtered)
plt.colorbar()
plt.tight_layout()
Output plot of the code

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

T-test of gridded (3-D) data in Python - numpy

Related

How to adjust x-axis spacing on distplot? [duplicate]

Extract x/y data from PolyCollection from fill_between

Matplotlib plot labels overlap [duplicate]

Percentile Distribution Graph

want to smooth a contour from a masked array

Categories

Resources