Xarray mask region based on multiple conditions - where-clause

I'm looking at a global netcdf file. I want to set all land points that are within the 60-75 deg N band to zero but keep the ocean points in that band as nan. As a second step, I want to keep the values on the land points from 60-75 but set all other land points to zero. Ocean values are NaNs. I just don't get my xarray script to do that - here is what I tried
import numpy as np
import matplotlib.pyplot as plt
ds = xr.open_dataset('ifle.nc')
ds['Shrub_total'] = ds['Shrub']
shrub_total = ds.Shrub_total
tundra = shrub_total.where((shrub_total!=np.nan)&(shrub_total.Lat>60)&
(shrub_total.Lat<75), 0)
shrub = shrub_total.where((shrub_total!=np.nan)&(shrub_total.Lat<60)&
(shrub_total.Lat>75), 0)
ds['Tundra'] = tundra
ds['Shrub'] = shrub
fig, axes = plt.subplots(ncols=2,figsize=(12,3))
ds['Shrub_total'].isel(Time=0).plot(ax=axes[0])
ds['Tundra'].isel(Time=0).plot(ax=axes[1])
ds['Shrub'].isel(Time=0).plot(ax=axes[2])
plt.show()
This is what it looks like
The left panel is the original data, for the middle one at least I managed to keep the data I wanted - but instead of the two massive violet blocks I wanted to keep the map with all values outside the selected area set to zero. The right panel was intended to be the 'inverse' of the middle one but I completely failed there. It feels like this should be such an easy thing to do but I just can't figure it out!

This appeared to be mostly an issue with the logical side, as well as the method used to deal with the NaNs.
The below seems to work for me:
tundra = shrub_total.where((np.isnan(shrub_total)==True)|
((shrub_total.Lat>60)&(shrub_total.Lat<75)), 0)
shrub = shrub_total.where((np.isnan(shrub_total)==True)|
((shrub_total.Lat<60)|(shrub_total.Lat>75)), 0)
I changed the shrub logical to an OR statement (we want to mask either less than 60 or more than 75 - it's not possible for somewhere to be both!).
We use np.isnan()==True rather than ()!=np.nan. I am unsure about why we can't treat this the way you did... This necessitated further changes to the logic.
Note, I do not use python so this may be very hacky, and I'm sure someone else will have a much more elegant and knowledgeable answer but it intrigued me so I attempted it :)

Related

Adding descriptive stats to this plot

In pandas/seaborn:
sns.distplot(combo['resubmits'], kde=False, bins=8)
plt.savefig("g1.png")
Makes a very pretty histogram. I want to include a textual "legend" showing the mean, stdev, n, etc as numbers in a box. You would think this is so common that there's a semi automatic way to do it but I can't find it.
There is a feature request for that.
However, note that using matplotlib.pyplot.axvline, you can easily do it yourself for now.
from matplotlib import pyplot as plt
plt.axvline(x, 0, y_max)
where x=combo['resubmits'].mean() and y_max is the maximal value of hist(combo['resubmits'])'s bins' values.

How do I create a bar chart that starts and ends in a certain range

I created a computer model (just for fun) to predict soccer match result. I ran a computer simulation to predict how many points that a team will gain. I get a list of simulation result for each team.
I want to plot something like confidence interval, but using bar chart.
I considered the following option:
I considered using matplotlib's candlestick, but this is not Forex price.
I also considered using matplotlib's errorbar, especially since it turns out I can mashes graphbar + errorbar, but it's not really what I am aiming for. I am actually aiming for something like Nate Silver's 538 election prediction result.
Nate Silver's is too complex, he colored the distribution and vary the size of the percentage. I just want a simple bar chart that plots on a certain range.
I don't want to resort to plot bar stacking like shown here
Matplotlib's barh (or bar) is probably suitable for this:
import numpy as np
import matplotlib.pylab as pl
x_mean = np.array([1, 3, 6 ])
x_std = np.array([0.3, 1, 0.7])
y = np.array([0, 1, 2 ])
pl.figure()
pl.barh(y, width=2*x_std, left=x_mean-x_std)
The bars have a horizontal width of 2*x_std and start at x_mean-x_std, so the center denotes the mean value.
It's not very pretty (yet), but highly customizable:

how to shift x axis labesl on line plot?

I'm using pandas to work with a data set and am tring to use a simple line plot with error bars to show the end results. It's all working great except that the plot looks funny.
By default, it will put my 2 data groups at the far left and right of the plot, which obscures the error bar to the point that it's not useful (the error bars in this case are key to intpretation so I want them plainly visible).
Now, I fix that problem by setting xlim to open up some space on either end of the x axis so that the error bars are plainly visible, but then I have an offset from where the x labels are to where the actual x data is.
Here is a simplified example that shows the problem:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df6 = pd.DataFrame( [-0.07,0.08] , index = ['A','B'])
df6.plot(kind='line', linewidth=2, yerr = [ [0.1,0.1],[0.1,0.1 ] ], elinewidth=2,ecolor='green')
plt.xlim(-0.2,1.2) # Make some room at ends to see error bars
plt.show()
I tried to include a plot (image) showing the problem but I cannot post images yet, having just joined up and do not have anough points yet to post images.
What I want to know is: How do I shift these labels over one tick to the right?
Thanks in advance.
Well, it turns out I found a solution, which I will jsut post here in case anyone else has this same issue in the future.
Basically, it all seems to work better in the case of a line plot if you just specify both the labels and the ticks in the same place at the same time. At least that was helpful for me. It sort of forces you to keep the length of those two lists the same, which seems to make the assignment between ticks and labels more well behaved (simple 1:1 in this case).
So I coudl fix my problem by including something like this:
plt.xticks([0, 1], ['A','B'] )
right after the xlim statement in code from original question. Now the A and B align perfectly with the place where the data is plotted, not offset from it.
Using above solution it works, but is less good-looking since now the x grid is very coarse (this is purely and aesthetic consideration). I could fix that by using a different xtick statement like:
plt.xticks([-0.2, 0, 0.2, 0.4, 0.6, 0.8, 1.0], ['','A','','','','','B',''])
This gives me nice looking grid and the data where I need it, but of course is very contrived-looking here. In the actual program I'd find a way to make that less clunky.
Hope that is of some help to fellow seekers....

matplotlib pyplot side-by-side graphics

I'm trying to put two scatterplots side-by-side in the same figure. I'm also using prettyplotlib to make the graphs look a little nicer. Here is the code
fig, ax = ppl.subplots(ncols=2,nrows=1,figsize=(14,6))
for each in ['skimmer','dos','webapp','losstheft','espionage','crimeware','misuse','pos']:
ypos = df[df['pattern']==each]['ypos_m']
xpos = df[df['pattern']==each]['xpos_m']
ax[0] = ppl.scatter(ypos,xpos,label=each)
plt.title("Multi-dimensional Scaling: Manhattan")
for each in ['skimmer','dos','webapp','losstheft','espionage','crimeware','misuse','pos']:
ypos = df[df['pattern']==each]['ypos_e']
xpos = df[df['pattern']==each]['xpos_e']
ax[1] = ppl.scatter(ypos,xpos,label=each)
plt.title("Multi-dimensional Scaling: Euclidean")
plt.show()
I don't get any error when the code runs, but what I end up with is one row with two graphs. One graph is completely empty and not styled by prettyplotlib at all. The right side graphic seems to have both of my scatterplots in it.
I know that ppl.subplots is returning a matplotlib.figure.Figure and a numpy array consisting of two matplotlib.axes.AxesSubplot. But I also admit that I don't quite get how axes and subplotting works. Hopefully it's just a simple mistake somewhere.
I think ax[0] = ppl.scatter(ypos,xpos,label=each) should be ax[0].scatter(ypos,xpos,label=each) and ax[1] = ppl.scatter(ypos,xpos,label=each) should be ax[1].scatter(ypos,xpos,label=each), change those and see if your problem get solved.
I am quite sure that the issue is: you are calling ppl.scatter(...), which will try to draw on the current axis, which is the 1st axes of 2 axes you generated (and it is the left one)
Also you may find that in the end, the ax list contains two matplotlib.collections.PathCollections, bot two axis as you may expect.
Since the solution above removes the prettiness of prettyplot, we shall use an alternative solution, which is to change the current working axis, by adding:
plt.sca(ax[0_or_1])
Before ppl.scatter(), inside each loop.

pandas access axis by user-defined name

I am wondering whether there is any way to access axes of pandas containers (DataFrame, Panel, etc...) by user-defined name instead of integer or "index", "columns", "minor_axis" etc...
For example, with the following data container:
df = DataFrame(randn(3,2),columns=['c1','c2'],index=['i1','i2','i3'])
df.index.name = 'myaxis1'
df.columns.name = 'myaxis2'
I would like to do this:
df.sum(axis='myaxis1')
df.xs('c1', axis='myaxis2') # cross section
Also very useful would be:
df.reshape(['myaxis2','myaxis1'])
(in this case not so relevant, but it could become so if the dimension increases)
The reason is that I work a lot with multi-dimensional arrays of varying dimensions, like "time", "variable", "percentile" etc...and a same piece of code is often applied to objects which can be DataFrame, Panel or even Panel4D or DataFrame with MultiIndex. For now I often make test on the shape of the object, or on the general settings of the script in order to know which axis is the relevant one to compute a sum or mean. But I think it would be much more convenient to forget about how the container is implemented in the detail (DataFrame, Panel etc...), and simply think about the nature of the problem (say I want to average over the time, I do not want to think about whether I work with in "probabilistic" mode with several percentiles, or in "deterministic" mode with a single time series).
Writing this post I have (re)discovered the very useful axes attribute. The above code could be translated into:
nms = [ax.name for ax in df.axes]
axid1 = nms.index('myaxis1')
axid2 = nms.index('myaxis2')
df.sum(axis=axid1)
df.xs('c1', axis=axid2) # cross section
and the "reshape" feature (does not apply to 3-d case though...):
newshape = ['myaxis2','myaxis1']
axid = [nms.index(nm) for nm in newshape]
df.swapaxes(*axid)
Well, I have to admit that I have found these solutions while writing this post (and this is already very convenient), but it could be generalized to account for DataFrame (or other) with MultiIndex axes, do a search on all axes and labels...
In my opinion it would be a major improvement to the user-friendliness of pandas (ok, forgetting about the actual structure could have a performance cost, but the user worried about performance can be careful in how he/she organizes the data).
What do you think?
This is still experimental, but look at this page:
http://pandas.pydata.org/pandas-docs/dev/dsintro.html#panelnd-experimental
import pandas
import numpy as np
from pandas.core import panelnd
MyPanel4D = panelnd.create_nd_panel_factory(
klass_name = 'MyPanel4D',
axis_orders = ['axis4', 'axis3', 'axis2', 'axis1'],
axis_slices = {'axis3': 'items',
'axis2': 'major_axis',
'axis1': 'minor_axis'},
slicer = 'Panel',
stat_axis=2)
mp4d = MyPanel4D(np.random.rand(5,4,3,2))
print mp4d
Results in this
<class 'pandas.core.panelnd.MyPanel4D'>
Dimensions: 5 (axis4) x 4 (axis3) x 3 (axis2) x 2 (axis1)
Axis4 axis: 0 to 4
Axis3 axis: 0 to 3
Axis2 axis: 0 to 2
Axis1 axis: 0 to 1
Here's the caveat, when you slice it like mp4d[0] you are going to get back a Panel, unless you create a hierarchy of custom objects (unfortunately will need to wait for 0.12-dev for support for 'renaming' Panel/DataFrame, its non-trivial and haven't had any requests)
So for higher dim objects you can impose your own name structure. The axis
aliasing should work like you are suggesting, but I think there are some bugs there