Change linear interpolation in Matplotlib - matplotlib

I have data that looks like this:
sv_length status right_edge left_edge
0 (0.999, 3.0] 0.142857 3.000000e+00 9.990000e-01
1 (3.0, 6.0] 0.125000 6.000000e+00 3.000000e+00
2 (6.0, 11.3] 0.153846 1.130000e+01 6.000000e+00
3 (11.3, 18.375] 0.964286 1.837500e+01 1.130000e+01
4 (18.375, 28.0] 0.965517 2.800000e+01 1.837500e+01
When I plot it, as so:
sbn.lineplot(x=binned['right_edge'], y=binned['status'])
plt.xlim([1, 100])
plt.xscale("log")
I get the following:
There is clearly linear interpolation between the values in "right_edge" (ie. the slanted edges). How can I make it so that the plot shows right-most interpolation (ie. extending each value to the left until the previous point is reached, thus producing horizontal lines)?

Related

Interpolating lines of a Polygon

Let's suppose we have 5 (x,y) points which makes a closed loop or a polygon. How can I interpolate or upsample some points so that the polygon will have a more round-ish look instead of sharp linear lines between two points? E.g. see the image. What I have is on the left and what I want is on the right.
A simple MATLAB code is as follows:
xv = [0 2 3 2.5 1 0];
yv = [1 0 2 3.5 3.5 1];
plot(xv, yv)
xlim([-1 4])
ylim([-2 5])

matplotlib subplots do not show the exact x tick labels passed to it as list

I am plotting a plot of Accuracy versus the var_smoothing curve of 4 different instances. My values are:
var_smoothing_values
>>
[1e-09, 1e-06, 0.001, 1]
gauss_accuracies
>>
[0.728, 0.8, 0.826, 0.832]
I have used the 2 subplots and on the second subplot, I am plotting this as:
f,ax = plt.subplots(1,2,figsize=(15,5))
ax[1].plot(var_smoothing_values,gauss_accuracies,marker='*',markersize=12)
ax[1].set_ylabel('Accuracy')
ax[1].set_xlabel('var_smoothing values')
ax[1].set_title('Accuracy vs var_smoothing | GaussianNB',size='large')
plt.show()
ax[1].set_xticks(var_smoothing_values) shows only 3 ticks.
How can I show only 4 ticks which corresponds to each of my var_smoothing_values??
You need to use the log scale on the x-axis since your x-values span acoss several orders of magnitude
ax[1].set_xscale('log')
ax[1].set_xticks(var_smoothing_values);

Pandas, approximating a bar plot for large dataframes

I have a dataframe (around 10k rows) of the following form:
id | voted
123 1.0
12 0.0
215 1.0
362 0.0
...
And I want to bar plot this and look at where the values are mostly 0.0 and where they are mostly 1.0. (the order of indices in the first column is essential, as the dataframe is sorted).
I tried doing a bar plot, but even if I restrict myself to a small subset of the dataframe, the plot is still not readable:
Is there a way to approximate areas that are mostly 1.0 with a single thicker bar, such as we do for histograms, when we set the bins to a higher and lower number?
As you are searching for an interval approximation for the density of the votes, maybe you can add a moving average to it, :
df['ma'] = df['voted'].rolling(5).mean()
With this you would have always an average, then you could plot it over the indexes as a line graph, if the value is close to 1 then you know that you have a group of id's which votes with 1.0.

Scatter plot of Multiindex GroupBy()

I'm trying to make a scatter plot of a GroupBy() with Multiindex (http://pandas.pydata.org/pandas-docs/stable/groupby.html#groupby-with-multiindex). That is, I want to plot one of the labels on the x-axis, another label on the y-axis, and the mean() as the size of each point.
df['RMSD'].groupby([df['Sigma'],df['Epsilon']]).mean() returns:
Sigma_ang Epsilon_K
3.4 30 0.647000
40 0.602071
50 0.619786
3.6 30 0.646538
40 0.591833
50 0.607769
3.8 30 0.616833
40 0.590714
50 0.578364
Name: RMSD, dtype: float64
And I'd like to to plot something like: plt.scatter(x=Sigma, y=Epsilon, s=RMSD)
What's the best way to do this? I'm having trouble getting the proper Sigma and Epsilon values for each RMSD value.
+1 to Vaishali Garg. Based on his comment, the following works:
df_mean = df['RMSD'].groupby([df['Sigma'],df['Epsilon']]).mean().reset_index()
plt.scatter(df_mean['Sigma'], df_mean['Epsilon'], s=100.*df_mean['RMSD'])

Numpy or Pandas function for "x-value-window" means or other stats?

Let's say I have x-y data samples sorted by x-value. I'm going to use Pandas as example, but I would be perfectly happy with a Numpy/Scipy-only solution, of course.
In [24]: pd.set_option('display.max_rows', 10)
In [25]: df = pd.DataFrame(np.random.randn(100, 2), columns=['x', 'y'])
In [26]: df = df.sort('x')
In [27]: df
Out[27]:
x y
13 -3.403818 0.717744
49 -2.688876 1.936267
74 -2.388332 -0.121599
52 -2.185848 0.617896
90 -2.155343 -1.132673
.. ... ...
65 1.736506 -0.170502
0 1.770901 0.520490
60 1.878376 0.206113
63 2.263602 1.112115
33 2.384195 -1.877502
[100 rows x 2 columns]
Now, I want to kind of "window" it or "discretize" it and get statistics on each window. But I don't want to do the Pandas moving-window functions because they define windows by rows. I want to define windows by a span of x-values, thus "x-value-window". Specifically, let's define each x-value-window with 2 parameters:
center x-value of each window
in this example, let's say I want x = 0.0 + 0.4 * k for all positive or negative k
thus -3.2, -2.8, -2.4, ..., 1.6, 2.0, 2.4
width of each window
in this example, let's say I want W = 0.5
thus, the example windows will be [-3.2-0.25, -3.2+0.25], [-2.8-0.25, -2.8+0.25], ..., [2.4-0.25, 2.4+0.25]
note that the windows overlap, which is intended
Having thus defined the windows, I would like to ask if there's a function that will produce the following data frame (or numpy array):
x y
-3.2 mean of y-values in x-value-window centered at -3.2
-2.8 mean of y-values in x-value-window centered at -2.8
-2.4 mean of y-values in x-value-window centered at -2.4
... ...
1.6 mean of y-values in x-value-window centered at 1.6
2.0 mean of y-values in x-value-window centered at 2.0
2.4 mean of y-values in x-value-window centered at 2.4
Is there anything that will do this for me? Or do I have to totally roll my own (and probably in a very slow python loop instead of fast numpy or pandas code)?
Extra 1: It would be even better if there's support for weighted windows (such as supported by Pandas's rolling_window function) but of course the weights in this case would not be based on how far the sample's row is from the center row of the window, but rather, how far the sample's x-value is from the center of the x-value-window.
Extra 2: It would be nice if there's support for statistics other than mean on the x-value-windows, e.g. (a) variance of the y-values in each x-value-window or (b) count of the number of samples falling within each x-value-window.
I first create a range of x values centered at zero. This range is wide enough so that then min value minus the width and the max value plus the width will capture all x values.
I then iterate through this range of x values which have k as the step size. At each point, I use loc to capture y values located at the selected x value plus and minus the width. The mean of these selected values are then calculated. These values are used to create the result dataframe.
import math
import numpy as np
import pandas as pd
k = .4
w = .5
np.random.seed(0)
df = pd.DataFrame(np.random.randn(100, 2), columns=['x', 'y'])
x_range = np.arange(math.floor((df.x.min() + w) / k) * k,
k * (math.ceil((df.x.max() - w) / k) + 1), k)
result = pd.DataFrame((df.loc[df.x.between(x - w, x + w), 'y'].mean() for x in x_range),
index=x_range, columns=['y_mean'])
result.index.name = 'centered_x'
>>> result
y_mean
centered_x
-2.400000e+00 0.653619
-2.000000e+00 0.733606
-1.600000e+00 0.576594
-1.200000e+00 0.150462
-8.000000e-01 0.065884
-4.000000e-01 0.022925
-8.881784e-16 0.211693
4.000000e-01 0.057527
8.000000e-01 -0.141970
1.200000e+00 0.233695
1.600000e+00 0.203570
2.000000e+00 0.306409
2.400000e+00 0.576789