Making ROC curves with results from cross_validate? - matplotlib

I am running 5 fold cross validation with a random forest as such:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
forest = RandomForestClassifier(n_estimators=100, max_depth=8, max_features=6)
cv_results = cross_validate(forest, X, y, cv=5, scoring=scoring)
However, I want to plot the ROC curves for the 5 outputs on one graph. The documentation only provides an example to plot the roc curve with cross validation when specifically using StratifiedKFold cross validation (see documentation here: https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html#sphx-glr-auto-examples-model-selection-plot-roc-crossval-py)
I tried tweeking the code to make it work for cross_validate but to no avail.
How do I make a ROC curve with the 5 results from the cross_validate output being plotted on a single graph?
Thanks in advance

cross_validate is a Model validation tool rather than a splitter class. You need to choose the splitter class which is right for you. You are probably after KFold. Something like this:
from sklearn.model_selection import KFold
cv = KFold(n_splits=5)

Related

Blurry XGBClassifier tree plot

I have trained and XGBClassifier called model and then plot the tree as follows:
from xgboost import plot_tree
plot_tree(model); plt.show(dpi=1200)
The resulting plot is really blurry:
Does anyone know how to improve the quality of that plot?
I have tried to include dpi=1200 (see code above) but that doesn't make any difference.

Why does keras (SGD) optimizer.minimize() not reach global minimum in this example?

I'm in the process of completing a TensorFlow tutorial via DataCamp and am transcribing/replicating the code examples I am working through in my own Jupyter notebook.
Here are the original instructions from the coding problem :
I'm running the following snippet of code and am not able to arrive at the same result that I am generating within the tutorial, which I have confirmed are the correct values via a connected scatterplot of x vs. loss_function(x) as seen a bit further below.
# imports
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from tensorflow import Variable, keras
def loss_function(x):
import math
return 4.0*math.cos(x-1)+np.divide(math.cos(2.0*math.pi*x),x)
# Initialize x_1 and x_2
x_1 = Variable(6.0, np.float32)
x_2 = Variable(0.3, np.float32)
# Define the optimization operation
opt = keras.optimizers.SGD(learning_rate=0.01)
for j in range(100):
# Perform minimization using the loss function and x_1
opt.minimize(lambda: loss_function(x_1), var_list=[x_1])
# Perform minimization using the loss function and x_2
opt.minimize(lambda: loss_function(x_2), var_list=[x_2])
# Print x_1 and x_2 as numpy arrays
print(x_1.numpy(), x_2.numpy())
I draw a quick connected scatterplot to confirm (successfully) that the loss function that I using gets me back to the same graph provided by the example (seen in screenshot above)
# Generate loss_function(x) values for given range of x-values
losses = []
for p in np.linspace(0.1, 6.0, 60):
losses.append(loss_function(p))
# Define x,y coordinates
x_coordinates = list(np.linspace(0.1, 6.0, 60))
y_coordinates = losses
# Plot
plt.scatter(x_coordinates, y_coordinates)
plt.plot(x_coordinates, y_coordinates)
plt.title('Plot of Input values (x) vs. Losses')
plt.xlabel('x')
plt.ylabel('loss_function(x)')
plt.show()
Here are the resulting global and local minima, respectively, as per the DataCamp environment :
4.38 is the correct global minimum, and 0.42 indeed corresponds to the first local minima on the graphs RHS (when starting from x_2 = 0.3)
And here are the results from my environment, both of which move opposite the direction that they should be moving towards when seeking to minimize the loss value:
I've spent the better part of the last 90 minutes trying to sort out why my results disagree with those of the DataCamp console / why the optimizer fails to minimize this loss for this simple toy example...?
I appreciate any suggestions that you might have after you've run the provided code in your own environments, many thanks in advance!!!
As it turned out, the difference in outputs arose from the default precision of tf.division() (vs np.division()) and tf.cos() (vs math.cos()) -- operations which were specified in (my transcribed, "custom") definition of the loss_function().
The loss_function() had been predefined in the body of the tutorial and when I "inspected" it using the inspect package ( using inspect.getsourcelines(loss_function) ) in order to redefine it in my own environment, the output of said inspection didn't clearly indicate that tf.division & tf.cos had been used instead of their NumPy counterparts (which my version of the code had used).
The actual difference is quite small, but is apparently sufficient to push the optimizer in the opposite direction (away from the two respective minima).
After swapping in tf.division() and tf.cos (as seen below) I was able to arrive at the same results as seen in the DC console.
Here is the code for the loss_function that will back in to the same results as seen in the console (screenshot) :
def loss_function(x):
import math
return 4.0*tf.cos(x-1)+tf.divide(tf.cos(2.0*math.pi*x),x)

Why curve fit function does not combine all data points. How to get best fit?

I'm not familiar that how to decide the fitting function? But by looking at the trend of data points I choosed Poisson distribution as my fitting function. Green curve is quite smooth but fitting curve is is far away from first data point having position (0,0.55). I want to get smooth curve using fitting function because it is far away from my actual data points. I tried to increase number of bins but still getting same type of curve. I have doubt that may be I am not choosing proper fitting function or may be I am missing something else?
`def Poisson_fit(x,a):
return (a*np.exp(-x))
def Poisson(x):
return (np.exp(-x))
x_data =np.linspace(0,5,10)
print("x_data: ",x_data)
[0.,0.55555556, 1.11111111, 1.66666667, 2.22222222, 2.77777778, 3.33333333,
3.88888889, 4.44444444, 5.]
hist, bin_edges= np.histogram(x, bins=10, density=True)
print("hist: ",hist)
#hist:[5.41041394e-01,1.42611032e-01,3.44975130e-02,7.60221121e-03,
1.66115522e-03,3.26808028e-04,6.70741368e-05,1.14168743e-05,5.70843717e-06,
1.42710929e-06]
plt.scatter(x_data, hist,marker='o',color='red')
popt, pcov = optimize.curve_fit(Poisson_fit, x_data, hist)
plt.plot(x_data, Poisson_fit(x_data,*popt), linestyle='--',
marker='.',color='red', label='Fit')
plt.plot(x_data,Poisson(x_data),marker='.',color='green',label='Poisson')`
#Second Graph(Find best fit)
In the following graph I have fit two different distributions on data points. For me its hard to judge which is best fit. Should I print error on the fitting function to judge the best fit?
`perr = np.sqrt(np.diag(pcov))`
If all data-points need to coincide with the interpolating fit, splines (e.g. cubic splines) can be used, generally resulting in a reasonably smooth fit (only generally, because what is "reasonably smooth" depends both on the data and the application).
Example:
import numpy as np
from scipy.interpolate import CubicSpline
import pylab
x_data = np.linspace(0,5,10)
y_data = np.array([5.41041394e-01,1.42611032e-01,3.44975130e-02,
7.60221121e-03,1.66115522e-03,3.26808028e-04,
6.70741368e-05,1.14168743e-05,5.70843717e-06,
1.42710929e-06])
spline = CubicSpline(x_data, y_data)
plot_x = np.linspace(0,5,1000)
pylab.plot(x_data, y_data, 'b*', label='Data')
pylab.plot(plot_x, spline(plot_x), 'k-', label='Spline')
pylab.legend(loc='best')
pylab.show()

Bifurcation diagram using python

I have a simple question. Can we create Bifurcation diagram from any type of equation or just from the equation of logistic map like
x=r (1-x)
What is the main idea of making a bifurcation diagram. I am working on this for last couple of weeks but got no idea. I just see the same equation plotting everywhere like I mentioned above. I do have different equation like
x[i+1]= a*x[i]*2**n+b (mod2**n)/2**n
where mod2**n indicates the remainder after division by 2**n.
I have to write a code to make a bifurcation diagram for above equation.
I tried to change a but it does not work.
import matplotlib.pyplot as plt
import numpy as np
def iter_map(a,N):
x=np.zeros(N)
x[0]=0.5
for i in range(N-1):
d=(a*x[i]*(2**n) +b)%(2**n)
x[i+1]=d/2**n
return x
N=5
n=8
a0 = 1.0
a_max= 4
step= 0.001
plt.figure()
for b in [0,1]:
for a in np.arange(a0, a_max, step):
x = iter_map(a, 8)
plt.plot(a*np.ones_like(x),x)
plt.xlabel(r'$a$')
plt.ylabel(r'$x$')
plt.show()
The result does not look as expected

Curve fitting a large data set

Right now, I'm trying to fit a curve to a large set of data; there are two arrays, x and y, each with 352 elements. I've fit a polynomial to the data, which works fine:
import numpy as np
import matplotlib.pyplot as plt
coeff=np.polyfit(x, y, 20)
coeff=np.polyfit(x, y, 20)
poly=np.poly1d(coeff)
But I need a more accurately optimized curve, so I've been trying to fit a curve with scipy. Here's the code that I have so far:
import numpy as np
import scipy
from scipy import scipy.optimize as sp
coeff=np.polyfit(x, y, 20)
coeff=np.polyfit(x, y, 20)
poly=np.poly1d(coeff)
poly_y=poly(x)
def poly_func(x): return poly(x)
param=sp.curve_fit(poly_func, x, y)
But all it returns is this:
ValueError: Unable to determine number of fit parameters.
How can I get this to work? (Or how can I fit a curve to this data?)
Your fit function does not make sense, it takes no parameter to fit.
Curve fit uses a non-linear optimizer, which needs a initial guess of the fitting parameters.
If no guess is given, it tries to determine number of parameters via introspection, which fails for your function, and set them to one (something you almost never want.)