Sns.Pairplot equivalent for Sns.Boxplot? - matplotlib

sns.pairplot(X_train,
x_vars = my_dict['int64'],
y_vars = ["SalePrice"])
The above gives me a nice scatterplot of all the X's vs. the y outcome variable. See picture below:
I would like to do the same thing with my categorical variables - plot all the X categorical variables in individual Boxplots vs. the y outcome variable. However, I can only do it for 1 of the categorical variables at a time. Is there a similar sns.pairplot function that lets me plot all the boxplots together at once?
My code is:
sns.boxplot(data = X_train, x = my_dict['object'][0], y = 'SalePrice')
which produces a very nice overview:
This is my my_dict['object'] list:
['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour',
'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1',
'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond',
'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC',
'CentralAir', 'Electrical', 'LowQualFinSF', 'BsmtFullBath',
'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr',
'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces',
'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageCars', 'GarageQual',
'GarageCond', 'PavedDrive', '3SsnPorch', 'PoolArea', 'PoolQC', 'Fence',
'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
'SaleCondition']

Related

pyplot return unexpected contour paths

I want to use pyplot.contour to extract isolines from 2D data.
My problem is that this method returns unexpected results : when I use levels clearly outside data range, the contour result still contains paths.
Here is an example reproducing the issue :
import numpy
from matplotlib import pyplot
n = 256
x = numpy.linspace(-3., 3., n)
y = numpy.linspace(-3., 3., n)
X, Y = numpy.meshgrid(x, y)
Z = X * numpy.sinc(X ** 2 + Y ** 2)
levels = [1000]
print(f'data min : {Z.min()}')
print(f'data min : {Z.max()}')
print(f'levels : {levels}')
isolines = pyplot.contour(X, Y, Z, levels, colors='red')
for i, collection in enumerate(isolines.collections):
npaths = len(collection.get_paths())
print(f'collection[{i}] has {npaths} paths')
pyplot.show()
Which outputs
data min : -0.47993931267102286
data min : 0.47993931267102286
levels : [1000]
/path/to/issue.py:15: UserWarning: No contour levels were found within the data range.
isolines = pyplot.contour(X, Y, Z, levels, colors='red')
collection[0] has 1 paths
I expected the contour to be empty and not contain 1 path, do I miss something obvious here ?
As of 2023/01/11, it is a bug in matplotlib :
https://github.com/matplotlib/matplotlib/issues/23778
As the fix has not landed yet, my temporary workaround is to detect when levels are outside Z value range, and empty the contour collections in that case.
quadcontourset = pyplot.contour(X, Y, Z, levels)
zmin = numpy.min(Z)
zmax = numpy.max(Z)
inside = (levels > zmin) & (levels < zmax)
levels_in = levels[inside]
if not levels_in:
quadcontourset.collections.clear()
I reproduce the issue with matplotlib 3.5.3. The issue is not fixed in current 3.6.2 version but a fix seems on track at
https://github.com/matplotlib/matplotlib/pull/24912

I am passing a function through a dataframe in subgroups and I would like to get a separate plot for each subgroup. Now plots overalap

I want to calculate the prediction bands = (upb and lpb) of subgroups in my dataframe and plot each of them individually. I would like to plot the results in different plots because now the prediction bands overlap. I tried to save the output (lpb and upb) in a dataframe (predictionbounds) but I get an error and I think this approach is not elegant. How can I get 4 different plots out of the loop??
## 1,2,3 are functions I use in the loop
# 1) linear model
def model(x, y, start):
return (b*x) + start
# 2) Prediction band
def predband .....
return lpb, upb
# 3) function to calculate CI 0f fitted parameters
def conf_int:........
return params_ci
### prepare dataset for loop
# group dataset in subgroups. Loops is applied in each subgroup
dfforR2 = dfdata.groupby(["treatment1", "treatment2])
variables={'treatment1':treatment1, 'treatment2':'treatment2','b':float, 'start':float,
'r_2':float}
results = pd.DataFrame(variables, index=[])
#Here I try to create an empty dataframe so I can save the variables 'lpb' and 'upb'
#bounds={'treatment1':treatment1, 'treatment2':'treatment2', 'lpb':float, 'upb':float}
#predictionbounds=pd.DataFrame(bounds, index=[])
### loop and make fit
for key, g in dfforR2:
x= np.linspace(0, 2, )
popt, pcov = curve_fit(model, g['x'], g['y'])
confint=(conf_int(g['y'], alpha, popt, pcov))
lpb, upb=predband(x, g['x'], g['y'], popt, model, conf=0.95)
new_row = {'treatment1':key[0], 'treatment2':key[1], 'slope': popt[0], 'start':popt[1],
'r_2':r_2}
results=results.append(new_row, ignore_index=True)
Problem starts below:
#line below does not work as I get an error:
#AttributeError: 'dict' object has no attribute 'append'
#new_bound = {'treatment1':key[0], 'treatment2':key[1], ' lpb': lpb, 'upb':upb}
#this works but I would like to print different graphs
plt.fill_between(x, lpb, upb, color = 'grey', alpha = 0.15)
#### Plot manually the output
# construct fitted curve for this treatment
#x= np.linspace(0, 1500, 400)
a = model(x, results.iloc[0,2], results.iloc[0,3])
plt.plot(x, a, color='tab:blue', label='Ctrl_W')
Curenlty this is what I get:

How to optimize the linear coefficients for numpy arrays in a maximization function?

I have to optimize the coefficients for three numpy arrays which maximizes my evaluation function.
I have a target array called train['target'] and three predictions arrays named array1, array2 and array3.
I want to put the best linear coefficients i.e., x,y,z for these three arrays which will maximize the function
roc_aoc_curve(train['target'], xarray1 + yarray2 +z*array3)
the above function would be maximum when prediction is closer to the target.
i.e, xarray1 + yarray2 + z*array3 should be closer to train['target'].
The range of x,y,z >=0 and x,y,z <= 1
Basically I am trying to put the weights x,y,z for each of the three arrays which would make the function
xarray1 + yarray2 +z*array3 closer to the train['target']
Any help in getting this would be appreciated.
I used pulp.LpProblem('Giapetto', pulp.LpMaximize) to do the maximization. It works for normal numbers, integers etc, however failing while trying to do with arrays.
import numpy as np
import pulp
# create the LP object, set up as a maximization problem
prob = pulp.LpProblem('Giapetto', pulp.LpMaximize)
# set up decision variables
x = pulp.LpVariable('x', lowBound=0)
y = pulp.LpVariable('y', lowBound=0)
z = pulp.LpVariable('z', lowBound=0)
score = roc_auc_score(train['target'],x*array1+ y*array2 + z*array3)
prob += score
coef = x+y+z
prob += (coef==1)
# solve the LP using the default solver
optimization_result = prob.solve()
# make sure we got an optimal solution
assert optimization_result == pulp.LpStatusOptimal
# display the results
for var in (x, y,z):
print('Optimal weekly number of {} to produce: {:1.0f}'.format(var.name, var.value()))
Getting error at the line
score = roc_auc_score(train['target'],x*array1+ y*array2 + z*array3)
TypeError: unsupported operand type(s) for /: 'int' and 'LpVariable'
Can't progress beyond this line when using arrays. Not sure if my approach is correct. Any help in optimizing the function would be appreciated.
When you add sums of array elements to a PuLP model, you have to use built-in PuLP constructs like lpSum to do it -- you can't just add arrays together (as you discovered).
So your score definition should look something like this:
score = pulp.lpSum([train['target'][i] - (x * array1[i] + y * array2[i] + z * array3[i]) for i in arr_ind])
A few notes about this:
[+] You didn't provide the definition of roc_auc_score so I just pretended that it equals the sum of the element-wise difference between the target array and the weighted sum of the other 3 arrays.
[+] I suspect your actual calculation for roc_auc_score is nonlinear; more on this below.
[+] arr_ind is a list of the indices of the arrays, which I created like this:
# build array index
arr_ind = range(len(array1))
[+] You also didn't include the arrays, so I created them like this:
array1 = np.random.rand(10, 1)
array2 = np.random.rand(10, 1)
array3 = np.random.rand(10, 1)
train = {}
train['target'] = np.ones((10, 1))
Here is my complete code, which compiles and executes, though I'm sure it doesn't give you the result you are hoping for, since I just guessed about target and roc_auc_score:
import numpy as np
import pulp
# create the LP object, set up as a maximization problem
prob = pulp.LpProblem('Giapetto', pulp.LpMaximize)
# dummy arrays since arrays weren't in OP code
array1 = np.random.rand(10, 1)
array2 = np.random.rand(10, 1)
array3 = np.random.rand(10, 1)
# build array index
arr_ind = range(len(array1))
# set up decision variables
x = pulp.LpVariable('x', lowBound=0)
y = pulp.LpVariable('y', lowBound=0)
z = pulp.LpVariable('z', lowBound=0)
# dummy roc_auc_score since roc_auc_score wasn't in OP code
train = {}
train['target'] = np.ones((10, 1))
score = pulp.lpSum([train['target'][i] - (x * array1[i] + y * array2[i] + z * array3[i]) for i in arr_ind])
prob += score
coef = x + y + z
prob += coef == 1
# solve the LP using the default solver
optimization_result = prob.solve()
# make sure we got an optimal solution
assert optimization_result == pulp.LpStatusOptimal
# display the results
for var in (x, y,z):
print('Optimal weekly number of {} to produce: {:1.0f}'.format(var.name, var.value()))
Output:
Optimal weekly number of x to produce: 0
Optimal weekly number of y to produce: 0
Optimal weekly number of z to produce: 1
Process finished with exit code 0
Now, if your roc_auc_score function is nonlinear, you will have additional troubles. I would encourage you to try to formulate the score in a way that is linear, possibly using additional variables (for example, if you want the score to be an absolute value).

Visualising an individual 2d graph for all points on a plane

I have a M vs N curve (let's take it to be a sigmoid, for ease of understanding) for a given value of parameters P and Q. I need to visualise the M vs N curves for a range of values of P and Q (assume 10 values in 0 to 1, i.e. 0.1, 0.2, ..., 0.9 for both P and Q)
The only solution that I've found for this problem is a Trellis plot (essentially a matrix of plots). I'd like to know if there any other method to visualise this sort of a 4d(?) relationship besides the Trellis plots. Thanks.
I'm not sure I understand what you're hoping for, so let me know if this is on the right track. Below are three examples using R.
The first is indeed a matrix of plots where each panel represents a different value of q and, within each panel, each curve represents a different value of p. The second is a 3D plot which looks at a surface based on three of the variables with the fourth fixed. The third is a Shiny app that creates the same interactive plot as in the second example but also provides a slider that allows you to change p and see how the plot changes. Unfortunately, I'm not sure how to embed the interactive plots in Stackoverflow so I've just provided the code.
I'm not sure if there's an elegant way to look at all four variables at the same time, but maybe someone will come along with additional options.
Matrix of plots for various values of p and q
library(tidyverse)
theme_set(theme_classic())
# Function to plot
my_fun = function(x, p, q) {
1/(1 + exp(p + q*x))
}
# Parameters
params = expand.grid(p=seq(-2,2,length=6), q=seq(-1,1,length=11))
# x-values to feed to my_fun
x = seq(-10,10,0.1)
# Generate data frame for plotting
dat = map2_df(params$p, params$q, function(p, q) {
data.frame(p=p, q=q, x, y=my_fun(x, p, q))
})
ggplot(dat, aes(x,y,colour=p, group=p)) +
geom_line() +
facet_grid(. ~ q, labeller=label_both) +
labs(colour="p") +
scale_colour_gradient(low="red", high="blue") +
theme(legend.position="bottom")
3D plot with one variable fixed
The code below will produce an interactive 3D plot that you can zoom and rotate. I've fixed the value of p and drawn a plot of the y surface for a grid of x and q values.
library(rgl)
x = seq(-10,10,0.1)
q = seq(-1,1,0.01)
y = outer(x, q, function(a, b) 1/(1 + exp(1 + b*a)))
persp3d(x, q, y, col=hcl(240,80,65), specular="grey20",
xlab = "x", ylab = "q", zlab = "y")
I'm not sure how to embed the interactive plot, but here's a static image of one viewing angle:
Shiny app
The code below will create the same plot as above, but with the added ability to vary p with a slider and see how the plot changes.
Open an R script file and paste in the code below. Save it as app.r in its own directory then run the code. Both an rgl window and the Shiny app page with the slider for controlling the value of p should open. Resize the windows as desired and then move the slider to see how the function surface changes for various values of p.
library(shiny)
# Define UI for application that draws an interactive plot
ui <- fluidPage(
# Application title
titlePanel("Plot the function 1/(1 + exp(p + q*x))"),
# Sidebar with a slider input for number of bins
sidebarLayout(
sidebarPanel(
sliderInput("p",
"Vary the value of p and see how the plot changes",
min = -2,
max = 2,
value = 1,
step=0.2)
),
# Show a plot of the generated distribution
mainPanel(
plotOutput("distPlot")
)
)
)
# Define server logic required to draw the plot
server <- function(input, output) {
output$distPlot <- renderPlot({
library(rgl)
x = seq(-10,10,0.1)
q = seq(-1,1,0.01)
y = outer(x, q, function(a, b) 1/(1 + exp(input$p + b*a)))
persp3d(x, q, y, col=hcl(240,50,65), specular="grey20",
xlab = "x", ylab = "q", zlab = "y")
})
}
# Run the application
shinyApp(ui = ui, server = server)

Storing plot objects in a list

I asked this question yesterday about storing a plot within an object. I tried implementing the first approach (aware that I did not specify that I was using qplot() in my original question) and noticed that it did not work as expected.
library(ggplot2) # add ggplot2
string = "C:/example.pdf" # Setup pdf
pdf(string,height=6,width=9)
x_range <- range(1,50) # Specify Range
# Create a list to hold the plot objects.
pltList <- list()
pltList[]
for(i in 1 : 16){
# Organise data
y = (1:50) * i * 1000 # Get y col
x = (1:50) # get x col
y = log(y) # Use natural log
# Regression
lm.0 = lm(formula = y ~ x) # make linear model
inter = summary(lm.0)$coefficients[1,1] # Get intercept
slop = summary(lm.0)$coefficients[2,1] # Get slope
# Make plot name
pltName <- paste( 'a', i, sep = '' )
# make plot object
p <- qplot(
x, y,
xlab = "Radius [km]",
ylab = "Services [log]",
xlim = x_range,
main = paste("Sample",i)
) + geom_abline(intercept = inter, slope = slop, colour = "red", size = 1)
print(p)
pltList[[pltName]] = p
}
# close the PDF file
dev.off()
I have used sample numbers in this case so the code runs if it is just copied. I did spend a few hours puzzling over this but I cannot figure out what is going wrong. It writes the first set of pdfs without problem, so I have 16 pdfs with the correct plots.
Then when I use this piece of code:
string = "C:/test_tabloid.pdf"
pdf(string, height = 11, width = 17)
grid.newpage()
pushViewport( viewport( layout = grid.layout(3, 3) ) )
vplayout <- function(x, y){viewport(layout.pos.row = x, layout.pos.col = y)}
counter = 1
# Page 1
for (i in 1:3){
for (j in 1:3){
pltName <- paste( 'a', counter, sep = '' )
print( pltList[[pltName]], vp = vplayout(i,j) )
counter = counter + 1
}
}
dev.off()
the result I get is the last linear model line (abline) on every graph, but the data does not change. When I check my list of plots, it seems that all of them become overwritten by the most recent plot (with the exception of the abline object).
A less important secondary question was how to generate a muli-page pdf with several plots on each page, but the main goal of my code was to store the plots in a list that I could access at a later date.
Ok, so if your plot command is changed to
p <- qplot(data = data.frame(x = x, y = y),
x, y,
xlab = "Radius [km]",
ylab = "Services [log]",
xlim = x_range,
ylim = c(0,10),
main = paste("Sample",i)
) + geom_abline(intercept = inter, slope = slop, colour = "red", size = 1)
then everything works as expected. Here's what I suspect is happening (although Hadley could probably clarify things). When ggplot2 "saves" the data, what it actually does is save a data frame, and the names of the parameters. So for the command as I have given it, you get
> summary(pltList[["a1"]])
data: x, y [50x2]
mapping: x = x, y = y
scales: x, y
faceting: facet_grid(. ~ ., FALSE)
-----------------------------------
geom_point:
stat_identity:
position_identity: (width = NULL, height = NULL)
mapping: group = 1
geom_abline: colour = red, size = 1
stat_abline: intercept = 2.55595281266726, slope = 0.05543539319091
position_identity: (width = NULL, height = NULL)
However, if you don't specify a data parameter in qplot, all the variables get evaluated in the current scope, because there is no attached (read: saved) data frame.
data: [0x0]
mapping: x = x, y = y
scales: x, y
faceting: facet_grid(. ~ ., FALSE)
-----------------------------------
geom_point:
stat_identity:
position_identity: (width = NULL, height = NULL)
mapping: group = 1
geom_abline: colour = red, size = 1
stat_abline: intercept = 2.55595281266726, slope = 0.05543539319091
position_identity: (width = NULL, height = NULL)
So when the plot is generated the second time around, rather than using the original values, it uses the current values of x and y.
I think you should use the data argument in qplot, i.e., store your vectors in a data frame.
See Hadley's book, Section 4.4:
The restriction on the data is simple: it must be a data frame. This is restrictive, and unlike other graphics packages in R. Lattice functions can take an optional data frame or use vectors directly from the global environment. ...
The data is stored in the plot object as a copy, not a reference. This has two
important consequences: if your data changes, the plot will not; and ggplot2 objects are entirely self-contained so that they can be save()d to disk and later load()ed and plotted without needing anything else from that session.
There is a bug in your code concerning list subscripting. It should be
pltList[[pltName]]
not
pltList[pltName]
Note:
class(pltList[1])
[1] "list"
pltList[1] is a list containing the first element of pltList.
class(pltList[[1]])
[1] "ggplot"
pltList[[1]] is the first element of pltList.
For your second question: Multi-page pdfs are easy -- see help(pdf):
onefile: logical: if true (the default) allow multiple figures in one
file. If false, generate a file with name containing the
page number for each page. Defaults to ‘TRUE’.
For your main question, I don't understand if you want to store the plot inputs in a list for later processing, or the plot outputs. If it is the latter, I am not sure that plot() returns an object you can store and retrieve.
Another suggestion regarding your second question would be to use either Sweave or Brew as they will give you complete control over how you display your multi-page pdf.
Have a look at this related question.