I visualized it as a heatmap to see how the features are correlated, but only one specific feature is not correlated. Why is this happening? - kaggle

Initial feature description:
The values ​​of the name feature are changed to "Mr", "Mrs", "Miss", "Master", and "Other", and the values ​​are converted to 0, 1, 2, 3, 4 respectively. (dtype is int64)
sns.heatmap(data.corr(),annot=True,cmap='RdYlGn',linewidths=0.2,annot_kws={'size':20})
fig=plt.gcf()
fig.set_size_inches(18,15)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()
I visualized it as a heatmap to see the correlation of the Titanic features in Kaggle, but only the feature called Initial does not show the correlation. Why is this?

I think there is an initial column and an initial column, so there seems to be a typo, so please check the code.

Related

Annotating numeric values on grouped bars chart in pyplot

Good evening all,
I have a pd.dataframe called plot_eigen_vecs_df which is of (3,11) dimension, and I am plotting each column value grouped by rows on a bar chart. I am using the following code:
plot_eigen_vecs_df.plot(kind='bar', figsize=(12, 8),
title='First 3 PCs factor loadings',
xlabel='Evects', legend=True)
The result is this graph:
enter image description here
I would like to keep the graph (grouped) exactly as it is, but I need to show the numeric value above each bars.
Thank you
I tried the add_label method, but unfortunately I am currently using a version of pyplot which is not the most recent, so .add_label doesn't work for me. Could you please help on the matter?

make specific data points in scatter plot seaborn more visible [duplicate]

I have a Seaborn scatterplot and am trying to control the plotting order with 'hue_order', but it is not working as I would have expected (I can't get the blue dot to show on top of the gray).
x = [1, 2, 3, 1, 2, 3]
cat = ['N','Y','N','N','N']
test = pd.DataFrame(list(zip(x,cat)),
columns =['x','cat']
)
display(test)
colors = {'N': 'gray', 'Y': 'blue'}
sns.scatterplot(data=test, x='x', y='x',
hue='cat', hue_order=['Y', 'N', ],
palette=colors,
)
Flipping the 'hue_order' to hue_order=['N', 'Y', ] doesn't change the plot. How can I get the 'Y' category to plot on top of the 'N' category? My actual data has duplicate x,y ordinates that are differentiated by the category column.
The reason this is happening is that, unlike most plotting functions, scatterplot doesn't (internally) iterate over the hue levels when it's constructing the plot. It draws a single scatterplot and then sets the color of the elements with a vector. It does this so that you don't end up with all of the points from the final hue level on top of all the points from the penultimate hue level on top of all the ... etc. But it means that the scatterplot z-ordering is insensitive to the hue ordering and reflects only the order in the input data.
So you could use your desired hue order to sort the input data:
hue_order = ["N", "Y"]
colors = {'N': 'gray', 'Y': 'blue'}
sns.scatterplot(
data=test.sort_values('cat', key=np.vectorize(hue_order.index)),
x='x', y='x',
hue='cat', hue_order=hue_order,
palette=colors, s=100, # Embiggen the points to see what's happening
)
There may be a more efficient way to do that "sort by list of unique values" built into pandas; I am not sure.
TLDR: Before plotting, sort the data so that the dominant color appears last in the data. Here, it could just be:
test = test.sort_values('cat') # ascending = True
Then you get:
It seems that hue_order doesn't affect the order (or z-order) in which things are plotted. Rather, it affects how colors are assigned. E.g., if you don't specify a specific mapping of categories to colors (i.e. you just use a list of colors or a color palette), this parameter can determine whether 'N' or 'Y' gets the first (and which gets the second) color of the palette. There's an example showing this behavior here in the hue_order section. When you have the dict already linking categories to colors (colors = {'N': 'gray', 'Y': 'blue'}), it seems to just affect the order of labels in the legend, as you probably have seen.
So the key is to make sure the color you want on top is plotted last (and thus "on top"). I would have also assumed the hue_order parameter would do as you expected, but apparently not!

pie chart for each column pandas

I have a dataframe of categorical values, and want to tabulate, then make a pie graph on each column.
I can tabulate my table and create one massive plot, but I do not think this meets my needs, and would prefer a pie graph for each column instead:
df = pd.DataFrame({'a': ['table', 'chair', 'chair', 'lamp', 'bed'],
'b': ['lamp', 'candle', 'chair', 'lamp', 'bed'],
'c': ['mirror', 'mirror', 'mirror', 'mirror', 'mirror']})
df
df2=df.apply(pd.value_counts).fillna(0)
df2.plot.bar()
display()
I tried making pie plots for each column, but have been struggling the past few hours with:
df2.plot(kind='pie',subplots=True,autopct='%1.1f%%', startangle=270, fontsize=17)
display()
I am thinking I am close, and hopefully soeone can help me get over the final hurdle. ie, make a pie graph based on each column, so that it is meaningful and interpretable, not this bungled mess (ie, a title above each plot referring to the column, the legend in an appropriate position), or even the correct documentation to read
One easy thing to do is to increase the figure size and specify the layout:
df2.plot(kind='pie', subplots=True,
autopct='%1.1f%%', startangle=270, fontsize=17,
layout=(2,2), figsize=(10,10))
Output:

FeatureUnion: keep existing features plus add new engineered features (aka transformed columns)

Say I have a dataset with 2 numerical columns and I would like to add a third column which is the product of the two (or some other function of the two existing columns). I can compute the new feature using a ColumnTransformer:
tfs = make_column_transformer(
(FunctionTransformer(lambda X: np.sqrt(X[:,0] - X[:,1]).reshape(-1, 1)), ["colX", "colY"]),
)
(X is a pandas DataFrame, therefore the indexing via column names. Note also the reshape I had to use. Maybe someone has a better idea there.)
As written above I would like to keep the original features (similar to what sklearn.preprocessing.PolynomialFeatures is doing), i.e. use all 3 columns to fit a linear model (or generally use them in an sklearn pipeline). How do I do this?
For example,
df = pd.DataFrame({'colX': [3, 4], 'colY': [2, 1]})
tfs = make_column_transformer(
(FunctionTransformer(lambda X: np.sqrt(X[:,0] - X[:,1]).reshape(-1, 1)), ["colX", "colY"]),
)
tfs.fit_transform(df)
gives
array([[1. ],
[1.73205081]])
but I would like to get an array that includes the original columns to pass this on in the pipeline.
The only way I could think of is a FeatureUnion with an identity transformation for the first two columns. Is there a more direct way?
(I would like to make a pipeline rather than change the DataFrame so that I do not forget to make the augmentation when calling model.predict().)
Reading the documentation more carefully I found that it is possible to pass "special-cased strings" to "indicate to drop the columns or to pass them through untransformed, respectively."
So one possibility to achieve my goal is
tfs = make_column_transformer(
(FunctionTransformer(lambda X: np.sqrt(X[:,0] - X[:,1]).reshape(-1, 1)), ["colX", "colY"]),
("passthrough", df.columns)
)
yielding
array([[1. , 3. , 2. ],
[1.73205081, 4. , 1. ]])
In the end there is thus no need for FeatureUnion but it can be done with ColumnTransformer or make_column_transformer alone, resp.

How do I create a bar chart that starts and ends in a certain range

I created a computer model (just for fun) to predict soccer match result. I ran a computer simulation to predict how many points that a team will gain. I get a list of simulation result for each team.
I want to plot something like confidence interval, but using bar chart.
I considered the following option:
I considered using matplotlib's candlestick, but this is not Forex price.
I also considered using matplotlib's errorbar, especially since it turns out I can mashes graphbar + errorbar, but it's not really what I am aiming for. I am actually aiming for something like Nate Silver's 538 election prediction result.
Nate Silver's is too complex, he colored the distribution and vary the size of the percentage. I just want a simple bar chart that plots on a certain range.
I don't want to resort to plot bar stacking like shown here
Matplotlib's barh (or bar) is probably suitable for this:
import numpy as np
import matplotlib.pylab as pl
x_mean = np.array([1, 3, 6 ])
x_std = np.array([0.3, 1, 0.7])
y = np.array([0, 1, 2 ])
pl.figure()
pl.barh(y, width=2*x_std, left=x_mean-x_std)
The bars have a horizontal width of 2*x_std and start at x_mean-x_std, so the center denotes the mean value.
It's not very pretty (yet), but highly customizable: