RobustScaler from scikit-learn not behaving properly - dataframe

I wanted to fit and cut the outliers part from my data, so I used RobustScaler (with data from here) :
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler(quantile_range = (25.0, 75.0))
df_robust = scaler.fit_transform(df)
df_robust = pd.DataFrame(df_robust,columns=df.columns)
But when I plot the box plot,
df_robust.boxplot(figsize=(25,25))
plt.show()
it is clear that some data outside the quantile range are still there :
Have you already encountered this problem ?

RobustScaler does not remove outliers. When fitted, it computes a scale and mean that's robust to outliers. Outliers however would later be transformed like all other points using those parameters.
In other words, RobustScaler preserves outliers and tries to not let them influence the scaling of the non-outliers.
From the doc:
This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range).
So what it does is compute something like this:
iqr = np.nanpercentile(xs, 75) - np.nanpercentile(xs, 25)
median = xs.median()
and standardize like this (check the source code for exact proportionality constant):
(xs - median) / iqr
There is no step that removes outliers.

Related

Central Limit Theorem: Sample means do not follow a normal distribution

The Problem
Good evening.
I am learning about the Central Limit Theorem. As practice, I ran simulations in an attempt to find the mean of a fair die (I know, a toy problem).
I took 4000 samples, and in each sample I rolled a die 50 times (screenshot of the code at the bottom). For each of these 4000 samples I computed the mean. Then, I plotted these 4000 sample means in a histogram (with bin size 0.03) using matplotlib.
Here is the result:
Question
Why aren't the sample means normally distributed given that the conditions for CLT (sample size >= 30) were respected?
Specifically, why does the histogram look like two normal distributions superimposed on top of each other? More intriguingly, why does the "outer" distribution look "discrete" with empty spaces occurring at regular intervals?
It almost seems like the result is off in a systematic way.
All help is greatly appreciated. I am very lost.
Supplementary Code
The code I used to generate the 4000 sample means.
"""
Take multiple samples of dice rolls. For
each sample, compute the sample mean.
With the sample means, plot a histogram.
By the Central Limit Theorem, the sample
means should be normally distributed.
"""
sample_means = []
num_samples = 4000
for i in range(num_samples):
# Large enough for CLT to hold
num_rolls = 50
sample = []
for j in range(num_rolls):
observation = random.randint(1, 6)
sample.append(observation)
sample_mean = sum(sample) / len(sample)
sample_means.append(sample_mean)
When num_rolls equals 50, each possible mean will be a fraction with denominator 50. So, in reality, you are looking at a discrete distribution.
To create a histogram of a discrete distribution, the bin boundaries are best placed nicely in-between the values. Using a step size of 0.03, some bin boundaries will coincide with the values, putting the double of values into one bin compared to its neighbor. Moreover, due to subtle floating point rounding problems, the result can become unpredictable when values and boundaries coincide.
Here is some code to illustrate what is going on:
from matplotlib import pyplot as plt
import numpy as np
import random
sample_means = []
num_samples = 4000
for i in range(num_samples):
num_rolls = 50
sample = []
for j in range(num_rolls):
observation = random.randint(1, 6)
sample.append(observation)
sample_mean = sum(sample) / len(sample)
sample_means.append(sample_mean)
fig, axs = plt.subplots(2, 2, figsize=(14, 8))
random_y = np.random.rand(len(sample_means))
for (ax0, ax1), step in zip(axs, [0.03, 0.02]):
bins = np.arange(3.01, 4, step)
ax0.hist(sample_means, bins=bins)
ax0.set_title(f'step={step}')
ax0.vlines(bins, 0, ax0.get_ylim()[1], ls=':', color='r') # show the bin boundaries in red
ax1.scatter(sample_means, random_y, s=1) # show the sample means with a random y
ax1.vlines(bins, 0, 1, ls=':', color='r') # show the bin boundaries in red
ax1.set_xticks(np.arange(3, 4, 0.02))
ax1.set_xlim(3.0, 3.3) # zoom in to region to better see the ins
ax1.set_title('bin boundaries between values' if step == 0.02 else 'chaotic bin boundaries')
plt.show()
PS: Note that the code would run much, much faster if instead of Python lists, the code would work completely with numpy.

Why the point size using sns.lmplot is different when I used plt.scatter?

I want to do a scatterplot according x and y variables, and the points size depend of a numeric variable and the color of every point depend of a categorical variable.
First, I was trying this with plt.scatter:
Graph 1
After, I tried this using lmplot but the point size is different in relation to the first graph.
I think the two graphs should be equals. Why not?
The point size is different in every graph.
Graph 2
Your question is no so much descriptive but i guess you want to control the size of the marker. Here is more documentation
Here is the start point for you.
A numeric variable can also be assigned to size to apply a semantic mapping to the areas of the points:
import seaborn as sns
tips = sns.load_dataset("tips")
sns.scatterplot(data=tips, x="total_bill", y="tip", hue="size", size="size")
For seaborn scatterplot:
df = sns.load_dataset("anscombe")
sp = sns.scatterplot(x="x", y="y", hue="dataset", data=df)
And to change the size of the points you use the s parameter.
sp = sns.scatterplot(x="x", y="y", hue="dataset", data=df, s=100)

How to show min and max values at the end of the axes

I generate plots like below:
from pylab import *
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker
import matplotlib.ticker as ticker
rcParams['axes.linewidth'] = 2 # set the value globally
rcParams['font.size'] = 16# set the value globally
rcParams['font.family'] = ['DejaVu Sans']
rcParams['mathtext.fontset'] = 'stix'
rcParams['legend.fontsize'] = 24
rcParams['axes.prop_cycle'] = cycler(color=['grey','b','g','r','orange'])
rc('lines', linewidth=2, linestyle='-',marker='o')
rcParams['axes.xmargin'] = 0
rcParams['axes.ymargin'] = 0
t = arange(0,21,1)
v = 2.0
s = v*t
plt.figure(figsize=(12, 4))
plt.plot(t,s,label='$s=%1.1f\cdot t$'%v)
plt.title('Wykres drogi w czasie $s=v\cdot t$')
plt.xlabel('Czas $t$, s')
plt.ylabel('Droga $s$, m')
plt.autoscale(enable=True, axis='both', tight=None)
legend(loc='best')
plt.xlim(min(t),max(t))
plt.ylim(min(s),max(s))
plt.grid()
plt.show()
When I am changing the value t = arange(0,21,1) for example to t = arange(0,20,1) which gives me for example on the x axis max value= 19.0 my max value dispirs from the x axis. The same situation is of course with y axis.
My question is how to force matplotlib to produce always plots where on the axes are max values just at the end of the axes like should be always for my purposes or should be possible to chose like an option?
Imiage from my program in Fortan I did some years ago
Matplotlib is more efficiens that I use it but there should be an opition like that (the picture above).
In this way I can always observe max min in text windows or do take addiional steps to make sure about max min values. I would like to read them from axes and the question is ...Are there such possibilites in mathplotlib ??? If not I will close the post.
Axes I am thinking about more or less
I see two ways to solve the problem.
Set the axes automatic limit mode to round numbers
In the rcParams you can do this with
rcParams['axes.autolimit_mode'] = 'round_numbers'
And turn off the manual axes limits with min and max
plt.xlim(min(t),max(t))
plt.ylim(min(s),max(s))
This will produce the image below. Still, the extreme values of the axes are shown at the nearest "round numbers", but the user can approximately catch the data range limits. If you need the exact value to be displayed, you can see the second solution which cannot be directly used from the rcParams.
or – Manually generate axes ticks
This solution implies explicitly asking for a given number of ticks. I guess there is a way to automatize it depending on the axes size etc. But if you are dealing with more or less every time the same graph size, you can decide a fixed number of ticks manually. This can be done with
plt.xlim(min(t),max(t))
plt.ylim(min(s),max(s))
plt.xticks(np.linspace(t.min(), t.max(), 7)) # arbitrary chosen
plt.yticks(np.linspace(s.min(), s.max(), 5)) # arbitrary chosen
generated the image below, quite similar to your image example.

Matplotlib/Seaborn: Boxplot collapses on x axis

I am creating a series of boxplots in order to compare different cancer types with each other (based on 5 categories). For plotting I use seaborn/matplotlib. It works fine for most of the cancer types (see image right) however in some the x axis collapses slightly (see image left) or strongly (see image middle)
https://i.imgur.com/dxLR4B4.png
Looking into the code how seaborn plots a box/violin plot https://github.com/mwaskom/seaborn/blob/36964d7ffba3683de2117d25f224f8ebef015298/seaborn/categorical.py (line 961)
violin_data = remove_na(group_data[hue_mask])
I realized that this happens when there are too many nans
Is there any possibility to prevent this collapsing by code only
I do not want to modify my dataframe (replace the nans by zero)
Below you find my code:
boxp_df=pd.read_csv(pf_in,sep="\t",skip_blank_lines=False)
fig, ax = plt.subplots(figsize=(10, 10))
sns.violinplot(data=boxp_df, ax=ax)
plt.xticks(rotation=-45)
plt.ylabel("label")
plt.tight_layout()
plt.savefig(pf_out)
The output is a per cancer type differently sized plot
(depending on if there is any category completely nan)
I am expecting each plot to be in the same width.
Update
trying to use the order parameter as suggested leads to the following output:
https://i.imgur.com/uSm13Qw.png
Maybe this toy example helps ?
|Cat1|Cat2|Cat3|Cat4|Cat5
|3.93| |0.52| |6.01
|3.34| |0.89| |2.89
|3.39| |1.96| |4.63
|1.59| |3.66| |3.75
|2.73| |0.39| |2.87
|0.08| |1.25| |-0.27
Update
Apparently, the problem is not the data but the length of the title
https://github.com/matplotlib/matplotlib/issues/4413
Therefore I would close the question
#Diziet should I delete it or does my issue might help other ones?
Sorry for not including the line below in the code example:
ax.set_title("VERY LONG TITLE", fontsize=20)
It's hard to be sure without data to test it with, but I think you can pass the names of your categories/cancers to the order= parameter. This forces seaborn to use/display those, even if they are empty.
for instance:
tips = sns.load_dataset("tips")
ax = sns.violinplot(x="day", y="total_bill", data=tips, order=['Thur','Fri','Sat','Freedom Day','Sun','Durin\'s Day'])

Exponential decay curve fitting in numpy and scipy

I'm having a bit of trouble with fitting a curve to some data, but can't work out where I am going wrong.
In the past I have done this with numpy.linalg.lstsq for exponential functions and scipy.optimize.curve_fit for sigmoid functions. This time I wished to create a script that would let me specify various functions, determine parameters and test their fit against the data. While doing this I noticed that Scipy leastsq and Numpy lstsq seem to provide different answers for the same set of data and the same function. The function is simply y = e^(l*x) and is constrained such that y=1 at x=0.
Excel trend line agrees with the Numpy lstsq result, but as Scipy leastsq is able to take any function, it would be good to work out what the problem is.
import scipy.optimize as optimize
import numpy as np
import matplotlib.pyplot as plt
## Sampled data
x = np.array([0, 14, 37, 975, 2013, 2095, 2147])
y = np.array([1.0, 0.764317544, 0.647136491, 0.070803763, 0.003630962, 0.001485394, 0.000495131])
# function
fp = lambda p, x: np.exp(p*x)
# error function
e = lambda p, x, y: (fp(p, x) - y)
# using scipy least squares
l1, s = optimize.leastsq(e, -0.004, args=(x,y))
print l1
# [-0.0132281]
# using numpy least squares
l2 = np.linalg.lstsq(np.vstack([x, np.zeros(len(x))]).T,np.log(y))[0][0]
print l2
# -0.00313461628963 (same answer as Excel trend line)
# smooth x for plotting
x_ = np.arange(0, x[-1], 0.2)
plt.figure()
plt.plot(x, y, 'rx', x_, fp(l1, x_), 'b-', x_, fp(l2, x_), 'g-')
plt.show()
Edit - additional information
The MWE above includes a small sample of the dataset. When fitting the actual data the scipy.optimize.curve_fit curve presents an R^2 of 0.82, while the numpy.linalg.lstsq curve, which is the same as that calculated by Excel, has an R^2 of 0.41.
You are minimizing different error functions.
When you use numpy.linalg.lstsq, the error function being minimized is
np.sum((np.log(y) - p * x)**2)
while scipy.optimize.leastsq minimizes the function
np.sum((y - np.exp(p * x))**2)
The first case requires a linear dependency between the dependent and independent variables, but the solution is known analitically, while the second can handle any dependency, but relies on an iterative method.
On a separate note, I cannot test it right now, but when using numpy.linalg.lstsq, I you don't need to vstack a row of zeros, the following works as well:
l2 = np.linalg.lstsq(x[:, None], np.log(y))[0][0]
To expound a bit on Jaime's point, any non-linear transformation of the data will lead to a different error function and hence to different solutions. These will lead to different confidence intervals for the fitting parameters. So you have three possible criteria to use to make a decision: which error you want to minimize, which parameters you want more confidence in, and finally, if you are using the fitting to predict some value, which method yields less error in the interesting predicted value. Playing around a bit analytically and in Excel suggests that different kinds of noise in the data (e.g. if the noise function scales the amplitude, affects the time-constant or is additive) leads to different choices of solution.
I'll also add that while this trick "works" for exponential decay to 0, it can't be used in the more general (and common) case of damped exponentials (rising or falling) to values that cannot be assumed to be 0.