ValueError: Maximum allowed size exceeded when plotting using seaborn - pandas

i have a data set with 37 columns and 230k rows
i am trying using seaborn to histogram every column
i have not yet cleaned my data
here is my code
for i in X.columns:
plt.figure()
ax = sns.histplot(data=df,x=i)
i got also this File C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\function_base.py:135 in linspace y = _nx.arange(0, num, dtype=dt).reshape((-1,) + (1,) * ndim(delta))
any solution for this please

It may be due to the size of your dataset. So you can try to draw one histogram at a time.
I think there is a inconsistency in your code : you loop over the columns of the dataframe X but you draw the columns of the dataframe df. It is more consistent like that :
for i in df.columns:
plt.figure()
ax = sns.histplot(data=df,x=i)

problem solved by determining the number of bins, since the bins default is set to auto and this was the reason, normally this leads to a huge computational error for high dataset size and with high variance
the code solved my issue as below:
for i in X.columns:
plt.figure()
ax = sns.histplot(data=df,x=i,bins=50)

Related

Auto-resize Figure in Seaborn

I am looking for some option to automatically resize the figures that I am generating using seaborn (barplots, countplot, boxplot). I am creating all the plots in one shot, but the issue is, in some of the graphs labels & bars are tightly packed because some of the columns have too many categorical values. I am using the below code:
for col in dff.drop(target_col_name, axis=1).columns:
if ((dff[col].nunique() / len(dff[col])) < threshold):
ax = sns.countplot(x=dff[col], hue= dff[target_col_name] )
ax.set_xticklabels(ax.get_xticklabels(), rotation = 90)
plt.tight_layout()
plt.show()
pd.crosstab(index = dff[col],
columns = dff[target_col_name], normalize = 'index').plot.bar()
plt.tight_layout()
plt.show()
elif (dff[col].dtype == 'int64' or dff[col].dtype == 'float64'):
sns.boxplot(dff[target_col_name], dff[col])
One solution is to increase all the figsize for all figures or use another if condition to target specific columns that have more categorical values and increase the size of those figures.
But I am looking for a more flexible solution so that all the figures get resized automatically based on the information in them.
I have used a plotly in-built function "figure()" that you can use to alter the size of charts. All you need do is declare it right before the code for your chats.
For instance, plt.figure(figsize=(12,5)) alters the height and width of the chart to 12 and 5 respectively.

3d seaborn lmplot using variable marker size

I have a pandas dataframe with three columns (A,B,C). I have drawn a regression line of A vs B using
sns.lmplot(x='A', y='B', data = df, x_bins=10, ci=None)
I am using 10 bins and no confidence interval as I have a large number (~5million) datapoints.
I would like to show the value of C on this plot. C has nothing to do with the regression of A against B. I would just like to show C by making the marker size of each bin equal to the average value of C within that bin.
It seems seaborn doesn't have a markersize parameter that can be set equal to a column of the dataframe. Is this even possible?
I cam across this stackexchange post which suggests using scatter_kws={"s": 100} to set the marker size. However, when I tried scatter_kws={"s": df['C']} it threw an error.
If this is not possible in seaborn, are there any alternative solutions?

Changed frequency of ticks in Pandas '.bar' plot, but messed up the actual bars

how's your self-isolation going on?
Mine rocks, as I'm drilling through visualization in Python. Recently, however, I've ran into an issue.
I figured that .plot.bar() in Pandas has an uncommon formatting of x-axis (which kinda confirms that I read before I ask). I had price data with monthly frequency, so I applied a fix to display only yearly ticks in a bar chart:
fig, ax = plt.subplots()
ax.bar(btc_returns.index, btc_returns)
ax.xaxis.set_major_locator(mdates.YearLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
Where btc_returns is a Series object with datetime in index.
The output I got was weird. Here are the screenshots of what I expected vs the end result.
I tried to find a solution to this, but no luck. Can you guys please give me a hand? Thanks! Criticism is welcome as always :)
And my solution is like this:
fig, ax = plt.subplots(figsize=(15,7))
ax.bar(btc_returns.index, btc_returns.returns.values, width = 1)
Where btc_returns is a DataFrame with the returns of BTC. I figured that .values makes the bar plot read the datetime input correctly. For the 'missing' bars - their resolution was just way too small, so I set the width to '1'.
Using the stock value data from Yahoo Finance: Bitcoin USD
Technically, you can do pd.to_datetime(btc.Date).dt.date at the beginning, but resample won't work, which is why btc_monthly.index.date is done as a second step.
resample can happen over different periods (e.g. 2M = every two months)
Load and transform the data
import pandas as pd
import matplotlib.pyplot as plt
# load data
btc = pd.read_csv('data/BTC-USD.csv')
# Date to datetime
btc.Date = pd.to_datetime(btc.Date)
# calculate daily return %
btc['return'] = ((btc.Close - btc.Close.shift(1))/btc.Close.shift(1))*100
# resample to monthly and aggregate by sum
btc_monthly = btc.resample('M', on='Date').sum()
# set the index to be date only (no time)
btc_monthly.index = btc_monthly.index.date
Plot
btc_monthly.plot(y='return', kind='bar', figsize=(15, 8))
plt.show()
Plot Bimonthly
btc_monthly = btc.resample('2M', on='Date').sum() # instead of 'M'
btc_monthly.index = btc_monthly.index.date
btc_monthly.plot(y='return', kind='bar', figsize=(15, 8), legend=False)
plt.title('Bitcoin USD: Bimonthly % Return')
plt.ylabel('% return')
plt.xlabel('Date')
plt.show()

Matplotlib/Seaborn: Boxplot collapses on x axis

I am creating a series of boxplots in order to compare different cancer types with each other (based on 5 categories). For plotting I use seaborn/matplotlib. It works fine for most of the cancer types (see image right) however in some the x axis collapses slightly (see image left) or strongly (see image middle)
https://i.imgur.com/dxLR4B4.png
Looking into the code how seaborn plots a box/violin plot https://github.com/mwaskom/seaborn/blob/36964d7ffba3683de2117d25f224f8ebef015298/seaborn/categorical.py (line 961)
violin_data = remove_na(group_data[hue_mask])
I realized that this happens when there are too many nans
Is there any possibility to prevent this collapsing by code only
I do not want to modify my dataframe (replace the nans by zero)
Below you find my code:
boxp_df=pd.read_csv(pf_in,sep="\t",skip_blank_lines=False)
fig, ax = plt.subplots(figsize=(10, 10))
sns.violinplot(data=boxp_df, ax=ax)
plt.xticks(rotation=-45)
plt.ylabel("label")
plt.tight_layout()
plt.savefig(pf_out)
The output is a per cancer type differently sized plot
(depending on if there is any category completely nan)
I am expecting each plot to be in the same width.
Update
trying to use the order parameter as suggested leads to the following output:
https://i.imgur.com/uSm13Qw.png
Maybe this toy example helps ?
|Cat1|Cat2|Cat3|Cat4|Cat5
|3.93| |0.52| |6.01
|3.34| |0.89| |2.89
|3.39| |1.96| |4.63
|1.59| |3.66| |3.75
|2.73| |0.39| |2.87
|0.08| |1.25| |-0.27
Update
Apparently, the problem is not the data but the length of the title
https://github.com/matplotlib/matplotlib/issues/4413
Therefore I would close the question
#Diziet should I delete it or does my issue might help other ones?
Sorry for not including the line below in the code example:
ax.set_title("VERY LONG TITLE", fontsize=20)
It's hard to be sure without data to test it with, but I think you can pass the names of your categories/cancers to the order= parameter. This forces seaborn to use/display those, even if they are empty.
for instance:
tips = sns.load_dataset("tips")
ax = sns.violinplot(x="day", y="total_bill", data=tips, order=['Thur','Fri','Sat','Freedom Day','Sun','Durin\'s Day'])

Seaborn time series plotting: a different problem for each function

I'm trying to use seaborn dataframe functionality (e.g. passing column names to x, y and hue plot parameters) for my timeseries (in pandas datetime format) plots.
x should come from a timeseries column(converted from a pd.Series of strings with pd.to_datetime)
y should come from a float column
hue comes from a categorical column that I calculated.
There are multiple streams in the same series that I am trying to separate (and use the hue for separating them visually), and therefore they should not be connected by a line (like in a scatterplot)
I have tried the following plot types, each with a different problem:
sns.scatterplot: gets the plotting right and the labels right bus has problems with the xlimits, and I could not set them right with plt.xlim() using data.Dates.min and data.Dates.min
sns.lineplot: gets the limits and the labels right but I could not find a setting to disable the lines between the individual datapoints like in matplotlib. I tried the setting the markers and the dashes parameters to no avail.
sns.stripplot: my last try, plotted the datapoints correctly and got the xlimits right but messed the labels ticks
Example input data for easy reproduction:
dates = pd.to_datetime(('2017-11-15',
'2017-11-29',
'2017-12-15',
'2017-12-28',
'2018-01-15',
'2018-01-30',
'2018-02-15',
'2018-02-27',
'2018-03-15',
'2018-03-27',
'2018-04-13',
'2018-04-27',
'2018-05-15',
'2018-05-28',
'2018-06-15',
'2018-06-28',
'2018-07-13',
'2018-07-27'))
values = np.random.randn(len(dates))
clusters = np.random.randint(1, size=len(dates))
D = {'Dates': dates, 'Values': values, 'Clusters': clusters}
data = pd.DataFrame(D)
To each of the functions I am passing the same arguments:
sns.OneOfThePlottingFunctions(x='Dates',
y='Values',
hue='Clusters',
data=data)
plt.show()
So to recap, what I want is a plot that uses seaborn's pandas functionality, and plots points(not lines) with correct x limits and readable x labels :)
Any help would be greatly appreciated.
ax = sns.scatterplot(x='Dates', y='Values', hue='Clusters', data=data)
ax.set_xlim(data['Dates'].min(), data['Dates'].max())