How to make pandas show the entire dataframe without cropping it by columns? - pandas

I am trying to represent cubic spline interpolation information for function f(x) as a dataframe.
When trying to print into a spyder, I find that the columns are being cut off. When trying to reproduce the output in Jupiter Lab, I got the same thing.
When I ran in ipython via terminal I got the desired full dataframe output.
I searched the integnet and tried the pandas commands setting options pd.set_options(), but nothing came of it.
I attach a screenshot with the output in ipython.

In Juputer can use:
from IPython.display import display, HTML
and instead of
print(dataframe)
use of in anyway place
display(HTML(dataframe.to_html()))
This will create a nice table.
Unfortunately, this will not work in the spyder. So you can try to adjust the width of the ipython were suggested. But in most cases this will make the output poorly or unreadable.
After trying the dataframe methods, I found what appears to be a cropping setting.
In Spyder I used:
pd.set_option('expand_frame_repr', False)
print(dataframe)
This method explains why increasing max_column didn't help me previously.

You can specify a maximum number for rows or columns using pd.set_options(display.max_columns=1000)
But you don't have to set an arbitrary value, but rather use None instead to make sure every size will be covered.
For rows, use:
pd.set_option('display.max_rows', None)
And for columns, use:
pd.set_option('display.max_columns', None)

It is a result of the display width. You can use the following set_options():
pd.set_options(display.width=1000) #make huge
You may also have to raise max columns but it should be smart enough to adjust automatically after you make width bigger:
pd.set_options(display.max_columns=None)

Related

How to increase length of ouput table or dataframe in Jupyter Notebook?

I am working on the Jupyter notebook and have been facing issues in increasing the length of the output of the Jupyter Notebook. I can see the output as follows:
I tried increasing the default length of the columns in pandas with no success. Can you please help me with it?
If you were using the typical way to view a dataframe in Jupyter (see my puzzelment about your screenshot in my comments to your original post) it would be things like this:
adapted from answer to 'Pretty-print an entire Pandas Series / DataFrame'
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
display(df)
(Note that will work with the text-based viewing, too. Note it uses print(df) in the answer to 'Pretty-print an entire Pandas Series / DataFrame'.
Adjust the 'display.max_colwidth' if you want the entire column text to show:
with pd.option_context('display.max_rows', None, 'display.max_columns', None,'display.max_colwidth', -1):
display(df)
(If you prefer text like you posted, replace display() with print()
Generally with the solutions above the view window in Jupyter will get scrollbars so you can navigate to view all still.
You can also set the number of rows to show to be lower to save space, see example here.
You may also be interested in Pandas dataframe hide index functionality? or Using python / Jupyter Notebook, how to prevent row numbers from printing?.
As pointed out here, setting some some global options is covered in the Pandas Documentation for top-level options.
For display() to work these days you don't need to do anything extra. But if your are using old Jupyter or it doesn't work then try adding towards the top of your notebook file and running the following as a cell first:
from IPython.display import display

Plots from excel with panda and seaborn 'ufunc 'isfinite' not supported for the input types'

I am trying to configure a template for creating plots for my test data. Therefore I need to say I am pretty new to that in python, and I already googled quite a lot regarding my question but what I found could not help me. I have a excel table with data in two columns, which I want to plot against each other. My code looks as follows
file='C:/Documents/Test/test_file.xlsx'
df1=pd.read_Excel(file,sheet_name='sheet1',header=0, engine="openpyxl")
plt.figure()
sns.lineplot(data=df1[:,:],x="eps",y="sigma",sort=False,linewidth=0.8)
The excel has -as mentioned a header with eps and sigma as x and y values. The values following are floats, when I check the datatype with df1.dtypes, the result is 'float64' So has anyone an idea what is not working? I get the error 'ufunc 'isfinite' not supported for the input types'
Plotting data from excel with panda and seaborn against each other and save the image.
This might be a library issue. I've been running into the same problem with example datasets and even a very simple:
sns.lineplot(x=[1], y=[1])
I'll update if I find a solution.
Edit: There seems to be an issue with Numpy that is causing this issue with Seaborn. Solution is to downgrade Numpy to 1.23 until 1.24.1 is released.
https://github.com/mwaskom/seaborn/issues/3192

Pycharm injects new line in output, half way through dataset columns

Why would Pycharm put the second half of my dataset on a new line? (see Image) How do I turn this off?
I would like to display my dataset as wide as possible, with no wrapping.
The console attempts to auto-detect the width of the display area, but when that fails it defaults to 80 characters. This behavior can be overridden with:
import pandas as pd
pd.set_option('display.width', 400)
pd.set_option('display.max_columns', 10)
please take reference from this stackoverflow
Getting wider output in PyCharm's built-in console

How to add plot commands to a figure in more than one cell, but display it only in the end?

I want to do the following in a Jupyter Notebook:
Create a pyplot.figure in a cell;
For each subsequent cells, calculate values and plot them to that same figure without displaying anything;
At the end, in another cell, display the figure with the result of every previous plot command.
Currently, while using %matplotlib notebook, the figure is always displayed after the same cell it's been created, and I don't even call plt.show().
This is not the behavior I desire. Instead I would like to postpone the display of the figure for the last cell only, but the figure of course should contain the results of the sequential plot commands called in the cells in between.
You can capture the content of a cell of a jupyter notebook using the magic command %%capture. You can also hide any output of a specific line by putting a ; at the end of it.
Showing the figure can be done by simply typing the variable in which the figure is stored, e.g. fig.
Combining those techniques gives you
import matplotlib.pyplot as plt
%matplotlib notebook
%%capture captured
fig, ax=plt.subplots()
ax.plot([1,2,3]);
fig # now show the figure
which is probably more understandable in the acutal notebook like this:
Also see How to overlay plots from different cells?

Why does DataFrameGroupBy.boxplot method throw error when given argument "subplots=True/False"?

I can use DataFrameGroupBy.boxplot(...) to create a boxplot in the following way:
In [15]: df = pd.DataFrame({"gene_length":[100,100,100,200,200,200,300,300,300],
...: "gene_id":[1,1,1,2,2,2,3,3,3],
...: "density":[0.4,1.1,1.2,1.9,2.0,2.5,2.2,3.0,3.3],
...: "cohort":["USA","EUR","FIJ","USA","EUR","FIJ","USA","EUR","FIJ"]})
In [17]: df.groupby("cohort").boxplot(column="density",by="gene_id")
In [18]: plt.show()
This produces the following image:
This is exactly what I want, except instead of making three subplots, I want all the plots to be in one plot (with different colors for USA, EUR, and FIJ). I've tried
In [17]: df.groupby("cohort").boxplot(column="density",subplots=False,by="gene_id")
but it produces the error
KeyError: 'gene_id'
I think the problem has something to do with the fact that by="gene_id" is a keyword sent to the matplotlib boxplot method. If someone has a better way of producing the plot I am after, perhaps by using DataFrame.boxplot(?) instead, please respond here. Thanks so much!
To use the pure pandas functions, I think you should not GroupBy before calling boxplot, but instead, request to group by certain columns in the call to boxplot on the DataFrame itself:
df.boxplot(column='density',by=['gene_id','cohort'])
To get a better-looking result, you might want to consider using the Seaborn library. It is designed to help precisely with this sort of tasks:
sns.boxplot(data=df,x='gene_id',y='density',hue='cohort')
EDIT to take into account comment below
If you want to have each of your cohort boxplots stacked/superimposed for each gene_id, it's a bit more complicated (plus you might end up with quite an ugly output). You cannot do this using Seaborn, AFAIK, but you could with pandas directly, by using the position= parameter to boxplot (see doc). The catch it to generate the correct sequence of positions to place the boxplots where you want them, but you'll have to fix the tick labels and the legend yourself.
pos = [i for i in range(len(df.gene_id.unique())) for _ in range(len(df.cohort.unique()))]
df.boxplot(column='density',by=['gene_id','cohort'],positions=pos)
An alternative would be to use seaborn.swarmplot instead of using boxplot. A swarmplot plots every point instead of the synthetic representation of boxplots, but you can use the parameter split=False to get the points colored by cohort but stacked on top of each other for each gene_id.
sns.swarmplot(data=df,x='gene_id',y='density',hue='cohort', split=False)
Without knowing the actual content of your dataframe (number of points per gene and per cohort, and how separate they are in each cohort), it's hard to say which solution would be the most appropriate.