Using ipysheet, how do I adjust the column width for an index column? - pandas

I am trying to work through the simple ipysheet example below and have not been able to find a way to increase the column width associated with the date index in the resulting sheet in a Jupyter Notebook
import ipysheet as ip
import pandas as pd
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
sheet = ip.from_dataframe(df)
display(sheet)
Resulting Output

I'm having the same problem, and don't see anything in the code that indicates this is possible.
You can see in the ipysheet code that Sheet extends ipywidget.DOMWidget, which in turn extends ipywidget.Widget. You can probably send some custom CSS into the constructor that will help it render the way you want but that's going to take a lot of research.
The workaround is to reset the index on your DataFrame so that dates is just a column.
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df = df.reset_index()
sheet = ip.from_dataframe(df)
Not exactly what you want, but for my purposes it's definitely good enough.

Related

What is the Vaex command for pd.isnull().sum()?

Someone please give me a VAEX alternative for this code:
df_train = vaex.open('../input/ms-malware-hdf5/train.csv.hdf5')
total = df_train.isnull().sum().sort_values(ascending = False)
Vaex does not at this time support counting missing values on a dataframe level, only on an expression (column) level. So you will have to do a bit of work yourself.
Consider the following example:
import vaex
import vaex.ml
import pandas as pd
df = vaex.ml.datasets.load_titanic()
count_na = [] # to count the missing value per column
for col in df.column_names:
count_na.append(df[col].isna().sum().item())
s = pd.Series(data=count_na, index=df.column_names).sort_values(ascending=True)
If you think this is something you might need to use often, it might be worth it to create your own dataframe method following this example.

Can I extract or construct as a Pandas dataframe the table with coefficient values etc. provided by the summary() method in statsmodels?

I have run an OLS model in statsmodels and I would like to have the table in the summary as a Pandas dataframe.
This is what I mean:
I would like the table within the red frame to be constructed / extracted and become a Pandas DataFrame.
My code up to that point was straightforward:
from statsmodels.regression.linear_model import OLS
mod = OLS(endog = coded_design_poly_select.response.values, exog = coded_design_poly_select.iloc[:, :-1].values)
fitted_model = mod.fit()
fitted_model.summary()
What would you suggest?
The fitted_model is in fact a RegressionResults object that stores all the regression results and you can access them via the corresponding methods/attributes.
For what you asked for, I believe the following code would work
data = {'coef': fitted_model.params,
'std err': fitted_model.bse,
't': fitted_model.tvalues,
'P>|t|': fitted_model.pvalues,
'[0.025': fitted_model.conf_int()[0],
'0.975]': fitted_model.conf_int()[1]}
pd.DataFrame(data).round(3)

Convert date/time index of external dataset so that pandas would plot clearly

When you already have time series data set but use internal dtype to index with date/time, you seem to be able to plot the index cleanly as here.
But when I already have data files with columns of date&time in its own format, such as [2009-01-01T00:00], is there a way to have this converted into the object that the plot can read? Currently my plot looks like the following.
Code:
dir = sorted(glob.glob("bsrn_txt_0100/*.txt"))
gen_raw = (pd.read_csv(file, sep='\t', encoding = "utf-8") for file in dir)
gen = pd.concat(gen_raw, ignore_index=True)
gen.drop(gen.columns[[1,2]], axis=1, inplace=True)
#gen['Date/Time'] = gen['Date/Time'][11:] -> cause error, didnt work
filter = gen[gen['Date/Time'].str.endswith('00') | gen['Date/Time'].str.endswith('30')]
filter['rad_tot'] = filter['Direct radiation [W/m**2]'] + filter['Diffuse radiation [W/m**2]']
lis = np.arange(35040) #used the number of rows, checked by printing. THis is for 2009-2010.
plt.xticks(lis, filter['Date/Time'])
plt.plot(lis, filter['rad_tot'], '.')
plt.title('test of generation 2009')
plt.xlabel('Date/Time')
plt.ylabel('radiation total [W/m**2]')
plt.show()
My other approach in mind was to use plotly. Yet again, its main purpose seems to feed in data on the internet. It would be best if I am familiar with all the modules and try for myself, but I am learning as I go to use pandas and matplotlib.
So I would like to ask whether there are anyone who experienced similar issues as I.
I think you need set labels to not visible by loop:
ax = df.plot(...)
spacing = 10
visible = ax.xaxis.get_ticklabels()[::spacing]
for label in ax.xaxis.get_ticklabels():
if label not in visible:
label.set_visible(False)

Cannot create bar plot with pandas

I am trying to create a bar plot using pandas. I have the following code:
import pandas as pd
indexes = ['Strongly agree', 'Agree', 'Neutral', 'Disagree', 'Strongly disagree']
df = pd.DataFrame({'Q7': [10, 11, 1, 0, 0]}, index=indexes)
df.plot.bar(indexes, df['Q7'].values)
By my reckoning this should work but I get a weird KeyError: 'Strongly agree' thrown at me. I can't figure out why this won't work.
By invoking plot as a Pandas method, you're referring to the data structures of df to make your plot.
The way you have it set up, with index=indexes, your bar plot's x values are stored in df.index. That's why Wen's suggestion in the comments to just use df.plot.bar() will work, as Pandas automatically looks to use df.index as the x-axis in this case.
Alternately, you can specify column names for x and y. In this case, you can move indexes into a column with reset_index() and then call the new index column explicitly:
df.reset_index().plot.bar(x="index", y="Q7")
Either approach will yield the correct plot:

How to Render Math Table Properly in IPython Notebook

The math problem that I'm solving gives different analytical solutions in different scenarios, and I would like to summarize the result in a nice table. IPython Notebook renders the list nicely:
for example:
import sympy
from pandas import DataFrame
from sympy import *
init_printing()
a, b, c, d = symbols('a b c d')
t = [[a/b, b/a], [c/d, d/c]]
t
However, when I summarize the answers into a table using DataFrame, the math cannot be rendered any more:
df = DataFrame(t, index=['Situation 1', 'Situation 2'], columns=['Answer1','Answer2'])
df
"print df.to_latex()" also gives the same result. I also tried "print(latex(t))" but it gives this after compiling in LaTex, which is alright, but I still need to manually convert it to a table:
How should I use DataFrame properly in order to render the math properly? Or is there any other way to export the math result into a table in Latex? Thanks!
Update: 01/25/14
Thanks again to #Jakob for solving the problem. It works perfectly for simple matrices, though there are still some minor problems for more complicated math expressions. But I guess like #asmeurer said, perfection requires an update in IPython and Pandas.
Update: 01/26/14
If I render the result directly, i.e. just print the list, it works fine:
MathJax is currently not able to render tables, hence the most obvious approach (pure latex) does not work.
However, following the advise of #asmeurer you should use an html table and render the cell content as latex. In your case this could be easily achieved by the following intermediate step:
from sympy import latex
tl = map(lambda tc: '$'+latex(tc)+'$',t)
df = DataFrame(tl, index=['Situation 1', 'Situation 2'], columns=['Answer'])
df
which gives:
Update:
In case of two dimensional data, the simple map function will not work directly. To cope with this situation the numpy shape, reshape and ravel functions could be used like:
import numpy as np
t = [[a/b, b/a],[a*a,b*b]]
tl=np.reshape(map(lambda tc: '$'+latex(tc)+'$',np.ravel(t)),np.shape(t))
df = DataFrame(tl, index=['Situation 1', 'Situation 2'], columns=['Answer 1','Answer 2'])
df
This gives:
Update 2:
Pandas crops cell content if the string length exceeds a certain number. E.g a more complicated expression like
t1 = [a/2+b/2+c/2+d/2]
tl=np.reshape(map(lambda tc: '$'+latex(tc)+'$',np.ravel(t1)),np.shape(t1))
df = DataFrame(tl, index=['Situation 1'], columns=['Answer 1'])
df
gives:
To cope with this issue a pandas package option has to be altered, for details see here. For the present case the max_colwidth has to be changed. The default value is 50, hence let's change it to 100:
import pandas as pd
pd.options.display.max_colwidth=100
df
gives: