How to Render Math Table Properly in IPython Notebook - pandas

The math problem that I'm solving gives different analytical solutions in different scenarios, and I would like to summarize the result in a nice table. IPython Notebook renders the list nicely:
for example:
import sympy
from pandas import DataFrame
from sympy import *
init_printing()
a, b, c, d = symbols('a b c d')
t = [[a/b, b/a], [c/d, d/c]]
t
However, when I summarize the answers into a table using DataFrame, the math cannot be rendered any more:
df = DataFrame(t, index=['Situation 1', 'Situation 2'], columns=['Answer1','Answer2'])
df
"print df.to_latex()" also gives the same result. I also tried "print(latex(t))" but it gives this after compiling in LaTex, which is alright, but I still need to manually convert it to a table:
How should I use DataFrame properly in order to render the math properly? Or is there any other way to export the math result into a table in Latex? Thanks!
Update: 01/25/14
Thanks again to #Jakob for solving the problem. It works perfectly for simple matrices, though there are still some minor problems for more complicated math expressions. But I guess like #asmeurer said, perfection requires an update in IPython and Pandas.
Update: 01/26/14
If I render the result directly, i.e. just print the list, it works fine:

MathJax is currently not able to render tables, hence the most obvious approach (pure latex) does not work.
However, following the advise of #asmeurer you should use an html table and render the cell content as latex. In your case this could be easily achieved by the following intermediate step:
from sympy import latex
tl = map(lambda tc: '$'+latex(tc)+'$',t)
df = DataFrame(tl, index=['Situation 1', 'Situation 2'], columns=['Answer'])
df
which gives:
Update:
In case of two dimensional data, the simple map function will not work directly. To cope with this situation the numpy shape, reshape and ravel functions could be used like:
import numpy as np
t = [[a/b, b/a],[a*a,b*b]]
tl=np.reshape(map(lambda tc: '$'+latex(tc)+'$',np.ravel(t)),np.shape(t))
df = DataFrame(tl, index=['Situation 1', 'Situation 2'], columns=['Answer 1','Answer 2'])
df
This gives:
Update 2:
Pandas crops cell content if the string length exceeds a certain number. E.g a more complicated expression like
t1 = [a/2+b/2+c/2+d/2]
tl=np.reshape(map(lambda tc: '$'+latex(tc)+'$',np.ravel(t1)),np.shape(t1))
df = DataFrame(tl, index=['Situation 1'], columns=['Answer 1'])
df
gives:
To cope with this issue a pandas package option has to be altered, for details see here. For the present case the max_colwidth has to be changed. The default value is 50, hence let's change it to 100:
import pandas as pd
pd.options.display.max_colwidth=100
df
gives:

Related

Replacing whole string with part of it using Regex in Python Pandas

I have a table in pdf found on this link: https://taxation-customs.ec.europa.eu/system/files/2022-11/tobacco_products_releases-consumption.pdf
I am trying to clean the data before doing analysis but I noticed that between 2014-2017 the cigarette data was merged due to error. Instead of two cells per year in a column for Sweden and UK I got one merged which looks something like this: 5393688\r28587000
I would like to update data only for Sweden and get the first value before \r.
So far my code was as follows:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
cig= pd.DataFrame(tabula.read_pdf(r"https://taxation-customs.ec.europa.eu/system/files/2022-11/tobacco_products_releases-consumption.pdf", pages ='all')[0])
cig.replace(to_replace='N/A', value=0, inplace=True, regex=True)
cig= cig.replace(',','', regex=True)
After this I tried
df.iloc[26,:].str.replace("('\r').*","")
cig.iloc[26,:] = cig.iloc[26,:].replace("('\r').*","", regex=True)
and
cig.iloc[26,:].replace(to_replace='(?:[0-9]+)([^0-9]{2})([0-9]+)', value='', regex=True)
But none of the above seem to produce desired result and I still have values with similar format i.e. 5393688\r28587000
Set regex=True and assign the changed subset back to the dataframe:
df.iloc[26,:] = df.iloc[26,:].replace("('\r').*","", regex=True)

Pandas styling - change font size and format float/apply background gradient

I am building an application that displays stock correlations data in various visual forms, including a matrix with a heatmap applied. My heatmap is created by passing the correlation matrix dataframe into IPy Widgets Output, so I can display it as part of a VBox later on. I have successfully applied a background gradient and formatted my numbers to 2dp. Can anyone help me edit the function to also reduce the font size, I just want to shrink it up a little?
Note: I chose to do this using dataframe styling over matplotlib as I had a number of issues getting the output to display in the way I wanted. I also have a function that downloads the dataframe to excel with the styling applied.
I have tried putting the following line of code at the beginning of my notebook so I can leave it outside of the function, but it seems to get ignored once the dataframe is passed to Output.
pd.options.display.float_format = "{:,.2f}".format
Here is my code sample:
import seaborn as sns
import ipywidgets as ipw
import pandas as pd
import numpy as np
#Sample Data
data = np.random.randint(5,30,size=500)
df = pd.DataFrame(data.reshape((50,10)))
corr = df.corr()
#Function produces dataframe as Output
def output_heatmap_df(df):
out = ipw.Output()
with out:
display(df.style\
.background_gradient(cmap=sns.diverging_palette(220,10, as_cmap=True),axis=None).format("{:,.2f}"))
out.layout.width='1600px'
return out
output_heatmap_df(corr)
In case anyone should come across this, the below code worked for me in the end:
def output_heatmap_df(df):
out = ipw.Output()
with out:
display(df.style\
.background_gradient(cmap=sns.diverging_palette(220,10, as_cmap=True),axis=None).format("{:,.2f}")
.set_properties(**{'text-align':'center','font-size':'10px'})
.set_table_styles([{'selector':'th','props':[('text-align','center'),('font-size','10px')]}])
)
out.layout.width='1600px'
return out

HoloViews: create boxplots for every column in a pandas dataframe

I'm able to create the following boxplot with Pandas pandas.DataFrame.boxplot() method:
import pandas as pd
import numpy as np
np.random.seed(1234)
df = pd.DataFrame(np.random.rand(10, 4),
columns=['Col1', 'Col2', 'Col3', 'Col4'])
df.plot.box()
plt.show()
Although, if I try to do the same using HoloViews' BoxWhisker Element with Bokeh as backend, it works fine for a single column:
import holoviews as hv
from holoviews import opts
hv.extension('bokeh')
hv.BoxWhisker(
data=df['Col1'],
vdims='Col1'
)
But as soon as I try to add just another column, I get the below error:
hv.BoxWhisker(
data=df[['Col1', 'Col2']]
)
DataError: None of the available storage backends were able to support the supplied data format. PandasInterface raised following error:
unsupported operand type(s) for +: 'NoneType' and 'int'
PandasInterface expects tabular data, for more information on supported datatypes see http://holoviews.org/user_guide/Tabular_Datasets.html
I can't understand whether there is a problem with the Tabular Data HoloViews understands, or I'm not able to apply the syntaxis properly.
I would also recommend James Bednar's answer, which is using hvPlot. HvPlot is built on top of HoloViews:
import hvplot.pandas
df.hvplot.box()
However, if you want to do it in HoloViews instead of hvPlot, you would have to melt your data to get all column names in one column, and all values in the other column.
This code works for your sample data:
hv.BoxWhisker(df.melt(), kdims='variable', vdims='value')
I'm not sure how to achieve what you want from the native HoloViews BoxWhisker interface, which is set up for tidy data rather than a set of independent columns like that. Meanwhile you can use hvPlot just as you use the native .plot() call:

Can I extract or construct as a Pandas dataframe the table with coefficient values etc. provided by the summary() method in statsmodels?

I have run an OLS model in statsmodels and I would like to have the table in the summary as a Pandas dataframe.
This is what I mean:
I would like the table within the red frame to be constructed / extracted and become a Pandas DataFrame.
My code up to that point was straightforward:
from statsmodels.regression.linear_model import OLS
mod = OLS(endog = coded_design_poly_select.response.values, exog = coded_design_poly_select.iloc[:, :-1].values)
fitted_model = mod.fit()
fitted_model.summary()
What would you suggest?
The fitted_model is in fact a RegressionResults object that stores all the regression results and you can access them via the corresponding methods/attributes.
For what you asked for, I believe the following code would work
data = {'coef': fitted_model.params,
'std err': fitted_model.bse,
't': fitted_model.tvalues,
'P>|t|': fitted_model.pvalues,
'[0.025': fitted_model.conf_int()[0],
'0.975]': fitted_model.conf_int()[1]}
pd.DataFrame(data).round(3)

How can I make different columns render as different colors in holoviews / hvplot?

I have a pandas dataframe with two columns of time series data. In my actual data, these columns are large enough that the render is unwieldy without datashader. I am attempting to compare events from these two timeseries. However, I need to be able to tell which data point is from which column. A simple functional example is below. How would I get columns A and B to use different color maps?
import numpy as np
import hvplot.pandas
import pandas as pd
A = np.random.randint(10, size=10000)
B = np.random.randint(30, size=10000)
d = {'A':A,'B':B}
df = pd.DataFrame(d)
df.hvplot(kind='scatter',datashade=True, height=500, width=1000, dynspread=False)
You will have to use the count_cat aggregator that counts each category separately, e.g. in the example above that would look like this:
import datashader as ds
df.hvplot(kind='scatter', aggregator=ds.count_cat('Variable'), datashade=True,
height=500, width=1000)
The 'Variable' here corresponds to the default group_label that hvplot assigns to the columns. If you provided a different group_label you would have to update the aggregator to match. However instead of supplying an aggregator explicitly you can also use the by keyword:
df.hvplot(kind='scatter', by='Variable', datashade=True,
height=500, width=1000)
Once hvplot 0.3.1 is released you'll also be able to supply an explicit cmap, e.g.:
df.hvplot(kind='scatter', by='Variable', datashade=True,
height=500, width=1000, cmap={'A': 'red', 'B': 'blue'})