Can I extract or construct as a Pandas dataframe the table with coefficient values etc. provided by the summary() method in statsmodels? - pandas

I have run an OLS model in statsmodels and I would like to have the table in the summary as a Pandas dataframe.
This is what I mean:
I would like the table within the red frame to be constructed / extracted and become a Pandas DataFrame.
My code up to that point was straightforward:
from statsmodels.regression.linear_model import OLS
mod = OLS(endog = coded_design_poly_select.response.values, exog = coded_design_poly_select.iloc[:, :-1].values)
fitted_model = mod.fit()
fitted_model.summary()
What would you suggest?

The fitted_model is in fact a RegressionResults object that stores all the regression results and you can access them via the corresponding methods/attributes.
For what you asked for, I believe the following code would work
data = {'coef': fitted_model.params,
'std err': fitted_model.bse,
't': fitted_model.tvalues,
'P>|t|': fitted_model.pvalues,
'[0.025': fitted_model.conf_int()[0],
'0.975]': fitted_model.conf_int()[1]}
pd.DataFrame(data).round(3)

Related

How to calculate and reshape more than 1 bn items of data into PySpark?

Our use case is to read data from BQ and caculate by using pandas and numpy.reshape to turn it into input for the model, sample code like:
import numpy as np
import pandas as pd
# Source Data
feature = spark.read.format('bigquery') \
.option('table', TABLE_NAME) \
.load()
test = feature.to_pandas_on_spark().sort_values(by = ['col1','col2'], ascending = True).drop(['col1','col3','col5'], axis = 1)
test = (test - test.mean())/(test.std())
row = int(len(test)/100)
row2 = 50
col3 = 100
feature_array = np.reshape(feature_nor.values, (row,row2,col3))
feature.to_pandas_on_spark() will collect all data into driver memory and for small amout of data it can work, but for more than 15 Billion data it can not handle this.
I try to convert to_pandas_on_spark() to spark dataframe so that it can compute in parallell:
sorted_df = feature.sort('sndr_id').sort('date_index').drop('sndr_id').drop('date_index').drop('cal_dt')
mean_df = sorted_df.select(*[f.mean(c).alias(c) for c in sorted_df.columns])
std_df = sorted_df.select(*[f.stddev(c).alias(c) for c in sorted_df.columns])
Since the function is different from the pandas api, so I cannot verify these code and for the last reshape operation(np.reshape(feature_nor.values, (row,row2,col3))) dataframe doesn't support this function, is there a good solution to replace it?
I want to know how to handle 1B data in a efficient way and without memory overflow, including how to use numpy's reshape and pandas's computation operations, any answers will be super helpful!
I would advise not to use pandas or numpy on a dataset of this size, there usually is some Spark function to solve your problem, even firing up a UDF or using pandas on spark comes with a significant performance loss.
What exactly are your reshape criteria?
Maybe pivot helps?

How to save the model comparison dataframe from compare_models() in pycaret?

I want to save the model comparison data frame from compare_models() in pycaret.
# load dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')
# init setup
from pycaret.classification import *
clf1 = setup(data = diabetes, target = 'Class variable')
# compare models
best = compare_models()
i.e. this data frame as shown above.
Does anyone know how to do that?
The solution is :
df = pull()
by Goosang Yu from the pycaret slack community.
compare_models() returns a pandas dataframe, containing the information of the list of models. Hence, you only need to save a dataframe, which can be for example achieved with best.to_csv(path). If you want to save the object in a different format (pickle, xml, ...), you can refer to pandas i/o documentation.

Exponential moving average on pandas

I was having a bit of trouble making an exponential moving average for a pandas data frame. I managed to make a simple moving average but I'm not sure how I can make one that is exponential. I was wondering if there's a function in pandas or maybe another module that can help with this. Ideally the exponential moving average would be in another column in my data frame. This is my code below:
import pandas as pd
import datetime as dt
import yfinance as yf
#Get initial paramaters
start = dt.date(2020,1,1)
end = dt.date.today()
ticker = 'SPY'
#Get df data
df = yf.download(ticker,start,end,progress=False)
#Make simple moving average
df['SMA'] = df['Adj Close'].rolling(window=75,min_periods=1).mean()
Thanks
Use the ewm method:
df['SMA'] = df['Adj Close'].ewm(span=75, min_periods=1).mean()
NB. check carefully the parameters' documentation as there is no more window, you should use one of com, span, halflife or alpha instead

Using Dask Delayed on Small/Partitioned Dataframes

I am working with time series data that is formatted as each row is a single instance of a ID/time/data. This means that the rows don't correspond 1 to 1 for each ID. Each ID has many rows across time.
I am trying to use dask delayed to have a function run on an entire ID sequence (it makes sense that the operation should be able to run on each individual ID at the same time since they don't affect each other). To do this I am first looping through each of the ID tags, pulling/locating all the data from that ID (with .loc in pandas, so it is a separate "mini" df), then delaying the function call on the mini df, adding a column with the delayed values and adding it to a list of all mini dfs. At the end of the for loop I want to call dask.compute() on all the mini-dfs at once but for some reason the mini df's values are still delayed. Below I will post some pseudocode about what I just tried to explain.
I have a feeling that this may not be the best way to go about this but it's what made sense at the time and I can't understand whats wrong so any help would be very much appreciated.
Here is what I am trying to do:
list_of_mini_dfs = []
for id in big_df:
curr_df = big_df.loc[big_df['id'] == id]
curr_df['new value 1'] = dask.delayed(myfunc)(args1)
curr_df['new value 2'] = dask.delayed(myfunc)(args2) #same func as previous line
list_of_mini_dfs.append(curr_df)
list_of_mini_dfs = dask.delayed(list_of_mini_dfs).compute()
Concat all mini dfs into new big df.
As you can see by the code I have to reach into my big/overall dataframe to pull out each ID's sequence of data since it is interspersed throughout the rows. I want to be able to call a delayed function on that single ID's data and then return the values from the function call into the big/overall dataframe.
Currently this method is not working, when I concat all the mini dataframes back together the two values I have delayed are still delayed, which leads me to think that it is due to the way I am delaying a function within a df and trying to compute the list of dataframes. I just can't see how to fix it.
Hopefully this was relatively clear and thank you for the help.
IIUC you are trying to do a sort of transform using dask.
import pandas as pd
import dask.dataframe as dd
import numpy as np
# generate big_df
dates = pd.date_range(start='2019-01-01',
end='2020-01-01')
l = len(dates)
out = []
for i in range(1000):
df = pd.DataFrame({"ID":[i]*l,
"date": dates,
"data0": np.random.randn(l),
"data1": np.random.randn(l)})
out.append(df)
big_df = pd.concat(out, ignore_index=True)\
.sample(frac=1)\
.reset_index(drop=True)
Now you want to apply your function fun on columns data0 and data1
Pandas
out = big_df.groupby("ID")[["data0","data1"]]\
.apply(fun)\
.reset_index()
df_pd = pd.merge(big_df, out, how="left", on="ID" )
Dask
df = dd.from_pandas(big_df, npartitions=4)
out = df.groupby("ID")[["data0","data1"]]\
.apply(fun, meta={'data0':'f8',
'data1':'f8'})\
.rename(columns={'data0': 'new_values0',
'data1': 'new_values1'})\
.compute() # Here you need to compute otherwise you'll get NaNs
df_dask = dd.merge(df, out,
how="left",
left_on=["ID"],
right_index=True)
The dask version is not necessarily faster than the pandas one. In particular if your df fits in RAM.

How to Render Math Table Properly in IPython Notebook

The math problem that I'm solving gives different analytical solutions in different scenarios, and I would like to summarize the result in a nice table. IPython Notebook renders the list nicely:
for example:
import sympy
from pandas import DataFrame
from sympy import *
init_printing()
a, b, c, d = symbols('a b c d')
t = [[a/b, b/a], [c/d, d/c]]
t
However, when I summarize the answers into a table using DataFrame, the math cannot be rendered any more:
df = DataFrame(t, index=['Situation 1', 'Situation 2'], columns=['Answer1','Answer2'])
df
"print df.to_latex()" also gives the same result. I also tried "print(latex(t))" but it gives this after compiling in LaTex, which is alright, but I still need to manually convert it to a table:
How should I use DataFrame properly in order to render the math properly? Or is there any other way to export the math result into a table in Latex? Thanks!
Update: 01/25/14
Thanks again to #Jakob for solving the problem. It works perfectly for simple matrices, though there are still some minor problems for more complicated math expressions. But I guess like #asmeurer said, perfection requires an update in IPython and Pandas.
Update: 01/26/14
If I render the result directly, i.e. just print the list, it works fine:
MathJax is currently not able to render tables, hence the most obvious approach (pure latex) does not work.
However, following the advise of #asmeurer you should use an html table and render the cell content as latex. In your case this could be easily achieved by the following intermediate step:
from sympy import latex
tl = map(lambda tc: '$'+latex(tc)+'$',t)
df = DataFrame(tl, index=['Situation 1', 'Situation 2'], columns=['Answer'])
df
which gives:
Update:
In case of two dimensional data, the simple map function will not work directly. To cope with this situation the numpy shape, reshape and ravel functions could be used like:
import numpy as np
t = [[a/b, b/a],[a*a,b*b]]
tl=np.reshape(map(lambda tc: '$'+latex(tc)+'$',np.ravel(t)),np.shape(t))
df = DataFrame(tl, index=['Situation 1', 'Situation 2'], columns=['Answer 1','Answer 2'])
df
This gives:
Update 2:
Pandas crops cell content if the string length exceeds a certain number. E.g a more complicated expression like
t1 = [a/2+b/2+c/2+d/2]
tl=np.reshape(map(lambda tc: '$'+latex(tc)+'$',np.ravel(t1)),np.shape(t1))
df = DataFrame(tl, index=['Situation 1'], columns=['Answer 1'])
df
gives:
To cope with this issue a pandas package option has to be altered, for details see here. For the present case the max_colwidth has to be changed. The default value is 50, hence let's change it to 100:
import pandas as pd
pd.options.display.max_colwidth=100
df
gives: