Vectorizing text from data frame column using pandas - pandas

I have a Data Frame which looks like this:
I am trying to vectorize every row, but only from the text column. I wrote this code:
vectorizerCount = CountVectorizer(stop_words='english')
# tokenize and build vocab
allDataVectorized = allData.apply(vectorizerCount.fit_transform(allData.iloc[:]['headline_text']), axis=1)
The error says:
TypeError: ("'csr_matrix' object is not callable", 'occurred at index 0')
Doing some research and trying changes I found out the fit_transform function returns a scipy.sparse.csr.csr_matrix and that is not callable.
Is there another way to do this?
Thanks!

There are a number of problems with your code. You probably need something like
allDataVectorized = pd.DataFrame(vectorizerCount.fit_transform(allData[['headline_text']]))
allData[['headline_text']]) (with the double brackets) is a DataFrame, which transforms to a numpy 2d array.
fit_transform returns a csr matrix.
pd.DataFrame(...) creates a DataFrame from a csr matrix.

Related

Understand pandas' applymap argument

I'm trying to highlight specific columns in my dataframe using guideline from this post, https://stackoverflow.com/a/41655055/5158984.
My question is on the use of the subset argument. My guess is that it's part of the **kwargs argument. However, the official documentation from Pandas, https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.applymap.html, vaguely explains it.
So in general, how can I know which key words I can use whenever I see **kwargs?
Thanks!
It seems that you are confusing pandas.DataFrame.applymap and df.style.applymap (where df is an instance of pd.DataFrame), for which subset stands on its own and is not part of the kwargs arguments.
Here is one way to find out (in your terminal or a Jupyter notebook cell) what are the named parameters of this method (or any other Pandas method for that matter):
import pandas as pd
df = pd.DataFrame()
help(df.style.applymap)
# Output
Help on method applymap in module pandas.io.formats.style:
applymap(func: 'Callable', subset: 'Subset | None' = None, **kwargs)
-> 'Styler' method of pandas.io.formats.style.Styler instance
Apply a CSS-styling function elementwise.
Updates the HTML representation with the result.
Parameters
----------
func : function
``func`` should take a scalar and return a string.
subset : label, array-like, IndexSlice, optional
A valid 2d input to `DataFrame.loc[<subset>]`, or, in the case of a 1d input
or single key, to `DataFrame.loc[:, <subset>]` where the columns are
prioritised, to limit ``data`` to *before* applying the function.
**kwargs : dict
Pass along to ``func``.
...

Can't convert Matrix to DataFrame JULIA

How can i convert a matrix to DataFrame in Julia?
I have an 10×2 Matrix{Any}, and when i try to convert it to a dataframe, using this:
df2 = convert(DataFrame,Xt2)
i get this error:
MethodError: Cannot `convert` an object of type Matrix{Any} to an object of type DataFrame
Try instead
df2 = DataFrame(Xt2,:auto)
You cannot use convert for this; you can use the DataFrame constructor, but then as the documentation (simply type ? DataFrame in the Julia REPL) will tell you, you need to either provide a vector of column names, or :auto to auto-generate column names.
Tangentially, I would also strongly recommend avoiding Matrix{Any} (or really anything involving Any) for any scenario where performance is at all important.

Can I extract or construct as a Pandas dataframe the table with coefficient values etc. provided by the summary() method in statsmodels?

I have run an OLS model in statsmodels and I would like to have the table in the summary as a Pandas dataframe.
This is what I mean:
I would like the table within the red frame to be constructed / extracted and become a Pandas DataFrame.
My code up to that point was straightforward:
from statsmodels.regression.linear_model import OLS
mod = OLS(endog = coded_design_poly_select.response.values, exog = coded_design_poly_select.iloc[:, :-1].values)
fitted_model = mod.fit()
fitted_model.summary()
What would you suggest?
The fitted_model is in fact a RegressionResults object that stores all the regression results and you can access them via the corresponding methods/attributes.
For what you asked for, I believe the following code would work
data = {'coef': fitted_model.params,
'std err': fitted_model.bse,
't': fitted_model.tvalues,
'P>|t|': fitted_model.pvalues,
'[0.025': fitted_model.conf_int()[0],
'0.975]': fitted_model.conf_int()[1]}
pd.DataFrame(data).round(3)

Holoviz panel will not print pandas dataframe row in Jupyter notebook

I'm trying to recreate the first panel.interact example in the Holoviz tutorial using a Pandas dataframe instead of a Dask dataframe. I get the slider, but the pandas dataframe row does not show.
See the original example at: http://holoviz.org/tutorial/Building_Panels.html
I've tried using Dask as in the Holoviz example. Dask rows print out just fine, but it demonstrates that panel seem to treat Dask dataframe rows differently for printing than Pandas dataframe rows. Here's my minimal code:
import pandas as pd
import panel
l1 = ['a','b','c','d','a','b']
l2 = [1,2,3,4,5,6]
df = pd.DataFrame({'cat':l1,'val':l2})
def select_row(rowno=0):
row = df.loc[rowno]
return row
panel.extension()
panel.extension('katex')
panel.interact(select_row, rowno=(0, 5))
I've included a line with the katex extension, because without it, I get a warning that it is needed. Without it, I don't even get the slider.
I can call the select_row(rowno=0) function separately in a Jupyter cell and get a nice printout of the row, so it appears the function is working as it should.
Any help in getting this to work would be most appreciated. Thanks.
Got a solution. With Pandas, loc[rowno:rowno] returns a pandas.core.frame.DataFrame object of length 1 which works fine with panel while loc[rowno] returns a pandas.core.series.Series object which does not work so well. Thus modifying the select_row() function like this makes it all work:
def select_row(rowno=0):
row = df.loc[rowno:rowno]
return row
Still not sure, however, why panel will print out the Dataframe object and not the Series object.
Note: if you use iloc, then you use add +1, i.e., df.iloc[rowno:rowno+1].

Naming variable for median calculation using Numpy

I'm using numpy to get the median. The dataframe has two variables. Is there a way to tell it which variable I want the median for?
np.median(dataframename)
You must make cast your dataframe to numpy vector. Try this:
#input data in dataframename
dataframename = np.asarray(dataframename)
dataframename = dataframename.astype(float)
np.median(dataframename)
I realized that my data was not in a dataframe. Once I put it in, this worked.
dataframename.loc[:,"var18"].median()