square elements of a non square pandas dataframe - pandas

I want to square the elements of a non square(m*n dimension) pandas dataframe but each time i try the following I get an error that says
1) np.power(errorR, 2)
2) errorR**2
ValueError: input must be a square array
is there a good solution for this?

Try df.applymap(lambda x: x**2)

I was using this in the jupyter environment and it worked after restarting the workspace.

Related

Pandas apply function is not resolving

Please view the Data Frame by clicking this image
Names=jobs[['Company Name']]
F = lambda x: x.split("\n")
Names.apply(F , axis=1)
AttributeError: 'Series' object has no attribute 'split'
When I run the following code, it works. Why am I facing this issue, I have never faced this kind of a problem before. PS: I got this data from scraping websites, unlike before. I am just hoping it has something to do with this
Names=jobs[['Company Name']]
F = lambda x: x.str.split("\n")
Names.apply(F , axis=1)
When I try it this why :
Ratings = jobs['Company Name'].apply(lambda x:x.split("\n")[1] , axis=1)
I get this error
TypeError: <lambda>() got an unexpected keyword argument 'axis'
You do not need the apply here, str.split is vectorized
jobs['Company Name'].str.split('\n')
should do the job.
I can not tell you why it had not worked before, but I can imagine it is due to the double brackets in [['Company Name']]. Single Brackets would collapse that to a Series while you keep the (2-dimensional Structure) of the Dataframe with the double brackets. See e.g. Python pandas: Keep selected column as DataFrame instead of Series for more details.

Holoviz panel will not print pandas dataframe row in Jupyter notebook

I'm trying to recreate the first panel.interact example in the Holoviz tutorial using a Pandas dataframe instead of a Dask dataframe. I get the slider, but the pandas dataframe row does not show.
See the original example at: http://holoviz.org/tutorial/Building_Panels.html
I've tried using Dask as in the Holoviz example. Dask rows print out just fine, but it demonstrates that panel seem to treat Dask dataframe rows differently for printing than Pandas dataframe rows. Here's my minimal code:
import pandas as pd
import panel
l1 = ['a','b','c','d','a','b']
l2 = [1,2,3,4,5,6]
df = pd.DataFrame({'cat':l1,'val':l2})
def select_row(rowno=0):
row = df.loc[rowno]
return row
panel.extension()
panel.extension('katex')
panel.interact(select_row, rowno=(0, 5))
I've included a line with the katex extension, because without it, I get a warning that it is needed. Without it, I don't even get the slider.
I can call the select_row(rowno=0) function separately in a Jupyter cell and get a nice printout of the row, so it appears the function is working as it should.
Any help in getting this to work would be most appreciated. Thanks.
Got a solution. With Pandas, loc[rowno:rowno] returns a pandas.core.frame.DataFrame object of length 1 which works fine with panel while loc[rowno] returns a pandas.core.series.Series object which does not work so well. Thus modifying the select_row() function like this makes it all work:
def select_row(rowno=0):
row = df.loc[rowno:rowno]
return row
Still not sure, however, why panel will print out the Dataframe object and not the Series object.
Note: if you use iloc, then you use add +1, i.e., df.iloc[rowno:rowno+1].

Naming variable for median calculation using Numpy

I'm using numpy to get the median. The dataframe has two variables. Is there a way to tell it which variable I want the median for?
np.median(dataframename)
You must make cast your dataframe to numpy vector. Try this:
#input data in dataframename
dataframename = np.asarray(dataframename)
dataframename = dataframename.astype(float)
np.median(dataframename)
I realized that my data was not in a dataframe. Once I put it in, this worked.
dataframename.loc[:,"var18"].median()

Dask DataFrame after Apply cannot reindex from a duplicate axis

I'am trying to change nan values of item_price to the mean value based on item_id
in the following dask dataframe:
all_data['item_price'] = all_data[['item_id','item_price']].groupby('item_id')['item_price'].apply(lambda x: x.fillna(x.mean()))
All_data.head()
Unfortunately I get the following error:
ValueError: cannot reindex from a duplicate axis
Any idea how to avoid this error or any other way to change nan values to mean values for a dask dataframe?
I found a solution to the problem. Fillna along with map can be used instead:
all_data['item_price'] = all_data['item_price'].fillna(
all_data['item_id'].map(
all_data.groupby('item_id')['item_price'].mean().compute()
)
)
This gets rid of the duplicate axis problem. Beware you have to use compute as seen in the code inside the map function for it to work without an error.

Vectorizing text from data frame column using pandas

I have a Data Frame which looks like this:
I am trying to vectorize every row, but only from the text column. I wrote this code:
vectorizerCount = CountVectorizer(stop_words='english')
# tokenize and build vocab
allDataVectorized = allData.apply(vectorizerCount.fit_transform(allData.iloc[:]['headline_text']), axis=1)
The error says:
TypeError: ("'csr_matrix' object is not callable", 'occurred at index 0')
Doing some research and trying changes I found out the fit_transform function returns a scipy.sparse.csr.csr_matrix and that is not callable.
Is there another way to do this?
Thanks!
There are a number of problems with your code. You probably need something like
allDataVectorized = pd.DataFrame(vectorizerCount.fit_transform(allData[['headline_text']]))
allData[['headline_text']]) (with the double brackets) is a DataFrame, which transforms to a numpy 2d array.
fit_transform returns a csr matrix.
pd.DataFrame(...) creates a DataFrame from a csr matrix.