DataFrame.quantile() function didn't work when use apply function - pandas

I made a function to get the middle part of a DataFrame, but the quantile function didn't work well. So I can't get the part of dataframe I want.
I'm using pandas 0.23.4
Here is the func body.
def func(df,key_name,tail):
df_new = df[df[key_name]>df[key_name].quantile(tail)]
return df_new
but it works good when I try it in this way
func(clean_video,'all_vv',0.1)
when I apply it like
clean_video.apply(func,axis = 1,args = ('all_vv',0.1))
I got the Error below.
AttributeError: ("'int' object has no attribute 'quantile'", 'occurred at index 0')
Thanks for all help.

Related

Indexing in Function, Pandas

I work with pandas dataframes and want to create a function with indexing.
def func(L,ra,N):
df_N = df[T_N].diff()
df[T_N] = df[T_N].iloc[df_N.index.min():df_N.index.max()]
df[T_N] = df[T_N].mean()
Value = [L,df[TR[0]].iloc[1]-(df[T_N[0]].iloc[1]-df[T_N[0]].iloc[1]),1]
return Value
For the line
df[T_N] = df[T_N].iloc[df_N.index.min():df_N.index.max()]
The error
TypeError: cannot do positional indexing on Int64Index with these indexers [nan] of type float
occurs. Does anyone know a way how I can avoid that? Is it even possible to do it that way? It works at other lines, just this one seems to be a problem.

How to properly tokenize column in pandas?

I am trying to solve tokenization problem in my dataset with comments from social media. I want to tokenize, lemmatize, remove punctuations and stop-words from the pandas column. I am struggling how to do it for each of the comment. I receive the following error when trying to get tokens:
import pandas as pd
import nltk
...
merged['message_tokens'] = merged.apply(lambda x: nltk.tokenize.word_tokenize(x['Clean_message']), axis=1)
TypeError: expected string or bytes-like object
When I am trying to tell pandas that I am passing it a string object, it gives me the following error message:
merged['message_tokens'] = merged.apply(lambda x: nltk.tokenize.word_tokenize(x['Clean_message'].str), axis=1)
AttributeError: 'str' object has no attribute 'str'
What am I doing wrong?
You can use astype to force the column type to string
merged['Clean_message'] = merged['Clean_message'].astype(str)
If you want to look at what's wrong in original column, you can use
m = merged['Clean_message'].apply(type).ne(str)
out = merged[m]
out dataframe contains the rows where the type of Clean_message column is not string.

Pandas apply function is not resolving

Please view the Data Frame by clicking this image
Names=jobs[['Company Name']]
F = lambda x: x.split("\n")
Names.apply(F , axis=1)
AttributeError: 'Series' object has no attribute 'split'
When I run the following code, it works. Why am I facing this issue, I have never faced this kind of a problem before. PS: I got this data from scraping websites, unlike before. I am just hoping it has something to do with this
Names=jobs[['Company Name']]
F = lambda x: x.str.split("\n")
Names.apply(F , axis=1)
When I try it this why :
Ratings = jobs['Company Name'].apply(lambda x:x.split("\n")[1] , axis=1)
I get this error
TypeError: <lambda>() got an unexpected keyword argument 'axis'
You do not need the apply here, str.split is vectorized
jobs['Company Name'].str.split('\n')
should do the job.
I can not tell you why it had not worked before, but I can imagine it is due to the double brackets in [['Company Name']]. Single Brackets would collapse that to a Series while you keep the (2-dimensional Structure) of the Dataframe with the double brackets. See e.g. Python pandas: Keep selected column as DataFrame instead of Series for more details.

Holoviz panel will not print pandas dataframe row in Jupyter notebook

I'm trying to recreate the first panel.interact example in the Holoviz tutorial using a Pandas dataframe instead of a Dask dataframe. I get the slider, but the pandas dataframe row does not show.
See the original example at: http://holoviz.org/tutorial/Building_Panels.html
I've tried using Dask as in the Holoviz example. Dask rows print out just fine, but it demonstrates that panel seem to treat Dask dataframe rows differently for printing than Pandas dataframe rows. Here's my minimal code:
import pandas as pd
import panel
l1 = ['a','b','c','d','a','b']
l2 = [1,2,3,4,5,6]
df = pd.DataFrame({'cat':l1,'val':l2})
def select_row(rowno=0):
row = df.loc[rowno]
return row
panel.extension()
panel.extension('katex')
panel.interact(select_row, rowno=(0, 5))
I've included a line with the katex extension, because without it, I get a warning that it is needed. Without it, I don't even get the slider.
I can call the select_row(rowno=0) function separately in a Jupyter cell and get a nice printout of the row, so it appears the function is working as it should.
Any help in getting this to work would be most appreciated. Thanks.
Got a solution. With Pandas, loc[rowno:rowno] returns a pandas.core.frame.DataFrame object of length 1 which works fine with panel while loc[rowno] returns a pandas.core.series.Series object which does not work so well. Thus modifying the select_row() function like this makes it all work:
def select_row(rowno=0):
row = df.loc[rowno:rowno]
return row
Still not sure, however, why panel will print out the Dataframe object and not the Series object.
Note: if you use iloc, then you use add +1, i.e., df.iloc[rowno:rowno+1].

Vectorizing text from data frame column using pandas

I have a Data Frame which looks like this:
I am trying to vectorize every row, but only from the text column. I wrote this code:
vectorizerCount = CountVectorizer(stop_words='english')
# tokenize and build vocab
allDataVectorized = allData.apply(vectorizerCount.fit_transform(allData.iloc[:]['headline_text']), axis=1)
The error says:
TypeError: ("'csr_matrix' object is not callable", 'occurred at index 0')
Doing some research and trying changes I found out the fit_transform function returns a scipy.sparse.csr.csr_matrix and that is not callable.
Is there another way to do this?
Thanks!
There are a number of problems with your code. You probably need something like
allDataVectorized = pd.DataFrame(vectorizerCount.fit_transform(allData[['headline_text']]))
allData[['headline_text']]) (with the double brackets) is a DataFrame, which transforms to a numpy 2d array.
fit_transform returns a csr matrix.
pd.DataFrame(...) creates a DataFrame from a csr matrix.