How to properly tokenize column in pandas? - pandas

I am trying to solve tokenization problem in my dataset with comments from social media. I want to tokenize, lemmatize, remove punctuations and stop-words from the pandas column. I am struggling how to do it for each of the comment. I receive the following error when trying to get tokens:
import pandas as pd
import nltk
...
merged['message_tokens'] = merged.apply(lambda x: nltk.tokenize.word_tokenize(x['Clean_message']), axis=1)
TypeError: expected string or bytes-like object
When I am trying to tell pandas that I am passing it a string object, it gives me the following error message:
merged['message_tokens'] = merged.apply(lambda x: nltk.tokenize.word_tokenize(x['Clean_message'].str), axis=1)
AttributeError: 'str' object has no attribute 'str'
What am I doing wrong?

You can use astype to force the column type to string
merged['Clean_message'] = merged['Clean_message'].astype(str)
If you want to look at what's wrong in original column, you can use
m = merged['Clean_message'].apply(type).ne(str)
out = merged[m]
out dataframe contains the rows where the type of Clean_message column is not string.

Related

Can not infer schema for type when converting pandas dataframe to pyspark dataframe

I am trying to use pyspark.pandas to read excel and I need to convert the pandas dataframe to pyspark dataframe.
df = panndas .read_excel(filepath,sheet_name="A", skiprows=12 ,usecols="B:AM",parse_dates=True)
pyspark_df= spark.createDataFrame(df)
when I do this, I got error
TypeError: Can not infer schema for type:
Even though I tried to specify the dtype for the read_excel and define the schema. I still have the error.
df = panndas .read_excel(filepath,sheet_name="A", skiprows=12 ,usecols="B:AM",parse_dates=True,dtype= dtypetest)
pyspark_df= spark.createDataFrame(df,schema)
Would you tell me how to solve it?

TypeError converting from pandas data frame to numpy array

I am getting TypeError after converting pandas dataframe to numpy array (after using pd.get_dummies or by creating dummy variables from the dataframe using df.apply function) if the columns are of mixed types int, str and float.
I am not getting these errors if only using mixed types int, and str.
code:
df = pd.DataFrame({'a':[1,2]*2, 'b':['m','f']*2, 'c':[0.2, .1, .3, .5]})
dfd = pd.get_dummies(df, drop_first=True, dtype=int)
dfd.values
Error: TypeError: '<' not supported between instances of 'str' and 'int'
I am getting error with dfd.to_numpy() too.
Even if I convert the dataframe dfd to int or float values using df.astype,
dfd.to_numpy() is still producing error. I am getting error even if only selecting columns which were not changed from df.
Goal:
I am encoding categorical features of the dataframe to one hot encoding, and then want to use SelectKBest with score_func=mutual_info_classif to select some features. The error produced by the code after fitting SelectKBest is same as the error produced by dfd.to_numpy() and hence I am assuming that the error is being produced when SelectKBest is trying to convert dataframe to numpy.
Besides, just using mutual_info_classif to get scores for corresponding features is working.
How should I debug it? Thanks.
pandas converting to numpy error for mixed types

How do I solve this error in my code? TypeError: 'NoneType' object is not subscriptable

How do I eliminate the error below from the given lines of code I have outlined below:
This is the code:
#import the libraries.
import streamlit as st
import pandas as pd
from PIL import Image
#Display the closing price.
st.header(company_name+" Close Price\n")
st.line_chart(df['Close'])
This is the error I am getting:
TypeError: 'NoneType' object is not subscriptable
If your DataFrame contains something like:
Id Open Close
0 AAA 12.15 13.22
1 BBB 24.11 25.11
then df['Close'] retrieves the respective column and the result is:
0 13.22
1 25.11
Name: Close, dtype: float64
(the left column contains indices and the right column - values from
this column).
But when you run: df = None then df['Close'] yields just the
error you described.
So the probable cause is that your code somehow assigned None to df.
Maybe you attempted to read df from some source and this instruction
resulted in assignment of None to df.
Note that to get such error df variable must exist.
Otherwise you would have got another error, namely:
NameError: name 'df' is not defined.
How to cope with this: Make sure that df contains an actual DataFrame,
with the "wanted" column.

koalas Column assignment doesn't support type ndarray

All - I am trying to add a new column to an existing koalas dataframe but it fails with the error above. The value I am assigning with is an np array. Am I missing something at all? This works well with pandas.
import databricks.koalas as ks
from sklearn.datasets import load_iris
iris = load_iris()
df = ks.DataFrame(data=iris.data, columns=iris.feature_names)
# works so far!!
df["target"] = iris.target ## this errors out!
TypeError: Column assignment doesn't support type ndarray
Am I missing anything here?
thanks.
Unfortunately, even df.assign did not solve the problem and I was getting the same error:
I had to do this:
ks.reset_option('compute.ops_on_diff_frames')
# convert target to a koalas series so that it can be assigned to the dataframe as a column
ks_series = ks.Series(iris.target)
df["target"] = ks_series
ks.reset_option('compute.ops_on_diff_frames')
My bad:
I misread where and what the issue was. Try the following:
...
df.assign(target=iris.target)
Could you try the following:
...
df = ks.DataFrame(data=iris.data, columns=list(iris.feature_names))
...
Looking into the load_iris documentation, they make a not of converting the returned array into a list.

AttributeError: 'DataFrame' object has no attribute 'DataFrame'

I am getting the following error when I am trying to convert the output of a SQL query into a dataframe in jupyter notebook. I have already checked other posts on similar topic, but this is a different error. Can someone please explain why this is happening.
Code:
import pandas as pd
k = %sql select * from table1
df = k.DataFrame()
Error: AttributeError: 'DataFrame' object has no attribute 'DataFrame'
k is already an object of DataFrame.
Always check with
type(object)
type(k)
this will give you what type of object it is. Based on that you can further try to convert as required.
Just a suggestion, going further if you want to convert a variable into DataFrame, use pd.DataFrame.
In your case df = pd.DataFrame(k) if it was not a dataframe object