TypeError converting from pandas data frame to numpy array - pandas

I am getting TypeError after converting pandas dataframe to numpy array (after using pd.get_dummies or by creating dummy variables from the dataframe using df.apply function) if the columns are of mixed types int, str and float.
I am not getting these errors if only using mixed types int, and str.
code:
df = pd.DataFrame({'a':[1,2]*2, 'b':['m','f']*2, 'c':[0.2, .1, .3, .5]})
dfd = pd.get_dummies(df, drop_first=True, dtype=int)
dfd.values
Error: TypeError: '<' not supported between instances of 'str' and 'int'
I am getting error with dfd.to_numpy() too.
Even if I convert the dataframe dfd to int or float values using df.astype,
dfd.to_numpy() is still producing error. I am getting error even if only selecting columns which were not changed from df.
Goal:
I am encoding categorical features of the dataframe to one hot encoding, and then want to use SelectKBest with score_func=mutual_info_classif to select some features. The error produced by the code after fitting SelectKBest is same as the error produced by dfd.to_numpy() and hence I am assuming that the error is being produced when SelectKBest is trying to convert dataframe to numpy.
Besides, just using mutual_info_classif to get scores for corresponding features is working.
How should I debug it? Thanks.
pandas converting to numpy error for mixed types

Related

Can not infer schema for type when converting pandas dataframe to pyspark dataframe

I am trying to use pyspark.pandas to read excel and I need to convert the pandas dataframe to pyspark dataframe.
df = panndas .read_excel(filepath,sheet_name="A", skiprows=12 ,usecols="B:AM",parse_dates=True)
pyspark_df= spark.createDataFrame(df)
when I do this, I got error
TypeError: Can not infer schema for type:
Even though I tried to specify the dtype for the read_excel and define the schema. I still have the error.
df = panndas .read_excel(filepath,sheet_name="A", skiprows=12 ,usecols="B:AM",parse_dates=True,dtype= dtypetest)
pyspark_df= spark.createDataFrame(df,schema)
Would you tell me how to solve it?

Y-values in Plotly are unordered strings

--Appologies, this is my first stackoverflow post--
I am importing data from .csv using Pandas.
With that data, I am trying to generate a plot using Plotly.Express
When interrogating the datatypes, it is found to be 'object'
When interrogating the datatype of 'PV' is is found to be 'str'
How do I convert plotly y values to float datatypes?
I was expecting that the Y values where in an ordered array

How to properly tokenize column in pandas?

I am trying to solve tokenization problem in my dataset with comments from social media. I want to tokenize, lemmatize, remove punctuations and stop-words from the pandas column. I am struggling how to do it for each of the comment. I receive the following error when trying to get tokens:
import pandas as pd
import nltk
...
merged['message_tokens'] = merged.apply(lambda x: nltk.tokenize.word_tokenize(x['Clean_message']), axis=1)
TypeError: expected string or bytes-like object
When I am trying to tell pandas that I am passing it a string object, it gives me the following error message:
merged['message_tokens'] = merged.apply(lambda x: nltk.tokenize.word_tokenize(x['Clean_message'].str), axis=1)
AttributeError: 'str' object has no attribute 'str'
What am I doing wrong?
You can use astype to force the column type to string
merged['Clean_message'] = merged['Clean_message'].astype(str)
If you want to look at what's wrong in original column, you can use
m = merged['Clean_message'].apply(type).ne(str)
out = merged[m]
out dataframe contains the rows where the type of Clean_message column is not string.

Writing data frame with object dtype to HDF5 only works after converting to string

I have a big data dataframe and I want to write it to disk for quick retrieval. I believe to_hdf(...) infers the data type of the columns and sometimes gets it wrong. I wonder what the correct way is to cope with this.
import pandas as pd
import numpy as np
length = 10
df = pd.DataFrame({"a": np.random.randint(1e7, 1e8, length),})
# df.loc[1, "a"] = "abc"
# df["a"] = df["a"].astype(str)
print(df.dtypes)
df.to_hdf("df.hdf5", key="data", format="table")
Uncommenting various lines leads me to the following.
Just filling the column with numbers will lead to a data type int32 and stores without problem
Setting one element to abc changes the data to object, but it seems that to_hdf internally infers another data type and throws an error: TypeError: object of type 'int' has no len()
Explicitely converting the column to str leads to success, and to_hdf stores the data.
Now I am wondering what is happening in the second case, and is there a way to prevent this? The only way I found was to go through all columns, check if they are dtype('O') and explicitely convert them to str.
Instead of using hdf5, I have found a generic pickling library which seems to be perfect for the job: jiblib
Storing and loading data is straight forward:
import joblib
joblib.dump(df, "file.jl")
df2 = joblib.load("file.jl")

I got a TypeError: only length-1 arrays can be converted to Python scalars

I got the below error message. I have found some questions on Stack Overflow, and I tried their solutions but it didn't worked.
import numpy as np
R=0.9999 #Reflectivity
a=np.arange(0,100000,1,dtype=np.complex)
b=R**(a)
c=np.exp(np.complex(0,a))
Error:
c=np.exp(np.complex(0,a))
TypeError: only length-1 arrays can be converted to Python scalars
The error is in np.complex(0,a). It doesn't expect 'a' to be an array. Compare with:
c = np.exp(np.complex(0,a[0])).
Since a is already an array with complex numbers, can't you directly calculate it? (Although this will result in inf+0.j given the size of the exponent)
c = np.exp(a)