Y-values in Plotly are unordered strings - plotly-python

--Appologies, this is my first stackoverflow post--
I am importing data from .csv using Pandas.
With that data, I am trying to generate a plot using Plotly.Express
When interrogating the datatypes, it is found to be 'object'
When interrogating the datatype of 'PV' is is found to be 'str'
How do I convert plotly y values to float datatypes?
I was expecting that the Y values where in an ordered array

Related

Convert type object column to float

I have a table with a column named "price". This column is of type object. So, it contains numbers as strings and also NaN or ? characters. I want to find the mean of this column but first I have to remove the NaN and ? values and also convert it to float
I am using the following code:
import pandas as pd
import numpy as np
df = pd.read_csv('Automobile_data.csv', sep = ',')
df = df.dropna('price', inplace=True)
df['price'] = df['price'].astype('int')
df['price'].mean()
But, this doesn't work. The error says:
ValueError: No axis named price for object type DataFrame
How can I solve this problem?
edit: in pandas version 1.3 and less, you need subset=[col] wrapped in a list/array. In verison 1.4 and greater you can pass a single column as a string.
You've got a few problems:
df.dropna() arguments require the axis and then the subset. The axis is rows/columns, and then subset is which of those to look at. So you want this to be (I think) df.dropna(axis='rows',subset='price')
Using inplace=True makes the whole thing return None, and so you have set df = None. You don't want to do that. If you are using inplace=True, then you don't assign something to that, the whole line would just be df.dropna(...,inplace=True).
Don't use inplace=True, just do the assignment. That is, you should use df=df.dropna(axis='rows',subset='price')

Polar converters like pandas

Pandas read_csv accepts converters to pre-process each field. This is very useful especially for int64 validation or mixed dateformats etc. Could you please provide a way to read multiple columns as pl.Utf8 and then cast as Int64, Float64, Date etc ?
If you need to preprocess some column like converters do in pandas, you can just read that column as pl.Utf8 dtype and use polars expressions to process that column before a cast.
csv = """a,b,c
#12,1,2,
#1,3,4
1,45,5""".encode()
(pl.read_csv(csv, dtypes={"a": pl.Utf8})
.with_column(pl.col("a").str.replace("#", "").cast(pl.Int64))
)
Or if you want to do the same to multiple columns of that dtype
csv = """a,b,c,str_col
#12,1#,2foo,
#1,3#,4,bar
1,45#,5,ham""".encode()
pl.read_csv(
file = csv,
).with_columns([
pl.col(pl.Utf8).exclude("str_col").str.replace("#","").cast(pl.Int64),
])

TypeError converting from pandas data frame to numpy array

I am getting TypeError after converting pandas dataframe to numpy array (after using pd.get_dummies or by creating dummy variables from the dataframe using df.apply function) if the columns are of mixed types int, str and float.
I am not getting these errors if only using mixed types int, and str.
code:
df = pd.DataFrame({'a':[1,2]*2, 'b':['m','f']*2, 'c':[0.2, .1, .3, .5]})
dfd = pd.get_dummies(df, drop_first=True, dtype=int)
dfd.values
Error: TypeError: '<' not supported between instances of 'str' and 'int'
I am getting error with dfd.to_numpy() too.
Even if I convert the dataframe dfd to int or float values using df.astype,
dfd.to_numpy() is still producing error. I am getting error even if only selecting columns which were not changed from df.
Goal:
I am encoding categorical features of the dataframe to one hot encoding, and then want to use SelectKBest with score_func=mutual_info_classif to select some features. The error produced by the code after fitting SelectKBest is same as the error produced by dfd.to_numpy() and hence I am assuming that the error is being produced when SelectKBest is trying to convert dataframe to numpy.
Besides, just using mutual_info_classif to get scores for corresponding features is working.
How should I debug it? Thanks.
pandas converting to numpy error for mixed types

How to Box Plot Panda Timestamp series ? (Errors with Timestamp type)

I'm using:
Pandas version 0.23.0
Python version 3.6.5
Seaborn version 0.81.1
I'd like a Box Plot of a column of Timestamp data. My dataframe is not a time series, the index is just an integer but I have created a column of Timestamp data using:
# create a new column of time stamps corresponding to EVENT_DTM
data['EVENT_DTM_TS'] =pd.to_datetime(data.EVENT_DTM, errors='coerce')
I filter out all NaT values resulting from coerce.
dt_filtered_time = data[~data.EVENT_DTM_TS.isnull()]
At this point my data looks good and I can confirm the type of the EVENT_DM_TS column is Timestamp with no invalid values.
Finally to generate the single variable box plot I invoke:
ax = sns.boxplot(x=dt_filtered_time.EVENT_DTM_TS)
and get the error:
TypeError: ufunc add cannot use operands with types dtype('M8[ns]') and dtype( 'M8[ns]')
I've Googled and found:
https://github.com/pandas-dev/pandas/issues/13844
https://github.com/matplotlib/matplotlib/issues/9610
which seemingly indicate issues with data type representations.
I've also seen references to issues with pandas version 0.21.0.
Anyone have an easy fix suggestion or do I need to use a different data type to plot the box plot. I'd like to get the single picture of the distribution of the timestamp data.
This is the code I ended up with:
import time
#plt.FuncFormatter
def convert_to_date_string(x,pos):
return time.strftime('%Y-%m',time.localtime(x))
plt.figure(figsize=(15,4))
sns.set(style='whitegrid')
temp = dt_filtered_time.EVENT_DTM_TS.astype(np.int64)/1E9
ax = sns.boxplot(x=temp)
ax.xaxis.set_major_formatter(convert_to_date_string)
Here is the result:
Credit goes to ImportanceOfBeingErnest whose comment pointed me towards this solution.

Image in the form of Numpy array in a cell in Pyspark data frame

I would like to store a image represented as a numpy array in a Pyspark data frame.
When I try the I get an error data type not supported.
looking at the data types supported in Pyspark I don't see numpy, wondering if there's a way to store array.
I also tried numpy as string but the string for some reason is truncated contains ...
Any suggestions or solutions?