I have dask dataframe that has cols
[ID,'PERIOD','CURRENCY']
Where I created PERIOD as
datetime.datetime.strptime(''201901, "%Y%m").date()
When I try to save this dataframe using:
dd.to_hdf('table.h5', key='df', append=True,complib='zlib', format='table', data_column=True)
I get an error as :
TypeError: Cannot serialize the column [PERIOD] because its data contents are [date] object dtype
However when I save the dataframe to CSV/PARQUET I dont see any error. I'm using dask Version 2.5.2
Apparently converting to unix timestamp works:
time.mktime(datetime.datetime.strptime('201901', "%Y%m").date().timetuple())
Related
I am trying to use pyspark.pandas to read excel and I need to convert the pandas dataframe to pyspark dataframe.
df = panndas .read_excel(filepath,sheet_name="A", skiprows=12 ,usecols="B:AM",parse_dates=True)
pyspark_df= spark.createDataFrame(df)
when I do this, I got error
TypeError: Can not infer schema for type:
Even though I tried to specify the dtype for the read_excel and define the schema. I still have the error.
df = panndas .read_excel(filepath,sheet_name="A", skiprows=12 ,usecols="B:AM",parse_dates=True,dtype= dtypetest)
pyspark_df= spark.createDataFrame(df,schema)
Would you tell me how to solve it?
I am using an external jar on my spark cluster to load a DataFrame. I am getting an error in attempting to convert this DataFrame to a pandas DataFrame.
The code is as follows :
jvm = spark._jvm
gateway = jvm.com.<external.package.classname>
data_w = gateway.loadData()
df = DataFrame(data_w, spark)
pandas_df = df.toPandas()
The spark dataframe df has valid data. However I am getting an error in the pandas conversion
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/pandas/conversion.py", line 67, in toPandas
'RuntimeConfig' object has no attribute 'sessionLocalTimeZone'
The spark dataframe has 2 date columns. Is there any jconf settings I need to add on my spark context for conversion ?
I can see https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/pandas/conversion.html that there is a usage of sessionLocalTimeZone. Do we need to explicitly set this on spark jconf. I am not setting this locally and it works fine.
I am having issues while trying to convert pandas data frame into spark data frame in Azure. I have done it in similar ways before and it worked, but not sure why it's not working. FYI, this was a pivot table which I converted into pandas dataframe by using reset_index, but still showing error. The code I used,
# Convert pandas dataframe to spark data frame
spark_Forecast_report122 = spark.createDataFrame(df_top_six_dup1.astype(str))
# write the data into table
spark_Forecast_report122.write.mode("overwrite").saveAsTable("default.spark_Forecast_report122")
sdff_Forecast_report122 = spark.read.table("spark_Forecast_report122")
Forecast_Price_df122 = sdff_Forecast_report122.toPandas()
display(Forecast_Price_df122)
I am attaching the error a images.Image_1
I am getting TypeError after converting pandas dataframe to numpy array (after using pd.get_dummies or by creating dummy variables from the dataframe using df.apply function) if the columns are of mixed types int, str and float.
I am not getting these errors if only using mixed types int, and str.
code:
df = pd.DataFrame({'a':[1,2]*2, 'b':['m','f']*2, 'c':[0.2, .1, .3, .5]})
dfd = pd.get_dummies(df, drop_first=True, dtype=int)
dfd.values
Error: TypeError: '<' not supported between instances of 'str' and 'int'
I am getting error with dfd.to_numpy() too.
Even if I convert the dataframe dfd to int or float values using df.astype,
dfd.to_numpy() is still producing error. I am getting error even if only selecting columns which were not changed from df.
Goal:
I am encoding categorical features of the dataframe to one hot encoding, and then want to use SelectKBest with score_func=mutual_info_classif to select some features. The error produced by the code after fitting SelectKBest is same as the error produced by dfd.to_numpy() and hence I am assuming that the error is being produced when SelectKBest is trying to convert dataframe to numpy.
Besides, just using mutual_info_classif to get scores for corresponding features is working.
How should I debug it? Thanks.
pandas converting to numpy error for mixed types
How can i generated a pandas dataframe from an ordereddict?
I have tried using the dataframe.from_dict method but that is not giving me the expected dataframe.
What is the best approach to convert an ordereddict into a list of dicts?
A bug in Pandas did not respect the key ordering of OrderedDict objects converted to a DataFrame via the from_dict call. Fixed in Pandas 0.11.