saving a dask dataframe to hdf5

saving a dask dataframe to hdf5 - dataframe

I have dask dataframe that has cols
[ID,'PERIOD','CURRENCY']
Where I created PERIOD as
datetime.datetime.strptime(''201901, "%Y%m").date()
When I try to save this dataframe using:
dd.to_hdf('table.h5', key='df', append=True,complib='zlib', format='table', data_column=True)
I get an error as :
TypeError: Cannot serialize the column [PERIOD] because its data contents are [date] object dtype
However when I save the dataframe to CSV/PARQUET I dont see any error. I'm using dask Version 2.5.2

Apparently converting to unix timestamp works:
time.mktime(datetime.datetime.strptime('201901', "%Y%m").date().timetuple())

Related

Can not infer schema for type when converting pandas dataframe to pyspark dataframe

I am trying to use pyspark.pandas to read excel and I need to convert the pandas dataframe to pyspark dataframe.
df = panndas .read_excel(filepath,sheet_name="A", skiprows=12 ,usecols="B:AM",parse_dates=True)
pyspark_df= spark.createDataFrame(df)
when I do this, I got error
TypeError: Can not infer schema for type:
Even though I tried to specify the dtype for the read_excel and define the schema. I still have the error.
df = panndas .read_excel(filepath,sheet_name="A", skiprows=12 ,usecols="B:AM",parse_dates=True,dtype= dtypetest)
pyspark_df= spark.createDataFrame(df,schema)
Would you tell me how to solve it?

Error in converting spark dataframe to pandas dataframe

I am using an external jar on my spark cluster to load a DataFrame. I am getting an error in attempting to convert this DataFrame to a pandas DataFrame.
The code is as follows :
jvm = spark._jvm
gateway = jvm.com.<external.package.classname>
data_w = gateway.loadData()
df = DataFrame(data_w, spark)
pandas_df = df.toPandas()
The spark dataframe df has valid data. However I am getting an error in the pandas conversion
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/pandas/conversion.py", line 67, in toPandas
'RuntimeConfig' object has no attribute 'sessionLocalTimeZone'
The spark dataframe has 2 date columns. Is there any jconf settings I need to add on my spark context for conversion ?
I can see https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/pandas/conversion.html that there is a usage of sessionLocalTimeZone. Do we need to explicitly set this on spark jconf. I am not setting this locally and it works fine.

Having error while trying to convert pandas data frame into spark dataframe in Azure

I am having issues while trying to convert pandas data frame into spark data frame in Azure. I have done it in similar ways before and it worked, but not sure why it's not working. FYI, this was a pivot table which I converted into pandas dataframe by using reset_index, but still showing error. The code I used,
# Convert pandas dataframe to spark data frame
spark_Forecast_report122 = spark.createDataFrame(df_top_six_dup1.astype(str))
# write the data into table
spark_Forecast_report122.write.mode("overwrite").saveAsTable("default.spark_Forecast_report122")
sdff_Forecast_report122 = spark.read.table("spark_Forecast_report122")
Forecast_Price_df122 = sdff_Forecast_report122.toPandas()
display(Forecast_Price_df122)
I am attaching the error a images.Image_1

TypeError converting from pandas data frame to numpy array

I am getting TypeError after converting pandas dataframe to numpy array (after using pd.get_dummies or by creating dummy variables from the dataframe using df.apply function) if the columns are of mixed types int, str and float.
I am not getting these errors if only using mixed types int, and str.
code:
df = pd.DataFrame({'a':[1,2]*2, 'b':['m','f']*2, 'c':[0.2, .1, .3, .5]})
dfd = pd.get_dummies(df, drop_first=True, dtype=int)
dfd.values
Error: TypeError: '<' not supported between instances of 'str' and 'int'
I am getting error with dfd.to_numpy() too.
Even if I convert the dataframe dfd to int or float values using df.astype,
dfd.to_numpy() is still producing error. I am getting error even if only selecting columns which were not changed from df.
Goal:
I am encoding categorical features of the dataframe to one hot encoding, and then want to use SelectKBest with score_func=mutual_info_classif to select some features. The error produced by the code after fitting SelectKBest is same as the error produced by dfd.to_numpy() and hence I am assuming that the error is being produced when SelectKBest is trying to convert dataframe to numpy.
Besides, just using mutual_info_classif to get scores for corresponding features is working.
How should I debug it? Thanks.
pandas converting to numpy error for mixed types

How to generate a pandas dataframe from ordereddict?

How can i generated a pandas dataframe from an ordereddict?
I have tried using the dataframe.from_dict method but that is not giving me the expected dataframe.
What is the best approach to convert an ordereddict into a list of dicts?

A bug in Pandas did not respect the key ordering of OrderedDict objects converted to a DataFrame via the from_dict call. Fixed in Pandas 0.11.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

saving a dask dataframe to hdf5 - dataframe

Apparently converting to unix timestamp works: time.mktime(datetime.datetime.strptime('201901', "%Y%m").date().timetuple())

Related

Can not infer schema for type when converting pandas dataframe to pyspark dataframe

Error in converting spark dataframe to pandas dataframe

Having error while trying to convert pandas data frame into spark dataframe in Azure

TypeError converting from pandas data frame to numpy array

How to generate a pandas dataframe from ordereddict?

Categories

Resources