create a dask dataframe from a dictionary - pandas

I have a dictionary like this:
d = {'Caps': 'cap_list', 'Term': 'unique_tokens', 'LocalFreq': 'local_freq_list','CorpusFreq': 'corpus_freq_list'}
I want to create a dask dataframe from it. How do I do it? Normally, in Pandas, is can be easily imported to a Pandas df by:
df = pd.DataFrame({'Caps': cap_list, 'Term': unique_tokens, 'LocalFreq': local_freq_list,
'CorpusFreq': corpus_freq_list})
Should I first load into a bag and then convert from bag to ddf?

If your data fits in memory then I encourage you to use Pandas instead of Dask Dataframe.
If for some reason you still want to use Dask dataframe then I would convert things to a Pandas dataframe and then use the dask.dataframe.from_pandas function.
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame(...)
ddf = dd.from_pandas(df, npartitions=20)
But there are many cases where this will be slower than just using Pandas well.

Related

Is there a way to speed up the conversion of spark dataframe to pandas dataframe?

I tried to convert spark dataframe to pandas in databricks notebook with pyspark. It takes for ever running. Is there a better way to do this? There are more than 600,000 rows.
df_PD = sparkDF.toPandas()
df_PD = sparkDF.toPandas()
Can you try changing your import statement and importing the Pandas API for Spark?
import pyspark.pandas as pd
df_PD = sparkDF.to_pandas()

How can I convert csv to pickle?

I have some csv files which take a bit long to load as dataframe into my workspace. Is there a fast and easy tool to convert them to pickle to load faster?
After you load the data using Pandas,
Use the following:
import pandas as pd
df.to_pickle('/Drive Path/df.pkl') #to save the dataframe, df to 123.pkl
df1 = pd.read_pickle('/Drive Path/df.pkl') #to load 123.pkl back to the dataframe df

How to write unstack and reindex in dask?

I write the script in pandas but because of efficinecy i need to switch to dask but i am not sure how to implement unstack and reindex in dask?
This is how my pandas script looks:
df_new = df.groupby(['Cars', 'Date'])['Durations'].mean().unstack(fill_value=0).reindex(columns=list_days,index=list_cars,fill_value=0).\
round().reset_index().fillna(0).round()
Typically, the result of a .groupby() aggregation will be small and fit in memory. As show in https://docs.dask.org/en/latest/dataframe-best-practices.html#reduce-and-then-use-pandas, you can use Dask for the large aggregation, and then pandas for the small in-memory post-processing.
df_new = (
df.groupby(['Cars', 'Date'])['Durations'].mean()
.compute() # turn the Dask DataFrame into a pandas dataframe
.unstack(fill_value=0).reindex(columns=list_days,index=list_cars,fill_value=0).
.round().reset_index().fillna(0).round()
)

MATLAB HDF5 to Dask Dataframe Not Supported Yet?

I am pulling a dataset out of a MATLAB mat file which is of HDF5 format as shown below:
matfile = 'C:\\....\\dataStuff.mat'
f = h5py.File(matfile, 'r')
data = f['/' + stuff + '/data/'].value
df = pd.DataFrame(data) # How do I create a Dask DF instead from data?
How do I do the same thing but instead of using Pandas, I create a Dask Dataframe?
The below function gives me an error:
ddf = dd.read_hdf(matfile, 'key')
the HDF5 class H5T_COMPOUND is not supported yet
I could attempt to just convert the Pandas DF into a Dask DF as shown below, but I would like to skip this step that takes another 2 minutes, but pulling the HDF5 data directly into a Dask Dataframe like I did with the Pandas.
df = dd.from_pandas(df, npartitions=3) # What I don't want to do

How to generate a pandas dataframe from ordereddict?

How can i generated a pandas dataframe from an ordereddict?
I have tried using the dataframe.from_dict method but that is not giving me the expected dataframe.
What is the best approach to convert an ordereddict into a list of dicts?
A bug in Pandas did not respect the key ordering of OrderedDict objects converted to a DataFrame via the from_dict call. Fixed in Pandas 0.11.