I write the script in pandas but because of efficinecy i need to switch to dask but i am not sure how to implement unstack and reindex in dask?
This is how my pandas script looks:
df_new = df.groupby(['Cars', 'Date'])['Durations'].mean().unstack(fill_value=0).reindex(columns=list_days,index=list_cars,fill_value=0).\
round().reset_index().fillna(0).round()
Typically, the result of a .groupby() aggregation will be small and fit in memory. As show in https://docs.dask.org/en/latest/dataframe-best-practices.html#reduce-and-then-use-pandas, you can use Dask for the large aggregation, and then pandas for the small in-memory post-processing.
df_new = (
df.groupby(['Cars', 'Date'])['Durations'].mean()
.compute() # turn the Dask DataFrame into a pandas dataframe
.unstack(fill_value=0).reindex(columns=list_days,index=list_cars,fill_value=0).
.round().reset_index().fillna(0).round()
)
Related
I have a dictionary like this:
d = {'Caps': 'cap_list', 'Term': 'unique_tokens', 'LocalFreq': 'local_freq_list','CorpusFreq': 'corpus_freq_list'}
I want to create a dask dataframe from it. How do I do it? Normally, in Pandas, is can be easily imported to a Pandas df by:
df = pd.DataFrame({'Caps': cap_list, 'Term': unique_tokens, 'LocalFreq': local_freq_list,
'CorpusFreq': corpus_freq_list})
Should I first load into a bag and then convert from bag to ddf?
If your data fits in memory then I encourage you to use Pandas instead of Dask Dataframe.
If for some reason you still want to use Dask dataframe then I would convert things to a Pandas dataframe and then use the dask.dataframe.from_pandas function.
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame(...)
ddf = dd.from_pandas(df, npartitions=20)
But there are many cases where this will be slower than just using Pandas well.
I am running the same functionality using Pandas API and Dask API.
I expected Dask API to be faster, but it is not.
Functionality
I cross joint 2 dataframes (pandas and dask respectively) by a 'grouping' column and then, on every single pair I compute the Levensthein distance between 2 strings.
The results is the expected, but I am concerned about the performance.
Pandas
#timeit
def pd_fuzzy_comparison(df1:DF, df2:DF, group_col:str, compare):
df = df1.merge(df2, on=bubble_col, suffixes=('_internal', '_external'))
df['score'] = df.apply(lambda d:
comp(d.company_norm_internal, d.company_norm_external), axis=1)
return df
Dask
#timeit
def dd_fuzzy_comparison(dd1:dd, dd2:dd, group_col:str, compare):
ddf = dd1.merge(dd2, on='city_norm', suffixes=('_internal', '_external'))
ddf['score'] = ddf.apply(
lambda d: ratio(d.company_norm_internal, d.company_norm_external), axis=1)
return ddf.compute()
Main
import multiprocessing
CORES = multiprocessing.cpu_count()
results = pd_fuzzy_comparison(
df1=internal_bubbles.copy(),
df2=external_bubbles.copy(),
bubble_col='city_norm',
compare=ratio )
ddata1 = dd.from_pandas(internal_bubbles.copy(), npartitions=CORES)
ddata2 = dd.from_pandas(external_bubbles.copy(), npartitions=CORES)
ddresults = dd_fuzzy_comparison(
dd1=ddata1.copy(), dd2=ddata2.copy(),
bubble_col='city-norm',
compare=ratio)
Output
'pd_fuzzy_comparison' 1122.39 ms
'dd_fuzzy_comparison' 1717.83 ms
What am I missing?
Thanks!
First, Dask isn't always faster than Pandas. If Pandas works for you, you should stick with it.
https://docs.dask.org/en/latest/best-practices.html#start-small
In your particular case you're using the df.apply method, which uses Python for loops, which will be slowish in any case. It is also GIL bound, so you will want to choose a scheduler that uses processes, like the dask.distributed or multiprocessing schedulers.
https://docs.dask.org/en/latest/scheduling.html
I need to create a a dask DataFrame from a set of dask Series,
analogously to constructing a pandas DataFrame from lists
pd.DataFrame({'l1': list1, 'l2': list2})
I am not seeing anything in the API. The dask DataFrame constructor is not supposed to be called by users directly and takes a computation graph as it's mainargument.
In general I agree that it would be nice for the dd.DataFrame constructor to behave like the pd.DataFrame constructor.
If your series have well defined divisions then you might try dask.dataframe.concat with axis=1.
You could also try converting one of the series into a DataFrame and then use assignment syntax:
L = # list of series
df = L[0].to_frame()
for s in L[1:]:
df[s.name] = s
I have a Spark's Dataframe parquet file that can be read by spark as follows
df = sqlContext.read.parquet('path_to/example.parquet')
df.registerTempTable('temp_table')
I want to slice my dataframe, df, by row (i.e. equivalent to df.iloc[0:4000], df.iloc[4000:8000] etc. in Pandas dataframe) since I want to convert each small chunks to pandas dataframe to work on each later on. I only know how to do it by using sample random fraction i.e.
df_sample = df.sample(False, fraction=0.1) # sample 10 % of my data
df_pandas = df_sample.toPandas()
I would be great if there is a method to slice my dataframe df by row. Thanks in advance.
You can use monotonically_increasing_id() to add an ID column to your dataframe and use that to get a working set of any size.
import pyspark.sql.functions as f
# add an index column
df = df.withColumn('id', f.monotonically_increasing_id())
# Sort by index and get first 4000 rows
working_set = df.sort('id').limit(4000)
Then, you can remove the working set from your dataframe using subtract().
# Remove the working set, and use this `df` to get the next working set
df = df.subtract(working_set)
Rinse and repeat until you're done processing all rows. Not the ideal way to do things, but it works. Consider filtering out your Spark data frame to be used in Pandas.
How can i generated a pandas dataframe from an ordereddict?
I have tried using the dataframe.from_dict method but that is not giving me the expected dataframe.
What is the best approach to convert an ordereddict into a list of dicts?
A bug in Pandas did not respect the key ordering of OrderedDict objects converted to a DataFrame via the from_dict call. Fixed in Pandas 0.11.