Filtering on index in dask dataframe is not any faster than filtering on any other column - dataframe

I'm loading a dask dataframe from a parquet collection of around 1300 parquet files, total of 250 milions entries. Dataset is a time series and has a datetime index.
In general I'm pretty happy with performance of queries considering the size of the dataset (6-7 min approx).
One thing I don't understand is why index filtering is no faster than any other column filtering.
What I mean is that if I take a slice of the ddf on timestamp (index) or if I take a slice based on any other field, the time it takes to compute the dataframe is basically identical.
Am I doing something wrong? I would have imagined filter on index would be much faster
Tried to reindex a second time but didn't help.
I've checked suggestions on this post: https://www.coiled.io/blog/dask-set-index-dataframe
import dask.dataframe as dd
dask.config.set(scheduler=dask.multiprocessing.get)
ddf = dd.read_parquet(data_lake + asset_class, engine='pyarrow',columns=['Close','Open','Symbol', 'TradeHigh', 'TradeLow', 'CloseCumulativeVolume', 'CloseDate', 'CloseCumulativeValue', 'TradeCount'])
q1_21 = ddf['2021-01-01':'2021-04-01'].compute()
#Wall time: 2min 59s
df = ddf.loc[ddf['Symbol'].str.contains('MSFT|TSLA')].compute()
#Wall time: 3min 2s

Related

Saving a dataframe from Databricks is really slow

I am working on an ML project, and I am now investigating the training data. The data is stored in Databricks and it's taking hours to download a dataset with a few thousand lines. This is the approach I am following:
Build a dataframe by selecting the table values I'm interested in. Time: 0.07 sec.
Simplify the dataframe. Time: 0.08 sec.
Save the dataframe using df.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save. Time: 1h and counting.
Do you have any suggestion on how I could speed up the process? Ideally, I would like to repeat the steps a few thousand times.
Here are my commands:
%py
table_name = 'coordinates'
df = spark.sql (f"""select * from main_frame where event_date = '2023-01-18' and session_unique_id = '34eb1a29-aebb-4425-9997-7074e45244e9'""")
df1 = df \
.select('character.accountid', 'character.health')
df1.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("dbfs:/FileStore/user/me/logpos.csv")

Faster returns comparisons in Pandas dataframes?

I have DataFrame containing 600,000 pairs of IDs. Each ID has returns Data in a large monthly returns_df. For each of the 600K pairs, I do the following
I set left and right DataFrames equal to their subset of returns_df.
I merge DataFrames to get months they both have data
I compute an absolute distance by comparing each, then summing results, and running a sigmoid function.
This process is taking ~12 hours as my computer has to create subsets of returns_df each time to compare. Can I substantially speed this up through some sort of vectorized solution or faster filtering?
def get_return_similarity(row):
left = returns_df[returns_df['FundID']==row.left_side_id]
right = returns_df[returns_df['FundID']==row.right_side_id]
temp = pd.merge(left,right, how='inner', on=['Year','Month'])
if temp.shape[0]<12: # Return if overlap < 12 months
return 0
temp['diff'] = abs(temp['Return_x'] - temp['Return_y'])
return 1/(math.exp(70*temp['diff'].sum()/(temp['diff'].shape[0]))) #scaled sigmoid function
df['return_score'] = df[['left_side_id','right_side_id']].apply(get_return_similarity,axis=1)
Thanks in advance for your help! Trying to get better with Pandas
Edit: As suggested, the basic data format is below
returns_df
df I am running the apply on:

Pandas pivot_table/groupby taking too long on very large dataframe

I am working on a dataframe of 18 million rows with the following structure:
I need to get a count of the subsystem for each suite as per the name_heuristic (there are 4 values for that column). So I need an output with columns for each type of name_heuristic with the suite as index and values will be count of subsystems as per each column.
I have tried using pivot_table with the following code:
df_table = pd.pivot_table(df, index='suite', columns='name_heuristics', values='subsystem', aggfunc=np.sum
But even after an HOUR, it is not done computing. What is taking so long and how can I speed it up? I even tried a groupby alternative that is still running 15 minutes and counting:
df_table = df.groupby(['name_heuristics', 'suite']).agg({'subsystem': np.sum}).unstack(level='name_heuristics').fillna(0)
Any help is greatly appreciated! I have been stuck on this for hours.
It seems pivoting more than one categorical column crashes pandas. My solution to a similar problem was converting categorical to object for the target columns, using
step 1
df['col1'] = df['col1'].astype('object')
df['col2'] = df['col2'].astype('object')
step 2
df_pivot = pandas.pivot_table(df, columns=['col1', 'col2'], index=...
This was independent of dataframe size...

pandas : Indexing for thousands of rows in dataframe

I initially had 100k rows in my dataset. I read the csv using pandas into a dataframe called data. I tried to do a subset selection of 51 rows using .loc. My index labels are numeric values 0, 1, 2, 3 etc. I tried using this command -
data = data.loc['0':'50']
But the results were weird, it took all the rows from 0 to 49999, looks like it is taking rows till the index value starts with 50.
Similarly, I tried with this command - new_data = data.loc['0':'19']
and the result was all the rows, starting from 0 till 18999.
Could this be a bug in pandas?
You want to use .iloc in place of .loc, since you are selecting data from the dataframe via numeric indices.
For example:
data.iloc[:50,:]
Keep in mind that your indices are of numeric-type, not string-type, so querying with a string (as you have done in your OP) attempts to match string-wise comparisons.

Resampling/interpolating/extrapolating columns of a pandas dataframe

I am interested in knowing how to interpolate/resample/extrapolate columns of a pandas dataframe for pure numerical and datetime type indices. I'd like to perform this with either straight-forward linear interpolation or spline interpolation.
Consider first a simple pandas data frame that has a numerical index (signifying time) and a couple of columns:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10,2), index=np.arange(0,20,2))
print(df)
0 1
0 0.937961 0.943746
2 1.687854 0.866076
4 0.410656 -0.025926
6 -2.042386 0.956386
8 1.153727 -0.505902
10 -1.546215 0.081702
12 0.922419 0.614947
14 0.865873 -0.014047
16 0.225841 -0.831088
18 -0.048279 0.314828
I would like to resample the columns of this dataframe over some denser grid of time indices which possibly extend beyond the last time index (thus requiring extrapolation).
Denote the denser grid of indices as, for example:
t = np.arange(0,40,.6)
The interpolate method for a pandas dataframe seems to interpolate only nan's and thus requires those new indices (which may or may not coincide with the original indices) to already be part of the dataframe. I guess I could append a data frame of nans at the new indices to the original dataframe (excluding any indices appearing in both the old and new dataframes) and call interpolate and then remove the original time indices. Or, I could do everything in scipy and create a new dataframe at the desired time indices.
Is there a more direct way to do this?
In addition, I'd like to know how to do this same thing when the indices are, in fact, datetimes. That is, when, for example:
df.index = np.array('2015-07-04 02:12:40', dtype=np.datetime64) + np.arange(0,20,2)