Saving a dataframe from Databricks is really slow - sql

I am working on an ML project, and I am now investigating the training data. The data is stored in Databricks and it's taking hours to download a dataset with a few thousand lines. This is the approach I am following:
Build a dataframe by selecting the table values I'm interested in. Time: 0.07 sec.
Simplify the dataframe. Time: 0.08 sec.
Save the dataframe using df.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save. Time: 1h and counting.
Do you have any suggestion on how I could speed up the process? Ideally, I would like to repeat the steps a few thousand times.
Here are my commands:
%py
table_name = 'coordinates'
df = spark.sql (f"""select * from main_frame where event_date = '2023-01-18' and session_unique_id = '34eb1a29-aebb-4425-9997-7074e45244e9'""")
df1 = df \
.select('character.accountid', 'character.health')
df1.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("dbfs:/FileStore/user/me/logpos.csv")

Related

Filtering on index in dask dataframe is not any faster than filtering on any other column

I'm loading a dask dataframe from a parquet collection of around 1300 parquet files, total of 250 milions entries. Dataset is a time series and has a datetime index.
In general I'm pretty happy with performance of queries considering the size of the dataset (6-7 min approx).
One thing I don't understand is why index filtering is no faster than any other column filtering.
What I mean is that if I take a slice of the ddf on timestamp (index) or if I take a slice based on any other field, the time it takes to compute the dataframe is basically identical.
Am I doing something wrong? I would have imagined filter on index would be much faster
Tried to reindex a second time but didn't help.
I've checked suggestions on this post: https://www.coiled.io/blog/dask-set-index-dataframe
import dask.dataframe as dd
dask.config.set(scheduler=dask.multiprocessing.get)
ddf = dd.read_parquet(data_lake + asset_class, engine='pyarrow',columns=['Close','Open','Symbol', 'TradeHigh', 'TradeLow', 'CloseCumulativeVolume', 'CloseDate', 'CloseCumulativeValue', 'TradeCount'])
q1_21 = ddf['2021-01-01':'2021-04-01'].compute()
#Wall time: 2min 59s
df = ddf.loc[ddf['Symbol'].str.contains('MSFT|TSLA')].compute()
#Wall time: 3min 2s

using pandas.melt for big dataframe

I am using at some point pd.melt to reshape my dataframe. This command after inspections is taking around 7min to run which is too long in my use case (I am using it in an interactive dashboard).
I am asking if there are any methods on how to improve running time of melt function via pandas.
If not, is it possible and a good practice to use a big data package just for this line of code?
pd.melt(change_t, id_vars=['id', 'date'], value_vars=factors, value_name='value')
factors=list of 20 columns
I've timed melting a test table with 2 id_vars, 20 factors, and 1M rows and it took 22 seconds on my laptop. Is your table similarly sized, or much much larger? If it is a huge table, would it be ok to return only part of the melted output to your interactive dashboard? I put some code for that approach and it took 1.3 seconds to return the first 1000 rows of the melted table.
Timing melting a large test table
import pandas as pd
import numpy as np
import time
id_cols = ['id','date']
n_ids = 1000
n_dates = 100
n_cols = 20
n_rows = 1000000
#Create the test table
df = pd.DataFrame({
'id':np.random.randint(1,n_ids+1,n_rows),
'date':np.random.randint(1,n_dates+1,n_rows),
})
factors = []
for c in range(n_cols):
c_name = 'C{}'.format(c)
factors.append(c_name)
df[c_name] = np.random.random(n_rows)
#Melt and time how long it takes
start = time.time()
pd.melt(df, id_vars=['id', 'date'], value_vars=factors, value_name='value')
print('Melting took',time.time()-start,'seconds for',n_rows,'rows')
#Melting took 21.744 seconds for 1000000 rows
Here's a way you can get just the first 1000 melted rows
ret_rows = 1000
start = time.time()
partial_melt_df = pd.DataFrame()
for ks,g in df.groupby(['id','date']):
g_melt = pd.melt(g, id_vars=['id', 'date'], value_vars=factors, value_name='value')
partial_melt_df = pd.concat((partial_melt_df,g_melt), ignore_index=True)
if len(partial_melt_df) >= ret_rows:
partial_melt_df = partial_melt_df.head(ret_rows)
break
print('Partial melting took',time.time()-start,'seconds to give back',ret_rows,'rows')
#Partial melting took 1.298 seconds to give back 1000 rows

Dask appropriate for my goal? ```Compute()``` taking very long

I am doing the following in Dask as the df dataframe has 7 million rows and 50 columns so pandas is extremely slow. However, I might not be using Dask correctly or Dask might not be appropriate for my goal. I need to do some preprocessing on the df dataframe, which is mainly creating some new columns. And then eventually saving the df (I am saving to csv but I have also tried parquet). However, before I save, I believe I have to do compute(). And compute() is taking very long -- I left it running for 3 hours and it still wasn't done. I tried to persist() throughout the calculations but persist() also took a long time. Is this expected with Dask given the size of my data? Could this be because of the number of partitions (I have 20 logical processor and dask is using 24 partitions -- I have 128 GB of memory if this helps too)? Is there something I could do to speed this up?
import dask.dataframe as dd
import numpy as np
import pandas as pd
from re import match

from dask_ml.preprocessing import LabelEncoder



df1 = dd.read_csv("data1.csv")
df2 = dd.read_csv("data2.csv")
df = df1.merge(df2, how='inner', left_on=['country', 'region'],
right_on=['country', 'region'])
df['actual__adj'] = (df['actual'] * df['travel'] + 809 * df['stopped']) / (
df['travel_time'] + df['stopped_time'])
df['c_adj'] = 1 - df['actual_adj'] / df['free']

df['stopped_tom'] = 1 * (df['stopped'] > 0)

def func(df):
df = df.sort_values('region')
df['first_established'] = 1 * (df['region_d']==df['region_d'].min())
df['last_established'] = 1 * (df['region_d']==df['region_d'].max())
df['actual_established'] = df['noted_timeframe'].shift(1, fill_value=0)
df['actual_established_2'] = df['noted_timeframe'].shift(-1, fill_value=0)
df['time_1'] = df['time_book'].shift(1, fill_value=0)
df['time_2'] = df['time_book'].shift(-1, fill_value=0)
df['stopped_investing'] = df['stopped'].shift(1, fill_value=1)
return df

df = df.groupby('country').apply(func).reset_index(drop=True)
df['actual_diff'] = np.abs(df['actual'] - df['actual_book'])
df['length_diff'] = np.abs(df['length'] - df['length_book'])

df['Investment'] = df['lor_index'].values * 1000
df = df.compute().to_csv("path")
Saving to csv or parquet will by default trigger computation, so the last line should be:
df = df.to_csv("path_*.csv")
The asterisk is needed to specify the pattern of csv file names (each partition is saved into a separate file, unless you specify single_file=True).
My guess is that most of the computation time is spent on this step:
df = df1.merge(df2, how='inner', left_on=['country', 'region'],
right_on=['country', 'region'])
If one of the dfs is small enough to fit in memory, then it would be good to keep it as a pandas dataframe, see further tips in the documentation.

Pandas pivot_table/groupby taking too long on very large dataframe

I am working on a dataframe of 18 million rows with the following structure:
I need to get a count of the subsystem for each suite as per the name_heuristic (there are 4 values for that column). So I need an output with columns for each type of name_heuristic with the suite as index and values will be count of subsystems as per each column.
I have tried using pivot_table with the following code:
df_table = pd.pivot_table(df, index='suite', columns='name_heuristics', values='subsystem', aggfunc=np.sum
But even after an HOUR, it is not done computing. What is taking so long and how can I speed it up? I even tried a groupby alternative that is still running 15 minutes and counting:
df_table = df.groupby(['name_heuristics', 'suite']).agg({'subsystem': np.sum}).unstack(level='name_heuristics').fillna(0)
Any help is greatly appreciated! I have been stuck on this for hours.
It seems pivoting more than one categorical column crashes pandas. My solution to a similar problem was converting categorical to object for the target columns, using
step 1
df['col1'] = df['col1'].astype('object')
df['col2'] = df['col2'].astype('object')
step 2
df_pivot = pandas.pivot_table(df, columns=['col1', 'col2'], index=...
This was independent of dataframe size...

pandas df.resample('D').sum() returns NaN

I've got a pandas data frame with electricity meter readings(cumulative). The df DatetimeIndex dtype='datetime64[ns]'. When I load the .csv file the dataframe does not contain any NaN values. I need to calculate both the monthly and daily energy generated.
To calculate monthly generation I use dfmonth = df.resample('M').sum() . This works fine.
To calculate daily generation I thought of using: dfday = df.resample('D').sum(). Which partially works but for some index dates (no data missing in raw file) returns NaN.
Please see code below. Does anyone knows why this happens? Any proposed solution?
df = pd.read_csv(file)
df = df.set_index(pd.DatetimeIndex(df['Reading Timestamp']))
df=df.rename(columns = {'Energy kWh':'meter', 'Instantaneous Power kW (approx)': 'kW'})
df.drop(df.columns[:10], axis=1, inplace=True) #Delete columns I don't need.
df['kWh'] = df['meter'].sub(df['meter'].shift())
dfmonth = df.resample('M').sum() #This works OK calculating kWh. dfmonth does not contain any NaN.
dfday = df.resample('D').sum() # This returns a total of 8 NaN out of 596 sampled points. Original df has 27929 DatetimeIndex rows
Thank you in advance.
A big apology to you all. The .csv I was given and the raw .csv I was checking against are not the same file. Data was somehow corrupted....
I've been banging my head against the wall till now, there is not problem with df.resample('D').sum()
Sorry again, consider thread sorted.