Dask appropriate for my goal? ```Compute()``` taking very long - pandas

I am doing the following in Dask as the df dataframe has 7 million rows and 50 columns so pandas is extremely slow. However, I might not be using Dask correctly or Dask might not be appropriate for my goal. I need to do some preprocessing on the df dataframe, which is mainly creating some new columns. And then eventually saving the df (I am saving to csv but I have also tried parquet). However, before I save, I believe I have to do compute(). And compute() is taking very long -- I left it running for 3 hours and it still wasn't done. I tried to persist() throughout the calculations but persist() also took a long time. Is this expected with Dask given the size of my data? Could this be because of the number of partitions (I have 20 logical processor and dask is using 24 partitions -- I have 128 GB of memory if this helps too)? Is there something I could do to speed this up?
import dask.dataframe as dd
import numpy as np
import pandas as pd
from re import match

from dask_ml.preprocessing import LabelEncoder



df1 = dd.read_csv("data1.csv")
df2 = dd.read_csv("data2.csv")
df = df1.merge(df2, how='inner', left_on=['country', 'region'],
right_on=['country', 'region'])
df['actual__adj'] = (df['actual'] * df['travel'] + 809 * df['stopped']) / (
df['travel_time'] + df['stopped_time'])
df['c_adj'] = 1 - df['actual_adj'] / df['free']

df['stopped_tom'] = 1 * (df['stopped'] > 0)

def func(df):
df = df.sort_values('region')
df['first_established'] = 1 * (df['region_d']==df['region_d'].min())
df['last_established'] = 1 * (df['region_d']==df['region_d'].max())
df['actual_established'] = df['noted_timeframe'].shift(1, fill_value=0)
df['actual_established_2'] = df['noted_timeframe'].shift(-1, fill_value=0)
df['time_1'] = df['time_book'].shift(1, fill_value=0)
df['time_2'] = df['time_book'].shift(-1, fill_value=0)
df['stopped_investing'] = df['stopped'].shift(1, fill_value=1)
return df

df = df.groupby('country').apply(func).reset_index(drop=True)
df['actual_diff'] = np.abs(df['actual'] - df['actual_book'])
df['length_diff'] = np.abs(df['length'] - df['length_book'])

df['Investment'] = df['lor_index'].values * 1000
df = df.compute().to_csv("path")

Saving to csv or parquet will by default trigger computation, so the last line should be:
df = df.to_csv("path_*.csv")
The asterisk is needed to specify the pattern of csv file names (each partition is saved into a separate file, unless you specify single_file=True).
My guess is that most of the computation time is spent on this step:
df = df1.merge(df2, how='inner', left_on=['country', 'region'],
right_on=['country', 'region'])
If one of the dfs is small enough to fit in memory, then it would be good to keep it as a pandas dataframe, see further tips in the documentation.

Related

using pandas.melt for big dataframe

I am using at some point pd.melt to reshape my dataframe. This command after inspections is taking around 7min to run which is too long in my use case (I am using it in an interactive dashboard).
I am asking if there are any methods on how to improve running time of melt function via pandas.
If not, is it possible and a good practice to use a big data package just for this line of code?
pd.melt(change_t, id_vars=['id', 'date'], value_vars=factors, value_name='value')
factors=list of 20 columns
I've timed melting a test table with 2 id_vars, 20 factors, and 1M rows and it took 22 seconds on my laptop. Is your table similarly sized, or much much larger? If it is a huge table, would it be ok to return only part of the melted output to your interactive dashboard? I put some code for that approach and it took 1.3 seconds to return the first 1000 rows of the melted table.
Timing melting a large test table
import pandas as pd
import numpy as np
import time
id_cols = ['id','date']
n_ids = 1000
n_dates = 100
n_cols = 20
n_rows = 1000000
#Create the test table
df = pd.DataFrame({
'id':np.random.randint(1,n_ids+1,n_rows),
'date':np.random.randint(1,n_dates+1,n_rows),
})
factors = []
for c in range(n_cols):
c_name = 'C{}'.format(c)
factors.append(c_name)
df[c_name] = np.random.random(n_rows)
#Melt and time how long it takes
start = time.time()
pd.melt(df, id_vars=['id', 'date'], value_vars=factors, value_name='value')
print('Melting took',time.time()-start,'seconds for',n_rows,'rows')
#Melting took 21.744 seconds for 1000000 rows
Here's a way you can get just the first 1000 melted rows
ret_rows = 1000
start = time.time()
partial_melt_df = pd.DataFrame()
for ks,g in df.groupby(['id','date']):
g_melt = pd.melt(g, id_vars=['id', 'date'], value_vars=factors, value_name='value')
partial_melt_df = pd.concat((partial_melt_df,g_melt), ignore_index=True)
if len(partial_melt_df) >= ret_rows:
partial_melt_df = partial_melt_df.head(ret_rows)
break
print('Partial melting took',time.time()-start,'seconds to give back',ret_rows,'rows')
#Partial melting took 1.298 seconds to give back 1000 rows

Fastest way to compute only last row in pandas dataframe

I am trying to find the fastest way to compute the results only for the last row in a dataframe. For some reason, when I do so it is slower than computing the entire dataframe. What am I doing wrong here? What would be the correct way to access only the last two rows and compute their values?
Currently these are my results:
Processing time of add_complete(): 1.333 seconds
Processing time of add_last_row_only(): 1.502 seconds
import numpy as np
import pandas as pd
def add_complete(df):
df['change_a'] = df['a'].diff()
df['change_b'] = df['b'].diff()
df['factor'] = df['change_a'] * df['change_b']
def add_last_row_only(df):
df.at[df.index[-1], 'change_a_last_row'] = df['a'].iloc[-1] - df['a'].iloc[-2]
df.at[df.index[-1], 'change_b_last_row'] = df['b'].iloc[-1] - df['b'].iloc[-2]
df.at[df.index[-1], 'factor_last_row'] = df['change_a_last_row'].iloc[-1] * df['change_b_last_row'].iloc[-1]
def main():
a = np.arange(200_000_000).reshape(100_000_000, 2)
df = pd.DataFrame(a, columns=['a', 'b'])
add_complete(df)
add_last_row_only(df)
print(df.tail())
Unless I am missing something, for this kind of operation I would use numpy on the two last lines:
%%timeit
changes = np.diff(df.values[-2:,:],axis=0)
factor = np.product(changes)
21µs just this operation, yes, microseconds.
If I add insertion it increases to 511ms, even filling all with same value.
I suspect the problem comes from handling around a 1.5Gb dataframe, which actually doubles the size when inserting the two extra columns.
%%timeit
changes = np.diff(df.values[-2:,:],axis=0)
factor = np.product(changes)
df['factor']=factor
df['changes_a']=changes[0][0]
df['changes_b']=changes[0][1]

pandas df.to_parquet write to multiple smaller files

Is it possible to use Pandas' DataFrame.to_parquet functionality to split writing into multiple files of some approximate desired size?
I have a very large DataFrame (100M x 100), and am using df.to_parquet('data.snappy', engine='pyarrow', compression='snappy') to write to a file, but this results in a file that's about 4GB. I'd instead like this split into many ~100MB files.
I ended up using Dask:
import dask.dataframe as da
ddf = da.from_pandas(df, chunksize=5000000)
save_dir = '/path/to/save/'
ddf.to_parquet(save_dir)
This saves to multiple parquet files inside save_dir, where the number of rows of each sub-DataFrame is the chunksize. Depending on your dtypes and number of columns, you can adjust this to get files to the desired size.
One other option is to use the partition_cols option in pyarrow.parquet.write_to_dataset():
import pyarrow.parquet as pq
import numpy as np
# df is your dataframe
n_partition = 100
df["partition_idx"] = np.random.choice(range(n_partition), size=df.shape[0])
table = pq.Table.from_pandas(df, preserve_index=False)
pq.write_to_dataset(table, root_path="{path to dir}/", partition_cols=["partition_idx"])
Slice the dataframe and save each chunk to a folder, using just pandas api (without dask or pyarrow).
You can pass extra params to the parquet engine if you wish.
def df_to_parquet(df, target_dir, chunk_size=1000000, **parquet_wargs):
"""Writes pandas DataFrame to parquet format with pyarrow.
Args:
df: DataFrame
target_dir: local directory where parquet files are written to
chunk_size: number of rows stored in one chunk of parquet file. Defaults to 1000000.
"""
for i in range(0, len(df), chunk_size):
slc = df.iloc[i : i + chunk_size]
chunk = int(i/chunk_size)
fname = os.path.join(target_dir, f"part_{chunk:04d}.parquet")
slc.to_parquet(fname, engine="pyarrow", **parquet_wargs)
Keep each parquet size small, around 128MB. To do this:
import dask.dataframe as dd
# Get number of partitions required for nominal 128MB partition size
# "+ 1" for non full partition
size128MB = int(df.memory_usage().sum()/1e6/128) + 1
# Read
ddf = dd.from_pandas(df, npartitions=size128MB)
save_dir = '/path/to/save/'
ddf.to_parquet(save_dir)
cunk = 200000
i = 0
n = 0
while i<= len(all_df):
j = i + cunk
print((i, j))
tmpdf = all_df[i:j]
tmpdf.to_parquet(path=f"./append_data/part.{n}.parquet",engine='pyarrow', compression='snappy')
i = j
n = n + 1

Updating single row is slow with mixed types in pandas

A simple line of code df.iloc[100] = df.iloc[500] gets very slow on a large DataFrame with mixed types due to the fact that pandas copies the entire columns (found it in the source code). What I don't get is why this behaviour is necessary and how to avoid it and force pandas to just update the relevant values if I am sure in advance that the dtypes are the same. When the DF is single-type then the copying doesn't take place and values are modified in-place.
I found a workaround that seems to have the desired effect but it works only on row numbers:
for c in df.columns:
df[c].array[100] = df[c].array[500]
It is literally 1000x faster than df.iloc[100] = df.iloc[500].
Here is how to reproduce the slowness of assignment:
import string
import itertools
import timeit
import numpy as np
import pandas as pd
data = list(itertools.product(range(200_000), string.ascii_uppercase))
df = pd.DataFrame(data, columns=['i', 'p'])
df['n1'] = np.random.randn(len(df))
df['n2'] = np.random.randn(len(df))
df['n3'] = np.random.randn(len(df))
df['n4'] = np.random.randn(len(df))
print(
timeit.timeit('df.loc[100] = df.loc[500]', number=100, globals=globals()) / 100
)
df_o = df.copy()
# Remove mixed types
for c in df_o.columns:
df_o[c] = df_o[c].astype('object')
print(
timeit.timeit('df_o.loc[100] = df_o.loc[500]', number=100, globals=globals()) / 100
)
This example alone shows 10x performance difference. I still don't fully understand why even with non-mixed types assigning a single row is quite slow.

pandas gets stuck when trying to read from bigquery

I have a fairly large table in big query ( app. 9M rows) and I would like to read it via pandas.
I've tried reading and using the [pd.read_gbq()][1] function, which works fine on small tables.
On the large table it gets stuck after 50 secs or so (logs show elapsed .. 50s) - without giving an error or anything.
My question is how can I read that table using pd (chunks?). Any conventions on scaling up these bigquery reads will be helpful.
EDIT / resolution
adding to Khan's answer, I ended up implementing chunks, writing 500,000 each time to a file, then reading these files to dataframe like so:
def download_gbq_table(self):
if not os.path.exists(self.tmp_dir):
os.makedirs(self.tmp_dir)
increment = 100000
intervals = list(range(0, self.table_size, 100000))
intervals.append(self.table_size - intervals[len(intervals)-1])
df = pd.DataFrame()
for offset in intervals:
query = f"select * from `<table_name>` limit {increment} offset {offset};"
logger.info(f"running query: {query}")
start_time = time.time()
tmp_df = pd.read_gbq(query,
project_id=self.connection_parameters['project_id'],
private_key=self.connection_parameters['service_account'],
dialect='standard'
)
df = pd.concat([df, tmp_df])
logger.info(f'time took: {str(round(time.time() - start_time, 2))}')
if len(df) % 500000 == 0:
df.to_csv(os.path.join(self.tmp_dir, f'df_{str(offset + increment)}.csv'))
df = pd.DataFrame()
def read_df_from_multi_csv(self):
all_files = glob.glob(os.path.join(self.tmp_dir, "df_*"))
df_list = []
for f in all_files:
start_time = time.time()
df_list.append(pd.read_csv(f))
logger.info(f'time took for reading {f}: {str(round(time.time() - start_time, 2))}')
return pd.concat((pd.read_csv(f) for f in all_files))
Pandas' read_gbq function currently does not provide a chunksize parameter (even though its opposite to_gbq function does provide a chunksize parameter).
Anyways, you can solve your problem with adding LIMIT and OFFSET to your SQL query read stuff iteratively from BigQuery. Something on the lines of:
project_id = "xxxxxxxx"
increment=100000
chunks=range(0, 9000000, 100000)
chunks[-1]+=increment
intervals=[[chunks[i-1], chunks[i]+1] for i, e in enumerate(chunks) if i > 0]
query_str="select * from `mydataset.mytable` limit {end} offset {start};"
for start, end in intervals:
query = query_str.format(start=start, end=end)
df = pd.read_gbq(query, project_id)
#-- do stuff with your df here..
Not sure if this existed back when the question was originally asked, but now you can use python-bigquery-sqlalchemy (link) to read data from BigQuery, which allows you to use the built-in chunking ability of pandas.read_sql(). You just create a SQLAlchemy connection engine using "bigquery://{project-name}" and pass that to con in pandas.read_sql().
For example:
from sqlalchemy.engine import create_engine
import pandas as pd
read_project = "my-cool-project"
query = f"""
select * from `{read_project}.cool-dataset.cooltable`
"""
bq_engine = create_engine(f"bigquery://{read_project}")
for df in pd.read_sql(query, con=bq_engine, chunksize=100_000):
# do stuff with df...