I am using at some point pd.melt to reshape my dataframe. This command after inspections is taking around 7min to run which is too long in my use case (I am using it in an interactive dashboard).
I am asking if there are any methods on how to improve running time of melt function via pandas.
If not, is it possible and a good practice to use a big data package just for this line of code?
pd.melt(change_t, id_vars=['id', 'date'], value_vars=factors, value_name='value')
factors=list of 20 columns
I've timed melting a test table with 2 id_vars, 20 factors, and 1M rows and it took 22 seconds on my laptop. Is your table similarly sized, or much much larger? If it is a huge table, would it be ok to return only part of the melted output to your interactive dashboard? I put some code for that approach and it took 1.3 seconds to return the first 1000 rows of the melted table.
Timing melting a large test table
import pandas as pd
import numpy as np
import time
id_cols = ['id','date']
n_ids = 1000
n_dates = 100
n_cols = 20
n_rows = 1000000
#Create the test table
df = pd.DataFrame({
'id':np.random.randint(1,n_ids+1,n_rows),
'date':np.random.randint(1,n_dates+1,n_rows),
})
factors = []
for c in range(n_cols):
c_name = 'C{}'.format(c)
factors.append(c_name)
df[c_name] = np.random.random(n_rows)
#Melt and time how long it takes
start = time.time()
pd.melt(df, id_vars=['id', 'date'], value_vars=factors, value_name='value')
print('Melting took',time.time()-start,'seconds for',n_rows,'rows')
#Melting took 21.744 seconds for 1000000 rows
Here's a way you can get just the first 1000 melted rows
ret_rows = 1000
start = time.time()
partial_melt_df = pd.DataFrame()
for ks,g in df.groupby(['id','date']):
g_melt = pd.melt(g, id_vars=['id', 'date'], value_vars=factors, value_name='value')
partial_melt_df = pd.concat((partial_melt_df,g_melt), ignore_index=True)
if len(partial_melt_df) >= ret_rows:
partial_melt_df = partial_melt_df.head(ret_rows)
break
print('Partial melting took',time.time()-start,'seconds to give back',ret_rows,'rows')
#Partial melting took 1.298 seconds to give back 1000 rows
Related
I have a dataframe like below
id B C
1 2 3
1 3 4
2 4 2
3 12 32
finally I want to store the csv file
1.csv, 2.csv, 3.csv which contains all the rows specific to id column
Can I do this efficiently.I know we can do using for loop which is time consuming
From the Pandas DOC, the method from the DataFrame you have to write the content in CSV file is to_csv. Looks like there is no specific parameter to optimize it for you.
As you can see here.
You can solve this problem in an O(n) operation, considering ordered IDs. You already have the entire DataFrame in memory. By saving pieces in single files you also can free some space in memory by splitting the entire DataFrame each loop step.
As suggested by #Lazyer, you can use multiprocessing:
import pandas as pd
import numpy as np
import multiprocessing as mp
import time
def to_csv(name, df):
df.to_csv(f'export/{name}.csv', index=False)
if __name__ == '__main__': # Do not remove this line! Mandatory
# Setup a minimal reproducible example
N = 10_000_000
rng = np.random.default_rng(2022)
df = pd.DataFrame(np.random.randint(1, 10000, (N, 3)),
columns=['id', 'B', 'C'])
# Multi processing
start = time.time()
with mp.Pool(mp.cpu_count()) as pool:
pool.starmap(to_csv, df.groupby('id'))
end = time.time()
print(f"[MP] Elapsed time: {end - start:.2f} seconds")
# Single processing
start = time.time()
for name, subdf in df.groupby('id'):
subdf.to_csv(f'export/{name}.csv', index=False)
end = time.time()
print(f"[SP] Elapsed time: {end - start:.2f} seconds")
Test for 10,000,000 records:
[...]$ python mp.py
[MP] Elapsed time: 2.99 seconds
[SP] Elapsed time: 12.97 seconds
I am trying to find the fastest way to compute the results only for the last row in a dataframe. For some reason, when I do so it is slower than computing the entire dataframe. What am I doing wrong here? What would be the correct way to access only the last two rows and compute their values?
Currently these are my results:
Processing time of add_complete(): 1.333 seconds
Processing time of add_last_row_only(): 1.502 seconds
import numpy as np
import pandas as pd
def add_complete(df):
df['change_a'] = df['a'].diff()
df['change_b'] = df['b'].diff()
df['factor'] = df['change_a'] * df['change_b']
def add_last_row_only(df):
df.at[df.index[-1], 'change_a_last_row'] = df['a'].iloc[-1] - df['a'].iloc[-2]
df.at[df.index[-1], 'change_b_last_row'] = df['b'].iloc[-1] - df['b'].iloc[-2]
df.at[df.index[-1], 'factor_last_row'] = df['change_a_last_row'].iloc[-1] * df['change_b_last_row'].iloc[-1]
def main():
a = np.arange(200_000_000).reshape(100_000_000, 2)
df = pd.DataFrame(a, columns=['a', 'b'])
add_complete(df)
add_last_row_only(df)
print(df.tail())
Unless I am missing something, for this kind of operation I would use numpy on the two last lines:
%%timeit
changes = np.diff(df.values[-2:,:],axis=0)
factor = np.product(changes)
21µs just this operation, yes, microseconds.
If I add insertion it increases to 511ms, even filling all with same value.
I suspect the problem comes from handling around a 1.5Gb dataframe, which actually doubles the size when inserting the two extra columns.
%%timeit
changes = np.diff(df.values[-2:,:],axis=0)
factor = np.product(changes)
df['factor']=factor
df['changes_a']=changes[0][0]
df['changes_b']=changes[0][1]
TLDR: How can one adjust the for-loop for a faster execution time:
import numpy as np
import pandas as pd
import time
np.random.seed(0)
# Given a DataFrame df and a row_index
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5
start = time.time()
target_row = df.loc[target_row_index]
result = []
# Method 1: Optimize this for-loop
for row in df.iterrows():
"""
Logic of calculating the variables check and score:
if the values for a specific column are 2 for both rows (row/target_row), it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
"""
check = row[1]+target_row # row[1] takes 30 microseconds per call
score = np.sum(check == 4) - np.sum(check == 3) # np.sum takes 47 microseconds per call
result.append(score)
print(time.time()-start)
# Goal: Calculate the list result as efficient as possible
# Method 2: Optimize Apply
def add(a, b):
check = a + b
return np.sum(check == 4) - np.sum(check == 3)
start = time.time()
q = df.apply(lambda row : add(row, target_row), axis = 1)
print(time.time()-start)
So I have a dataframe of size 30'000 and a target row in this dataframe with a given row index. Now I want to compare this row to all the other rows in the dataset by calculating a score. The score is calculated as follows:
if the values for a specific column are 2 for both rows, it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
The result is then the list of all the scores we just calculated.
As I need to execute this code quite often I would like to optimize it for performance.
Any help is very much appreciated.
I already read Optimization when using Pandas are there further resources you can recommend? Thanks
If you're willing to convert your df to a NumPy array, NumPy has some really good vectorisation that helps. My code using NumPy is as below:
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5
start_time = time.time()
# Converting stuff to NumPy arrays
target_row = df.loc[target_row_index].to_numpy()
np_arr = df.to_numpy()
# Calculations
np_arr += target_row
check = np.sum(np_arr == 4, axis=1) - np.sum(np_arr == 3, axis=1)
result = list(check)
end_time = time.time()
print(end_time - start_time)
Your complete code (on Google Colab for me) outputs a time of 14.875332832336426 s, while the NumPy code above outputs a time of 0.018691539764404297 s, and of course, the result list is the same in both cases.
Note that in general, if your calculations are purely numerical, NumPy will virtually always be better than Pandas and a for loop. Pandas really shines through with strings and when you need the column and row names, but for pure numbers, NumPy is the way to go due to vectorisation.
I am doing the following in Dask as the df dataframe has 7 million rows and 50 columns so pandas is extremely slow. However, I might not be using Dask correctly or Dask might not be appropriate for my goal. I need to do some preprocessing on the df dataframe, which is mainly creating some new columns. And then eventually saving the df (I am saving to csv but I have also tried parquet). However, before I save, I believe I have to do compute(). And compute() is taking very long -- I left it running for 3 hours and it still wasn't done. I tried to persist() throughout the calculations but persist() also took a long time. Is this expected with Dask given the size of my data? Could this be because of the number of partitions (I have 20 logical processor and dask is using 24 partitions -- I have 128 GB of memory if this helps too)? Is there something I could do to speed this up?
import dask.dataframe as dd
import numpy as np
import pandas as pd
from re import match
from dask_ml.preprocessing import LabelEncoder
df1 = dd.read_csv("data1.csv")
df2 = dd.read_csv("data2.csv")
df = df1.merge(df2, how='inner', left_on=['country', 'region'],
right_on=['country', 'region'])
df['actual__adj'] = (df['actual'] * df['travel'] + 809 * df['stopped']) / (
df['travel_time'] + df['stopped_time'])
df['c_adj'] = 1 - df['actual_adj'] / df['free']
df['stopped_tom'] = 1 * (df['stopped'] > 0)
def func(df):
df = df.sort_values('region')
df['first_established'] = 1 * (df['region_d']==df['region_d'].min())
df['last_established'] = 1 * (df['region_d']==df['region_d'].max())
df['actual_established'] = df['noted_timeframe'].shift(1, fill_value=0)
df['actual_established_2'] = df['noted_timeframe'].shift(-1, fill_value=0)
df['time_1'] = df['time_book'].shift(1, fill_value=0)
df['time_2'] = df['time_book'].shift(-1, fill_value=0)
df['stopped_investing'] = df['stopped'].shift(1, fill_value=1)
return df
df = df.groupby('country').apply(func).reset_index(drop=True)
df['actual_diff'] = np.abs(df['actual'] - df['actual_book'])
df['length_diff'] = np.abs(df['length'] - df['length_book'])
df['Investment'] = df['lor_index'].values * 1000
df = df.compute().to_csv("path")
Saving to csv or parquet will by default trigger computation, so the last line should be:
df = df.to_csv("path_*.csv")
The asterisk is needed to specify the pattern of csv file names (each partition is saved into a separate file, unless you specify single_file=True).
My guess is that most of the computation time is spent on this step:
df = df1.merge(df2, how='inner', left_on=['country', 'region'],
right_on=['country', 'region'])
If one of the dfs is small enough to fit in memory, then it would be good to keep it as a pandas dataframe, see further tips in the documentation.
I have a fairly large table in big query ( app. 9M rows) and I would like to read it via pandas.
I've tried reading and using the [pd.read_gbq()][1] function, which works fine on small tables.
On the large table it gets stuck after 50 secs or so (logs show elapsed .. 50s) - without giving an error or anything.
My question is how can I read that table using pd (chunks?). Any conventions on scaling up these bigquery reads will be helpful.
EDIT / resolution
adding to Khan's answer, I ended up implementing chunks, writing 500,000 each time to a file, then reading these files to dataframe like so:
def download_gbq_table(self):
if not os.path.exists(self.tmp_dir):
os.makedirs(self.tmp_dir)
increment = 100000
intervals = list(range(0, self.table_size, 100000))
intervals.append(self.table_size - intervals[len(intervals)-1])
df = pd.DataFrame()
for offset in intervals:
query = f"select * from `<table_name>` limit {increment} offset {offset};"
logger.info(f"running query: {query}")
start_time = time.time()
tmp_df = pd.read_gbq(query,
project_id=self.connection_parameters['project_id'],
private_key=self.connection_parameters['service_account'],
dialect='standard'
)
df = pd.concat([df, tmp_df])
logger.info(f'time took: {str(round(time.time() - start_time, 2))}')
if len(df) % 500000 == 0:
df.to_csv(os.path.join(self.tmp_dir, f'df_{str(offset + increment)}.csv'))
df = pd.DataFrame()
def read_df_from_multi_csv(self):
all_files = glob.glob(os.path.join(self.tmp_dir, "df_*"))
df_list = []
for f in all_files:
start_time = time.time()
df_list.append(pd.read_csv(f))
logger.info(f'time took for reading {f}: {str(round(time.time() - start_time, 2))}')
return pd.concat((pd.read_csv(f) for f in all_files))
Pandas' read_gbq function currently does not provide a chunksize parameter (even though its opposite to_gbq function does provide a chunksize parameter).
Anyways, you can solve your problem with adding LIMIT and OFFSET to your SQL query read stuff iteratively from BigQuery. Something on the lines of:
project_id = "xxxxxxxx"
increment=100000
chunks=range(0, 9000000, 100000)
chunks[-1]+=increment
intervals=[[chunks[i-1], chunks[i]+1] for i, e in enumerate(chunks) if i > 0]
query_str="select * from `mydataset.mytable` limit {end} offset {start};"
for start, end in intervals:
query = query_str.format(start=start, end=end)
df = pd.read_gbq(query, project_id)
#-- do stuff with your df here..
Not sure if this existed back when the question was originally asked, but now you can use python-bigquery-sqlalchemy (link) to read data from BigQuery, which allows you to use the built-in chunking ability of pandas.read_sql(). You just create a SQLAlchemy connection engine using "bigquery://{project-name}" and pass that to con in pandas.read_sql().
For example:
from sqlalchemy.engine import create_engine
import pandas as pd
read_project = "my-cool-project"
query = f"""
select * from `{read_project}.cool-dataset.cooltable`
"""
bq_engine = create_engine(f"bigquery://{read_project}")
for df in pd.read_sql(query, con=bq_engine, chunksize=100_000):
# do stuff with df...