A more efficient way of aggregating a random sample from pandas dataframe and iteratively append the mean of sampled df in an empty dataframe - pandas

I am trying to take out a random sample from my df, take mean of all columns in a single row series, using df_sample.mean(axis =0) and then appending this series to an empty dataframe and I want 1 million such rows. I am getting the result but it is taking too much time to run. Can someone please suggest an efficient way to do this ?
train = pd.DataFrame()
for i in range (1000000):
df_sample = df_2.sample(n=100)
row = df_sample.mean(axis=0)
train = train.append(row,ignore_index=True)

Here's a faster way to do, this would result in 1 million (10 lakh) rows:
Method 1: Sampling using pandas inbuilt
n_times = 1000000
values = [df_2.sample(n=1).mean(axis=0, numeric_only=True) for _ in range (n_times)]
train = pd.DataFrame(values, columns=['mean_col'])
Method 2: Sampling using numpy
def f1():
return np.mean(df_2.values[np.random.randint(0, df.shape[0])])
def f2():
return df_2.iloc[np.random.randint(0, df.shape[0])].mean(axis=0, numeric_only=True)
values = [f1() for _ in range(n_times)]
train = pd.DataFrame(values, columns=['mean_col'])
values = [f2() for _ in range(n_times)]
train = pd.DataFrame(values, columns=['mean_col'])

Related

Fastest way to compute only last row in pandas dataframe

I am trying to find the fastest way to compute the results only for the last row in a dataframe. For some reason, when I do so it is slower than computing the entire dataframe. What am I doing wrong here? What would be the correct way to access only the last two rows and compute their values?
Currently these are my results:
Processing time of add_complete(): 1.333 seconds
Processing time of add_last_row_only(): 1.502 seconds
import numpy as np
import pandas as pd
def add_complete(df):
df['change_a'] = df['a'].diff()
df['change_b'] = df['b'].diff()
df['factor'] = df['change_a'] * df['change_b']
def add_last_row_only(df):
df.at[df.index[-1], 'change_a_last_row'] = df['a'].iloc[-1] - df['a'].iloc[-2]
df.at[df.index[-1], 'change_b_last_row'] = df['b'].iloc[-1] - df['b'].iloc[-2]
df.at[df.index[-1], 'factor_last_row'] = df['change_a_last_row'].iloc[-1] * df['change_b_last_row'].iloc[-1]
def main():
a = np.arange(200_000_000).reshape(100_000_000, 2)
df = pd.DataFrame(a, columns=['a', 'b'])
add_complete(df)
add_last_row_only(df)
print(df.tail())
Unless I am missing something, for this kind of operation I would use numpy on the two last lines:
%%timeit
changes = np.diff(df.values[-2:,:],axis=0)
factor = np.product(changes)
21µs just this operation, yes, microseconds.
If I add insertion it increases to 511ms, even filling all with same value.
I suspect the problem comes from handling around a 1.5Gb dataframe, which actually doubles the size when inserting the two extra columns.
%%timeit
changes = np.diff(df.values[-2:,:],axis=0)
factor = np.product(changes)
df['factor']=factor
df['changes_a']=changes[0][0]
df['changes_b']=changes[0][1]

How do you speed up a score calculation based on two rows in a Pandas Dataframe?

TLDR: How can one adjust the for-loop for a faster execution time:
import numpy as np
import pandas as pd
import time
np.random.seed(0)
# Given a DataFrame df and a row_index
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5
start = time.time()
target_row = df.loc[target_row_index]
result = []
# Method 1: Optimize this for-loop
for row in df.iterrows():
"""
Logic of calculating the variables check and score:
if the values for a specific column are 2 for both rows (row/target_row), it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
"""
check = row[1]+target_row # row[1] takes 30 microseconds per call
score = np.sum(check == 4) - np.sum(check == 3) # np.sum takes 47 microseconds per call
result.append(score)
print(time.time()-start)
# Goal: Calculate the list result as efficient as possible
# Method 2: Optimize Apply
def add(a, b):
check = a + b
return np.sum(check == 4) - np.sum(check == 3)
start = time.time()
q = df.apply(lambda row : add(row, target_row), axis = 1)
print(time.time()-start)
So I have a dataframe of size 30'000 and a target row in this dataframe with a given row index. Now I want to compare this row to all the other rows in the dataset by calculating a score. The score is calculated as follows:
if the values for a specific column are 2 for both rows, it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
The result is then the list of all the scores we just calculated.
As I need to execute this code quite often I would like to optimize it for performance.
Any help is very much appreciated.
I already read Optimization when using Pandas are there further resources you can recommend? Thanks
If you're willing to convert your df to a NumPy array, NumPy has some really good vectorisation that helps. My code using NumPy is as below:
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5
start_time = time.time()
# Converting stuff to NumPy arrays
target_row = df.loc[target_row_index].to_numpy()
np_arr = df.to_numpy()
# Calculations
np_arr += target_row
check = np.sum(np_arr == 4, axis=1) - np.sum(np_arr == 3, axis=1)
result = list(check)
end_time = time.time()
print(end_time - start_time)
Your complete code (on Google Colab for me) outputs a time of 14.875332832336426 s, while the NumPy code above outputs a time of 0.018691539764404297 s, and of course, the result list is the same in both cases.
Note that in general, if your calculations are purely numerical, NumPy will virtually always be better than Pandas and a for loop. Pandas really shines through with strings and when you need the column and row names, but for pure numbers, NumPy is the way to go due to vectorisation.

Sample Pandas dataframe based on values in column

I have a large dataframe that I want to sample based on values on the target column value, which is binary : 0/1
I want to extract equal number of rows that have 0's and 1's in the "target" column. I was thinking of using the pandas sampling function but not sure how to declare the equal number of samples I want from both classes for the dataframe based on the target column.
I was thinking of using something like this:
df.sample(n=10000, weights='target', random_state=1)
Not sure how to edit it to get 10k records with 5k 1's and 5k 0's in the target column. Any help is appreciated!
You can group the data by target and then sample,
df = pd.DataFrame({'col':np.random.randn(12000), 'target':np.random.randint(low = 0, high = 2, size=12000)})
new_df = df.groupby('target').apply(lambda x: x.sample(n=5000)).reset_index(drop = True)
new_df.target.value_counts()
1 5000
0 5000
Edit: Use DataFrame.sample
You get similar results using DataFrame.sample
new_df = df.groupby('target').sample(n=5000)
You can use DataFrameGroupBy.sample method as follwing:
sample_df = df.groupby("target").sample(n=5000, random_state=1)
Also found this to be a good method:
df['weights'] = np.where(df['target'] == 1, .5, .5)
sample_df = df.sample(frac=.1, random_state=111, weights='weights')
Change the value of frac depending on the percent of data you want back from the original dataframe.
You will have to run a df0.sample(n=5000) and df1.sample(n=5000) and then combine df0 and df1 into a dfsample dataframe. You can create df0 and df1 by df.filter() with some logic. If you provide sample data I can help you construct that logic.

More efficient Pandas code

I am trying to learn Python and Pandas and coming from VBA I am still caught in the habit of looping through every single cell, but I am looking for ways to operate on entire rows at a time.
Below is my part of my code. I have about 3000 stocks in the columns and about 40 or so data points in the rows saved in a dataframe called df.
I do the same kind of loop as showed to test for multiple criterias based on row values for the stocks in each column. As you see my code uses .ix to loop through the 'cells' in the dataframe.
But I have looked for ways to operate on the entire rows at a time, but have failed every attempt.
This take about 7 minutes for the 3000 stocks (but only about 1 minut or so for 2000 stocks??). But this must be able to run much faster?
def piotrosky():
df_temp = pd.DataFrame(np.nan, index=range(10), columns=df.columns)
#bruger dictionary til rename input så man ikke skal gøre det for hver række
dic={0:'positiveNetIncome',1:'positiveOperatingCF',2:'increasingROA', 3:'QualityOfEarnings',4:'longTermDebtToAssets',
5:'currentRatio', 6:'sharesOutVsSharesLast',7:'increasingGrossM',8:'IncreasingAssetTurnOver', 9:'total' }
df_temp.rename(dic, inplace = True)
r=1
#df is a vector with stocks in the columns and datapoints in the rows
#so I always need to loop across the columns
for i in range(df.shape[1]-1):
#positive net income
if df.ix[2,r]>0:
df_temp.ix[0,r]=1
else:
df_temp.ix[0,r]=0
#positiveOpeCF
if df.ix[3,r]>0:
df_temp.ix[1,r]=1
else:
df_temp.ix[1,r]=0
#Continue with several simular loops
#total
df_temp.ix[9,r]=df_temp.ix[0,r]+df_temp.ix[1,r]+df_temp.ix[2,r]+df_temp.ix[3,r]+ \
df_temp.ix[4,r]+df_temp.ix[5,r]+df_temp.ix[6,r]+df_temp.ix[7,r]+df_temp.ix[8,r]
r=r+1
Edit:
All of the below is done on a dataframe that is the transpose of the one you describe in your post. df.T should produce properly formatted input.
Method:
For conditionals on pandas dataframes, you can use the numpy function np.where:
criteria = {}
# np.where(condition, value_if_true, value_if_false)
criteria['positive_net_income'] = np.where(df[2] > 0, 1, 0)
After you get these numpy arrays, you can construct a dataframe from them,
pd.DataFrame(criteria)
and sum across it
pd.DataFrame(criteria).sum(axis=1)
to get a Series you can add as a column to your initial DataFrame
def piotrosky(df):
criteria = {}
criteria['positive_net_income'] = np.where(df[2] > 0, 1, 0)
criteria['positive_operating_cf'] = np.where(df[3] > 0, 1, 0)
...
return pd.DataFrame(criteria).sum(axis=1)
df['piotrosky_score'] = piotrosky(df)

pandas / numpy arithmetic mean in csv file

I have a csv file which contains 3000 rows and 5 columns, which constantly have more rows appended to it on a weekly basis.
What i'm trying to do is to find the arithmetic mean for the last column for the last 1000 rows, every week. (So when new rows are added to it weekly, it'll just take the average of most recent 1000 rows)
How should I construct the pandas or numpy array to achieve this?
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
#How should I write the next line of codes to get the average for the most 1000 rows?
I'm on a different machine than what my pandas is installed on so I'm going on memory, but I think what you'll want to do is...
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
#Let's pretend your 5th column has a name (header) of `Stuff`
last_thousand = df_1.tail(1000)
np.mean(last_thousand.Stuff)
A little bit quicker using mean():
df = pd.read_csv("fds.csv", header = 0)
results = df.tail(1000).mean()
Results will contain the mean for each column within the last 1000 rows. If you want more statistics, you can also use describe():
resutls = df.tail(1000).describe().unstack()
So basically I needed to use the pandas tail function. My Code below works.
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
numpy.average(df_1.tail(1000))