Pandas pivot_table/groupby taking too long on very large dataframe

Pandas pivot_table/groupby taking too long on very large dataframe - pandas

I am working on a dataframe of 18 million rows with the following structure:
I need to get a count of the subsystem for each suite as per the name_heuristic (there are 4 values for that column). So I need an output with columns for each type of name_heuristic with the suite as index and values will be count of subsystems as per each column.
I have tried using pivot_table with the following code:
df_table = pd.pivot_table(df, index='suite', columns='name_heuristics', values='subsystem', aggfunc=np.sum
But even after an HOUR, it is not done computing. What is taking so long and how can I speed it up? I even tried a groupby alternative that is still running 15 minutes and counting:
df_table = df.groupby(['name_heuristics', 'suite']).agg({'subsystem': np.sum}).unstack(level='name_heuristics').fillna(0)
Any help is greatly appreciated! I have been stuck on this for hours.

It seems pivoting more than one categorical column crashes pandas. My solution to a similar problem was converting categorical to object for the target columns, using
step 1
df['col1'] = df['col1'].astype('object')
df['col2'] = df['col2'].astype('object')
step 2
df_pivot = pandas.pivot_table(df, columns=['col1', 'col2'], index=...
This was independent of dataframe size...

Related

Faster returns comparisons in Pandas dataframes?

I have DataFrame containing 600,000 pairs of IDs. Each ID has returns Data in a large monthly returns_df. For each of the 600K pairs, I do the following
I set left and right DataFrames equal to their subset of returns_df.
I merge DataFrames to get months they both have data
I compute an absolute distance by comparing each, then summing results, and running a sigmoid function.
This process is taking ~12 hours as my computer has to create subsets of returns_df each time to compare. Can I substantially speed this up through some sort of vectorized solution or faster filtering?
def get_return_similarity(row):
left = returns_df[returns_df['FundID']==row.left_side_id]
right = returns_df[returns_df['FundID']==row.right_side_id]
temp = pd.merge(left,right, how='inner', on=['Year','Month'])
if temp.shape[0]<12: # Return if overlap < 12 months
return 0
temp['diff'] = abs(temp['Return_x'] - temp['Return_y'])
return 1/(math.exp(70*temp['diff'].sum()/(temp['diff'].shape[0]))) #scaled sigmoid function
df['return_score'] = df[['left_side_id','right_side_id']].apply(get_return_similarity,axis=1)
Thanks in advance for your help! Trying to get better with Pandas
Edit: As suggested, the basic data format is below
returns_df
df I am running the apply on:

pandas : Indexing for thousands of rows in dataframe

I initially had 100k rows in my dataset. I read the csv using pandas into a dataframe called data. I tried to do a subset selection of 51 rows using .loc. My index labels are numeric values 0, 1, 2, 3 etc. I tried using this command -
data = data.loc['0':'50']
But the results were weird, it took all the rows from 0 to 49999, looks like it is taking rows till the index value starts with 50.
Similarly, I tried with this command - new_data = data.loc['0':'19']
and the result was all the rows, starting from 0 till 18999.
Could this be a bug in pandas?

You want to use .iloc in place of .loc, since you are selecting data from the dataframe via numeric indices.
For example:
data.iloc[:50,:]
Keep in mind that your indices are of numeric-type, not string-type, so querying with a string (as you have done in your OP) attempts to match string-wise comparisons.

counting each value in dataframe

So I want to create a plot or graph. I have a time series data.
My dataframe looks like that:
df.head()
I need to count values in df['status'] (there are 4 different values) and df['group_name'] (2 different values) for each day.
So i want to have date index and count of how many times each value from df['status'] appear as well as df['group_name']. It should return Series.

I used spam.groupby('date')['column'].value_counts().unstack().fillna(0).astype(int) and it working as it should. Thank you all for help

How do I preset the dimensions of my dataframe in pandas?

I am trying to preset the dimensions of my data frame in pandas so that I can have 500 rows by 300 columns. I want to set it before I enter data into the dataframe.
I am working on a project where I need to take a column of data, copy it, shift it one to the right and shift it down by one row.
I am having trouble with the last row being cut off when I shift it down by one row (eg: I started with 23 rows and it remains at 23 rows despite the fact that I shifted down by one and should have 24 rows).
Here is what I have done so far:
bolusCI = pd.DataFrame()
##set index to very high number to accommodate shifting row down by 1
bolusCI = bolus_raw[["Activity (mCi)"]].copy()
activity_copy = bolusCI.shift(1)
activity_copy
pd.concat([bolusCI, activity_copy], axis =1)
Thanks!

There might be a more efficient way to achieve what you are looking to do, but to directly answer your question you could do something like this to init the DataFrame with certain dimensions
pd.DataFrame(columns=range(300),index=range(500))

You just need to define the index and columns in the constructor. The simplest way is to use pandas.RangeIndex. It mimics np.arange and range in syntax. You can also pass a name parameter to name it.
pd.DataFrame
pd.Index
df = pd.DataFrame(
index=pd.RangeIndex(500),
columns=pd.RangeIndex(300)
)
print(df.shape)
(500, 300)

Fillna (forward fill) on a large dataframe efficiently with groupby?

What is the most efficient way to forward fill information in a large dataframe?
I combined about 6 million rows x 50 columns of dimensional data from daily files. I dropped the duplicates and now I have about 200,000 rows of unique data which would track any change that happens to one of the dimensions.
Unfortunately, some of the raw data is messed up and has null values. How do I efficiently fill in the null data with the previous values?
id start_date end_date is_current location dimensions...
xyz987 2016-03-11 2016-04-02 Expired CA lots_of_stuff
xyz987 2016-04-03 2016-04-21 Expired NaN lots_of_stuff
xyz987 2016-04-22 NaN Current CA lots_of_stuff
That's the basic shape of the data. The issue is that some dimensions are blank when they shouldn't be (this is an error in the raw data). An example is that for previous rows, the location is filled out for the row but it is blank in the next row. I know that the location has not changed but it is capturing it as a unique row because it is blank.
I assume that I need to do a groupby using the ID field. Is this the correct syntax? Do I need to list all of the columns in the dataframe?
cols = [list of all of the columns in the dataframe]
wfm.groupby(['id'])[cols].fillna(method='ffill', inplace=True)
There are about 75,000 unique IDs within the 200,000 row dataframe. I tried doing a
df.fillna(method='ffill', inplace=True)
but I need to do it based on the IDs and I want to make sure that I am being as efficient as possible (it took my computer a long time to read and consolidate all of these files into memory).

It is likely efficient to execute the fillna directly on the groupby object:
df = df.groupby(['id']).fillna(method='ffill')
Method referenced
here
in documentation.

How about forward filling each group?
df = df.groupby(['id'], as_index=False).apply(lambda group: group.ffill())

github/jreback: this is a dupe of #7895. .ffill is not implemented in cython on a groupby operation (though it certainly could be), and instead calls python space on each group.
here's an easy way to do this.
url:https://github.com/pandas-dev/pandas/issues/11296
according to jreback's answer, when you do a groupby ffill() is not optimized, but cumsum() is. try this:
df = df.sort_values('id')
df.ffill() * (1 - df.isnull().astype(int)).groupby('id').cumsum().applymap(lambda x: None if x == 0 else 1)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas