adding to multiindexed pandas dataframe by specific indices is incredibly slow - pandas

I have a rather large data frame (~400M rows). For each row, I calculate which elements in another data frame (~200M rows) I need to increment by 1. This second data frame is indexed by three parameters. So the line I use to do this looks like:
total_counts[i].loc[zipcode,j,g_durs[j][0]:g_durs[j][-1]+1] += 1
The above line works, but it appears to be a horrendous bottleneck and takes almost 1 second to run. Unfortunately, I cannot wait 11 years for this to finish.
Any insights on how I might speed up this operation? Do I need to split up the 200M row data frame into an array of smaller objects? I don't understand exactly what is making this so slow.
Thanks in advance.

Related

Opening and concatenating a list of files in a single pandas DataFrame avoiding loops

I've been reading different post here on this topic but I didn't really manage to find an answer.
I have a folder with 50 files (total size around 70GB). I would like to open them and create a single DataFrame on which perform computation.
Unfortunately, I go out of memory if I try to concatenate the files immediately after I open them.
I know I can operate on chunks and work on smaller subsets, but these 50 files are already a small portion of the entire dataset.
Thus, I found an alternative that works (using for loops) and deleting the list at each iteration, but that obviously takes too much time. I had a tentative with "pd.appned" but in that case I go out of Memory.
for file in lst:
df_list.append(open_file(file))
#print(file)
result = df_list[len(df_list)-1]
del df_list[len(df_list)-1]
for i in range(len(df_list)-1, -1,-1):
print(i)
result = pd.concat([result, df_list[i]], ignore_index=True)
del df_list[i]
Although working, I feel like I'm doing things twice. Furthermore, putting pd.concat in a loop is very bad idea, since time increase exponentially (Why does concatenation of DataFrames get exponentially slower?).
Does anyone have any suggestion?
Now opening and concatenation takes 75 mins + 105 mins. I hope to reduce this time.

Multi-process Pandas Groupby Failing to do Anything After Processes Kick Off

I have a fairly large pandas dataframe that is about 600,000 rows by 50 columns. I would like to perform a groupby.agg(custom_function) to get the resulting data. The custom_function is a function that takes the first value of non-null data in the series or returns null if all values in the series are null. (My dataframe is hierarchically sorted by data quality, the first occurrence of a unique key has the most accurate data, but in the event of null data in the first occurrence I want to take values in the second occurence... and so on.)
I have found the basic groupby.agg(custom_function) syntax is slow, so I have implemented multi-processing to speed up the computation. When this code is applied over a dataframe that is ~10,000 rows long the computation takes a few seconds, however, when I try to use the entirety of the data, the process seems to stall out. Multiple processes kick off, but memory and cpu usage stay about the same and nothing gets done.
Here is the trouble portion of the code:
# Create list of individual dataframes to feed map/multiprocess function
grouped = combined.groupby(['ID'])
grouped_list = [group for name, group in grouped]
length = len(grouped)
# Multi-process execute single pivot function
print('\nMulti-Process Pivot:')
with concurrent.futures.ProcessPoolExecutor() as executor:
with tqdm.tqdm(total=length) as progress:
futures = []
for df in grouped_list:
future = executor.submit(custom_function, df)
future.add_done_callback(lambda p: progress.update())
futures.append(future)
results = []
for future in futures:
result = future.result()
results.append(result)
I think the issue has something to do with the multi-processing (maybe queuing up a job this large is the issue?). I don't understand why a fairly small job creates no issues for this code, but increasing the size of the input data seems to hang it up rather than just execute more slowly. If there is a more efficient way to take the first value in each column per unique ID, I'd be interested to hear it.
Thanks for your help.

Detect sharp drops in time series data (Pandas)

I have a time series DataFrame which is indexed by "Cycles" and I'd like to know if there's a way to detect sharp prolonged drops.
The data is smoothed using a rolling mean of 10k cycles. And the index is not "symmetric", meaning that each row doesn't always increase the same number of cycles.
Index(Cycle) RawData DecimatedData
4 9.032 9.363
6 9.721 9.359
10 9.831 9.363
11 8.974 9.352
15 9.143 9.354
Even after decimating there are many "small" drops along the way which I'd like to ignore.
I've tried using df['DecimatedData'].diff() with different windows but it hasn't yielded good results.
I was thinking of putting a counter and that continues to increase as the .diff() function yields negative values, and then after it reaches a certain limit I can flag it. But I think there might be a more elegant way of doing this.

Why is renaming columns in pandas so slow?

Given a large data frame (in my case 250M rows and 30 cols), why is it so slow to just change then name of a column?
I am using df.rename(columns={'oldName':'newName'},inplace=True) so this should not make any copies of the data, yet it is taking over 30 seconds, while I would have expected this to be in the order of milliseconds (as it's just replacing one string by another).
I know, that' a huge table, more than most people have RAM in their machine (hence I'm not going to add example code either), but still this shouldn't take any significant amount of time as it's not actually touching any of the data. Why does this take so long, i.e. why is renaming a column doing effort proportional to the number of rows of my dataframe?
I don't think inplace=True doesn't copy your data. There are some discussion on SO saying it actually does copy, and then assign back. Also see this github issue.
You can just override the columns with:
df.columns = df.columns.to_series().replace({'a':'b'})

Pandas join is slow

edit at Oct 16 2017: I think I found the problem, it seems to be a bug in pandas core. It can't merge/join anything over 145k rows. 144k rows it can do without an issue. Pandas version 0.20.3, running on Fedora 26.
----Original post----
I have a medium size amount of data to process (about 200k rows with about 40 columns). I've optimised a lot of the code, but the only trouble I have now is joining the columns.
I receive the data in an unfortunate structure and need to extract the data in a certain way, then put it all into a dataframe.
Basically I extract 2 arrays at a time (each 200k rows long). One array is the timestamp, the other array is the values.
Here I create a dataframe, and use the timestamp as the index.
When I extract the second block of data, I do the same and create a new dataframe using the new values + timestamp.
I need to join the two dataframes on the index. The timestamps can be slightly different, so I use a join method using the 'outer' method, to keep the new timestamps. Basically I follow the documentation below.
result = left.join(right, how='outer')
https://pandas.pydata.org/pandas-docs/stable/merging.html#joining-on-index
This however is way to slow. I left it for about 15 mins and it still hadn't finished processing, so I killed the process.
Can anyone help? Any hints/tips?
edit:
It's a work thing, so I can't give out the data sorry. But it's just two long dataframes, each with a timestamp as the index, and a single column for the values.
The code is just as described above.
data_df.join(variable_df, how='outer')
I forgot to answer this. It's not really a bug in pandas.
The timestamp was a nanosecond timestamp, and joining them on the index like this was causing a massive slow down. Basically it was better to join on a column - made it all much faster.