Multi-process Pandas Groupby Failing to do Anything After Processes Kick Off - pandas

I have a fairly large pandas dataframe that is about 600,000 rows by 50 columns. I would like to perform a groupby.agg(custom_function) to get the resulting data. The custom_function is a function that takes the first value of non-null data in the series or returns null if all values in the series are null. (My dataframe is hierarchically sorted by data quality, the first occurrence of a unique key has the most accurate data, but in the event of null data in the first occurrence I want to take values in the second occurence... and so on.)
I have found the basic groupby.agg(custom_function) syntax is slow, so I have implemented multi-processing to speed up the computation. When this code is applied over a dataframe that is ~10,000 rows long the computation takes a few seconds, however, when I try to use the entirety of the data, the process seems to stall out. Multiple processes kick off, but memory and cpu usage stay about the same and nothing gets done.
Here is the trouble portion of the code:
# Create list of individual dataframes to feed map/multiprocess function
grouped = combined.groupby(['ID'])
grouped_list = [group for name, group in grouped]
length = len(grouped)
# Multi-process execute single pivot function
print('\nMulti-Process Pivot:')
with concurrent.futures.ProcessPoolExecutor() as executor:
with tqdm.tqdm(total=length) as progress:
futures = []
for df in grouped_list:
future = executor.submit(custom_function, df)
future.add_done_callback(lambda p: progress.update())
futures.append(future)
results = []
for future in futures:
result = future.result()
results.append(result)
I think the issue has something to do with the multi-processing (maybe queuing up a job this large is the issue?). I don't understand why a fairly small job creates no issues for this code, but increasing the size of the input data seems to hang it up rather than just execute more slowly. If there is a more efficient way to take the first value in each column per unique ID, I'd be interested to hear it.
Thanks for your help.

Related

Speed-up search in pandas python

I am trying to perform an analysis on some data, however, the speed shall be quite faster!
these are the steps that I follow. Please recommend any solutions that you think might speed up the processing time.
ts is a datetime object and the "time" column in Data is in epoch time. Note that Data might include up to 500000 records.
Data = pd.DataFrame(RawData) # (RawData is a list of lists)
Data.loc[:, 'time'] = pd.to_datetime(Data.loc[:, 'time'], unit='s')
I find the index of the first row in Data which has a time object greater than my ts as follows:
StartIndex = Data.loc[:, 'time'].searchsorted(ts)
StartIndex is usually very low and is found within a few records from the beginning, however, I have no idea if the size of Data would affect fining this index.
Now we get to the hard part: within Data there is column called "PDNumber". I have two other variables called Max_1 and Min_1. I have to find the index of the the row in which the "PDNumber" value goes above Max_1 or comes below Min_1. Note that this search shall start from StartIndex through the end of dataframe. whichever happens first, the search shall stop and the found Index is called SecondStartIndex. Now we have another two variables called Max_2 and Min_2. Again, we have to search the "PDNumber" column to find the index of the first row that goes above 'Max_2' or comes below Min_2; this index is called ThirdIndex
right now, I use a for loop to go through data adding the index by 1 in each step and see if I have reached the SecondIndex and when reached, I use a while loop till the end of dataframe to find the ThirdIndex. I use a counter in while loop as well.
Any suggestions on speeding up the process time?

Output Dataframe to CSV File using Repartition and Coalesce

Currently, I am working on a single node Hadoop and I wrote a job to output a sorted dataframe with only one partition to one single csv file. And I discovered several outcomes when using repartition differently.
At first, I used orderBy to sort the data and then used repartition to output a CSV file, but the output was sorted in chunks instead of in an overall manner.
Then, I tried to discard repartition function, but the output was only a part of the records. I realized without using repartition spark will output 200 CSV files instead of 1, even though I am working on a one partition dataframe.
Thus, what I did next were placing repartition(1), repartition(1, "column of partition"), repartition(20) function before orderBy. Yet output remained the same with 200 CSV files.
So I used the coalesce(1) function before orderBy, and the problem was fixed.
I do not understand why working on a single partitioned dataframe has to use repartition and coalesce, and how the aforesaid processes affect the output. Grateful if someone can elaborate a little.
Spark has relevant parameters here:
spark.sql.shuffle.partitions and spark.default.parallelism.
When you perform operations like sort in your case, it triggers something called a shuffle operation
https://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations
That will split your dataframe to spark.sql.shuffle.partitions partitions.
I also struggled with the same problem as you do and did not find any elegant solution.
Spark generally doesn’t have a great concept of ordered data, because all your data is split accross multiple partitions. And every time you call an operation that requires a shuffle your ordering will be changed.
For this reason, you’re better off only sorting your data in spark for the operations that really need it.
Forcing your data into a single file will break when the dataset gets larger
As Miroslav points out your data gets shuffled between partitions every time you trigger what’s called a shuffle stage (this is things like grouping or join or window operations)
You can set the number of shuffle partitions in the spark Config - the default is 200
Calling repartition before a group by operation is kind of pointless because spark needs to reparation your data again to execute the groupby
Coalesce operations sometimes get pushed into the shuffle stage by spark. So maybe that’s why it worked. Either that or because you called it after the groupby operation
A good way to understand what’s going on with your query is to start using the spark UI - it’s normally available at http://localhost:4040
More info here https://spark.apache.org/docs/3.0.0-preview/web-ui.html

Is there an idiomatic way to cache Spark dataframes?

I have a large parquet dataset that I am reading with Spark. Once read, I filter for a subset of rows which are used in a number of functions that apply different transformations:
The following is similar but not exact logic to what I'm trying to accomplish:
df = spark.read.parquet(file)
special_rows = df.filter(col('special') > 0)
# Thinking about adding the following line
special_rows.cache()
def f1(df):
new_df_1 = df.withColumn('foo', lit(0))
return new_df_1
def f2(df):
new_df_2 = df.withColumn('foo', lit(1))
return new_df_2
new_df_1 = f1(special_rows)
new_df_2 = f2(special_rows)
output_df = new_df_1.union(new_df_2)
output_df.write.parquet(location)
Because a number of functions might be using this filtered subset of rows, I'd like to cache or persist it in order to potentially speed up execution speed / memory consumption. I understand that in the above example, there is no action called until my final write to parquet.
My questions is, do I need to insert some sort of call to count(), for example, in order to trigger the caching, or if Spark during that final write to parquet call will be able to see that this dataframe is being used in f1 and f2 and will cache the dataframe itself.
If yes, is this an idiomatic approach? Does this mean in production and large scale Spark jobs that rely on caching, random operations that force an action on the dataframe pre-emptively are frequently used, such as a call to count?
there is no action called until my final write to parquet.
and
Spark during that final write to parquet call will be able to see that this dataframe is being used in f1 and f2 and will cache the dataframe itself.
are correct. If you do output_df.explain(), you will see the query plan, which will show that what you said is correct.
Thus, there is no need to do special_rows.cache(). Generally, cache is only necessary if you intend to reuse the dataframe after forcing Spark to calculate something, e.g. after write or show. If you see yourself intentionally calling count(), you're probably doing something wrong.
You might want to repartition after running special_rows = df.filter(col('special') > 0). There can be a large number of empty partitions after running a filtering operation, as explained here.
The new_df_1 will make cache special_rows which will be reused by new_df_2 here new_df_1.union(new_df_2). That's not necessarily a performance optimization. Caching is expensive. I've seen caching slow down a lot of computations, even when it's being used in a textbook manner (i.e. caching a DataFrame that gets reused several times downstream).
Counting does not necessarily make sure the data is cached. Counts avoid scanning rows whenever possible. They'll use the Parquet metadata when they can, which means they don't cache all the data like you might expect.
You can also "cache" data by writing it to disk. Something like this:
df.filter(col('special') > 0).repartition(500).write.parquet("some_path")
special_rows = spark.read.parquet("some_path")
To summarize, yes, the DataFrame will be cached in this example, but it's not necessarily going to make your computation run any faster. It might be better to have no cache or to "cache" by writing data to disk.

Why is renaming columns in pandas so slow?

Given a large data frame (in my case 250M rows and 30 cols), why is it so slow to just change then name of a column?
I am using df.rename(columns={'oldName':'newName'},inplace=True) so this should not make any copies of the data, yet it is taking over 30 seconds, while I would have expected this to be in the order of milliseconds (as it's just replacing one string by another).
I know, that' a huge table, more than most people have RAM in their machine (hence I'm not going to add example code either), but still this shouldn't take any significant amount of time as it's not actually touching any of the data. Why does this take so long, i.e. why is renaming a column doing effort proportional to the number of rows of my dataframe?
I don't think inplace=True doesn't copy your data. There are some discussion on SO saying it actually does copy, and then assign back. Also see this github issue.
You can just override the columns with:
df.columns = df.columns.to_series().replace({'a':'b'})

Apply np.random.rand to groups - optimization issue

Need to optimize a single line of code that will be executed tens of thousands of times during the calculations and hence timing becomes an issue. Seems to be simple but really got stuck.
The line is:
df['Random']=df['column'].groupby(level=0).transform(lambda x: np.random.rand())
So I want to assign the same random number to each group and "ungroup". Since rand() is called many times using this implementation the code is very ineffective.
Can someone help in vectorizing this?
Try this!
df = pd.DataFrame(np.sort(np.random.randint(2,5,50)),columns=['column'])
uniques =df['column'].unique()
final = df.merge(pd.Series(np.random.rand(len(uniques)),index=uniques).to_frame(),
left_on='column',right_index=True)
You can store the uniques and then run last line every time to get new random numbers and join with df.