Apply np.random.rand to groups - optimization issue - pandas

Need to optimize a single line of code that will be executed tens of thousands of times during the calculations and hence timing becomes an issue. Seems to be simple but really got stuck.
The line is:
df['Random']=df['column'].groupby(level=0).transform(lambda x: np.random.rand())
So I want to assign the same random number to each group and "ungroup". Since rand() is called many times using this implementation the code is very ineffective.
Can someone help in vectorizing this?

Try this!
df = pd.DataFrame(np.sort(np.random.randint(2,5,50)),columns=['column'])
uniques =df['column'].unique()
final = df.merge(pd.Series(np.random.rand(len(uniques)),index=uniques).to_frame(),
left_on='column',right_index=True)
You can store the uniques and then run last line every time to get new random numbers and join with df.

Related

Opening and concatenating a list of files in a single pandas DataFrame avoiding loops

I've been reading different post here on this topic but I didn't really manage to find an answer.
I have a folder with 50 files (total size around 70GB). I would like to open them and create a single DataFrame on which perform computation.
Unfortunately, I go out of memory if I try to concatenate the files immediately after I open them.
I know I can operate on chunks and work on smaller subsets, but these 50 files are already a small portion of the entire dataset.
Thus, I found an alternative that works (using for loops) and deleting the list at each iteration, but that obviously takes too much time. I had a tentative with "pd.appned" but in that case I go out of Memory.
for file in lst:
df_list.append(open_file(file))
#print(file)
result = df_list[len(df_list)-1]
del df_list[len(df_list)-1]
for i in range(len(df_list)-1, -1,-1):
print(i)
result = pd.concat([result, df_list[i]], ignore_index=True)
del df_list[i]
Although working, I feel like I'm doing things twice. Furthermore, putting pd.concat in a loop is very bad idea, since time increase exponentially (Why does concatenation of DataFrames get exponentially slower?).
Does anyone have any suggestion?
Now opening and concatenation takes 75 mins + 105 mins. I hope to reduce this time.

Filtering option causing python to hang - how to debug?

I am preprocessing large datasets to get them ready for clustering operations. I have a script that reads the data from CSV and performs various checks for missing data, erroneous values, etc. Until now, everything has worked as expected. Still, when I ran the script yesterday, it started to hang on to a simple filtering operation. The source data has not changed, but somehow processing can't get past this line. I have isolated the problem by moving the following lines of code to another file, and the same issue is observed:
import pandas as pd
df = pd.read_csv('data1.csv',index_col=0)
# Get list of columns of interest for first check
columns = [col for col in df.columns if 'temp' in col]
# Find indices where any value of a column of interest has a value of 1
indices = list(df[df[columns]==1].dropna(how='all').index)
This previously ran fine, correctly identifying indices with this '1' flag in 'columns'. Now (and with no changes to the code or source data), it hangs on the indices line. I further broke it down to identify the specific problem: df[columns]==1 runs fine, but grabbing the df filtered on this condition (df[df[columns]==1]) is the line that hangs.
How can I troubleshoot what the problem is? Since I had not made any changes when it last worked, I am perplexed. What could possibly be the cause? Thanks in advance for any tips.
EDIT: The below approach seems to be drastically faster and solved the problem:
indices = df[(df[columns]==1).any(1)].index
When tested on a subset of the whole df, it accomplished the task in 0.015 seconds, while the prior method took 15.0 seconds.

Multi-process Pandas Groupby Failing to do Anything After Processes Kick Off

I have a fairly large pandas dataframe that is about 600,000 rows by 50 columns. I would like to perform a groupby.agg(custom_function) to get the resulting data. The custom_function is a function that takes the first value of non-null data in the series or returns null if all values in the series are null. (My dataframe is hierarchically sorted by data quality, the first occurrence of a unique key has the most accurate data, but in the event of null data in the first occurrence I want to take values in the second occurence... and so on.)
I have found the basic groupby.agg(custom_function) syntax is slow, so I have implemented multi-processing to speed up the computation. When this code is applied over a dataframe that is ~10,000 rows long the computation takes a few seconds, however, when I try to use the entirety of the data, the process seems to stall out. Multiple processes kick off, but memory and cpu usage stay about the same and nothing gets done.
Here is the trouble portion of the code:
# Create list of individual dataframes to feed map/multiprocess function
grouped = combined.groupby(['ID'])
grouped_list = [group for name, group in grouped]
length = len(grouped)
# Multi-process execute single pivot function
print('\nMulti-Process Pivot:')
with concurrent.futures.ProcessPoolExecutor() as executor:
with tqdm.tqdm(total=length) as progress:
futures = []
for df in grouped_list:
future = executor.submit(custom_function, df)
future.add_done_callback(lambda p: progress.update())
futures.append(future)
results = []
for future in futures:
result = future.result()
results.append(result)
I think the issue has something to do with the multi-processing (maybe queuing up a job this large is the issue?). I don't understand why a fairly small job creates no issues for this code, but increasing the size of the input data seems to hang it up rather than just execute more slowly. If there is a more efficient way to take the first value in each column per unique ID, I'd be interested to hear it.
Thanks for your help.

Apache Spark: count vs head(1).isEmpty

For a given spark df, I want to know if a certain column has null value or not. The code I had was -
if (df.filter(col(colName).isNull).count() > 0) {//throw exception}
This was taking long and was being called 2 times for 1 df since I was checking for 2 columns. Each time it was called, I saw a job for count, so 2 jobs for 1 df.
I then changed the code to look like this -
if (!df.filter(col(colName).isNull).head(1).isEmpty) {//throw exception}
With this change, I now see 4 head jobs compared to the 2 count jobs before, increasing the overall time.
Can you experts please help me understand why the number of jobs doubled? The head function should be called only 2 times.
Thanks for your help!
N
Update: added screenshot showing the jobs for both cases. The left side shows the one with count and right side is the head. That's the only line that is different between the 2 runs.
dataframe.head(1) does 2 things -
1. Executes the action behind the dataframe on executor(s).
2. Collects 1st row of the result from executor(s) to the driver.
dataframe.count() does 2 things -
1. Executes the action behind the dataframe on executor(s). If there are no transformation on the file and parquet format is used then it is basically scanning the statistics of the file(s).
2. Collects count from executor(s) to the driver.
Based on the source of dataframe being a file which stores statistics and absence of any transformation, count() can run faster than head.
I am not 100% sure why there are 2 jobs vs 4. Can you please paste the screenshot.
Is hard to say just looking for this line of code, but there is one reason for head can taking more time. head is a deterministic request if you have sort or order_by in any part that will request a shuffle to always return the first row. With the case of count you don't need the result ordered, so there is no need to shuffle, basic a simple mapreduce step. That is probably why your head can taking more time.

Pig: how to loop through all fields/columns?

I'm new to Pig. I need to do some calculation for all fields/columns in a table. However, I can't find a way to do it by searching online. It would be great if someone here can give some help!
For example: I have a table with 100 fields/columns, most of them are numeric. I need to find the average of each field/column, is there an elegant way to do it without repeat AVERAGE(column_xxx) for 100 times?
If there's just one or two columns, then I can do
B = group A by ALL;
C = foreach B generate AVERAGE(column_1), AVERAGE(columkn_2);
However, if there's 100 fields, it's really tedious to repeatedly write AVERAGE for 100 times and it's easy to have errors.
One way I can think of is embed Pig in Python and use Python to generate a string like that and put into compile. However, that still sounds weird even if it works.
Thank you in advance for help!
I don't think there is a nice way to do this with pig. However, this should work well enough and can be done in 5 minutes:
Describe the table (or alias) in question
Copy the output, and reorgaize it manually into the script part you need (for example with excel)
Finish and store the script
If you need to be able with columns that can suddenly change etc. there is probably no good way to do it in pig. Perhaps you could read it in all columns (in R for example) and do your operation there.