Improving for loop over groupby keys of a pandas dataframe and concatenating - pandas

I am currently in a scenario in which I have a (really inefficient) for loop that attempts to process a pandas dataframe df with complicated aggregation functions (say complicated_function) over some groupby keys (respectively groupby_keys) as follows:
final_result = []
for groupby_key in groupby_keys:
small_df = complicated_function_1(df.loc[df['groupby_key'] == groupby_key])
final_result.append(small_df)
df_we_want = pd.concat(final_result)
I am trying to see if there is a more efficient way of dealing with this, rather than having to use a for loop, especially if there are many groupby_keys.
Is it possible to convert all this into a single function for me to just groupby and agg/concatenate or pipe? Or is this procedure doomed to be constrained to the for loop? I have tried multiple combinations but have been getting an assortment of errors (have not really been able to pin down something specific).

Related

Pandas dataframe sharing between functions isn't working

I have a script that modifies a pandas dataframe with several concurrent functions (asyncio coroutines). Each function adds rows to the dataframe and it's important that the functions all share the same list. However, when I add a row with pd.concat a new copy of the dataframe is created. I can tell because each dataframe now has a different memory location as given by id().
As a result the functions are no longer share the same object. How can I keep all functions pointed at a common dataframe object?
Note that this issue doesn't arise when I use the append method, but that is being deprecated.
pandas dataframes are efficient because they use contiguous memory blocks, frequently of fundamental types like int and float. You can't just add a row because the dataframe doesn't own the next bit of memory it would have to expand into. Concatenation usually requires that new memory is allocated and data is copied. Once that happens, referrers to the original dataframe
If you know the final size you want, you can preallocate and fill. Otherwise, you are better off keeping a list of new dataframes and concatenating them all at once. Since these are parallel procedures, they aren't dependent on each others output, so this may be a feasable option.

How to work on the whole column of a spark datafrmae at once while looping columns?

I have a pyspark df. I want to work on the whole column at once. Without looping row at a time.
On the other hand i would want to loop the row.
I think the best solution would be a pandas udf but i couldnt find a way to do it.
I created a pandas udf. Which does what i am trying but its a waste of time since its doing the same work for each row and doesnt do it at once.
Here is an example of what i built.
#pandas_udf(DoubleType())
def hodges(v):
res = []
for x in v:
res.append(0.5 * np.median([x[i] + x[j]
for i in range(len(x))
for j in range(i + 1, len(x))]))
return pd.Series(res)
this works perfectly. I send to this udf an array-column after combinging all columns of df to a one array column.
But you can see what i am trying to do is to loop each row array. But in this udf i also loop the column as i do "for x in v". I dont want it as i want it to perform the operation on the whole column simultaniously .
I also tried avoiding udfs at all while doing it with a "with_column" operation in pyspark. But my driver fails. I guess its some memory problem.
Any ways i will be happy to hear a pandas udf approach. I emphesize that my dataframe is spark i dont want to convert to pandas. Just use the pandas udf for the spark df. As i did above. Except that i need without row at a time.

Concatenate more than 2 dataframes side by side using a for loop

I am new to Pandas and was curious to know if I can merge more than 2 dataframes (generated within a for loop) side by side?
Use the pandas library's DataFrame method called concat.
Here's documentation: https://pandas.pydata.org/docs/reference/api/pandas.concat.html
(Avoid looping over dataframes at all costs. It's much slower than the tools provided to you by pandas. In most cases, there's probably a pandas function for it, or a few you can use together to achieve the same thing.)

how to merge small dataframes into a large without copy

I have a large pandas dataframe, and want to merge a couple smaller dataframes into it, thus adding more columns. However, it seems there is an implicit of copy of the large dataframe after each merge, which I want to avoid. What's the most efficient way to do this? (Note the resulting dataframe will have the same rows, as it is growing with more columns.) map seems better, as it keeps the original dataframe, but there is overhead to create dictionary. Also not sure it works with merging multiple col into the main one. Or maybe the merge may not be deep copying everything internally?
Base case:
id(df) # before merge
df = df.merge(df1[["sid", "col1"]], how="left", on=["sid"])
id(df) # will be different <-- trying to avoid copying df every time a smaller one merged into it
df = df.merge(df2[["sid", "col2"]], how="left", on=["sid", "key2"])
id(df) # will be different
...
Using map():
d_col1 = {d["sid"]:d["col1"] for d in df1[["sid", "col1"]].to_dict("records")}
df["col1"] = df["sid"].map(d_col1)
id(df) # this is the same object
Some post referred dask, haven't tested that yet.
here is another way. First map can be done with a Series and as df1 is already built, I don't know if it is less efficient than using a dictionary though.
df["col1"] = df["sid"].map(df1.set_index('sid')['col1'])
Now with two or more columns, you can play with index
df['col2'] = (
df2.set_index(['sid','key2'])['col2']
.reindex(pd.MultiIndex.from_frame(df[['sid','key2']]))
.to_numpy()
)

Spark group by several fields several times on the same RDD

My data is stored in csv format, and the headers are given in the column_names variable.
I wrote the following code to read it into a python dictionary RDD
rdd=sc.textFile(hdfs_csv_dir)\
.map(lambda x: x.split(','))\
.filter(lambda row: len(row)==len(column_names))\
.map(lambda row: dict([(column,row[index]) for index,column in enumerate(column_names)]))
Next, I wrote a function that counts the combinations of column values given the column names
import operator
def count_by(rdd,cols=[]):
'''
Equivalent to:
SELECT col1, col2, COUNT(*) FROM MX3 GROUP BY col1, col2;
But the number of columns can be more than 2
'''
counts=rdd.map(lambda x: (','.join([str(x[c]) for c in cols]), 1))\
.reduceByKey(operator.add)\
.map(lambda t:t[0].split(',')+[t[1]])\
.collect()
return counts
I am running count_by several times, with a lot of different parameters on the same rdd.
What is the best way to optimize the query and make it run faster ?
First, you should cache the RDD (by calling cachedRdd = rdd.cache()) before passing it multiple times into count_by, to prevent Spark from loading it from disk for each operation. Operating on a cached RDD means data will be loaded to memory upon first use (first call to count_by), then read from memory for following calls.
You should also consider using Spark DataFrame API instead of the lower-level RDD API, since:
You seem to articulate your intentions using SQL, and DataFrame API allows you to actually use such a dialect
When using DataFrames, Spark can perform some extra optimizations since it has a better understanding of what you are trying to do, is it can design the best way to achieve it. SQL-like dialect is declerative - you only say what you want, not how to get it, which gives Spark more freedom to optimize