Pandas dataframe sharing between functions isn't working - pandas

I have a script that modifies a pandas dataframe with several concurrent functions (asyncio coroutines). Each function adds rows to the dataframe and it's important that the functions all share the same list. However, when I add a row with pd.concat a new copy of the dataframe is created. I can tell because each dataframe now has a different memory location as given by id().
As a result the functions are no longer share the same object. How can I keep all functions pointed at a common dataframe object?
Note that this issue doesn't arise when I use the append method, but that is being deprecated.

pandas dataframes are efficient because they use contiguous memory blocks, frequently of fundamental types like int and float. You can't just add a row because the dataframe doesn't own the next bit of memory it would have to expand into. Concatenation usually requires that new memory is allocated and data is copied. Once that happens, referrers to the original dataframe
If you know the final size you want, you can preallocate and fill. Otherwise, you are better off keeping a list of new dataframes and concatenating them all at once. Since these are parallel procedures, they aren't dependent on each others output, so this may be a feasable option.

Related

Improving for loop over groupby keys of a pandas dataframe and concatenating

I am currently in a scenario in which I have a (really inefficient) for loop that attempts to process a pandas dataframe df with complicated aggregation functions (say complicated_function) over some groupby keys (respectively groupby_keys) as follows:
final_result = []
for groupby_key in groupby_keys:
small_df = complicated_function_1(df.loc[df['groupby_key'] == groupby_key])
final_result.append(small_df)
df_we_want = pd.concat(final_result)
I am trying to see if there is a more efficient way of dealing with this, rather than having to use a for loop, especially if there are many groupby_keys.
Is it possible to convert all this into a single function for me to just groupby and agg/concatenate or pipe? Or is this procedure doomed to be constrained to the for loop? I have tried multiple combinations but have been getting an assortment of errors (have not really been able to pin down something specific).

Memory Allocation in Spark(df)

I have defined a df in spark and have applied some transformations(filter) and have stored it in the same df what happens to the memory allocated to the df.
df=rdd.filter1
df=df.fitler2
df.filter3
df.fitler4
There are two ways to look at it-
Dataframe is immutable, thus for every operation that you apply on it
a new Dataframe is created(with fresh memory allocated). So, the 'df'
will ultimately point to the dataframe returned by the last 'filter'
operation. Since everytime a new dataframe object is created - your
question about 'change in memory allocation' stands invalid.
If you mean - multiple filter operation will reduce the data and the
memory required. The answer is 'Yes'. Due to filter operations the
dataframe partitions will shrink and 'less' memory would be occupied.
Actually, nothing until you invoke some action like collecting data to driver or writing DF to file. All transformations in Spark are lazy.

Extracting data as a list from a Pandas dataframe while preserving order

Suppose I have some Pandas dataframe df that has a column called "HEIGHT", among many other columns.
If I issue list(df["HEIGHT"]), then this will give me a list of the items in that column in the exact order in which they were in the dataframe, i.e. ordered by the index of the dataframe.
Is that always the case? The df["HEIGHT"] command will return a Series and list() will convert it to a list. But are these operations always order-preserving? Interestingly in the [book1 by the Pandas author (!), from my reading so far, it is unclear to me, when these elementary operations preserve order; is order perhaps always preserved, or is there some simple rule to know when order should be preserved?
The order of elements in a pandas Series (i.e., a column in a pandas DataFrame) will not change unless you do something that makes it change. And the order of a python list is guaranteed to reflect insertion order (SO thread).
So yes, df[0].tolist() (slightly faster than list(df[0])) should always yield a Python list of elements in the same order as the elements in df[0].
Order will always be preserved. When you use the list function, you provide it an iterator, and construct a list by iterating over it. For more information on iterators, you might want to read PEP 234 on iterators.
The iteration order is determined by the iterator you provide it. Iterators for a series are provided by pd.Series.__iter__() (the standard way to access an iterator for an object, which is searched for by the list method and similar). For more information on iteration and indexing in Pandas, consider reading the relevant API reference section and the much more in-depth indexing documentation.

Looping through columns to conduct data manipulations in a data frame

One struggle I have with using Python Pandas is to repeat the same coding scheme for a large number of columns. For example, below is trying to create a new column age_b in a data frame called data. How do I easily loop through a long (100s or even 1000s) of numeric columns, do the exact same thing, with the newly created column names being the existing name with a prefix or suffix string such as "_b".
labels = [1,2,3,4,5]
data['age_b'] = pd.cut(data['age'],bins=5, labels=labels)
In general, I have many simply data frame column manipulations or calculations, and it's easy to write the code. However, so often I want to repeat the same process for dozens of columns, that's when I get bogged down, because most functions or manipulations would work for one column, but not easily repeatable to many columns. It would be nice if someone can suggest a looping code "structure". Thanks!

Why does pandas.apply() work differently for Series and DataFrame columns

apologies if this is a silly question, but I am not quite sure as to why this behavior is the case, and/or whether I am misunderstanding it. I was trying to create a function for the 'apply' method, and noticed that if you run apply on a series the series is passed as a np.array and if you pass the same series within a dataframe of 1 column, the series is passed as a series to the (u)func.
This affects the way a simpleton like me writes the function (i prefer iloc indexing to integer-based indexing on the array) so I was wondering whether this is on purpose, or historical accident?
Thanks,