how to merge small dataframes into a large without copy - pandas

I have a large pandas dataframe, and want to merge a couple smaller dataframes into it, thus adding more columns. However, it seems there is an implicit of copy of the large dataframe after each merge, which I want to avoid. What's the most efficient way to do this? (Note the resulting dataframe will have the same rows, as it is growing with more columns.) map seems better, as it keeps the original dataframe, but there is overhead to create dictionary. Also not sure it works with merging multiple col into the main one. Or maybe the merge may not be deep copying everything internally?
Base case:
id(df) # before merge
df = df.merge(df1[["sid", "col1"]], how="left", on=["sid"])
id(df) # will be different <-- trying to avoid copying df every time a smaller one merged into it
df = df.merge(df2[["sid", "col2"]], how="left", on=["sid", "key2"])
id(df) # will be different
...
Using map():
d_col1 = {d["sid"]:d["col1"] for d in df1[["sid", "col1"]].to_dict("records")}
df["col1"] = df["sid"].map(d_col1)
id(df) # this is the same object
Some post referred dask, haven't tested that yet.

here is another way. First map can be done with a Series and as df1 is already built, I don't know if it is less efficient than using a dictionary though.
df["col1"] = df["sid"].map(df1.set_index('sid')['col1'])
Now with two or more columns, you can play with index
df['col2'] = (
df2.set_index(['sid','key2'])['col2']
.reindex(pd.MultiIndex.from_frame(df[['sid','key2']]))
.to_numpy()
)

Related

Pandas dataframe sharing between functions isn't working

I have a script that modifies a pandas dataframe with several concurrent functions (asyncio coroutines). Each function adds rows to the dataframe and it's important that the functions all share the same list. However, when I add a row with pd.concat a new copy of the dataframe is created. I can tell because each dataframe now has a different memory location as given by id().
As a result the functions are no longer share the same object. How can I keep all functions pointed at a common dataframe object?
Note that this issue doesn't arise when I use the append method, but that is being deprecated.
pandas dataframes are efficient because they use contiguous memory blocks, frequently of fundamental types like int and float. You can't just add a row because the dataframe doesn't own the next bit of memory it would have to expand into. Concatenation usually requires that new memory is allocated and data is copied. Once that happens, referrers to the original dataframe
If you know the final size you want, you can preallocate and fill. Otherwise, you are better off keeping a list of new dataframes and concatenating them all at once. Since these are parallel procedures, they aren't dependent on each others output, so this may be a feasable option.

Select multiple slices in pandas dataframe using a list or another dataframe

I have a massive dataframe with column ["data"] that I need to create multiple smaller sets of dataframes with. I want to select df["data"][1:3], df["data"][4:5] at once without having to use a for loop and iterate selecting each slice.
Either a vectorized solution or .apply solution would make this much faster
Something like using below dataframe (called "selection") as the indexing input:
from
to.
1
3
4
5.
and trying to do something like: df["data"][selection["from"]:selection["to"]]
and it would output df[1:3], df[4:5].

Improving for loop over groupby keys of a pandas dataframe and concatenating

I am currently in a scenario in which I have a (really inefficient) for loop that attempts to process a pandas dataframe df with complicated aggregation functions (say complicated_function) over some groupby keys (respectively groupby_keys) as follows:
final_result = []
for groupby_key in groupby_keys:
small_df = complicated_function_1(df.loc[df['groupby_key'] == groupby_key])
final_result.append(small_df)
df_we_want = pd.concat(final_result)
I am trying to see if there is a more efficient way of dealing with this, rather than having to use a for loop, especially if there are many groupby_keys.
Is it possible to convert all this into a single function for me to just groupby and agg/concatenate or pipe? Or is this procedure doomed to be constrained to the for loop? I have tried multiple combinations but have been getting an assortment of errors (have not really been able to pin down something specific).

Alternative to reset_index().apply() to create new column based off index values

I have a df with a multiindex with 2 levels. One of these levels, age, is used to generate another column, Numeric Age.
Currently, my idea is to reset_index, use apply with age_func which reads row["age"], and then re-set the index, something like...
df = df.reset_index("age")
df["Numeric Age"] = df.apply(age_func, axis=1)
df = df.set_index("age") # ValueError: cannot reindex from a duplicate axis
This strikes me as a bad idea. I'm having a hard time resetting the indices correctly, and I think this is probably a slow way to go.
What is the correct way to make a new column based on the values of one of your indices? Or, if this is the correct way to go, is there a way to re-set the indices such that the df is the exact same as when I started, with the new column added?
We can set a new column using .loc, and modify the rows we need using masks. To use the correct col values, we also use a mask.
First step is to make a mask for the rows to target.
mask_foo = df.index.get_level_values("age") == "foo"
Later we will use .apply(axis=1), so write a function to handle the rows you will have from mask_foo.
def calc_foo_numeric_age(row):
# The logic here isn't important, the point is we have access to the row
return row["some_other_column"].split(" ")[0]
And now the .loc magic
df[mask_foo, "Numeric Age"] = df[mask_foo].apply(calc_foo_numeric_age, axis=1)
Repeat process for other target indices.
If your situation allows you to reset_index().apply(axis=1), I recommend that over this. I am doing this because I have other reasons for not wanting to reset_index().

pandas.read_sql read array columns directly into native structures?

Is there any way to get pandas to read a table with array-typed columns directly into native structures? By default, a int[] column ends up as an object column containing python list of python ints. There are ways to convert this into a column of Series, or better, a column with a multi-index, but this are very slow (~10 seconds) for 500M rows. Would be much faster if the data was originally loaded into a dataframe. I don't what to unroll the array in sql because I have very many array columns.
url = "postgresql://u:p#host:5432/dname"
engine = sqlalchemy.create_engine(url)
df = pd.read_sql_query("select 1.0 as a, 2.2 as b, array[1,2,3] as c;", engine)
print df
print type(df.loc[0,'c']) # list
print type(df.loc[0,'c'][0]) # int
Does it help if you use read_sql_table instead of read_sql_query ? Also, type detection can fail due to missing values. Maybe this is the cause.