Cannot update multiple rows and columns when new values include a list - pandas

I am selecting multiple rows based on a condition, and updating values in multiple columns. This works unless one of the values is a list.
First, a dataframe:
>>> dummy_data = {'A': ['abc', 'def', 'def'],
'B': ['red', 'purple', 'blue'],
'C': [25, 94, 57],
'D': [False, False, False],
'E': [[9,8,12], [36,72,4], [18,3,5]]}
>>> df = pd.DataFrame(dummy_data)
A B C D E
0 abc red 25 False [9, 8, 12]
1 def purple 94 False [36, 72, 4]
2 def blue 57 False [18, 3, 5]
Things that work:
This works to select multiple rows and update multiple columns:
>>> df.loc[df['A'] == 'def', ['B', 'C', 'D']] = ['orange', 42, True]
A B C D E
0 abc red 25 False [9, 8, 12]
1 def orange 42 True [36, 72, 4]
2 def orange 42 True [18, 3, 5]
This works to update column E with a new list:
>>> new_list = [1,2,3]
>>> df.loc[df['A'] == 'def', ['E']] = pd.Series([new_list] * len(df))
A B C D E
0 abc red 25 False [9, 8, 12]
1 def purple 94 False [1, 2, 3]
2 def blue 57 False [1, 2, 3]
But how to do both?
I can't figure out an elegant way to combine these approaches.
Attempt 1 This works, but I get the ndarray from ragged nested sequences warning:
>>> new_list = [1,2,3]
>>> updates = ['orange', 42, new_list]
>>> num_rows = df.A.eq('def').sum()
>>> df.loc[df['A'] == 'def', ['B', 'C', 'E']] = [updates] * num_rows
VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences ...
A B C D E
0 abc red 25 False [9, 8, 12]
1 def orange 42 False [1, 2, 3]
2 def orange 42 False [1, 2, 3]
Attempt 2 This works, but seems overly complicated:
>>> new_list = [1,2,3]
>>> updates = ['orange', 42, new_list]
>>> num_rows = df.A.eq('def').sum()
>>> df2 = pd.DataFrame([updates] * num_rows)
>>> df.loc[df['A'] == 'def', ['B', 'C', 'E']] = df2[[0, 1, 2]].values
A B C D E
0 abc red 25 False [9, 8, 12]
1 def orange 42 False [1, 2, 3]
2 def orange 42 False [1, 2, 3]

You can use pandas.DataFrame to assign/align the new values with the selected columns with the help of a boolean mask.
mask = df['A'] == 'def'
cols = ['B', 'C', 'D', 'E']
new_list = [1,2,3]
updates = ['orange', 42, True, [new_list]]
df.loc[mask, cols] = pd.DataFrame(dict(zip(cols, updates)), index=df.index)
>>> print(df)
A B C D E
0 abc red 25 False [9, 8, 12]
1 def orange 42 True [1, 2, 3]
2 def orange 42 True [1, 2, 3]
[Finished in 589ms]

Create a numpy array with object dtype:
df.loc[df['A'] == 'def', ['B', 'C', 'E']] = np.array([updates] * num_rows, dtype='object')
Output:
A B C D E
0 abc red 25 False [9, 8, 12]
1 def orange 42 False [1, 2, 3]
2 def orange 42 False [1, 2, 3]
However, as commented, [updates] * num_rows is a dangerous operation. For example, later you want to modify one of the array value:
df.iloc[-1,-1].append(4)
Then your data becomes (notice the change in row 1 as well):
A B C D E
0 abc red 25 False [9, 8, 12]
1 def orange 42 False [1, 2, 3, 4]
2 def orange 42 False [1, 2, 3, 4]

Related

Add a new column if index in other 2 column is the same

I would add the new index to the new column e if b and c is the same.
In the mean time,
I need to consider the limit of the sum(d)<=20,
If the total d with the same b and c is exceed 20,
then give a new index.
the example input data below:
a
b
c
d
0
0
2
9
1
2
1
10
2
1
0
9
3
1
0
11
4
2
1
9
5
0
1
15
6
2
0
9
7
1
0
8
I sort the b and c first,
let comparing be more easier,
then I got key errorKeyError: 0, temporary_size += df.loc[df[i], 'd']\
Hope it like this:
a
b
c
d
e
5
0
1
15
1
0
0
2
9
2
2
1
0
9
3
3
1
0
11
3
7
1
0
8
4
6
2
0
9
5
1
2
1
10
6
4
2
1
9
6
and here is my code:
import pandas as pd
d = {'a': [0, 1, 2, 3, 4, 5, 6, 7], 'b': [0, 2, 1, 1, 2, 0, 2, 1], 'c': [2, 1, 0, 0, 1, 1, 0, 0], 'd': [9, 10, 9, 11, 9, 15, 9, 8]}
df = pd.DataFrame(data=d)
print(df)
df.sort_values(['b', 'c'], ascending=[True, True], inplace=True, ignore_index=True)
e_id = 0
total_size = 20
temporary_size = 0
for i in range(0, len(df.index)-1):
if df.loc[i, 'b'] == df.loc[i+1, 'b'] and df.loc[i, 'c'] != df.loc[i+1, 'c']:
temporary_size = temporary_size + df.loc[i, 'd']
if temporary_size <= total_size:
df.loc['e', i] = e_id
else:
df.loc[i, 'e'] = e_id
temporary_size = temporary_size + df.loc[i, 'd']
e_id += 1
else:
df.loc[i, 'e'] = e_id
temporary_size = temporary_size + df.loc[i, 'd']
print(df)
finally, I can't get the column c in my dataframe.
THANKS FOR ALL!

pandas - get differ between two data frame with same dimensions

How can I get the differ between two pandas dataframe with the same dimensions:
import pandas as pd
df1 = pd.DataFrame({
'x': ['a', 'b', 'c', 'd', 'e'],
'y': [1, 1, 1, 1, 1],
'z': [2, 2, 2, 2, 2]})
print(df1)
df2 = pd.DataFrame({
'x': ['a', 'b', 'c', 'd', 'e'],
'y': [1, 1, 1, 1, 1],
'z': [3, 3, 3, 3, 3]})
print(df2)
I would like the output delta data frame is:
x y z
0 a 0 1
1 b 0 1
2 c 0 1
3 d 0 1
4 e 0 1
Set x as the common index, subtract and reset the index (pandas aligns on the index before any operation):
df2.set_index('x').sub(df1.set_index('x')).reset_index()
x y z
0 a 0 1
1 b 0 1
2 c 0 1
3 d 0 1
4 e 0 1

Removing 'dominated' rows from a Pandas dataframe (rows with all values lower than the values of any other row)

Edit: changed example df for clarity
I have a dataframe, similar to the one given below (except the real one has a few thousand rows and columns, and values being floats):
df = pd.DataFrame([[6,5,4,3,8], [6,5,4,3,6], [1,1,3,9,5], [0,1,2,7,4], [2, 0, 0, 4, 0])
0 1 2 3 4
0 6 5 4 3 8
1 6 5 4 3 6
2 1 1 3 9 5
3 0 1 2 7 4
4 2 0 0 4 0
From this dataframe, I would like to drop all rows for which all values are lower than or equal to any other row. For this simple example, row 1 and row 3 should be deleted ('dominated' by row 0 and row 2 respectively'):
filtered df:
0 1 2 3 4
0 6 5 4 3 8
2 1 1 3 9 5
4 2 0 0 4 0
It would be even better if the approach could take into account floating point errors, since my real dataframe contains floats (i.e. instead of dropping rows where all values are lower/equal, the values shouldn't be lower than a small amount (e.g. 0.0001).
My initial idea to tackle this problem was as follows:
Select the first row
Compare the other rows with it using a list comprehension (see below)
Drop all rows that returned True
Repeat for the next row
List comprehension code:
selected_row = df.loc[0
[(df.loc[r]<=selected_row).all() and (df.loc[r]<selected_row).any() for r in range(len(df))]
[False, True, False, False, False]
This seems hardly efficient however. Any suggestions on how to (efficiently) tackle this problem would be greatly appreciated.
We can try with broadcasting:
import pandas as pd
df = pd.DataFrame([
[6, 5, 4, 3, 8], [6, 5, 4, 3, 6], [1, 1, 3, 9, 5],
[0, 1, 2, 7, 4], [2, 0, 0, 4, 0]
])
# Need to ensure only one of each row present since comparing to 1
# there needs to be one and only one of each row
df = df.drop_duplicates()
# Broadcasted comparison explanation below
cmp = (df.values[:, None] <= df.values).all(axis=2).sum(axis=1) == 1
# Filter using the results from the comparison
df = df[cmp]
df:
0 1 2 3 4
0 6 5 4 3 8
2 1 1 3 9 5
4 2 0 0 4 0
Intuition:
Broadcast the comparison operation over the DataFrame:
(df.values[:, None] <= df.values)
[[[ True True True True True]
[ True True True True False]
[False False False True False]
[False False False True False]
[False False False True False]] # df vs [6 5 4 3 8]
[[ True True True True True]
[ True True True True True]
[False False False True False]
[False False False True False]
[False False False True False]] # df vs [6 5 4 3 6]
[[ True True True False True]
[ True True True False True]
[ True True True True True]
[False True False False False]
[ True False False False False]] # df vs [1 1 3 9 5]
[[ True True True False True]
[ True True True False True]
[ True True True True True]
[ True True True True True]
[ True False False False False]] # df vs [0 1 2 7 4]
[[ True True True False True]
[ True True True False True]
[False True True True True]
[False True True True True]
[ True True True True True]]] # df vs [2 0 0 4 0]
Then we can check for all on axis=2:
(df.values[:, None] <= df.values).all(axis=2)
[[ True False False False False] # Rows le [6 5 4 3 8]
[ True True False False False] # Rows le [6 5 4 3 6]
[False False True False False] # Rows le [1 1 3 9 5]
[False False True True False] # Rows le [0 1 2 7 4]
[False False False False True]] # Rows le [2 0 0 4 0]
Then we can use sum to total how many rows are less than or equal to:
(df.values[:, None] <= df.values).all(axis=2).sum(axis=1)
[1 2 1 2 1]
The rows where the is only 1 row that is less than or equal to (self match only) are the rows to keep. Because we drop_duplicates there will be no duplicates in the dataframe so the only True values will be the self-match and those that are less than or equal to:
(df.values[:, None] <= df.values).all(axis=2).sum(axis=1) == 1
[ True False True False True]
This then becomes the filter for the DataFrame:
df = df[[True, False, True, False, True]]
df:
0 1 2 3 4
0 6 5 4 3 8
2 1 1 3 9 5
4 2 0 0 4 0
What is the expected proportion of dominant rows?
What is the size of the datasets that you will handle and the available memory?
While a solution like the broadcasting approach is very clever and efficient (vectorized), it will not be able to handle large dataframes as the size of the broadcast will quickly explode the memory limit (a 100,000×10 input array will not run on most computers).
Here is another approach to avoid testing all combinations and computing everything at once in the memory. It is slower due to the loop, but it should be able to handle much larger arrays. It will also run faster when the proportion of dominated rows increases.
In summary, it compares the dataset with the first row, drops the dominated rows, shifts the first row to the end and start again until doing a full loop. If rows get dropped over time, the number of comparison decrease.
def get_dominants_loop(df):
from tqdm import tqdm
seen = [] # keep track of tested rows
idx = df.index # initial index
for i in tqdm(range(len(df)+1)):
x = idx[0]
if x in seen: # done a full loop
return df.loc[idx]
seen.append(idx[0])
# check which rows are dominated and drop them from the index
idx = (df.loc[idx]-df.loc[x]).le(0).all(axis=1)
# put tested row at the end
idx = list(idx[~idx].index)+[x]
To drop the dominated rows:
df = get_dominants_loop(df)
NB. I used tqdm here to have a progress bar. It is not needed for the code to run
Quick benchmarking in cases where the broadcast approach could not run: <2min for 100k×10 in a cas where most rows are not dominated ; 4s when most rows are dominated
you can try:
df[df.shift(1)[0] >= df[1][0]]
output
0
1
2
3
4
1
6
5
4
3
6
2
1
1
3
9
5
You can try something like that:
# Cartesian product
x = np.tile(df, df.shape[0]).reshape(-1, df.shape[1])
y = np.tile(df.T, df.shape[0]).T
# Remove same rows
#dups = np.all(x == y, axis=1)
#x = x[~dups]
#y = y[~dups]
x = np.delete(x, slice(None, None, df.shape[0]+1), axis=0)
y = np.delete(y, slice(None, None, df.shape[0]+1), axis=0)
# Keep dominant rows
m = x[np.all(x >= y, axis=1)]
>>> m
array([[6, 5, 4, 3, 8],
[1, 1, 3, 9, 5]])
# Before remove duplicates
# df1 = pd.DataFrame({'x': x.tolist(), 'y': y.tolist()})
>>> df1
x y
0 [6, 5, 4, 3, 8] [6, 5, 4, 3, 8] # dup
1 [6, 5, 4, 3, 8] [6, 5, 4, 3, 6] # DOMINANT
2 [6, 5, 4, 3, 8] [1, 1, 3, 9, 5]
3 [6, 5, 4, 3, 8] [0, 1, 2, 7, 4]
4 [6, 5, 4, 3, 6] [6, 5, 4, 3, 8]
5 [6, 5, 4, 3, 6] [6, 5, 4, 3, 6] # dup
6 [6, 5, 4, 3, 6] [1, 1, 3, 9, 5]
7 [6, 5, 4, 3, 6] [0, 1, 2, 7, 4]
8 [1, 1, 3, 9, 5] [6, 5, 4, 3, 8]
9 [1, 1, 3, 9, 5] [6, 5, 4, 3, 6]
10 [1, 1, 3, 9, 5] [1, 1, 3, 9, 5] # dup
11 [1, 1, 3, 9, 5] [0, 1, 2, 7, 4] # DOMINANT
12 [0, 1, 2, 7, 4] [6, 5, 4, 3, 8]
13 [0, 1, 2, 7, 4] [6, 5, 4, 3, 6]
14 [0, 1, 2, 7, 4] [1, 1, 3, 9, 5]
15 [0, 1, 2, 7, 4] [0, 1, 2, 7, 4] # dup
Here is a way using df.apply()
m = (pd.concat(df.apply(lambda x: df.ge(x,axis=1),axis=1).tolist(),keys = df.index)
.all(axis=1)
.groupby(level=0)
.sum()
.eq(1))
ndf = df.loc[m]
Output:
0 1 2 3 4
0 6 5 4 3 8
2 1 1 3 9 5
4 2 0 0 4 0

How to create new list column values from groupby

My goal is to create a new column c_list that contains a list after an groupby (without merge function): df['c_list'] = df.groupby('a').agg({'c':lambda x: list(x)})
df = pd.DataFrame(
{'a': ['x', 'y', 'y', 'x'],
'b': [2, 0, 0, 0],
'c': [8, 2, 5, 6]
}
)
df
Initial dataframe
a b c
0 x 2 8
1 y 0 2
2 y 0 5
3 x 0 6
Looking for:
a b c d
0 x 2 8 [6, 8]
1 y 0 2 [2, 5]
2 y 0 5 [2, 5]
3 x 0 6 [6, 8]
Try with transform
df['d']=df.groupby('a').c.transform(lambda x : [x.values.tolist()]*len(x))
0 [8, 6]
1 [2, 5]
2 [2, 5]
3 [8, 6]
Name: c, dtype: object
Or
df['d']=df.groupby('a').c.agg(list).reindex(df.a).values

insert value into random row

I have a dataframe as below.
D1 = pd.DataFrame({'a': [15, 22, 107, 120],
'b': [25, 21, 95, 110]})
I am trying to randomly add two rows into column 'b' to get the effect of below. In each case the inserted 0 in this case shifts the rows down one.
D1 = pd.DataFrame({'a': [15, 22, 107, 120, 0, 0],
'b': [0, 25, 21, 0, 95, 110]})
Everything I have seen is about inserting into the whole column as opposed to individual rows.
Here is one potential way to achieve this using numpy.random.randint and numpy.insert:
import numpy as np
n = 2
rand_idx = np.random.randint(0, len(D1), size=n)
# Append 'n' rows of zeroes to D1
D2 = D1.append(pd.DataFrame(np.zeros((n, D1.shape[1])), columns=D1.columns, dtype=int), ignore_index=True)
# Insert n zeroes into random indices and assign back to column 'b'
D2['b'] = np.insert(D1['b'].values, rand_idx, 0)
print(D2)
a b
0 15 25
1 22 0
2 107 0
3 120 21
4 0 95
5 0 110
Use numpy.insert with set positions - for a by random and for b by length of original DataFrame:
n = 2
new = np.zeros(n, dtype=int)
a = np.insert(D1['b'].values, len(D1), new)
b = np.insert(D1['a'].values, np.random.randint(0, len(D1), size=n), new)
#python 0.24+
#a = np.insert(D1['b'].to_numpy(), len(D1), new)
#b = np.insert(D1['a'].to_numpy(), np.random.randint(0, len(D1), size=n), new)
df = pd.DataFrame({'a':a, 'b': b})
print (df)
a b
0 25 0
1 21 15
2 95 22
3 110 0
4 0 107
5 0 120