Replace a single value in a column of a dataframe with 3 new column values - pandas

I would effectively like to replace a single value in a column (based on a criteria) with 3 new column values. For example:
Ch
Time
A
1 min
A
1 min
B
2 min
B
2 min
A1
A2
A3
1
2
3
2
3
4
1
1
2
B1
B2
B3
1
2
1
1
1
1
In the above table, If Ch==A, then I would like to see (All values from table 2 repeated twice and so on so forth for Ch==B etc):
ColA
ColB
ColC
1
2
3
1
3
4
1
1
2
1
2
3
1
3
4
1
1
2
How can I go about doing this efficiently? Thank you in advance for your help!
I tried replacing the value C, however I am not sure of how to insert 3 new columns.

IIUC, you can concat and merge:
df1 = pd.DataFrame({'Ch': ['A', 'A', 'B', 'B'], 'Time': ['1 min', '1 min', '2 min', '2 min']})
df2 = pd.DataFrame({'A1': [1, 2, 1], 'A2': [2, 3, 1], 'A3': [3, 4, 2]})
df3 = pd.DataFrame({'B1': [1, 1], 'B2': [2, 1], 'B3': [1, 1]})
d = {'A': df2, 'B': df3}
df1.merge(pd.concat({k: v.set_axis(['ColA', 'ColB', 'ColC'], axis=1)
for k,v in d.items()}
).droplevel(1),
left_on='Ch', right_index=True
)
Output:
Ch Time ColA ColB ColC
0 A 1 min 1 2 3
0 A 1 min 2 3 4
0 A 1 min 1 1 2
1 A 1 min 1 2 3
1 A 1 min 2 3 4
1 A 1 min 1 1 2
2 B 2 min 1 2 1
2 B 2 min 1 1 1
3 B 2 min 1 2 1
3 B 2 min 1 1 1

Related

Iterate over rows and subtract values in pandas df

I have the following table:
ID
Qty_1
Qty_2
A
1
10
A
2
0
A
3
0
B
3
29
B
2
0
B
1
0
I want to iterate based on the ID, and subtract Qty_2 - Qty_1 and update the next row with that result.
The result would be:
ID
Qty_1
Qty_2
A
1
10
A
2
8
A
3
5
B
3
29
B
2
27
B
1
26
Ideally, I would also like to start by subtracting the first row end a new ID appears and only after that start the loop:
ID
Qty_1
Qty_2
A
1
9
A
2
7
A
3
4
B
3
26
B
2
24
B
1
23
Each of the solutions is ok! Thank you!
First compute the difference between 'Qty_1' and 'Qty_2' row by row, then group by 'ID' and compute cumulative sum:
df['Qty_2'] = df.assign(Qty_2=df['Qty_2'].sub(df['Qty_1'])) \
.groupby('ID')['Qty_2'].cumsum()
print(df)
# Output:
ID Qty_1 Qty_2
0 A 1 9
1 A 2 7
2 A 3 4
3 B 3 26
4 B 2 24
5 B 1 23
Setup:
data = {'ID': ['A', 'A', 'A', 'B', 'B', 'B'],
'Qty_1': [1, 2, 3, 3, 2, 1],
'Qty_2': [10, 0, 0, 29, 0, 0]}
df = pd.DataFrame(data)

pandas sort_values with condition

I have a dataframe that I'd like to sort on cols time and b, where b sort is conditional on value of a. So if a == 1, sort from highest to lowest, and if a == -1, sort from lowest to highest. I would normally do something like df.sort_values(by=['time', 'b']) but I think it sorts b always from lowest to highest.
df = pd.DataFrame({'time': [0, 3, 2, 2, 1], 'a': [1, -1, 1, 1, -1], 'b': [4, 5, 1, 6, 2]})
time a b
0 0 1 4
1 3 -1 5
2 2 1 1
3 2 1 6
4 1 -1 2
desired output
time a b
0 0 1 4
1 1 -1 2
2 2 1 6
3 2 1 1
4 3 -1 5
Multiply a and b and use it as sorting key:
df['sort'] = df['a']*df['b']
df.sort_values(by=['time', 'sort'], ascending=[True, False]).drop('sort', axis=1)
output:
time a b
0 0 1 4
4 1 -1 2
3 2 1 6
2 2 1 1
1 3 -1 5
alternative:
df['sort'] = (1-df['a'])*df['b']
df.sort_values(by=['time', 'sort']).drop('sort', axis=1)
Pass ascending after create additional col for sorting
out = df.assign(key = df.a*df.b).sort_values(['time','key'],ascending=[True,False]).drop('key',1)
Out[59]:
time a b
0 0 1 4
4 1 -1 2
3 2 1 6
2 2 1 1
1 3 -1 5

Pandas - Add many new columns based on many aggregate functions

Pandas 1.0.5
import pandas as pd
d = pd.DataFrame({
"card_id": [1, 1, 2, 2, 1, 1, 2, 2],
"day": [1, 1, 1, 1, 2, 2, 2, 2],
"amount": [1, 2, 10, 20, 3, 4, 30, 40]
})
#add columns
d['count'] = d.groupby(['card_id', 'day'])["amount"].transform('count')
d['min'] = d.groupby(['card_id', 'day'])["amount"].transform('min')
d['max'] = d.groupby(['card_id', 'day'])["amount"].transform('max')
I would like to change the three transform lines to one line. I tried this:
d['count', 'min', 'max'] = d.groupby(['card_id', 'day'])["amount"].transform('count', 'min', 'max')
Error: "TypeError: count() takes 1 positional argument but 3 were given"
I also tried this:
d[('count', 'min', 'max')] = d.groupby(['card_id', 'day']).agg(
count = pd.NamedAgg('amount', 'count')
,min = pd.NamedAgg('amount', 'min')
,max = pd.NamedAgg('amount', 'max')
)
Error: "TypeError: incompatible index of inserted column with frame index"
Use merge,
d = pd.DataFrame({
"card_id": [1, 1, 2, 2, 1, 1, 2, 2],
"day": [1, 1, 1, 1, 2, 2, 2, 2],
"amount": [1, 2, 10, 20, 3, 4, 30, 40]
})
df_out = d.groupby(['card_id', 'day']).agg(
count = pd.NamedAgg('amount', 'count')
,min = pd.NamedAgg('amount', 'min')
,max = pd.NamedAgg('amount', 'max')
)
d.merge(df_out, left_on=['card_id', 'day'], right_index=True)
Output:
card_id day amount count min max
0 1 1 1 2 1 2
1 1 1 2 2 1 2
2 2 1 10 2 10 20
3 2 1 20 2 10 20
4 1 2 3 2 3 4
5 1 2 4 2 3 4
6 2 2 30 2 30 40
7 2 2 40 2 30 40
The output of you groupyby is creating a multilevel index and the index of this ouput doesn't match the index of d, hence the error. However, we can join the columns in d to the index in output of the groupby using merge with column names and right_index=True.
You could use the assign function to get the results in one go:
grouping = df.groupby(["card_id", "day"])
df.assign(
count=grouping.transform("count"),
min=grouping.transform("min"),
max=grouping.transform("max"),
)
card_id day amount count min max
0 1 1 1 2 1 2
1 1 1 2 2 1 2
2 2 1 10 2 10 20
3 2 1 20 2 10 20
4 1 2 3 2 3 4
5 1 2 4 2 3 4
6 2 2 30 2 30 40
7 2 2 40 2 30 40

Pairwise minima of elements of a pandas Series

Input:
numbers = pandas.Series([3,5,8,1], index=["A","B","C","D"])
A 3
B 5
C 8
D 1
Expected output (pandas DataFrame):
A B C D
A 3 3 3 1
B 3 5 5 1
C 3 5 8 1
D 1 1 1 1
Current (working) solution:
pairwise_mins = pandas.DataFrame(index=numbers.index)
def calculate_mins(series, index):
to_return = numpy.minimum(series, series[index])
return to_return
for col in numbers.index:
pairwise_mins[col] = calculate_mins(numbers, col)
I suspect there must be a better, shorter, vectorized solution. Who could help me with it?
Use the outer ufunc here that numpy provides, combined with numpy.minimum
n = numbers.to_numpy()
np.minimum.outer(n, n)
array([[3, 3, 3, 1],
[3, 5, 5, 1],
[3, 5, 8, 1],
[1, 1, 1, 1]], dtype=int64)
This can be done by broadcasting:
pd.DataFrame(np.where(numbers.values[:,None] < numbers.values,
numbers[:,None],
numbers),
index=numbers.index,
columns=numbers.index)
Output:
A B C D
A 3 3 3 1
B 3 5 5 1
C 3 5 8 1
D 1 1 1 1
Use np.broadcast_to and np.clip
a = numbers.values
pd.DataFrame(np.broadcast_to(a, (a.size,a.size)).T.clip(max=a),
columns=numbers.index,
index=numbers.index)
Out[409]:
A B C D
A 3 3 3 1
B 3 5 5 1
C 3 5 8 1
D 1 1 1 1

Why does groupby in Pandas place counts under existing column names?

I'm coming from R and do not understand the default groupby behavior in pandas. I create a dataframe and groupby the column 'id' like so:
d = {'id': [1, 2, 3, 4, 2, 2, 4], 'color': ["r","r","b","b","g","g","r"], 'size': [1,2,1,2,1,3,4]}
df = DataFrame(data=d)
freq = df.groupby('id').count()
When I check the header of the resulting dataframe, all the original columns are there instead of just 'id' and 'freq' (or 'id' and 'count').
list(freq)
Out[117]: ['color', 'size']
When I display the resulting dataframe, the counts have replaced the values for the columns not employed in the count:
freq
Out[114]:
color size
id
1 1 1
2 3 3
3 1 1
4 2 2
I was planning to use groupby and then to filter by the frequency column. Do I need to delete the unused columns and add the frequency column manually? What is the usual approach?
count aggregate all columns of DataFrame with excluding NaNs values, if need id as column use as_index=False parameter or reset_index():
freq = df.groupby('id', as_index=False).count()
print (freq)
id color size
0 1 1 1
1 2 3 3
2 3 1 1
3 4 2 2
So if add NaNs in each column should be differences:
d = {'id': [1, 2, 3, 4, 2, 2, 4],
'color': ["r","r","b","b","g","g","r"],
'size': [np.nan,2,1,2,1,3,4]}
df = pd.DataFrame(data=d)
freq = df.groupby('id', as_index=False).count()
print (freq)
id color size
0 1 1 0
1 2 3 3
2 3 1 1
3 4 2 2
You can specify columns for count:
freq = df.groupby('id', as_index=False)['color'].count()
print (freq)
id color
0 1 1
1 2 3
2 3 1
3 4 2
If need count with NaNs:
freq = df.groupby('id').size().reset_index(name='count')
print (freq)
id count
0 1 1
1 2 3
2 3 1
3 4 2
d = {'id': [1, 2, 3, 4, 2, 2, 4],
'color': ["r","r","b","b","g","g","r"],
'size': [np.nan,2,1,2,1,3,4]}
df = pd.DataFrame(data=d)
freq = df.groupby('id').size().reset_index(name='count')
print (freq)
id count
0 1 1
1 2 3
2 3 1
3 4 2
Thanks Bharath for pointed for another solution with value_counts, differences are explained here:
freq = df['id'].value_counts().rename_axis('id').to_frame('freq').reset_index()
print (freq)
id freq
0 2 3
1 4 2
2 3 1
3 1 1