Copy first of group down and sum total - pre defined groups - pandas

I have previously asked how to iterate through a prescribed grouping of items and received the solution.
import pandas as pd
data = [['apple', 1], ['orange', 2], ['pear', 3], ['peach', 4],['plum', 5], ['grape', 6]]
#index_groups = [0],[1,2],[3,4,5]
df = pd.DataFrame(data, columns=['Name', 'Number'])
for i in range(len(df)):
print(df['Number'][i])
Name Age
0 apple 1
1 orange 2
2 pear 3
3 peach 4
4 plum 5
5 grape 6
where :
for group in index_groups:
print(df.loc[group])
gave me just what I needed. Following up on this I would like to now sum the numbers per group but also copy down the first 'Name' in each group to the other names in the group, and then concatenate so one line per 'Name'.
In the above example the output I'm seeking would be
Name Age
0 apple 1
1 orange 5
2 peach 15
I can append the sums to a list easy enough
group_sum = []
group_sum.append(sum(df['Number'].loc[group]))
But I can't get the 'Names' in order to merge with the sums.

You could try:
df_final = pd.DataFrame()
for group in index_groups:
_df = df.loc[group]
_df["Name"] = df.loc[group].Name.iloc[0]
df_final = pd.concat([df_final, _df])
df_final.groupby("Name").agg(Age=("Number", "sum")).reset_index()
Output:
Name Age
0 apple 1
1 orange 5
2 peach 15

Related

groupby to show same row value from other columns

After groupby by "Mode" column and take out the value from "indicator" of "max, min", how to let the relative value to show in the same dataframe like below:
df = pd.read_csv(r'relative.csv')
Grouped = df.groupby('Mode')['Indicator'].agg(['max', 'min'])
print(Grouped)
(from google, maybe can use from col_value or row_value function, but seem be more complicated, could someone can help to solve it by easy ways? thank you.)
You can do it in two steps, using groupby and idxmin() or idxmix():
# Create a df with the min values of 'Indicator', renaming the column 'Value' to 'B'
min = df.loc[df.groupby('Mode')['Indicator'].idxmin()].reset_index(drop=True).rename(columns={'Indicator': 'min', 'Value': 'B'})
print(min)
# Mode min B
# 0 A 1 6
# 1 B 1 7
# Create a df with the max values of 'Indicator', renaming the column 'Value' to 'A'
max = df.loc[df.groupby('Mode')['Indicator'].idxmax()].reset_index(drop=True).rename(columns={'Indicator': 'max', 'Value': 'A'})
print(max)
# Mode max A
# 0 A 3 2
# 1 B 4 3
# Merge the dataframes together
result = pd.merge(min, max)
# reorder the columns to match expected output
print(result[['Mode', 'max','min','A', 'B']])
# Mode max min A B
# 0 A 3 1 2 6
# 1 B 4 1 3 7
The logic is unclear, there is no real reason why you would call your columns A/B since the 6/3 values in it are not coming from A/B.
I assume you want to achieve:
(df.groupby('Mode')['Indicator'].agg(['idxmax', 'idxmin'])
.rename(columns={'idxmin': 'min', 'idxmax': 'max'}).stack()
.to_frame('x').merge(df, left_on='x', right_index=True)
.drop(columns=['x', 'Mode']).unstack()
)
Output:
Indicator Value
max min max min
Mode
A 3 1 2 6
B 4 1 3 7
C 10 10 20 20
Used input:
Mode Indicator Value
0 A 1 6
1 A 2 5
2 A 3 2
3 B 4 3
4 B 3 6
5 B 2 8
6 B 1 7
7 C 10 20
With the dataframe you provided:
import pandas as pd
df = pd.DataFrame(
{
"Mode": ["A", "A", "A", "B", "B", "B", "B"],
"Indicator": [1, 2, 3, 4, 3, 2, 1],
"Value": [6, 5, 2, 3, 6, 8, 7],
}
)
new_df = df.groupby("Mode")["Indicator"].agg(["max", "min"])
print(new_df)
# Output
max min
Mode
A 3 1
B 4 1
Here is one way to do it with product from Python standard library's itertools module and Pandas at property:
from itertools import product
for row, (col, func) in product(["A", "B"], [("A", "max"), ("B", "min")]):
new_df.at[row, col] = df.loc[
(df["Mode"] == row) & (df["Indicator"] == new_df.loc[row, func]), "Value"
].values[0]
new_df = new_df.astype(int)
Then:
print(new_df)
# Output
max min A B
Mode
A 3 1 2 6
B 4 1 3 7

Re-define dataframe index with map function

I have a dataframe like this. I wanted to know how can I apply map function to its index and rename it into a easier format.
df = pd.DataFrame({'d': [1, 2, 3, 4]}, index=['apple_017', 'orange_054', 'orange_061', 'orange_053'])
df
d
apple_017 1
orange_054 2
orange_061 3
orange_053 4
There are only two labels in the indeces of the dataframe, so it's either apple or orange in this case and this is how I tried:
data.index = data.index.map(i = "apple" if "apple" in i else "orange")
(Apparently it's not how it works)
Desired output:
d
apple 1
orange 2
orange 3
orange 4
Appreciate anyone's help and suggestion!
Try via split():
df.index=df.index.str.split('_').str[0]
OR
via map():
df.index=df.index.map(lambda x:'apple' if 'apple' in x else 'orange')
output of df:
d
apple 1
orange 2
orange 3
orange 4

How to flat a string to several columns in pandas?

fruit = pd.DataFrame({'type': ['apple: 1 orange: 2 pear: 3']})
I want to flat the dataframe and get the below format:
apple orange pear
1 2 3
Thanks
You are making your live extremely difficult if you work with multiple values in a single field. You can basically use none of the pandas functions because they all assume they data in a field belong together and should stay together.
For instance with
In [10]: fruit = pd.Series({'apple': 1, 'orange': 2, 'pear': 3})
In [11]: fruit
Out[11]:
apple 1
orange 2
pear 3
dtype: int64
you could easily transform your data as in
In [14]: fruit.to_frame()
Out[14]:
0
apple 1
orange 2
pear 3
In [15]: fruit.to_frame().T
Out[15]:
apple orange pear
0 1 2 3

How can I create a datframe column which counts the occurrence of each value in anopther column?

I am trying to add a column to my dataframe, which will hold a value which represents the number of times a unique value has appeared in another column.
For example , I haver the following dataframe:
Date|Team|Goals|
22.08.20|Team1|4|
22.08.20|Team2|3|
22.08.20|Team3|1|
22.09.20|Team1|4|
22.09.20|Team3|5|
I would like to add a counter column, which counts how often each team appears:
Date|Team|Goals|Count|
22.08.20|Team1|4|1|
22.08.20|Team2|3|1|
22.08.20|Team3|1|1|
22.09.20|Team1|4|2|
22.09.20|Team3|5|2|
My Dataframe is ordered by date, so the teams should appear in the correct order.
Apologies, very new to pandas and stack overflow, so please let me know if I can format this question differently. Thanks
TRY:
df['Count'] = df.groupby('Team').cumcount().add(1)
OUTPUT:
Date Team Goals Count
0 22.08.20 Team1 4 1
1 22.08.20 Team2 3 1
2 22.08.20 Team3 1 1
3 22.09.20 Team1 4 2
4 22.09.20 Team3 5 2
Another answer building upon #Nk03's with replicable results:
import pandas as pd
import numpy as np
# Set numpy random seed
np.random.seed(42)
# Create dates array
dates = pd.date_range(start='2021-06-01', periods=10, freq='D')
# Create teams array
teams_names = ['Team 1', 'Team 2', 'Team 3']
teams = [teams_names[i] for i in np.random.randint(0, 3, 10)]
# Create goals array
goals = np.random.randint(1, 6, 10)
# Create DataFrame
data = pd.DataFrame({'Date': dates,
'Team': teams,
'Goals': goals})
# Cumulative count of teams
data['Count'] = data.groupby('Team').cumcount().add(1)
The output will be:
Date Team Goals Count
0 2021-06-01 Team 2 3 1
1 2021-06-02 Team 2 1 2
2 2021-06-03 Team 2 4 3
3 2021-06-04 Team 1 2 1
4 2021-06-05 Team 2 4 4
5 2021-06-06 Team 1 2 2
6 2021-06-07 Team 2 2 5
7 2021-06-08 Team 3 4 1
8 2021-06-09 Team 3 5 2
9 2021-06-10 Team 1 2 3

Check if list cell contains value

Having a dataframe like this:
month transactions_ids
0 1 [0, 5, 1]
1 2 [7, 4]
2 3 [8, 10, 9, 11]
3 6 [2]
4 9 [3]
For a given transaction_id, I would like to get the month when it took place. Notice that a transaction_id can only be related to one single month.
So for example, given transaction_id = 4, the month would be 2.
I know this can be done in a loop by looking month by month if the transactions_ids related contain the given transaction_id, but I'm wondering if there is any way more efficient than that.
Cheers
The best way in my opinion is to explode your data frame and avoid having python lists in your cells.
df = df.explode('transaction_ids')
which outputs
month transactions_ids
0 1 0
0 1 5
0 1 1
1 2 7
1 2 4
2 3 8
2 3 10
2 3 9
2 3 11
3 6 2
4 9 3
Then, simply
id_to_find = 1 # example
df.loc[df.transactions_ids == id_to_find, 'month']
P.S: be aware of the duplicated indexes that explode outputs. In general, it is better to do explode(...).reset_index(drop=True) for most cases to avoid unwanted behavior.
You can use pandas string methods to find the id in the "list" (it's really just a string as far as pandas is concerned when read in using StringIO):
import pandas as pd
from io import StringIO
data = StringIO("""
month transactions_ids
1 [0,5,1]
2 [7,4]
3 [8,10,9,11]
6 [2]
9 [3]
""")
df = pd.read_csv(data, delim_whitespace=True)
df.loc[df['transactions_ids'].str.contains('4'), 'month']
In case your transactions_ids are real lists, then you can use map to check for membership:
df['transactions_ids'].map(lambda x: 3 in x)