How to calculate multiple columns from multiple columns in pandas - pandas

I am trying to calculate multiple colums from multiple columns in a pandas dataframe using a function.
The function takes three arguments -a-, -b-, and -c- and and returns three calculated values -sum-, -prod- and -quot-. In my pandas data frame I have three coumns -a-, -b- and and -c- from which I want to calculate the columns -sum-, -prod- and -quot-.
The mapping that I do works only when I have exactly three rows. I do not know what is going wrong, although I expect that it has to do something with selecting the correct axis. Could someone explain what is happening and how I can calculate the values that I would like to have.
Below are the situations that I have tested.
INITIAL VALUES
def sum_prod_quot(a,b,c):
sum = a + b + c
prod = a * b * c
quot = a / b / c
return (sum, prod, quot)
df = pd.DataFrame({ 'a': [20, 100, 18],
'b': [ 5, 10, 3],
'c': [ 2, 10, 6],
'd': [ 1, 2, 3]
})
df
a b c d
0 20 5 2 1
1 100 10 10 2
2 18 3 6 3
CALCULATION STEPS
Using exactly three rows
When I calculate three columns from this dataframe and using the function function I get:
df['sum'], df['prod'], df['quot'] = \
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
df
a b c d sum prod quot
0 20 5 2 1 27.0 120.0 27.0
1 100 10 10 2 200.0 10000.0 324.0
2 18 3 6 3 2.0 1.0 1.0
This is exactly the result that I want to have: The sum-column has the sum of the elements in the columns a,b,c; the prod-column has the product of the elements in the columns a,b,c and the quot-column has the quotients of the elements in the columns a,b,c.
Using more than three rows
When I expand the dataframe with one row, I get an error!
The data frame is defined as:
df = pd.DataFrame({ 'a': [20, 100, 18, 40],
'b': [ 5, 10, 3, 10],
'c': [ 2, 10, 6, 4],
'd': [ 1, 2, 3, 4]
})
df
a b c d
0 20 5 2 1
1 100 10 10 2
2 18 3 6 3
3 40 10 4 4
The call is
df['sum'], df['prod'], df['quot'] = \
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
The result is
...
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
ValueError: too many values to unpack (expected 3)
while I would expect an extra row:
df
a b c d sum prod quot
0 20 5 2 1 27.0 120.0 27.0
1 100 10 10 2 200.0 10000.0 324.0
2 18 3 6 3 2.0 1.0 1.0
3 40 10 4 4 54.0 1600.0 1.0
Using less than three rows
When I reduce tthe dataframe with one row I get also an error.
The dataframe is defined as:
df = pd.DataFrame({ 'a': [20, 100],
'b': [ 5, 10],
'c': [ 2, 10],
'd': [ 1, 2]
})
df
a b c d
0 20 5 2 1
1 100 10 10 2
The call is
df['sum'], df['prod'], df['quot'] = \
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
The result is
...
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
ValueError: need more than 2 values to unpack
while I would expect a row less:
df
a b c d sum prod quot
0 20 5 2 1 27.0 120.0 27.0
1 100 10 10 2 200.0 10000.0 324.0
QUESTIONS
The questions I have:
1) Why do I get these errors?
2) How do I have to modify the call such that I get the desired data frame?
NOTE
In this link a similar question is asked, but the given answer did not work for me.

The answer doesn't seem correct for 3 rows as well. Can you check other values except first row and first column. Looking at the results, product of 20*5*2 is NOT 120, it's 200 and is placed below in sum column. You need to form list in correct way before assigning to new columns. You can try use following to set the new columns:
df['sum'], df['prod'], df['quot'] = zip(*map(sum_prod_quot, df['a'], df['b'], df['c']))
For details follow the link

Related

groupby to show same row value from other columns

After groupby by "Mode" column and take out the value from "indicator" of "max, min", how to let the relative value to show in the same dataframe like below:
df = pd.read_csv(r'relative.csv')
Grouped = df.groupby('Mode')['Indicator'].agg(['max', 'min'])
print(Grouped)
(from google, maybe can use from col_value or row_value function, but seem be more complicated, could someone can help to solve it by easy ways? thank you.)
You can do it in two steps, using groupby and idxmin() or idxmix():
# Create a df with the min values of 'Indicator', renaming the column 'Value' to 'B'
min = df.loc[df.groupby('Mode')['Indicator'].idxmin()].reset_index(drop=True).rename(columns={'Indicator': 'min', 'Value': 'B'})
print(min)
# Mode min B
# 0 A 1 6
# 1 B 1 7
# Create a df with the max values of 'Indicator', renaming the column 'Value' to 'A'
max = df.loc[df.groupby('Mode')['Indicator'].idxmax()].reset_index(drop=True).rename(columns={'Indicator': 'max', 'Value': 'A'})
print(max)
# Mode max A
# 0 A 3 2
# 1 B 4 3
# Merge the dataframes together
result = pd.merge(min, max)
# reorder the columns to match expected output
print(result[['Mode', 'max','min','A', 'B']])
# Mode max min A B
# 0 A 3 1 2 6
# 1 B 4 1 3 7
The logic is unclear, there is no real reason why you would call your columns A/B since the 6/3 values in it are not coming from A/B.
I assume you want to achieve:
(df.groupby('Mode')['Indicator'].agg(['idxmax', 'idxmin'])
.rename(columns={'idxmin': 'min', 'idxmax': 'max'}).stack()
.to_frame('x').merge(df, left_on='x', right_index=True)
.drop(columns=['x', 'Mode']).unstack()
)
Output:
Indicator Value
max min max min
Mode
A 3 1 2 6
B 4 1 3 7
C 10 10 20 20
Used input:
Mode Indicator Value
0 A 1 6
1 A 2 5
2 A 3 2
3 B 4 3
4 B 3 6
5 B 2 8
6 B 1 7
7 C 10 20
With the dataframe you provided:
import pandas as pd
df = pd.DataFrame(
{
"Mode": ["A", "A", "A", "B", "B", "B", "B"],
"Indicator": [1, 2, 3, 4, 3, 2, 1],
"Value": [6, 5, 2, 3, 6, 8, 7],
}
)
new_df = df.groupby("Mode")["Indicator"].agg(["max", "min"])
print(new_df)
# Output
max min
Mode
A 3 1
B 4 1
Here is one way to do it with product from Python standard library's itertools module and Pandas at property:
from itertools import product
for row, (col, func) in product(["A", "B"], [("A", "max"), ("B", "min")]):
new_df.at[row, col] = df.loc[
(df["Mode"] == row) & (df["Indicator"] == new_df.loc[row, func]), "Value"
].values[0]
new_df = new_df.astype(int)
Then:
print(new_df)
# Output
max min A B
Mode
A 3 1 2 6
B 4 1 3 7

how to generate random numbers that can be summed to a specific value?

I have 2 dataframe as follows:
import pandas as pd
import numpy as np
# Create data set.
dataSet1 = {'id': ['A', 'B', 'C'],
'value' : [9,20,20]}
dataSet2 = {'id' : ['A', 'A','A','B','B','B','C'],
'id_2': [1, 2, 3, 2,3,4,1]}
# Create dataframe with data set and named columns.
df_map1 = pd.DataFrame(dataSet1, columns= ['id', 'value'])
df_map2 = pd.DataFrame(dataSet2, columns= ['id','id_2'])
df_map1
id value
0 A 9
1 B 20
2 C 20
df_map2
id id_2
0 A 1
1 A 2
2 A 3
3 B 2
4 B 3
5 B 4
6 C 1
where id_2 can have dups of id. (namely id_2 is subset of id)
#doing a quick merge, based on id.
df = df_map1.merge(df_map2 ,on=['id'])
id value id_2
0 A 9 1
1 A 9 2
2 A 9 3
3 B 20 2
4 B 20 3
5 B 20 4
6 C 20 1
I can represent what's the relationship between id and id_2 as follows
id_ref = df.groupby('id')['id_2'].apply(list).to_dict()
{'A': [1, 2, 3], 'B': [2, 3, 4], 'C': [1]}
Now, I would like to generate random integer say 0 to 3 put the list (5 elements for exmaple) into the pandas df and explode.
import numpy as np
import random
df['random_value'] = df.apply(lambda _: np.random.randint(0,3, 5), axis=1)
id value id_2 random_value
0 A 9 1 [0, 0, 0, 0, 1]
1 A 9 2 [0, 2, 1, 2, 1]
2 A 9 3 [0, 1, 2, 2, 1]
3 B 20 2 [2, 1, 1, 2, 2]
4 B 20 3 [0, 0, 0, 0, 0]
5 B 20 4 [1, 0, 0, 1, 0]
6 C 20 1 [1, 2, 2, 2, 1]
The condition for generating this random_value list, is that sum of the list has to be equal to 9.
That means, for id : A, if we sum all the elements inside the list, we have total of 13 shown the description below, but what we want is 9:
and same concept for id B and C.. and so on....
is there anyway to achieve this?
# i was looking into multinomial from np.random function... seems this should be the solution but im not sure how to apply this with pandas.
np.random.multinomial(9, np.ones(5)/5, size = 1)[0]
=> array([2,3,3,0,1])
2+3+3+0+1 = 9
ATTEMPT/IDEA ...
given that we have list of id_2. ie) id: A has 3 distinct elements [1,2,3].
so id A is mapped to 3 different elements. so we can get
3 * 5 = 15 ( which will be our long list )
3: length of list
5: create 5 elements of list
hence
list_A = np.random.multinomial(9,np.ones(3*5)/(3*5) ,size = 1)[0]
and then we evenly distribute/split the list.
using this list comprehension:
[list_A [i:i + n] for i in range(0, len(list_A ), n)]
but I am still unsure how to do this dynamically.
The core idea is as you said (about getting 3*5=15 numbers), plus reshaping it into a 2D array with the same number of rows as that id has in the dataframe. The following function does that,
def generate_random_numbers(df):
value = df['value'].iloc[0]
list_len = 5
num_rows = len(df)
num_rand = list_len*num_rows
return pd.Series(
map(list, np.random.multinomial(value, np.ones(num_rand)/num_rand).reshape(num_rows, -1)),
df.index
)
And apply it:
df['random_value'] = df.groupby(['id', 'value'], as_index=False).apply(generate_random_numbers).droplevel(0)

how to find average of a couple of columns and make a new data set?

I have a data set how I can find the average of numerical columns for each 30 rows and make a smaller data set?
For example:
data
A B C
2. 4 f
3 1 d
4 2 p
3 1 l
2 0 g
Lets I wanna find the average of each 2 rows (for simplification).
A B C
(2+3)/2 (4+1)/2 d
(4+3)/2 (2+1)/2 l
2 0 g
Try dividing the index by n to get groups then use groupby aggregate:
n = 2
new_df = df.groupby(df.index // n).agg({'A': 'mean', 'B': 'mean', 'C': 'last'})
new_df:
A B C
0 2.5 2.5 d
1 3.5 1.5 l
2 2.0 0.0 g
Or the aggregation dict can be built programmatically by dtype here object dtypes are aggregated by taking the last value, all other types are aggregated with the mean:
import numpy as np
n = 2
aggregation_d = dict(
zip(df.columns, np.where(df.dtypes != 'object', 'mean', 'last'))
)
new_df = df.groupby(df.index // n).agg(aggregation_d)
aggregation_d:
{'A': 'mean', 'B': 'mean', 'C': 'last'}
new_df:
A B C
0 2.5 2.5 d
1 3.5 1.5 l
2 2.0 0.0 g
Explanation:
df.index // n
Creates a Series of values:
Int64Index([0, 0, 1, 1, 2], dtype='int64')
These values establish a group.
Then the dict of values specifies how to aggregate each group:
{'A': 'mean', 'B': 'mean', 'C': 'last'}
Take the mean of A and B, and the last value of C.

Adding the lower levels of two Pandas MultiIndex columns

I have the following DataFrame:
import pandas as pd
columns = pd.MultiIndex.from_arrays([['n1', 'n1', 'n2', 'n2'],
['p', 'm', 'p', 'm']])
values = [
[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
]
df = pd.DataFrame(values, columns=columns)
n1 n2
p m p m
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
Now I want to add another column (n3) to this DataFrame whose lower-level columns p and m should be the sums of the corresponding lower-level columns of n1 and n2:
n1 n2 n3
p m p m p m
0 1 2 3 4 4 6
1 5 6 7 8 12 14
2 9 10 11 12 20 22
Here's the code I came up with:
n3 = df[['n1', 'n2']].sum(axis=1, level=1)
level1 = df.columns.levels[1]
n3.columns = pd.MultiIndex.from_arrays([['n3'] * len(level1), level1])
df = pd.concat([df, n3], axis=1)
This does what I want, but feels very cumbersome compared to code that doesn't use MultiIndex columns:
df['n3'] = df[['n1', 'n2']].sum(axis=1)
My current code also only works for a column MultiIndex consisting of two levels, and I'd be interested in doing this for arbitrary levels.
What's a better way of doing this?
One way to do so with stack and unstack:
new_df = df.stack(level=1)
new_df['n3'] = new_df.sum(axis=1)
new_df.unstack(level=-1)
Output:
n1 n2 n3
m p m p m p
0 2 1 4 3 6 4
1 6 5 8 7 14 12
2 10 9 12 11 22 20
If you build the structure like:
df['n3','p']=1
df['n3','m']=1
then you can write:
df['n3'] = df[['n1', 'n2']].sum(axis=1, level=1)
Here's another way that I just discovered which does not reorder the columns:
# Sum column-wise on level 1
s = df.loc[:, ['n1', 'n2']].sum(axis=1, level=1)
# Prepend a column level
s = pd.concat([s], keys=['n3'], axis=1)
# Add column to DataFrame
df = pd.concat([df, s], axis=1)

Pandas: How to obtain top 2, middle 2 and bottom 2 rows in a each group

Let's say I have a dataframe df as below. To obtain 1st 2 and last 2 in each group I have used groupby.nth
df = pd.DataFrame({'A': ['a','a','a','a','a','a','a','a','b','b','b','b','b','b','b'],
'B': [1, 2, 3, 4, 5,6,7,8,1, 2, 3, 4, 5,6,7]}, columns=['A', 'B'])
df.groupby('A').nth([0,1,-2,-1])
Result:
B
A
a 1
a 2
a 7
a 8
b 1
b 2
b 6
b 7
I'm not sure how to obtain the middle 2 rows. For example, in group 'A' there are 8 instances so my middle would be 4, 5 (n/2, n/2+1) and group 'B' my middle rows would be 3, 4 (n/2-0.5, n/2+0.5). Any guidance is appreciated.
sacul's answer is nice , Here I just follow your own idea def a customize function
def middle(x):
if len(x) % 2 == 0:
return x.iloc[int(len(x) / 2) - 1:int(len(x) / 2) + 1]
else:
return x.iloc[int((len(x) / 2 - 0.5)) - 1:int(len(x) / 2 + 0.5)]
pd.concat([middle(y) for _ , y in df.groupby('A')])
Out[25]:
A B
3 a 4
4 a 5
10 b 3
11 b 4
You can use iloc to find the n//2 -1 and n//2 indices for each group (// is floor division):
g = df.groupby('A')
g.apply(lambda x: x['B'].iloc[[len(x)//2-1, len(x)//2]])
A
a 3 4
4 5
b 10 3
11 4
Name: B, dtype: int64