how to generate random numbers that can be summed to a specific value? - pandas

I have 2 dataframe as follows:
import pandas as pd
import numpy as np
# Create data set.
dataSet1 = {'id': ['A', 'B', 'C'],
'value' : [9,20,20]}
dataSet2 = {'id' : ['A', 'A','A','B','B','B','C'],
'id_2': [1, 2, 3, 2,3,4,1]}
# Create dataframe with data set and named columns.
df_map1 = pd.DataFrame(dataSet1, columns= ['id', 'value'])
df_map2 = pd.DataFrame(dataSet2, columns= ['id','id_2'])
df_map1
id value
0 A 9
1 B 20
2 C 20
df_map2
id id_2
0 A 1
1 A 2
2 A 3
3 B 2
4 B 3
5 B 4
6 C 1
where id_2 can have dups of id. (namely id_2 is subset of id)
#doing a quick merge, based on id.
df = df_map1.merge(df_map2 ,on=['id'])
id value id_2
0 A 9 1
1 A 9 2
2 A 9 3
3 B 20 2
4 B 20 3
5 B 20 4
6 C 20 1
I can represent what's the relationship between id and id_2 as follows
id_ref = df.groupby('id')['id_2'].apply(list).to_dict()
{'A': [1, 2, 3], 'B': [2, 3, 4], 'C': [1]}
Now, I would like to generate random integer say 0 to 3 put the list (5 elements for exmaple) into the pandas df and explode.
import numpy as np
import random
df['random_value'] = df.apply(lambda _: np.random.randint(0,3, 5), axis=1)
id value id_2 random_value
0 A 9 1 [0, 0, 0, 0, 1]
1 A 9 2 [0, 2, 1, 2, 1]
2 A 9 3 [0, 1, 2, 2, 1]
3 B 20 2 [2, 1, 1, 2, 2]
4 B 20 3 [0, 0, 0, 0, 0]
5 B 20 4 [1, 0, 0, 1, 0]
6 C 20 1 [1, 2, 2, 2, 1]
The condition for generating this random_value list, is that sum of the list has to be equal to 9.
That means, for id : A, if we sum all the elements inside the list, we have total of 13 shown the description below, but what we want is 9:
and same concept for id B and C.. and so on....
is there anyway to achieve this?
# i was looking into multinomial from np.random function... seems this should be the solution but im not sure how to apply this with pandas.
np.random.multinomial(9, np.ones(5)/5, size = 1)[0]
=> array([2,3,3,0,1])
2+3+3+0+1 = 9
ATTEMPT/IDEA ...
given that we have list of id_2. ie) id: A has 3 distinct elements [1,2,3].
so id A is mapped to 3 different elements. so we can get
3 * 5 = 15 ( which will be our long list )
3: length of list
5: create 5 elements of list
hence
list_A = np.random.multinomial(9,np.ones(3*5)/(3*5) ,size = 1)[0]
and then we evenly distribute/split the list.
using this list comprehension:
[list_A [i:i + n] for i in range(0, len(list_A ), n)]
but I am still unsure how to do this dynamically.

The core idea is as you said (about getting 3*5=15 numbers), plus reshaping it into a 2D array with the same number of rows as that id has in the dataframe. The following function does that,
def generate_random_numbers(df):
value = df['value'].iloc[0]
list_len = 5
num_rows = len(df)
num_rand = list_len*num_rows
return pd.Series(
map(list, np.random.multinomial(value, np.ones(num_rand)/num_rand).reshape(num_rows, -1)),
df.index
)
And apply it:
df['random_value'] = df.groupby(['id', 'value'], as_index=False).apply(generate_random_numbers).droplevel(0)

Related

groupby to show same row value from other columns

After groupby by "Mode" column and take out the value from "indicator" of "max, min", how to let the relative value to show in the same dataframe like below:
df = pd.read_csv(r'relative.csv')
Grouped = df.groupby('Mode')['Indicator'].agg(['max', 'min'])
print(Grouped)
(from google, maybe can use from col_value or row_value function, but seem be more complicated, could someone can help to solve it by easy ways? thank you.)
You can do it in two steps, using groupby and idxmin() or idxmix():
# Create a df with the min values of 'Indicator', renaming the column 'Value' to 'B'
min = df.loc[df.groupby('Mode')['Indicator'].idxmin()].reset_index(drop=True).rename(columns={'Indicator': 'min', 'Value': 'B'})
print(min)
# Mode min B
# 0 A 1 6
# 1 B 1 7
# Create a df with the max values of 'Indicator', renaming the column 'Value' to 'A'
max = df.loc[df.groupby('Mode')['Indicator'].idxmax()].reset_index(drop=True).rename(columns={'Indicator': 'max', 'Value': 'A'})
print(max)
# Mode max A
# 0 A 3 2
# 1 B 4 3
# Merge the dataframes together
result = pd.merge(min, max)
# reorder the columns to match expected output
print(result[['Mode', 'max','min','A', 'B']])
# Mode max min A B
# 0 A 3 1 2 6
# 1 B 4 1 3 7
The logic is unclear, there is no real reason why you would call your columns A/B since the 6/3 values in it are not coming from A/B.
I assume you want to achieve:
(df.groupby('Mode')['Indicator'].agg(['idxmax', 'idxmin'])
.rename(columns={'idxmin': 'min', 'idxmax': 'max'}).stack()
.to_frame('x').merge(df, left_on='x', right_index=True)
.drop(columns=['x', 'Mode']).unstack()
)
Output:
Indicator Value
max min max min
Mode
A 3 1 2 6
B 4 1 3 7
C 10 10 20 20
Used input:
Mode Indicator Value
0 A 1 6
1 A 2 5
2 A 3 2
3 B 4 3
4 B 3 6
5 B 2 8
6 B 1 7
7 C 10 20
With the dataframe you provided:
import pandas as pd
df = pd.DataFrame(
{
"Mode": ["A", "A", "A", "B", "B", "B", "B"],
"Indicator": [1, 2, 3, 4, 3, 2, 1],
"Value": [6, 5, 2, 3, 6, 8, 7],
}
)
new_df = df.groupby("Mode")["Indicator"].agg(["max", "min"])
print(new_df)
# Output
max min
Mode
A 3 1
B 4 1
Here is one way to do it with product from Python standard library's itertools module and Pandas at property:
from itertools import product
for row, (col, func) in product(["A", "B"], [("A", "max"), ("B", "min")]):
new_df.at[row, col] = df.loc[
(df["Mode"] == row) & (df["Indicator"] == new_df.loc[row, func]), "Value"
].values[0]
new_df = new_df.astype(int)
Then:
print(new_df)
# Output
max min A B
Mode
A 3 1 2 6
B 4 1 3 7

Pandas add column with values from two df on partly matching column

I have an easy question most likely but still am stuck on how to solve what I want.
I have a two dataframes which match one column "giftID" and want to create a new column in df1 adding the values from df2 matching the giftID. I tried it with np.where and all different kinds but can't get it working.
df = pd.read_csv('../data/gifts.csv')
trip1 =df[:20].copy()
trip1['TripId']=0
subtours = [list(trip1['GiftId'])] * len(trip1)
trip1['Subtour'] = subtours
trip2 = df[20:41].copy()
#trip2['Subtour'] = [s]*len(trip2)
trip2['TripId']=1
trip2['Subtour'] = subtours = [list(trip2['GiftId'])] * len(trip2)
mini_tour = trip1.append(trip2)
grouped = mini_tour.groupby('TripId')
SA = Simulated_Anealing()
wrw = 0
for name, trip in grouped:
tourId = trip['TripId'].unique()[0]
optimized_trip,wrw_c = SA.simulated_annealing(trip)
wrw += wrw_c
subtours = [optimized_trip]*len(trip)
mask = mini_tour['TripId'] == tourId
mini_tour.loc[mask,'Subtour'] = 0
Input:
df giftID weight
1 A 4
2 B 5
3 C 6
4 D 7
5 E 12
df1 giftID subtour
1 A 1, 3, 4
2 B 1, 3, 4
3 C 1, 3, 4
df2 giftID subtour
1 D 2, 5, 8
2 E 2, 5, 8
Output:
df giftID weight subtour
1 A 4 1, 3, 4
2 B 5 1, 3, 4
3 C 6 1, 3, 4
4 D 7 2, 5, 8
5 E 12 2, 5, 8
Firstly, you can pd.concat, df1 and df2
import pandas pd
df12 = pd.concat([df1,df2],axis=0) # axis = 0 means row wise
Then merge the df12 with your main one:
df_merge = pd.merge(df,df12,how='left',left_on='giftID',right_on='gift')

How to add values to an multiindexed column dataframe

My dataframe is
a b
1 2 1 2
0 0.281045 0.975469 -0.538213 -0.180008
1 0.128696 1.875480 0.247637 -0.047927
I want to insert the matrix to (a,3), (b, 3)
[[1, 1],
[1, 1]]
a b
1 2 3 1 2 3
0 0.281045 0.975469 1. -0.538213 -0.180008 1.
1 0.128696 1.875480 1. 0.247637 -0.047922 1.
It seems like there is no decent way to add value to the multiindex dataframe, Here is the code that I tried:
df[:,:,3] = [[1, 1],
[1, 1]]```
But it didn't work...
You can create new DataFrame with MultiIndex and then append to data by DataFrame.join with sorting MultiIndex:
arr = np.array([[1, 1],[1, 1]])
df1 = pd.DataFrame(arr,
index=df.index,
columns= pd.MultiIndex.from_product([df.columns.levels[0], [3]]))
df = df.join(df1).sort_index(axis=1)
print (df)
a b
1 2 3 1 2 3
0 0.281045 0.975469 1 -0.538213 -0.180008 1
1 0.128696 1.875480 1 0.247637 -0.047927 1

Adding the lower levels of two Pandas MultiIndex columns

I have the following DataFrame:
import pandas as pd
columns = pd.MultiIndex.from_arrays([['n1', 'n1', 'n2', 'n2'],
['p', 'm', 'p', 'm']])
values = [
[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
]
df = pd.DataFrame(values, columns=columns)
n1 n2
p m p m
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
Now I want to add another column (n3) to this DataFrame whose lower-level columns p and m should be the sums of the corresponding lower-level columns of n1 and n2:
n1 n2 n3
p m p m p m
0 1 2 3 4 4 6
1 5 6 7 8 12 14
2 9 10 11 12 20 22
Here's the code I came up with:
n3 = df[['n1', 'n2']].sum(axis=1, level=1)
level1 = df.columns.levels[1]
n3.columns = pd.MultiIndex.from_arrays([['n3'] * len(level1), level1])
df = pd.concat([df, n3], axis=1)
This does what I want, but feels very cumbersome compared to code that doesn't use MultiIndex columns:
df['n3'] = df[['n1', 'n2']].sum(axis=1)
My current code also only works for a column MultiIndex consisting of two levels, and I'd be interested in doing this for arbitrary levels.
What's a better way of doing this?
One way to do so with stack and unstack:
new_df = df.stack(level=1)
new_df['n3'] = new_df.sum(axis=1)
new_df.unstack(level=-1)
Output:
n1 n2 n3
m p m p m p
0 2 1 4 3 6 4
1 6 5 8 7 14 12
2 10 9 12 11 22 20
If you build the structure like:
df['n3','p']=1
df['n3','m']=1
then you can write:
df['n3'] = df[['n1', 'n2']].sum(axis=1, level=1)
Here's another way that I just discovered which does not reorder the columns:
# Sum column-wise on level 1
s = df.loc[:, ['n1', 'n2']].sum(axis=1, level=1)
# Prepend a column level
s = pd.concat([s], keys=['n3'], axis=1)
# Add column to DataFrame
df = pd.concat([df, s], axis=1)

How to calculate multiple columns from multiple columns in pandas

I am trying to calculate multiple colums from multiple columns in a pandas dataframe using a function.
The function takes three arguments -a-, -b-, and -c- and and returns three calculated values -sum-, -prod- and -quot-. In my pandas data frame I have three coumns -a-, -b- and and -c- from which I want to calculate the columns -sum-, -prod- and -quot-.
The mapping that I do works only when I have exactly three rows. I do not know what is going wrong, although I expect that it has to do something with selecting the correct axis. Could someone explain what is happening and how I can calculate the values that I would like to have.
Below are the situations that I have tested.
INITIAL VALUES
def sum_prod_quot(a,b,c):
sum = a + b + c
prod = a * b * c
quot = a / b / c
return (sum, prod, quot)
df = pd.DataFrame({ 'a': [20, 100, 18],
'b': [ 5, 10, 3],
'c': [ 2, 10, 6],
'd': [ 1, 2, 3]
})
df
a b c d
0 20 5 2 1
1 100 10 10 2
2 18 3 6 3
CALCULATION STEPS
Using exactly three rows
When I calculate three columns from this dataframe and using the function function I get:
df['sum'], df['prod'], df['quot'] = \
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
df
a b c d sum prod quot
0 20 5 2 1 27.0 120.0 27.0
1 100 10 10 2 200.0 10000.0 324.0
2 18 3 6 3 2.0 1.0 1.0
This is exactly the result that I want to have: The sum-column has the sum of the elements in the columns a,b,c; the prod-column has the product of the elements in the columns a,b,c and the quot-column has the quotients of the elements in the columns a,b,c.
Using more than three rows
When I expand the dataframe with one row, I get an error!
The data frame is defined as:
df = pd.DataFrame({ 'a': [20, 100, 18, 40],
'b': [ 5, 10, 3, 10],
'c': [ 2, 10, 6, 4],
'd': [ 1, 2, 3, 4]
})
df
a b c d
0 20 5 2 1
1 100 10 10 2
2 18 3 6 3
3 40 10 4 4
The call is
df['sum'], df['prod'], df['quot'] = \
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
The result is
...
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
ValueError: too many values to unpack (expected 3)
while I would expect an extra row:
df
a b c d sum prod quot
0 20 5 2 1 27.0 120.0 27.0
1 100 10 10 2 200.0 10000.0 324.0
2 18 3 6 3 2.0 1.0 1.0
3 40 10 4 4 54.0 1600.0 1.0
Using less than three rows
When I reduce tthe dataframe with one row I get also an error.
The dataframe is defined as:
df = pd.DataFrame({ 'a': [20, 100],
'b': [ 5, 10],
'c': [ 2, 10],
'd': [ 1, 2]
})
df
a b c d
0 20 5 2 1
1 100 10 10 2
The call is
df['sum'], df['prod'], df['quot'] = \
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
The result is
...
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
ValueError: need more than 2 values to unpack
while I would expect a row less:
df
a b c d sum prod quot
0 20 5 2 1 27.0 120.0 27.0
1 100 10 10 2 200.0 10000.0 324.0
QUESTIONS
The questions I have:
1) Why do I get these errors?
2) How do I have to modify the call such that I get the desired data frame?
NOTE
In this link a similar question is asked, but the given answer did not work for me.
The answer doesn't seem correct for 3 rows as well. Can you check other values except first row and first column. Looking at the results, product of 20*5*2 is NOT 120, it's 200 and is placed below in sum column. You need to form list in correct way before assigning to new columns. You can try use following to set the new columns:
df['sum'], df['prod'], df['quot'] = zip(*map(sum_prod_quot, df['a'], df['b'], df['c']))
For details follow the link