Pandas: How to obtain top 2, middle 2 and bottom 2 rows in a each group - pandas

Let's say I have a dataframe df as below. To obtain 1st 2 and last 2 in each group I have used groupby.nth
df = pd.DataFrame({'A': ['a','a','a','a','a','a','a','a','b','b','b','b','b','b','b'],
'B': [1, 2, 3, 4, 5,6,7,8,1, 2, 3, 4, 5,6,7]}, columns=['A', 'B'])
df.groupby('A').nth([0,1,-2,-1])
Result:
B
A
a 1
a 2
a 7
a 8
b 1
b 2
b 6
b 7
I'm not sure how to obtain the middle 2 rows. For example, in group 'A' there are 8 instances so my middle would be 4, 5 (n/2, n/2+1) and group 'B' my middle rows would be 3, 4 (n/2-0.5, n/2+0.5). Any guidance is appreciated.

sacul's answer is nice , Here I just follow your own idea def a customize function
def middle(x):
if len(x) % 2 == 0:
return x.iloc[int(len(x) / 2) - 1:int(len(x) / 2) + 1]
else:
return x.iloc[int((len(x) / 2 - 0.5)) - 1:int(len(x) / 2 + 0.5)]
pd.concat([middle(y) for _ , y in df.groupby('A')])
Out[25]:
A B
3 a 4
4 a 5
10 b 3
11 b 4

You can use iloc to find the n//2 -1 and n//2 indices for each group (// is floor division):
g = df.groupby('A')
g.apply(lambda x: x['B'].iloc[[len(x)//2-1, len(x)//2]])
A
a 3 4
4 5
b 10 3
11 4
Name: B, dtype: int64

Related

groupby to show same row value from other columns

After groupby by "Mode" column and take out the value from "indicator" of "max, min", how to let the relative value to show in the same dataframe like below:
df = pd.read_csv(r'relative.csv')
Grouped = df.groupby('Mode')['Indicator'].agg(['max', 'min'])
print(Grouped)
(from google, maybe can use from col_value or row_value function, but seem be more complicated, could someone can help to solve it by easy ways? thank you.)
You can do it in two steps, using groupby and idxmin() or idxmix():
# Create a df with the min values of 'Indicator', renaming the column 'Value' to 'B'
min = df.loc[df.groupby('Mode')['Indicator'].idxmin()].reset_index(drop=True).rename(columns={'Indicator': 'min', 'Value': 'B'})
print(min)
# Mode min B
# 0 A 1 6
# 1 B 1 7
# Create a df with the max values of 'Indicator', renaming the column 'Value' to 'A'
max = df.loc[df.groupby('Mode')['Indicator'].idxmax()].reset_index(drop=True).rename(columns={'Indicator': 'max', 'Value': 'A'})
print(max)
# Mode max A
# 0 A 3 2
# 1 B 4 3
# Merge the dataframes together
result = pd.merge(min, max)
# reorder the columns to match expected output
print(result[['Mode', 'max','min','A', 'B']])
# Mode max min A B
# 0 A 3 1 2 6
# 1 B 4 1 3 7
The logic is unclear, there is no real reason why you would call your columns A/B since the 6/3 values in it are not coming from A/B.
I assume you want to achieve:
(df.groupby('Mode')['Indicator'].agg(['idxmax', 'idxmin'])
.rename(columns={'idxmin': 'min', 'idxmax': 'max'}).stack()
.to_frame('x').merge(df, left_on='x', right_index=True)
.drop(columns=['x', 'Mode']).unstack()
)
Output:
Indicator Value
max min max min
Mode
A 3 1 2 6
B 4 1 3 7
C 10 10 20 20
Used input:
Mode Indicator Value
0 A 1 6
1 A 2 5
2 A 3 2
3 B 4 3
4 B 3 6
5 B 2 8
6 B 1 7
7 C 10 20
With the dataframe you provided:
import pandas as pd
df = pd.DataFrame(
{
"Mode": ["A", "A", "A", "B", "B", "B", "B"],
"Indicator": [1, 2, 3, 4, 3, 2, 1],
"Value": [6, 5, 2, 3, 6, 8, 7],
}
)
new_df = df.groupby("Mode")["Indicator"].agg(["max", "min"])
print(new_df)
# Output
max min
Mode
A 3 1
B 4 1
Here is one way to do it with product from Python standard library's itertools module and Pandas at property:
from itertools import product
for row, (col, func) in product(["A", "B"], [("A", "max"), ("B", "min")]):
new_df.at[row, col] = df.loc[
(df["Mode"] == row) & (df["Indicator"] == new_df.loc[row, func]), "Value"
].values[0]
new_df = new_df.astype(int)
Then:
print(new_df)
# Output
max min A B
Mode
A 3 1 2 6
B 4 1 3 7

how to generate random numbers that can be summed to a specific value?

I have 2 dataframe as follows:
import pandas as pd
import numpy as np
# Create data set.
dataSet1 = {'id': ['A', 'B', 'C'],
'value' : [9,20,20]}
dataSet2 = {'id' : ['A', 'A','A','B','B','B','C'],
'id_2': [1, 2, 3, 2,3,4,1]}
# Create dataframe with data set and named columns.
df_map1 = pd.DataFrame(dataSet1, columns= ['id', 'value'])
df_map2 = pd.DataFrame(dataSet2, columns= ['id','id_2'])
df_map1
id value
0 A 9
1 B 20
2 C 20
df_map2
id id_2
0 A 1
1 A 2
2 A 3
3 B 2
4 B 3
5 B 4
6 C 1
where id_2 can have dups of id. (namely id_2 is subset of id)
#doing a quick merge, based on id.
df = df_map1.merge(df_map2 ,on=['id'])
id value id_2
0 A 9 1
1 A 9 2
2 A 9 3
3 B 20 2
4 B 20 3
5 B 20 4
6 C 20 1
I can represent what's the relationship between id and id_2 as follows
id_ref = df.groupby('id')['id_2'].apply(list).to_dict()
{'A': [1, 2, 3], 'B': [2, 3, 4], 'C': [1]}
Now, I would like to generate random integer say 0 to 3 put the list (5 elements for exmaple) into the pandas df and explode.
import numpy as np
import random
df['random_value'] = df.apply(lambda _: np.random.randint(0,3, 5), axis=1)
id value id_2 random_value
0 A 9 1 [0, 0, 0, 0, 1]
1 A 9 2 [0, 2, 1, 2, 1]
2 A 9 3 [0, 1, 2, 2, 1]
3 B 20 2 [2, 1, 1, 2, 2]
4 B 20 3 [0, 0, 0, 0, 0]
5 B 20 4 [1, 0, 0, 1, 0]
6 C 20 1 [1, 2, 2, 2, 1]
The condition for generating this random_value list, is that sum of the list has to be equal to 9.
That means, for id : A, if we sum all the elements inside the list, we have total of 13 shown the description below, but what we want is 9:
and same concept for id B and C.. and so on....
is there anyway to achieve this?
# i was looking into multinomial from np.random function... seems this should be the solution but im not sure how to apply this with pandas.
np.random.multinomial(9, np.ones(5)/5, size = 1)[0]
=> array([2,3,3,0,1])
2+3+3+0+1 = 9
ATTEMPT/IDEA ...
given that we have list of id_2. ie) id: A has 3 distinct elements [1,2,3].
so id A is mapped to 3 different elements. so we can get
3 * 5 = 15 ( which will be our long list )
3: length of list
5: create 5 elements of list
hence
list_A = np.random.multinomial(9,np.ones(3*5)/(3*5) ,size = 1)[0]
and then we evenly distribute/split the list.
using this list comprehension:
[list_A [i:i + n] for i in range(0, len(list_A ), n)]
but I am still unsure how to do this dynamically.
The core idea is as you said (about getting 3*5=15 numbers), plus reshaping it into a 2D array with the same number of rows as that id has in the dataframe. The following function does that,
def generate_random_numbers(df):
value = df['value'].iloc[0]
list_len = 5
num_rows = len(df)
num_rand = list_len*num_rows
return pd.Series(
map(list, np.random.multinomial(value, np.ones(num_rand)/num_rand).reshape(num_rows, -1)),
df.index
)
And apply it:
df['random_value'] = df.groupby(['id', 'value'], as_index=False).apply(generate_random_numbers).droplevel(0)

Pandas add column with values from two df on partly matching column

I have an easy question most likely but still am stuck on how to solve what I want.
I have a two dataframes which match one column "giftID" and want to create a new column in df1 adding the values from df2 matching the giftID. I tried it with np.where and all different kinds but can't get it working.
df = pd.read_csv('../data/gifts.csv')
trip1 =df[:20].copy()
trip1['TripId']=0
subtours = [list(trip1['GiftId'])] * len(trip1)
trip1['Subtour'] = subtours
trip2 = df[20:41].copy()
#trip2['Subtour'] = [s]*len(trip2)
trip2['TripId']=1
trip2['Subtour'] = subtours = [list(trip2['GiftId'])] * len(trip2)
mini_tour = trip1.append(trip2)
grouped = mini_tour.groupby('TripId')
SA = Simulated_Anealing()
wrw = 0
for name, trip in grouped:
tourId = trip['TripId'].unique()[0]
optimized_trip,wrw_c = SA.simulated_annealing(trip)
wrw += wrw_c
subtours = [optimized_trip]*len(trip)
mask = mini_tour['TripId'] == tourId
mini_tour.loc[mask,'Subtour'] = 0
Input:
df giftID weight
1 A 4
2 B 5
3 C 6
4 D 7
5 E 12
df1 giftID subtour
1 A 1, 3, 4
2 B 1, 3, 4
3 C 1, 3, 4
df2 giftID subtour
1 D 2, 5, 8
2 E 2, 5, 8
Output:
df giftID weight subtour
1 A 4 1, 3, 4
2 B 5 1, 3, 4
3 C 6 1, 3, 4
4 D 7 2, 5, 8
5 E 12 2, 5, 8
Firstly, you can pd.concat, df1 and df2
import pandas pd
df12 = pd.concat([df1,df2],axis=0) # axis = 0 means row wise
Then merge the df12 with your main one:
df_merge = pd.merge(df,df12,how='left',left_on='giftID',right_on='gift')

Finding duplicate entries

I am working with the 515k Hotel Reviews dataset from Kaggle. There are 1492 unique hotel names and 1493 unique addresses. So at first it would appear that one (or possibly more) hotel has more than one address. But, if I do a groupby.count on the data, I get 1494 whether I groupby HotelName followed by Address or if I reverse the order.
In order to make this reproducible, hopefully this simplification will suffice:
data = {
'HotelName': ['A', 'B', 'C', 'D', 'A', 'B', 'C', 'D', 'A', 'B', 'B', 'C', 'C'],
'Address': [1, 2, 3, 4, 1, 2, 3, 4, 2, 2, 2, 3, 5]
}
df = pd.DataFrame(data, columns = ['HotelName', 'Address'])
df['HotelName'].unique().shape[0] # Returns 4
df['Address'].unique().shape[0] # Returns 5
df.groupby(['Address', 'HotelName']).count().shape[0] # Returns 6
df.groupby(['Address', 'HotelName']).count().shape[0] # Returns 6
I would like to find the hotel names that have different addresses. So in my example, I would like to find the A and C along with their addresses (1,2 and 3,5 respectively). That code should be enough for me to also find the addresses that have duplicate hotel names.
Use the nunique groupby aggregator:
>>> n_uniq = df.groupby('HotelName')['Address'].nunique()
>>> n_uniq
HotelName
A 2
B 1
C 2
D 1
Name: Address, dtype: int64
If you want to look at the distinct hotels with more than one address in the original dataframe,
>>> hotels_with_mult_addr = n_uniq.index[n_uniq > 1]
>>> df[df['HotelName'].isin(hotels_with_mult_addr)].drop_duplicates()
HotelName Address
0 A 1
2 C 3
8 A 2
12 C 5
If I understand your correctly, we can check which hotel has more then 1 unique adress with groupby.transform(nunqiue):
m = df.groupby('HotelName')['Address'].transform('nunique').ne(1)
print(df.loc[m])
HotelName Address
0 A 1
2 C 3
4 A 1
6 C 3
8 A 2
11 C 3
12 C 5
If you want to get a more concise view on which the duplicates are, use groupby.agg(set):
df.loc[m].groupby('HotelName')['Address'].agg(set).reset_index(name='addresses')
HotelName addresses
0 A {1, 2}
1 C {3, 5}
Step by step:
transform(nunique) gives us the amount of unique adresses next to each row
df.groupby('HotelName')['Address'].transform('nunique')
0 2
1 1
2 2
3 1
4 2
5 1
6 2
7 1
8 2
9 1
10 1
11 2
12 2
Name: Address, dtype: int64
Then we check which rows are not equal (ne) to 1 and filter those:
df.groupby('HotelName')['Address'].transform('nunique').ne(1)
0 True
1 False
2 True
3 False
4 True
5 False
6 True
7 False
8 True
9 False
10 False
11 True
12 True
Name: Address, dtype: bool
Groupby didn't do what you were expected. After you did the groupby here is what you got
HotelName Address
0 A 1
4 A 1
HotelName Address
8 A 2
HotelName Address
1 B 2
5 B 2
9 B 2
10 B 2
HotelName Address
2 C 3
6 C 3
11 C 3
HotelName Address
3 D 4
7 D 4
HotelName Address
12 C 5
There are indeed 6 combinations!
If you want to know the duplication in each group, you should check the row index.
Here is the long way to do it, where in newdf['count'] == 1 is unique
df = pd.DataFrame(data, columns = ['HotelName', 'Address'])
df = df.sort_values(by = ['HotelName','Address']).reset_index(drop = True)
count = df.groupby(['HotelName','Address'])['Address'].count().reset_index(drop = True)
df['rownum'] = df.groupby(['HotelName','Address']).cumcount()+1
dfnew = df[df['rownum']==1].reset_index(drop = True).drop(columns = 'rownum')
dfnew['count'] = count
dfnew

Adding the lower levels of two Pandas MultiIndex columns

I have the following DataFrame:
import pandas as pd
columns = pd.MultiIndex.from_arrays([['n1', 'n1', 'n2', 'n2'],
['p', 'm', 'p', 'm']])
values = [
[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
]
df = pd.DataFrame(values, columns=columns)
n1 n2
p m p m
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
Now I want to add another column (n3) to this DataFrame whose lower-level columns p and m should be the sums of the corresponding lower-level columns of n1 and n2:
n1 n2 n3
p m p m p m
0 1 2 3 4 4 6
1 5 6 7 8 12 14
2 9 10 11 12 20 22
Here's the code I came up with:
n3 = df[['n1', 'n2']].sum(axis=1, level=1)
level1 = df.columns.levels[1]
n3.columns = pd.MultiIndex.from_arrays([['n3'] * len(level1), level1])
df = pd.concat([df, n3], axis=1)
This does what I want, but feels very cumbersome compared to code that doesn't use MultiIndex columns:
df['n3'] = df[['n1', 'n2']].sum(axis=1)
My current code also only works for a column MultiIndex consisting of two levels, and I'd be interested in doing this for arbitrary levels.
What's a better way of doing this?
One way to do so with stack and unstack:
new_df = df.stack(level=1)
new_df['n3'] = new_df.sum(axis=1)
new_df.unstack(level=-1)
Output:
n1 n2 n3
m p m p m p
0 2 1 4 3 6 4
1 6 5 8 7 14 12
2 10 9 12 11 22 20
If you build the structure like:
df['n3','p']=1
df['n3','m']=1
then you can write:
df['n3'] = df[['n1', 'n2']].sum(axis=1, level=1)
Here's another way that I just discovered which does not reorder the columns:
# Sum column-wise on level 1
s = df.loc[:, ['n1', 'n2']].sum(axis=1, level=1)
# Prepend a column level
s = pd.concat([s], keys=['n3'], axis=1)
# Add column to DataFrame
df = pd.concat([df, s], axis=1)