Pandas add column with values from two df on partly matching column - pandas

I have an easy question most likely but still am stuck on how to solve what I want.
I have a two dataframes which match one column "giftID" and want to create a new column in df1 adding the values from df2 matching the giftID. I tried it with np.where and all different kinds but can't get it working.
df = pd.read_csv('../data/gifts.csv')
trip1 =df[:20].copy()
trip1['TripId']=0
subtours = [list(trip1['GiftId'])] * len(trip1)
trip1['Subtour'] = subtours
trip2 = df[20:41].copy()
#trip2['Subtour'] = [s]*len(trip2)
trip2['TripId']=1
trip2['Subtour'] = subtours = [list(trip2['GiftId'])] * len(trip2)
mini_tour = trip1.append(trip2)
grouped = mini_tour.groupby('TripId')
SA = Simulated_Anealing()
wrw = 0
for name, trip in grouped:
tourId = trip['TripId'].unique()[0]
optimized_trip,wrw_c = SA.simulated_annealing(trip)
wrw += wrw_c
subtours = [optimized_trip]*len(trip)
mask = mini_tour['TripId'] == tourId
mini_tour.loc[mask,'Subtour'] = 0
Input:
df giftID weight
1 A 4
2 B 5
3 C 6
4 D 7
5 E 12
df1 giftID subtour
1 A 1, 3, 4
2 B 1, 3, 4
3 C 1, 3, 4
df2 giftID subtour
1 D 2, 5, 8
2 E 2, 5, 8
Output:
df giftID weight subtour
1 A 4 1, 3, 4
2 B 5 1, 3, 4
3 C 6 1, 3, 4
4 D 7 2, 5, 8
5 E 12 2, 5, 8

Firstly, you can pd.concat, df1 and df2
import pandas pd
df12 = pd.concat([df1,df2],axis=0) # axis = 0 means row wise
Then merge the df12 with your main one:
df_merge = pd.merge(df,df12,how='left',left_on='giftID',right_on='gift')

Related

how to generate random numbers that can be summed to a specific value?

I have 2 dataframe as follows:
import pandas as pd
import numpy as np
# Create data set.
dataSet1 = {'id': ['A', 'B', 'C'],
'value' : [9,20,20]}
dataSet2 = {'id' : ['A', 'A','A','B','B','B','C'],
'id_2': [1, 2, 3, 2,3,4,1]}
# Create dataframe with data set and named columns.
df_map1 = pd.DataFrame(dataSet1, columns= ['id', 'value'])
df_map2 = pd.DataFrame(dataSet2, columns= ['id','id_2'])
df_map1
id value
0 A 9
1 B 20
2 C 20
df_map2
id id_2
0 A 1
1 A 2
2 A 3
3 B 2
4 B 3
5 B 4
6 C 1
where id_2 can have dups of id. (namely id_2 is subset of id)
#doing a quick merge, based on id.
df = df_map1.merge(df_map2 ,on=['id'])
id value id_2
0 A 9 1
1 A 9 2
2 A 9 3
3 B 20 2
4 B 20 3
5 B 20 4
6 C 20 1
I can represent what's the relationship between id and id_2 as follows
id_ref = df.groupby('id')['id_2'].apply(list).to_dict()
{'A': [1, 2, 3], 'B': [2, 3, 4], 'C': [1]}
Now, I would like to generate random integer say 0 to 3 put the list (5 elements for exmaple) into the pandas df and explode.
import numpy as np
import random
df['random_value'] = df.apply(lambda _: np.random.randint(0,3, 5), axis=1)
id value id_2 random_value
0 A 9 1 [0, 0, 0, 0, 1]
1 A 9 2 [0, 2, 1, 2, 1]
2 A 9 3 [0, 1, 2, 2, 1]
3 B 20 2 [2, 1, 1, 2, 2]
4 B 20 3 [0, 0, 0, 0, 0]
5 B 20 4 [1, 0, 0, 1, 0]
6 C 20 1 [1, 2, 2, 2, 1]
The condition for generating this random_value list, is that sum of the list has to be equal to 9.
That means, for id : A, if we sum all the elements inside the list, we have total of 13 shown the description below, but what we want is 9:
and same concept for id B and C.. and so on....
is there anyway to achieve this?
# i was looking into multinomial from np.random function... seems this should be the solution but im not sure how to apply this with pandas.
np.random.multinomial(9, np.ones(5)/5, size = 1)[0]
=> array([2,3,3,0,1])
2+3+3+0+1 = 9
ATTEMPT/IDEA ...
given that we have list of id_2. ie) id: A has 3 distinct elements [1,2,3].
so id A is mapped to 3 different elements. so we can get
3 * 5 = 15 ( which will be our long list )
3: length of list
5: create 5 elements of list
hence
list_A = np.random.multinomial(9,np.ones(3*5)/(3*5) ,size = 1)[0]
and then we evenly distribute/split the list.
using this list comprehension:
[list_A [i:i + n] for i in range(0, len(list_A ), n)]
but I am still unsure how to do this dynamically.
The core idea is as you said (about getting 3*5=15 numbers), plus reshaping it into a 2D array with the same number of rows as that id has in the dataframe. The following function does that,
def generate_random_numbers(df):
value = df['value'].iloc[0]
list_len = 5
num_rows = len(df)
num_rand = list_len*num_rows
return pd.Series(
map(list, np.random.multinomial(value, np.ones(num_rand)/num_rand).reshape(num_rows, -1)),
df.index
)
And apply it:
df['random_value'] = df.groupby(['id', 'value'], as_index=False).apply(generate_random_numbers).droplevel(0)

Adding new column to an existing dataframe at an arbitrary position [duplicate]

Can I insert a column at a specific column index in pandas?
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
This will put column n as the last column of df, but isn't there a way to tell df to put n at the beginning?
see docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.insert.html
using loc = 0 will insert at the beginning
df.insert(loc, column, value)
df = pd.DataFrame({'B': [1, 2, 3], 'C': [4, 5, 6]})
df
Out:
B C
0 1 4
1 2 5
2 3 6
idx = 0
new_col = [7, 8, 9] # can be a list, a Series, an array or a scalar
df.insert(loc=idx, column='A', value=new_col)
df
Out:
A B C
0 7 1 4
1 8 2 5
2 9 3 6
If you want a single value for all rows:
df.insert(0,'name_of_column','')
df['name_of_column'] = value
Edit:
You can also:
df.insert(0,'name_of_column',value)
df.insert(loc, column_name, value)
This will work if there is no other column with the same name. If a column, with your provided name already exists in the dataframe, it will raise a ValueError.
You can pass an optional parameter allow_duplicates with True value to create a new column with already existing column name.
Here is an example:
>>> df = pd.DataFrame({'b': [1, 2], 'c': [3,4]})
>>> df
b c
0 1 3
1 2 4
>>> df.insert(0, 'a', -1)
>>> df
a b c
0 -1 1 3
1 -1 2 4
>>> df.insert(0, 'a', -2)
Traceback (most recent call last):
File "", line 1, in
File "C:\Python39\lib\site-packages\pandas\core\frame.py", line 3760, in insert
self._mgr.insert(loc, column, value, allow_duplicates=allow_duplicates)
File "C:\Python39\lib\site-packages\pandas\core\internals\managers.py", line 1191, in insert
raise ValueError(f"cannot insert {item}, already exists")
ValueError: cannot insert a, already exists
>>> df.insert(0, 'a', -2, allow_duplicates = True)
>>> df
a a b c
0 -2 -1 1 3
1 -2 -1 2 4
You could try to extract columns as list, massage this as you want, and reindex your dataframe:
>>> cols = df.columns.tolist()
>>> cols = [cols[-1]]+cols[:-1] # or whatever change you need
>>> df.reindex(columns=cols)
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
EDIT: this can be done in one line ; however, this looks a bit ugly. Maybe some cleaner proposal may come...
>>> df.reindex(columns=['n']+df.columns[:-1].tolist())
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
Here is a very simple answer to this(only one line).
You can do that after you added the 'n' column into your df as follows.
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
df
l v n
0 a 1 0
1 b 2 0
2 c 1 0
3 d 2 0
# here you can add the below code and it should work.
df = df[list('nlv')]
df
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
However, if you have words in your columns names instead of letters. It should include two brackets around your column names.
import pandas as pd
df = pd.DataFrame({'Upper':['a','b','c','d'], 'Lower':[1,2,1,2]})
df['Net'] = 0
df['Mid'] = 2
df['Zsore'] = 2
df
Upper Lower Net Mid Zsore
0 a 1 0 2 2
1 b 2 0 2 2
2 c 1 0 2 2
3 d 2 0 2 2
# here you can add below line and it should work
df = df[list(('Mid','Upper', 'Lower', 'Net','Zsore'))]
df
Mid Upper Lower Net Zsore
0 2 a 1 0 2
1 2 b 2 0 2
2 2 c 1 0 2
3 2 d 2 0 2
A general 4-line routine
You can have the following 4-line routine whenever you want to create a new column and insert into a specific location loc.
df['new_column'] = ... #new column's definition
col = df.columns.tolist()
col.insert(loc, col.pop()) #loc is the column's index you want to insert into
df = df[col]
In your example, it is simple:
df['n'] = 0
col = df.columns.tolist()
col.insert(0, col.pop())
df = df[col]

Finding duplicate entries

I am working with the 515k Hotel Reviews dataset from Kaggle. There are 1492 unique hotel names and 1493 unique addresses. So at first it would appear that one (or possibly more) hotel has more than one address. But, if I do a groupby.count on the data, I get 1494 whether I groupby HotelName followed by Address or if I reverse the order.
In order to make this reproducible, hopefully this simplification will suffice:
data = {
'HotelName': ['A', 'B', 'C', 'D', 'A', 'B', 'C', 'D', 'A', 'B', 'B', 'C', 'C'],
'Address': [1, 2, 3, 4, 1, 2, 3, 4, 2, 2, 2, 3, 5]
}
df = pd.DataFrame(data, columns = ['HotelName', 'Address'])
df['HotelName'].unique().shape[0] # Returns 4
df['Address'].unique().shape[0] # Returns 5
df.groupby(['Address', 'HotelName']).count().shape[0] # Returns 6
df.groupby(['Address', 'HotelName']).count().shape[0] # Returns 6
I would like to find the hotel names that have different addresses. So in my example, I would like to find the A and C along with their addresses (1,2 and 3,5 respectively). That code should be enough for me to also find the addresses that have duplicate hotel names.
Use the nunique groupby aggregator:
>>> n_uniq = df.groupby('HotelName')['Address'].nunique()
>>> n_uniq
HotelName
A 2
B 1
C 2
D 1
Name: Address, dtype: int64
If you want to look at the distinct hotels with more than one address in the original dataframe,
>>> hotels_with_mult_addr = n_uniq.index[n_uniq > 1]
>>> df[df['HotelName'].isin(hotels_with_mult_addr)].drop_duplicates()
HotelName Address
0 A 1
2 C 3
8 A 2
12 C 5
If I understand your correctly, we can check which hotel has more then 1 unique adress with groupby.transform(nunqiue):
m = df.groupby('HotelName')['Address'].transform('nunique').ne(1)
print(df.loc[m])
HotelName Address
0 A 1
2 C 3
4 A 1
6 C 3
8 A 2
11 C 3
12 C 5
If you want to get a more concise view on which the duplicates are, use groupby.agg(set):
df.loc[m].groupby('HotelName')['Address'].agg(set).reset_index(name='addresses')
HotelName addresses
0 A {1, 2}
1 C {3, 5}
Step by step:
transform(nunique) gives us the amount of unique adresses next to each row
df.groupby('HotelName')['Address'].transform('nunique')
0 2
1 1
2 2
3 1
4 2
5 1
6 2
7 1
8 2
9 1
10 1
11 2
12 2
Name: Address, dtype: int64
Then we check which rows are not equal (ne) to 1 and filter those:
df.groupby('HotelName')['Address'].transform('nunique').ne(1)
0 True
1 False
2 True
3 False
4 True
5 False
6 True
7 False
8 True
9 False
10 False
11 True
12 True
Name: Address, dtype: bool
Groupby didn't do what you were expected. After you did the groupby here is what you got
HotelName Address
0 A 1
4 A 1
HotelName Address
8 A 2
HotelName Address
1 B 2
5 B 2
9 B 2
10 B 2
HotelName Address
2 C 3
6 C 3
11 C 3
HotelName Address
3 D 4
7 D 4
HotelName Address
12 C 5
There are indeed 6 combinations!
If you want to know the duplication in each group, you should check the row index.
Here is the long way to do it, where in newdf['count'] == 1 is unique
df = pd.DataFrame(data, columns = ['HotelName', 'Address'])
df = df.sort_values(by = ['HotelName','Address']).reset_index(drop = True)
count = df.groupby(['HotelName','Address'])['Address'].count().reset_index(drop = True)
df['rownum'] = df.groupby(['HotelName','Address']).cumcount()+1
dfnew = df[df['rownum']==1].reset_index(drop = True).drop(columns = 'rownum')
dfnew['count'] = count
dfnew

Adding the lower levels of two Pandas MultiIndex columns

I have the following DataFrame:
import pandas as pd
columns = pd.MultiIndex.from_arrays([['n1', 'n1', 'n2', 'n2'],
['p', 'm', 'p', 'm']])
values = [
[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
]
df = pd.DataFrame(values, columns=columns)
n1 n2
p m p m
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
Now I want to add another column (n3) to this DataFrame whose lower-level columns p and m should be the sums of the corresponding lower-level columns of n1 and n2:
n1 n2 n3
p m p m p m
0 1 2 3 4 4 6
1 5 6 7 8 12 14
2 9 10 11 12 20 22
Here's the code I came up with:
n3 = df[['n1', 'n2']].sum(axis=1, level=1)
level1 = df.columns.levels[1]
n3.columns = pd.MultiIndex.from_arrays([['n3'] * len(level1), level1])
df = pd.concat([df, n3], axis=1)
This does what I want, but feels very cumbersome compared to code that doesn't use MultiIndex columns:
df['n3'] = df[['n1', 'n2']].sum(axis=1)
My current code also only works for a column MultiIndex consisting of two levels, and I'd be interested in doing this for arbitrary levels.
What's a better way of doing this?
One way to do so with stack and unstack:
new_df = df.stack(level=1)
new_df['n3'] = new_df.sum(axis=1)
new_df.unstack(level=-1)
Output:
n1 n2 n3
m p m p m p
0 2 1 4 3 6 4
1 6 5 8 7 14 12
2 10 9 12 11 22 20
If you build the structure like:
df['n3','p']=1
df['n3','m']=1
then you can write:
df['n3'] = df[['n1', 'n2']].sum(axis=1, level=1)
Here's another way that I just discovered which does not reorder the columns:
# Sum column-wise on level 1
s = df.loc[:, ['n1', 'n2']].sum(axis=1, level=1)
# Prepend a column level
s = pd.concat([s], keys=['n3'], axis=1)
# Add column to DataFrame
df = pd.concat([df, s], axis=1)

Pandas: How to obtain top 2, middle 2 and bottom 2 rows in a each group

Let's say I have a dataframe df as below. To obtain 1st 2 and last 2 in each group I have used groupby.nth
df = pd.DataFrame({'A': ['a','a','a','a','a','a','a','a','b','b','b','b','b','b','b'],
'B': [1, 2, 3, 4, 5,6,7,8,1, 2, 3, 4, 5,6,7]}, columns=['A', 'B'])
df.groupby('A').nth([0,1,-2,-1])
Result:
B
A
a 1
a 2
a 7
a 8
b 1
b 2
b 6
b 7
I'm not sure how to obtain the middle 2 rows. For example, in group 'A' there are 8 instances so my middle would be 4, 5 (n/2, n/2+1) and group 'B' my middle rows would be 3, 4 (n/2-0.5, n/2+0.5). Any guidance is appreciated.
sacul's answer is nice , Here I just follow your own idea def a customize function
def middle(x):
if len(x) % 2 == 0:
return x.iloc[int(len(x) / 2) - 1:int(len(x) / 2) + 1]
else:
return x.iloc[int((len(x) / 2 - 0.5)) - 1:int(len(x) / 2 + 0.5)]
pd.concat([middle(y) for _ , y in df.groupby('A')])
Out[25]:
A B
3 a 4
4 a 5
10 b 3
11 b 4
You can use iloc to find the n//2 -1 and n//2 indices for each group (// is floor division):
g = df.groupby('A')
g.apply(lambda x: x['B'].iloc[[len(x)//2-1, len(x)//2]])
A
a 3 4
4 5
b 10 3
11 4
Name: B, dtype: int64