Merging two dataset with partial match - pandas

I want to merge two dataframe df1 and df2. Shape of df1 is (115, 16) and Df2 is (624402, 23).
df1 = pd.DataFrame({'Invoice': ['20561', '20562', '20563', '20564'],
'Currency': ['EUR', 'EUR', 'EUR', 'USD']})
df2 = pd.DataFrame({'Ref': ['20561', 'INV20562', 'INV20563BG', '20564'],
'Type': ['01', '03', '04', '02'],
'Amount': ['150', '175', '160', '180'],
'Comment': ['bla', 'bla', 'bla', 'bla']})
print(df1)
Invoice Currency
0 20561 EUR
1 20562 EUR
2 20563 EUR
3 20564 USD
print(df2)
Ref Type Amount Comment
0 20561 01 150 bla
1 INV20562 03 175 bla
2 INV20563BG 04 160 bla
3 20564 02 180 bla
I applied the following code:
df4 = df1.copy()
for i, row in df1.iterrows():
tmp = df2[df2['Ref'].str.contains(row['Invoice'], na=False)]
df4.loc[i, 'Amount'] = tmp['Amount'].values[0]
print(df4)
It is showing: IndexError: index 0 is out of bounds for axis 0 with size 0

The IndexError occurs when no row matches the invoice. You can check for this and return np.nan (or a different default value) if a matching invoice is not found:
df4 = df1.copy()
for i, row in df1.iterrows():
tmp = df2[df2['Ref'].str.contains(row['Invoice'], na=False)]
df4.loc[i, 'Amount'] = tmp['Amount'].values[0] if not tmp.empty else np.nan

Related

Merging unequal dataframes with partial string match

I want to merge two data frames. df1 has 115 rows and df2 has 600,000 rows.
f1 = pd.DataFrame({'Invoice': ['20561', '20562', '20563', '20564'],
'Currency': ['EUR', 'EUR', 'EUR', 'USD']})
df2 = pd.DataFrame({'Ref': ['20561', 'INV20562', 'INV20563BG', '20564'],
'Type': ['01', '03', '04', '02'],
'Amount': ['150', '175', '160', '180'],
'Comment': ['bla', 'bla', 'bla', 'bla']})
print(df1)
Invoice Currency
0 20561 EUR
1 20562 EUR
2 20563 EUR
3 20564 USD
print(df2)
Ref Type Amount Comment
0 20561 01 150 bla
1 INV20562 03 175 bla
2 INV20563BG 04 160 bla
3 20564 02 180 bla
I applied following code:
compList = '|'.join(df1['Invoice'].tolist())
df2['compMatch'] = df2.Ref.str.contains(compList)
# drop unmatched articles
df2 = df2[df2['compMatch']==True]
df2['content'] = df2['Ref'].str.lower().str.split()
df2['matchedName'] = df2['content'].apply(lambda x: [item for item in x if item in df1['Invoice'].tolist()])
df1['Invoice'].tolist()])
print (df2)
Ref Type Amount Comment compMatch content matchedName
0 20561 01 150 bla True [20561] [20561]
1 INV20562 03 175 bla True [inv20562] []
2 INV20563BG 04 160 bla True [inv20563bg] []
3 20564 02 180 bla True [20564] [20564]
here you see, a few MatchedNames are missing for Ref INV20562 and Ref INV20563BG.
What's wrong with this code? Is there any other solution?
Looks like you want to merge on the digits of the ref:
df2.merge(df1,
left_on=df2['Ref'].str.extract(r'(\d+)', expand=False),
right_on='Invoice', how='left')
Output:
Ref Type Amount Comment Invoice Currency
0 20561 01 150 bla 20561 EUR
1 INV20562 03 175 bla 20562 EUR
2 INV20563BG 04 160 bla 20563 EUR
3 20564 02 180 bla 20564 USD

Assign Random Number between two value conditionally

I have a dataframe:
df = pd.DataFrame({
'Prod': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat'],
'Line': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'AGE': ['35-44', '20-34', '35-44', '35-44', '45-70', '35-44']})
I wish to replace the values in the Age column by integers between two values. So, for example, I wish to replace each value with age range '35-44' by a random integer between 35-44.
I tried:
df.loc[df["AGE"]== '35-44', 'AGE'] = random.randint(35, 44)
But it picks the same value for each row. I would like it to randomly pick a different value for each row.
I get:
df = pd.DataFrame({
'Prod': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat'],
'Line': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'AGE': ['38', '20-34', '38', '38', '45-70', '38']})
But I would like to get something like the following. I don't much care about how the values are distributed as long as they are in the range that I assign
df = pd.DataFrame({
'Prod': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat'],
'Line': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'AGE': ['36', '20-34', '39', '38', '45-70', '45']})
The code
random.randint(35, 44)
Produces a single random value making the statement analogous to:
df.loc[df["AGE"]== '35-44', 'AGE'] = 38 # some constant
We need a collection of values that is the same length as the values to fill. We can use np.random.randint instead:
import numpy as np
m = df["AGE"] == '35-44'
df.loc[m, 'AGE'] = np.random.randint(35, 44, m.sum())
(Series.sum is used to "count" the number of True values in the Series since True is 1 and False is 0)
df:
Prod Line AGE
0 abc Revenues 40
1 qrt EBT 20-34
2 xyz Expenses 41
3 xam Revenues 35
4 asc EBT 45-70
5 yat Expenses 36
*Reproducible with np.random.seed(26)
Naturally, using the filter on both sides of the expression with apply would also work:
import random
m = df["AGE"] == '35-44'
df.loc[m, 'AGE'] = df.loc[m, 'AGE'].apply(lambda _: random.randint(35, 44))
df:
Prod Line AGE
0 abc Revenues 36
1 qrt EBT 20-34
2 xyz Expenses 37
3 xam Revenues 43
4 asc EBT 45-70
5 yat Expenses 44
*Reproducible with random.seed(28)

How to use apply for multiple Pandas dataset columns?

I am hardly trying to fill some columns with NaN values, selected from a previous list. The code is going to the else path and never makes the correct modifications...
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': [0.0, np.nan, np.nan, 100],
'C': [20, 0.0002, 10000, np.nan],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
num_cols = ['B', 'C']
fill_mean = lambda col: col.fillna(col.mean()) if col.name in num_cols else col
df2.apply(fill_mean, axis=1)
You can do this much simpler using
df1.fillna(df1.mean())
This will fill the numeric columns' nas by the column mean:
A B C D
0 A0 0.0 20.000000 D0
1 A1 50.0 0.000200 D1
2 A2 50.0 10000.000000 D2
3 A3 100.0 3340.000067 D3
I am not sure if your desired output it just the mean on all columns (single row). If that is the case, may be the below solution could help.
df = df1.select_dtypes(include='float').mean().to_frame().T
df = pd.concat([df, df.reindex(columns = df1.select_dtypes(exclude='float').columns)], axis=1, sort=False)
print(df)
B C A D
0 50.0 3340.000067 NaN NaN

Fill column in df.A based on comparison values in df.A and df.B

So I have this code:
import pandas as pd
import numpy as np
frame1 = {'Season': ['S19', 'S20', 'S21',
'S19', 'S20', 'S21',
'S19', 'S20', 'S21'],
'DateFrom': ['2019-01-01', '2020-01-01', '2021-01-01',
'2019-01-01', '2020-01-01', '2021-01-01',
'2019-01-01', '2020-01-01', '2021-01-01'],
'DateTo': ['2019-12-30', '2020-12-30', '2021-12-30',
'2019-12-30', '2020-12-30', '2021-12-30',
'2019-12-30', '2020-12-30', '2021-12-30'],
'Currency': ['EUR', 'EUR', 'EUR',
'USD', 'USD', 'USD',
'MAD', 'MAD', 'MAD'],
'Rate': [1, 2, 3, 4, 5, 6, 7, 8, 9]
}
df1 = pd.DataFrame(data=frame1)
frame2 = {'Room': ['Double', 'Single', 'SeaView'],
'Season': ['S20', 'S20', 'S19'],
'DateFrom': ['2020-05-01', '2020-07-05', '2019-03-25'],
'Currency': ['EUR', 'MAD', 'USD'],
'Rate': [0, 0, 0]
}
df2 = pd.DataFrame(data=frame2)
df1[['DateFrom', 'DateTo']] = df1[['DateFrom', 'DateTo']].apply(pd.to_datetime)
df2[['DateFrom']] = df2[['DateFrom']].apply(pd.to_datetime)
print(df1.dtypes)
print(df2.dtypes)
df2['Rate'] = np.where((
df2['Season'] == df1['Season'] &
df2['Currency'] == df1['Currency'] &
(df2['DateFrom'] > df1['DateFrom'] & df2['DateFrom'] < df1['DateTo'])
), df1['Rates'], 'MissingData')
print(df2)
What I am trying to achieve is to fill Rate values in df2 with Rate values from df1 based on conditions where:
df2.Season == df1.Season &
df2.Currency == df1.Currency &
df2.DateFrom must be between df1.DateFrom and df1.DateTo
So my result in 'Rates' should be 2,8,4
I was hoping that code above will work but its not, i am getting error:
"TypeError: unsupported operand type(s) for &: 'str' and 'str'"
Any help how to make it work will be appreciated.
You can first merge then compare:
out = df1.merge(df2[['Season','Currency','DateFrom']],on=['Season','Currency'],
suffixes=('','_y'))
out = (out[out['DateFrom_y'].between(out['DateFrom'],out['DateTo'])]
.reindex(columns=df1.columns).copy())
print(out)
Season DateFrom DateTo Currency Rate
0 S20 2020-01-01 2020-12-30 EUR 2
1 S19 2019-01-01 2019-12-30 USD 4
2 S20 2020-01-01 2020-12-30 MAD 8
EDIT per comments:
out = df1.merge(df2,on=['Season','Currency'],suffixes=('','_y'))
out = (out[out['DateFrom_y'].between(out['DateFrom'],out['DateTo'])]
.reindex(columns=df2.columns).copy())
Room Season DateFrom Currency Rate
0 Double S20 2020-01-01 EUR 2
1 SeaView S19 2019-01-01 USD 4
2 Single S20 2020-01-01 MAD 8

Pandas time re-sampling categorical data from a column with calculations from another numerical column

I have a data-frame with a categorical column and a numerical , the index set to time data
df = pd.DataFrame({
'date': [
'2013-03-01 ', '2013-03-02 ',
'2013-03-01 ', '2013-03-02',
'2013-03-01 ', '2013-03-02 '
],
'Kind': [
'A', 'B', 'A', 'B', 'B', 'B'
],
'Values': [1, 1.5, 2, 3, 5, 3]
})
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
the above code gives:
Kind Values
date
2013-03-01 A 1.0
2013-03-02 B 1.5
2013-03-01 A 2.0
2013-03-02 B 3.0
2013-03-01 B 5.0
2013-03-02 A 3.0
My aim is to achieve the below data-frame:
A_count B_count A_Val max B_Val max
date
2013-03-01 2 1 2 5
2013-03-02 0 3 0 3
Which also has the time as index . Here, I note that If we use
data = pd.DataFrame(data.resample('D')['Pack'].value_counts())
we get :
Kind
date Kind
2013-03-01 A 2
B 1
2013-03-02 B 3
Use DataFrame.pivot_table with flattening MultiIndex in columns in list comprehension:
df = pd.DataFrame({
'date': [
'2013-03-01 ', '2013-03-02 ',
'2013-03-01 ', '2013-03-02',
'2013-03-01 ', '2013-03-02 '
],
'Kind': [
'A', 'B', 'A', 'B', 'B', 'B'
],
'Values': [1, 1.5, 2, 3, 5, 3]
})
df['date'] = pd.to_datetime(df['date'])
#is possible omit
#df = df.set_index('date')
df = df.pivot_table(index='date', columns='Kind', values='Values', aggfunc=['count','max'])
df.columns = [f'{b}_{a}' for a, b in df.columns]
print (df)
A_count B_count A_max B_max
date
2013-03-01 2.0 1.0 2.0 5.0
2013-03-02 NaN 3.0 NaN 3.0
Another solution with Grouper for resample by days:
df = df.set_index('date')
df = df.groupby([pd.Grouper(freq='d'), 'Kind'])['Values'].agg(['count','max']).unstack()
df.columns = [f'{b}_{a}' for a, b in df.columns]