Fill column in df.A based on comparison values in df.A and df.B - pandas

So I have this code:
import pandas as pd
import numpy as np
frame1 = {'Season': ['S19', 'S20', 'S21',
'S19', 'S20', 'S21',
'S19', 'S20', 'S21'],
'DateFrom': ['2019-01-01', '2020-01-01', '2021-01-01',
'2019-01-01', '2020-01-01', '2021-01-01',
'2019-01-01', '2020-01-01', '2021-01-01'],
'DateTo': ['2019-12-30', '2020-12-30', '2021-12-30',
'2019-12-30', '2020-12-30', '2021-12-30',
'2019-12-30', '2020-12-30', '2021-12-30'],
'Currency': ['EUR', 'EUR', 'EUR',
'USD', 'USD', 'USD',
'MAD', 'MAD', 'MAD'],
'Rate': [1, 2, 3, 4, 5, 6, 7, 8, 9]
}
df1 = pd.DataFrame(data=frame1)
frame2 = {'Room': ['Double', 'Single', 'SeaView'],
'Season': ['S20', 'S20', 'S19'],
'DateFrom': ['2020-05-01', '2020-07-05', '2019-03-25'],
'Currency': ['EUR', 'MAD', 'USD'],
'Rate': [0, 0, 0]
}
df2 = pd.DataFrame(data=frame2)
df1[['DateFrom', 'DateTo']] = df1[['DateFrom', 'DateTo']].apply(pd.to_datetime)
df2[['DateFrom']] = df2[['DateFrom']].apply(pd.to_datetime)
print(df1.dtypes)
print(df2.dtypes)
df2['Rate'] = np.where((
df2['Season'] == df1['Season'] &
df2['Currency'] == df1['Currency'] &
(df2['DateFrom'] > df1['DateFrom'] & df2['DateFrom'] < df1['DateTo'])
), df1['Rates'], 'MissingData')
print(df2)
What I am trying to achieve is to fill Rate values in df2 with Rate values from df1 based on conditions where:
df2.Season == df1.Season &
df2.Currency == df1.Currency &
df2.DateFrom must be between df1.DateFrom and df1.DateTo
So my result in 'Rates' should be 2,8,4
I was hoping that code above will work but its not, i am getting error:
"TypeError: unsupported operand type(s) for &: 'str' and 'str'"
Any help how to make it work will be appreciated.

You can first merge then compare:
out = df1.merge(df2[['Season','Currency','DateFrom']],on=['Season','Currency'],
suffixes=('','_y'))
out = (out[out['DateFrom_y'].between(out['DateFrom'],out['DateTo'])]
.reindex(columns=df1.columns).copy())
print(out)
Season DateFrom DateTo Currency Rate
0 S20 2020-01-01 2020-12-30 EUR 2
1 S19 2019-01-01 2019-12-30 USD 4
2 S20 2020-01-01 2020-12-30 MAD 8
EDIT per comments:
out = df1.merge(df2,on=['Season','Currency'],suffixes=('','_y'))
out = (out[out['DateFrom_y'].between(out['DateFrom'],out['DateTo'])]
.reindex(columns=df2.columns).copy())
Room Season DateFrom Currency Rate
0 Double S20 2020-01-01 EUR 2
1 SeaView S19 2019-01-01 USD 4
2 Single S20 2020-01-01 MAD 8

Related

How to plot timeseries bar chart with multiple values per stack/ timestamp from pandas dataframe

value cumsum price
0 2021-02-01 00:00:00 164.6136 164.6136 0.0216
2021-02-01 00:00:00 163.8085 328.4221 0.0215
2021-02-01 00:00:00 163.0466 491.4687 0.0214
2021-02-01 00:00:00 14999.9925 15491.4612 0.0213
1 2021-02-01 00:00:10 164.6136 164.6136 0.0216
... ... ... ...
8634 2021-02-01 23:59:00 14999.9993 14999.9993 0.0221
8635 2021-02-01 23:59:10 14999.9993 14999.9993 0.0221
8636 2021-02-01 23:59:20 14999.9993 14999.9993 0.0221
8637 2021-02-01 23:59:30 0.0000 0.0000 0.0221
2021-02-01 23:59:30 14999.9993 14999.9993 0.0221
My data looks like the above, and I would like to plot a graph like below
Can somebody please help me?
You can use the below code to plot the graph.
If the code helps you, accept it as answer and upvote it pls.
import matplotlib.pyplot as plt
import pandas as pd
# create data
df = pd.DataFrame([['A', 10, 20, 10, 26],['A', 10, 40, 10, 26],['A', 10, 70, 10, 26], ['B', 20, 25, 15, 21],['B', 20, 45, 15, 21], ['C', 12, 15, 19, 6],
['D', 10, 18, 11, 19]],
columns=['Team', 'Round 1', 'Round 2', 'Round 3', 'Round 4'])
# view data
df
# # plot data in stack manner of bar type
df.pivot("Team", "Round 2").plot(kind='bar',stacked=True).legend().set_visible(False)
plt.show()
df
This is what you get as output. Do you expect this?

Select rows based on multiple columns that are later than a certain date

I have the following dataframe:
import pandas as pd
import numpy as np
np.random.seed(0)
# create an array of 5 dates starting at '2015-02-24', one per minute
rng = pd.date_range('2021-07-29', periods=5, freq='D')
rng_1 = pd.date_range('2021-07-30', periods=5, freq='D')
rng_2 = pd.date_range('2021-07-31', periods=5, freq='D')
df_status = ['received', 'send', 'received', 'send', 'send']
df = pd.DataFrame({ 'Date': rng, 'Date_1': rng_1, 'Date_2': rng_2, 'status': df_status })
print(df)
I would like to print out all the rows if at least one column contains a date that is equal to, or at least 2021-08-01. What would be the most effective way to do this?
I have tried to do this with the following code, however, I get the following error:
start_date = '2022-08-01'
start_date = pd.to_datetime(start_date, format="%Y/%m/%d")
mask = (df['Date'] >= start_date | df['Date_1'] >= start_date | df['Date_3'] >= start_date)
TypeError: unsupported operand type(s) for &: 'Timestamp' and 'DatetimeArray'
Thank you in advance.
Adjusted dataframe:
df = {'sample_received': {1: nan,
2: nan,
3: '2022-08-01 20:31:24',
4: '2022-08-01 20:25:45',
5: '2022-08-01 20:41:22'},
'result_received': {1: '2022-08-01 16:25:33',
2: '2022-08-01 13:25:36',
3: '2022-08-02 09:45:34',
4: '2022-08-02 09:52:59',
5: '2022-08-02 08:22:45'},
'status': {1: 'Approved',
2: 'Approved',
3: 'Approved',
4: 'Approved',
5: 'Approved'}}
Use boolean indexing with any:
df[df.ge('2021-08-01').any(1)]
output:
Date Date_1 Date_2
1 2021-07-30 2021-07-31 2021-08-01
2 2021-07-31 2021-08-01 2021-08-02
3 2021-08-01 2021-08-02 2021-08-03
4 2021-08-02 2021-08-03 2021-08-04
intermediate:
df.ge('2021-08-01').any(1)
0 False
1 True
2 True
3 True
4 True
dtype: bool
using only the date columns
filtering by name (Date in the column name):
df[df.filter(like='Date').ge('2021-08-01').any(1)]
filtering by type:
df[df.select_dtypes('datetime64[ns]').ge('2021-08-01').any(1)]
You may use any inside apply:
df[df.apply(lambda x: any([x[col] >= pd.to_datetime('2021-08-01') for col in df.columns]), axis=1)]

Merging two dataset with partial match

I want to merge two dataframe df1 and df2. Shape of df1 is (115, 16) and Df2 is (624402, 23).
df1 = pd.DataFrame({'Invoice': ['20561', '20562', '20563', '20564'],
'Currency': ['EUR', 'EUR', 'EUR', 'USD']})
df2 = pd.DataFrame({'Ref': ['20561', 'INV20562', 'INV20563BG', '20564'],
'Type': ['01', '03', '04', '02'],
'Amount': ['150', '175', '160', '180'],
'Comment': ['bla', 'bla', 'bla', 'bla']})
print(df1)
Invoice Currency
0 20561 EUR
1 20562 EUR
2 20563 EUR
3 20564 USD
print(df2)
Ref Type Amount Comment
0 20561 01 150 bla
1 INV20562 03 175 bla
2 INV20563BG 04 160 bla
3 20564 02 180 bla
I applied the following code:
df4 = df1.copy()
for i, row in df1.iterrows():
tmp = df2[df2['Ref'].str.contains(row['Invoice'], na=False)]
df4.loc[i, 'Amount'] = tmp['Amount'].values[0]
print(df4)
It is showing: IndexError: index 0 is out of bounds for axis 0 with size 0
The IndexError occurs when no row matches the invoice. You can check for this and return np.nan (or a different default value) if a matching invoice is not found:
df4 = df1.copy()
for i, row in df1.iterrows():
tmp = df2[df2['Ref'].str.contains(row['Invoice'], na=False)]
df4.loc[i, 'Amount'] = tmp['Amount'].values[0] if not tmp.empty else np.nan

Assign Random Number between two value conditionally

I have a dataframe:
df = pd.DataFrame({
'Prod': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat'],
'Line': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'AGE': ['35-44', '20-34', '35-44', '35-44', '45-70', '35-44']})
I wish to replace the values in the Age column by integers between two values. So, for example, I wish to replace each value with age range '35-44' by a random integer between 35-44.
I tried:
df.loc[df["AGE"]== '35-44', 'AGE'] = random.randint(35, 44)
But it picks the same value for each row. I would like it to randomly pick a different value for each row.
I get:
df = pd.DataFrame({
'Prod': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat'],
'Line': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'AGE': ['38', '20-34', '38', '38', '45-70', '38']})
But I would like to get something like the following. I don't much care about how the values are distributed as long as they are in the range that I assign
df = pd.DataFrame({
'Prod': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat'],
'Line': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'AGE': ['36', '20-34', '39', '38', '45-70', '45']})
The code
random.randint(35, 44)
Produces a single random value making the statement analogous to:
df.loc[df["AGE"]== '35-44', 'AGE'] = 38 # some constant
We need a collection of values that is the same length as the values to fill. We can use np.random.randint instead:
import numpy as np
m = df["AGE"] == '35-44'
df.loc[m, 'AGE'] = np.random.randint(35, 44, m.sum())
(Series.sum is used to "count" the number of True values in the Series since True is 1 and False is 0)
df:
Prod Line AGE
0 abc Revenues 40
1 qrt EBT 20-34
2 xyz Expenses 41
3 xam Revenues 35
4 asc EBT 45-70
5 yat Expenses 36
*Reproducible with np.random.seed(26)
Naturally, using the filter on both sides of the expression with apply would also work:
import random
m = df["AGE"] == '35-44'
df.loc[m, 'AGE'] = df.loc[m, 'AGE'].apply(lambda _: random.randint(35, 44))
df:
Prod Line AGE
0 abc Revenues 36
1 qrt EBT 20-34
2 xyz Expenses 37
3 xam Revenues 43
4 asc EBT 45-70
5 yat Expenses 44
*Reproducible with random.seed(28)

Pandas time re-sampling categorical data from a column with calculations from another numerical column

I have a data-frame with a categorical column and a numerical , the index set to time data
df = pd.DataFrame({
'date': [
'2013-03-01 ', '2013-03-02 ',
'2013-03-01 ', '2013-03-02',
'2013-03-01 ', '2013-03-02 '
],
'Kind': [
'A', 'B', 'A', 'B', 'B', 'B'
],
'Values': [1, 1.5, 2, 3, 5, 3]
})
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
the above code gives:
Kind Values
date
2013-03-01 A 1.0
2013-03-02 B 1.5
2013-03-01 A 2.0
2013-03-02 B 3.0
2013-03-01 B 5.0
2013-03-02 A 3.0
My aim is to achieve the below data-frame:
A_count B_count A_Val max B_Val max
date
2013-03-01 2 1 2 5
2013-03-02 0 3 0 3
Which also has the time as index . Here, I note that If we use
data = pd.DataFrame(data.resample('D')['Pack'].value_counts())
we get :
Kind
date Kind
2013-03-01 A 2
B 1
2013-03-02 B 3
Use DataFrame.pivot_table with flattening MultiIndex in columns in list comprehension:
df = pd.DataFrame({
'date': [
'2013-03-01 ', '2013-03-02 ',
'2013-03-01 ', '2013-03-02',
'2013-03-01 ', '2013-03-02 '
],
'Kind': [
'A', 'B', 'A', 'B', 'B', 'B'
],
'Values': [1, 1.5, 2, 3, 5, 3]
})
df['date'] = pd.to_datetime(df['date'])
#is possible omit
#df = df.set_index('date')
df = df.pivot_table(index='date', columns='Kind', values='Values', aggfunc=['count','max'])
df.columns = [f'{b}_{a}' for a, b in df.columns]
print (df)
A_count B_count A_max B_max
date
2013-03-01 2.0 1.0 2.0 5.0
2013-03-02 NaN 3.0 NaN 3.0
Another solution with Grouper for resample by days:
df = df.set_index('date')
df = df.groupby([pd.Grouper(freq='d'), 'Kind'])['Values'].agg(['count','max']).unstack()
df.columns = [f'{b}_{a}' for a, b in df.columns]