Select rows based on multiple columns that are later than a certain date - pandas

I have the following dataframe:
import pandas as pd
import numpy as np
np.random.seed(0)
# create an array of 5 dates starting at '2015-02-24', one per minute
rng = pd.date_range('2021-07-29', periods=5, freq='D')
rng_1 = pd.date_range('2021-07-30', periods=5, freq='D')
rng_2 = pd.date_range('2021-07-31', periods=5, freq='D')
df_status = ['received', 'send', 'received', 'send', 'send']
df = pd.DataFrame({ 'Date': rng, 'Date_1': rng_1, 'Date_2': rng_2, 'status': df_status })
print(df)
I would like to print out all the rows if at least one column contains a date that is equal to, or at least 2021-08-01. What would be the most effective way to do this?
I have tried to do this with the following code, however, I get the following error:
start_date = '2022-08-01'
start_date = pd.to_datetime(start_date, format="%Y/%m/%d")
mask = (df['Date'] >= start_date | df['Date_1'] >= start_date | df['Date_3'] >= start_date)
TypeError: unsupported operand type(s) for &: 'Timestamp' and 'DatetimeArray'
Thank you in advance.
Adjusted dataframe:
df = {'sample_received': {1: nan,
2: nan,
3: '2022-08-01 20:31:24',
4: '2022-08-01 20:25:45',
5: '2022-08-01 20:41:22'},
'result_received': {1: '2022-08-01 16:25:33',
2: '2022-08-01 13:25:36',
3: '2022-08-02 09:45:34',
4: '2022-08-02 09:52:59',
5: '2022-08-02 08:22:45'},
'status': {1: 'Approved',
2: 'Approved',
3: 'Approved',
4: 'Approved',
5: 'Approved'}}

Use boolean indexing with any:
df[df.ge('2021-08-01').any(1)]
output:
Date Date_1 Date_2
1 2021-07-30 2021-07-31 2021-08-01
2 2021-07-31 2021-08-01 2021-08-02
3 2021-08-01 2021-08-02 2021-08-03
4 2021-08-02 2021-08-03 2021-08-04
intermediate:
df.ge('2021-08-01').any(1)
0 False
1 True
2 True
3 True
4 True
dtype: bool
using only the date columns
filtering by name (Date in the column name):
df[df.filter(like='Date').ge('2021-08-01').any(1)]
filtering by type:
df[df.select_dtypes('datetime64[ns]').ge('2021-08-01').any(1)]

You may use any inside apply:
df[df.apply(lambda x: any([x[col] >= pd.to_datetime('2021-08-01') for col in df.columns]), axis=1)]

Related

How to convert a dataframe of Timestamps to a numpy list of Timestamps

I have a dataframe like the following:
df = pd.DataFrame(
{
"timestamp1": [
pd.Timestamp("2021-01-01"),
pd.Timestamp("2021-03-01"),
],
"timestamp2": [
pd.Timestamp("2022-01-01"),
pd.Timestamp("2022-03-01"),
],
})
I want to convert this to a list of numpy arrays so I get something like the following:
array([[Timestamp('2021-01-01 00:00:00'),
Timestamp('2022-01-01 00:00:00')],
[Timestamp('2021-01-01 00:00:00'),
Timestamp('2022-03-01 00:00:00')]], dtype=object)
I have tried df.to_numpy() but this doesn't seem to work as each item is a numpy.datetime64 object.
In [176]: df
Out[176]:
timestamp1 timestamp2
0 2021-01-01 2022-01-01
1 2021-03-01 2022-03-01
I don't know much about pd.Timestamp, but it looks like the values are actually stored as you got from to_numpy(), as numpy.datetime64[ns]:
In [179]: df.dtypes
Out[179]:
timestamp1 datetime64[ns]
timestamp2 datetime64[ns]
dtype: object
An individual column, a Series, has a tolist() method
In [190]: df['timestamp1'].tolist()
Out[190]: [Timestamp('2021-01-01 00:00:00'), Timestamp('2021-03-01 00:00:00')]
That's why `#jezrael's answer works
In [191]: arr = np.array([list(df[x]) for x in df.columns])
In [192]: arr
Out[192]:
array([[Timestamp('2021-01-01 00:00:00'),
Timestamp('2021-03-01 00:00:00')],
[Timestamp('2022-01-01 00:00:00'),
Timestamp('2022-03-01 00:00:00')]], dtype=object)
Once you have an array, you can easily transpose it:
In [193]: arr.T
Out[193]:
array([[Timestamp('2021-01-01 00:00:00'),
Timestamp('2022-01-01 00:00:00')],
[Timestamp('2021-03-01 00:00:00'),
Timestamp('2022-03-01 00:00:00')]], dtype=object)
An individual Timestamp object can be converted/displayed in various ways:
In [196]: x=arr[0,0]
In [197]: type(x)
Out[197]: pandas._libs.tslibs.timestamps.Timestamp
In [198]: x.to_datetime64()
Out[198]: numpy.datetime64('2021-01-01T00:00:00.000000000')
In [199]: x.to_numpy()
Out[199]: numpy.datetime64('2021-01-01T00:00:00.000000000')
In [200]: x.to_pydatetime()
Out[200]: datetime.datetime(2021, 1, 1, 0, 0)
In [201]: print(x)
2021-01-01 00:00:00
In [202]: repr(x)
Out[202]: "Timestamp('2021-01-01 00:00:00')"
Use list comprehension for convert values to lists and then to numpy arrays:
print (np.array([list(df[x]) for x in df.columns]))
[[Timestamp('2021-01-01 00:00:00') Timestamp('2021-03-01 00:00:00')]
[Timestamp('2022-01-01 00:00:00') Timestamp('2022-03-01 00:00:00')]]

Multiple vectorized condition - Length of values between two data frames not matching

I am trying to perform a rather simple task by using vectorized conditions. The size of the two dataframes differ but still I do not understand why that may an issue.
df1_data = {'In-Person Status': {0: 'No', 1: 'Yes', 2: 'No', 3: 'Yes', 4: 'No', 5: 'Yes'},
'ID': {0: 5, 1: 45, 2: 22, 3: 34, 4: 46, 5: 184}}
df1 = pd.DataFrame(df1_data)
df2_data = {'Age': {0: 22, 1: 34, 2: 51, 3: 8}, 'ID': {0: 5, 1: 2145, 2: 5022, 3: 34}}
df2 = pd.DataFrame(df2_data)
I am using the following code:
conditions = [
(df2['ID'].isin(df1['ID'])) & (df1['In-Person Status'] == 'No')
]
value = ['True']
df2['Result'] = NaN
df2['Result'] = np.select(conditions, value, 'False')
Desired output:
Age ID Result
22 0005 True
34 2145 False
51 5022 False
8 0034 False
Although the task might be very simple, I am getting the following error message:
ValueError: Length of values (72610) does not match length of index (1634)
I would very much appreciate any suggestions.
We can join the two dfs as suggested in the comments, then drop the nan value rows in the Age column. The last couple of lines are optional to get the format to match your output.
dfj = df1.join(df2, rsuffix='_left')
conditions = [(dfj['ID'].isin(dfj['ID_left'])) & (dfj['In-Person Status'] == 'No')]
value = [True]
dfj['Result'] = np.select(conditions, value, False)
dfj = dfj.dropna(axis=0, how='any', subset=['Age'])
dfj = dfj[['Age' , 'ID_left', 'Result']]
dfj.columns = ['Age', 'ID', 'Result']
dfj['ID'] = dfj['ID'].apply(lambda x: str(x).zfill(6)[0:4])
dfj['Age'] = dfj['Age'].astype(int)
Output:
Age ID Result
0 22 0005 True
1 34 2145 False
2 51 5022 False
3 8 0034 False

Assign Random Number between two value conditionally

I have a dataframe:
df = pd.DataFrame({
'Prod': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat'],
'Line': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'AGE': ['35-44', '20-34', '35-44', '35-44', '45-70', '35-44']})
I wish to replace the values in the Age column by integers between two values. So, for example, I wish to replace each value with age range '35-44' by a random integer between 35-44.
I tried:
df.loc[df["AGE"]== '35-44', 'AGE'] = random.randint(35, 44)
But it picks the same value for each row. I would like it to randomly pick a different value for each row.
I get:
df = pd.DataFrame({
'Prod': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat'],
'Line': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'AGE': ['38', '20-34', '38', '38', '45-70', '38']})
But I would like to get something like the following. I don't much care about how the values are distributed as long as they are in the range that I assign
df = pd.DataFrame({
'Prod': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat'],
'Line': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'AGE': ['36', '20-34', '39', '38', '45-70', '45']})
The code
random.randint(35, 44)
Produces a single random value making the statement analogous to:
df.loc[df["AGE"]== '35-44', 'AGE'] = 38 # some constant
We need a collection of values that is the same length as the values to fill. We can use np.random.randint instead:
import numpy as np
m = df["AGE"] == '35-44'
df.loc[m, 'AGE'] = np.random.randint(35, 44, m.sum())
(Series.sum is used to "count" the number of True values in the Series since True is 1 and False is 0)
df:
Prod Line AGE
0 abc Revenues 40
1 qrt EBT 20-34
2 xyz Expenses 41
3 xam Revenues 35
4 asc EBT 45-70
5 yat Expenses 36
*Reproducible with np.random.seed(26)
Naturally, using the filter on both sides of the expression with apply would also work:
import random
m = df["AGE"] == '35-44'
df.loc[m, 'AGE'] = df.loc[m, 'AGE'].apply(lambda _: random.randint(35, 44))
df:
Prod Line AGE
0 abc Revenues 36
1 qrt EBT 20-34
2 xyz Expenses 37
3 xam Revenues 43
4 asc EBT 45-70
5 yat Expenses 44
*Reproducible with random.seed(28)

Fill column in df.A based on comparison values in df.A and df.B

So I have this code:
import pandas as pd
import numpy as np
frame1 = {'Season': ['S19', 'S20', 'S21',
'S19', 'S20', 'S21',
'S19', 'S20', 'S21'],
'DateFrom': ['2019-01-01', '2020-01-01', '2021-01-01',
'2019-01-01', '2020-01-01', '2021-01-01',
'2019-01-01', '2020-01-01', '2021-01-01'],
'DateTo': ['2019-12-30', '2020-12-30', '2021-12-30',
'2019-12-30', '2020-12-30', '2021-12-30',
'2019-12-30', '2020-12-30', '2021-12-30'],
'Currency': ['EUR', 'EUR', 'EUR',
'USD', 'USD', 'USD',
'MAD', 'MAD', 'MAD'],
'Rate': [1, 2, 3, 4, 5, 6, 7, 8, 9]
}
df1 = pd.DataFrame(data=frame1)
frame2 = {'Room': ['Double', 'Single', 'SeaView'],
'Season': ['S20', 'S20', 'S19'],
'DateFrom': ['2020-05-01', '2020-07-05', '2019-03-25'],
'Currency': ['EUR', 'MAD', 'USD'],
'Rate': [0, 0, 0]
}
df2 = pd.DataFrame(data=frame2)
df1[['DateFrom', 'DateTo']] = df1[['DateFrom', 'DateTo']].apply(pd.to_datetime)
df2[['DateFrom']] = df2[['DateFrom']].apply(pd.to_datetime)
print(df1.dtypes)
print(df2.dtypes)
df2['Rate'] = np.where((
df2['Season'] == df1['Season'] &
df2['Currency'] == df1['Currency'] &
(df2['DateFrom'] > df1['DateFrom'] & df2['DateFrom'] < df1['DateTo'])
), df1['Rates'], 'MissingData')
print(df2)
What I am trying to achieve is to fill Rate values in df2 with Rate values from df1 based on conditions where:
df2.Season == df1.Season &
df2.Currency == df1.Currency &
df2.DateFrom must be between df1.DateFrom and df1.DateTo
So my result in 'Rates' should be 2,8,4
I was hoping that code above will work but its not, i am getting error:
"TypeError: unsupported operand type(s) for &: 'str' and 'str'"
Any help how to make it work will be appreciated.
You can first merge then compare:
out = df1.merge(df2[['Season','Currency','DateFrom']],on=['Season','Currency'],
suffixes=('','_y'))
out = (out[out['DateFrom_y'].between(out['DateFrom'],out['DateTo'])]
.reindex(columns=df1.columns).copy())
print(out)
Season DateFrom DateTo Currency Rate
0 S20 2020-01-01 2020-12-30 EUR 2
1 S19 2019-01-01 2019-12-30 USD 4
2 S20 2020-01-01 2020-12-30 MAD 8
EDIT per comments:
out = df1.merge(df2,on=['Season','Currency'],suffixes=('','_y'))
out = (out[out['DateFrom_y'].between(out['DateFrom'],out['DateTo'])]
.reindex(columns=df2.columns).copy())
Room Season DateFrom Currency Rate
0 Double S20 2020-01-01 EUR 2
1 SeaView S19 2019-01-01 USD 4
2 Single S20 2020-01-01 MAD 8

Pandas time re-sampling categorical data from a column with calculations from another numerical column

I have a data-frame with a categorical column and a numerical , the index set to time data
df = pd.DataFrame({
'date': [
'2013-03-01 ', '2013-03-02 ',
'2013-03-01 ', '2013-03-02',
'2013-03-01 ', '2013-03-02 '
],
'Kind': [
'A', 'B', 'A', 'B', 'B', 'B'
],
'Values': [1, 1.5, 2, 3, 5, 3]
})
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
the above code gives:
Kind Values
date
2013-03-01 A 1.0
2013-03-02 B 1.5
2013-03-01 A 2.0
2013-03-02 B 3.0
2013-03-01 B 5.0
2013-03-02 A 3.0
My aim is to achieve the below data-frame:
A_count B_count A_Val max B_Val max
date
2013-03-01 2 1 2 5
2013-03-02 0 3 0 3
Which also has the time as index . Here, I note that If we use
data = pd.DataFrame(data.resample('D')['Pack'].value_counts())
we get :
Kind
date Kind
2013-03-01 A 2
B 1
2013-03-02 B 3
Use DataFrame.pivot_table with flattening MultiIndex in columns in list comprehension:
df = pd.DataFrame({
'date': [
'2013-03-01 ', '2013-03-02 ',
'2013-03-01 ', '2013-03-02',
'2013-03-01 ', '2013-03-02 '
],
'Kind': [
'A', 'B', 'A', 'B', 'B', 'B'
],
'Values': [1, 1.5, 2, 3, 5, 3]
})
df['date'] = pd.to_datetime(df['date'])
#is possible omit
#df = df.set_index('date')
df = df.pivot_table(index='date', columns='Kind', values='Values', aggfunc=['count','max'])
df.columns = [f'{b}_{a}' for a, b in df.columns]
print (df)
A_count B_count A_max B_max
date
2013-03-01 2.0 1.0 2.0 5.0
2013-03-02 NaN 3.0 NaN 3.0
Another solution with Grouper for resample by days:
df = df.set_index('date')
df = df.groupby([pd.Grouper(freq='d'), 'Kind'])['Values'].agg(['count','max']).unstack()
df.columns = [f'{b}_{a}' for a, b in df.columns]