Comparing dates of values in 2 dataframes to decide output - pandas

This is DataFrame 1:
Date Serial Number Type
0 2014-12-17 1N4AL2EP8DC270200 New
1 2015-10-28 1N4AL2EP8DC270200 Used
2 2015-01-22 1N4AL3AP1EN239307 New
3 2015-11-22 1N4AL3AP1EN239307 Used
4 2015-05-22 1N4AL3AP1FC235402 New
5 2016-12-02 1N4AL3AP1FC235402 Used
6 2015-01-22 1N4AL3AP2FC213098 New
7 2016-05-13 1N4AL3AP2FC213098 Used
8 2014-05-14 1N4AL3AP3EC132416 New
9 2016-04-07 1N4AL3AP3EC132416 Used
10 2014-05-24 1N4AL3AP5EC316644 New
11 2014-12-18 1N4AL3AP5EC316644 Used
12 2014-12-11 1N4AL3AP6EC322517 New
13 2015-10-04 1N4AL3AP6EC322517 Used
14 2016-06-06 1N4AL3AP6EC322517 Used
...
This is DataFrame 2:
Date Serial Number
0 2014-03-12 5N1AA08C78N611573
1 2014-03-12 JN8AS5MT3EW604277
2 2014-03-12 1N6AF0LX5DN114710
3 2014-03-12 1N4AL3AP8DN447876
4 2014-03-12 JN8AZ1MU8AW021145
5 2014-03-12 JN1AZ4EH0AM500138
6 2014-03-12 JN8AF5MR3BT013548
7 2014-03-12 3N1AB61E17L629049
8 2014-03-12 3N1BC13E87L368844
9 2014-03-13 1N6AD07W95C431183
10 2014-03-13 1N6AA07A25N543180
11 2014-03-13 1N4CL2AP1BC110185
12 2014-03-13 JN8AZ1MW1BW181306
13 2014-03-13 5N1BV28U46N116791
...
Just given a sample of the DataFrame, not the entire DataFrame. I need to retrieve the first Date of every Serial Number that has its type as Used in DataFrame 1 (For example: For serial number '1N4AL3AP6EC322517' 2015-10-04 is the Date I'm looking for. Then compare this Date to the Date recorded for the same Serial Number in DataFrame 2 if the Date in DataFrame 2 is earlier that in DataFrame 1, mark it with 'A' otherwise mark it with 'B'.
Have to do this for over 2000 serial numbers, what's an efficient way to do the same?

I think you can use merge_asof:
print (df2)
Date Serial Number
0 2016-03-12 1N4AL3AP6EC322517
1 2013-03-12 1N4AL3AP5EC316644
2 2014-03-12 1N4AL3AP3EC132416
3 2016-08-12 1N4AL3AP2FC213098
4 2014-03-12 JN8AZ1MU8AW021145
#if necessary cast Date columns to datetime
df1.Date = pd.to_datetime(df1.Date)
df2.Date = pd.to_datetime(df2.Date)
#get first value of column Serial Number filtered by Used
df = df1[df1.Type == 'Used'].drop_duplicates(['Serial Number'])
print (df)
Date Serial Number Type
1 2015-10-28 1N4AL2EP8DC270200 Used
3 2015-11-22 1N4AL3AP1EN239307 Used
5 2016-12-02 1N4AL3AP1FC235402 Used
7 2016-05-13 1N4AL3AP2FC213098 Used
9 2016-04-07 1N4AL3AP3EC132416 Used
11 2014-12-18 1N4AL3AP5EC316644 Used
13 2015-10-04 1N4AL3AP6EC322517 Used
#add value B
df2['Mark'] = 'B'
df = pd.merge_asof(df.sort_values(['Date']),
df2.sort_values(['Date']), on='Date', by='Serial Number')
print (df)
Date Serial Number Type Mark
0 2014-12-18 1N4AL3AP5EC316644 Used B
1 2015-10-04 1N4AL3AP6EC322517 Used NaN
2 2015-10-28 1N4AL2EP8DC270200 Used NaN
3 2015-11-22 1N4AL3AP1EN239307 Used NaN
4 2016-04-07 1N4AL3AP3EC132416 Used B
5 2016-05-13 1N4AL3AP2FC213098 Used NaN
6 2016-12-02 1N4AL3AP1FC235402 Used NaN
#add value A
mask = df['Serial Number'].isin(df2['Serial Number'])
df.loc[mask, 'Mark'] = df.loc[mask, 'Mark'].fillna('A')
print (df)
Date Serial Number Type Mark
0 2014-12-18 1N4AL3AP5EC316644 Used B
1 2015-10-04 1N4AL3AP6EC322517 Used A
2 2015-10-28 1N4AL2EP8DC270200 Used NaN
3 2015-11-22 1N4AL3AP1EN239307 Used NaN
4 2016-04-07 1N4AL3AP3EC132416 Used B
5 2016-05-13 1N4AL3AP2FC213098 Used A
6 2016-12-02 1N4AL3AP1FC235402 Used NaN

Related

Calculate the minimum value of a certain column for all the groups and subtract the value from all the values of a certain column of that group

df = pd.DataFrame([['2018-02-03',42],
['2018-02-03',22],
['2018-02-03',10],
['2018-02-03',32],
['2018-02-03',10],
['2018-02-04',8],
['2018-02-04',2],
['2018-02-04',12],
['2018-02-03',20],
['2018-02-05',30],
['2018-02-05',5],
['2018-02-05',15]])
df.columns = ['product','date','quantity']
I want to create groups by date and calculate the minimum value of a 'quantity' column for all the groups respectively and subtract the value from all the values of a 'quantity' column of that group. The desired output is:
day value
2018-02-03 32 #(because, 42-10 = 32), 10 is minimum for date 2018-02-03.
2018-02-03 12
2018-02-03 0
2018-02-03 22
2018-02-03 0
2018-02-04 6
2018-02-04 0
2018-02-04 10
2018-02-03 10
2018-02-05 25
2018-02-05 0
2018-02-05 10
Now, this is what I tried:
df = df.groupby('Date', as_index = True)
datamin = df.groupby('Date')['quantity'].min()
Bu this creates a dataframe with the first quantity by Date ana I also do not know, how to proceed after this!!
try via groupby() and transform():
df['value']=df.groupby('date')['quantity'].transform(lambda x:x-x.min())
output of df:
date quantity value
0 2018-02-03 42 32
1 2018-02-03 22 12
2 2018-02-03 10 0
3 2018-02-03 32 22
4 2018-02-03 10 0
5 2018-02-04 8 6
6 2018-02-04 2 0
7 2018-02-04 12 10
8 2018-02-03 20 10
9 2018-02-05 30 25
10 2018-02-05 5 0
11 2018-02-05 15 10
For improve performance use GroupBy.transform without lambda function, better is subtract all values of column like:
df['value'] = df['quantity'].sub(df.groupby('date')['quantity'].transform('min'))

How to convert a column in the pandas dataframe from UTC to datetime64?

I have a column in a dataframe which is currently a string:
'4/17/2015 8:03:45 PM'
I need to create other columns based on this column
['dayofweek','dayofmonth']
I couldnt find a fast solution which could help me do that for 60 000 000 rows, pls help )
You can create a function(s) to convert the date in your dataframe to the output you need and than add the result as a new columns in your dataframe as below:
df
numbers time
0 1 2019-01-01
1 2 2019-01-02
2 3 2019-01-03
3 4 2019-01-04
4 5 2019-01-05
5 6 2019-01-06
6 7 2019-01-07
7 8 2019-01-08
8 9 2019-01-09
9 10 2019-01-10
10 11 2019-01-11
11 12 2019-01-12
Create the function to apply to your data columns to your dataframe
def day_of_the_week(value):
return value.strftime("%A")
def day_of_the_month(value):
return value.day
Create the new columns by applying the functions to the dataframe date columns
df['day_of_the_week'] = df['time'].apply(day_of_the_week)
df['day_of_the_month'] = df['time'].apply(day_of_the_month)
Get the update Datafame
df
numbers time day_of_the_week day_of_the_month
0 1 2019-01-01 Tuesday 1
1 2 2019-01-02 Wednesday 2
2 3 2019-01-03 Thursday 3
3 4 2019-01-04 Friday 4
4 5 2019-01-05 Saturday 5
5 6 2019-01-06 Sunday 6
6 7 2019-01-07 Monday 7
7 8 2019-01-08 Tuesday 8
8 9 2019-01-09 Wednesday 9
9 10 2019-01-10 Thursday 10
10 11 2019-01-11 Friday 11
11 12 2019-01-12 Saturday 12

what is wrong with pandas date_time?

Following is the code for an example of using pandas datetime module. As shown in the output, it is not consitent, It is mixing date and month. Am i doing something wrong?
​
dates = ['20/11/17', '12/02/18', '02/05/18', '10/09/18',
'22/06/17', '12/02/15','19/11/17', '04/09/16',
'12/05/18', '11/04/15', '10/04/17', '13/06/16']
data = pd.DataFrame(data=dates, columns=['date'])
data['date_format'] = pd.to_datetime(dates)
data
Output:
date date_format
0 20/11/17 2017-11-20
1 12/02/18 2018-12-02
2 02/05/18 2018-02-05
3 10/09/18 2018-10-09
4 22/06/17 2017-06-22
5 12/02/15 2015-12-02
6 19/11/17 2017-11-19
7 04/09/16 2016-04-09
8 12/05/18 2018-12-05
9 11/04/15 2015-11-04
10 10/04/17 2017-10-04
11 13/06/16 2016-06-13

Pandas time difference calculation error

I have two time columns in my dataframe: called date1 and date2.
As far as I always assumed, both are in date_time format. However, I now have to calculate the difference in days between the two and it doesn't work.
I run the following code to analyse the data:
df['month1'] = pd.DatetimeIndex(df['date1']).month
df['month2'] = pd.DatetimeIndex(df['date2']).month
print(df[["date1", "date2", "month1", "month2"]].head(10))
print(df["date1"].dtype)
print(df["date2"].dtype)
The output is:
date1 date2 month1 month2
0 2016-02-29 2017-01-01 1 1
1 2016-11-08 2017-01-01 1 1
2 2017-11-27 2009-06-01 1 6
3 2015-03-09 2014-07-01 1 7
4 2015-06-02 2014-07-01 1 7
5 2015-09-18 2017-01-01 1 1
6 2017-09-06 2017-07-01 1 7
7 2017-04-15 2009-06-01 1 6
8 2017-08-14 2014-07-01 1 7
9 2017-12-06 2014-07-01 1 7
datetime64[ns]
object
As you can see, the month for date1 is not calculated correctly!
The final operation, which does not work is:
df["date_diff"] = (df["date1"]-df["date2"]).astype('timedelta64[D]')
which leads to the following error:
incompatible type [object] for a datetime/timedelta operation
I first thought it might be due to date2, so I tried:
df["date2_new"] = pd.to_datetime(df['date2'] - 315619200, unit = 's')
leading to:
unsupported operand type(s) for -: 'str' and 'int'
Anyone has an idea what I need to change?
Use .dt accessor with days attribute:
df[['date1','date2']] = df[['date1','date2']].apply(pd.to_datetime)
df['date_diff'] = (df['date1'] - df['date2']).dt.days
Output:
date1 date2 month1 month2 date_diff
0 2016-02-29 2017-01-01 1 1 -307
1 2016-11-08 2017-01-01 1 1 -54
2 2017-11-27 2009-06-01 1 6 3101
3 2015-03-09 2014-07-01 1 7 251
4 2015-06-02 2014-07-01 1 7 336
5 2015-09-18 2017-01-01 1 1 -471
6 2017-09-06 2017-07-01 1 7 67
7 2017-04-15 2009-06-01 1 6 2875
8 2017-08-14 2014-07-01 1 7 1140
9 2017-12-06 2014-07-01 1 7 1254

Subtract day column from date column in pandas data frame

I have two columns in my data frame.One column is date(df["Start_date]) and other is number of days.I want to subtract no of days column(df["days"]) from Date column.
I was trying something like this
df["new_date"]=df["Start_date"]-datetime.timedelta(days=df["days"])
I think you need to_timedelta:
df["new_date"]=df["Start_date"]-pd.to_timedelta(df["days"], unit='D')
Sample:
np.random.seed(120)
start = pd.to_datetime('2015-02-24')
rng = pd.date_range(start, periods=10)
df = pd.DataFrame({'Start_date': rng, 'days': np.random.choice(np.arange(10), size=10)})
print (df)
Start_date days
0 2015-02-24 7
1 2015-02-25 0
2 2015-02-26 8
3 2015-02-27 4
4 2015-02-28 1
5 2015-03-01 7
6 2015-03-02 1
7 2015-03-03 3
8 2015-03-04 8
9 2015-03-05 9
df["new_date"]=df["Start_date"]-pd.to_timedelta(df["days"], unit='D')
print (df)
Start_date days new_date
0 2015-02-24 7 2015-02-17
1 2015-02-25 0 2015-02-25
2 2015-02-26 8 2015-02-18
3 2015-02-27 4 2015-02-23
4 2015-02-28 1 2015-02-27
5 2015-03-01 7 2015-02-22
6 2015-03-02 1 2015-03-01
7 2015-03-03 3 2015-02-28
8 2015-03-04 8 2015-02-24
9 2015-03-05 9 2015-02-24