Pandas time difference calculation error - pandas

I have two time columns in my dataframe: called date1 and date2.
As far as I always assumed, both are in date_time format. However, I now have to calculate the difference in days between the two and it doesn't work.
I run the following code to analyse the data:
df['month1'] = pd.DatetimeIndex(df['date1']).month
df['month2'] = pd.DatetimeIndex(df['date2']).month
print(df[["date1", "date2", "month1", "month2"]].head(10))
print(df["date1"].dtype)
print(df["date2"].dtype)
The output is:
date1 date2 month1 month2
0 2016-02-29 2017-01-01 1 1
1 2016-11-08 2017-01-01 1 1
2 2017-11-27 2009-06-01 1 6
3 2015-03-09 2014-07-01 1 7
4 2015-06-02 2014-07-01 1 7
5 2015-09-18 2017-01-01 1 1
6 2017-09-06 2017-07-01 1 7
7 2017-04-15 2009-06-01 1 6
8 2017-08-14 2014-07-01 1 7
9 2017-12-06 2014-07-01 1 7
datetime64[ns]
object
As you can see, the month for date1 is not calculated correctly!
The final operation, which does not work is:
df["date_diff"] = (df["date1"]-df["date2"]).astype('timedelta64[D]')
which leads to the following error:
incompatible type [object] for a datetime/timedelta operation
I first thought it might be due to date2, so I tried:
df["date2_new"] = pd.to_datetime(df['date2'] - 315619200, unit = 's')
leading to:
unsupported operand type(s) for -: 'str' and 'int'
Anyone has an idea what I need to change?

Use .dt accessor with days attribute:
df[['date1','date2']] = df[['date1','date2']].apply(pd.to_datetime)
df['date_diff'] = (df['date1'] - df['date2']).dt.days
Output:
date1 date2 month1 month2 date_diff
0 2016-02-29 2017-01-01 1 1 -307
1 2016-11-08 2017-01-01 1 1 -54
2 2017-11-27 2009-06-01 1 6 3101
3 2015-03-09 2014-07-01 1 7 251
4 2015-06-02 2014-07-01 1 7 336
5 2015-09-18 2017-01-01 1 1 -471
6 2017-09-06 2017-07-01 1 7 67
7 2017-04-15 2009-06-01 1 6 2875
8 2017-08-14 2014-07-01 1 7 1140
9 2017-12-06 2014-07-01 1 7 1254

Related

Is there a way of group by month in Pandas starting at specific day number?

I'm trying to group by month some data in python, but i need the month to start at the 25 of each month, is there a way to do that in Pandas?
For weeks there is a way of starting on Monday, Tuesday, ... But for months it's always full month.
pd.Grouper(key='date', freq='M')
You could offset the dates by 24 days and groupby:
np.random.seed(1)
dates = pd.date_range('2019-01-01', '2019-04-30', freq='D')
df = pd.DataFrame({'date':dates,
'val': np.random.uniform(0,1,len(dates))})
# for groupby
s = df['date'].sub(pd.DateOffset(24))
(df.groupby([s.dt.year, s.dt.month], as_index=False)
.agg({'date':'min', 'val':'sum'})
)
gives
date val
0 2019-01-01 10.120368
1 2019-01-25 14.895363
2 2019-02-25 14.544506
3 2019-03-25 17.228734
4 2019-04-25 3.334160
Another example:
np.random.seed(1)
dates = pd.date_range('2019-01-20', '2019-01-30', freq='D')
df = pd.DataFrame({'date':dates,
'val': np.random.uniform(0,1,len(dates))})
s = df['date'].sub(pd.DateOffset(24))
df['groups'] = df.groupby([s.dt.year, s.dt.month]).cumcount()
gives
date val groups
0 2019-01-20 0.417022 0
1 2019-01-21 0.720324 1
2 2019-01-22 0.000114 2
3 2019-01-23 0.302333 3
4 2019-01-24 0.146756 4
5 2019-01-25 0.092339 0
6 2019-01-26 0.186260 1
7 2019-01-27 0.345561 2
8 2019-01-28 0.396767 3
9 2019-01-29 0.538817 4
10 2019-01-30 0.419195 5
And you can see the how the cumcount restarts at day 25.
I prepared the following test DataFrame:
Dat Val
0 2017-03-24 0
1 2017-03-25 0
2 2017-03-26 1
3 2017-03-27 0
4 2017-04-24 0
5 2017-04-25 0
6 2017-05-24 0
7 2017-05-25 2
8 2017-05-26 0
The first step is to compute a "shifted date" column:
df['Dat2'] = df.Dat + pd.DateOffset(days=-24)
The result is:
Dat Val Dat2
0 2017-03-24 0 2017-02-28
1 2017-03-25 0 2017-03-01
2 2017-03-26 1 2017-03-02
3 2017-03-27 0 2017-03-03
4 2017-04-24 0 2017-03-31
5 2017-04-25 0 2017-04-01
6 2017-05-24 0 2017-04-30
7 2017-05-25 2 2017-05-01
8 2017-05-26 0 2017-05-02
As you can see, March dates in Dat2 start just from original date 2017-03-25,
and so on.
The value of 1 is in March (Dat2) and the value of 2 is in May (also Dat2).
Then, to compute e.g. a sum by month, we can run:
df.groupby(pd.Grouper(key='Dat2', freq='MS')).sum()
getting:
Val
Dat2
2017-02-01 0
2017-03-01 1
2017-04-01 0
2017-05-01 2
So we have correct groupping:
1 is in March,
2 is in May.
The advantage over the other answer is that you have all dates on the first
day of a month, of course bearing in mind that e.g. 2017-03-01 in the
result means the period from 2017-03-25 to 2017-04-24 (including).

Grouping into series based on days since

I need to create a new grouping every time I have a period of more than 60 days since my previous record.
Basically, I need too take the data I have here:
RowNo StartDate StopDate DaysBetween
1 3/21/2017 3/21/2017 14
2 4/4/2017 4/4/2017 14
3 4/18/2017 4/18/2017 14
4 6/23/2017 6/23/2017 66
5 7/5/2017 7/5/2017 12
6 7/19/2017 7/19/2017 14
7 9/27/2017 9/27/2017 70
8 10/24/2017 10/24/2017 27
9 10/31/2017 10/31/2017 7
10 11/14/2017 11/14/2017 14
And turn it into this:
RowNo StartDate StopDate DaysBetween Series
1 3/21/2017 3/21/2017 14 1
2 4/4/2017 4/4/2017 14 1
3 4/18/2017 4/18/2017 14 1
4 6/23/2017 6/23/2017 66 2
5 7/5/2017 7/5/2017 12 2
6 7/19/2017 7/19/2017 14 2
7 9/27/2017 9/27/2017 70 3
8 10/24/2017 10/24/2017 27 3
9 10/31/2017 10/31/2017 7 3
10 11/14/2017 11/14/2017 14 3
Once I have that I'll group by Series and get the min(StartDate) and max(StopDate) for individual durations.
I could do this using a cursor but I'm sure someone much smarter than me has figured out a more elegant solution. Thanks in advance!
You can use the window function sum() over with a conditional FLAG
Example
Select *
,Series= 1+sum(case when [DaysBetween]>60 then 1 else 0 end) over (Order by RowNo)
From YourTable
Returns
RowNo StartDate StopDate DaysBetween Series
1 2017-03-21 2017-03-21 14 1
2 2017-04-04 2017-04-04 14 1
3 2017-04-18 2017-04-18 14 1
4 2017-06-23 2017-06-23 66 2
5 2017-07-05 2017-07-05 12 2
6 2017-07-19 2017-07-19 14 2
7 2017-09-27 2017-09-27 70 3
8 2017-10-24 2017-10-24 27 3
9 2017-10-31 2017-10-31 7 3
10 2017-11-14 2017-11-14 14 3
EDIT - 2008 Version
Select A.*
,B.*
From YourTable A
Cross Apply (
Select Series=1+sum( case when [DaysBetween]>60 then 1 else 0 end)
From YourTable
Where RowNo <= A.RowNo
) B

PostgreSQL - rank over rows listed in blocks of 0 and 1

I have a table that looks like:
id code date1 date2 block
--------------------------------------------------
20 1234 2017-07-01 2017-07-31 1
15 1234 2017-06-01 2017-06-30 1
13 1234 2017-05-01 2017-05-31 0
11 1234 2017-03-01 2017-03-31 0
9 1234 2017-02-01 2017-02-28 1
8 1234 2017-01-01 2017-01-31 0
7 1234 2016-11-01 2016-11-31 0
6 1234 2016-10-01 2016-10-31 1
2 1234 2016-09-01 2016-09-31 1
I need to rank the rows according to the blocks of 0's and 1's, like:
id code date1 date2 block desired_rank
-------------------------------------------------------------------
20 1234 2017-07-01 2017-07-31 1 1
15 1234 2017-06-01 2017-06-30 1 1
13 1234 2017-05-01 2017-05-31 0 2
11 1234 2017-03-01 2017-03-31 0 2
9 1234 2017-02-01 2017-02-28 1 3
8 1234 2017-01-01 2017-01-31 0 4
7 1234 2016-11-01 2016-11-31 0 4
6 1234 2016-10-01 2016-10-31 1 5
2 1234 2016-09-01 2016-09-31 1 5
I've tried to use rank() and dense_rank(), but the result I end up with is:
id code date1 date2 block dense_rank()
-------------------------------------------------------------------
20 1234 2017-07-01 2017-07-31 1 1
15 1234 2017-06-01 2017-06-30 1 2
13 1234 2017-05-01 2017-05-31 0 1
11 1234 2017-03-01 2017-03-31 0 2
9 1234 2017-02-01 2017-02-28 1 3
8 1234 2017-01-01 2017-01-31 0 3
7 1234 2016-11-01 2016-11-31 0 4
6 1234 2016-10-01 2016-10-31 1 4
2 1234 2016-09-01 2016-09-31 1 5
In the last table, the rank doesn't care about the rows, it just takes all the 1's and 0's as a unit and sets an ascending count starting at the first 1 and 0.
My query goes like this:
CREATE TEMP TABLE data (id integer,code text, date1 date, date2 date, block integer);
INSERT INTO data VALUES
(20,'1234', '2017-07-01','2017-07-31',1),
(15,'1234', '2017-06-01','2017-06-30',1),
(13,'1234', '2017-05-01','2017-05-31',0),
(11,'1234', '2017-03-01','2017-03-31',0),
(9, '1234', '2017-02-01','2017-02-28',1),
(8, '1234', '2017-01-01','2017-01-31',0),
(7, '1234', '2016-11-01','2016-11-30',0),
(6, '1234', '2016-10-01','2016-10-31',1),
(2, '1234', '2016-09-01','2016-09-30',1);
SELECT *,dense_rank() OVER (PARTITION BY code,block ORDER BY date2 DESC)
FROM data
ORDER BY date2 DESC;
By the way, the database is in postgreSQL.
I hope there's a workaround... Thanks :)
Edit: Note that the blocks of 0's and 1's aren't equal.
There's no way to get this result using a single Window Function:
SELECT *,
Sum(flag) -- now sum the 0/1 to create the "rank"
Over (PARTITION BY code
ORDER BY date2 DESC)
FROM
(
SELECT *,
CASE
WHEN Lag(block) -- check if this is the 1st row of a new block
Over (PARTITION BY code
ORDER BY date2 DESC) = block
THEN 0
ELSE 1
END AS flag
FROM DATA
) AS dt

Subtract day column from date column in pandas data frame

I have two columns in my data frame.One column is date(df["Start_date]) and other is number of days.I want to subtract no of days column(df["days"]) from Date column.
I was trying something like this
df["new_date"]=df["Start_date"]-datetime.timedelta(days=df["days"])
I think you need to_timedelta:
df["new_date"]=df["Start_date"]-pd.to_timedelta(df["days"], unit='D')
Sample:
np.random.seed(120)
start = pd.to_datetime('2015-02-24')
rng = pd.date_range(start, periods=10)
df = pd.DataFrame({'Start_date': rng, 'days': np.random.choice(np.arange(10), size=10)})
print (df)
Start_date days
0 2015-02-24 7
1 2015-02-25 0
2 2015-02-26 8
3 2015-02-27 4
4 2015-02-28 1
5 2015-03-01 7
6 2015-03-02 1
7 2015-03-03 3
8 2015-03-04 8
9 2015-03-05 9
df["new_date"]=df["Start_date"]-pd.to_timedelta(df["days"], unit='D')
print (df)
Start_date days new_date
0 2015-02-24 7 2015-02-17
1 2015-02-25 0 2015-02-25
2 2015-02-26 8 2015-02-18
3 2015-02-27 4 2015-02-23
4 2015-02-28 1 2015-02-27
5 2015-03-01 7 2015-02-22
6 2015-03-02 1 2015-03-01
7 2015-03-03 3 2015-02-28
8 2015-03-04 8 2015-02-24
9 2015-03-05 9 2015-02-24

Update a Field/Column based on Current and Previous Record Value

I need assistance with updating a field/column "IsLatest" based on the comparison between the current and previous record. I'm using CTE's syntax and I'm able to get the current and previous record but I'm unable updated field/column "IsLatest" which I need based on the field/column "Value" of the current and previous record.
Example
Current Output
Dates Customer Value IsLatest
2010-01-01 00:00:00.000 1 12 1
Dates Customer Value IsLatest
2010-01-01 00:00:00.000 1 12 0
2010-01-02 00:00:00.000 1 30 1
Dates Customer Value IsLatest
2010-01-01 00:00:00.000 1 12 0
2010-01-02 00:00:00.000 1 30 0
2010-01-03 00:00:00.000 1 13 1
Expected Final Output
Dates Customer Value ValueSetId IsLatest
2010-01-01 00:00:00.000 1 12 12 0
2010-01-01 00:00:00.000 1 12 13 0
2010-01-01 00:00:00.000 1 12 14 0
2010-01-02 00:00:00.000 1 30 12 0
2010-01-02 00:00:00.000 1 30 13 0
2010-01-02 00:00:00.000 1 30 14 0
2010-01-03 00:00:00.000 1 13 12 0
2010-01-03 00:00:00.000 1 13 13 0
2010-01-03 00:00:00.000 1 13 14 0
2010-01-04 00:00:00.000 1 14 12 0
2010-01-04 00:00:00.000 1 14 13 0
2010-01-04 00:00:00.000 1 14 14 1
;WITH a AS
(
SELECT
Dates Customer Value,
row_number() over (partition by customer order by Dates desc, ValueSetId desc) rn
FROM #Customers)
SELECT Dates, Customer, Value, case when RN = 1 then 1 else 0 end IsLatest
FROM a