Split dataset by every other day - pandas

I have a dataset that I collected over many days and is indexed by calendar day. Each day has a different number of entries in it. I want to see if the odd days (e.g. day 1, day 3, day 5, etc...) are correlated with the even days (e.g. day 2, day 4, day 6 etc...) and to do this, I have to split my dataset into two.
I can't use day % 2 because I have missing days and weekends in the set that throw it off. I have tried using resample like this:
df_odd = df.resample('2D')
lowest_date = df['date_minus_time'].min()
df_even = df.query('date_minus_time != #lowest_date).resample('2D')
But this insists on aggregating the data by day. I want to keep all the rows so I can perform further operations (e.g. groupby) on the resulting datasets.
How can I create two dataframes, one with all the rows with an "even" date and one with all the rows with an "odd" date with even and odd being relative to first day of my data set?
Here are some example data:
Date var
2018-12-10 1
2018-12-10 0
2018-12-10 1
2018-12-10 0
2018-12-11 1
2018-12-11 1
2018-12-12 0
2018-12-12 1
2018-12-12 1
2018-12-14 1
2018-12-14 0
2018-12-14 1
2018-12-16 1
2018-12-16 1
2018-12-16 1
And the expected output:
df_odd:
Date var
2018-12-10 1
2018-12-10 0
2018-12-10 1
2018-12-10 0
2018-12-12 0
2018-12-12 1
2018-12-12 1
2018-12-16 1
2018-12-16 1
2018-12-16 1
df_even:
Date var
2018-12-11 1
2018-12-11 1
2018-12-14 1
2018-12-14 0
2018-12-14 1

Use pd.Categorical with .codes
num = pd.Categorical(df.Date).codes + 1
df_odd = df[num%2 == 0]
df_even = df[num%2 == 1]
df_odd
Date var
0 2018-12-10 1
1 2018-12-10 0
2 2018-12-10 1
3 2018-12-10 0
6 2018-12-12 0
7 2018-12-12 1
8 2018-12-12 1
12 2018-12-16 1
13 2018-12-16 1
14 2018-12-16 1
df_even
Date var
4 2018-12-11 1
5 2018-12-11 1
9 2018-12-14 1
10 2018-12-14 0
11 2018-12-14 1

Related

Pandas cycling through a period of 28 days using timestamp (e.g.)

I am trying to work out how I could get a cycle of 28 days in my df. What I mean is that I have dataset spanning more than a couple of years with full dates. There doesn't seem to be a stfrtime that looks at a 28 day (lunar month).
How can I have the minimum date be 1 (say for 1st of January 2018) and 28 for (28th of January 2018)
The 29th of January again becomes 1 and 30th of January 2 and so on ...
How can I achieve this cycle?
Are you looking for something like that:
Input data (the worst case):
Date
Date
0 2018-07-12
1 2018-07-13
2 2018-07-14
3 2018-07-14 # <- duplicate day
4 2018-07-15
5 2018-07-30 # <- missing days
6 2018-07-31
7 2018-08-08 # <- end cycle day, missing days
8 2018-08-08 # <- duplicate day
9 2018-08-09 # <- new cycle day
10 2018-08-10
Use np.tile to repeat the period:
df['Day'] = (df['Date'].dt.day_of_year - df['Date'].min().day_of_year) % 28 + 1
Result output:
>>> df
Date Day
0 2018-07-12 1
1 2018-07-13 2
2 2018-07-14 3
3 2018-07-14 3
4 2018-07-15 4
5 2018-07-30 19
6 2018-07-31 20
7 2018-08-08 28
8 2018-08-08 28
9 2018-08-09 1
10 2018-08-10 2

Period and Quarter Sequence

I'm trying to find a way to do a sequence for date periods and quarters(not sure if this is the correct term).
Basically this will help people to navigate dates based on weeks, periods, and quarters once I join this to our sales data. For example, if I just want to know the sales from last week, I could just use WHERE WeekSequence = -1... Another example is, a manager wants to get the sales data for the past quarter, I could just use WHERE QuarterSequence = -1... something like that.
My current table:
WeekStartDate WeekEndDate CurrentWeek Period Quarter WeekSequence
----------------------------------------------------------------------
2020-08-03 2020-08-09 0 2 1 -5
2020-08-10 2020-08-16 0 2 1 -4
2020-08-17 2020-08-23 0 2 1 -3
2020-08-24 2020-08-30 0 2 1 -2
2020-08-31 2020-09-06 0 2 1 -1
2020-09-07 2020-09-13 1 3 1 0
2020-09-14 2020-09-20 0 3 1 1
2020-09-21 2020-09-27 0 3 1 2
2020-09-28 2020-10-04 0 3 1 3
2020-10-05 2020-10-11 0 4 2 4
2020-10-12 2020-10-18 0 4 2 5
What I want it to look like(highlighted):
If I understand correctly, just use window functions:
select t.*,
(period -
max(case when currentweek = 1 then period end) over ()
) as periodsequence,
(quarter -
max(case when currentweek = 1 then quarter end) over ()
) as quartersequence
from t;
You can include this in a view rather than putting it in a table.

Is there a way of group by month in Pandas starting at specific day number?

I'm trying to group by month some data in python, but i need the month to start at the 25 of each month, is there a way to do that in Pandas?
For weeks there is a way of starting on Monday, Tuesday, ... But for months it's always full month.
pd.Grouper(key='date', freq='M')
You could offset the dates by 24 days and groupby:
np.random.seed(1)
dates = pd.date_range('2019-01-01', '2019-04-30', freq='D')
df = pd.DataFrame({'date':dates,
'val': np.random.uniform(0,1,len(dates))})
# for groupby
s = df['date'].sub(pd.DateOffset(24))
(df.groupby([s.dt.year, s.dt.month], as_index=False)
.agg({'date':'min', 'val':'sum'})
)
gives
date val
0 2019-01-01 10.120368
1 2019-01-25 14.895363
2 2019-02-25 14.544506
3 2019-03-25 17.228734
4 2019-04-25 3.334160
Another example:
np.random.seed(1)
dates = pd.date_range('2019-01-20', '2019-01-30', freq='D')
df = pd.DataFrame({'date':dates,
'val': np.random.uniform(0,1,len(dates))})
s = df['date'].sub(pd.DateOffset(24))
df['groups'] = df.groupby([s.dt.year, s.dt.month]).cumcount()
gives
date val groups
0 2019-01-20 0.417022 0
1 2019-01-21 0.720324 1
2 2019-01-22 0.000114 2
3 2019-01-23 0.302333 3
4 2019-01-24 0.146756 4
5 2019-01-25 0.092339 0
6 2019-01-26 0.186260 1
7 2019-01-27 0.345561 2
8 2019-01-28 0.396767 3
9 2019-01-29 0.538817 4
10 2019-01-30 0.419195 5
And you can see the how the cumcount restarts at day 25.
I prepared the following test DataFrame:
Dat Val
0 2017-03-24 0
1 2017-03-25 0
2 2017-03-26 1
3 2017-03-27 0
4 2017-04-24 0
5 2017-04-25 0
6 2017-05-24 0
7 2017-05-25 2
8 2017-05-26 0
The first step is to compute a "shifted date" column:
df['Dat2'] = df.Dat + pd.DateOffset(days=-24)
The result is:
Dat Val Dat2
0 2017-03-24 0 2017-02-28
1 2017-03-25 0 2017-03-01
2 2017-03-26 1 2017-03-02
3 2017-03-27 0 2017-03-03
4 2017-04-24 0 2017-03-31
5 2017-04-25 0 2017-04-01
6 2017-05-24 0 2017-04-30
7 2017-05-25 2 2017-05-01
8 2017-05-26 0 2017-05-02
As you can see, March dates in Dat2 start just from original date 2017-03-25,
and so on.
The value of 1 is in March (Dat2) and the value of 2 is in May (also Dat2).
Then, to compute e.g. a sum by month, we can run:
df.groupby(pd.Grouper(key='Dat2', freq='MS')).sum()
getting:
Val
Dat2
2017-02-01 0
2017-03-01 1
2017-04-01 0
2017-05-01 2
So we have correct groupping:
1 is in March,
2 is in May.
The advantage over the other answer is that you have all dates on the first
day of a month, of course bearing in mind that e.g. 2017-03-01 in the
result means the period from 2017-03-25 to 2017-04-24 (including).

Get all dates for all date ranges in table using SQL Server

I have table dbo.WorkSchedules(Id, From, To) where I store date ranges for work schedules. I want to create a view that will have all dates for all rows of WorkSchedules. Thanks to this I have 1 view with all dates for all schedules.
On web I only found solutions for 1 row like 2 parameters start and end. My issue is different where I have multiple rows with start and end range.
Example:
WorkSchedules
Id | From | To
---+------------+-----------
1 | 2018-01-01 | 2018-01-05
2 | 2018-01-08 | 2018-01-12
Desired result
1 | 2018-01-01
2 | 2018-01-02
3 | 2018-01-03
4 | 2018-01-04
5 | 2018-01-05
6 | 2018-01-08
7 | 2018-01-09
8 | 2018-01-10
9 | 2018-01-11
10| 2018-01-12
If you are regularly dealing with "jobs" and "schedules" then I propose that you need a permanent calendar table (a table where each row is a unique date). You can create rows for dates dynamically but why do this many times when you can do it once and just re-use?
A calendar table, even of several decades, isn't "big" and when indexed they can be very fast as well. You can also store information about holidays and/or fiscal periods etc.
There are many scripts available to produce these tables, here's an answer with 2 scripts on this site: https://stackoverflow.com/a/5635628/2067753
Assuming you use the second (more comprehensive) script, then you can exclude weekends, or other conditions such as holidays, from query results.
Once you have a permanent Calendar table this style of query may be used:
CREATE TABLE WorkSchedules(
Id INTEGER NOT NULL PRIMARY KEY
,[From] DATE NOT NULL
,[To] DATE NOT NULL
);
INSERT INTO WorkSchedules(Id,[From],[To]) VALUES (1,'2018-01-01','2018-01-05');
INSERT INTO WorkSchedules(Id,[From],[To]) VALUES (2,'2018-01-12','2018-01-12');
with range as (
select min(ws.[From]) as dt_from, max(ws.[To]) dt_to
from WorkSchedules as ws
)
select c.*
from calendar as c
inner join range on c.date between range.dt_from and range.dt_to
where c.KindOfDay = 'BANKDAY'
order by c.date
and the result looks like this (note: "News Years Day" has been excluded)
Date Year Quarter Month Week Day DayOfYear Weekday Fiscal_Year Fiscal_Quarter Fiscal_Month KindOfDay Description
---- --------------------- ------ --------- ------- ------ ----- ----------- --------- ------------- ---------------- -------------- ----------- -------------
1 02.01.2018 00:00:00 2018 1 1 1 2 2 2 2018 1 1 BANKDAY NULL
2 03.01.2018 00:00:00 2018 1 1 1 3 3 3 2018 1 1 BANKDAY NULL
3 04.01.2018 00:00:00 2018 1 1 1 4 4 4 2018 1 1 BANKDAY NULL
4 05.01.2018 00:00:00 2018 1 1 1 5 5 5 2018 1 1 BANKDAY NULL
5 08.01.2018 00:00:00 2018 1 1 2 8 8 1 2018 1 1 BANKDAY NULL
6 09.01.2018 00:00:00 2018 1 1 2 9 9 2 2018 1 1 BANKDAY NULL
7 10.01.2018 00:00:00 2018 1 1 2 10 10 3 2018 1 1 BANKDAY NULL
8 11.01.2018 00:00:00 2018 1 1 2 11 11 4 2018 1 1 BANKDAY NULL
9 12.01.2018 00:00:00 2018 1 1 2 12 12 5 2018 1 1 BANKDAY NULL
Without the where clause the full range is:
Date Year Quarter Month Week Day DayOfYear Weekday Fiscal_Year Fiscal_Quarter Fiscal_Month KindOfDay Description
---- --------------------- ------ --------- ------- ------ ----- ----------- --------- ------------- ---------------- -------------- ----------- ----------------
1 01.01.2018 00:00:00 2018 1 1 1 1 1 1 2018 1 1 HOLIDAY New Year's Day
2 02.01.2018 00:00:00 2018 1 1 1 2 2 2 2018 1 1 BANKDAY NULL
3 03.01.2018 00:00:00 2018 1 1 1 3 3 3 2018 1 1 BANKDAY NULL
4 04.01.2018 00:00:00 2018 1 1 1 4 4 4 2018 1 1 BANKDAY NULL
5 05.01.2018 00:00:00 2018 1 1 1 5 5 5 2018 1 1 BANKDAY NULL
6 06.01.2018 00:00:00 2018 1 1 1 6 6 6 2018 1 1 SATURDAY NULL
7 07.01.2018 00:00:00 2018 1 1 1 7 7 7 2018 1 1 SUNDAY NULL
8 08.01.2018 00:00:00 2018 1 1 2 8 8 1 2018 1 1 BANKDAY NULL
9 09.01.2018 00:00:00 2018 1 1 2 9 9 2 2018 1 1 BANKDAY NULL
10 10.01.2018 00:00:00 2018 1 1 2 10 10 3 2018 1 1 BANKDAY NULL
11 11.01.2018 00:00:00 2018 1 1 2 11 11 4 2018 1 1 BANKDAY NULL
12 12.01.2018 00:00:00 2018 1 1 2 12 12 5 2018 1 1 BANKDAY NULL
and weekends and holidays may be excluded using the column KindOfDay
See this as a demonstration (with build of calendar table) here: http://rextester.com/CTSW63441
Ok, I worked this out for you, thinking you mean that you meant 01/08/2018 as a From date in the second row.
/*WorkSchedules
Id| From | To
1 | 2018-01-01 | 2018-01-05
2 | 2018-01-08 | 2018-01-12
*/
--DROP TABLE #WorkSchedules;
CREATE TABLE #WorkSchedules (
ID int,
[DateFrom] DATE,
[DateTo] DATE
)
INSERT INTO #WorkSchedules
SELECT 1, '2018-01-01', '2018-01-05'
UNION
SELECT 2, '2018-01-08', '2018-01-12'
;WITH CTEDATELIMITS AS (
SELECT [DateFrom], [DateTo]
FROM #WorkSchedules
)
,CTEDATES AS
(
SELECT [DateFrom] as [DateResult] FROM CTEDATELIMITS
UNION ALL
SELECT DATEADD(Day, 1, [DateResult]) FROM CTEDATES
JOIN CTEDATELIMITS ON CTEDATES.[DateResult] >= CTEDATELIMITS.[DateFrom]
AND CTEDATES.dateResult < CTEDATELIMITS.[DateTo]
)
SELECT [DateResult] FROM CTEDATES
ORDER BY [DateResult]
You would use a recursive CTE:
with dates as (
select from, to, from as date
from WorkSchedules
union all
select from, to, dateadd(day, 1, date)
from dates
where date < to
)
select row_number() over (order by date), date
from dates;
Note that from and to are reserved words in SQL. They are lousy names for identifiers. I have not escaped them because I assume they are not the actual names of the columns.

Pandas time difference calculation error

I have two time columns in my dataframe: called date1 and date2.
As far as I always assumed, both are in date_time format. However, I now have to calculate the difference in days between the two and it doesn't work.
I run the following code to analyse the data:
df['month1'] = pd.DatetimeIndex(df['date1']).month
df['month2'] = pd.DatetimeIndex(df['date2']).month
print(df[["date1", "date2", "month1", "month2"]].head(10))
print(df["date1"].dtype)
print(df["date2"].dtype)
The output is:
date1 date2 month1 month2
0 2016-02-29 2017-01-01 1 1
1 2016-11-08 2017-01-01 1 1
2 2017-11-27 2009-06-01 1 6
3 2015-03-09 2014-07-01 1 7
4 2015-06-02 2014-07-01 1 7
5 2015-09-18 2017-01-01 1 1
6 2017-09-06 2017-07-01 1 7
7 2017-04-15 2009-06-01 1 6
8 2017-08-14 2014-07-01 1 7
9 2017-12-06 2014-07-01 1 7
datetime64[ns]
object
As you can see, the month for date1 is not calculated correctly!
The final operation, which does not work is:
df["date_diff"] = (df["date1"]-df["date2"]).astype('timedelta64[D]')
which leads to the following error:
incompatible type [object] for a datetime/timedelta operation
I first thought it might be due to date2, so I tried:
df["date2_new"] = pd.to_datetime(df['date2'] - 315619200, unit = 's')
leading to:
unsupported operand type(s) for -: 'str' and 'int'
Anyone has an idea what I need to change?
Use .dt accessor with days attribute:
df[['date1','date2']] = df[['date1','date2']].apply(pd.to_datetime)
df['date_diff'] = (df['date1'] - df['date2']).dt.days
Output:
date1 date2 month1 month2 date_diff
0 2016-02-29 2017-01-01 1 1 -307
1 2016-11-08 2017-01-01 1 1 -54
2 2017-11-27 2009-06-01 1 6 3101
3 2015-03-09 2014-07-01 1 7 251
4 2015-06-02 2014-07-01 1 7 336
5 2015-09-18 2017-01-01 1 1 -471
6 2017-09-06 2017-07-01 1 7 67
7 2017-04-15 2009-06-01 1 6 2875
8 2017-08-14 2014-07-01 1 7 1140
9 2017-12-06 2014-07-01 1 7 1254