Pandas Sort Two Columns with Day of Year Wrap-Around to New Year - pandas

I have data that may at certain times of the year around the first of each year, that a day_of_year sequence involves changing the "year" column to the new year when day_of_year ==1. It is a trick that I have not been able to figure out and in some ways not sure how to start so any help here is much appreciated. My data looks like this:
Here is my df1 =
day_of_year year var_1
364 2017 17.71666667
364 2018 5.166666667
364 2019 2
364 2020 1.595833333
364 2021 3.75
364 2022 6.8875
365 2017 14.83333333
365 2018 2.758333333
365 2019 4.108333333
365 2020 5.766666667
365 2021 5.291666667
365 2022 10.58636364
1 2017 2.0125
1 2018 14.0125
1 2019 -0.504166667
1 2020 7.666666667
1 2021 5.520833333
1 2022 1.229166667
2 2017 1.7625
2 2018 15.10416667
2 2019 -0.391666667
2 2020 9.5
2 2021 7.645833333
2 2022 0.9125
And, after the re-formatting, I need it to look like the below sorted df with "n/a" for any missing or expected data in a year that might be missing data. thank you again,
final df:
day_of_year year var_1
364 2017 17.71666667
365 2017 14.83333333
1 2018 14.0125
2 2018 15.10416667
364 2018 5.166666667
365 2018 2.758333333
1 2019 -0.504166667
2 2019 -0.391666667
364 2019 2
365 2019 4.108333333
1 2020 7.666666667
2 2020 9.5
364 2020 1.595833333
365 2020 5.766666667
1 2021 5.520833333
2 2021 7.645833333
364 2021 3.75
365 2021 5.291666667
1 2022 1.229166667
2 2022 0.9125
364 2022 6.8875
365 2022 10.58636364
n/a n/a n/a
n/a n/a n/a

Why would you change the year based on the day? Just sort by the two columns:
df.sort_values(by=['year', 'day_of_year'])
Output:
day_of_year year var_1
12 1 2017 2.012500
18 2 2017 1.762500
0 364 2017 17.716667
6 365 2017 14.833333
13 1 2018 14.012500
19 2 2018 15.104167
1 364 2018 5.166667
7 365 2018 2.758333
14 1 2019 -0.504167
20 2 2019 -0.391667
2 364 2019 2.000000
8 365 2019 4.108333
15 1 2020 7.666667
21 2 2020 9.500000
3 364 2020 1.595833
9 365 2020 5.766667
16 1 2021 5.520833
22 2 2021 7.645833
4 364 2021 3.750000
10 365 2021 5.291667
17 1 2022 1.229167
23 2 2022 0.912500
5 364 2022 6.887500
11 365 2022 10.586364
If for some reason you really need to fix the year, use a conditional with mask:
(df.assign(year=df['year'].mask(df['day_of_year'].le(2), df['year'].add(1)))
.sort_values(by=['year', 'day_of_year'])
)
Or, if you want to update the years after a change from 365 to a lower day:
(df.assign(year=df['year'].add(df['day_of_year'].diff().lt(0).cumsum()))
.sort_values(by=['year', 'day_of_year'])
)
Output:
day_of_year year var_1
0 364 2017 17.716667
6 365 2017 14.833333
12 1 2018 2.012500
18 2 2018 1.762500
1 364 2018 5.166667
7 365 2018 2.758333
13 1 2019 14.012500
19 2 2019 15.104167
2 364 2019 2.000000
8 365 2019 4.108333
14 1 2020 -0.504167
20 2 2020 -0.391667
3 364 2020 1.595833
9 365 2020 5.766667
15 1 2021 7.666667
21 2 2021 9.500000
4 364 2021 3.750000
10 365 2021 5.291667
16 1 2022 5.520833
22 2 2022 7.645833
5 364 2022 6.887500
11 365 2022 10.586364
17 1 2023 1.229167
23 2 2023 0.912500

I would convert everything to date time first. Just run:
pd.to_datetime(df['day_of_year'].astype(str) + '-' + df['year'].astype(str),
format='%j-%Y')
I assign it to column ymd and sort, yielding the following:
>>> df.sort_values('ymd')
day_of_year year var_1 ymd
12 1 2017 2.012500 2017-01-01
18 2 2017 1.762500 2017-01-02
0 364 2017 17.716667 2017-12-30
6 365 2017 14.833333 2017-12-31
13 1 2018 14.012500 2018-01-01
19 2 2018 15.104167 2018-01-02
1 364 2018 5.166667 2018-12-30
7 365 2018 2.758333 2018-12-31
14 1 2019 -0.504167 2019-01-01
20 2 2019 -0.391667 2019-01-02
2 364 2019 2.000000 2019-12-30
8 365 2019 4.108333 2019-12-31
15 1 2020 7.666667 2020-01-01
21 2 2020 9.500000 2020-01-02
3 364 2020 1.595833 2020-12-29
9 365 2020 5.766667 2020-12-30
16 1 2021 5.520833 2021-01-01
22 2 2021 7.645833 2021-01-02
4 364 2021 3.750000 2021-12-30
10 365 2021 5.291667 2021-12-31
17 1 2022 1.229167 2022-01-01
23 2 2022 0.912500 2022-01-02
5 364 2022 6.887500 2022-12-30
11 365 2022 10.586364 2022-12-31

Related

Get the last 4 weeks prior to current week of and the same 4 weeks of last year

I have a list of date, fiscal week, and fiscal year:
DATE_VALUE FISCAL_WEEK FISCAL_YEAR_VALUE
14-Dec-20 51 2020
15-Dec-20 51 2020
16-Dec-20 51 2020
17-Dec-20 51 2020
18-Dec-20 51 2020
19-Dec-20 51 2020
20-Dec-20 51 2020
21-Dec-20 52 2020
22-Dec-20 52 2020
23-Dec-20 52 2020
24-Dec-20 52 2020
25-Dec-20 52 2020
26-Dec-20 52 2020
27-Dec-20 52 2020
28-Dec-20 1 2021
29-Dec-20 1 2021
30-Dec-20 1 2021
31-Dec-20 1 2021
1-Jan-21 1 2021
2-Jan-21 1 2021
3-Jan-21 1 2021
4-Jan-21 2 2021
5-Jan-21 2 2021
6-Jan-21 2 2021
7-Jan-21 2 2021
8-Jan-21 2 2021
9-Jan-21 2 2021
10-Jan-21 2 2021
11-Jan-21 3 2021
12-Jan-21 3 2021
13-Jan-21 3 2021
14-Jan-21 3 2021
15-Jan-21 3 2021
16-Jan-21 3 2021
17-Jan-21 3 2021
18-Jan-21 4 2021
19-Jan-21 4 2021
20-Jan-21 4 2021
21-Jan-21 4 2021
22-Jan-21 4 2021
23-Jan-21 4 2021
24-Jan-21 4 2021
20-Dec-21 52 2021
21-Dec-21 52 2021
22-Dec-21 52 2021
23-Dec-21 52 2021
24-Dec-21 52 2021
25-Dec-21 52 2021
26-Dec-21 52 2021
27-Dec-21 53 2021
28-Dec-21 53 2021
29-Dec-21 53 2021
30-Dec-21 53 2021
31-Dec-21 53 2021
1-Jan-22 53 2021
2-Jan-22 53 2021
3-Jan-22 1 2022
4-Jan-22 1 2022
5-Jan-22 1 2022
6-Jan-22 1 2022
7-Jan-22 1 2022
8-Jan-22 1 2022
9-Jan-22 1 2022
10-Jan-22 2 2022
11-Jan-22 2 2022
12-Jan-22 2 2022
13-Jan-22 2 2022
14-Jan-22 2 2022
15-Jan-22 2 2022
16-Jan-22 2 2022
17-Jan-22 3 2022
18-Jan-22 3 2022
19-Jan-22 3 2022
20-Jan-22 3 2022
21-Jan-22 3 2022
22-Jan-22 3 2022
23-Jan-22 3 2022
24-Jan-22 4 2022
25-Jan-22 4 2022
26-Jan-22 4 2022
27-Jan-22 4 2022
28-Jan-22 4 2022
29-Jan-22 4 2022
30-Jan-22 4 2022
I want to pull the last 4 weeks prior to the current week AND the same 4 weeks of the year before. Please see example 1. This works fine when all 4 weeks are within the same year. But when it comes to the beginning of a year when 1 or more weeks are in the current year but the other are in the previous year, I am not able to get the desired output below:
FISCAL_YEAR_VALUE FISCAL_WEEK
2020 51
2020 52
2021 2
2021 1
2021 52
2021 53
2022 1
2022 2
The code I have is below. I am using the date of 21-JAN-22 as an example:
SELECT
FISCAL_YEAR_VALUE,
FISCAL_WEEK
FROM TABLE_NAME
WHERE FISCAL_YEAR_VALUE IN (SELECT *
FROM (WITH T AS (
SELECT DISTINCT FISCAL_YEAR_VALUE
FROM TABLE_NAME
WHERE TRUNC(DATE_VALUE) <= TRUNC(TO_DATE('21-JAN-22'))--TEST DATE
ORDER BY FISCAL_YEAR_VALUE DESC
FETCH NEXT 2 ROWS ONLY
)
SELECT FISCAL_YEAR_VALUE
FROM T ORDER BY FISCAL_YEAR_VALUE
)
)
AND FISCAL_WEEK IN (SELECT *
FROM (WITH T AS (
SELECT DISTINCT FISCAL_WEEK, FISCAL_YEAR_VALUE
FROM TABLE_NAME
WHERE TRUNC(DATE_VALUE) <= TRUNC(TO_DATE('21-JAN-22'))--TEST DATE
ORDER BY FISCAL_YEAR_VALUE DESC, FISCAL_WEEK DESC
OFFSET 1 ROWS
FETCH NEXT 4 ROWS ONLY
)
SELECT FISCAL_WEEK
FROM T ORDER BY FISCAL_YEAR_VALUE, FISCAL_WEEK
)
)
GROUP BY FISCAL_YEAR_VALUE, FISCAL_WEEK
ORDER BY FISCAL_YEAR_VALUE, FISCAL_WEEK
Output of the code is:
FISCAL_YEAR_VALUE FISCAL_WEEK
2021 2
2021 1
2021 52
2021 53
2022 1
2022 2
As you can see, the last 2 weeks of year 2020 are not included. Please see example 2. How can I also include this exception in the code to make it dynamic? Any help would be greatly appreciated!
To find the values this year, you can use:
SELECT DISTINCT fiscal_year_value, fiscal_week
FROM table_name
WHERE date_value < TRUNC(SYSDATE, 'IW')
AND date_value >= TRUNC(SYSDATE, 'IW') - INTERVAL '28' DAY
To find the values from the previous year, you can find the maximum fiscal week from this year and subtract 1 from the year and then use that to find the upper bound of the date_value for last fiscal year and, given that can use a similar range for last year:
WITH this_year (fiscal_year_value, fiscal_week) AS (
SELECT fiscal_year_value, fiscal_week
FROM table_name
WHERE date_value < TRUNC(SYSDATE, 'IW')
AND date_value >= TRUNC(SYSDATE, 'IW') - INTERVAL '28' DAY
),
max_last_year (max_date_value) AS (
SELECT MAX(date_value) + INTERVAL '1' DAY
FROM table_name
WHERE (fiscal_year_value, fiscal_week) IN (
SELECT fiscal_year_value - 1, fiscal_week
FROM this_year
ORDER BY fiscal_year_value DESC, fiscal_week DESC
FETCH FIRST ROW ONLY
)
)
SELECT fiscal_year_value, fiscal_week
FROM this_year
UNION
SELECT t.fiscal_year_value, t.fiscal_week
FROM table_name t
INNER JOIN max_last_year m
ON ( t.date_value < m.max_date_value
AND t.date_value >= m.max_date_value - INTERVAL '28' DAY);
Which, for the sample data:
Create Table table_name(DATE_VALUE DATE, FISCAL_WEEK INT, FISCAL_YEAR_VALUE INT);
INSERT INTO table_name (date_value, fiscal_week, fiscal_year_value)
SELECT DATE '2019-12-30' + LEVEL - 1, CEIL(LEVEL/7), 2020
FROM DUAL
CONNECT BY LEVEL <= 7 * 52
UNION ALL
SELECT DATE '2020-12-28' + LEVEL - 1, CEIL(LEVEL/7), 2021
FROM DUAL
CONNECT BY LEVEL <= 7 * 53
UNION ALL
SELECT DATE '2022-01-03' + LEVEL - 1, CEIL(LEVEL/7), 2022
FROM DUAL
CONNECT BY LEVEL <= 7 * 52;
Outputs:
FISCAL_YEAR_VALUE
FISCAL_WEEK
2022
38
2022
39
2022
40
2022
41
2021
38
2021
39
2021
40
2021
41
And if today's date was 2022-01-01, would output:
FISCAL_YEAR_VALUE
FISCAL_WEEK
2021
52
2021
53
2022
1
2022
2
2020
51
2020
52
2021
1
2021
2
There may be a simpler method but without any knowledge of how you calculate a fiscal year that is not immediately possible.
fiddle

Assigning a day, week, and year column in Pandas in one line

I usually have to extract days, weeks and years into separate columns like this:
data['Day'] = data.SALESDATE.dt.isocalendar().day
data['Week'] = data.SALESDATE.dt.isocalendar().week
data['Year'] = data.SALESDATE.dt.isocalendar().year
But is there a way where I can assign all three in one nice line?
data[['Day', 'Week', 'Year']] = ....
``
For one line solution use DataFrame.join with rename columns if necessary:
rng = pd.date_range('2017-04-03', periods=10)
data = pd.DataFrame({'SALESDATE': rng, 'a': range(10)})
data = data.join(data.SALESDATE.dt.isocalendar().rename(columns=lambda x: x.title()))
print (data)
SALESDATE a Year Week Day
0 2017-04-03 0 2017 14 1
1 2017-04-04 1 2017 14 2
2 2017-04-05 2 2017 14 3
3 2017-04-06 3 2017 14 4
4 2017-04-07 4 2017 14 5
5 2017-04-08 5 2017 14 6
6 2017-04-09 6 2017 14 7
7 2017-04-10 7 2017 15 1
8 2017-04-11 8 2017 15 2
9 2017-04-12 9 2017 15 3
Or change order of list and assign:
data[['Year', 'Week', 'Day']] = data.SALESDATE.dt.isocalendar()
print (data)
SALESDATE a Year Week Day
0 2017-04-03 0 2017 14 1
1 2017-04-04 1 2017 14 2
2 2017-04-05 2 2017 14 3
3 2017-04-06 3 2017 14 4
4 2017-04-07 4 2017 14 5
5 2017-04-08 5 2017 14 6
6 2017-04-09 6 2017 14 7
7 2017-04-10 7 2017 15 1
8 2017-04-11 8 2017 15 2
9 2017-04-12 9 2017 15 3
If need changed order of values in list:
data[['Day', 'Week', 'Year']] = data.SALESDATE.dt.isocalendar()[['day','week','year']]
print (data)
SALESDATE a Day Week Year
0 2017-04-03 0 1 14 2017
1 2017-04-04 1 2 14 2017
2 2017-04-05 2 3 14 2017
3 2017-04-06 3 4 14 2017
4 2017-04-07 4 5 14 2017
5 2017-04-08 5 6 14 2017
6 2017-04-09 6 7 14 2017
7 2017-04-10 7 1 15 2017
8 2017-04-11 8 2 15 2017
9 2017-04-12 9 3 15 2017

Segregating data based on last 3 months and this time last year

I need to filter out my data into two different index.
(1) last three months, includes December as current month minus three
(2) current month (December 2019) and current month values from the year before
pDate Name Date Year Month
11/17/2019 12:18 A 2019/11 2019 11
12/23/2018 11:52 B 2018/12 2018 12
12/1/2019 11:42 C 2019/12 2019 12
12/10/2018 14:31 D 2018/12 2018 12
12/14/2018 12:42 E 2018/12 2018 12
10/15/2019 15:19 F 2019/10 2019 10
10/23/2019 10:50 G 2019/10 2019 10
12/2/2018 15:14 H 2018/12 2018 12
I was able to group them based upon their last 3 months values, relatively quick as:
df1 = df.sort_values(by="pDate",ascending=True).set_index("pDate").last("3M")
How do I get a dataframe which maps December 2019 (current month) and December 2018 only.
Idea is create month periods by Series.dt.to_period and then you can subtract values for past periods filtering by Series.between with boolean indexing:
$changed sample datetimes
df['pDate'] = pd.to_datetime(df['pDate'])
df = df.sort_values(by="pDate")
print (df)
pDate Name Date Year Month
7 2018-12-02 15:14:00 H 2018/12 2018 12
4 2018-12-14 12:42:00 E 2018/12 2018 12
3 2019-10-10 14:31:00 D 2018/12 2018 12
5 2019-10-15 15:19:00 F 2019/10 2019 10
6 2019-10-23 10:50:00 G 2019/10 2019 10
2 2019-11-01 11:42:00 C 2019/12 2019 12
1 2019-12-23 11:52:00 B 2018/12 2018 12
0 2020-01-17 12:18:00 A 2019/11 2019 11
nowp = pd.to_datetime('now').to_period('m')
print (nowp)
2020-01
df['per'] = df['pDate'].dt.to_period('m')
df = df[df['per'].between(nowp-4, nowp-1) | df['per'].eq(nowp-13)]
print (df)
pDate Name Date Year Month per
7 2018-12-02 15:14:00 H 2018/12 2018 12 2018-12
4 2018-12-14 12:42:00 E 2018/12 2018 12 2018-12
3 2019-10-10 14:31:00 D 2018/12 2018 12 2019-10
5 2019-10-15 15:19:00 F 2019/10 2019 10 2019-10
6 2019-10-23 10:50:00 G 2019/10 2019 10 2019-10
2 2019-11-01 11:42:00 C 2019/12 2019 12 2019-11
1 2019-12-23 11:52:00 B 2018/12 2018 12 2019-12
Detail:
print (nowp)
2020-01
print (nowp-1)
2019-12
print (nowp-13)
2018-12
print (nowp-4)
2019-09

Why am I seeing multiple months in the results when I am joining with dim_date

Here is my simple Postgresql Query
SELECT dd.year_actual as yr, sum("Ordered_Amount") from channel_sales cs
JOIN dim_date dd ON cs."date" = dd.date_actual
GROUP BY
dd.year_actual,
cs."Ordered_Amount"
Here is the result below. What I was expecting was a single line result with the year and total amount, but instead it is breaking it down into multiple rows of 2018. I am not sure what I am doing wrong here.
2018 2226
2018 357
2018 616
2018 1074
2018 1422
2018 3080
2018 2106
2018 924
2018 176
2018 580
2018 1587
2018 14350
2018 306
2018 2516
2018 1482
2018 2880
2018 8400
2018 5200
2018 16758
2018 781
2018 135
2018 4056
2018 150
2018 500
2018 2338
2018 3850
2018 1432
2018 1396
2018 1230
2018 274
2018 1494
2018 1068
2018 878
2018 1441
2018 1832
2018 3042
2018 4180
2018 2327
2018 206
2018 426
2018 2090
2018 1003
2018 62499
2018 900
2018 2274
2018 399
2018 1980
2018 278
2018 736
2018 24070
2018 561
2018 648
2018 1256
2018 120
2018 21912
2018 1639
2018 4452
2018 1008
2018 96577
2018 3240
2018 1386
2018 388
2018 260
2018 1080
2018 5525
2018 2672
2018 24674
2018 4392
2018 948
2018 801
2018 658
2018 1908
2018 692
2018 498
2018 630
2018 8999
2018 4056
2018 2990
2018 1745
2018 1280
2018 126
2018 988
2018 422
2018 936
Is it how I am making the join, or is it because I am using group by clause in the wrong way. I cannot figure out for the life of it.
Because you are not grouping by only year. You are also grouping by ordered_amount which you also sum(). Thus you are effectively summing by year and common ordered_amounts. If say in 2018, there are 4 ordered_amounts of 100 that would show as 2018, 400. And this would be repeated per ordered_amount. ie:
2018,100
2018,100
2018,100
2018,100
2018,200
2018,300
2018,300
would be:
2018,400
2018,200
2018,600
Write it as:
SELECT dd.year_actual as yr, sum("Ordered_Amount")
from channel_sales cs
JOIN dim_date dd ON cs."date" = dd.date_actual
GROUP BY
dd.year_actual
Also note that if this is not a 1-to-many or 1-to-1 relation, then sum results would be wrong. To prevent that, you may first do the sum and then join. Depending on table structures and which data is coming from where, a join may not even be needed.

Shrinking multiple rows to one row

I want to shrink multiple rows in a data frame to one row.
for example, if I have a dataframe like this,
name year project_name month week worklogs
Ahkam 2019 Proj1 1 1 10
Ahkam 2019 proj2 1 1 14
Ahkam 2019 proj3 1 2 6
Ahkam 2019 proj4 1 2 14
Naser 2019 Proj1 1 1 7
Naser 2019 proj2 1 1 8
Naser 2019 proj3 1 2 5
Naser 2019 proj4 1 2 3
and my output dataframe should be:
name year project_name month week worklogs
Ahkam 2019 NaN 1 1 24
Ahkam 2019 NaN 1 2 20
Naser 2019 NaN 1 1 15
Naser 2019 NaN 1 2 8
The project_name column may be whatever it can be. The worklogs must be added according to grouped columns(name,year,month,week)
Thanks in advance.
Use DataFrameGroupBy.agg:
df = (df.groupby(['name', 'year', 'month', 'week'], as_index=False)
.agg({'project_name':'first', 'worklogs':'sum'}))
print(df)
name year month week project_name worklogs
0 Ahkam 2019 1 1 Proj1 24
1 Ahkam 2019 1 2 proj3 20
2 Naser 2019 1 1 Proj1 15
3 Naser 2019 1 2 proj3 8