Shrinking multiple rows to one row - pandas

I want to shrink multiple rows in a data frame to one row.
for example, if I have a dataframe like this,
name year project_name month week worklogs
Ahkam 2019 Proj1 1 1 10
Ahkam 2019 proj2 1 1 14
Ahkam 2019 proj3 1 2 6
Ahkam 2019 proj4 1 2 14
Naser 2019 Proj1 1 1 7
Naser 2019 proj2 1 1 8
Naser 2019 proj3 1 2 5
Naser 2019 proj4 1 2 3
and my output dataframe should be:
name year project_name month week worklogs
Ahkam 2019 NaN 1 1 24
Ahkam 2019 NaN 1 2 20
Naser 2019 NaN 1 1 15
Naser 2019 NaN 1 2 8
The project_name column may be whatever it can be. The worklogs must be added according to grouped columns(name,year,month,week)
Thanks in advance.

Use DataFrameGroupBy.agg:
df = (df.groupby(['name', 'year', 'month', 'week'], as_index=False)
.agg({'project_name':'first', 'worklogs':'sum'}))
print(df)
name year month week project_name worklogs
0 Ahkam 2019 1 1 Proj1 24
1 Ahkam 2019 1 2 proj3 20
2 Naser 2019 1 1 Proj1 15
3 Naser 2019 1 2 proj3 8

Related

Pandas Sort Two Columns with Day of Year Wrap-Around to New Year

I have data that may at certain times of the year around the first of each year, that a day_of_year sequence involves changing the "year" column to the new year when day_of_year ==1. It is a trick that I have not been able to figure out and in some ways not sure how to start so any help here is much appreciated. My data looks like this:
Here is my df1 =
day_of_year year var_1
364 2017 17.71666667
364 2018 5.166666667
364 2019 2
364 2020 1.595833333
364 2021 3.75
364 2022 6.8875
365 2017 14.83333333
365 2018 2.758333333
365 2019 4.108333333
365 2020 5.766666667
365 2021 5.291666667
365 2022 10.58636364
1 2017 2.0125
1 2018 14.0125
1 2019 -0.504166667
1 2020 7.666666667
1 2021 5.520833333
1 2022 1.229166667
2 2017 1.7625
2 2018 15.10416667
2 2019 -0.391666667
2 2020 9.5
2 2021 7.645833333
2 2022 0.9125
And, after the re-formatting, I need it to look like the below sorted df with "n/a" for any missing or expected data in a year that might be missing data. thank you again,
final df:
day_of_year year var_1
364 2017 17.71666667
365 2017 14.83333333
1 2018 14.0125
2 2018 15.10416667
364 2018 5.166666667
365 2018 2.758333333
1 2019 -0.504166667
2 2019 -0.391666667
364 2019 2
365 2019 4.108333333
1 2020 7.666666667
2 2020 9.5
364 2020 1.595833333
365 2020 5.766666667
1 2021 5.520833333
2 2021 7.645833333
364 2021 3.75
365 2021 5.291666667
1 2022 1.229166667
2 2022 0.9125
364 2022 6.8875
365 2022 10.58636364
n/a n/a n/a
n/a n/a n/a
Why would you change the year based on the day? Just sort by the two columns:
df.sort_values(by=['year', 'day_of_year'])
Output:
day_of_year year var_1
12 1 2017 2.012500
18 2 2017 1.762500
0 364 2017 17.716667
6 365 2017 14.833333
13 1 2018 14.012500
19 2 2018 15.104167
1 364 2018 5.166667
7 365 2018 2.758333
14 1 2019 -0.504167
20 2 2019 -0.391667
2 364 2019 2.000000
8 365 2019 4.108333
15 1 2020 7.666667
21 2 2020 9.500000
3 364 2020 1.595833
9 365 2020 5.766667
16 1 2021 5.520833
22 2 2021 7.645833
4 364 2021 3.750000
10 365 2021 5.291667
17 1 2022 1.229167
23 2 2022 0.912500
5 364 2022 6.887500
11 365 2022 10.586364
If for some reason you really need to fix the year, use a conditional with mask:
(df.assign(year=df['year'].mask(df['day_of_year'].le(2), df['year'].add(1)))
.sort_values(by=['year', 'day_of_year'])
)
Or, if you want to update the years after a change from 365 to a lower day:
(df.assign(year=df['year'].add(df['day_of_year'].diff().lt(0).cumsum()))
.sort_values(by=['year', 'day_of_year'])
)
Output:
day_of_year year var_1
0 364 2017 17.716667
6 365 2017 14.833333
12 1 2018 2.012500
18 2 2018 1.762500
1 364 2018 5.166667
7 365 2018 2.758333
13 1 2019 14.012500
19 2 2019 15.104167
2 364 2019 2.000000
8 365 2019 4.108333
14 1 2020 -0.504167
20 2 2020 -0.391667
3 364 2020 1.595833
9 365 2020 5.766667
15 1 2021 7.666667
21 2 2021 9.500000
4 364 2021 3.750000
10 365 2021 5.291667
16 1 2022 5.520833
22 2 2022 7.645833
5 364 2022 6.887500
11 365 2022 10.586364
17 1 2023 1.229167
23 2 2023 0.912500
I would convert everything to date time first. Just run:
pd.to_datetime(df['day_of_year'].astype(str) + '-' + df['year'].astype(str),
format='%j-%Y')
I assign it to column ymd and sort, yielding the following:
>>> df.sort_values('ymd')
day_of_year year var_1 ymd
12 1 2017 2.012500 2017-01-01
18 2 2017 1.762500 2017-01-02
0 364 2017 17.716667 2017-12-30
6 365 2017 14.833333 2017-12-31
13 1 2018 14.012500 2018-01-01
19 2 2018 15.104167 2018-01-02
1 364 2018 5.166667 2018-12-30
7 365 2018 2.758333 2018-12-31
14 1 2019 -0.504167 2019-01-01
20 2 2019 -0.391667 2019-01-02
2 364 2019 2.000000 2019-12-30
8 365 2019 4.108333 2019-12-31
15 1 2020 7.666667 2020-01-01
21 2 2020 9.500000 2020-01-02
3 364 2020 1.595833 2020-12-29
9 365 2020 5.766667 2020-12-30
16 1 2021 5.520833 2021-01-01
22 2 2021 7.645833 2021-01-02
4 364 2021 3.750000 2021-12-30
10 365 2021 5.291667 2021-12-31
17 1 2022 1.229167 2022-01-01
23 2 2022 0.912500 2022-01-02
5 364 2022 6.887500 2022-12-30
11 365 2022 10.586364 2022-12-31

Assigning a day, week, and year column in Pandas in one line

I usually have to extract days, weeks and years into separate columns like this:
data['Day'] = data.SALESDATE.dt.isocalendar().day
data['Week'] = data.SALESDATE.dt.isocalendar().week
data['Year'] = data.SALESDATE.dt.isocalendar().year
But is there a way where I can assign all three in one nice line?
data[['Day', 'Week', 'Year']] = ....
``
For one line solution use DataFrame.join with rename columns if necessary:
rng = pd.date_range('2017-04-03', periods=10)
data = pd.DataFrame({'SALESDATE': rng, 'a': range(10)})
data = data.join(data.SALESDATE.dt.isocalendar().rename(columns=lambda x: x.title()))
print (data)
SALESDATE a Year Week Day
0 2017-04-03 0 2017 14 1
1 2017-04-04 1 2017 14 2
2 2017-04-05 2 2017 14 3
3 2017-04-06 3 2017 14 4
4 2017-04-07 4 2017 14 5
5 2017-04-08 5 2017 14 6
6 2017-04-09 6 2017 14 7
7 2017-04-10 7 2017 15 1
8 2017-04-11 8 2017 15 2
9 2017-04-12 9 2017 15 3
Or change order of list and assign:
data[['Year', 'Week', 'Day']] = data.SALESDATE.dt.isocalendar()
print (data)
SALESDATE a Year Week Day
0 2017-04-03 0 2017 14 1
1 2017-04-04 1 2017 14 2
2 2017-04-05 2 2017 14 3
3 2017-04-06 3 2017 14 4
4 2017-04-07 4 2017 14 5
5 2017-04-08 5 2017 14 6
6 2017-04-09 6 2017 14 7
7 2017-04-10 7 2017 15 1
8 2017-04-11 8 2017 15 2
9 2017-04-12 9 2017 15 3
If need changed order of values in list:
data[['Day', 'Week', 'Year']] = data.SALESDATE.dt.isocalendar()[['day','week','year']]
print (data)
SALESDATE a Day Week Year
0 2017-04-03 0 1 14 2017
1 2017-04-04 1 2 14 2017
2 2017-04-05 2 3 14 2017
3 2017-04-06 3 4 14 2017
4 2017-04-07 4 5 14 2017
5 2017-04-08 5 6 14 2017
6 2017-04-09 6 7 14 2017
7 2017-04-10 7 1 15 2017
8 2017-04-11 8 2 15 2017
9 2017-04-12 9 3 15 2017

Segregating data based on last 3 months and this time last year

I need to filter out my data into two different index.
(1) last three months, includes December as current month minus three
(2) current month (December 2019) and current month values from the year before
pDate Name Date Year Month
11/17/2019 12:18 A 2019/11 2019 11
12/23/2018 11:52 B 2018/12 2018 12
12/1/2019 11:42 C 2019/12 2019 12
12/10/2018 14:31 D 2018/12 2018 12
12/14/2018 12:42 E 2018/12 2018 12
10/15/2019 15:19 F 2019/10 2019 10
10/23/2019 10:50 G 2019/10 2019 10
12/2/2018 15:14 H 2018/12 2018 12
I was able to group them based upon their last 3 months values, relatively quick as:
df1 = df.sort_values(by="pDate",ascending=True).set_index("pDate").last("3M")
How do I get a dataframe which maps December 2019 (current month) and December 2018 only.
Idea is create month periods by Series.dt.to_period and then you can subtract values for past periods filtering by Series.between with boolean indexing:
$changed sample datetimes
df['pDate'] = pd.to_datetime(df['pDate'])
df = df.sort_values(by="pDate")
print (df)
pDate Name Date Year Month
7 2018-12-02 15:14:00 H 2018/12 2018 12
4 2018-12-14 12:42:00 E 2018/12 2018 12
3 2019-10-10 14:31:00 D 2018/12 2018 12
5 2019-10-15 15:19:00 F 2019/10 2019 10
6 2019-10-23 10:50:00 G 2019/10 2019 10
2 2019-11-01 11:42:00 C 2019/12 2019 12
1 2019-12-23 11:52:00 B 2018/12 2018 12
0 2020-01-17 12:18:00 A 2019/11 2019 11
nowp = pd.to_datetime('now').to_period('m')
print (nowp)
2020-01
df['per'] = df['pDate'].dt.to_period('m')
df = df[df['per'].between(nowp-4, nowp-1) | df['per'].eq(nowp-13)]
print (df)
pDate Name Date Year Month per
7 2018-12-02 15:14:00 H 2018/12 2018 12 2018-12
4 2018-12-14 12:42:00 E 2018/12 2018 12 2018-12
3 2019-10-10 14:31:00 D 2018/12 2018 12 2019-10
5 2019-10-15 15:19:00 F 2019/10 2019 10 2019-10
6 2019-10-23 10:50:00 G 2019/10 2019 10 2019-10
2 2019-11-01 11:42:00 C 2019/12 2019 12 2019-11
1 2019-12-23 11:52:00 B 2018/12 2018 12 2019-12
Detail:
print (nowp)
2020-01
print (nowp-1)
2019-12
print (nowp-13)
2018-12
print (nowp-4)
2019-09

Get all dates for all date ranges in table using SQL Server

I have table dbo.WorkSchedules(Id, From, To) where I store date ranges for work schedules. I want to create a view that will have all dates for all rows of WorkSchedules. Thanks to this I have 1 view with all dates for all schedules.
On web I only found solutions for 1 row like 2 parameters start and end. My issue is different where I have multiple rows with start and end range.
Example:
WorkSchedules
Id | From | To
---+------------+-----------
1 | 2018-01-01 | 2018-01-05
2 | 2018-01-08 | 2018-01-12
Desired result
1 | 2018-01-01
2 | 2018-01-02
3 | 2018-01-03
4 | 2018-01-04
5 | 2018-01-05
6 | 2018-01-08
7 | 2018-01-09
8 | 2018-01-10
9 | 2018-01-11
10| 2018-01-12
If you are regularly dealing with "jobs" and "schedules" then I propose that you need a permanent calendar table (a table where each row is a unique date). You can create rows for dates dynamically but why do this many times when you can do it once and just re-use?
A calendar table, even of several decades, isn't "big" and when indexed they can be very fast as well. You can also store information about holidays and/or fiscal periods etc.
There are many scripts available to produce these tables, here's an answer with 2 scripts on this site: https://stackoverflow.com/a/5635628/2067753
Assuming you use the second (more comprehensive) script, then you can exclude weekends, or other conditions such as holidays, from query results.
Once you have a permanent Calendar table this style of query may be used:
CREATE TABLE WorkSchedules(
Id INTEGER NOT NULL PRIMARY KEY
,[From] DATE NOT NULL
,[To] DATE NOT NULL
);
INSERT INTO WorkSchedules(Id,[From],[To]) VALUES (1,'2018-01-01','2018-01-05');
INSERT INTO WorkSchedules(Id,[From],[To]) VALUES (2,'2018-01-12','2018-01-12');
with range as (
select min(ws.[From]) as dt_from, max(ws.[To]) dt_to
from WorkSchedules as ws
)
select c.*
from calendar as c
inner join range on c.date between range.dt_from and range.dt_to
where c.KindOfDay = 'BANKDAY'
order by c.date
and the result looks like this (note: "News Years Day" has been excluded)
Date Year Quarter Month Week Day DayOfYear Weekday Fiscal_Year Fiscal_Quarter Fiscal_Month KindOfDay Description
---- --------------------- ------ --------- ------- ------ ----- ----------- --------- ------------- ---------------- -------------- ----------- -------------
1 02.01.2018 00:00:00 2018 1 1 1 2 2 2 2018 1 1 BANKDAY NULL
2 03.01.2018 00:00:00 2018 1 1 1 3 3 3 2018 1 1 BANKDAY NULL
3 04.01.2018 00:00:00 2018 1 1 1 4 4 4 2018 1 1 BANKDAY NULL
4 05.01.2018 00:00:00 2018 1 1 1 5 5 5 2018 1 1 BANKDAY NULL
5 08.01.2018 00:00:00 2018 1 1 2 8 8 1 2018 1 1 BANKDAY NULL
6 09.01.2018 00:00:00 2018 1 1 2 9 9 2 2018 1 1 BANKDAY NULL
7 10.01.2018 00:00:00 2018 1 1 2 10 10 3 2018 1 1 BANKDAY NULL
8 11.01.2018 00:00:00 2018 1 1 2 11 11 4 2018 1 1 BANKDAY NULL
9 12.01.2018 00:00:00 2018 1 1 2 12 12 5 2018 1 1 BANKDAY NULL
Without the where clause the full range is:
Date Year Quarter Month Week Day DayOfYear Weekday Fiscal_Year Fiscal_Quarter Fiscal_Month KindOfDay Description
---- --------------------- ------ --------- ------- ------ ----- ----------- --------- ------------- ---------------- -------------- ----------- ----------------
1 01.01.2018 00:00:00 2018 1 1 1 1 1 1 2018 1 1 HOLIDAY New Year's Day
2 02.01.2018 00:00:00 2018 1 1 1 2 2 2 2018 1 1 BANKDAY NULL
3 03.01.2018 00:00:00 2018 1 1 1 3 3 3 2018 1 1 BANKDAY NULL
4 04.01.2018 00:00:00 2018 1 1 1 4 4 4 2018 1 1 BANKDAY NULL
5 05.01.2018 00:00:00 2018 1 1 1 5 5 5 2018 1 1 BANKDAY NULL
6 06.01.2018 00:00:00 2018 1 1 1 6 6 6 2018 1 1 SATURDAY NULL
7 07.01.2018 00:00:00 2018 1 1 1 7 7 7 2018 1 1 SUNDAY NULL
8 08.01.2018 00:00:00 2018 1 1 2 8 8 1 2018 1 1 BANKDAY NULL
9 09.01.2018 00:00:00 2018 1 1 2 9 9 2 2018 1 1 BANKDAY NULL
10 10.01.2018 00:00:00 2018 1 1 2 10 10 3 2018 1 1 BANKDAY NULL
11 11.01.2018 00:00:00 2018 1 1 2 11 11 4 2018 1 1 BANKDAY NULL
12 12.01.2018 00:00:00 2018 1 1 2 12 12 5 2018 1 1 BANKDAY NULL
and weekends and holidays may be excluded using the column KindOfDay
See this as a demonstration (with build of calendar table) here: http://rextester.com/CTSW63441
Ok, I worked this out for you, thinking you mean that you meant 01/08/2018 as a From date in the second row.
/*WorkSchedules
Id| From | To
1 | 2018-01-01 | 2018-01-05
2 | 2018-01-08 | 2018-01-12
*/
--DROP TABLE #WorkSchedules;
CREATE TABLE #WorkSchedules (
ID int,
[DateFrom] DATE,
[DateTo] DATE
)
INSERT INTO #WorkSchedules
SELECT 1, '2018-01-01', '2018-01-05'
UNION
SELECT 2, '2018-01-08', '2018-01-12'
;WITH CTEDATELIMITS AS (
SELECT [DateFrom], [DateTo]
FROM #WorkSchedules
)
,CTEDATES AS
(
SELECT [DateFrom] as [DateResult] FROM CTEDATELIMITS
UNION ALL
SELECT DATEADD(Day, 1, [DateResult]) FROM CTEDATES
JOIN CTEDATELIMITS ON CTEDATES.[DateResult] >= CTEDATELIMITS.[DateFrom]
AND CTEDATES.dateResult < CTEDATELIMITS.[DateTo]
)
SELECT [DateResult] FROM CTEDATES
ORDER BY [DateResult]
You would use a recursive CTE:
with dates as (
select from, to, from as date
from WorkSchedules
union all
select from, to, dateadd(day, 1, date)
from dates
where date < to
)
select row_number() over (order by date), date
from dates;
Note that from and to are reserved words in SQL. They are lousy names for identifiers. I have not escaped them because I assume they are not the actual names of the columns.

pandas sort by multiple columns

I want to sort the values in column C in ascending order and values in column B in order "April","August","December" and any remaining values e.g NaN in current example. Can anyone help.
before
A B C
0 354.7 April 4
1 278.8 NaN 4
2 283.5 December 2
3 249.6 NaN 2
4 95.5 April 2
5 85.6 August 2
6 55.4 August 4
7 176.5 December 4
8 104.8 August 8
9 278.8 NaN 10
10 238.7 April 8
11 278.8 April 5
12 152 December 8
After :
A B C
0 95.5 April 2
1 85.6 August 2
2 283.5 December 2
3 249.6 NaN 2
4 354.7 April 4
5 55.4 August 4
6 176.5 December 4
7 278.8 NaN 4
8 278.8 April 5
9 238.7 April 8
10 104.8 August 8
11 152 December 8
12 278.8 NaN 10
Is this what you need ?
df.B=pd.Categorical(df.B,['December','April','August'])
df.sort_values(['C','B'])
Out[284]:
A B C
2 283.5 December 2
4 95.5 April 2
5 85.6 August 2
3 249.6 NaN 2
7 176.5 December 4
0 354.7 April 4
6 55.4 August 4
1 278.8 NaN 4
11 278.8 April 5
12 152.0 December 8
10 238.7 April 8
8 104.8 August 8
9 278.8 NaN 10