For the following dataset which has multi-indexes Activity->Facility Size->Year
my goal is to apply style.highlight_max() & style.highlight_min() for Column (Employee Count) in each group individually (Large, Medium, and Small),is there a way to do it while still maintaining the same shape of the dataframe as below without separating each group apart.
Facility Count Employee Count \
Activity Facility Size Year
Computer Programming Large 2010 59 3830
2011 60 3912
2012 63 4111
2013 66 4273
2014 73 4730
2015 81 5066
2016 85 5347
2017 86 5426
2018 86 7256
Medium 2010 168 1418
2011 179 1509
2012 190 1601
2013 191 1615
2014 211 1779
2015 228 1922
2016 229 1972
2017 233 2017
2018 235 2640
Small 2010 182 431
2011 185 438
2012 194 459
2013 202 478
2014 226 535
2015 226 572
2016 235 604
2017 243 623
2018 244 687
I have tried getting each group separately by using .groupby(level=0) but that is not what I want since it separates each group apart not as one dataframe
Thank you
Related
I have a data table derived via unstacking an existing dataframe:
Day 0 1 2 3 4 5 6
Hrs
0 223 231 135 122 099 211 217
1 156 564 132 414 156 454 157
2 950 178 121 840 143 648 192
3 025 975 151 185 341 145 888
4 111 264 469 330 671 201 345
-- -- -- -- -- -- -- --
I want to simply change the column titles so I have the days of the week displayed instead of numbered. Something like this:
Day Mon Tue Wed Thu Fri Sat Sun
Hrs
0 223 231 135 122 099 211 217
1 156 564 132 414 156 454 157
2 950 178 121 840 143 648 192
3 025 975 151 185 341 145 888
4 111 264 469 330 671 201 345
-- -- -- -- -- -- -- --
I've tried .rename(columns = {'original':'new', etc}, inplace = True) and other similar functions, none of which have worked.
I also tried going to the original dataframe and creating a dt.day_name column from the parsed dates, but it come out with the days of the week mixed up.
I'm sure it's a simple fix, but I'm living off nothing but caffeine, so help would be appreciated.
You can try:
import pandas as pd
df = pd.DataFrame(columns=[0,1,2,3,4,5,6])
df.columns = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
I'm a newbie in R and I'm trying to translate the following nested query using dplyr:
SELECT * FROM DAT
where concat(code, datcomp) IN
(SELECT concat(code, max(datcomp)) from DAT group by code)
DAT is a data frame containing several hundreds columns.
code is a not-unique numeric field
datcomp is a string like 'YYYY-MM-DDTHH24:MI:SS'
What I'm trying to do is extracting from data frame the most recent timestamp for each code.
Eg: given
code datcomp
1 1005 2019-06-12T09:13:47
2 1005 2019-06-19T16:15:46
3 1005 2019-06-17T21:46:02
4 1005 2019-06-17T17:52:01
5 1005 2019-06-24T13:10:05
6 1015 2019-05-02T10:33:13
7 1030 2019-06-11T14:58:16
8 1030 2019-06-20T09:50:20
9 2008 2019-05-17T18:43:34
10 2008 2019-05-28T15:16:50
11 3030 2019-05-24T09:51:30
12 3032 2019-05-30T16:40:03
13 3032 2019-05-21T09:34:27
14 3062 2019-05-17T16:10:53
15 3062 2019-06-20T16:45:51
16 3069 2019-07-01T17:54:59
17 3069 2019-07-09T12:39:56
18 3069 2019-07-09T17:45:09
19 3069 2019-07-17T14:31:01
20 3069 2019-06-24T13:42:27
21 3104 2019-06-05T14:47:38
22 3104 2019-05-17T15:18:47
23 3111 2019-06-06T15:52:51
24 3111 2019-07-01T09:50:33
25 3127 2019-04-16T16:04:59
26 3127 2019-05-15T11:49:29
27 3249 2019-06-21T18:24:14
28 3296 2019-07-01T17:44:54
29 3311 2019-06-10T11:05:20
30 3311 2019-06-21T12:11:05
31 3311 2019-06-19T11:36:47
32 3332 2019-05-13T09:38:21
33 3440 2019-06-11T12:53:07
34 3440 2019-05-17T17:40:19
35 3493 2019-04-18T11:18:37
36 5034 2019-06-06T15:24:04
37 5034 2019-05-31T11:39:17
38 5216 2019-05-20T17:16:07
39 5216 2019-05-14T15:08:15
40 5385 2019-05-17T13:19:54
41 5387 2019-05-13T09:33:31
42 5387 2019-05-07T10:49:14
43 5387 2019-05-15T10:38:25
44 5696 2019-06-10T16:16:49
45 5696 2019-06-11T14:47:00
46 5696 2019-06-13T17:10:36
47 6085 2019-05-21T10:15:58
48 6085 2019-06-03T11:22:34
49 6085 2019-05-29T11:25:37
50 6085 2019-05-31T12:52:42
51 6175 2019-05-13T17:17:48
52 6175 2019-05-27T09:58:04
53 6175 2019-05-23T10:32:21
54 6230 2019-06-21T14:28:11
55 6230 2019-06-11T16:00:48
56 6270 2019-05-28T08:57:38
57 6270 2019-05-17T16:17:04
58 10631 2019-05-22T09:46:51
59 10631 2019-07-03T10:41:41
60 10631 2019-06-06T11:52:42
What I need is
code datcomp
1 1005 2019-06-24T13:10:05
2 1015 2019-05-02T10:33:13
3 1030 2019-06-20T09:50:20
4 2008 2019-05-28T15:16:50
5 3030 2019-05-24T09:51:30
6 3032 2019-05-30T16:40:03
7 3062 2019-06-20T16:45:51
8 3069 2019-07-17T14:31:01
9 3104 2019-06-05T14:47:38
10 3111 2019-07-01T09:50:33
11 3127 2019-05-15T11:49:29
12 3249 2019-06-21T18:24:14
13 3296 2019-07-01T17:44:54
14 3311 2019-06-21T12:11:05
15 3332 2019-05-13T09:38:21
16 3440 2019-06-11T12:53:07
17 3493 2019-04-18T11:18:37
18 5034 2019-06-06T15:24:04
19 5216 2019-05-20T17:16:07
20 5385 2019-05-17T13:19:54
21 5387 2019-05-15T10:38:25
22 5696 2019-06-13T17:10:36
23 6085 2019-06-03T11:22:34
24 6175 2019-05-27T09:58:04
25 6230 2019-06-21T14:28:11
26 6270 2019-05-28T08:57:38
27 10631 2019-07-03T10:41:41
Thank you in advance.
a more generalized version - group, then sort so that you get whatever you want first, then slice (which would allow you to take the nth value from each group as sorted):
dati %>%
group_by(code) %>%
arrange(desc(datcomp)) %>%
slice(1) %>%
ungroup()
To be clear, I'm not a developer, I'm just a business analyst trying to achieve something in Access which has stumped me.
I have a table of values as such:
Area Week
232 1
232 2
232 3
232 4
232 5
232 6
232 7
232 8
232 9
232 10
232 11
232 12
232 35
232 36
232 37
232 38
232 39
232 41
232 42
232 43
232 44
232 45
232 46
232 47
232 48
232 49
232 50
232 51
232 52
330 1
330 2
330 3
330 4
330 33
330 34
330 35
330 36
330 37
330 38
330 39
330 40
330 41
330 42
330 43
330 44
330 45
330 47
330 48
330 49
330 50
I would like to create a query using SQL in Access to create grouping as follows:
Area Code Week Start Week End
232 1 12
232 35 39
232 41 52
330 1 4
330 33 45
330 47 50
However everything I have read leads me to use the ROWNUM() function which is not native to Access.
I'm OK with general queries in Access, but am not very familiar with SQL.
How can I go about achieving my task?
Thanks
Mike
Use another database! MS Access doesn't have good functionality (in general).
You can do what you want, but it is expensive:
select area, min(week), max(week)
from (select t.*,
(select count(*)
from t as t2
where t2.area = t.area and t2.week <= t.week
) as seqnum
from t
) as t
group by area, (week - seqnum);
The correlated subquery is essentially doing row_number().
I have a dataframe that needs to repeat itself.
from io import StringIO
import pandas as pd
audit_trail = StringIO('''
course_id AcademicYear_to months TotalFee
260 2017 24 100
260 2018 12 140
274 2016 36 300
274 2017 24 340
274 2018 12 200
285 2017 24 300
285 2018 12 200
''')
df11 = pd.read_csv(audit_trail, sep=" " )
For the course id 260 there are 2 entries per year. Year 2017 and Year 2018. I need to repeat the years for the month groups.
I will get 2 more rows, 2018 for months 24 and 2017 for months 12. The final dataframe will look like this...
audit_trail = StringIO('''
course_id AcademicYear_to months TotalFee
260 2017 24 100
260 2018 24 100
260 2017 12 140
260 2018 12 140
274 2016 36 300
274 2017 36 300
274 2018 36 300
274 2016 24 340
274 2017 24 340
274 2018 24 340
274 2016 12 200
274 2017 12 200
274 2018 12 200
285 2017 24 300
285 2018 24 300
285 2017 12 200
285 2018 12 200
''')
df12 = pd.read_csv(audit_trail, sep=" " )
I tried to concat the same dataframe twice, but that does not solve the problem. I need to change the years and for 36 months, the data needs to be repeated 3 times.
pd.concat([df11, df11])
The group by object will return the years. I simply need to join the years in each group with the original dataframe.
df11.groupby('course_id')['AcademicYear_to'].apply(list)
260 [2017, 2018]
274 [2016, 2017, 2018]
285 [2017, 2018]
Simple join can work if the records match with the number of years. For e.g. course id 274 has 48 months and 285 has duration of 24 months and there are 3, 2 entries respectively. The problem is with course id 260 which is 24 months course but has only 1 entry. The join will not return the second year for that course.
df11=pd.read_csv('https://s3.amazonaws.com/todel162/myso.csv')
df11.course_id.value_counts()
274 3
285 2
260 1
df=df11.merge(df11[['course_id']], on='course_id')
df.course_id.value_counts()
274 9
285 4
260 1
Is it possible to write a query that will also consider the number of months?
The following query will return the records where simple join will not return expected results.
df11=pd.read_csv('https://s3.amazonaws.com/todel162/myso.csv')
df11['m1']=df11.groupby('course_id').course_id.transform( lambda x: x.count() * 12)
df11.query( 'm1 != duration_inmonths')
df11.course_id.value_counts()
274 3
285 2
260 1
df=df11.merge(df11[['course_id']], on='course_id')
df.course_id.value_counts()
274 9
285 4
260 1
The expected count in this case is
274 6
285 4
260 2
This is because even if there are 3 years for id 274, the course duration is only 24 months. And even if there is only 1 record for 260 since the duration is 24 months, it should return 2 records. (once for current year and the other current_year + 1)
IIUC we can merge df11 to itself:
In [14]: df11.merge(df11[['course_id']], on='course_id')
Out[14]:
course_id AcademicYear_to months TotalFee
0 260 2017 24 100
1 260 2017 24 100
2 260 2018 12 140
3 260 2018 12 140
4 274 2016 36 300
5 274 2016 36 300
6 274 2016 36 300
7 274 2017 24 340
8 274 2017 24 340
9 274 2017 24 340
10 274 2018 12 200
11 274 2018 12 200
12 274 2018 12 200
13 285 2017 24 300
14 285 2017 24 300
15 285 2018 12 200
16 285 2018 12 200
Not Pretty!
def f(x):
idx = x.index.remove_unused_levels()
idx = pd.MultiIndex.from_product(idx.levels, names=idx.names)
return x.reindex(idx)
df11.set_index(['months', 'AcademicYear_to']) \
.groupby('course_id').TotalFee.apply(f) \
.groupby(level=[0, 1]).transform('first') \
.astype(df11.TotalFee.dtype).reset_index()
course_id months AcademicYear_to TotalFee
0 260 24 2017 100
1 260 24 2018 100
2 260 12 2017 140
3 260 12 2018 140
4 274 12 2016 200
5 274 12 2017 200
6 274 12 2018 200
7 274 24 2016 340
8 274 24 2017 340
9 274 24 2018 340
10 274 36 2016 300
11 274 36 2017 300
12 274 36 2018 300
13 285 24 2017 300
14 285 24 2018 300
15 285 12 2017 200
16 285 12 2018 200
I have a list of almost a million dates formatted as DD-Mmm-YYY. I would love to create a calendar heat map using Seaborn's heatmap function to visualize the distribution of dates during the calendar year (regardless of year). I have figured out how to separate out Month and Day into separate columns so that I have:
In [8]: df.head()
Out[8]:
original_date month day
0 05-Sep-2010 Sep 05
1 08-Apr-2010 Apr 08
2 03-Aug-2008 Aug 03
3 03-Feb-2008 Feb 03
4 14-Mar-2008 Mar 14
What can I do to this dataframe to get it into a format that has days of the month as columns, and months as row index? Here's what I'm looking for, but it was done without Pandas, using csv processing and nested dictionaries.
01 02 03 04 05 06 07 08 09 10 ...
Jan 1923 371 341 451 437 332 338 398 403 476 ...
Feb 931 675 891 514 479 452 509 657 507 771 ...
Mar 1370 906 737 594 469 458 524 368 430 2136 ...
Apr 1433 1127 706 791 639 817 584 580 515 757 ...
May 1666 885 884 697 1626 708 809 1053 826 1281 ...
I'd like to do this in Pandas to be able to filter by year, etc.
First I would create a new dataframe that counts by month and day (excluding year)
new_df = []
for key, grp in df.groupby(['month', 'day']):
month, day = key
new_df.append({
'month': month,
'day': day,
'count': len(grp)
})
new_df = pd.DataFrame(new_df)
Then you can pivot this dataframe to give the format you want
new_df.pivot('month', 'day', 'count')