Pandas, getting mean and sum with groupby - pandas

I have a data frame, df, which looks like this:
index New Old Map Limit count
1 93 35 54 > 18 1
2 163 93 116 > 18 1
3 134 78 96 > 18 1
4 117 81 93 > 18 1
5 194 108 136 > 18 1
6 125 57 79 <= 18 1
7 66 39 48 > 18 1
8 120 83 95 > 18 1
9 150 98 115 > 18 1
10 149 99 115 > 18 1
11 148 85 106 > 18 1
12 92 55 67 <= 18 1
13 64 24 37 > 18 1
14 84 53 63 > 18 1
15 99 70 79 > 18 1
I need to produce a data frame that looks like this:
Limit <=18 >18
total mean total mean
New xx1 yy1 aa1 bb1
Old xx2 yy2 aa2 bb2
MAP xx3 yy3 aa3 bb3
I tried this without success:
df.groupby('Limit')['New', 'Old', 'MAP'].[sum(), mean()].T without success.
How can I achieve this in pandas?

You can use groupby with agg, then transpose by T and unstack:
print (df[['New', 'Old', 'Map', 'Limit']].groupby('Limit').agg([sum, 'mean']).T.unstack())
Limit <= 18 > 18
sum mean sum mean
New 217.0 108.5 1581.0 121.615385
Old 112.0 56.0 946.0 72.769231
Map 146.0 73.0 1153.0 88.692308
I edit by comment, it looks nicer:
print (df.groupby('Limit')['New', 'Old', 'Map', 'Limit'].agg([sum, 'mean']).T.unstack())
And if need total columns:
print (df.groupby('Limit')['New', 'Old', 'Map', 'Limit']
.agg({'total':sum, 'mean': 'mean'})
.T
.unstack(0))
Limit <= 18 > 18
total mean total mean
New 217.0 108.5 1581.0 121.615385
Old 112.0 56.0 946.0 72.769231
Map 146.0 73.0 1153.0 88.692308

Related

pandas df add new column based on proportion of two other columns from another dataframe

I have df1 which has three columns (loadgroup, cartons, blocks) like this
loadgroup
cartons
blocks
cartonsPercent
blocksPercent
1
2269
14
26%
21%
2
1168
13
13%
19%
3
937
8
11%
12%
4
2753
24
31%
35%
5
1686
9
19%
13%
total(sum of column)
8813
68
100%
100%
The interpretation is like this: out of df1 26% cartons which is also 21% of blocks are assigned to loadgroup 1, etc. we can assume blocks are 1 to 68, cartons are 1 to 8813.
I also have df2 which also has cartons and blocks columns. but does not have loadgroup.
My goal is to assign loadgroup (1-5 as well) to df2 (100 blocks 29608 cartons in total), but keep the proportions, for example, for df2, 26% cartons 21% blocks assign loadgroup 1, 13% cartons 19% blocks assign loadgroup 2, etc.
df2 is like this:
block
cartons
0
533
1
257
2
96
3
104
4
130
5
71
6
68
7
87
8
99
9
51
10
291
11
119
12
274
13
316
14
87
15
149
16
120
17
222
18
100
19
148
20
192
21
188
22
293
23
120
24
224
25
449
26
385
27
395
28
418
29
423
30
244
31
327
32
337
33
249
34
528
35
528
36
494
37
540
38
368
39
533
40
614
41
462
42
350
43
618
44
463
45
552
46
397
47
401
48
397
49
365
50
475
51
379
52
541
53
488
54
383
55
354
56
760
57
327
58
211
59
356
60
552
61
401
62
320
63
368
64
311
65
421
66
458
67
278
68
504
69
385
70
242
71
413
72
246
73
465
74
386
75
231
76
154
77
294
78
275
79
169
80
398
81
227
82
273
83
319
84
177
85
272
86
204
87
139
88
187
89
263
90
90
91
134
92
67
93
115
94
45
95
65
96
40
97
108
98
60
99
102
total 100 blocks
29608 cartons
I want to add loadgroup column to df2, try to keep those proportions as close as possible. How to do it please? Thank you very much for the help.
I don't know how to find loadgroup column based on both cartons percent and blocks percent. But generate random loadgroup based on either cartons percent or blocks percent is easy.
Here is what I did. I generate 100,000 seeds first, then for each seed, I add column loadgroup1 based on cartons percent, loadgroup2 based on blocks percent, then calculate both percentages, then compare with df1 percentages, get absolute difference, record it. For these 100,000 seeds, I take the minimum difference one as my solution, which is sufficient for my job.
But this is not the optimal solution, and I am looking for quick and easy way to do this. Hope somebody can help.
Here is my code.
df = pd.DataFrame()
np.random.seed(10000)
seeds = np.random.randint(1, 1000000, size = 100000)
for i in range(46530, 46537):
print(seeds[i])
np.random.seed(seeds[i])
df2['loadGroup1'] = np.random.choice(df1.loadgroup, len(df2), p = df1.CartonsPercent)
df2['loadGroup2'] = np.random.choice(df1.loadgroup, len(df2), p = df1.blocksPercent)
df2.reset_index(inplace = True)
three = pd.DataFrame(df2.groupby('loadGroup1').agg(Cartons = ('cartons', 'sum'), blocks = ('block', 'count')))
three['CartonsPercent'] = three.Cartons/three.Cartons.sum()
three['blocksPercent'] = three.blocks/three.blocks.sum()
four = df1[['CartonsPercent','blocksPercent']] - three[['CartonsPercent','blocksPercent']]
four = four.abs()
subdf = pd.DataFrame({'i':[i],'Seed':[seeds[i]], 'Percent':['CartonsPercent'], 'AbsDiff':[four.sum().sum()]})
df = pd.concat([df,subdf])
three = pd.DataFrame(df2.groupby('loadGroup2').agg(Cartons = ('cartons', 'sum'), blocks = ('block', 'count')))
three['CartonsPercent'] = three.Cartons/three.Cartons.sum()
three['blocksPercent'] = three.blocks/three.blocks.sum()
four = df1[['CartonsPercent','blocksPercent']] - three[['CartonsPercent','blocksPercent']]
four = four.abs()
subdf = pd.DataFrame({'i':[i],'Seed':[seeds[i]], 'Percent':['blocksPercent'], 'AbsDiff':[four.sum().sum()]})
df = pd.concat([df,subdf])
df.sort_values(by = 'AbsDiff', ascending = True, inplace = True)
df = df.head(10)
Actually the first row of df will tell me the seed I am looking for, I kept 10 rows just for curiosity.
Here is my solution.
block
cartons
loadgroup
0
533
4
1
257
1
2
96
4
3
104
4
4
130
4
5
71
2
6
68
1
7
87
4
8
99
4
9
51
4
10
291
4
11
119
2
12
274
2
13
316
4
14
87
4
15
149
5
16
120
3
17
222
2
18
100
2
19
148
2
20
192
3
21
188
4
22
293
1
23
120
2
24
224
4
25
449
1
26
385
5
27
395
3
28
418
1
29
423
4
30
244
5
31
327
1
32
337
5
33
249
4
34
528
1
35
528
1
36
494
5
37
540
3
38
368
2
39
533
4
40
614
5
41
462
4
42
350
5
43
618
4
44
463
2
45
552
1
46
397
3
47
401
3
48
397
1
49
365
1
50
475
4
51
379
1
52
541
1
53
488
2
54
383
2
55
354
1
56
760
5
57
327
4
58
211
2
59
356
5
60
552
4
61
401
1
62
320
1
63
368
3
64
311
3
65
421
2
66
458
5
67
278
4
68
504
5
69
385
4
70
242
4
71
413
1
72
246
2
73
465
5
74
386
4
75
231
1
76
154
4
77
294
4
78
275
1
79
169
4
80
398
4
81
227
4
82
273
1
83
319
3
84
177
4
85
272
5
86
204
3
87
139
1
88
187
4
89
263
4
90
90
4
91
134
4
92
67
3
93
115
3
94
45
2
95
65
2
96
40
4
97
108
2
98
60
2
99
102
1
Here are the summaries.
loadgroup
cartons
blocks
cartonsPercent
blocksPercent
1
7610
22
26%
22%
2
3912
18
13%
18%
3
3429
12
12%
12%
4
9269
35
31%
35%
5
5388
13
18%
13%
It's very close to my target though.

Reverse the order of the rows by chunks of n rows

Consider the following sequence:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
which produces:
A B C D
0 56 83 99 46
1 40 70 22 51
2 70 9 78 33
3 65 72 79 87
4 0 6 22 73
.. .. .. .. ..
95 35 76 62 97
96 86 85 50 65
97 15 79 82 62
98 21 20 19 32
99 21 0 51 89
I can reverse the sequence with the following command:
df.iloc[::-1]
That gives me the following result:
A B C D
99 21 0 51 89
98 21 20 19 32
97 15 79 82 62
96 86 85 50 65
95 35 76 62 97
.. .. .. .. ..
4 0 6 22 73
3 65 72 79 87
2 70 9 78 33
1 40 70 22 51
0 56 83 99 46
How would I rewrite the code if I wanted to reverse the sequence every nth row, e.g. every 4th row?
IIUC, you want to reverse by chunk (3, 2, 1, 0, 8, 7, 6, 5…):
One option is to use groupby with a custom group:
N = 4
group = df.index//N
# if the index is not a linear range
# import numpy as np
# np.arange(len(df))//N
df.groupby(group).apply(lambda d: d.iloc[::-1]).droplevel(0)
output:
A B C D
3 45 33 73 77
2 91 34 19 68
1 12 25 55 19
0 65 48 17 4
7 99 99 95 9
.. .. .. .. ..
92 89 68 48 67
99 99 28 52 87
98 47 49 21 8
97 80 18 92 5
96 49 12 24 40
[100 rows x 4 columns]
A very fast method, based only on indexing is to use numpy to generate a list of the indices reversed by chunk:
import numpy as np
N = 4
idx = np.arange(len(df)).reshape(-1, N)[:, ::-1].ravel()
# array([ 3, 2, 1, 0, 7, 6, 5, 4, 11, ...])
# slice using iloc
df.iloc[idx]

pandas how to filter and slice with multiple conditions

Using pandas, how do I return dataframe filtered by value of 2 in 'GEN' column, value 20 in 'AGE' column and exclude columns with name 'GEN' and 'BP'? Thanks in advance:)
AGE GEN BMI BP S1 S2 S3 S4 S5 S6 Y
59 2 32.1 101 157 93.2 38 4 4.8598 87 151
48 1 21.6 87 183 103.2 70 3 3.8918 69 75
72 2 30.5 93 156 93.6 41 4 4.6728 85 141
24 1 25.3 84 198 131.4 40 5 4.8903 89 206
50 1 23 101 192 125.4 52 4 4.2905 80 135
23 1 22.6 89 139 64.8 61 2 4.1897 68 97
20 2 22 90 160 99.6 50 3 3.9512 82 138
66 2 26.2 114 255 185 56 4.5 4.2485 92 63
60 2 32.1 83 179 119.4 42 4 4.4773 94 110
20 1 30 85 180 93.4 43 4 5.3845 88 310
You can do this -
cols = df.columns[~df.columns.isin(['GEN','BP'])]
out=df.loc[(df['GEN'] == 2) & (df['AGE'] == 20),cols]
OR
out=df.query("'GEN'==2 and 'AGE'==20").loc[cols]

SQL Server : create new column category price according to price column

I have a SQL Server table with a column price looking like this:
10
96
64
38
32
103
74
32
67
103
55
28
30
110
79
91
16
71
36
106
89
87
59
41
56
89
68
32
80
47
45
77
64
93
17
88
13
19
83
12
76
99
104
65
83
95
Now my aim is to create a new column giving a category from 1 to 10 to each of those values.
For instance the max value in my column is 110 the min is 10. Max-min = 100. Then if I want to have 10 categories I do 100/10= 10. Therefore here are the ranges:
10-20 1
21-30 2
31-40 3
41-50 4
51-60 5
61-70 6
71-80 7
81-90 8
91-100 9
101-110 10
Desired output:
my new column called cat should look like this:
price cat
-----------------
10 1
96 9
64 6
38 3
32 3
103 10
74 7
32 3
67 6
103 10
55 5
28 2
30 3
110 10
79 7
91 9
16 1
71 7
36 3
106 10
89 8
87 8
59 5
41 4
56 5
89 8
68 6
32 3
80 7
47 4
45 4
77 7
64 6
93 9
17 1
88 8
13 1
19 1
83 8
12 1
76 7
99 9
104 10
65 6
83 8
95 9
Is there a way to perform this with T-SQL? Sorry if this question is maybe too easy. I searched long time on the web. So either the problem is not as simple as I imagine. Either I entered the wrong keywords.
Yes, almost exactly as you describe the calculation:
select price,
1 + (price - min_price) * 10 / (max_price - min_price + 1) as decile
from (select price,
min(price) over () as min_price,
max(price) over () as max_price
from t
) t;
The 1 + is because you want the values from 1 to 10, rather than 0 to 9.
Yes - a case statement can do that.
select
price
,case
when price between 10 and 20 then 1
when price between 21 and 30 then 2
when price between 31 and 40 then 3
when price between 41 and 50 then 4
when price between 51 and 60 then 5
when price between 61 and 70 then 6
when price between 71 and 80 then 7
when price between 81 and 90 then 8
when price between 91 and 100 then 9
when price between 101 and 110 then 10
else null
end as cat
from [<enter your table name here>]

How to divide a result set into equal parts?

I have a table new_table
ID PROC_ID DEP_ID OLD_STAFF NEW_STAFF
1 15 43 58 ?
2 19 43 58 ?
3 29 43 58 ?
4 31 43 58 ?
5 35 43 58 ?
6 37 43 58 ?
7 38 43 58 ?
8 39 43 58 ?
9 58 43 58 ?
10 79 43 58 ?
How I can select all proc_ids and update new_staff, for example
ID PROC_ID DEP_ID OLD_STAFF NEW_STAFF
1 15 43 58 15
2 19 43 58 15
3 29 43 58 15
4 31 43 58 15
5 35 43 58 23
6 37 43 58 23
7 38 43 58 23
8 39 43 58 28
9 58 43 58 28
10 79 43 58 28
15 - 4(proc_id)
23 - 3(proc_id)
28 - 3(proc_id)
58 - is busi
where 15, 23, 28 and 58 staffs in one dep
"how to divide equal parts"
Oracle has a function, ntile() which splits a result set into equal buckets. For instance this query puts your posted data into four buckets:
SQL> select id
2 , proc_id
3 , ntile(4) over (order by id asc) as gen_staff
4 from new_table;
ID PROC_ID GEN_STAFF
---------- ---------- ----------
1 15 1
2 19 1
3 29 1
4 31 2
5 35 2
6 37 2
7 38 3
8 39 3
9 58 4
10 79 4
10 rows selected.
SQL>
This isn't quite the solution you want but you need to clarify your requirements before it's possible to provide a complete answer.
update new_table
set new_staff='15'
where ID in('1','2','3','4')
update new_table
set new_staff='28'
where ID in('8','9','10')
update new_table
set new_staff='23'
where ID in('5','6','7')
Not sure if this is what you mean.