Evaluation stuck at local optima in genetic programing DEAP. How to prevent GP from converging on local optima? - genetic-programming

I'm trying to do a symbolic regression of a geometric model. And it gets stuck at most of the time with a fitness score that is not near 0. So I did a couple of research and find out it is the problem with local minima. And some people tried to prioritize diversity of population over fitness. But that's not what I want.
So I what I did is to reconfigure the algorithms.eaSimple and added a block in that. So it resets the population when the last n=50 generation have the same fitness.
I don't have any idea other than that as I'm very new to it.
Is there any better way to do this?
I'm using min fitness. creator.create("FitnessMin", base.Fitness, weights=(-1.0,))
def my_eaSimple(population, toolbox, cxpb, mutpb, ngen, stats=None, halloffame: tools.HallOfFame = None,
verbose=True):
logbook = tools.Logbook()
logbook.header = ['gen', 'nevals'] + (stats.fields if stats else [])
# Evaluate the individuals with an invalid fitness
invalid_ind = [ind for ind in population if not ind.fitness.valid]
fitnesses = toolbox.map(toolbox.evaluate, invalid_ind)
for ind, fit in zip(invalid_ind, fitnesses):
ind.fitness.values = fit
if halloffame is not None:
halloffame.update(population)
record = stats.compile(population) if stats else {}
logbook.record(gen=0, nevals=len(invalid_ind), **record)
if verbose:
print(logbook.stream)
# Begin the generational process
gen = 1
last_few_pop_to_consider = 50
starting_condition = last_few_pop_to_consider
is_last_few_fitness_same = lambda stats_array: abs(numpy.mean(stats_array) - stats_array[0]) < 0.1
while gen < ngen + 1:
# Select the next generation individuals
offspring = toolbox.select(population, len(population))
# Vary the pool of individuals
offspring = algorithms.varAnd(offspring, toolbox, cxpb, mutpb)
# Evaluate the individuals with an invalid fitness
invalid_ind = [ind for ind in offspring if not ind.fitness.valid]
fitnesses = toolbox.map(toolbox.evaluate, invalid_ind)
for ind, fit in zip(invalid_ind, fitnesses):
ind.fitness.values = fit
# Update the hall of fame with the generated individuals
if halloffame is not None:
halloffame.update(offspring)
# Replace the current population by the offspring
population[:] = offspring
# Append the current generation statistics to the logbook
record = stats.compile(population) if stats else {}
logbook.record(gen=gen, nevals=len(invalid_ind), **record)
if verbose:
print(logbook.stream)
gen += 1
# stopping criteria
min_fitness = record['fitness']['min\t']
# max_fitness = record['fitness']['max\t']
if min_fitness < 0.1:
print('Reached desired fitness')
break
if gen > starting_condition:
min_stats = logbook.chapters['fitness'].select('min\t')[-last_few_pop_to_consider:]
if is_last_few_fitness_same(min_stats):
print('Defining new population')
population = toolbox.population(n=500)
starting_condition = gen + last_few_pop_to_consider
return population, logbook
Output
gen nevals avg max min std avg max min std
0 500 2.86566e+23 1.41421e+26 112.825 6.31856e+24 10.898 38 3 9.50282
1 451 2.82914e+18 1.41421e+21 90.113 6.31822e+19 6.226 38 1 5.63231
2 458 2.84849e+18 1.41421e+21 89.1206 6.3183e+19 5.602 36 1 5.18417
3 459 4.24902e+14 2.01509e+17 75.1408 9.01321e+15 5.456 35 1 4.05167
4 463 4.23166e+14 2.03904e+17 74.3624 9.11548e+15 6.604 36 1 3.61762
5 462 2.8693e+11 1.25158e+14 65.9366 5.60408e+12 7.464 34 1 3.00478
6 467 2.82843e+18 1.41421e+21 65.9366 6.31823e+19 8.144 37 1 3.51216
7 463 5.40289e+13 2.65992e+16 65.9366 1.1884e+15 8.322 22 1 2.88276
8 450 6.59849e+14 3.29754e+17 59.1286 1.47323e+16 8.744 34 1 3.03685
9 458 1.8128e+11 8.17261e+13 54.4395 3.65075e+12 9.148 23 1 2.69557
10 459 6.59851e+14 3.29754e+17 54.4395 1.47323e+16 9.724 35 1 3.02255
11 458 2.34825e+10 1.41421e+11 54.4395 5.26173e+10 9.842 18 1 2.32057
12 459 3.52996e+11 1.60442e+14 54.4395 7.1693e+12 10.56 33 1 2.63788
13 457 3.81044e+11 1.60442e+14 54.4395 7.18851e+12 11.306 35 1 2.84611
14 457 2.30681e+13 1.15217e+16 54.4395 5.14751e+14 11.724 24 1 2.6495
15 463 2.65947e+10 1.41421e+11 54.4395 5.52515e+10 12.072 29 1 2.63036
16 469 4.54286e+10 9.13693e+12 54.4395 4.10784e+11 12.104 34 1 3.00752
17 461 6.58255e+11 1.74848e+14 54.4395 9.76474e+12 12.738 36 4 3.10956
18 450 2.03669e+10 1.41421e+11 54.4395 4.96374e+10 13.062 30 4 3.01963
19 465 1.75385e+10 2.82843e+11 54.4395 4.74595e+10 13.356 24 1 2.82157
20 458 1.83887e+10 1.41421e+11 54.4395 4.7559e+10 13.282 23 1 3.03949
21 455 3.67899e+10 8.36173e+12 54.4395 4.04044e+11 13.284 34 4 3.03106
22 461 1.36372e+10 1.41422e+11 54.4395 4.16569e+10 13.06 35 3 3.01005
23 471 2.00634e+26 1.00317e+29 54.3658 4.48181e+27 12.798 36 1 3.17698
24 466 2.82843e+18 1.41421e+21 54.3658 6.31823e+19 12.706 36 3 3.07043
25 464 3.00384e+10 8.36174e+12 54.3658 3.75254e+11 12.612 34 5 2.89231
26 474 2.00925e+10 1.41421e+11 54.3658 4.93588e+10 12.594 34 3 2.60253
27 452 2.9528e+11 1.41626e+14 54.3658 6.32694e+12 12.43 25 1 2.49822
28 453 1.23899e+10 1.41421e+11 54.3658 3.98511e+10 12.41 20 5 2.45721
29 456 5.98529e+14 2.99256e+17 54.3658 1.33697e+16 12.57 37 1 2.6346
30 474 1.35672e+13 6.69898e+15 54.3658 2.99297e+14 12.526 35 1 2.94029
31 446 6.92755e+22 3.46377e+25 54.3658 1.5475e+24 12.55 36 1 2.62517
32 462 4.02525e+10 8.16482e+12 54.3658 3.92769e+11 12.764 34 5 2.77061
33 449 1.53268e+13 7.65519e+15 54.3658 3.42007e+14 12.628 35 1 2.76218
34 466 3.13214e+16 1.54388e+19 54.3658 6.89799e+17 12.626 35 1 2.97626
35 464 2.82845e+18 1.41421e+21 54.3658 6.31823e+19 12.806 36 5 2.74597
36 460 2.93493e+11 1.32308e+14 54.3658 5.91505e+12 12.734 35 5 2.88084
37 456 2.93491e+10 8.29826e+12 54.3658 3.72372e+11 12.614 37 1 2.80517
38 449 3.44519e+10 8.16482e+12 54.3658 3.67344e+11 12.742 34 3 2.91881
39 466 1.53217e+13 7.65519e+15 54.3658 3.42008e+14 12.502 35 3 2.70296
40 454 2.82843e+18 1.41421e+21 54.3658 6.31823e+19 12.51 36 1 2.81103
41 453 9.66059e+24 4.68888e+27 54.3658 2.09566e+26 12.554 33 1 2.47691
42 448 2.2287e+10 3.38289e+12 54.3658 1.58629e+11 12.576 26 1 2.50763
43 460 5.47399e+12 2.73042e+15 54.3658 1.21985e+14 12.584 34 1 2.80053
44 460 2.82843e+18 1.41421e+21 54.3658 6.31823e+19 12.692 27 1 2.86516
45 464 2.829e+18 1.41421e+21 54.3658 6.31823e+19 12.57 34 1 3.15549
46 460 2.92607e+11 1.31556e+14 54.3658 5.88776e+12 12.61 37 3 2.78817
47 465 2.82843e+18 1.41421e+21 54.3658 6.31823e+19 12.622 36 1 3.04616
48 461 1.64306e+10 2.97245e+12 54.3658 1.37408e+11 12.468 26 1 2.57856
49 463 1.54834e+10 1.41421e+11 54.3658 4.4029e+10 12.464 20 1 2.4529
50 451 1.59239e+10 1.41421e+11 54.3658 4.44609e+10 12.63 33 1 2.76281
51 455 5.40036e+19 2.70018e+22 54.3658 1.20635e+21 12.78 37 1 2.84668
52 478 2.82843e+18 1.41421e+21 54.3658 6.31823e+19 12.712 36 3 2.84694
53 461 2.78669e+21 1.39193e+24 54.3658 6.21866e+22 12.714 36 1 3.23546
54 471 7.41272e+12 3.70045e+15 54.3658 1.65323e+14 12.336 34 3 2.848
55 465 2.83036e+18 1.41421e+21 54.3658 6.31822e+19 12.74 36 1 3.62662
56 459 2.82843e+18 1.41421e+21 54.3658 6.31823e+19 12.606 29 1 2.60437
57 453 5.98308e+24 2.99154e+27 54.3658 1.33652e+26 12.722 34 1 2.62311
58 460 3.62463e+21 1.8109e+24 54.3658 8.09047e+22 12.65 37 1 2.92361
Defining new population
59 500 5.83025e+48 2.91513e+51 109.953 1.30238e+50 10.846 38 1 8.89889
60 464 2.93632e+15 8.87105e+17 165.988 4.38882e+16 5.778 36 1 4.79173
61 444 5.54852e+19 2.70018e+22 93.5182 1.20674e+21 4.992 37 1 4.648
62 463 4.28647e+14 2.14148e+17 82.0774 9.56741e+15 5.468 34 1 4.34891
63 464 2.82843e+18 1.41421e+21 78.8184 6.31823e+19 6.624 35 1 4.25989
64 453 3.40035e+11 1.60954e+14 68.7629 7.19022e+12 7.356 36 1 3.77694
65 456 5.65762e+18 2.82851e+21 68.7629 1.26368e+20 7.606 35 1 4.15966
66 461 2.82843e+18 1.41421e+21 68.7629 6.31823e+19 7.906 35 1 3.81171
67 447 1.63302e+10 1.41421e+11 68.7629 4.51102e+10 7.802 33 1 3.47258
68 463 6.59552e+14 3.29754e+17 68.7629 1.47323e+16 8.37 34 3 3.80698
69 460 1.53579e+13 7.65512e+15 68.7629 3.42003e+14 8.646 35 1 3.64042
70 461 2.80014e+10 1.41421e+11 68.7629 5.63553e+10 9.212 38 1 3.69582
71 453 1.97446e+11 7.80484e+13 68.7629 3.50764e+12 9.84 34 1 3.74785
72 459 9.98853e+11 1.75397e+14 68.7629 1.25317e+13 10.284 35 3 3.61764
73 453 5.6863e+16 2.84218e+19 68.7629 1.26979e+18 10.796 36 1 3.86864
74 466 2.57445e+10 1.41434e+11 68.7629 5.4564e+10 10.806 35 1 3.2949
75 453 2.82849e+18 1.41421e+21 68.7629 6.31823e+19 10.876 34 1 3.27301
76 433 1.67235e+20 8.36174e+22 68.7629 3.73574e+21 10.868 35 1 2.94051
77 457 3.6663e+21 1.83315e+24 68.7629 8.1899e+22 10.964 37 3 3.21476
78 461 1.80829e+14 9.04015e+16 68.7629 4.03883e+15 10.992 35 3 3.26985
79 450 3.21984e+11 1.41626e+14 68.7629 6.32593e+12 11.17 28 1 2.77941
80 460 2.82843e+18 1.41421e+21 68.7629 6.31823e+19 11.044 35 1 3.25362
81 455 6.46751e+14 2.99308e+17 68.7629 1.34123e+16 11.06 34 1 3.51061
82 463 3.21908e+21 1.60954e+24 68.7629 7.19088e+22 11.112 34 1 3.58433
83 473 2.82843e+18 1.41421e+21 68.7629 6.31823e+19 10.946 38 3 3.70663
84 460 3.14081e+11 1.41626e+14 68.7629 6.32625e+12 10.896 35 1 3.4976
85 456 1.53419e+13 7.65526e+15 68.7629 3.4201e+14 11.156 36 1 3.23661
The population get reset after getting 54.4395 minimum fitness for 50 times in 59th gen.

Related

pandas df add new column based on proportion of two other columns from another dataframe

I have df1 which has three columns (loadgroup, cartons, blocks) like this
loadgroup
cartons
blocks
cartonsPercent
blocksPercent
1
2269
14
26%
21%
2
1168
13
13%
19%
3
937
8
11%
12%
4
2753
24
31%
35%
5
1686
9
19%
13%
total(sum of column)
8813
68
100%
100%
The interpretation is like this: out of df1 26% cartons which is also 21% of blocks are assigned to loadgroup 1, etc. we can assume blocks are 1 to 68, cartons are 1 to 8813.
I also have df2 which also has cartons and blocks columns. but does not have loadgroup.
My goal is to assign loadgroup (1-5 as well) to df2 (100 blocks 29608 cartons in total), but keep the proportions, for example, for df2, 26% cartons 21% blocks assign loadgroup 1, 13% cartons 19% blocks assign loadgroup 2, etc.
df2 is like this:
block
cartons
0
533
1
257
2
96
3
104
4
130
5
71
6
68
7
87
8
99
9
51
10
291
11
119
12
274
13
316
14
87
15
149
16
120
17
222
18
100
19
148
20
192
21
188
22
293
23
120
24
224
25
449
26
385
27
395
28
418
29
423
30
244
31
327
32
337
33
249
34
528
35
528
36
494
37
540
38
368
39
533
40
614
41
462
42
350
43
618
44
463
45
552
46
397
47
401
48
397
49
365
50
475
51
379
52
541
53
488
54
383
55
354
56
760
57
327
58
211
59
356
60
552
61
401
62
320
63
368
64
311
65
421
66
458
67
278
68
504
69
385
70
242
71
413
72
246
73
465
74
386
75
231
76
154
77
294
78
275
79
169
80
398
81
227
82
273
83
319
84
177
85
272
86
204
87
139
88
187
89
263
90
90
91
134
92
67
93
115
94
45
95
65
96
40
97
108
98
60
99
102
total 100 blocks
29608 cartons
I want to add loadgroup column to df2, try to keep those proportions as close as possible. How to do it please? Thank you very much for the help.
I don't know how to find loadgroup column based on both cartons percent and blocks percent. But generate random loadgroup based on either cartons percent or blocks percent is easy.
Here is what I did. I generate 100,000 seeds first, then for each seed, I add column loadgroup1 based on cartons percent, loadgroup2 based on blocks percent, then calculate both percentages, then compare with df1 percentages, get absolute difference, record it. For these 100,000 seeds, I take the minimum difference one as my solution, which is sufficient for my job.
But this is not the optimal solution, and I am looking for quick and easy way to do this. Hope somebody can help.
Here is my code.
df = pd.DataFrame()
np.random.seed(10000)
seeds = np.random.randint(1, 1000000, size = 100000)
for i in range(46530, 46537):
print(seeds[i])
np.random.seed(seeds[i])
df2['loadGroup1'] = np.random.choice(df1.loadgroup, len(df2), p = df1.CartonsPercent)
df2['loadGroup2'] = np.random.choice(df1.loadgroup, len(df2), p = df1.blocksPercent)
df2.reset_index(inplace = True)
three = pd.DataFrame(df2.groupby('loadGroup1').agg(Cartons = ('cartons', 'sum'), blocks = ('block', 'count')))
three['CartonsPercent'] = three.Cartons/three.Cartons.sum()
three['blocksPercent'] = three.blocks/three.blocks.sum()
four = df1[['CartonsPercent','blocksPercent']] - three[['CartonsPercent','blocksPercent']]
four = four.abs()
subdf = pd.DataFrame({'i':[i],'Seed':[seeds[i]], 'Percent':['CartonsPercent'], 'AbsDiff':[four.sum().sum()]})
df = pd.concat([df,subdf])
three = pd.DataFrame(df2.groupby('loadGroup2').agg(Cartons = ('cartons', 'sum'), blocks = ('block', 'count')))
three['CartonsPercent'] = three.Cartons/three.Cartons.sum()
three['blocksPercent'] = three.blocks/three.blocks.sum()
four = df1[['CartonsPercent','blocksPercent']] - three[['CartonsPercent','blocksPercent']]
four = four.abs()
subdf = pd.DataFrame({'i':[i],'Seed':[seeds[i]], 'Percent':['blocksPercent'], 'AbsDiff':[four.sum().sum()]})
df = pd.concat([df,subdf])
df.sort_values(by = 'AbsDiff', ascending = True, inplace = True)
df = df.head(10)
Actually the first row of df will tell me the seed I am looking for, I kept 10 rows just for curiosity.
Here is my solution.
block
cartons
loadgroup
0
533
4
1
257
1
2
96
4
3
104
4
4
130
4
5
71
2
6
68
1
7
87
4
8
99
4
9
51
4
10
291
4
11
119
2
12
274
2
13
316
4
14
87
4
15
149
5
16
120
3
17
222
2
18
100
2
19
148
2
20
192
3
21
188
4
22
293
1
23
120
2
24
224
4
25
449
1
26
385
5
27
395
3
28
418
1
29
423
4
30
244
5
31
327
1
32
337
5
33
249
4
34
528
1
35
528
1
36
494
5
37
540
3
38
368
2
39
533
4
40
614
5
41
462
4
42
350
5
43
618
4
44
463
2
45
552
1
46
397
3
47
401
3
48
397
1
49
365
1
50
475
4
51
379
1
52
541
1
53
488
2
54
383
2
55
354
1
56
760
5
57
327
4
58
211
2
59
356
5
60
552
4
61
401
1
62
320
1
63
368
3
64
311
3
65
421
2
66
458
5
67
278
4
68
504
5
69
385
4
70
242
4
71
413
1
72
246
2
73
465
5
74
386
4
75
231
1
76
154
4
77
294
4
78
275
1
79
169
4
80
398
4
81
227
4
82
273
1
83
319
3
84
177
4
85
272
5
86
204
3
87
139
1
88
187
4
89
263
4
90
90
4
91
134
4
92
67
3
93
115
3
94
45
2
95
65
2
96
40
4
97
108
2
98
60
2
99
102
1
Here are the summaries.
loadgroup
cartons
blocks
cartonsPercent
blocksPercent
1
7610
22
26%
22%
2
3912
18
13%
18%
3
3429
12
12%
12%
4
9269
35
31%
35%
5
5388
13
18%
13%
It's very close to my target though.

pandas how to filter and slice with multiple conditions

Using pandas, how do I return dataframe filtered by value of 2 in 'GEN' column, value 20 in 'AGE' column and exclude columns with name 'GEN' and 'BP'? Thanks in advance:)
AGE GEN BMI BP S1 S2 S3 S4 S5 S6 Y
59 2 32.1 101 157 93.2 38 4 4.8598 87 151
48 1 21.6 87 183 103.2 70 3 3.8918 69 75
72 2 30.5 93 156 93.6 41 4 4.6728 85 141
24 1 25.3 84 198 131.4 40 5 4.8903 89 206
50 1 23 101 192 125.4 52 4 4.2905 80 135
23 1 22.6 89 139 64.8 61 2 4.1897 68 97
20 2 22 90 160 99.6 50 3 3.9512 82 138
66 2 26.2 114 255 185 56 4.5 4.2485 92 63
60 2 32.1 83 179 119.4 42 4 4.4773 94 110
20 1 30 85 180 93.4 43 4 5.3845 88 310
You can do this -
cols = df.columns[~df.columns.isin(['GEN','BP'])]
out=df.loc[(df['GEN'] == 2) & (df['AGE'] == 20),cols]
OR
out=df.query("'GEN'==2 and 'AGE'==20").loc[cols]

create new column from divided columns over iteration

I am working with the following code:
url = 'https://raw.githubusercontent.com/dothemathonthatone/maps/master/fertility.csv'
df = pd.read_csv(url)
year regional_schlüssel Aus15 Deu15 Aus16 Deu16 Aus17 Deu17 Aus18 Deu18 ... aus36 aus37 aus38 aus39 aus40 aus41 aus42 aus43 aus44 aus45
0 2000 5111000 0 4 8 25 20 45 56 89 ... 935 862 746 732 792 660 687 663 623 722
1 2000 5113000 1 1 4 14 13 33 19 48 ... 614 602 498 461 521 470 393 411 397 400
2 2000 5114000 0 11 0 5 2 13 7 20 ... 317 278 265 235 259 228 204 173 213 192
3 2000 5116000 0 2 2 7 3 28 13 26 ... 264 217 206 207 197 177 171 146 181 169
4 2000 5117000 0 0 3 1 2 4 4 7 ... 135 129 118 116 128 148 89 110 124 83
I would like to create a new set of columns fertility_deu15, ..., fertility_deu45 and fertility_aus15, ..., fertility_aus45 such that aus15 / Aus15 = fertiltiy_aus15 and deu15/ Deu15 = fertility_deu15 for each ausi and Ausj where j == i \n [15-45] and deui:Deuj where j == i \n [15-45]
I'm not sure what is up with that data but we need to fix it to make it numeric. I'll end up doing that while filtering
numerator = df.filter(regex='^[a-z]+\d+$') # Lower case ones
numerator = numerator.apply(pd.to_numeric, errors='coerce') # Fix numbers
denominator = df.filter(regex='^[A-Z][a-z]+\d+$').rename(columns=str.lower)
denominator = denominator.apply(pd.to_numeric, errors='coerce')
numerator.div(denominator).add_prefix('fertility_')

To find avg in pig and sort it in ascending order

have a schema with 9 fields and i want to take only two fields(6,7 i.e $5,$6) and i want to calculate the average of $5 and i want to sort the $6 in ascending order so how to do this task can some one help me.
Input Data:
N368SW 188 170 175 17 -1 MCO MHT 1142
N360SW 100 115 87 -10 5 MCO MSY 550
N626SW 114 115 90 13 14 MCO MSY 550
N252WN 107 115 84 -10 -2 MCO MSY 550
N355SW 104 115 85 -1 10 MCO MSY 550
N405WN 113 110 96 14 11 MCO ORF 655
N456WN 110 110 92 24 24 MCO ORF 655
N743SW 144 155 124 7 18 MCO PHL 861
N276WN 142 150 129 -2 6 MCO PHL 861
N369SW 153 145 134 30 22 MCO PHL 861
N363SW 151 145 137 5 -1 MCO PHL 861
N346SW 141 150 128 51 60 MCO PHL 861
N785SW 131 145 118 -15 -1 MCO PHL 861
N635SW 144 155 127 -6 5 MCO PHL 861
N242WN 298 300 276 68 70 MCO PHX 1848
N439WN 130 140 111 -4 6 MCO PIT 834
N348SW 140 135 124 7 2 MCO PIT 834
N672SW 136 135 122 9 8 MCO PIT 834
N493WN 151 160 136 -9 0 MCO PVD 1073
N380SW 170 155 155 13 -2 MCO PVD 1073
N705SW 164 160 147 6 2 MCO PVD 1073
N233LV 157 160 143 1 4 MCO PVD 1073
N786SW 156 160 139 6 10 MCO PVD 1073
N280WN 160 160 146 1 1 MCO PVD 1073
N282WN 104 95 81 10 1 MCO RDU 534
N694SW 89 100 77 3 14 MCO RDU 534
N266WN 94 95 82 9 10 MCO RDU 534
N218WN 98 100 77 12 14 MCO RDU 534
N355SW 47 50 35 15 18 MCO RSW 133
N388SW 44 45 30 37 38 MCO RSW 133
N786SW 46 50 31 4 8 MCO RSW 133
N707SA 52 50 33 10 8 MCO RSW 133
N795SW 176 185 153 -9 0 MCO SAT 1040
N402WN 176 185 161 4 13 MCO SAT 1040
N690SW 123 130 107 -1 6 MCO SDF 718
N457WN 135 130 105 20 15 MCO SDF 718
N720WN 144 155 131 13 24 MCO STL 880
N775SW 147 160 135 -6 7 MCO STL 880
N291WN 136 155 122 96 115 MCO STL 880
N247WN 144 155 127 43 54 MCO STL 880
N748SW 179 185 159 -4 2 MDW ABQ 1121
N709SW 176 190 158 21 35 MDW ABQ 1121
N325SW 110 105 97 36 31 MDW ALB 717
N305SW 116 110 90 107 101 MDW ALB 717
N403WN 145 165 128 -6 14 MDW AUS 972
N767SW 136 165 125 59 88 MDW AUS 972
N730SW 118 120 100 28 30 MDW BDL 777
i have written the code like this but it is not working properly:
a = load '/path/to/file' using PigStorage('\t');
b = foreach a generate (int)$5 as field_a:int,(chararray)$6 as field_b:chararray;
c = group b all;
d = foreach c generate b.field_b,AVG(b.field_a);
e = order d by field_b ASC;
dump e;
I am facing error at order by:
grunt> a = load '/user/horton/sample_pig_data.txt' using PigStorage('\t');
grunt> b = foreach a generate (int)$5 as fielda:int,(chararray)$6 as fieldb:chararray;
grunt> describe #;
b: {fielda: int,fieldb: chararray}
grunt> c = group b all;
grunt> describe #;
c: {group: chararray,b: {(fielda: int,fieldb: chararray)}}
grunt> d = foreach c generate b.fieldb,AVG(b.fielda);
grunt> e = order d by fieldb ;
2017-01-05 15:51:29,623 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
<line 6, column 15> Invalid field projection. Projected field [fieldb] does not exist in schema: :bag{:tuple(fieldb:chararray)},:double.
Details at logfile: /root/pig_1483631021021.log
I want output like(not related to input data):
(({(Bharathi),(Komal),(Archana),(Trupthi),(Preethi),(Rajesh),(siddarth),(Rajiv) },
{ (72) , (83) , (87) , (75) , (93) , (90) , (78) , (89) }),83.375)
If you have found the answer, best practice is to post it so that others referring to this can have a better understanding.

Pandas, getting mean and sum with groupby

I have a data frame, df, which looks like this:
index New Old Map Limit count
1 93 35 54 > 18 1
2 163 93 116 > 18 1
3 134 78 96 > 18 1
4 117 81 93 > 18 1
5 194 108 136 > 18 1
6 125 57 79 <= 18 1
7 66 39 48 > 18 1
8 120 83 95 > 18 1
9 150 98 115 > 18 1
10 149 99 115 > 18 1
11 148 85 106 > 18 1
12 92 55 67 <= 18 1
13 64 24 37 > 18 1
14 84 53 63 > 18 1
15 99 70 79 > 18 1
I need to produce a data frame that looks like this:
Limit <=18 >18
total mean total mean
New xx1 yy1 aa1 bb1
Old xx2 yy2 aa2 bb2
MAP xx3 yy3 aa3 bb3
I tried this without success:
df.groupby('Limit')['New', 'Old', 'MAP'].[sum(), mean()].T without success.
How can I achieve this in pandas?
You can use groupby with agg, then transpose by T and unstack:
print (df[['New', 'Old', 'Map', 'Limit']].groupby('Limit').agg([sum, 'mean']).T.unstack())
Limit <= 18 > 18
sum mean sum mean
New 217.0 108.5 1581.0 121.615385
Old 112.0 56.0 946.0 72.769231
Map 146.0 73.0 1153.0 88.692308
I edit by comment, it looks nicer:
print (df.groupby('Limit')['New', 'Old', 'Map', 'Limit'].agg([sum, 'mean']).T.unstack())
And if need total columns:
print (df.groupby('Limit')['New', 'Old', 'Map', 'Limit']
.agg({'total':sum, 'mean': 'mean'})
.T
.unstack(0))
Limit <= 18 > 18
total mean total mean
New 217.0 108.5 1581.0 121.615385
Old 112.0 56.0 946.0 72.769231
Map 146.0 73.0 1153.0 88.692308