Getting total coverage considering staggered values in awk - awk

Here is my data structure.
# id1 id2 len start end
# 9 16792 5475 4181 4232
# 11 16792 2317 1086 1137
# 11 32879 2317 8 60
# 11 32858 2317 10 52
# 11 30670 2317 17 63
# 14 12645 532 3 67
# 14 12645 532 158 222
# 14 11879 532 3 223
# 18 23847 644 64 285
# 18 30160 644 98 285
# 18 30160 644 345 477
# 18 30160 644 516 644
I want to get the coverage of id1 based on its length (column len) considering all entries start and end values.
The problem is that the multiple entries can have juxtapose values so considering the values in all entries would overrate the coverage. Also considering the smallest start value and biggest end value doesn't account for all since it can have gaps where not all length is represented.
Also, for each entrie that is included I need to add 1 so I can account for the whole coverage.
My expected result should be something like this
9 = 51 / 5475 = 0.009
11 = 108 / 2317 = 0.047
14 = 221 / 532 = 0.415
18 = 484 / 644 = 0.75

Related

Fast way to generate sequences for RNN/LSTM Model from Pandas Dataframe

I have a pandas dataframe which includes 200k+ unique IDs and for each ID 33 time periods with numeric data as well as padded values to ensure same sequence length for my RNN model. The data is sorted already by ID and time. Overall, the pandas dataframe has just over 8M rows! I'm generating the np array sequences currently, but it takes so long (code below).
I'm wondering if there is a faster or more efficient way to generate them. I'll share a data sample below as well as my current code. Thank you in advance and let me know if there are any follow up questions!
# pandas dataframe example
id time v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 y
a1 1 161 11 137 15 386 15 275 10 422 1 344 3 18 0
a1 2 487 14 77 11 329 7 188 12 174 2 462 14 14 0
a1 3 92 2 12 11 226 20 50 5 313 9 65 19 13 0
… … … … … … … … … … … … … … … …
a1 31 434 6 30 12 216 12 151 18 470 20 414 4 13 0
a1 32 271 2 456 12 100 19 198 12 377 14 205 13 1 0
a1 33 183 15 34 10 499 20 229 6 191 20 145 13 2 0
a2 1 247 10 54 6 115 14 102 9 39 6 34 1 8 0
a2 2 216 19 423 4 205 13 458 12 226 7 264 18 6 0
a2 3 311 7 285 5 147 6 30 17 332 10 116 13 1 0
… … … … … … … … … … … … … … … …
a2 31 124 13 62 11 229 4 242 20 261 6 350 16 8 0
a2 32 359 9 290 2 72 14 478 2 197 14 188 7 11 0
a2 33 410 15 370 18 34 9 387 5 218 9 257 5 1 1
# current sequence generation code
def create_sequence_data(df, num_features=num_features, y_label='y'):
id_list = df['id'].unique()
x_sequence = []
y_local = []
ids = []
for idx in id_list:
x_sequence.append(df[df['id'] == idx][num_features].values.tolist())
y_local.append(df[df['id'] == idx][y_label].max())
ids.append(idx)
return x_sequence, y_local, ids
x_seq, y_vals, idx_ls = create_sequence_data(df=df, y_label='y')
This solution actually worked fast enough for me. In the code prior to the one above I was also looping through rows for each id but the filter on id alone and then the values.tolist() call outputs a list of lists perfect for the sequence modeling.
In terms of time by total rows in the several dataframes was just over 8M and I was able to get through each one in under a couple hours so not too bad compared to before.

pandas df add new column based on proportion of two other columns from another dataframe

I have df1 which has three columns (loadgroup, cartons, blocks) like this
loadgroup
cartons
blocks
cartonsPercent
blocksPercent
1
2269
14
26%
21%
2
1168
13
13%
19%
3
937
8
11%
12%
4
2753
24
31%
35%
5
1686
9
19%
13%
total(sum of column)
8813
68
100%
100%
The interpretation is like this: out of df1 26% cartons which is also 21% of blocks are assigned to loadgroup 1, etc. we can assume blocks are 1 to 68, cartons are 1 to 8813.
I also have df2 which also has cartons and blocks columns. but does not have loadgroup.
My goal is to assign loadgroup (1-5 as well) to df2 (100 blocks 29608 cartons in total), but keep the proportions, for example, for df2, 26% cartons 21% blocks assign loadgroup 1, 13% cartons 19% blocks assign loadgroup 2, etc.
df2 is like this:
block
cartons
0
533
1
257
2
96
3
104
4
130
5
71
6
68
7
87
8
99
9
51
10
291
11
119
12
274
13
316
14
87
15
149
16
120
17
222
18
100
19
148
20
192
21
188
22
293
23
120
24
224
25
449
26
385
27
395
28
418
29
423
30
244
31
327
32
337
33
249
34
528
35
528
36
494
37
540
38
368
39
533
40
614
41
462
42
350
43
618
44
463
45
552
46
397
47
401
48
397
49
365
50
475
51
379
52
541
53
488
54
383
55
354
56
760
57
327
58
211
59
356
60
552
61
401
62
320
63
368
64
311
65
421
66
458
67
278
68
504
69
385
70
242
71
413
72
246
73
465
74
386
75
231
76
154
77
294
78
275
79
169
80
398
81
227
82
273
83
319
84
177
85
272
86
204
87
139
88
187
89
263
90
90
91
134
92
67
93
115
94
45
95
65
96
40
97
108
98
60
99
102
total 100 blocks
29608 cartons
I want to add loadgroup column to df2, try to keep those proportions as close as possible. How to do it please? Thank you very much for the help.
I don't know how to find loadgroup column based on both cartons percent and blocks percent. But generate random loadgroup based on either cartons percent or blocks percent is easy.
Here is what I did. I generate 100,000 seeds first, then for each seed, I add column loadgroup1 based on cartons percent, loadgroup2 based on blocks percent, then calculate both percentages, then compare with df1 percentages, get absolute difference, record it. For these 100,000 seeds, I take the minimum difference one as my solution, which is sufficient for my job.
But this is not the optimal solution, and I am looking for quick and easy way to do this. Hope somebody can help.
Here is my code.
df = pd.DataFrame()
np.random.seed(10000)
seeds = np.random.randint(1, 1000000, size = 100000)
for i in range(46530, 46537):
print(seeds[i])
np.random.seed(seeds[i])
df2['loadGroup1'] = np.random.choice(df1.loadgroup, len(df2), p = df1.CartonsPercent)
df2['loadGroup2'] = np.random.choice(df1.loadgroup, len(df2), p = df1.blocksPercent)
df2.reset_index(inplace = True)
three = pd.DataFrame(df2.groupby('loadGroup1').agg(Cartons = ('cartons', 'sum'), blocks = ('block', 'count')))
three['CartonsPercent'] = three.Cartons/three.Cartons.sum()
three['blocksPercent'] = three.blocks/three.blocks.sum()
four = df1[['CartonsPercent','blocksPercent']] - three[['CartonsPercent','blocksPercent']]
four = four.abs()
subdf = pd.DataFrame({'i':[i],'Seed':[seeds[i]], 'Percent':['CartonsPercent'], 'AbsDiff':[four.sum().sum()]})
df = pd.concat([df,subdf])
three = pd.DataFrame(df2.groupby('loadGroup2').agg(Cartons = ('cartons', 'sum'), blocks = ('block', 'count')))
three['CartonsPercent'] = three.Cartons/three.Cartons.sum()
three['blocksPercent'] = three.blocks/three.blocks.sum()
four = df1[['CartonsPercent','blocksPercent']] - three[['CartonsPercent','blocksPercent']]
four = four.abs()
subdf = pd.DataFrame({'i':[i],'Seed':[seeds[i]], 'Percent':['blocksPercent'], 'AbsDiff':[four.sum().sum()]})
df = pd.concat([df,subdf])
df.sort_values(by = 'AbsDiff', ascending = True, inplace = True)
df = df.head(10)
Actually the first row of df will tell me the seed I am looking for, I kept 10 rows just for curiosity.
Here is my solution.
block
cartons
loadgroup
0
533
4
1
257
1
2
96
4
3
104
4
4
130
4
5
71
2
6
68
1
7
87
4
8
99
4
9
51
4
10
291
4
11
119
2
12
274
2
13
316
4
14
87
4
15
149
5
16
120
3
17
222
2
18
100
2
19
148
2
20
192
3
21
188
4
22
293
1
23
120
2
24
224
4
25
449
1
26
385
5
27
395
3
28
418
1
29
423
4
30
244
5
31
327
1
32
337
5
33
249
4
34
528
1
35
528
1
36
494
5
37
540
3
38
368
2
39
533
4
40
614
5
41
462
4
42
350
5
43
618
4
44
463
2
45
552
1
46
397
3
47
401
3
48
397
1
49
365
1
50
475
4
51
379
1
52
541
1
53
488
2
54
383
2
55
354
1
56
760
5
57
327
4
58
211
2
59
356
5
60
552
4
61
401
1
62
320
1
63
368
3
64
311
3
65
421
2
66
458
5
67
278
4
68
504
5
69
385
4
70
242
4
71
413
1
72
246
2
73
465
5
74
386
4
75
231
1
76
154
4
77
294
4
78
275
1
79
169
4
80
398
4
81
227
4
82
273
1
83
319
3
84
177
4
85
272
5
86
204
3
87
139
1
88
187
4
89
263
4
90
90
4
91
134
4
92
67
3
93
115
3
94
45
2
95
65
2
96
40
4
97
108
2
98
60
2
99
102
1
Here are the summaries.
loadgroup
cartons
blocks
cartonsPercent
blocksPercent
1
7610
22
26%
22%
2
3912
18
13%
18%
3
3429
12
12%
12%
4
9269
35
31%
35%
5
5388
13
18%
13%
It's very close to my target though.

create new column from divided columns over iteration

I am working with the following code:
url = 'https://raw.githubusercontent.com/dothemathonthatone/maps/master/fertility.csv'
df = pd.read_csv(url)
year regional_schlüssel Aus15 Deu15 Aus16 Deu16 Aus17 Deu17 Aus18 Deu18 ... aus36 aus37 aus38 aus39 aus40 aus41 aus42 aus43 aus44 aus45
0 2000 5111000 0 4 8 25 20 45 56 89 ... 935 862 746 732 792 660 687 663 623 722
1 2000 5113000 1 1 4 14 13 33 19 48 ... 614 602 498 461 521 470 393 411 397 400
2 2000 5114000 0 11 0 5 2 13 7 20 ... 317 278 265 235 259 228 204 173 213 192
3 2000 5116000 0 2 2 7 3 28 13 26 ... 264 217 206 207 197 177 171 146 181 169
4 2000 5117000 0 0 3 1 2 4 4 7 ... 135 129 118 116 128 148 89 110 124 83
I would like to create a new set of columns fertility_deu15, ..., fertility_deu45 and fertility_aus15, ..., fertility_aus45 such that aus15 / Aus15 = fertiltiy_aus15 and deu15/ Deu15 = fertility_deu15 for each ausi and Ausj where j == i \n [15-45] and deui:Deuj where j == i \n [15-45]
I'm not sure what is up with that data but we need to fix it to make it numeric. I'll end up doing that while filtering
numerator = df.filter(regex='^[a-z]+\d+$') # Lower case ones
numerator = numerator.apply(pd.to_numeric, errors='coerce') # Fix numbers
denominator = df.filter(regex='^[A-Z][a-z]+\d+$').rename(columns=str.lower)
denominator = denominator.apply(pd.to_numeric, errors='coerce')
numerator.div(denominator).add_prefix('fertility_')

how to filter AGE based on condition if my age column is like below

this is the data given.
Sample_Size Age(years)
7304
2581
4723
1153
2402
1925
1812
356 18 - 24
598 25 - 34
865 35 - 44
1288 45 - 54
1676 55 - 64
2521 65 or older
he asked me to Filter the data based on Sample_Size more than 1000 and Age(years) between 40 to 50
what i did is,
i created table.dumped the csv file into that hive table .now i need to filter the data according to given condition.
which logic should I apply to filter the column?

Forming a variable for a graph using result of searching with awk

I´m using cacti to graph CPU usage of equipment with 7 modules, the command used shows 12 samples for each module. I need to use awk to find the pattern of each module name and after form a variable with this sintaxis [module]:[12th CPU sample], for example: MSCBC05:47
Below a extract of command output mentioned:
ACT AD-46 TIME 141216 1556 MSCBC05
PROCESSOR LOAD DATA
INT PLOAD CALIM OFFDO OFFDI FTCHDO FTCHDI OFFMPH OFFMPL FTCHMPH FTCHMPL
1 46 56250 656 30 656 30 1517 2 1517 2
2 47 56250 659 32 659 32 1448 1 1448 1
3 46 56250 652 22 652 22 1466 1 1466 1
4 47 56250 672 33 672 33 1401 0 1401 0
5 47 56250 674 38 674 38 1446 2 1446 2
6 45 56250 669 22 669 22 1365 1 1365 1
7 45 56250 674 26 674 26 1394 2 1394 2
8 46 56250 664 24 664 24 1396 0 1396 0
9 47 56250 686 24 686 24 1425 2 1425 2
10 47 56250 676 31 676 31 1386 0 1386 0
11 48 56250 702 25 702 25 1414 2 1414 2
12 47 56250 703 31 703 31 1439 2 1439 2
Complete output
https://dl.dropboxusercontent.com/u/33222611/raw_output.txt
I suggest
awk '$1 == "ACT" { sub(/\r/, ""); curmsc = $6 } curmsc != "" && $1 == "12" { print curmsc ":" $2; curmsc = "" }' raw_output.txt
Written more readably, that is
$1 == "ACT" { # In the first line of an ACT block
sub(/\r/, "") # remove the trailing carriage return. Could also use todos or so.
curmsc = $6 # remember MSC
}
curmsc != "" && $1 == "12" { # if we are in such a block and the first token is 12
print curmsc ":" $2 # print the stuff we want to know
curmsc = "" # then flag that we're outside a block
}