Division a column in period and print min max for each in awk

Division a column in period and print min max for each in awk - awk

I have a data file which content two columns. One of them have periodic variation of whom the max and min are different in each period :
a 3
b 4
c 5
d 4
e 3
f 2
g 1
h 2
i 3
j 4
k 5
l 6
m 5
n 4
o 3
p 2
q 1
r 0
s 1
t 2
u 3
We can find that in the 1st period (from a to i): max = 5, min = 1. In the 2nd period (from i to u) : max = 6, min = 0.
Using awk, I can only print the max and min of all second column, but I cannot print these values min and max after each period. That means I wish to obtain results like this :
period min max
1 1 5
2 0 6
Here is what I did :
{
nb_lignes = 21
period = 9
nb_periodes = int(nb_lignes/period)
}
{
for (j = 0; j <= nb_periodes; j++)
{ if (NR == (1 + period*j)) {{max=$2 ; min=$2}}
for (i = (period*j); i <= (period*(j+1)); i++)
{
if (NR == i)
{
if ($2 >= max) {max = $2}
if ($2 <= min) {min = $2}
{print "Min: "min,"Max: "max,"Ligne: " NR}
}
}
}
}
#END { print "Min: "min,"Max: "max }
However the result is far away from what I search for :
Min: 3 Max: 3 Ligne: 1
Min: 3 Max: 4 Ligne: 2
Min: 3 Max: 5 Ligne: 3
Min: 3 Max: 5 Ligne: 4
Min: 3 Max: 5 Ligne: 5
Min: 2 Max: 5 Ligne: 6
Min: 1 Max: 5 Ligne: 7
Min: 1 Max: 5 Ligne: 8
Min: 1 Max: 5 Ligne: 9
Min: 1 Max: 5 Ligne: 9
Min: 4 Max: 4 Ligne: 10
Min: 4 Max: 5 Ligne: 11
Min: 4 Max: 6 Ligne: 12
Min: 4 Max: 6 Ligne: 13
Min: 4 Max: 6 Ligne: 14
Min: 3 Max: 6 Ligne: 15
Min: 2 Max: 6 Ligne: 16
Min: 1 Max: 6 Ligne: 17
Min: 0 Max: 6 Ligne: 18
Min: 0 Max: 6 Ligne: 18
Min: 1 Max: 1 Ligne: 19
Min: 1 Max: 2 Ligne: 20
Min: 1 Max: 3 Ligne: 21
Thank you in advance for you help.

Try something like:
$ awk '
BEGIN{print "period", "min", "max"}
!f{min=$2; max=$2; ++f; next}
{max = ($2>max)?$2:max; min = ($2<min)?$2:min; f++}
f==9{print ++a, min, max; f=0}' file
period min max
1 1 5
2 0 6
When the flag f is not set, you assign the second column to max and min variables and start incrementing your flag.
For each line, check the second column. If it is bigger than our max variable assign that column to max. Like wise, if it is smaller than our min variable, assign it to our min variable. Keep incrementing the flag.
Once the flag reaches 9, print the period number, min and max variables. Reset the flag to 0 and start again afresh from next line.

I've started, so I'll finish. I chose to create an array which contains the minimum and maximum for each period:
awk -v period=9 '
BEGIN { print "period", "min", "max" }
NR % period == 1 { ++i }
!min[i] || $2 < min[i] { min[i] = $2 }
$2 > max[i] { max[i] = $2 }
END { for (i in min) print i, min[i], max[i] }' input
The index i increases every period number of lines (in this case 9). If no value has been set yet or a new minimum/maximum has been found, update the array.
edit: if max[i] has not yet been set then $2 > max[i], so no need to check !max[i].

awk 'BEGIN{print "Period","min","max"}
NR==1||(NR%10==0){mi=ma=$2}
{$2<mi?mi=$2:0;$2>ma?ma=$2:0}
NR%9==0{print ++i,mi,ma}' your_file
Tester here

Related

How do I get a time delta that is closest to 0 days?

I have the following dataframe:
gp_columns = {
'name': ['companyA', 'companyB'],
'firm_ID' : [1, 2],
'timestamp_one' : ['2016-04-01', '2017-09-01']
}
fund_columns = {
'firm_ID': [1, 1, 2, 2, 2],
'department_ID' : [10, 11, 20, 21, 22],
'timestamp_mult' : ['2015-01-01', '2016-03-01', '2016-10-01', '2017-02-01', '2018-11-01'],
'number' : [400, 500, 1000, 3000, 4000]
}
gp_df = pd.DataFrame(gp_columns)
fund_df = pd.DataFrame(fund_columns)
gp_df['timestamp_one'] = pd.to_datetime(gp_df['timestamp_one'])
fund_df['timestamp_mult'] = pd.to_datetime(fund_df['timestamp_mult'])
merged_df = gp_df.merge(fund_df)
merged_df
merged_df_v1 = merged_df.copy()
merged_df_v1['incidence_num'] = merged_df.groupby('firm_ID')['department_ID']\
.transform('cumcount')
merged_df_v1['incidence_num'] = merged_df_v1['incidence_num'] + 1
merged_df_v1['time_delta'] = merged_df_v1['timestamp_mult'] - merged_df_v1['timestamp_one']
merged_wide = pd.pivot(merged_df_v1, index = ['name','firm_ID', 'timestamp_one'], \
columns = 'incidence_num', \
values = ['department_ID', 'time_delta', 'timestamp_mult', 'number'])
merged_wide.reset_index()
that looks as follows:
My question is how i get a column that calculates the minimum time delta (so closest to 0). Note that the time delta can be negative or positive, so .abs() does not work for me here.
I want a dataframe with this particular output:

You can stack (which removes NaTs) and groupby.first after sorting the rows by absolute value (with the key parameter of sort_values):
df = merged_wide.reset_index()
df['time_delta_min'] = (df['time_delta'].stack()
.sort_values(key=abs)
.groupby(level=0).first()
)
output:
name firm_ID timestamp_one department_ID \
incidence_num 1 2 3
0 companyA 1 2016-04-01 10 11 NaN
1 companyB 2 2017-09-01 20 21 22
time_delta timestamp_mult \
incidence_num 1 2 3 1 2
0 -456 days -31 days NaT 2015-01-01 2016-03-01
1 -335 days -212 days 426 days 2016-10-01 2017-02-01
number time_delta_min
incidence_num 3 1 2 3
0 NaT 400 500 NaN -31 days
1 2018-11-01 1000 3000 4000 -212 days

Use lookup with indices of absolute values by DataFrame.idxmin:
idx, cols = pd.factorize(df['time_delta'].abs().idxmin(axis=1))
df['time_delta_min'] = (df['time_delta'].reindex(cols, axis=1).to_numpy()
[np.arange(len(df)), idx])
print (df)

JMeter how to disable summariser.name=summary while running from command prompt

I am running Jmeter scripts from the command line. While running I get this summary after every request. I understood from the documentation that we need comment or set summariser.name=summary to none. I don't want to see this summary. Pl. let me know how to disable it.
00:44:10.785 summary + 6 in 00:00:32 = 0.2/s Avg: 241 Min: 2 Max: 1239 Err: 1 (16.67%) Active: 1 Started: 1 Finished: 0
00:44:10.785 summary = 498 in 00:39:27 = 0.2/s Avg: 126 Min: 0 Max: 2851 Err: 32 (6.43%)
00:44:42.892 summary + 7 in 00:00:31 = 0.2/s Avg: 88 Min: 0 Max: 418 Err: 0 (0.00%) Active: 1 Started: 1 Finished: 0
00:44:42.892 summary = 505 in 00:39:57 = 0.2/s Avg: 126 Min: 0 Max: 2851 Err: 32 (6.34%)
00:45:14.999 summary + 6 in 00:00:31 = 0.2/s Avg: 73 Min: 2 Max: 216 Err: 0 (0.00%) Active: 1 Started: 1 Finished: 0
00:45:14.999 summary = 511 in 00:40:28 = 0.2/s Avg: 125 Min: 0 Max: 2851 Err: 32 (6.26%)
00:45:41.565 summary + 6 in 00:00:31 = 0.2/s Avg: 68 Min: 2 Max: 205 Err: 0 (0.00%) Active: 1 Started: 1 Finished: 0
00:45:41.565 summary = 517 in 00:40:58 = 0.2/s Avg: 125 Min: 0 Max: 2851 Err: 32 (6.19%)
00:46:13.681 summary + 6 in 00:00:31 = 0.2/s Avg: 103 Min: 2 Max: 384 Err: 0 (0.00%) Active: 1 Started: 1 Finished: 0
00:46:13.681 summary = 523 in 00:41:29 = 0.2/s Avg: 124 Min: 0 Max: 2851 Err: 32 (6.12%)

If you don't want to see the summariser output in the console you can amend your command to
jmeter -Jsummariser.out=false -n -t test.jmx -l result.jtl
in order to make the change permanent - put this line: summariser.out=false to user.properties file.
If you want to turn off the summariser completely:
Open jmeter.properties file with your favourite text editor
Locate this line
summariser.name=summary
and either comment it by putting # character in front of it:
#summariser.name=summary
or just simply delete it
That's it, you won't see summariser output on next execution
More information:
Summariser - Generate Summary Results - configuration
Configuring JMeter
Apache JMeter Properties Customization Guide

For each unique occurrence in field, transform each unique occurrence in another field in a different column

I have a file
splice_region_variant,intron_variant A1CF 1
3_prime_UTR_variant A1CF 18
intron_variant A1CF 204
downstream_gene_variant A1CF 22
synonymous_variant A1CF 6
missense_variant A1CF 8
5_prime_UTR_variant A2M 1
stop_gained A2M 1
missense_variant A2M 15
splice_region_variant,intron_variant A2M 2
synonymous_variant A2M 2
upstream_gene_variant A2M 22
intron_variant A2M 308
missense_variant A4GNT 1
intron_variant A4GNT 21
5_prime_UTR_variant A4GNT 3
3_prime_UTR_variant A4GNT 7
This file is sorted by $2
for each occurrence of an unique element in $2, I wanna transform in a column each unique occurrence of an element in $1, with corresponding value in $3, or 0 if the record is not there. So that I have:
splice_region_variant,intron_variant 3_prime_UTR_variant intron_variant downstream_gene_variant synonymous_variant missense_variant 5_prime_UTR_variant stop_gained upstream_gene_variant
A1CF 1 18 204 22 6 8 0 0 0
A2M 2 0 308 0 2 15 1 1 22
A4GNT 0 7 21 0 0 22 3 0 0
test file:
a x 2
b,c x 4
dd x 3
e,e,t x 5
a b 1
cc b 2
e,e,t b 1
This is what I'm getting:
a b,c dd e,e,t cc
x 5 2 4 3
b 1 2 1
EDIT: This might be doing it but doesn't output 0s in blank fields
'BEGIN {FS = OFS = "\t"}
NR > 1 {data[$2][$1] = $3; blocks[$1]}
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
# header
printf "gene"
for (block in blocks) {
printf "%s%s", OFS, block
}
print ""
# data
for (ts in data) {
printf "%s", ts
for (block in blocks) {
printf "%s%s", OFS, data[ts][block]
}
print ""
}
}' file
modified from https://unix.stackexchange.com/questions/424642/dynamic-transposing-rows-to-columns-using-awk-based-on-row-value

If you want to print 0 if a certain value is absent, you could do something like this:
val = data[ts][block] ? data[ts][block] : 0;
printf "%s%s", OFS, val

AWK - Select lines to print according to score

I have a tab-separated file containing a series of lemmas with associated scores.
The file contains 5 columns, the first column is the lemma and the third is the one that contains the score. What I need to do is print the line as it is, when lemma is not repeated and print the line with the highest score when lemma is repeated.
IN
Lemma --- Score --- ---
cserép 06a 55 6 bueno
darázs 05 38 1 bueno
dél 06a 34 1 bueno
dér 06a 29 1 bueno
díj 05 14 89 malo
díj 06a 2 101 malo
díj 06b 2 101 malo
díj 07 90 13 bueno
díj 08a 2 101 malo
díj 08b 2 101 malo
egér 06a 66 5 bueno
fonal 05 12 1 bueno
fonal 07 52 4 bueno
Desired output
Lemma --- Score --- ---
cserép 06a 55 6 bueno
darázs 05 38 1 bueno
dél 06a 34 1 bueno
dér 06a 29 1 bueno
díj 07 90 13 bueno
egér 06a 66 5 bueno
fonal 07 52 4 malo
What I have done. But it only works when the lemma is repeated once.
BEGIN {
OFS=FS="\t";
flag="";
}
{
id=$1;
if (id != flag)
{
if (line != "")
{
sub("^;","",line);
z=split(line,A,";");
if ((A[3] > A[8]) && (A[8] != ""))
{
print A[1]"\t"A[2]"\t"A[3]"\t"A[4]"\t"A[5];
}
else if ((A[8] > A[3]) && (A[8] != ""))
{
print A[6]"\t"A[7]"\t"A[8]"\t"A[9]"\t"A[10]
}
else
{
print A[1]"\t"A[2]"\t"A[3]"\t"A[4]"\t"A[5];
}
}
delete line;
flag=id;
}
line[$1]=line[$1]";"$2";"$3";"$4";"$5;
}
END {
line=line ";"$1";"$2";"$3";"$4";"$5
sub("^;","",line);
z=split(line,A,";");
if ((A[3] > A[8]) && (A[8] != ""))
{
print A[1]"\t"A[2]"\t"A[3]"\t"A[4]"\t"A[5];
}
else if ((A[8] > A[3]) && (A[8] != ""))
{
print A[6]"\t"A[7]"\t"A[8]"\t"A[9]"\t"A[10]
}
else
{
print A[1]"\t"A[2]"\t"A[3]"\t"A[4]"\t"A[5]
}
}

This one doesn't require the file to be sorted by lemma, but, it keeps all the lines to be printed in memory (one for each lemma) so may not be appropriate for a file with millions of different lemmas.
It also does not respect the order of the original file.
Finally, it assumes that all scores are non-negative!
$ cat lemma.awk
BEGIN { FS = OFS = "\t" }
NR == 1 { print }
NR > 1 {
if ($3 > score[$1]) {
score[$1] = $3
line[$1] = $0
}
}
END { for (lemma in line) print line[lemma] }
$ awk -f lemma.awk lemma.txt
Lemma --- Score --- ---
cserép 06a 55 6 bueno
díj 07 90 13 bueno
fonal 07 52 4 bueno
darázs 05 38 1 bueno
egér 06a 66 5 bueno
dél 06a 34 1 bueno
dér 06a 29 1 bueno

Tested with gnu awk:
prevLemma != $1 {
if( prevLemma ) {
print line;
}
prevLemma = $1;
prevScore = $3;
line = $0;
}
prevLemma == $1 { if( prevScore < $3 ) {
prevScore = $3;
line = $0;
}
}
END { print line;}
assumption is: the file is sorted by lemma
when the lemma changes (or at the very beginning when the var is empty) the lemma, score and line are saved
when the lemma changes (or in the END), the line for the previous lemma is printed
when the current line belongs to the same lemma and has a higher score the values are saved again

$ cat tst.awk
$1 != prev { printf "%s", maxLine; maxLine=""; max=$3; prev=$1 }
$3 >= max { max=$3; maxLine=$0 ORS }
END { printf "%s", maxLine }
$ awk -f tst.awk file
Lemma --- Score --- ---
cserép 06a 55 6 bueno
darázs 05 38 1 bueno
dél 06a 34 1 bueno
dér 06a 29 1 bueno
díj 07 90 13 bueno
egér 06a 66 5 bueno
fonal 07 52 4 bueno

Use a script:
if ($1 != $5) print $0
else
{
score($NR) = $3
print $0
}
Actually , this might be better done with perl.

How to round times in Xcode

I am struggling for days trying to solve this puzzle.
I have this code that calculates time IN & OUT as decimal hours: (6 min = 0.1 hr)~(60 min = 1.0 hr)
NSUInteger unitFlag = NSCalendarUnitHour | NSCalendarUnitMinute;
NSDateComponents *components = [calendar components:unitFlag
fromDate:self.outT
toDate:self.inT
options:0];
NSInteger hours = [components hour];
NSInteger minutes = [components minute];
if (minutes <0) (minutes -= 60*-1) && (hours -=1);
if (hours<0 && minutes<0)(hours +=24)&& (minutes -=60*-1);
if(hours<0 && minutes>0)(hours +=24)&& (minutes = minutes);
if(hours <0 && minutes == 00)(hours +=24)&&(minutes = minutes);
if(minutes >0)(minutes = (minutes/6));
self.blockDecimalLabel.text = [NSString stringWithFormat:#"%d.%d", (int)hours, (int)minutes];
The green lines show what the code does, what I am looking for is to round the minutes like the blue lines, 1,2 minutes round down to the next decimal hr, 3,4,5 minutes round up to the next decimal hr
What I am try to achieve is:
If the result is 11 minutes the code return 0.1 then only after 12 minutes it will return 0.2. What I am trying to do is if the result is 8 the code returns 01, but if it is 9 will round to the next decimal that is 0.2 and so on.The objective is do not loose maximum of 5 minutes in each multiple of 6 in worst cases. Doing this the maximum lost will be 3 minutes in average
Any input is more than welcome :)
Cheers

Your goals seem incoherent to me. However, I tried this:
let beh = NSDecimalNumberHandler(
roundingMode: .RoundPlain, scale: 1, raiseOnExactness: false,
raiseOnOverflow: false, raiseOnUnderflow: false, raiseOnDivideByZero: false
)
for t in 0...60 {
let div = Double(t)/60.0
let deci = NSDecimalNumber(double: div)
let deci2 = deci.decimalNumberByRoundingAccordingToBehavior(beh)
let result = deci2.doubleValue
println("min: \(t) deci: \(result)")
}
The output seems pretty much what you are asking for:
min: 0 deci: 0.0
min: 1 deci: 0.0
min: 2 deci: 0.0
min: 3 deci: 0.1
min: 4 deci: 0.1
min: 5 deci: 0.1
min: 6 deci: 0.1
min: 7 deci: 0.1
min: 8 deci: 0.1
min: 9 deci: 0.2
min: 10 deci: 0.2
min: 11 deci: 0.2
min: 12 deci: 0.2
min: 13 deci: 0.2
min: 14 deci: 0.2
min: 15 deci: 0.3
min: 16 deci: 0.3
min: 17 deci: 0.3
min: 18 deci: 0.3
min: 19 deci: 0.3
min: 20 deci: 0.3
min: 21 deci: 0.4
min: 22 deci: 0.4
min: 23 deci: 0.4
min: 24 deci: 0.4
min: 25 deci: 0.4
min: 26 deci: 0.4
min: 27 deci: 0.5
min: 28 deci: 0.5
min: 29 deci: 0.5
min: 30 deci: 0.5
min: 31 deci: 0.5
min: 32 deci: 0.5
min: 33 deci: 0.6
min: 34 deci: 0.6
min: 35 deci: 0.6
min: 36 deci: 0.6
min: 37 deci: 0.6
min: 38 deci: 0.6
min: 39 deci: 0.7
min: 40 deci: 0.7
min: 41 deci: 0.7
min: 42 deci: 0.7
min: 43 deci: 0.7
min: 44 deci: 0.7
min: 45 deci: 0.8
min: 46 deci: 0.8
min: 47 deci: 0.8
min: 48 deci: 0.8
min: 49 deci: 0.8
min: 50 deci: 0.8
min: 51 deci: 0.9
min: 52 deci: 0.9
min: 53 deci: 0.9
min: 54 deci: 0.9
min: 55 deci: 0.9
min: 56 deci: 0.9
min: 57 deci: 1.0
min: 58 deci: 1.0
min: 59 deci: 1.0
min: 60 deci: 1.0

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Division a column in period and print min max for each in awk - awk

awk 'BEGIN{print "Period","min","max"} NR==1||(NR%10==0){mi=ma=$2} {$2<mi?mi=$2:0;$2>ma?ma=$2:0} NR%9==0{print ++i,mi,ma}' your_file Tester here

Related

How do I get a time delta that is closest to 0 days?

JMeter how to disable summariser.name=summary while running from command prompt

For each unique occurrence in field, transform each unique occurrence in another field in a different column

AWK - Select lines to print according to score

How to round times in Xcode

Categories

Resources