Passing some rows based on a threshold - awk

I am wondering how can I adapt the code (or add extra calls) below to pass some rows based on a threshold (here, a percentage value). My xyz file:
21.4 7
21.5 141
21.6 4
21.7 43
21.8 26
21.9 133
22 305
22.1 216
22.2 93
22.3 33
22.4 13
22.7 23
22.8 2
22.9 10
23 39
23.1 22
23.2 33
23.3 8
23.4 9
23.5 2
23.6 270
23.7 724
23.8 2349
23.9 2
24 1
24.1 11
24.2 376
24.3 1452
24.4 92
with the following awk call I obtain the corresponding percentage of the value in the 3rd column:
awk 'FNR==NR { s+=$2; next; } { printf "%s\t%s\t%s%%\n", $1, $2, 100*$2/s }' xyz xyz | sort -k3 -g
which gives:
24 1 0.0155304%
22.8 2 0.0310607%
23.5 2 0.0310607%
23.9 2 0.0310607%
21.6 4 0.0621214%
21.4 7 0.108713%
23.3 8 0.124243%
23.4 9 0.139773%
22.9 10 0.155304%
24.1 11 0.170834%
22.4 13 0.201895%
23.1 22 0.341668%
22.7 23 0.357198%
21.8 26 0.403789%
22.3 33 0.512502%
23.2 33 0.512502%
23 39 0.605684%
21.7 43 0.667806%
24.4 92 1.42879%
22.2 93 1.44432%
21.9 133 2.06554%
21.5 141 2.18978%
22.1 216 3.35456%
23.6 270 4.1932%
22 305 4.73676%
24.2 376 5.83942%
23.7 724 11.244%
24.3 1452 22.5501%
23.8 2349 36.4808%
So, I want to automagically filter the last Nth rows if the sum of the last values in the 3rd column is just greater than 60%, in the case above it will be 36.4808% + 22.5501% + 11.244% = 70.2749%:
23.7 724 11.244%
24.3 1452 22.5501%
23.8 2349 36.4808%
Any hints are appreciated,

That could be done in a single awk command, but I think this version is shorter:
awk -v OFS="\t" 'FNR==NR { s+=$2; next; } { $3=100*$2/s "%" }1' xyz xyz |
sort -k3 -g |
awk '(t+=$3)>40'
prints out:
23.7 724 11.244%
24.3 1452 22.5501%
23.8 2349 36.4808%

Assumptions:
for rows with duplicate values in the 2nd column, precendence goes to rows with a higher row number (via awk/FNR)
One GNU awk idea:
awk '
BEGIN {OFS="\t" }
{ a[$2][FNR]=$1 # [FNR] needed to distinguish between rows with duplicate values in 2nd column
s+=$2
}
END { PROCINFO["sorted_in"]="#ind_num_desc" # sort array by numeric index (descending order)
for (i in a) { # loop through array (sorted by $2 descending)
for (j in a[i]) { # loop through potential duplicate $2 values (sorted by FNR descending)
pct=100*i/s
out[--c]=a[i][j] OFS i OFS pct "%" # build output line, store in out[] array, index= -1, -2, ...
sum+=pct
if (sum > 60) break # if the sum of percentages > 60 then break from loop
}
if (sum > 60) break # if the sum of percentages > 60 then break from loop
}
for (i=c;i<0;i++) # print contents of out[] array starting with lowest index and increasing to -1
print out[i]
}
' xyz
NOTE: requires GNU awk for:
multidimensional arrays (aka array of arrays)
PROCINFO["sorted_in"] support
This generates:
23.7 724 11.244%
24.3 1452 22.5501%
23.8 2349 36.4808%

Related

How to remove unwanted values in data when reading csv file

Reading Pina_Indian_Diabities.csv some of the values are strings, something like this
+AC0-5.4128147485
734 2
735 4
736 0
737 8
738 +AC0-5.4128147485
739 1
740 NaN
741 3
742 1
743 9
744 13
745 12
746 1
747 1
like in row 738, there re such values in other rows and columns as well.
How can I drop them?

how do I use awk to edit a file

I have a text file like this small example:
chr10:103909786-103910082 147 148 24 BA
chr10:103909786-103910082 149 150 11 BA
chr10:103909786-103910082 150 151 2 BA
chr10:103909786-103910082 152 153 1 BA
chr10:103909786-103910082 274 275 5 CA
chr10:103909786-103910082 288 289 15 CA
chr10:103909786-103910082 294 295 4 CA
chr10:103909786-103910082 295 296 15 CA
chr10:104573088-104576021 2925 2926 134 CA
chr10:104573088-104576021 2926 2927 10 CA
chr10:104573088-104576021 2932 2933 2 CA
chr10:104573088-104576021 58 59 1 BA
chr10:104573088-104576021 689 690 12 BA
chr10:104573088-104576021 819 820 33 BA
in this file there are 5 tab separated columns. the first column is considered as ID. for example in the first row the whole "chr10:103909786-103910082" is ID.
1- in the 1st step I would like to filter out the rows based on the 4th column.
if the number in the 4th column is less than 10 and the same row but in the 5th column the group is BA, that row will be filtered out. also if the number in the 4th column is less than 5 and the same row but in the 5th column the group is CA, that row will be filtered out.
3- 3rd step:
I want to get the ratio of number in 4th column. in fact in the 1st column there are repeated values which represent the same ID. I want to get one ratio per ID, so in the output every ID will be repeated only once. each ID has both BA and CA in the 5th column. for each ID I should get 2 values for CA and BA separately and get the ration of CA/BA as the final value for each ID. to get one value as CA, I should add up all values in the 4th column which belong the same ID and classified as CA and to get one value as BA, I should add up all values in the 4th column which belong the same ID and classified as BA. the last step is to get the ration of CA/BA per ID. the expected output for the small example would look like this:
1- after filtration:
chr10:103909786-103910082 147 148 24 BA
chr10:103909786-103910082 149 150 11 BA
chr10:103909786-103910082 274 275 5 CA
chr10:103909786-103910082 288 289 15 CA
chr10:103909786-103910082 295 296 15 CA
chr10:104573088-104576021 2925 2926 134 CA
chr10:104573088-104576021 2926 2927 10 CA
chr10:104573088-104576021 689 690 12 BA
chr10:104573088-104576021 819 820 33 BA
2- after summarizing each group (CA and BA):
chr10:103909786-103910082 147 148 35 BA
chr10:103909786-103910082 274 275 35 CA
chr10:104573088-104576021 2925 2926 144 CA
chr10:104573088-104576021 819 820 45 BA
3- the final output(this ratio is made using the values in 4th column):
chr10:103909786-103910082 1
chr10:104573088-104576021 3.2
in the above lines, 1 = 35/35 and 3.2 = 144/45.
I am trying to do that in awk
awk 'ID==$1 {
if (ID) {
print ID, a["CA"]/a["BA"]; a["CA"]=a["BA"]=0;
}
ID=$1
}
$5=="BA" && $4>=10 || $5=="CA" && $4>=5 { a[$5]+=$4 }
END{ print ID, a["CA"]/a["BA"] }' file.txt
I tried to use the code but did not succeed. this code returns one number. in fact sum of all CA and divides it by sum of all BAs but I want to do that per ID and get the ration per ID. do you know how to solve the problem and correct the code?
awk '$4 >= 5 && $5 == "CA" { a[$1]+=$4 }
$4 >= 10 && $5 == "BA" { b[$1]+=$4 }
END{ for ( i in a) print i,a[i]/b[i]}' file
output:
chr10:103909786-103910082 1
chr10:104573088-104576021 3.2

exchange columns based on some conditions

I have a text file with 5 columns. If the number of the 5th column is less than the 3rd column, replace the 4th and 5th column as 2nd and 3rd column. If the number of the 5th column is greater than 3rd column, leave that line as same.
1EAD A 396 B 311
1F3B A 118 B 171
1F5V A 179 B 171
1G73 A 162 C 121
1BS0 E 138 G 230
Desired output
1EAD B 311 A 396
1F3B A 118 B 171
1F5V B 171 A 179
1G73 C 121 A 162
1BS0 E 138 G 230
$ awk '{ if ($5 >= $3) print $0; else print $1"\t"$4"\t"$5"\t"$2"\t"$3; }' foo.txt

Transpose a column to a line after detecting a pattern

I have this text file format:
01 contig00041 1 878 + YP_003990830.1 metalloendopeptidase, glycoprotease family Geobacillus sp. Y4.1MC1 100.00 291 1 291 47 337 0.0 592 #line 1
01 contig00041 1241 3117 - YP_002948419.1 ABC transporter Geobacillus sp. WCH70 84.94 #line 2
37.31 624 #line 3
260 1 #line 4
321 624 #line 5
532 23 #line 6
12 644 #line 7
270 0.0 #line 8
3e-37 1046 #line 9
154 #line 10
I have to detect a line containing 8 columns (line 2), and transpose the second column of the followning seven lines (lines 3 - 9) to the end of the 8-column line. And finally, exclude line 10. This pattern repeats along a large text file, but it is not frequent (30 time, in a file of 2000 lines). Is it possible doing it using awk?
The edited text file must look like the following text:
01 contig00041 1 878 + YP_003990830.1 metalloendopeptidase, glycoprotease family Geobacillus sp. Y4.1MC1 100.00 291 1 291 47 337 0.0 592 #line 1
01 contig00041 1241 3117 - YP_002948419.1 ABC transporter Geobacillus sp. WCH70 84.94 624 1 624 23 644 0.0 1046 #line 2
Thank you very much in advance.
awk 'NF == 12 { t = $0; for (i = 1; i <= 7; ++i) { r = getline; if (r < 1) break; t = t "\t" $2; } print t; next; } NF > 12' temp.txt
Output:
01 contig00041 1 878 + YP_003990830.1 metalloendopeptidase, glycoprotease family Geobacillus sp. Y4.1MC1 100.00 291 1 291 47 337 0.0 592
01 contig00041 1241 3117 - YP_002948419.1 ABC transporter Geobacillus sp. WCH70 84.94 624 1 624 23 644 0.0 1046
It would automatically print lines having more than 12 fields.
If it detects lines having 12 fields, concatenate second column of other 7 lines with it and print.
Ignore any other line.
edited to only add the second column of the lines with two columns.
I think this does what you want:
awk 'NF >= 8 { a[++i] = $0 } NF == 2 { a[i] = a[i] " " $2 } END { for (j = 1; j <= i; ++j) print a[j] }' file
For lines with more than 8 columns, add a new element to the array a. If the line has 2 columns, append the contents to the current array element. Once the whole file has been processed, go through the array and print all of the lines.
Output:
01 contig00041 1 878 + YP_003990830.1 metalloendopeptidase, glycoprotease family Geobacillus sp. Y4.1MC1 100.00 291 1 291 47 337 0.0 592
01 contig00041 1241 3117 - YP_002948419.1 ABC transporter Geobacillus sp. WCH70 84.94 624 1 624 23 644 0.0 1046

matching columns in two files with different line numbers

This is a rather repeated question but I could not figure it out with my files, so, any help will be highly appreciated.
I have two files, I want to compare their first fields and print the common lines into a third file, an example of my files:
file 1:
gene1
gene2
gene3
file 2:
gene1|trans1|12|233|345 45
gene1|trans2|12|342|232 45
gene2|trans2|12|344|343 12
gene2|trans2|12|344|343 45
gene2|trans2|12|344|343 12
gene2|trans3|12|34r|343 325
gene2|trans2|12|344|343 545
gene3|trans4|12|344|333 454
gene3|trans2|12|343|343 545
gene3|trans3|12|344|343 45
gene4|trans2|12|344|343 2112
gene4|trans2|12|344|343 455
file 2 contains more fields. Please pay attention that the first field is not exactly like the first file but the gene element only matches.
The output should look like this:
gene1|trans1|12|233|345 45
gene1|trans2|12|342|232 45
gene2|trans2|12|344|343 12
gene2|trans2|12|344|343 45
gene2|trans2|12|344|343 12
gene2|trans3|12|34r|343 325
gene2|trans2|12|344|343 545
gene3|trans4|12|344|333 454
gene3|trans2|12|343|343 545
gene3|trans3|12|344|343 45
I use this code, it does not give me any error but it does not give me any output either:
awk '{if (f[$1] != FILENAME) a[$1]++; f[$1] = FILENAME; } END{ for (i in a) if (a[i] > 1) print i; }' file1 file1
thank you very much
Some like this?
awk -F\| 'FNR==NR {a[$0]++;next} $1 in a' file1 file2
gene1|trans1|12|233|345 45
gene1|trans2|12|342|232 45
gene2|trans2|12|344|343 12
gene2|trans2|12|344|343 45
gene2|trans2|12|344|343 12
gene2|trans3|12|34r|343 325
gene2|trans2|12|344|343 545
gene3|trans4|12|344|333 454
gene3|trans2|12|343|343 545
gene3|trans3|12|344|343 45
In this example, grep is sufficient:
grep -w -f file1 file2