Transpose a column to a line after detecting a pattern - awk

I have this text file format:
01 contig00041 1 878 + YP_003990830.1 metalloendopeptidase, glycoprotease family Geobacillus sp. Y4.1MC1 100.00 291 1 291 47 337 0.0 592 #line 1
01 contig00041 1241 3117 - YP_002948419.1 ABC transporter Geobacillus sp. WCH70 84.94 #line 2
37.31 624 #line 3
260 1 #line 4
321 624 #line 5
532 23 #line 6
12 644 #line 7
270 0.0 #line 8
3e-37 1046 #line 9
154 #line 10
I have to detect a line containing 8 columns (line 2), and transpose the second column of the followning seven lines (lines 3 - 9) to the end of the 8-column line. And finally, exclude line 10. This pattern repeats along a large text file, but it is not frequent (30 time, in a file of 2000 lines). Is it possible doing it using awk?
The edited text file must look like the following text:
01 contig00041 1 878 + YP_003990830.1 metalloendopeptidase, glycoprotease family Geobacillus sp. Y4.1MC1 100.00 291 1 291 47 337 0.0 592 #line 1
01 contig00041 1241 3117 - YP_002948419.1 ABC transporter Geobacillus sp. WCH70 84.94 624 1 624 23 644 0.0 1046 #line 2
Thank you very much in advance.

awk 'NF == 12 { t = $0; for (i = 1; i <= 7; ++i) { r = getline; if (r < 1) break; t = t "\t" $2; } print t; next; } NF > 12' temp.txt
Output:
01 contig00041 1 878 + YP_003990830.1 metalloendopeptidase, glycoprotease family Geobacillus sp. Y4.1MC1 100.00 291 1 291 47 337 0.0 592
01 contig00041 1241 3117 - YP_002948419.1 ABC transporter Geobacillus sp. WCH70 84.94 624 1 624 23 644 0.0 1046
It would automatically print lines having more than 12 fields.
If it detects lines having 12 fields, concatenate second column of other 7 lines with it and print.
Ignore any other line.

edited to only add the second column of the lines with two columns.
I think this does what you want:
awk 'NF >= 8 { a[++i] = $0 } NF == 2 { a[i] = a[i] " " $2 } END { for (j = 1; j <= i; ++j) print a[j] }' file
For lines with more than 8 columns, add a new element to the array a. If the line has 2 columns, append the contents to the current array element. Once the whole file has been processed, go through the array and print all of the lines.
Output:
01 contig00041 1 878 + YP_003990830.1 metalloendopeptidase, glycoprotease family Geobacillus sp. Y4.1MC1 100.00 291 1 291 47 337 0.0 592
01 contig00041 1241 3117 - YP_002948419.1 ABC transporter Geobacillus sp. WCH70 84.94 624 1 624 23 644 0.0 1046

Related

Passing some rows based on a threshold

I am wondering how can I adapt the code (or add extra calls) below to pass some rows based on a threshold (here, a percentage value). My xyz file:
21.4 7
21.5 141
21.6 4
21.7 43
21.8 26
21.9 133
22 305
22.1 216
22.2 93
22.3 33
22.4 13
22.7 23
22.8 2
22.9 10
23 39
23.1 22
23.2 33
23.3 8
23.4 9
23.5 2
23.6 270
23.7 724
23.8 2349
23.9 2
24 1
24.1 11
24.2 376
24.3 1452
24.4 92
with the following awk call I obtain the corresponding percentage of the value in the 3rd column:
awk 'FNR==NR { s+=$2; next; } { printf "%s\t%s\t%s%%\n", $1, $2, 100*$2/s }' xyz xyz | sort -k3 -g
which gives:
24 1 0.0155304%
22.8 2 0.0310607%
23.5 2 0.0310607%
23.9 2 0.0310607%
21.6 4 0.0621214%
21.4 7 0.108713%
23.3 8 0.124243%
23.4 9 0.139773%
22.9 10 0.155304%
24.1 11 0.170834%
22.4 13 0.201895%
23.1 22 0.341668%
22.7 23 0.357198%
21.8 26 0.403789%
22.3 33 0.512502%
23.2 33 0.512502%
23 39 0.605684%
21.7 43 0.667806%
24.4 92 1.42879%
22.2 93 1.44432%
21.9 133 2.06554%
21.5 141 2.18978%
22.1 216 3.35456%
23.6 270 4.1932%
22 305 4.73676%
24.2 376 5.83942%
23.7 724 11.244%
24.3 1452 22.5501%
23.8 2349 36.4808%
So, I want to automagically filter the last Nth rows if the sum of the last values in the 3rd column is just greater than 60%, in the case above it will be 36.4808% + 22.5501% + 11.244% = 70.2749%:
23.7 724 11.244%
24.3 1452 22.5501%
23.8 2349 36.4808%
Any hints are appreciated,
That could be done in a single awk command, but I think this version is shorter:
awk -v OFS="\t" 'FNR==NR { s+=$2; next; } { $3=100*$2/s "%" }1' xyz xyz |
sort -k3 -g |
awk '(t+=$3)>40'
prints out:
23.7 724 11.244%
24.3 1452 22.5501%
23.8 2349 36.4808%
Assumptions:
for rows with duplicate values in the 2nd column, precendence goes to rows with a higher row number (via awk/FNR)
One GNU awk idea:
awk '
BEGIN {OFS="\t" }
{ a[$2][FNR]=$1 # [FNR] needed to distinguish between rows with duplicate values in 2nd column
s+=$2
}
END { PROCINFO["sorted_in"]="#ind_num_desc" # sort array by numeric index (descending order)
for (i in a) { # loop through array (sorted by $2 descending)
for (j in a[i]) { # loop through potential duplicate $2 values (sorted by FNR descending)
pct=100*i/s
out[--c]=a[i][j] OFS i OFS pct "%" # build output line, store in out[] array, index= -1, -2, ...
sum+=pct
if (sum > 60) break # if the sum of percentages > 60 then break from loop
}
if (sum > 60) break # if the sum of percentages > 60 then break from loop
}
for (i=c;i<0;i++) # print contents of out[] array starting with lowest index and increasing to -1
print out[i]
}
' xyz
NOTE: requires GNU awk for:
multidimensional arrays (aka array of arrays)
PROCINFO["sorted_in"] support
This generates:
23.7 724 11.244%
24.3 1452 22.5501%
23.8 2349 36.4808%

How to remove unwanted values in data when reading csv file

Reading Pina_Indian_Diabities.csv some of the values are strings, something like this
+AC0-5.4128147485
734 2
735 4
736 0
737 8
738 +AC0-5.4128147485
739 1
740 NaN
741 3
742 1
743 9
744 13
745 12
746 1
747 1
like in row 738, there re such values in other rows and columns as well.
How can I drop them?

Extracting lines using two criteria

Hoping somebody can teach me how to do this task.
I am thinking awk might be good to do this, but I am really beginner.
I have a file like below (tab separated, actual file is much bigger).
Here, important columns are second and ninth (235 and 15 in the first line of the file).
S 235 1365 * 0 * * * 15 1 c81 592
H 235 296 99.7 + 0 0 3I296M1066I 14 1 s15018 1
H 235 719 95.4 + 0 0 174D545M820I 15 1 c2664 10
H 235 764 99.1 + 0 0 55I764M546I 15 1 c6519 4
H 235 792 100 + 0 0 180I792M393I 14 1 c407 107
S 236 1365 * 0 * * * 15 1 c474 152
H 236 279 95 + 0 0 765I279M321I 10-1 1 s7689 1
H 236 301 99.7 - 0 0 908I301M156I 15 1 s8443 1
H 236 563 95.2 - 0 0 728I563M74I 17 1 c1725 12
H 236 97 97.9 - 0 0 732I97M536I 17 1 s11472 1
I would like to extract lines by specifying the value of ninth columns. At this time, second columns will be like pivot column. What I mean pivot column is, consider as a single set of data if second column has same value. And within the set of lines, all lines need to have the specific values in the ninth column.
So, for example, if I specify ninth column "14" and "15". Then out put will be.
S 235 1365 * 0 * * * 15 1 c81 592
H 235 296 99.7 + 0 0 3I296M1066I 14 1 s15018 1
H 235 719 95.4 + 0 0 174D545M820I 15 1 c2664 10
H 235 764 99.1 + 0 0 55I764M546I 15 1 c6519 4
H 235 792 100 + 0 0 180I792M393I 14 1 c407 107
6th and 8th lines have "15" in their ninth column, but other lines in the "set" (specified by second column: 236) have values other than "14" or "15", so I do not want to extract the lines.
$ cat tst.awk
$2 != prevPivot { prtCurrSet() }
$9 !~ /^1[45]$/ { isBadSet=1 }
{ currSet = currSet $0 ORS; prevPivot = $2 }
END { prtCurrSet() }
function prtCurrSet() {
if ( !isBadSet ) {
printf "%s", currSet
}
currSet = ""
isBadSet = 0
}
$ awk -f tst.awk file
S 235 1365 * 0 * * * 15 1 c81 592
H 235 296 99.7 + 0 0 3I296M1066I 14 1 s15018 1
H 235 719 95.4 + 0 0 174D545M820I 15 1 c2664 10
H 235 764 99.1 + 0 0 55I764M546I 15 1 c6519 4
H 235 792 100 + 0 0 180I792M393I 14 1 c407 107
Not completely sure about complete requirement, seeing your expected output, could you please try following.
awk '$2 == 235 && ($9 == 14 || $9 == 15)' Input_file
Output will be as follows.
S 235 1365 * 0 * * * 15 1 c81 592
H 235 296 99.7 + 0 0 3I296M1066I 14 1 s15018 1
H 235 719 95.4 + 0 0 174D545M820I 15 1 c2664 10
H 235 764 99.1 + 0 0 55I764M546I 15 1 c6519 4
H 235 792 100 + 0 0 180I792M393I 14 1 c407 107
Short awk expression:
awk '$2==235 && $9~/^1[45]$/' file
$9~/^1[45]$/ - ensures that the 9th field matches either 14 or 15

exchange columns based on some conditions

I have a text file with 5 columns. If the number of the 5th column is less than the 3rd column, replace the 4th and 5th column as 2nd and 3rd column. If the number of the 5th column is greater than 3rd column, leave that line as same.
1EAD A 396 B 311
1F3B A 118 B 171
1F5V A 179 B 171
1G73 A 162 C 121
1BS0 E 138 G 230
Desired output
1EAD B 311 A 396
1F3B A 118 B 171
1F5V B 171 A 179
1G73 C 121 A 162
1BS0 E 138 G 230
$ awk '{ if ($5 >= $3) print $0; else print $1"\t"$4"\t"$5"\t"$2"\t"$3; }' foo.txt

How to number the lines according to a field with awk?

I wonder whether there is a way using awk to number the lines according to a field. For example,
Input
2334 332
2334 546
2334 675
7890 222
7890 134
234 45
.
.
.
Based on the 1st field, I would have the following output
Output
1 2334 332
1 2334 546
1 2334 675
2 7890 222
2 7890 134
3 234 45
.
.
.
I would be grateful for your help.
Cheers,
T
here's how,
awk '!a[$1]++{c++}{print c, $0}' file
1 2334 332
1 2334 546
1 2334 675
2 7890 222
2 7890 134
3 234 45
awk 'last != $1 { line = line + 1 } { last = $1; print line, $0 }'