find match, print first occurrence and continue until the end of the file - awk - awk

I have a pretty large file from which I'd like to extract only the first line of those containing my match and then continuing doing that until the end of the file. Example of input and desired output below
Input
C,4,2,5,6,8,9,5
C,4,5,4,5,4,43,6
S,4,23,567,2,4,5
S,23,4,7,78,8,9,6
S,3,5,67,8,54,56
S,4,8,9,54,3,4,52
E,2,3,213,5,8,44
E,5,7,9,67,89,33
E,54,526,54,43,53
S,9,8,9,7,9,32,4
S,5,6,4,5,67,87,88
S,4,23,5,8,5,7,3
E,4,6,4,8,9,32,23
E,43,7,1,78,9,8,65
Output
S,4,23,567,2,4,5
S,9,8,9,7,9,32,4
The match in my lines is S, which usually comes after a line that starts with either E or C. What I'm struggling with is to tell awk to print only the first line after those with E or C. Another way would be to print the first of the bunch of lines containing S. Any idea??

does this one-liner help?
awk '/^S/&&!i{print;i=!i}!/^S/{i=!i}' file
or more "readable":
awk -v p=1 '/^S/&&p{print;p=0}!/^S/{p=1}' file

You can use sed, like this:
sed -rn '/^(E|C)/{:a;n;/^S/!ba;p}' file

here's a multi liner to enter in a file (e.g. u.awk)
/^[CE]/ {ON=1; next}
/^S/ {if (ON) print}
{ON=0}
then run : "awk -f u.awk inputdatafile"

awk to the rescue!
$ awk '/^[CE]/{p=1} /^S/&&p{p=0;print}' file
S,4,23,567,2,4,5
S,9,8,9,7,9,32,4

$ awk '/^S/{if (!f) print; f=1; next} {print; f=0}' file
C,4,2,5,6,8,9,5
C,4,5,4,5,4,43,6
S,4,23,567,2,4,5
E,2,3,213,5,8,44
E,5,7,9,67,89,33
E,54,526,54,43,53
S,9,8,9,7,9,32,4
E,4,6,4,8,9,32,23
E,43,7,1,78,9,8,65

Related

How can i merge every block of 3 lines together while ignoring lower numbers of consecutive lines?

I have a text file like the following, contains blocks of text, blocks are in multiples of 3 lines or just 1 line:
AAAAAAAAAAAAA
BBBBBBBBBBBBB
CCCCCCCCCCCCC
DDDDDDDDDDDDD
EEEEEEEEEEEEE
FFFFFFFFFFFFF
GGGGGGGGGGGGG
HHHHHHHHHHHHH
IIIIIIIIIIIII
JJJJJJJJJJJJJ
KKKKKKKKKKKKK
LLLLLLLLLLLLL
MMMMMMMMMMMMM
NNNNNNNNNNNNN
OOOOOOOOOOOOO
PPPPPPPPPPPPP
QQQQQQQQQQQQQ
RRRRRRRRRRRRR
SSSSSSSSSSSSS
TTTTTTTTTTTTT
UUUUUUUUUUUUU
VVVVVVVVVVVVV
WWWWWWWWWWWWW
XXXXXXXXXXXXX
YYYYYYYYYYYYY
ZZZZZZZZZZZZZ
1111111111111
I would like to merge every block of 3 consecutive lines together, starting with the first in the block. I want to ignore lines that are in less then a block of 3 consecutive lines.
Characters and lengths of lines are always different. ( i have made the lines the same size in the example so it doesn't look too ugly).
So the output would be
AAAAAAAAAAAAA BBBBBBBBBBBBB CCCCCCCCCCCCC
DDDDDDDDDDDDD EEEEEEEEEEEEE FFFFFFFFFFFFF
GGGGGGGGGGGGG
HHHHHHHHHHHHH IIIIIIIIIIIII JJJJJJJJJJJJJ
KKKKKKKKKKKKK
LLLLLLLLLLLLL MMMMMMMMMMMMM NNNNNNNNNNNNN
OOOOOOOOOOOOO PPPPPPPPPPPPP QQQQQQQQQQQQQ
RRRRRRRRRRRRR SSSSSSSSSSSSS TTTTTTTTTTTTT
UUUUUUUUUUUUU
VVVVVVVVVVVVV WWWWWWWWWWWWW XXXXXXXXXXXXX
YYYYYYYYYYYYY ZZZZZZZZZZZZZ 1111111111111
I have tried to use
xargs -n3
However im not sure how to ignore singular lines
How can i acheive this?
With GNU awk for gensub():
$ awk -v RS= -v ORS='\n\n' '{$1=$1; print gensub(/(([^ ]+ ){2}[^ ]+) /,"\\1\n","g")}' file
AAAAAAAAAAAAA BBBBBBBBBBBBB CCCCCCCCCCCCC
DDDDDDDDDDDDD EEEEEEEEEEEEE FFFFFFFFFFFFF
GGGGGGGGGGGGG
HHHHHHHHHHHHH IIIIIIIIIIIII JJJJJJJJJJJJJ
KKKKKKKKKKKKK
LLLLLLLLLLLLL MMMMMMMMMMMMM NNNNNNNNNNNNN
OOOOOOOOOOOOO PPPPPPPPPPPPP QQQQQQQQQQQQQ
RRRRRRRRRRRRR SSSSSSSSSSSSS TTTTTTTTTTTTT
UUUUUUUUUUUUU
VVVVVVVVVVVVV WWWWWWWWWWWWW XXXXXXXXXXXXX
YYYYYYYYYYYYY ZZZZZZZZZZZZZ 1111111111111
In awk:
$ awk -v FS="\n" -v RS="" '{for(i=1;i<=NF;i+=3)print $i,$(i+1),$(i+2);print ""}' file
Output:
AAAAAAAAAAAAA BBBBBBBBBBBBB CCCCCCCCCCCCC
DDDDDDDDDDDDD EEEEEEEEEEEEE FFFFFFFFFFFFF
GGGGGGGGGGGGG
HHHHHHHHHHHHH IIIIIIIIIIIII JJJJJJJJJJJJJ
...
Update Version that won't leave trailing space:
$ awk -v FS="\n" -v RS="" '{for(i=1;i<=NF;i++)printf "%s%s",$i,(i%3==0||i==NF?ORS:OFS);print ""}' file
Please see discussion on some features in the comments. Thanks to the commentators for the constructive feedback.
Here is a different which will always work:
awk '(NF==0){print rec ORS; rec="";c=0; next}
{rec = rec (c ? (c%3==0 ? ORS : OFS) : "") $0; c++ }
END {print rec}' file
This might work for you (GNU sed):
sed '/\S/{N;/\n\s*$/b;N;//b;s/\n/ /g}' file
If the current line is not empty, append the next line.
If the appended line is not empty, append the next line.
If that line is also not empty, replace the newlines by spaces.
In all other cases print the line(s) as is.
An alternative, that is more programmatic:
sed ':a;N;s/\n/&/2;Ta;/^\s*$/M{P;D};s/\n/ /g' file

Concatenate the sequence to the ID in fasta file

Here is my input file
>OTU1;size=4;
ATTCCGGGTTTACT
ATTCCTTTTATCGA
ATC
>OTU2;size=10;
CGGATCTAGGCGAT
ACT
>OTU3;size=5;
ATTCCCGGGATCTA
ACTTTTC
The expected output file is:
>OTU1;size=4;ATTCCGGGTTTACTATTCCTTTTATCGAATC
>OTU2;size=10;CGGATCTAGGCGATACT
>OTU3;size=5;ATTCCCGGGATCTAACTTTTC
I've tried the code from Remove line breaks in a FASTA file
but this doesn't work for me, and I am not sure how to modify the code from that post...
Any suggestion? Thanks in advance!
Here is another awk script. Using the awk internal parsing mechanism.
awk 'BEGIN{RS=">";OFS="";}NR>1{$1=$1;print ">"$0}' input.txt
Output is:
>OTU1;size=4;ATTCCGGGTTTACTATTCCTTTTATCGAATC
>OTU2;size=10;CGGATCTAGGCGATACT
>OTU3;size=5;ATTCCCGGGATCTAACTTTTC
Explanation:
awk '
BEGIN { # initialize awk internal variables
RS=">"; # set `RS`=record separator to `>`
OFS=""; # set `OFS`=output field separator to empty string.
}
NR>1 { # handle from 2nd record (1st record is empty).
$1=$1; # regenerate the output line
print ">"$0 # print out ">" with computed output line
}' input.txt
$ awk '{printf "%s%s", (/^>/ ? ors : ""), $0; ors=ORS} END{print ""}' file
>OTU1;size=4;ATTCCGGGTTTACTATTCCTTTTATCGAATC
>OTU2;size=10;CGGATCTAGGCGATACT
>OTU3;size=5;ATTCCCGGGATCTAACTTTTC
Could you please try following too.
awk -v RS=">" 'NR>1{gsub(/\n/,"");print ">"$0}' Input_file
My original attempt was awk -v RS=">" -v FS="\n" -v OFS="" 'NF>1{$1=$1;print ">"$0}' Input_file but later I saw it is already answered buy dudi boy so written another(first mentioned) one.
Similar to my answer here:
$ awk 'BEGIN{RS=">"; FS="\n"; ORS=""}
(FNR==1){next}
{ name=$1; seq=$0; gsub(/(^[^\n]*|)\n/,"",seq) }
{ print ">" name seq }' file1.fasta file2.fasta file3.fasta ...

Print line modified and the line after using awk

I want to modify lines in a file using awk and print the new lines with the following line.
My file is like this
Name_Name2_ Name3_Name4
ASHRGSJFSJRGDJRG
Name5_Name6_Name7_Name8
ADGTHEGHGTJKLGRTIWRK
I want
Name-Name2
ASHRGSJFSJRGDJRG
Name5-Name6
ADGTHEGHGTJKLGRTIWRK
I have sued awk to modify my file:
awk -F'_' {print $1 "-" $2} file > newfile
but I don't know how to tell to print also the line just after (ABDJRH)
sure is it possible with awk x=NR+1 NR<=x
thanks
Following awk may help you on same.
awk -F"_" '/_/{print $1"-"$2;next} 1' Input_file
assuming your structure in sample (no separation in line with "data" letter )
awk '$0=$1' Input_file
# or with sed
sed 's/[[:space:]].*//' Input_file

Grep part of string after symbol and shuffle columns

I would like to take the number after the - sign and put is as column 2 in my matrix. I know how to grep the string but not how to print it after the text string.
in:
1-967764 GGCTGGTCCGATGGTAGTGGGTTATCAGAACT
3-425354 GCATTGGTGGTTCAGTGGTAGAATTCTCGCC
4-376323 GGCTGGTCCGATGGTAGTGGGTTATCAGAAC
5-221398 GGAAGAGCACACGTCTGAACTCCAGTCACGTGAAAATCTCGTATGCCGTCT
6-180339 TCCCTGGTGGTCTAGTGGTTAGGATTCGGCGCT
out:
GGCTGGTCCGATGGTAGTGGGTTATCAGAACT 967764
GCATTGGTGGTTCAGTGGTAGAATTCTCGCC 425354
GGCTGGTCCGATGGTAGTGGGTTATCAGAAC 376323
GGAAGAGCACACGTCTGAACTCCAGTCACGTGAAAATCTCGTATGCCGTCT 221398
TCCCTGGTGGTCTAGTGGTTAGGATTCGGCGCT 180339
awk -F'[[:space:]-]+' '{print $3,$2}' file
Seems like a simple substitution should do the job:
sed -E 's/[0-9]+-([0-9]+)[[:space:]]*(.*)/\2 \1/' file
Capture the parts you're interested in and use them in the replacement.
Alternatively, using awk:
awk 'sub(/^[0-9]+-/, "") { print $2, $1 }' file
Remove the leading digits and - from the start of the line. When this is successful, sub returns true, so the action is performed, printing the second field, followed by the first.
Using regex ( +|-) as field separator:
$ awk -F"( +|-)" '{print $3,$2}' file
GGCTGGTCCGATGGTAGTGGGTTATCAGAACT 967764
GCATTGGTGGTTCAGTGGTAGAATTCTCGCC 425354
GGCTGGTCCGATGGTAGTGGGTTATCAGAAC 376323
GGAAGAGCACACGTCTGAACTCCAGTCACGTGAAAATCTCGTATGCCGTCT 221398
TCCCTGGTGGTCTAGTGGTTAGGATTCGGCGCT 180339
here is another awk
$ awk 'split($1,a,"-") {print $2,a[2]}' file
awk '{sub(/.-/,"");print $2,$1}' file
GGCTGGTCCGATGGTAGTGGGTTATCAGAACT 967764
GCATTGGTGGTTCAGTGGTAGAATTCTCGCC 425354
GGCTGGTCCGATGGTAGTGGGTTATCAGAAC 376323
GGAAGAGCACACGTCTGAACTCCAGTCACGTGAAAATCTCGTATGCCGTCT 221398
TCCCTGGTGGTCTAGTGGTTAGGATTCGGCGCT 180339

AWK - get value between two strings over multiple lines

input.txt:
>block1
111111111111111111111
>block2
222222222222222222222
>block3
333333333333333333333
AWK command:
awk '/>block2.*>/' input.txt
Expected output
222222222222222222222
However, AWK is returning nothing. What am I misunderstanding?
Thanks!
If you want to print the line after the line containing >block2, then you could use:
awk '/^>block2$/ { nr=NR+1 } NR == nr { print }'
Track the record number plus 1 when you find the match; when the current record number matches the remembered one, print the current record.
If you want all the lines between the line >block2 and >block3, then you'd use:
awk '/^>block2$/,/^>block3/ {if ($0 !~ /^>block[23]$/) print }'
For all lines between the two markers, if the line doesn't match either marker, print it. The output is the same with the sample data file.
another awk
$ awk 'c&&c--; /^>block2/{c=1}' file
222222222222222222222
c specifies how many lines you want to print after the match. If you want the text between two markers
$ awk '/^>block3/{exit} s; /^>block2/{s=1}' file
222222222222222222222
if there are multiple instances and you want them all, just change exit to s=0
You probably meant:
$ awk '/>/{f=/^>block2$/;next} f' file
222222222222222222222