Move new line character 5 positions downstream in a text (fasta) file - awk

I am trying to transform a text file like this (fasta format):
>seq1
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
ATGATGATGGAATGAGGAT
TTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGG
TTGCAATGCGCGTATTTAT
TTTTTTTTTTTTTTTTTTT
AAAAAAAAAAAAAGGCTGT
AAAAAAAAAAAAAAAGGGG
The objective is to displace newline character 5 positions downstream, except for those lines starting with >
>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG
I would like to use AWK, but I am not sure how to proceed. I am thinking about something similar to this:
awk '{for(i=1;i<=NR;i++){ if($1 ~ /^>/){¿?¿?¿?}}}'
Do you know how can I solve this?

Assumptions:
all data lines are to be expanded to a max of 24 characters
One awk idea:
awk -v width=24 ' # pass width in as awk variable "width"
function print_sequence() {
if (sequence) # if sequence is not blank
while (sequence) { # while sequence is not blank
print substr(sequence,1,width) # print 1st 24 characters
sequence=substr(sequence,width+1) # remove 1st 24 characters
}
}
/^>/ { print_sequence() # flush previous set of data to stdout
print # print current input line
next # process next input line
}
{ sequence=sequence $1 } # append data to our "sequence" variable
END { print_sequence() } # flush last set of data to stdout
' fasta.in > fasta.out
This generates:
$ cat fasta.out
>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG

I would do it following way, let file.txt content be
>seq1
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
ATGATGATGGAATGAGGAT
TTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGG
TTGCAATGCGCGTATTTAT
TTTTTTTTTTTTTTTTTTT
AAAAAAAAAAAAAGGCTGT
AAAAAAAAAAAAAAAGGGG
then
awk 'BEGIN{width=24}/>/&&x{print x;x=""}/>/{print;next}{x = x $0}length(x)>=width{print substr(x,1,width);x=substr(x,width+1)}END{print x}' file.txt
gives output
>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG
Explanation: I set width to 24 which is number of desired character, if > is found and there is something stored in x do print that and set x value to empty string, if line with > is encountered do print it and go to next line. For every line do append current line content to x, if length of x is equal to or greater than width do print width first characters of x and remove these characters from x. After processing all lines do print x. Disclaimer solution: this solution assumes that ratio between current width and desired with is lesser than 0.5
(GNU Awk 5.0.1)

Yet another approach you could try, using awk's field and record separators:
awk -v width=24 '
BEGIN {
FS="\n" # Set the Field separator to newline
RS=">" # Set the Record separator to ">"
ORS=OFS="" # Set the Output Record and Field separator to an empty string
}
NR>1 { # Using ">" as a record separator the first record is empty, so skip
header=$1 # Using "\n" as the Field separator, $1 contains the header, save it in a variable
$1=OFS # Assign an empty string to $1 so the record gets recalculated and the body becomes $0 i
# with all newlines are removed, since OFS == ""
gsub(".{" width "}", "&" FS) # Append every "width" characters with a newline (FS)
print RS header FS $0 FS # Print a ">", the header, a newline, the body and a newline
}
' fasta_in > fasta_out

Assuming the line that starts with > is never more than 24 chars long:
$ awk '{printf "%s", (/^>/ ? sep $0 ORS : $0); sep=ORS} END{print ""}' file | fold -w24
>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG

Related

Extract first position of a regex match grep

Good morning everyone,
I have a text file containing multiple lines. I want to find a regular pattern inside it and print its position using grep.
For example:
ARTGHFRHOPLIT
GFRTLOPLATHLG
TGHLKTGVARTHG
I want to find L[any_letter]T in the file and print the position of L and the three letter code. In this case it would results as:
11 LIT
8 LAT
4 LKT
I wrote a code in grep, but it doesn't return what I need. The code is:
grep -E -boe "L.T" file.txt
It returns:
11:LIT
21:LAT
30:LKT
Any help would be appreciated!!
Awk suites this better:
awk 'match($0, /L[[:alpha:]]T/) {
print RSTART, substr($0, RSTART, RLENGTH)}' file
11 LIT
8 LAT
4 LKT
This is assuming only one such match per line.
If there can be multiple overlapping matches per line then use:
awk '{
n = 0
while (match($0, /L[[:alpha:]]T/)) {
n += RSTART
print n, substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART + 1)
}
}' file
With your shown samples, please try following awk code. Written and tested in GNU awk, should work in any awk.
awk '
{
ind=prev=""
while(ind=index($0,"L")){
if(substr($0,ind+2,1)=="T" && substr($0,ind+1,1) ~ /[a-zA-Z]/){
if(prev==""){ print prev+ind,substr($0,ind,3) }
if(prev>1) { print prev+ind+2,substr($0,ind,3) }
}
$0=substr($0,ind+3)
prev+=ind
}
}' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
{
ind=prev="" ##Nullifying ind and prev variables here.
while(ind=index($0,"L")){ ##Run while loop to check if index for L letter is found(whose index will be stored into ind variable).
if(substr($0,ind+2,1)=="T" && substr($0,ind+1,1) ~ /[a-zA-Z]/){ ##Checking condition if letter after 1 position of L is T AND letter next to L is a letter.
if(prev==""){ print prev+ind,substr($0,ind,3) } ##Checking if prev variable is NULL then printing prev+ind along with 3 letters from index of L eg:(LIT).
if(prev>1) { print prev+ind+2,substr($0,ind,3) } ##If prev is greater than 1 then printing prev+ind+2 and along with 3 letters from index of L eg:(LIT).
}
$0=substr($0,ind+3) ##Setting value of rest of line value to 2 letters after matched L position.
prev+=ind ##adding ind to prev value.
}
}' Input_file ##Mentioning Input_file name here.
Peeking at the answer of #anubhava you might also sum the RSTART + RLENGTH and use that as the start for the substr to get multiple matches per line and per word.
The while loop takes the current line, and for every iteration it updates its value by setting it to the part right after the last match till the end of the string.
Note that if you use the . in a regex it can match any character.
awk '{
pos = 0
while (match($0, /L[a-zA-Z]T/)) {
pos += RSTART;
print pos, substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART + RLENGTH)
}
}' file
If file contains
ARTGHFRHOPLIT
GFRTLOPLATHLG
TGHLKTGVARTHG
ARTGHFRHOPLITLOT LATTELET
LUT
The output is
11 LIT
8 LAT
4 LKT
11 LIT
12 LOT
14 LAT
17 LET
1 LUT

How to print the initials with awk

I have this input text file:
Pedro Paulo da Silva
22 years old
Brazil
Bruce Mackenzie
30 years old
United States of America
Lee Dong In
26 years old
South Korea
The name of the person is always the first line of the file (and the first line after the empty line /n).
I have to do this output (ignoring everything except the names in the first lines):
PedroPdS
BruceM
LeeDI
Don't know how to do that with awk. I just know that awk 'print {$number}' will grab the column $number and that's how I'm supposed to grab their names.
I've searched here and found this: sed -e 's/$/ /' -e 's/\([^ ]\)[^ ]* /\1/g' -e 's/^ *//'
But I have to use awk.
Would you please try the following:
awk -v RS="" -F '\n' ' # records are separated on blank lines by setting RS to null
{
n = split($1, b, " ") # split the name on spaces
init = b[1] # the first name
for (i = 2; i <= n; i++) # loop over the remaining
init = init substr(b[i], 1, 1) # append the initial
print init
}' input.txt
Output:
PedroPdS
BruceM
LeeDI
With your shown samples, please try following once.
awk '
!NF{
count=0
next
}
++count==1{
printf("%s%s",$1,NF==1?ORS:"")
for(i=2;i<=NF;i++){
printf("%s%s",substr($i,1,1),i==NF?ORS:"")
}
}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
!NF{ ##Checking if line is empty then do following.
count=0 ##Setting count to 0 here.
next ##next will skip all further statements from here.
}
++count==1{ ##Checking condition if count is 1 then do following.
printf("%s%s",$1,NF==1?ORS:"") ##Using printf to print $1 followed by new line OR nothing.
for(i=2;i<=NF;i++){ ##Starting a for loop here.
printf("%s%s",substr($i,1,1),i==NF?ORS:"") ##Using printf to print sub string of current line from field 2 to last field of line and printing only 1st character of line.
}
}' Input_file ##Mentioning Input_file name here.
I would use GNU AWK for this task following way, let file.txt content be
Pedro Paulo da Silva
22 years old
Brazil
Bruce Mackenzie
30 years old
United States of America
Lee Dong In
26 years old
South Korea
then
awk '{if(prevline==""){print gensub(/ ([[:alpha:]])[[:alpha:]]+/, "\\1", "g")};prevline=$0}' file.txt
output
PedroPdS
BruceM
LeeDI
Explanation: there are two things to do, first select which lines to print, then change their content into initials. For first I check if previous line (prevline) is empty string, GNU AWK if variable was not set earlier treat it as empty string for comparison with another string, so condition is meet for first line, then after processing each line I set prevline to line content ($0) so in next turn it does hold previous line. For conversion into initials I harness gensub function - I instruct AWK to replace space-letter-letters using letter and print such changed line.
(tested in gawk 4.2.1)
$ cat input
Pedro Paulo da Silva
22 years old
Brazil
Bruce Mackenzie
30 years old
United States of America
Lee Dong In
26 years old
South Korea
$ awk '!a{ printf "%s", $1;
for( i = 2; i <= NF; i++ ) printf("%c", $i);
printf "\n"; a=1}
/^$/{a=0}' input
PedroPdS
BruceM
LeeDI
You can try this:
awk -F ' ' 'BEGIN {X = 1} NR == X{print $1 substr($2, 1, 1) substr($3, 1, 1) substr($4, 1, 1); X += 4}'
Another potential option is:
awk '/[0-9]/{print p} {p=$1 substr($2, 1, 1) substr($3, 1, 1) substr($4, 1, 1) substr($5, 1, 1)}' file
Another variation with a mix from the existing answers.
awk '{
if (!x) { # Variable x is empty at the start or set to empty line
res=$1 # Set res to field 1
for(i=2; i<=NF;i++) { # Loop the rest of the fields starting at field 2
res = res substr($i, 1, 1) # Concat the first char from each field with res
}
print res
}
x=$0 # Set x variable to the value of the current line
}
' file
Output
PedroPdS
BruceM
LeeDI
With GNU awk in paragraph mode and using gensub() function you can get it:
awk 'BEGIN {RS = ""; FS = "\n"} {print gensub(/([[:space:]])([[:alpha:]]{1})([^[:space:]+])+/,"\\2","g",$1)}' file
PedroPdS
BruceM
LeeDI
Yet another. It turned out a bit like #WilliamPursell's, though (++):
$ awk '!p{for(i=1;i<=NF;i++)printf (i==1?"%s%s":"%c%s"),$i,(i==NF?ORS:"")}{p=NF}' file
Output:
PedroPdS
BruceM
LeeDI
"Explained":
$ awk '
!p { # if previous record empty
for(i=1;i<=NF;i++) # process record for ...
printf (i==1?"%s%s":"%c%s"),$i,(i==NF?ORS:"") # ... output
}
{ p=NF }' file # store field count

Concatenate columns and adds digits awk

I have a csv file:
number1;number2;min_length;max_length
"40";"1801";8;8
"40";"182";8;8
"42";"32";6;8
"42";"4";6;6
"43";"691";9;9
I want the output be:
4018010000;4018019999
4018200000;4018299999
42320000;42329999
423200000;423299999
4232000000;4232999999
42400000;42499999
43691000000;43691999999
So the new file will be consisting of:
column_1 = a concatenation of old_column_1 + old_column_2 + a number
of "0" equal to (old_column_3 - length of the old_column_2)
column_2 = a concatenation of old_column_1 + old_column_2 + a number of "9" equal
to (old_column_3 - length of the old_column_2) , when min_length = max_length. And when min_length is not equal with max_length , I need to take into account all the possible lengths. So for the line "42";"32";6;8 , all the lengths are: 6,7 and 8.
Also, i need to delete the quotation mark everywhere.
I tried with paste and cut like that:
paste -d ";" <(cut -f1,2 -d ";" < file1) > file2
for the concatenation of the first 2 columns, but i think with awk its easier. However, i can't figure out how to do it. Any help it's apreciated. Thanks!
Edit: Actually, added column 4 in input.
You may use this awk:
awk 'function padstr(ch, len, s) {
s = sprintf("%*s", len, "")
gsub(/ /, ch, s)
return s
}
BEGIN {
FS=OFS=";"
}
{
gsub(/"/, "");
for (i=0; i<=($4-$3); i++) {
d = $3 - length($2) + i
print $1 $2 padstr("0", d), $1 $2 padstr("9", d)
}
}' file
4018010000;4018019999
4018200000;4018299999
42320000;42329999
423200000;423299999
4232000000;4232999999
42400000;42499999
43691000000;43691999999
With awk:
awk '
BEGIN{FS = OFS = ";"} # set field separator and output field separator to be ";"
{
$0 = gensub("\"", "", "g"); # Drop double quotes
s = $1$2; # The range header number
l = $3-length($2); # Number of zeros or 9s to be appended
l = 10^l; # Get 10 raised to that number
print s*l, (s+1)*l-1; # Adding n zeros is multiplication by 10^n
# Adding n nines is multipliaction by 10^n + (10^n - 1)
}' input.txt
Explanation inline as comments.

Print all the files which are at maximum depth in a directory

Print all the files which are present in maximum depth
for example
abc/1/2/3/4/r.txt
abc/1/f1.txt
abc/11/22/44/66/77/f2.txt
abc/11/22/44/66/77/f4.txt
abc/11/22/44/66/77/f5.txt
so this would print
abc/11/22/44/66/77/f2.txt
abc/11/22/44/66/77/f4.txt
abc/11/22/44/66/77/f5.txt
I have written this command
$cat listoffiles.txt | awk -F "/" ' { if ( NF > x ) { x = NF; y = $0 } }END{ print y }'
but this is printing only the first occurrence.
Keep buffering deepest files and discard them whenever the max depth changes. At the end, dump what's in the buffer.
awk -F'/+' 'NF>max{max=NF;delete buf} NF==max{buf[$0]} END{for(f in buf) print f}' file

Awk - Substring comparison

Working native bash code :
while read line
do
a=${line:112:7}
b=${line:123:7}
if [[ $a != "0000000" || $b != "0000000" ]]
then
echo "$line" >> FILE_OT_YHAV
else
echo "$line" >> FILE_OT_NHAV
fi
done <$FILE_IN
I have the following file (its a dummy), the substrings being checked are both on the 4th field, so nm the exact numbers.
AAAAAAAAAAAAAA XXXXXX BB CCCCCCC 12312312443430000000
BBBBBBB AXXXXXX CC DDDDDDD 10101010000000000000
CCCCCCCCCC C C QWEQWEE DDD AAAAAAA A12312312312312310000
I m trying to write an awk script that compares two specific substrings, if either one is not 000000 it outputs the line into File A, if both of them are 000000 it outputs the line into File B, this is the code i have so far :
# Before first line.
BEGIN {
print "Awk Started"
FILE_OT_YHAV="FILE_OT_YHAV.test"
FILE_OT_NHAV="FILE_OT_NHAV.test"
FS=""
}
# For each line of input.
{
fline=$0
# print "length = #" length($0) "#"
print "length = #" length(fline) "#"
print "##" substr($0,112,7) "##" substr($0,123,7) "##"
if ( (substr($0,112,7) != "0000000") || (substr($0,123,7) != "0000000") )
print $0 > FILE_OT_YHAV;
else
print $0 > FILE_OT_NHAV;
}
# After last line.
END {
print "Awk Ended"
}
The problem is that when i run it, it :
a) Treats every line as having a different length
b) Therefore the substrings are applied to different parts of it (that is why i added the print length stuff before the if, to check on it.
This is a sample output of the line length awk reads and the different substrings :
Awk Started
length = #130#
## ## ##
length = #136#
##0000000##22016 ##
length = #133#
##0000001##16 ##
length = #129#
##0010220## ##
length = #138#
##0000000##1022016##
length = #136#
##0000000##22016 ##
length = #134#
##0000000##016 ##
length = #137#
##0000000##022016 ##
Is there a reason why awk treats lines of the same length as having a different length? Does it have something to do with the spacing of the input file?
Thanks in advance for any help.
After the comments about cleaning the file up with sed, i got this output (and yes now the lines have a different size) :
1 0M-DM-EM-G M-A.M-E. #DEH M-SM-TM-OM-IM-WM-EM-IM-A M-DM-V/M-DM-T/M-TM-AM-P 01022016 $
2 110000080103M-CM-EM-QM-OM-MM-TM-A M-A. 6M-AM-HM-GM-MM-A 1055801001102 0000120000012001001142 19500000120 0100M-D000000000000000000000001022016 $
3 110000106302M-TM-AM-QM-EM-KM-KM-A 5M-AM-AM-HM-GM-MM-A 1043801001101 0000100000010001001361 19500000100M-IM-SM-O0100M-D000000000000000000000001022016 $
4 110000178902M-JM-AM-QM-AM-CM-IM-AM-MM-MM-G M-KM-EM-KM-AM-S 71M-AM-HM-GM-MM-A 1136101001101 0000130000013001006061 19500000130 0100M-D000000000000000000000001022016 $