Multiple passes with awk and execution order - awk

Two part question:
Part One:
First I have a sequence AATTCCGG which I want to change to TAAGGCC. I used gsub to change A to T, C to G, G to C and T to A. Unfortunetly awk executes these orders sequentially, so I ended up with AAACCCC. I got around this by using upper and lower case, then converting back to upper case values, but I would like to do this in a single step if possible.
example:
echo AATTCCGG | awk '{gsub("A","T",$1);gsub("T","A",$1);gsub("C","G",$1);gsub("G","C",$1);print $0}'
OUTPUT:
AAAACCCC
Part Two:
Is there a way to get awk to run to the end of a file for one set of instructions before starting a second set? I tried some of the following, but with no success
for the data set
1 A
2 B
3 C
4 D
5 E
I am using the following pipe to get the data I want (Just an example)
awk '{if ($1%2==0)print $1,"E";else print $0}' test | awk '{if ($1%2==0 && $2=="E") print $0}'
I am using a pipe to rerun the program, however I have found that it is quicker if I don't have to rerun the program.

This can be efficiently solved with tr:
$ echo AATTCCGG | tr ATCG TAGC
Regarding part two (this should be a different question, really): no, it is not possible with awk, pipe is the way to go.

for part two, try this command:
awk '{if ($1%2==0)print $1,"E"}' test

Here is a method I have found for the first part of the question using awk. It uses an array and a for loop.
cat sub.awk
awk '
BEGIN{d["G"]="C";d["C"]="G";d["T"]="A";d["A"]="T";FS="";OFS=""}
{for(i=1;i<(NF+1);i++)
{if($i in d)
$i=d[$i]}
}
{print}'
Input/Output:
ATCG
TAGC

Related

awk - regex to print first matching group

I am trying to get the first matching group based on regex, but it's not printing anything after the second awk command. Not sure what I was doing wrong. Any help is greatly appreciated.
git status | awk 'NR=1' --> Limiting this t0 print the first line.
On branch TA1692959
git status | awk 'NR=1' | awk '/^On\sbranch\s([\w]*)/{ print $1 }' --> I was trying to get the first word "TA1692959" after "On branch"this prints nothing.
git status |
{n,m,g}awk 'NR<--NF' FS='^On branch |[^[:alnum:]_].+$' OFS=
TA1241521
If you find yourself passing the data through multiple awk calls then chances are pretty good you can do the same thing with a single awk call, eg:
git status | awk 'NR==1 && /^On branch / {print $3; exit}'
TA1692959
In this case:
there's no need for a regex; otherwise OP should update the question with additional samples showing the need for a regex
the exit is optional and merely allows awk to skip processing the rest of the input stream

How to check if a string contains at least one letter different from 4 using bash or awk

How to check that a sequence has at least one letter that is not A, U, C, G characters using awk or bash?
Can it be done without the typical for loop?
Example of sequence:
AUVGAU
I give this as input I should get it back given that it has V
The input file looks something like this, so I think awk would be better.
>7A0E_1|
AUVGAU
>7A0E_2|
GUCAU
Expected output
>7A0E_1|
AUVGAU
Here is what I tried:
awk '!/^>/ {next}; {getline s}; s !~ /AUGC/ { print $0 "\n" s }' sample
But obviously /AUGC/ is not right... can someone help me with this regex?
I think awk is the tool if you want to conditionally output the > line if the next record does not contain [AUCG]. You can do that with:
awk '/^>/ {rec=$0; next} /[^AUGC]/ {printf "%s\n%s\n", rec, $0}' sample
In your case that results in:
$ awk '/^>/ {rec=$0; next} /[^AUGC]/ {printf "%s\n%s\n", rec, $0}' sample
>7A0E_1|
AUVGAU
(note: you can use print rec; print instead of printf, but printf above reduced the output to a single call)
Where you ran into trouble was forgetting to save the current record that began with > and then using getline -- which wasn't needed at all.
How to check that a sequence has at least one letter that is not A, U, C, G characters using awk(...)? Can it be done without the typical for loop?
Yes, GNU AWK can do that. Let file.txt content be
AUVGAU
AUCG
(empty line is intentional) then
awk 'BEGIN{FPAT="[^AUCG]"}{print NF>=1}' file.txt
output
1
0
0
Explanation: both solutions count number of characters which are not one of: A, U, C, G, any other character is treated as constituing field and number of fields (NF) is then checked (>=1). Note that this solution does redefine what is field and if that is problem you might use patsplit instead
awk '{patsplit($0,arr,"[^AUCG]");print length(arr)>=1}' file.txt
(tested in gawk 4.2.1)

grep -v multiple line same time

I would like to filter the lines containing "pattern" and the following 5 lines.
Something like grep -v -A 5 'pattern' myfile.txt with output:
other
other
other
other
other
other
I'm interested in linux shell solutions, grep, awk, sed...
Thx
myfile.txt:
other
other
other
pattern
follow1
follow2
follow3
follow4
follow5
other
other
other
pattern
follow1
follow2
follow3
follow4
follow5
other
other
other
other
other
other
You can use awk:
awk '/pattern/{c=5;next} !(c&&c--)' file
Basically: We are decreasing the integer c on every row of input. We are printing lines when c is 0. *(see below) Note: c will be automatically initialized with 0 by awk upon it's first usage.
When the word pattern is found, we set c to 5 which makes c--<=0 false for 5 lines and makes awk not print those lines.
* We could bascially use c--<=0 to check if c is less or equal than 0. But when there are many(!) lines between the occurrences of the word pattern, c could overflow. To avoid that, oguz ismail suggested to implement the check like this:
!(c&&c--)
This will check if c is trueish (greater zero) and only then decrement c. c will never be less than 0 and therefore not overflow. The inversion of this check !(...) makes awk print the correct lines.
Side-note: Normally you would use the word regexp if you mean a regular expression, not pattern.
With GNU sed (should be okay as Linux is mentioned by OP)
sed '/pattern/,+5d' ip.txt
which deletes the lines matching the given regex and 5 lines that follow
I did it using this:
head -$(wc -l myfile.txt | awk '{print $1-5 }') myfile.txt | grep -v "whatever"
which means:
wc -l myfile.txt : how many lines (but it also shows the filename)
awk '{print $1}' : only show the amount of lines
awk '{print $1-5 }' : we don't want the last five lines
head ... : show the first ... lines (which means, leave out the last five)
grep -v "..." : this part you know :-)

Looks for patterns across different lines

I have a file like this (test.txt):
abc
12
34
def
56
abc
ghi
78
def
90
And I would like to search the 78 which is enclosed by "abc\nghi" and "def". Currently, I know I can do this by:
cat test.txt | awk '/abc/,/def/' | awk '/ghi/,'/def/'
Is there any better way?
One way is to use flags
$ awk '/ghi/ && p~/abc/{f=1} f; /def/{f=0} {p=$0}' test.txt
ghi
78
def
{p=$0} this will save input line for future use
/ghi/ && p~/abc/{f=1} set flag if current line contains ghi and previous line contains abc
f; print input record as long as flag is set
/def/{f=0} clear the flag if line contains def
If you only want the lines between these two boundaries
$ awk '/ghi/ && p~/abc/{f=1; next} /def/{f=0} f; {p=$0}' ip.txt
78
$ awk '/12/ && p~/abc/{f=1; next} /def/{f=0} f; {p=$0}' ip.txt
34
See also How to select lines between two patterns?
This is not really clean, but you can redefine your record separator as a regular expression to be abc\nghi\n|\ndef. This however creates multiple records, and you need to keep track which ones are between the correct ones. With awk you can check which RS was found using RT.
awk 'BEGIN{RS="abc\nghi\n|\ndef"}
(RT~/abc/){s=1}
(s==1)&&(RT~/def/){print $0}
{s=0}' file
This does :
set RS to abc\nghi\n or \ndef.
check if the record is found, if RT contains abc you found the first one.
if you found the first one and the next RT contains def, then print.
grep alternative
$ grep -Pazo '(?s)(?<=abc\nghi)(.*)(?=def)' file
but I think awk will be better
You could do this with sed. It's not ideal in that it doesn't actually understand records, but it might work for you...
sed -Ene 'H;${x;s/.*\nabc\nghi\n([0-9]+)\ndef\n.*/\1/;p;}' input.txt
Here's what's basically going on:
H - appends the current line to sed's "hold space"
${ - specifies the start of a series of commands that will be run once we come to the end of the file
x - swaps the hold space with the pattern space, so that future substitutions will work on what was stored using H
s/../../ - analyses the pattern space (which is now multi-line), capturing the data specified in your question, replacing the entire pattern space with the bracketed expression...
p - prints the result.
One important factor here is that the regular expression is ERE, so the -E option is important. If your version of sed uses some other option to enable support for ERE, then use that option instead.
Another consideration is that the regex above assumes Unix-style line endings. If you try to process a text file that was generated on DOS or Windows, the regex may need to be a little different.
awk solution:
awk '/ghi/ && r=="abc"{ f=1; n=NR+1 }f && NR==n{ v=$0 }v && NR==n+1{ print v }{ r=$0 }' file
The output:
78
Bonus GNU awk approach:
awk -v RS= 'match($0,/\nabc\nghi\n(.+)\ndef/,a){ print a[1] }' file

Can I speed up AWK program using NR function

I am using awk to pull out data form a file that us +30M records. I know within a few 1000 records where the records I want are. I am curious if I can cut down on the time it take awk to find the records by telling it a starting point setting the NR. for example, my record is >25 million lines in I could use the following:
awk 'BEGIN{NR=25000000}{rest of my script}' in
would this make awk skip straight to the 25M record and save me the time of it scanning each record before that?
For a better example, I am using this AWK in a loop in sh. I need the normal output of the awk script, but I would also like it pass along the NR when it finished to the next interation when loop comes back to this script again.
awk -v n=$line -v r=$record 'BEGIN{a=1}$4==n{print $10;a=2}($4!=n&&a==2){(pass NR out to $record);exit}' in
Nope. Let's try it:
$ cat -n file
1 one
2 two
3 three
4 four
$ awk 'BEGIN {NR=2} {print NR, $0}' file
3 one
4 two
5 three
6 four
Are your records fixed length, or do you know the average line length? If yes, then you can use a language that allows you to open a file and seek to a position. Otherwise you have to read all those lines:
awk -v start=25000000 'NR < start {next} {your program here}' file
To maintain your position between runs of the script, I'd use a language like perl: at the end of the run use tell() to output the current position, say to a file; then at the start of the next run, use seek() to pick up where you left off. Add a check that the starting position is less than the current file size, in case the file was truncated.
One way (Using sed), if you know the line numbers
for n in 3 5 8 9 ....
do
sed -n "${n}p" file |awk command
done
or
sed -n "25000,30000p" file |awk command
Records generally have no fixed size so there is no way for awk but to scan the first part of the file even just to skip them.
Should you want to skip the first part of the input file and you (roughly) know the size to ignore, you can use dd to truncate the input, eg here assuming a record is 80 bytes wide:
dd if=inputfile bs=25MB skip=80 | awk ...
Finally, you can avoid awk to scan the last records by exiting from the awk script when you have hit the end of the interesting zone.