Edit only specific lines when I find special character with awk - awk

I have this kind of file :
>AX-89948491-minus
CTAACACATTTAGTAGATT
>AX-89940152-plus
cgtcattcagggcaggtggggcaaaA
>AX-89922107-plus
TTATAACTTGTGTATGCTCTCAGGCT
When the lines start by ">" and include "minus" , I need to reverse (rev) and translate (tr) the next following lines. I should get :
>AX-89948491-minus
AATCTACTAAATGTGTTAG
>AX-89940152-plus
cgtcattcagggcaggtggggcaaaA
>AX-89922107-plus
TTATAACTTGTGTATGCTCTCAGGCT
I would like to go with awk. I tried that but it does not work..
awk '{if(NR%2==1~/"plus"/){print;getline;print} else if (NR%2==1~/"minus"/){system("echo "$0" | rev | tr ATCGatcg TAGCtagc")} else {print;getline;print}}' file
Any help?

This gnu-awk should work for you:
awk '
p {
cmd = "rev <<< \047" $0 "\047 | tr ATCGatcg TAGCtagc"
if ((cmd |& getline var) > 0)
$0 = var
}
{
p = /^>/ && /-minus/
} 1' file
>AX-89948491-minus
AATCTACTAAATGTGTTAG
>AX-89940152-plus
cgtcattcagggcaggtggggcaaaA
>AX-89922107-plus
TTATAACTTGTGTATGCTCTCAGGCT

Awk is a tool to manipulate text, not a tool to sequence calls to other tools. The latter is what a shell is for. There are times when you need to call other tools from awk but not when it's simple text manipulation like reversing and translating characters in a string as you want to do.
Using any awk in any shell on every Unix box without spawning a subshell once per target input line to call other Unix tools (including the non-POSIX-defined rev which won't exist on some Unix boxes):
$ cat tst.awk
BEGIN {
split("ATCGatcg TAGCtagc",tmp)
for (i=1; i<=length(tmp[1]); i++) {
tr[substr(tmp[1],i,1)] = substr(tmp[2],i,1)
}
}
f {
out = ""
for (i=1; i<=length($0); i++) {
char = substr($0,i,1)
out = (char in tr ? tr[char] : char) out
}
$0 = out
f = 0
}
/^>.*minus/ { f=1 }
{ print }
$ awk -f tst.awk file
>AX-89948491-minus
AATCTACTAAATGTGTTAG
>AX-89940152-plus
cgtcattcagggcaggtggggcaaaA
>AX-89922107-plus
TTATAACTTGTGTATGCTCTCAGGCT

I'd use perl, as it has builtin reverse and tr functions:
perl -lpe '
if (/^>/) {$rev = /minus/; next}
if ($rev) {$_ = reverse; tr/ATCGatcg/TAGCtagc/}
' file
>AX-89948491-minus
AATCTACTAAATGTGTTAG
>AX-89940152-plus
cgtcattcagggcaggtggggcaaaA
>AX-89922107-plus
TTATAACTTGTGTATGCTCTCAGGCT

Related

How can I use sed to generate an awk file?

How do I write sed commands to generate an awk file.
Here is my problem:
For example, I have a text file, A.txt which contains a word on each line.
app#
#ple
#ol#
The # refers when the word starts/ ends/ starts and ends. For example, app# shows that the word starts with 'app'. #ple shows that the word ends with 'ple'. #ol# shows that the word has 'ol' in the middle of the word.
I have to generate an awk file from sed commands which reads in another file, B.txt (which contains a word on each line) and increments the variable start, end, middle.
How do I write sed commands whereby for each line in the text file, A.txt, it will generate an awk code ie.
{ {if ($1 ~/^app/)
{start++;}
}
For example, if I input the other file, B.txt with these words into the awk script,
application
people
bold
cold
The output would be; start = 1, end = 1, middle = 2.
I'd use ed over sed for this, actually.
A quick script that creates A.awk from A.txt and runs it on B.txt:
#!/bin/sh
ed -s A.txt <<'EOF'
1,$ s!^#\(.*\)#$!$0 ~ /.+\1.+/ { middle++ }!
1,$ s!^#\(.*\)!$0 ~ /\1$/ { end++ }!
1,$ s!^\(.*\)#!$0 ~ /^\1/ { start++ }!
0 a
#!/usr/bin/awk -f
BEGIN { start = end = middle = 0 }
.
$ a
END { printf "start = %d, end = %d, middle = %d\n", start, end, middle }
.
w A.awk
EOF
# awk -f A.awk B.txt would work too, but this demonstrates a self-contained awk script
chmod +x A.awk
./A.awk B.txt
Running it:
$ ./translate.sh
start = 1, end = 1, middle = 2
$ cat A.awk
#!/usr/bin/awk -f
BEGIN { start = end = middle = 0 }
$0 ~ /^app/ { start++ }
$0 ~ /ple$/ { end++ }
$0 ~ /.+ol.+/ { middle++ }
END { printf "start = %d, end = %d, middle = %d\n", start, end, middle }
Note: This assumes that the middle patterns shouldn't match at the start or end of a line.
But here's a attempt using sed to create A.awk, putting all the sed commands in a file, as trying to this as a one-liner using -e and getting all the escaping right is not something I feel up to at the moment:
Contents of makeA.sed:
s!^#\(.*\)#$!$0 ~ /.+\1.+/ { middle++ }!
s!^#\(.*\)!$0 ~ /\1$/ { end++ }!
s!^\(.*\)#!$0 ~ /^\1/ { start++ }!
1 i\
#!/usr/bin/awk -f\
BEGIN { start = end = middle = 0 }
$ a\
END { printf "start = %d, end = %d, middle = %d\\n", start, end, middle }
Running it:
$ sed -f makeA.sed A.txt > A.awk
$ awk -f A.awk B.txt
start = 1, end = 1, middle = 2
Off the top of my head, and not tested:
/\(.*\)#$/s//{if ($1 ~ /^\1/) start++; next}/
/#\(.*\)$/s//{if ($1 ~ /\1$/) end++; next}/
/\(.*\)/s//{if ($1 ~ /\1/) middle++; next}/
The construct \(.*\) matches any text and saves it in a back-reference, then \1 recalls the back-reference. The empty pattern following the s command refers back to the pattern that matched the line. The next prevents the third pattern from matching after one of the other two has already matched.

Can't replace string to multi-lined string with sed

I'm trying to replace a fixed parse ("replaceMe") in a text with multi-lined text with sed.
My bash script goes as follows:
content=$(awk'{print $5}' < data.txt | sort | uniq)
target=$(cat install.sh)
text=$(sed "s/replaceMe/$content/" <<< "$target")
echo "${text}"
If content contains one line only, replacing works, but if it contains sevrel lines I get:
sed:... untarminated `s' command
I read about "fetching" multi-lined content, but I couldn't find something about placing multi lined string
You'll have more problems than that depending on the contents of data.txt since sed doesn't understand literal strings (see Is it possible to escape regex metacharacters reliably with sed). Just use awk which does:
text="$( awk -v old='replaceMe' '
NR==FNR {
if ( !seen[$5]++ ) {
new = (NR>1 ? new ORS : "") $5
}
next
}
s = index($0,old) { $0 = substr($0,1,s-1) new substr($0,s+length(old)) }
{ print }
' data.txt install.sh )"

Redirect input for gawk to a system command

Usually a gawk script processes each line of its stdin. Is it possible to instead specify a system command in the script use the process each line from output of the command in the rest of the script?
For example consider the following simple interaction:
$ { echo "abc"; echo "def"; } | gawk '{print NR ":" $0; }'
1:abc
2:def
I would like to get the same output without using pipe, specifying instead the echo commands as a system command.
I can of course use the pipe but that would force me to either use two different scripts or specify the gawk script inside the bash script and I am trying to avoid that.
UPDATE
The previous example is not quite representative of my usecase, this is somewhat closer:
$ { echo "abc"; echo "def"; } | gawk '/d/ {print NR ":" $0; }'
2:def
UPDATE 2
A shell script parallel would be as follows. Without the exec line the script would read from stdin; with the exec it would use the command that line as input:
/tmp> cat t.sh
#!/bin/bash
exec 0< <(echo abc; echo def)
while read l; do
echo "line:" $l
done
/tmp> ./t.sh
line: abc
line: def
From all of your comments, it sounds like what you want is:
$ cat tst.awk
BEGIN {
if ( ("mktemp" | getline file) > 0 ) {
system("(echo abc; echo def) > " file)
ARGV[ARGC++] = file
}
close("mktemp")
}
{ print FILENAME, NR, $0 }
END {
if (file!="") {
system("rm -f \"" file "\"")
}
}
$ awk -f tst.awk
/tmp/tmp.ooAfgMNetB 1 abc
/tmp/tmp.ooAfgMNetB 2 def
but honestly, I wouldn't do it. You're munging what the shell is good at (creating/destroying files and processes) with what awk is good at (manipulating text).
I believe what you're looking for is getline:
awk '{ while ( ("echo abc; echo def" | getline line) > 0){ print line} }' <<< ''
abc
def
Adjusting the answer to you second example:
awk '{ while ( ("echo abc; echo def" | getline line) > 0){ counter++; if ( line ~ /d/){print counter":"line} } }' <<< ''
2:def
Let's break it down:
awk '{
cmd = "echo abc; echo def"
# line below will create a line variable containing the ouptut of cmd
while ( ( cmd | getline line) > 0){
# we need a counter because NR will not work for us
counter++;
# if the line contais the letter d
if ( line ~ /d/){
print counter":"line
}
}
}' <<< ''
2:def

awk output format for average

I am computing average of many values and printing it using awk using following script.
for j in `ls *.txt`; do
for i in emptyloop dd cp sleep10 gpid forkbomb gzip bzip2; do
echo -n $j $i" "; cat $j | grep $i | awk '{ sum+=$2} END {print sum/NR}'
done;
echo ""
done
but problem is, it is printing the value in in 1.2345e+05, which I do not want, I want it to print values in round figure. but I am unable to find where to pass the output format.
EDIT: using {print "average,%3d = ",sum/NR}' inplace of {print sum/NR}' is not helping, because it is printing "average,%3d 1.2345e+05".
You need printf instead of simply print. Print is a much simpler routine than printf is.
for j in *.txt; do
for i in emptyloop dd cp sleep10 gpid forkbomb gzip bzip2; do
awk -v "i=$i" -v "j=$j" '$0 ~ i {sum += $2} END {printf j, i, "average %6d", sum/NR}' "$j"
done
echo
done
You don't need ls - a glob will do.
Useless use of cat.
Quote all variables when they are expanded.
It's not necessary to use echo - AWK can do the job.
It's not necessary to use grep - AWK can do the job.
If you're getting numbers like 1.2345e+05 then %6d might be a better format string than %3d. Use printf in order to use format strings - print doesn't support them.
The following all-AWK script might do what you're looking for and be quite a bit faster. Without seeing your input data I've made a few assumptions, primarily that the command name being matched is in column 1.
awk '
BEGIN {
cmdstring = "emptyloop dd cp sleep10 gpid forkbomb gzip bzip2";
n = split(cmdstring, cmdarray);
for (i = 1; i <= n; i++) {
cmds[cmdarray[i]]
}
}
$1 in cmds {
sums[$1, FILENAME] += $2;
counts[$1, FILENAME]++
files[FILENAME]
}
END {
for file in files {
for cmd in cmds {
printf "%s %s %6d", file, cmd, sums[cmd, file]/counts[cmd, file]
}
}
}' *.txt

How to run a .awk file?

I am converting a CSV file into a table format, and I wrote an AWK script and saved it as my.awk. Here is the my script:
#AWK for test
awk -F , '
BEGIN {
aa = 0;
}
{
hdng = "fname,lname,salary,city";
l1 = length($1);
l13 = length($13);
if ((l1 > 2) && (l13 == 0)) {
fname = substr($1, 2, 1);
l1 = length($3) - 4;
lname = substr($3, l1, 4);
processor = substr($1, 2);
#printf("%s,%s,%s,%s\n", fname, lname, salary, $0);
}
if ($0 ~ ",,,,")
aa++
else if ($0 ~ ",fname")
printf("%s\n", hdng);
else if ((l1 > 2) && (l13 == 0)) {
a++;
}
else {
perf = $11;
if (perf ~/^[0-9\.\" ]+$/)
type = "num"
else
type = "char";
if (type == "num")
printf("Mr%s,%s,%s,%s,,N,N,,\n", $0,fname,lname, city);
}
}
END {
} ' < life.csv > life_out.csv*
How can I run this script on a Unix server? I tried to run this my.awk file by using this command:
awk -f my.awk life.csv
The file you give is a shell script, not an awk program. So, try sh my.awk.
If you want to use awk -f my.awk life.csv > life_out.cs, then remove awk -F , ' and the last line from the file and add FS="," in BEGIN.
If you put #!/bin/awk -f on the first line of your AWK script it is easier. Plus editors like Vim and ... will recognize the file as an AWK script and you can colorize. :)
#!/bin/awk -f
BEGIN {} # Begin section
{} # Loop section
END{} # End section
Change the file to be executable by running:
chmod ugo+x ./awk-script
and you can then call your AWK script like this:
`$ echo "something" | ./awk-script`
Put the part from BEGIN....END{} inside a file and name it like my.awk.
And then execute it like below:
awk -f my.awk life.csv >output.txt
Also I see a field separator as ,. You can add that in the begin block of the .awk file as FS=","