Modifying a number value in text - awk

I have a text coming in as
A1:B2.C3.D4.E5
A2:B7.C10.D0.E9
A0:B1.C9.D4.E8
I wonder how to change it as
A1:B2.C1.D4.E5
A2:B7.C8.D0.E9
A0:B1.C7.D4.E8
using Awk. First problem is multiple delimiter. Second is, how to get the C-Value and Decrement by 2.

awk solution:
$ awk -F"." '{$2=substr($2,0,1)""substr($2,2)-2;}1' OFS="." file
A1:B2.C1.D4.E5
A2:B7.C8.D0.E9
A0:B1.C7.D4.E8

I was wondering wether awk regexp would do the job, but apparently, awk cannot capture pattern. This is why I suggest perl solution:
$ cat data.txt
A1:B2.C3.D4.E5
A2:B7.C10.D0.E9
A0:B1.C9.D4.E8
$ perl -pe 's/C([0-9]+)/"C" . ($1-2)/ge;' data.txt
A1:B2.C1.D4.E5
A2:B7.C8.D0.E9
A0:B1.C7.D4.E8

Admittedly, I probably would have done this using the substr() function like Guru has shown:
awk 'BEGIN { FS=OFS="." } { $2 = substr($2,0,1) substr($2,2) - 2 }1' file
I do also like Aif's answer using Perl probably just a little more. Shorter is sweeter, isn't it? However, GNU awk can capture pattens. Here's how:
awk 'BEGIN { FS=OFS="." } match($2, /(C)(.*)/, a) { $2 = a[1] a[2] - 2}1' file

Related

Concatenate the sequence to the ID in fasta file

Here is my input file
>OTU1;size=4;
ATTCCGGGTTTACT
ATTCCTTTTATCGA
ATC
>OTU2;size=10;
CGGATCTAGGCGAT
ACT
>OTU3;size=5;
ATTCCCGGGATCTA
ACTTTTC
The expected output file is:
>OTU1;size=4;ATTCCGGGTTTACTATTCCTTTTATCGAATC
>OTU2;size=10;CGGATCTAGGCGATACT
>OTU3;size=5;ATTCCCGGGATCTAACTTTTC
I've tried the code from Remove line breaks in a FASTA file
but this doesn't work for me, and I am not sure how to modify the code from that post...
Any suggestion? Thanks in advance!
Here is another awk script. Using the awk internal parsing mechanism.
awk 'BEGIN{RS=">";OFS="";}NR>1{$1=$1;print ">"$0}' input.txt
Output is:
>OTU1;size=4;ATTCCGGGTTTACTATTCCTTTTATCGAATC
>OTU2;size=10;CGGATCTAGGCGATACT
>OTU3;size=5;ATTCCCGGGATCTAACTTTTC
Explanation:
awk '
BEGIN { # initialize awk internal variables
RS=">"; # set `RS`=record separator to `>`
OFS=""; # set `OFS`=output field separator to empty string.
}
NR>1 { # handle from 2nd record (1st record is empty).
$1=$1; # regenerate the output line
print ">"$0 # print out ">" with computed output line
}' input.txt
$ awk '{printf "%s%s", (/^>/ ? ors : ""), $0; ors=ORS} END{print ""}' file
>OTU1;size=4;ATTCCGGGTTTACTATTCCTTTTATCGAATC
>OTU2;size=10;CGGATCTAGGCGATACT
>OTU3;size=5;ATTCCCGGGATCTAACTTTTC
Could you please try following too.
awk -v RS=">" 'NR>1{gsub(/\n/,"");print ">"$0}' Input_file
My original attempt was awk -v RS=">" -v FS="\n" -v OFS="" 'NF>1{$1=$1;print ">"$0}' Input_file but later I saw it is already answered buy dudi boy so written another(first mentioned) one.
Similar to my answer here:
$ awk 'BEGIN{RS=">"; FS="\n"; ORS=""}
(FNR==1){next}
{ name=$1; seq=$0; gsub(/(^[^\n]*|)\n/,"",seq) }
{ print ">" name seq }' file1.fasta file2.fasta file3.fasta ...

awk print several substring

I would like to be able to print several substrings via awk.
Here an example of what I usually do;
awk' {print substr($0,index($0,string),10)} ' test.txt > result.txt
This allow me to print 10 letters after the discovery of my string.
But the result is the first one substring, instead of several as I expected.
Here an example if I use the string "ATGC" :
test.txt
ATGCATATAAATGCTTTTTTTTT
result.txt
ATGCATATAA
instead of
ATGCATATAA
ATGCTTTTTT
What I have to add ?
I'm sure the answer is easy for you guys !
Thank you for your help.
If you have gawk (gnu awk), you can make use of FPAT:
awk -v FPAT='ATGC.{6}' '{for(i=1;i<=NF;i++)print $i}' file
With your example:
$ awk -v FPAT='ATGC.{6}' '{for(i=1;i<=NF;i++)print $i}' <<<"ATGCATATAAATGCTTTTTTTTT"
ATGCATATAA
ATGCTTTTTT
awk '{print substr($0,1,10),RS substr($0,length -12,10)}' file
ATGCATATAA
ATGCTTTTTT

How to print insert in awk loop

How to print the file name in the loop? I want to print the file name and the average value of column 4 at same line:
for i in `ls *cov`
do
awk '{sum +=$4;n++}END{print sum/n}' $i
done
I mean I want to
awk '{sum +=$4;n++}END{print $i\t sum/n}' $i
You can use bash variables in an awk script using the -v flag:
awk -v file=$i '{sum +=$4;n++}END{print file\t sum/n}' $i
But, there is also the built in awk variable FILENAME:
awk '{sum +=$4;n++}END{print FILENAME\t sum/n}' $i
Which is much cleaner since you aren't passing around variables.
Lose the loop (see why-is-using-a-shell-loop-to-process-text-considered-bad-practice) and just use:
awk -v OFS='\t' '{sum+=$4} ENDFILE{print FILENAME, (FNR>0 ? sum/FNR : 0); sum=0}' *cov
The above uses GNU awk for ENDFILE, there's simple tweaks for other awks but the important things are:
A surrounding shell loop is neither required nor desirable.
The variable n isn't needed since awk has builtin variables.
You have to protect yourself from divide by zero on empty files.

Combining awk search with standard awk and awk delimiter

I`m working on a set of data for which I need specific fields as output:
The data looks like this:
/home/oracle/db.log.gz:2013-1-19T00:00:25 <user.info> 1 2013-1-19T00:00:53.911 host_name RT_FLOW [junos#26.1.1.1.2.4 source-address="10.1.2.0" source-port="616" destination-address="100.1.1.2" destination-port="23" service-name="junos-telnet" nat-source-address="20x.2x.1.2" nat-source-port="3546" nat-destination-address="9x.12x.3.0"]
From above I need three things:
(I) - 2013-1-19T00:00:53.911 which is $4
(II)- source-address="10.1.2.0" which is $8 of which I need only 10.1.2.0
(III) - destination-address="100.1.1.2" which $10 of which I need only 100.1.1.2
I cannot use simple awk like this -> awk '{ print $4 \t $8 \t $10 }' since there are some fields after "device_name" in the log file which are not always present in all log lines so I have to make use of delimiters such as
awk -F 'source-address=' '{print $2}' | awk '{print $1} -> this gives source-addressIP which is (II) requirement
I`m not sure how do I combine using a awk search for I and II and III.
Can someone help?
I believe sed is better for this job
sed -r 's/([^ ]+[ ]+){3}([^ ]+).*[ ]+source-address="([^"]+)".*[ ]+destination-address="([^"]+)".*/\2\t\3\t\4/' file
Output:
2013-1-19T00:00:53.911 10.1.2.0 100.1.1.2
What do you exactly want?
solve the problem using any (reasonably standard) tool
solve this challenge using one instance of awk
solve the problem using just awk, no matter how many instances it costs
For the first case, you could parse the line using scripting language of your choice (mine would be Perl), or do it the hard way using sed and a single big substitution. Or something between the two – use three regexes to get the parts you want.
For the second case, you could adapt any of the former solutions, preferably the sed one. Awk and sed solutions have already been posted.
For the third case, you could just run the obvious awk solutions you mentioned in your question and send the results to a single pipe like { awk …; awk …; awk …; } < file | consumer.
Try doing this :
awk '{print gensub(/.*\s+([0-9]{4}-[0-9]+-[0-9]+T[0-9]{2}:[0-9]{2}:[0-9]{2}.[0-9]+).*source-address="([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}).*destination-address="([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}).*/, "(I) \\1\n(II) \\2\n(III) \\3", "g"); }' file
Another solution using perl :
perl -lne 'print "(", "I" x ++$c, ") $_" for m/.*?\s+(\d{4}-\d+-\d+T\d{2}:\d{2}:\d{2}.\d+).*source-address="(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}).*destination-address="(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}).*/' file
Outputs :
(I) 2013-1-19T00:00:53.911
(II) 10.1.2.0
(III) 100.1.1.2

Unable to match regex in string using awk

I am trying to fetch the lines in which the second part of the line contains a pattern from the first part of the line.
$ cat file.txt
String1 is a big string|big
$ awk -F'|' ' { if ($2 ~ /$1/) { print $0 } } ' file.txt
But it is not working.
I am not able to find out what is the mistake here.
Can someone please help?
Two things: No slashes, and your numbers are backwards.
awk -F\| '$1~$2' file.txt
I guess what you meant is part of the string in the first part should be a part of the 2nd part.if this is what you want! then,
awk -F'|' '{n=split($1,a,' ');for(i=1,i<=n;i++){if($2~/a[i]/)print $0}}' your_file
There are surprisingly many things wrong with your command line:
1) You aren't using the awk condition/action syntax but instead needlessly embedding a condition within an action,
2) You aren't using the default awk action but instead needlessly hand-coding a print $0.
3) You have your RE operands reversed.
4) You are using RE comparison but it looks like you really want to match strings.
You can fix the first 3 of the above by modifying your command to:
awk -F'|' '$1~$2' file.txt
but I think what you really want is "4" which would mean you need to do this instead:
awk -F'|' 'index($1,$2)' file.txt