Awk command to replace a substring with a specific value - awk

I have a flat file which has phone number in field starting at 314 till 323. Now I wanted to dummy out that field with 1234567890
For this I tried using the below commands but both are throwing errors.
awk '{var=substr($0,314,10);gsub("[0-9]","1234567890",$var); print}' final_phone.txt >final_phone.txt1
awk 'var=substr($0,314,10) { $var = "1234567890" }1' final_phone.txt >final_phone.txt1
**fatal: grow_fields_arr: fields_arr: can't allocate 9849885432 bytes of memory (Cannot allocate memory)**
In the first case I tried to assign the substring to a variable and in gsub I wanted to check for numbers pattern and substitute with 1234567890.
In the second case I was trying to assign the value to each of the substring value in each line.
Can someone help me with the syntax here?

#markp-fuso's comment abut why your code is generating that specific error message is correct.
All you need is:
awk '{$0=substr($0,1,313) "1234567890" substr($0,324)} 1' file
or if you want to check for numbers first:
awk 'substr($0,314,10) ~ /^[0-9]+$/{$0=substr($0,1,313) "1234567890" substr($0,324)} 1' file
and using variables:
awk '
BEGIN { beg=314; lgth=10; new="1234567890" }
substr($0,beg,lgth) ~ /^[0-9]+$/ {
$0 = substr($0,1,beg-1) new substr($0,beg+lgth)
}
1' file

Related

How to filter the OTU by counts with AWK?

I am trying to filter all the singleton from a fasta file.
Here is my input file:
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU2;size=1;
ATCCGGGACTGATC
>OTU3;size=5;
GAACTATCGGGTAA
>OTU4;size=1;
AATTGGCCATCT
The expected output is:
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA
I've tried
awk -F'>' '{if($1>=2) {print $0}' input.fasta > ouput.fasta
but this will remove all the header for each OTU.
Anyone could help me out?
Could you please try following.
awk -F'[=;]' '/^>/{flag=""} $3>=3{flag=1} flag' Input_file
$ awk '/>/{f=/=1;/}!f' file
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA
awk -v FS='[;=]' 'prev_sz>=2 && !/size/{print prev RS $0} /size/{prev=$0;prev_sz=$(NF-1)}'
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA
Store the size from each line in prev_sz variable and whole line in prev variables. Now check if its >= 2, then print the previous line and the current line. RS is used to print new line.
While all the above methods work, they are limited to the fact that input always has to look the same. I.e the sequence-name in your fasta-file needs to have the form:
>NAME;size=value;
A few solutions can handle a bit more extended sequence-names, but none handle the case where things go a bit more generic, i.e.
>NAME;label1=value1;label2=value2;STRING;label3=value3;
Print sequence where label xxx matches value vvv:
awk '/>{f = /;xxx=vvv;/}f' file.fasta
Print sequence where label xxx has a numeric value p bigger than q:
awk -v label="xxx" -v limit=q \
'BEGIN{ere=";" label "="}
/>/{ f=0; match($0,ere);value=0+substr($0,RSTART+length(ere)); f=(value>limit)}
f' <file>
In the above ere is a regular expression we try to match. We use it to find the location of the value attached to label xxx. This substring will have none-numeric characters after its value, but by adding 0 to it, it is converted to a number, losing all non-numeric values (i.e. 3;label4=value4; is converted to 3). We check if the value is bigger than our limit, and print the sequence based on that result.

awk to parse field if specific value is in another

In the awk below I am trying to parse $2 using the _ only if $3 is a specific valus (ID). I am reading that parsed value into an array and going to use it as a key in a lookup. The awk does execute but the entire line 2 or line with ID in $3 prints not just the desired. The print statement is only to see what results (for testing only) and will not be part of the script. Thank you :).
awk
awk -F'\t' '$3=="ID"
f="$(echo $2|cut -d_ -f1,1)"
{
print $f
}' file
file tab-delimited
R_Index locus type
17 chr20:31022959 NON
18 chr11:118353210-chr9:20354877_KMT2A-MLLT3.K8M9 ID
desired
$f = chr11:118353210-chr9:20354877
Not completely clear, could you please try following.
awk '{split($2,array,"_");if(array[2]=="KMT2A-MLLT3.K8M9"){print array[1]}}' Input_file
Or if you want top change 2nd field's value along with printing all lines then try following once.
awk '{split($2,array,"_");if(array[2]=="KMT2A-MLLT3.K8M9"){$2=array[1]}} 1' Input_file

How can I use awk to insert something in the middle of the word?

I have an input:
This is a test
And I want to insert some letters in the middle of the word, like:
This is a teSOMETHINGst
I know I can define the needed word by $i, but how can I modify the word that way?
I'm trying to do it like that:
{
i=4 # finding somehow
print (substr($i,1,(length($i)/2)) "SOMETHING" substr($i,(length($i)/2),(length($i)/2)))
}
As I'm new to awk I wonder if it is a right way.
This may be what you're looking for:
$ awk 'match($0,/\<test\>/){mid=int(RLENGTH/2); $0=substr($0,RSTART,mid) "SOMETHING" substr($0,RSTART+mid,RELNGTH-mid)} 1'
e.g. some test cases (no pun intended):
$ echo 'This is a test' |
awk 'match($0,/\<test\>/){mid=int(RLENGTH/2); $0=substr($0,RSTART,mid) "SOMETHING" substr($0,RSTART+mid,RLENGTH-mid)} 1'
teSOMETHINGst
$ echo 'These are tests' |
awk 'match($0,/\<tests\>/){mid=int(RLENGTH/2); $0=substr($0,RSTART,mid) "SOMETHING" substr($0,RSTART+mid,RLENGTH-mid)} 1'
teSOMETHINGsts
$ echo 'These contestants are in a test' |
awk 'match($0,/\<test\>/){mid=int(RLENGTH/2); $0=substr($0,RSTART,mid) "SOMETHING" substr($0,RSTART+mid,RLENGTH-mid)} 1'
teSOMETHINGst
Assuming your requirement is to match the column number containing test and do some operations over it, do a simple loop over the columns upto NF and match using the regex match operator ~ or for fixed strings do a equality match as $i == "test"
awk '
{
for(i=1;i<=NF;i++) {
if ($i ~ "test") {
halfLength=(length($i)/2)
$i=(substr($i,1,halfLength) "SOMETHING" substr($i,(halfLength+1),halfLength))
}
}
}1' <<<"This is a test"
This produces the output as expected. Note that I've made the substr() call for printing the 2nd part of the string as substr($i,(halfLength+1),halfLength). The +1 is needed which you have missed before. I've used the substr() result to be modify column number containing test i.e. as $i=..
Also when doing {..}1, each of the column fields are reconstructed based on the modifications if any, in our case only to the column containing the string you wanted.
Also note that the whole attempt will fail if the target string contains an odd number of characters or forms a sub string of another larger string ( could use the equality operator but regex approach would fail )
Another another one that grew from curiosity to personal vendetta (:
$ echo This is a contestant test |
awk -v s="test" '
BEGIN {
FS=OFS=""
}
{
if(i=match($0, "(^| )" s "( |$)")) { # match over index since regex support
j=(i+length(s)/2+!!(i-1)) # !!(i-1) detect beginning of record
$j="SOMETHING" $j
}
}1'
This is a contestant teSOMETHINGst
Another one using empty separators, mostly to satisfy personal curiosity:
$ echo This is a test |
awk -v s="test" '
BEGIN {
FS=OFS="" # empty separators
}
{
if(i=index($0,s)) { # index finds the beginning of test
j=(i+length(s)/2) # midpoint
$j="SOMETHING" $j # insert string
}
}1' # output
This is a teSOMETHINGst

How can I exclude blank lines with awk?

Question
How can I exclude lines starting with a space character, and that have nothing else on the line? With awk, I want to print the line Need to print, but it's also printing the blank line. How can I exclude it?
Script: test.awk
$0 !~/^start|^#/ {
print "Result : %s",$0
}
Data
# test
start
Need to print
Result
Result : %s
Result : %s Need to print
Use the NF Variable
You aren't really asking about lines that start with a space, you're asking about how to discard blank lines. Pragmatically speaking, blank lines have no fields, so you can use the built-in NF variable to discard lines which don't have at least one field. For example:
$ awk 'NF > 0 && !/^(start|#)/ {print "Result: " $0}' /tmp/corpus
Result: Need to print
You can use:
awk '/^[^[:space:]]/{print "Result : " $0}'
The use of [^[:space:]] ensures that there is at least a single non space character in every line which get's printed.

Awk with a variable

I have an awk command to print out the total number of times "200" occurred in column 26.
awk '$26 ~ /200/{n++}; END {print n+0}' testfile
How do I modify this statement so I can pass 200 as a variable? e.g. if I have a variable $code with a value of 200
Thanks in advance
awk '$26 ~ code {n++} END {print n+0}' code=200 testfile
If a filename on the command line has the form var=val it is treated as a variable
assignment. The variable var will be assigned the value val.
ยง Awk Program Execution
awk -v var="$shellVar" '$26~var{n++} END{print n}' file
you see above line how to use shell variable in awk. some notes for your awk one-liner:
print n+0 not necessary. because the n defined by you, not picked from input text, and you explicitly did n++, so it is number type, n+0 makes no sense
the ; before END should be removed
I copied your code about the checking 200 part. but it is risky. if the $26 has only a number, you can consider to use 1*$26 == 200 or $26 == "200" using regex in this situation may give wrong result, think about in your $26, value was : 20200