Using awk array in regular expression - awk

I am trying to search a pattern stored in awk array "a" read from file1 and then search in 4rth column of file2 , the following command works perfectly
awk -F" " 'NR==FNR{a[NR]=$1;next}{for (i in a){ if($4 ~ a[i])print $0}}' file1 file2
but If I change the command at if to search at the start of 4rth column of file2 as shown below , it does not work , any suggestions please ?
awk -F" " 'NR==FNR{a[NR]=$1;next}{for (i in a){ if($4 ~ "^a[i]" )print $0}}' file1 file2
can the second command be modified to search the array element at the start of 4rth column in file2 ?

Change "^a[i]" to "^"a[i]......

Related

How to print the length size of the following line

I would like to modify a file by including the size of following line using awk.
My file is like this:
>AAAS:1220136:1220159:-:0::NW_015494524.1:1220136-1220159(-)
ATGTCGATGCTCGATC
>AAAS::1215902:1215986:-:1::NW_015494524.1:1215902-1215986(-)
ATGCGATGCTAGCTAGCTCGAT
>AAAS:1215614:1215701:-:1::NW_015494524.1:1215614-1215701(-)
ATGCCGCGACGCAGCACCCGACGCGCAG
I am using awk to modify it to have the following format:
>Assembly_AAAS_1_16
ATGTCGATGCTCGATC
>Assembly_AAAS_2_22
ATGCGATGCTAGCTAGCTCGAT
>Assembly_AAAS_3_28
ATGCCGCGACGCAGCACCCGACGCGCAG
I have used awk to modify the first part.
awk -F":" -v i=1 '/>/{print ">Assembly_" $1 "_" val i "_";i++;next} {print length($0)} 1' infile | sed -e "s/_>/_/g" > outfile
I can use print length($0) but how to print it in the same line?
Thanks
EDIT2: Since OP has changed the sample data again so adding this code now.
awk -v val="Assembly_AAAS_" '/>/{++i;val=">"val i "_";next} {sub(/ +$/,"");print val length($0) ORS $0}' Input_file
OR
awk -v val="Assembly_AAAS_" '/>/{++i;val=">"val i "_";next} {print val length($1) ORS $0;}' Input_file
Above will remove spaces from last of the lines of Input_file, in case you don't need it then remove sub(/ +$/,""); part from above code please.
EDIT: As per OP changed solution now.
awk -v i=1 -v val=">Assembly_GeneName1_" -v val1="_sizeline" '/>/{value="\047" val i val1;i++;next} {print value length($0) ORS $0}' Input_file
OR
awk -v i=1 -v val=">Assembly_GeneName1_" -v val1="_sizeline" '
/>/{ value="\047" val i val1;
i++;
next}
{
print value length($0) ORS $0
}
' Input_file
Following awk may help you on same.
awk -v i="" -v j=2 '/>/{print "\047>Assembly_GeneName1_"++i"_sizeline"j;j+=2;next} 1' Input_file
Solution 2nd:
awk -v i=1 -v j=2 -v val=">Assembly_GeneName1_" -v val1="_sizeline" '/>/{print "\047" val i val1 j;j+=2;i++;next} 1' Input_file
What you are dealing with is a beautiful example of records which are not lines. awk is a record parser and by default, a record is defined to be a line. With awk you can define a record to be a block of text using the record separator RS.
RS : The first character of the string value of RS shall be the input record separator; a <newline> by default. If RS contains more
than one character, the results are unspecified. If RS is null, then
records are separated by sequences consisting of a <newline> plus one
or more blank lines, leading or trailing blank lines shall not result
in empty records at the beginning or end of the input, and a <newline>
shall always be a field separator, no matter what the value of FS is.
So the goal is to define the record to be
AAAS:1220136:1220159:-:0::NW_015494524.1:1220136-1220159(-)
ATGTCGATGCTCGATC
And this can be done by defining the RS="\n<". Furthremore we will use \n as a field separator FS. This way you can get the requested length as length($2) and the count by using the record count NR.
A simple awk script is then:
awk 'BEGIN{RS="\n<"; FS=OFS="\n"}
{$1=">Assembly_AAAS_"NR"_"length($2)}
{print $1,$2}' <file>
This will do exactly what you want.
note: we use print $1,$2 and not print $0 as the last record might have 3 fields (if the last char of the file is a newline). This would imply that you would have an extra empty line at the end of your file.
If you want to pick the AAAS string out of $1 you can use substr($1,1,match($1,":")-1) to pick it up. This results in this:
awk 'BEGIN{RS="\n<"; FS=OFS="\n"}
{$1=">Assembly_"substr($1,1,match($1,":")-1)"_"NR"_"length($2)}
{print $1,$2}' <file>
Finally, be aware that the above solution only works if there are no spaces in $2, if you want to change that, you can do this :
awk 'BEGIN{RS="\n<"; FS=OFS="\n"}
{ gsub(/[[:blank:]]/,"",$2);
$1=">Assembly_"substr($1,1,match($1,":")-1)"_"NR"_"length($2)
}
{ print $1,$2 }' <file>

Exact string match in awk

I have a file test.txt with the next lines
1997 100 500 2010TJ
2010TJXML 16 20 59
I'm using the next awk line to get information only about string 2010TJ
awk -v var="2010TJ" '$0 ~ var {print $0}' test.txt
But the code print the two lines. I want to know how to get the line containing the exact string
1997 100 500 2010TJ
the string can be placed in any column of the file.
Several options:
Use a gawk word boundary (not POSIX awk...):
$ gawk '/\<2010TJ\>/' file
An actual space or tab or what is separating the columns:
$ awk '/^2010TJ /' file
Or compare the field directly to the string:
$ awk '$1=="2010TJ"' file
You can loop over the fields to test each field if you wish:
$ awk '{for (i=1;i<=NF;i++) if ($i=="2010TJ") {print; next}}' file
Or, given your example of setting a variable, those same using a variable:
$ gawk -v s=2010TJ '$0~"\\<" s "\\>"'
$ awk -v s=2010TJ '$0~"^" s " "'
$ awk -v s=2010TJ '$1==s'
Note the first is a little different than the second and third. The first is the standalone string 2010TJ anywhere in $0; the second and third is a string that starts with that string.
Try this (for testing only column 1) :
awk '$1 == "2010TJ" {print $0}' test.txt
or grep like (all columns) :
gawk '/\<2010TJ\>/ {print $0}' test.txt
Note
\< \> is word boundarys
another awk with word boundary
awk '/\y2010TJ\y/' file
note \y matches either beginning or end of a word.

Awk variables concatenated with symbols

I am writing a simple bash loop where I use awk to grab lines from a file. The pattern is iterated over in the loop. I can get the program to work fine until I try to add symbols to the variable used in awk for the search pattern
WORKING PROGRAM (first search term is "cat")
list="cat dog"
for k in $list
do
vark="$k"
awk '/'$vark'/{print RS $0}' RS=\> FILE1 > FILE2
done
But when I try to add the symbols "|" ,"<", and "_" to the variable the loop breaks. I have tried with multiple seperators for the symbols but can not seem to get it to correctly integrate the symbols into the variable.
BROKEN PROGRAM (first search term is "|cat>_")
list="cat dog"
for k in $list
do
vark="$k"
varkk="|$k>_"
awk '/'$varkk'/{print RS $0}' RS=\> FILE1 > FILE2
done
Thank you so much for your help!
Correct way to pass shell/bash variable to awk is using -v option, like -v varname="$shell_var"
From
awk '/'$varkk'/{print RS $0}' RS=\> FILE1 > FILE2
To
awk -v regexp="$vark" -v RS='\>' '$0 ~ regexp{print RS $0}' FILE1 > FILE2
You can also do this, using awk itself
awk '
FNR==NR{ arr[$0]; next }
{ for(i in arr)if($0 ~ i){ print RS $0; next} }
' pattern_file RS='\>' FILE1 >file2

Using awk how do I combine data in two files and substitute values from the second file to the first file?

Any ideas how to the following using awk?
Two input files, data.txt and keys.txt:
data.txt contains some data:
A;1
B;2
A;3
keys.txt contains "key;value" pairs ("C" is in this example not part of data.txt, but the awk script should still work):
A;30
B;20
C;10
The output should be as follows:
A;1;30
B;2;20
A;3;30
Hence, each row in data.txt that contains any key from keys.txt should get the corresponding value appended to the row in data.txt.
awk to the rescue!
assumes the second file has unique keys unlike first file (if not you need to specify what happens then)
$ awk 'BEGIN {FS=OFS=";"}
NR==FNR {a[$1]=$2; next}
$1 in a {print $0,a[$1]}' file2 file1
A;1;30
B;2;20
A;3;30
ps. note the order of files...
awk solution:
awk -F';' 'NR==FNR{a[$1]=$2; next}{if($1 in a) $0=$0 FS a[$1]; print}' file2 file1
The output:
A1;1;2
A2;2;1
A3;3;0.5
A1;1;2
A2;2;1
A3;3;0.5
NR==FNR - processing the first file i.e. file2
a[$1]=$2 - accumulating additional values for each key
if($1 in a) $0=$0 FS a[$1] - appending value if first column matches

Removing content of a column based on number of occurences

I have a file (; seperated) with data like this
111111121;000-000.1;000-000.2
111111211;000-000.1;000-000.2
111112111;000-000.1;000-000.2
111121111;000-000.1;000-000.2
111211111;000-000.1;000-000.2
112111111;000-000.1;000-000.2
121111112;000-000.2;020-000.8
121111121;000-000.2;020-000.8
121111211;000-000.2;020-000.8
121113111;000-000.3;000-200.2
211111121;000-000.1;000-000.2
I would like to remove any $3 that has less than 3 occurences, so the outcome would be like
111111121;000-000.1;000-000.2
111111211;000-000.1;000-000.2
111112111;000-000.1;000-000.2
111121111;000-000.1;000-000.2
111211111;000-000.1;000-000.2
112111111;000-000.1;000-000.2
121111112;000-000.2;020-000.8
121111121;000-000.2;020-000.8
121111211;000-000.2;020-000.8
121113111;000-000.3
211111121;000-000.1;000-000.2
That is, only $3 got deleted, as it had only a single occurence
Sadly I am not really sure if (thus how) this could be done relatively easily (as doing the =COUNT.IF matching, and manuel delete in Excel feels quite embarrassing)
$ awk -F';' 'NR==FNR{cnt[$3]++;next} cnt[$3]<3{sub(/;[^;]+$/,"")} 1' file file
111111121;000-000.1;000-000.2
111111211;000-000.1;000-000.2
111112111;000-000.1;000-000.2
111121111;000-000.1;000-000.2
111211111;000-000.1;000-000.2
112111111;000-000.1;000-000.2
121111112;000-000.2;020-000.8
121111121;000-000.2;020-000.8
121111211;000-000.2;020-000.8
121113111;000-000.3
211111121;000-000.1;000-000.2
or if you prefer:
$ awk -F';' 'NR==FNR{cnt[$3]++;next} {print (cnt[$3]<3 ? $1 FS $2 : $0)}' file file
this awk one-liner can help, it processes the file twice:
awk -F';' 'NR==FNR{a[$3]++;next}a[$3]<3{NF--}7' file file
Though that awk solutions are the best in terms of performance, your goal could be also achieved with something like this:
while IFS=" " read a b;do
if [[ "$a" -lt "3" ]];then
sed -i "s/$b//" b.txt
fi
done <<<"$(cut -d";" -f3 b.txt |sort |uniq -c)"
Operation is based on the output of cut which counts occurrences.
$cut -d";" -f3 b.txt |sort |uniq -c
7 000-000.2
1 000-200.2
3 020-000.8
Above works for editing source file in place, so keep a back up for testing.
You can feed the file twice to awk. On the first run you gather a statistic that you use in the second run:
script.awk
FNR == NR { stats[ $3 ]++
next
}
{ if( stats[$3] < 3) print $1 $2
else print
}
Run it like this: awk -F\; -f script.awk yourfile yourfile .
The condition FNR == NR is true during processing of the first filename given to awk. The next statement skips the second block.
Thus the second block is only used for processing the second filename given to awk (which is here the same as the first filename).