Awk variables concatenated with symbols - variables

I am writing a simple bash loop where I use awk to grab lines from a file. The pattern is iterated over in the loop. I can get the program to work fine until I try to add symbols to the variable used in awk for the search pattern
WORKING PROGRAM (first search term is "cat")
list="cat dog"
for k in $list
do
vark="$k"
awk '/'$vark'/{print RS $0}' RS=\> FILE1 > FILE2
done
But when I try to add the symbols "|" ,"<", and "_" to the variable the loop breaks. I have tried with multiple seperators for the symbols but can not seem to get it to correctly integrate the symbols into the variable.
BROKEN PROGRAM (first search term is "|cat>_")
list="cat dog"
for k in $list
do
vark="$k"
varkk="|$k>_"
awk '/'$varkk'/{print RS $0}' RS=\> FILE1 > FILE2
done
Thank you so much for your help!

Correct way to pass shell/bash variable to awk is using -v option, like -v varname="$shell_var"
From
awk '/'$varkk'/{print RS $0}' RS=\> FILE1 > FILE2
To
awk -v regexp="$vark" -v RS='\>' '$0 ~ regexp{print RS $0}' FILE1 > FILE2
You can also do this, using awk itself
awk '
FNR==NR{ arr[$0]; next }
{ for(i in arr)if($0 ~ i){ print RS $0; next} }
' pattern_file RS='\>' FILE1 >file2

Related

How to use awk to find the line starting with a variable

I know 2 things about awk:
1.
PAT='aGeneName'
awk -v var="$PAT" '$3 ~ var {print $0}' file.txt # will print the line where 3rd field includes the variable $PAT
2.
awk '$3 ~ /^aGeneName/' file.txt # will print the line where 3rd field starts with string "aGeneName"
But what I want is the combination of these two: I want to print the line where the 3rd field starts with the variable $PAT, something like
PAT='aGeneName'
awk -v var="$PAT" '$3 ~ /^var/ {print $0}' file.txt # but this is wrong, since variable can't be put into //
One way is like this:
PAT='aGeneName'
awk -v var="$PAT" '$3 ~ "^" var {print $0}' file.txt
And the {print $0} can be saved here, it's implied.
Another way, when the pattern var is a simple string, no RegEX character inside:
PAT='aGeneName'
awk -v var="$PAT" 'index($3, var)==1' file.txt

How to print the length size of the following line

I would like to modify a file by including the size of following line using awk.
My file is like this:
>AAAS:1220136:1220159:-:0::NW_015494524.1:1220136-1220159(-)
ATGTCGATGCTCGATC
>AAAS::1215902:1215986:-:1::NW_015494524.1:1215902-1215986(-)
ATGCGATGCTAGCTAGCTCGAT
>AAAS:1215614:1215701:-:1::NW_015494524.1:1215614-1215701(-)
ATGCCGCGACGCAGCACCCGACGCGCAG
I am using awk to modify it to have the following format:
>Assembly_AAAS_1_16
ATGTCGATGCTCGATC
>Assembly_AAAS_2_22
ATGCGATGCTAGCTAGCTCGAT
>Assembly_AAAS_3_28
ATGCCGCGACGCAGCACCCGACGCGCAG
I have used awk to modify the first part.
awk -F":" -v i=1 '/>/{print ">Assembly_" $1 "_" val i "_";i++;next} {print length($0)} 1' infile | sed -e "s/_>/_/g" > outfile
I can use print length($0) but how to print it in the same line?
Thanks
EDIT2: Since OP has changed the sample data again so adding this code now.
awk -v val="Assembly_AAAS_" '/>/{++i;val=">"val i "_";next} {sub(/ +$/,"");print val length($0) ORS $0}' Input_file
OR
awk -v val="Assembly_AAAS_" '/>/{++i;val=">"val i "_";next} {print val length($1) ORS $0;}' Input_file
Above will remove spaces from last of the lines of Input_file, in case you don't need it then remove sub(/ +$/,""); part from above code please.
EDIT: As per OP changed solution now.
awk -v i=1 -v val=">Assembly_GeneName1_" -v val1="_sizeline" '/>/{value="\047" val i val1;i++;next} {print value length($0) ORS $0}' Input_file
OR
awk -v i=1 -v val=">Assembly_GeneName1_" -v val1="_sizeline" '
/>/{ value="\047" val i val1;
i++;
next}
{
print value length($0) ORS $0
}
' Input_file
Following awk may help you on same.
awk -v i="" -v j=2 '/>/{print "\047>Assembly_GeneName1_"++i"_sizeline"j;j+=2;next} 1' Input_file
Solution 2nd:
awk -v i=1 -v j=2 -v val=">Assembly_GeneName1_" -v val1="_sizeline" '/>/{print "\047" val i val1 j;j+=2;i++;next} 1' Input_file
What you are dealing with is a beautiful example of records which are not lines. awk is a record parser and by default, a record is defined to be a line. With awk you can define a record to be a block of text using the record separator RS.
RS : The first character of the string value of RS shall be the input record separator; a <newline> by default. If RS contains more
than one character, the results are unspecified. If RS is null, then
records are separated by sequences consisting of a <newline> plus one
or more blank lines, leading or trailing blank lines shall not result
in empty records at the beginning or end of the input, and a <newline>
shall always be a field separator, no matter what the value of FS is.
So the goal is to define the record to be
AAAS:1220136:1220159:-:0::NW_015494524.1:1220136-1220159(-)
ATGTCGATGCTCGATC
And this can be done by defining the RS="\n<". Furthremore we will use \n as a field separator FS. This way you can get the requested length as length($2) and the count by using the record count NR.
A simple awk script is then:
awk 'BEGIN{RS="\n<"; FS=OFS="\n"}
{$1=">Assembly_AAAS_"NR"_"length($2)}
{print $1,$2}' <file>
This will do exactly what you want.
note: we use print $1,$2 and not print $0 as the last record might have 3 fields (if the last char of the file is a newline). This would imply that you would have an extra empty line at the end of your file.
If you want to pick the AAAS string out of $1 you can use substr($1,1,match($1,":")-1) to pick it up. This results in this:
awk 'BEGIN{RS="\n<"; FS=OFS="\n"}
{$1=">Assembly_"substr($1,1,match($1,":")-1)"_"NR"_"length($2)}
{print $1,$2}' <file>
Finally, be aware that the above solution only works if there are no spaces in $2, if you want to change that, you can do this :
awk 'BEGIN{RS="\n<"; FS=OFS="\n"}
{ gsub(/[[:blank:]]/,"",$2);
$1=">Assembly_"substr($1,1,match($1,":")-1)"_"NR"_"length($2)
}
{ print $1,$2 }' <file>

Exact string match in awk

I have a file test.txt with the next lines
1997 100 500 2010TJ
2010TJXML 16 20 59
I'm using the next awk line to get information only about string 2010TJ
awk -v var="2010TJ" '$0 ~ var {print $0}' test.txt
But the code print the two lines. I want to know how to get the line containing the exact string
1997 100 500 2010TJ
the string can be placed in any column of the file.
Several options:
Use a gawk word boundary (not POSIX awk...):
$ gawk '/\<2010TJ\>/' file
An actual space or tab or what is separating the columns:
$ awk '/^2010TJ /' file
Or compare the field directly to the string:
$ awk '$1=="2010TJ"' file
You can loop over the fields to test each field if you wish:
$ awk '{for (i=1;i<=NF;i++) if ($i=="2010TJ") {print; next}}' file
Or, given your example of setting a variable, those same using a variable:
$ gawk -v s=2010TJ '$0~"\\<" s "\\>"'
$ awk -v s=2010TJ '$0~"^" s " "'
$ awk -v s=2010TJ '$1==s'
Note the first is a little different than the second and third. The first is the standalone string 2010TJ anywhere in $0; the second and third is a string that starts with that string.
Try this (for testing only column 1) :
awk '$1 == "2010TJ" {print $0}' test.txt
or grep like (all columns) :
gawk '/\<2010TJ\>/ {print $0}' test.txt
Note
\< \> is word boundarys
another awk with word boundary
awk '/\y2010TJ\y/' file
note \y matches either beginning or end of a word.

Using awk array in regular expression

I am trying to search a pattern stored in awk array "a" read from file1 and then search in 4rth column of file2 , the following command works perfectly
awk -F" " 'NR==FNR{a[NR]=$1;next}{for (i in a){ if($4 ~ a[i])print $0}}' file1 file2
but If I change the command at if to search at the start of 4rth column of file2 as shown below , it does not work , any suggestions please ?
awk -F" " 'NR==FNR{a[NR]=$1;next}{for (i in a){ if($4 ~ "^a[i]" )print $0}}' file1 file2
can the second command be modified to search the array element at the start of 4rth column in file2 ?
Change "^a[i]" to "^"a[i]......

what does it means in awk script? `awk -F "|" '{!a[$1]++}{printf RS $1}{print FS $2}' input.txt`

I need meaning of below code in unix, and help me go forward..
`awk -F "|" '{!a[$1]++}{printf RS $1}{print FS $2}' input.txt`
My sample i/p file is like below
1|Balaji 1|Kumar 3|India 3|China 3|Australia 1|Dinesh
I need o/p like below
1|Balaji|Kumar|Dinesh 3|India|China|Australia
I won't explain the awk line in your question. because it doesn't make much sense:
created array a[], but never use
wrong usage of RS, FS
try this one-liner:
awk -F'[| ]' '{for(i=1;i<=NF;i++)if(i%2)a[$i]=a[$i]?a[$i]"|"$(i+1):$(i+1)}
END{for(x in a) printf x"|"a[x]" ";print ""}' file
with your example:
kent$ echo "1|Balaji 1|Kumar 3|India 3|China 3|Australia 1|Dinesh"|awk -F'[| ]' '{for(i=1;i<=NF;i++)if(i%2)a[$i]=a[$i]?a[$i]"|"$(i+1):$(i+1)}END{for(x in a) printf x"|"a[x]" ";print ""}'
1|Balaji|Kumar|Dinesh 3|India|China|Australia
Note that there would be an ending space, it could be removed in the END loop.
Surprisingly, it can be change to simply. I am not sure why !a[$1]++ is written inside that.Its obsolete overe there:
awk -F "|" '{printf RS $1}{print FS $2}' input.txt
it will print first the record separator which is newline and then $1 which is the first field and then the field separator which is "|" and then the second field $2 and then a newline(since the statement is print. If printf is used newline will not be printed).
Based on your comment, below should work:
awk '{
for(i=1;i<=NF;i++){split($i,a,"|");
b[a[1]]?b[a[1]]=b[a[1]]" "a[2]:b[a[1]]=a[2]
}
for(j in b)printf j"|"b[j]" ";
print"";}' your_file
Changing record selector makes it easy to read this data. It have only a small bug that I do not see how to solve, it prints it on two line.
awk -F\| '{a[$1]=a[$1]?a[$1]"|"$2:$2} END{for(i in a) printf i"|"a[i]" "}' RS=" " file
1|Balaji|Kumar|Dinesh
3|India|China|Australia
New version with correct output, thanks to Birei
awk -F\| '{sub(/\n/,x, $0); a[$1]=a[$1]?a[$1]"|"$2:$2} END{for(i in a) printf i"|"a[i]" "}' RS=" "
1|Balaji|Kumar|Dinesh 3|India|China|Australia