how to retain header using AWK - awk

How can I print the header from data1 while running this comparison of two files data1 and data2 matching on column 2? My code only prints data lines. The headers are named differently and so I choose to use column position
awk -F, 'FNR==NR {a[$2]=$0; next}; $2 in a {print a[$2]}' /data1 /data2 > /data3.txt

$ awk -F, 'NR==1; FNR==NR{a[$2]=$0; next} ...
will print the first line of the first file. If you want to skip processing replace with NR==1{print; next}.

Related

How to print the length size of the following line

I would like to modify a file by including the size of following line using awk.
My file is like this:
>AAAS:1220136:1220159:-:0::NW_015494524.1:1220136-1220159(-)
ATGTCGATGCTCGATC
>AAAS::1215902:1215986:-:1::NW_015494524.1:1215902-1215986(-)
ATGCGATGCTAGCTAGCTCGAT
>AAAS:1215614:1215701:-:1::NW_015494524.1:1215614-1215701(-)
ATGCCGCGACGCAGCACCCGACGCGCAG
I am using awk to modify it to have the following format:
>Assembly_AAAS_1_16
ATGTCGATGCTCGATC
>Assembly_AAAS_2_22
ATGCGATGCTAGCTAGCTCGAT
>Assembly_AAAS_3_28
ATGCCGCGACGCAGCACCCGACGCGCAG
I have used awk to modify the first part.
awk -F":" -v i=1 '/>/{print ">Assembly_" $1 "_" val i "_";i++;next} {print length($0)} 1' infile | sed -e "s/_>/_/g" > outfile
I can use print length($0) but how to print it in the same line?
Thanks
EDIT2: Since OP has changed the sample data again so adding this code now.
awk -v val="Assembly_AAAS_" '/>/{++i;val=">"val i "_";next} {sub(/ +$/,"");print val length($0) ORS $0}' Input_file
OR
awk -v val="Assembly_AAAS_" '/>/{++i;val=">"val i "_";next} {print val length($1) ORS $0;}' Input_file
Above will remove spaces from last of the lines of Input_file, in case you don't need it then remove sub(/ +$/,""); part from above code please.
EDIT: As per OP changed solution now.
awk -v i=1 -v val=">Assembly_GeneName1_" -v val1="_sizeline" '/>/{value="\047" val i val1;i++;next} {print value length($0) ORS $0}' Input_file
OR
awk -v i=1 -v val=">Assembly_GeneName1_" -v val1="_sizeline" '
/>/{ value="\047" val i val1;
i++;
next}
{
print value length($0) ORS $0
}
' Input_file
Following awk may help you on same.
awk -v i="" -v j=2 '/>/{print "\047>Assembly_GeneName1_"++i"_sizeline"j;j+=2;next} 1' Input_file
Solution 2nd:
awk -v i=1 -v j=2 -v val=">Assembly_GeneName1_" -v val1="_sizeline" '/>/{print "\047" val i val1 j;j+=2;i++;next} 1' Input_file
What you are dealing with is a beautiful example of records which are not lines. awk is a record parser and by default, a record is defined to be a line. With awk you can define a record to be a block of text using the record separator RS.
RS : The first character of the string value of RS shall be the input record separator; a <newline> by default. If RS contains more
than one character, the results are unspecified. If RS is null, then
records are separated by sequences consisting of a <newline> plus one
or more blank lines, leading or trailing blank lines shall not result
in empty records at the beginning or end of the input, and a <newline>
shall always be a field separator, no matter what the value of FS is.
So the goal is to define the record to be
AAAS:1220136:1220159:-:0::NW_015494524.1:1220136-1220159(-)
ATGTCGATGCTCGATC
And this can be done by defining the RS="\n<". Furthremore we will use \n as a field separator FS. This way you can get the requested length as length($2) and the count by using the record count NR.
A simple awk script is then:
awk 'BEGIN{RS="\n<"; FS=OFS="\n"}
{$1=">Assembly_AAAS_"NR"_"length($2)}
{print $1,$2}' <file>
This will do exactly what you want.
note: we use print $1,$2 and not print $0 as the last record might have 3 fields (if the last char of the file is a newline). This would imply that you would have an extra empty line at the end of your file.
If you want to pick the AAAS string out of $1 you can use substr($1,1,match($1,":")-1) to pick it up. This results in this:
awk 'BEGIN{RS="\n<"; FS=OFS="\n"}
{$1=">Assembly_"substr($1,1,match($1,":")-1)"_"NR"_"length($2)}
{print $1,$2}' <file>
Finally, be aware that the above solution only works if there are no spaces in $2, if you want to change that, you can do this :
awk 'BEGIN{RS="\n<"; FS=OFS="\n"}
{ gsub(/[[:blank:]]/,"",$2);
$1=">Assembly_"substr($1,1,match($1,":")-1)"_"NR"_"length($2)
}
{ print $1,$2 }' <file>

Using awk how do I combine data in two files and substitute values from the second file to the first file?

Any ideas how to the following using awk?
Two input files, data.txt and keys.txt:
data.txt contains some data:
A;1
B;2
A;3
keys.txt contains "key;value" pairs ("C" is in this example not part of data.txt, but the awk script should still work):
A;30
B;20
C;10
The output should be as follows:
A;1;30
B;2;20
A;3;30
Hence, each row in data.txt that contains any key from keys.txt should get the corresponding value appended to the row in data.txt.
awk to the rescue!
assumes the second file has unique keys unlike first file (if not you need to specify what happens then)
$ awk 'BEGIN {FS=OFS=";"}
NR==FNR {a[$1]=$2; next}
$1 in a {print $0,a[$1]}' file2 file1
A;1;30
B;2;20
A;3;30
ps. note the order of files...
awk solution:
awk -F';' 'NR==FNR{a[$1]=$2; next}{if($1 in a) $0=$0 FS a[$1]; print}' file2 file1
The output:
A1;1;2
A2;2;1
A3;3;0.5
A1;1;2
A2;2;1
A3;3;0.5
NR==FNR - processing the first file i.e. file2
a[$1]=$2 - accumulating additional values for each key
if($1 in a) $0=$0 FS a[$1] - appending value if first column matches

awk to parse field by using period and output unique digits

I am trying to use awk to parse $2 on using the first . in the string and output the digits with the header row above it. The current output is close but both commands seem to taking $1 as well. Do I need to specify something in the command to only prints the digits in $2, it seems close. Thank you :).
file
R_2016_09_20_12_47
IonXpress_007 16-0001.xxx.xxx
IonXpress_008 16-0002.xxx.xxx
IonXpress_009 16-0003.xxx.xxx
R_2016_09_20_12_46
IonXpress_007 16-0004.xxx.xxx
IonXpress_008 16-0005.xxx.xxx
IonXpress_009 16-0006.xxx.xxx
desired output
R_2016_09_20_12_47
16-0001
16-0002
16-0003
R_2016_09_20_12_46
16-0004
16-0005
16-0006
awk
awk -F. '{print $1}' file
cut
cut -d'.' -f1 file
current output
R_2016_09_20_12_47
IonXpress_007 16-0001
IonXpress_008 16-0002
IonXpress_009 16-0003
R_2016_09_20_12_46
IonXpress_001 16-0004
IonXpress_002 16-0005
IonXpress_003 16-0006
Try this :
% awk -F'[ .]' '{print $2 ? $2 : $1}' file
R_2016_09_20_12_47
16-0001
16-0002
16-0003
R_2016_09_20_12_46
16-0004
16-0005
16-0006
NOTE
i take space and . as separators
i use ternary operator to make a condition on $2

Matching two fields between two files AWK

trying to match fields 1,3 to fields 1,2 in another file and print line of second file. First file is tab delimited and second is csv delimited. Unexpected token error?
file1
1 x 12345 x x x
file2
1,12345,x,x,x
script
awk -F',' FNR==NR{a[$1]=$1,$3; next} ($1,$2 in a) {print}' file1 file2 > output.txt
same idea, but doesn't depend on uniqueness of the first field but the pair instead
$ awk 'NR==FNR{a[$1,$3]; next} ($1,$2) in a' file1 FS=, file2
1,12345,x,x,x
You almost nailed it !
awk 'NR==FNR{first[$1]=$3;next} $1 in first{if(first[$1]==$2){print}}' file1 FS="," file2
Output
1,12345,x,x,x
Notes
Since the field separator is different for both the files, we have changed it in between files.
This script makes an assumption that the first field of each file is unique, else, the script breaks
See [ switching field separator ] in between files.

awk field separator not working for first line

echo 'NODE_1_length_317516_cov_18.568_ID_4005' | awk 'FS="_length" {print $1}'
Obtained output:
NODE_1_length_317516_cov_18.568_ID_4005
Expected output:
NODE_1
How is that possible? I'm missing something.
When you are going through lines using Awk, the field separator is interpreted before processing the record. Awk reads the record according the current values of FS and RS and goes ahead performing the operations you ask it for.
This means that if you set the value of FS while reading a record, this won't have effect for that specific record. Instead, the FS will have effect when reading the next one and so on.
So if you have a file like this:
$ cat file
1,2 3,4
5,6 7,8
And you set the field separator while reading one record, it takes effect from the next line:
$ awk '{FS=","} {print $1}' file
1,2 # FS is still the space!
5
So what you want to do is to set the FS before starting to read the file. That is, set it in the BEGIN block or via parameter:
$ awk 'BEGIN{FS=","} {print $1}' file
1,2 # now, FS is the comma
5
$ awk -F, '{print $1}' file
1
5
There is also another way: make Awk recompute the full record with {$0=$0}. With this, Awk will take into account the current FS and act accordingly:
$ awk '{FS=","} {$0=$0;print $1}' file
1
5
awk Statement used incorrectly
Correct way is
awk 'BEGIN { FS = "#{delimiter}" } ; { print $1 }'
In your case you can use
awk 'BEGIN { FS = "_length" } ; { print $1 }'
Inbuilt variables like FS, ORS etc must be set within a context i.e in 1 of the following blocks: BEGIN, condition blocks or END.
$ echo 'NODE_1_length_317516_cov_18.568_ID_4005' | awk 'BEGIN{FS="_length"} {print $1}'
NODE_1
$
You can also pass the delimiter using -F switch like this:
$ echo 'NODE_1_length_317516_cov_18.568_ID_4005' | awk -F "_length" '{print $1}'
NODE_1
$