BEGIN and END blocks in awk

BEGIN and END blocks in awk - awk

I am using the awk command in terminal on my Mac.
I want to print the contents of an already existing file and give a title to each column which i'll separate using a tab then I want to send the output to another file. What line of code would I use to give titles to the columns? Im hoping to use simple awk commands and preferably if I can complete the task in as little lines as possible.
So far I have tried to use the BEGIN command. (The titles I want to give are first name, second name and score)
BEGIN { print "First Name\tSecond Name\tScore}**
then I want to print the entire contents of the file.
{print} filename.txt
Finally I want to save the output to another file
End{print} filename.txt > output.txt
to do this all all together
awk 'BEGIN {print "First Name\tSecond Name\tScore";}
{print}
End{print}' filename.txt > output.txt
However, this only saved the titles to the output file and not the contents of the original file under the columns.

awk processes files line by line. Before it starts processing the file you can have it do something. We use the BEGIN keyword to note that the following block of code is to be executed before processing. Same with END running after the processing of each line of the file is complete.
While your code has some superfluous bits in it, like the unnecessary END block, it still should do exactly what you are wanting to do, assuming you have data in your filename.txt.
A more succinct awk code would be:
awk 'BEGIN {print "First Name\tSecond Name\tScore";}1' filename.txt > output.txt
In action (using commas instead of tabs because it's easier and I'm lazy):
$ echo "1,2,3" > filename.txt
$ awk 'BEGIN {print "c1,c2,c3"}1' filename.txt > output.txt
$ cat output.txt
c1,c2,c3
1,2,3

Related

AWK insert line at top of a file [duplicate]

This question already has answers here:
AWK - how to add the first line of the file?
(2 answers)
How to insert a text at the beginning of a file?
(19 answers)
Closed last month.
I would like to get some explanation here!
I'm new to awk and I'd like to know how do I insert a line at to of a file.
This is what I've tried so far
file.txt
content line
another line
awk command
awk 'BEGIN {print "first line" } {print}' file.txt
the output
first line
content line
another line
however, when run with -i inplace doesn't write to the file only gives me this output
first line
I would like to know what I am doing wrong and if you guys can explain it I'd really appreciate.

The BEGIN{} block is processed before any files are processed which means any output generated by the BEGIN{} block has nowhere to go but to stdout.
To get the line inserted into the file you need to move the print "first line" into the main body of the script where it can be processed along with the file.
One idea based on inserting the new row while reading the first line of the input file:
$ awk -i inplace 'FNR==1 {print "first line"}1' file.txt
$ cat file.txt
first line
content line
another line
NOTES:
FNR==1 will apply to each file if you happen to feed multiple files to awk, otherwise NR==1 will also suffice if it's just the one input file
the stand-alone 1 is non-zero and thus considered by awk as 'true'; here it applies against all input lines and the default behavior for a 'true' is to pass the input line through to stdout (effectively the same as {print})

The -i inplace includes the built-in inplace.awk include file to emulate sed -i in-place editing but there are some caveats that come from the method used for this emulation.
Because it works by fiddling with the processing of each input file (by using the BEGINFILE pattern), anything printed in the BEGIN selector still goes to the start-up output stream rather than to the "fiddled" output stream. That's because the first input file has yet to begin processing at that point.
So what you'll see is first line going to standard output then the two lines from the input file being printed in-place back to that file. You just may not have realised that last bit since you don't change the lines in the file when you write them back.
This was no doubt a difficult implementation decision for the creators of inplace.awk since in-place editing over multiple files needs to be catered for. The question is: where should output go in BEGIN for in-place editing?
You have a couple of options here.
First, if you know you'll only ever process one file, you can use the normal trick, with BEGIN but without inplace:
awk 'BEGIN {print "first line" } {print}' file.txt > /tmp/$$ && mv /tmp/$$ file.txt
Second, using inplace but not BEGIN, you need to first decide which of the input files it should affect. If you want it to affect all input files, that means something like:
awk -i inplace 'FNR==1 {print "first line";print;next} {print}' file1.txt file2.txt
If you want it to affect just the first input file, use NR rather than FNR (the former never decreases, the latter resets to one for each new input file).
Finally, for the case where all files should be affected, you can use the same method that inplace itself uses.
As special patterns like BEGIN are executed in order of definition (and -i comes before processing of your script), simply use BEGINFILE rather than BEGIN, as per the following transcript:
=====
pax#styx:~$ cat xx.001
Original line from xx.001
=====
pax#styx:~$ cat xx.002
Original line from xx.002
=====
pax#styx:~$ cat xx.awk
BEGINFILE {
print "inserted line"
}
{
print
}
=====
pax#styx:~$ awk -i inplace -f xx.awk xx.001 xx.002
=====
pax#styx:~$ cat xx.001
inserted line
Original line from xx.001
=====
pax#styx:~$ cat xx.002
inserted line
Original line from xx.002
The BEGINFILE from inplace will first weave its magic to capture output (per input file), then your BEGINFILE will print to that capture area.

Non-awk version:
I was excited to see the -i inplace option, but it's not available in macOS and BSDs. So in addition to #paxdiablo's solution of a tmp file, here's how you can prepend lines with /bin/ed.
ed -s file.txt << EOF
1i
first line
.
w
EOF

What I would do:
$ awk -i inplace 'NR==1{print "first line";print;next} {print}' file ; cat file
first line
original content line 1
original content line 2

File size grows greatly after using awk

I want to add row number for a file, then I do like this,
awk '{print $0 "\x03" NR > "/opt/data2/gds_test/test_partly.txt"}' /opt/data2/gds_test/test_partly.txt
I put this line of command in a shell script file, and run it for some time, it still does not finish, so I end it by force, but I find the source file size has changed from 1.7G to 242G,
What happened? I am a little confused,
I had ever use a small file to test in command line, this awk command seems ok,

You're reading from the front of a file at the same time as you're writing onto the end of it. Try this instead:
tmp=$(mktemp)
awk '{print $0 "\x03" NR}' '/opt/data2/gds_test/test_partly.txt' > "$tmp" &&
mv "$tmp" '/opt/data2/gds_test/test_partly.txt'

yes, i change to redirect the result to a tmp file, and then delete the original file and rename the tmp file, it is ok,
and i just also get to know that gawk -I inplace can be used,

why awk print file content while there is no print command

i have an awk file, which i read each words from a file into an array, there is no print command in it, but after i run it, the whole content of the file is printed,
#!/bin/awk -f
{
for(i=1;i<=NF;i++)
used[$i]=1
}
after i run this awk file like this
awk 1.awk 2
the whole content of file 2 is printed on the screen, i am confused,
i tried this directly from command line, there is nothing printed out, so i think there is something wrong with the file or the way to run this file,

You missed the -f option: awk -f 1.awk 2
What you provided is, instead of the contents of "1.awk" as the awk commands, you're providing the literal string 1.awk as the awk command.
You can essentially done this: awk '"1.awk"' 2
And since that is a "true" value, the default action is to print each record of the data contained in file "2".

Concatenating lines using awk

I have fasta file that contains two gene sequences and what I want to do is remove the fasta header (line starting with ">"), concatenate the rest of the lines and output that sequence
Here is my fasta sequence (genome.fa):
>Potrs164783
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
>Potrs164784
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Desired output
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
I am using awk to do this but I am getting this error
awk 'BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >filename.fa;}' ../genome.fa
awk: syntax error at source line 1
context is
BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >>> >filename. <<< fa;}
awk: illegal statement at source line 1
I am basically a python person and I was given this script by someone. What am I doing wrong here?
I realized that i was not clear and so i am pasting the whole code that i got from someone. The input file and desired output remains the same
mkdir split_genome;
cd split_genome;
awk 'BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >filename.fa;}' ../genome.fa;
ls -1 `pwd`/* > ../scaffold_list.txt;
cd ..;

If all you want to do is produce the desired output shown in your question, other solutions will work.
However, the script you have is trying to print each sequence to a file that is named using its header, and the extension .fa.
The syntax error you're getting is because filename.fa is neither a variable or a fixed string. While no Awk will allow you to print to filename.fa because it is neither in quotes or a variable (varaible names can't have a . in them), BSD Awk does not allow manipulating strings when they currently act as a file name where GNU Awk does.
So the solution:
print $0 > filename".fa"
would produce the same error in BSD Awk, but would work in GNU Awk.
To fix this, you can append the extension ".fa" to filename at assignment.
This will do the job:
$ awk '{if($0 ~ /^>/) filename=substr($0, 2)".fa"; else print $0 > filename}' file
$ cat Potrs164783.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
$ cat Potrs164784.fa
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
You'll notice I left out the BEGIN{filename="file1"} declaration statement as it is unnecessary. Also, I replaced the need for sub(...) by using the string function substr as it is more clear and requires fewer actions.

The awk code that you show attempts to do something different than produce the output that you want. Fortunately, there are much simpler ways to obtain your desired output. For example:
$ grep -v '>' ../genome.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Alternatively, if you had intended to have all non-header lines concatenated into one line:
$ sed -n '/^>/!H; $!d; x; s/\n//gp' ../genome.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGATTGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAACTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAATTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCCGGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC

Try this to print lines not started by > and in one line:
awk '!/^>/{printf $0}' genome.fa > filename.fa
With carriage return:
awk '!/^>/' genome.fa > filename.fa
To create single files named by the headers:
awk 'split($0,a,"^>")>1{file=a[2];next}{print >file}' genome.fa

awk subsetting on ID's in $1 and close( )

I am trying to sort with awk a large csv file by id's in the first column on OSX.
I started with:
awk -F, 'NR>1 {print > ($1 ".sync")}' file.csv
However, the process stopped at ID s_17 with the error:
awk: s_18.sync makes too many open files input record number 37674601,
file file.csv source line number 1
I tried modifying with this close() statement but it then only writes the first file
awk -F, 'NR>1 {print > ($1 ".sync");close($1 ".sync"}' file.csv
Can anyone provide insight on how to close the files after each one, properly, so that the number of open files stays manageable but they all get written?

Because you close the file you need to use the append >> operator so you don't clobber the output files:
$ awk -F, 'NR>1{f=$1".sync";print >> f;close(f)}' file.csv
Check out the manual for the official word on redirection with awk.

Don't sort with awk. AWK is great to format data before sorting. Pipe the output into sort(1) and let it sort the data. That's what sort does, and it does a great job.
Also - which type of sort do you need? Dictionary? Numeric? Do you need to ignore spaces?
example:
sort -t, -n <file

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas