AWK insert line at top of a file [duplicate] - awk

This question already has answers here:
AWK - how to add the first line of the file?
(2 answers)
How to insert a text at the beginning of a file?
(19 answers)
Closed last month.
I would like to get some explanation here!
I'm new to awk and I'd like to know how do I insert a line at to of a file.
This is what I've tried so far
file.txt
content line
another line
awk command
awk 'BEGIN {print "first line" } {print}' file.txt
the output
first line
content line
another line
however, when run with -i inplace doesn't write to the file only gives me this output
first line
I would like to know what I am doing wrong and if you guys can explain it I'd really appreciate.

The BEGIN{} block is processed before any files are processed which means any output generated by the BEGIN{} block has nowhere to go but to stdout.
To get the line inserted into the file you need to move the print "first line" into the main body of the script where it can be processed along with the file.
One idea based on inserting the new row while reading the first line of the input file:
$ awk -i inplace 'FNR==1 {print "first line"}1' file.txt
$ cat file.txt
first line
content line
another line
NOTES:
FNR==1 will apply to each file if you happen to feed multiple files to awk, otherwise NR==1 will also suffice if it's just the one input file
the stand-alone 1 is non-zero and thus considered by awk as 'true'; here it applies against all input lines and the default behavior for a 'true' is to pass the input line through to stdout (effectively the same as {print})

The -i inplace includes the built-in inplace.awk include file to emulate sed -i in-place editing but there are some caveats that come from the method used for this emulation.
Because it works by fiddling with the processing of each input file (by using the BEGINFILE pattern), anything printed in the BEGIN selector still goes to the start-up output stream rather than to the "fiddled" output stream. That's because the first input file has yet to begin processing at that point.
So what you'll see is first line going to standard output then the two lines from the input file being printed in-place back to that file. You just may not have realised that last bit since you don't change the lines in the file when you write them back.
This was no doubt a difficult implementation decision for the creators of inplace.awk since in-place editing over multiple files needs to be catered for. The question is: where should output go in BEGIN for in-place editing?
You have a couple of options here.
First, if you know you'll only ever process one file, you can use the normal trick, with BEGIN but without inplace:
awk 'BEGIN {print "first line" } {print}' file.txt > /tmp/$$ && mv /tmp/$$ file.txt
Second, using inplace but not BEGIN, you need to first decide which of the input files it should affect. If you want it to affect all input files, that means something like:
awk -i inplace 'FNR==1 {print "first line";print;next} {print}' file1.txt file2.txt
If you want it to affect just the first input file, use NR rather than FNR (the former never decreases, the latter resets to one for each new input file).
Finally, for the case where all files should be affected, you can use the same method that inplace itself uses.
As special patterns like BEGIN are executed in order of definition (and -i comes before processing of your script), simply use BEGINFILE rather than BEGIN, as per the following transcript:
=====
pax#styx:~$ cat xx.001
Original line from xx.001
=====
pax#styx:~$ cat xx.002
Original line from xx.002
=====
pax#styx:~$ cat xx.awk
BEGINFILE {
print "inserted line"
}
{
print
}
=====
pax#styx:~$ awk -i inplace -f xx.awk xx.001 xx.002
=====
pax#styx:~$ cat xx.001
inserted line
Original line from xx.001
=====
pax#styx:~$ cat xx.002
inserted line
Original line from xx.002
The BEGINFILE from inplace will first weave its magic to capture output (per input file), then your BEGINFILE will print to that capture area.

Non-awk version:
I was excited to see the -i inplace option, but it's not available in macOS and BSDs. So in addition to #paxdiablo's solution of a tmp file, here's how you can prepend lines with /bin/ed.
ed -s file.txt << EOF
1i
first line
.
w
EOF

What I would do:
$ awk -i inplace 'NR==1{print "first line";print;next} {print}' file ; cat file
first line
original content line 1
original content line 2

Related

BEGIN and END blocks in awk

I am using the awk command in terminal on my Mac.
I want to print the contents of an already existing file and give a title to each column which i'll separate using a tab then I want to send the output to another file. What line of code would I use to give titles to the columns? Im hoping to use simple awk commands and preferably if I can complete the task in as little lines as possible.
So far I have tried to use the BEGIN command. (The titles I want to give are first name, second name and score)
BEGIN { print "First Name\tSecond Name\tScore}**
then I want to print the entire contents of the file.
{print} filename.txt
Finally I want to save the output to another file
End{print} filename.txt > output.txt
to do this all all together
awk 'BEGIN {print "First Name\tSecond Name\tScore";}
{print}
End{print}' filename.txt > output.txt
However, this only saved the titles to the output file and not the contents of the original file under the columns.
awk processes files line by line. Before it starts processing the file you can have it do something. We use the BEGIN keyword to note that the following block of code is to be executed before processing. Same with END running after the processing of each line of the file is complete.
While your code has some superfluous bits in it, like the unnecessary END block, it still should do exactly what you are wanting to do, assuming you have data in your filename.txt.
A more succinct awk code would be:
awk 'BEGIN {print "First Name\tSecond Name\tScore";}1' filename.txt > output.txt
In action (using commas instead of tabs because it's easier and I'm lazy):
$ echo "1,2,3" > filename.txt
$ awk 'BEGIN {print "c1,c2,c3"}1' filename.txt > output.txt
$ cat output.txt
c1,c2,c3
1,2,3

why awk print file content while there is no print command

i have an awk file, which i read each words from a file into an array, there is no print command in it, but after i run it, the whole content of the file is printed,
#!/bin/awk -f
{
for(i=1;i<=NF;i++)
used[$i]=1
}
after i run this awk file like this
awk 1.awk 2
the whole content of file 2 is printed on the screen, i am confused,
i tried this directly from command line, there is nothing printed out, so i think there is something wrong with the file or the way to run this file,
You missed the -f option: awk -f 1.awk 2
What you provided is, instead of the contents of "1.awk" as the awk commands, you're providing the literal string 1.awk as the awk command.
You can essentially done this: awk '"1.awk"' 2
And since that is a "true" value, the default action is to print each record of the data contained in file "2".

Append prefix to first column of a file with awk

I have a couple of hundreds of files which I want to process with xargs. They all need a fix of their first column.
Therefore I need an awk command to append the prefix "ID_" to the first column of a file (except for the first header line). Can anyone help me with this?
Something along the line:
gawk -f ';' "{$1='ID_' $1; print $0}" file.csv > file_processed.csv
I am not expert for the command, though. And I would rather like to have some inplace processing instead of making a copy of each file. Beforehand, I made it in VIM, but then I only had one file.
:%s/^-/ID_/
I hope someone can help me here.
gawk 'BEGIN{FS=";"; OFS=";"} {if(NR>1) $1="ID_"$1; print}' file.csv > file_processed.csv
FS and OFS set the input and output field separators, respectively.
NR>1 checks whether current line number is larger than 1, so we don't modify the header line.
You can also modify the file in place with -i inplace option:
gawk -i inplace 'BEGIN{FS=";"; OFS=";"} {if(NR>1) $1="ID_"$1; print}' file.csv
Edit
After elaborating the original question, here's the final version:
gawk -i inplace 'BEGIN{FS=OFS=";"} NR>1{sub(/^-/,"ID_",$2)} 1' file.csv
which substitutes - in the beginning of second column with ID_.
NR>1 action applies for all but first (header) line. 1 invokes the default default print action.
If you just want to do something, particularly adding a prefix, on the first field, it is not different from adding the prefix to the whole line.
So you can just awk '$0 = "ID_" $0' file.csv it should do the work. If you want to make it "change in place", you can:
awk '$0="ID_"$0' csv >/tmp/foo && mv /tmp/foo file.csv
You can also make use of sed:
sed -i 's/^/ID_/' file
The -i does "in-place modification"
You mentioned vim, and gave s/^-/ID_/ cmd, it doesn't add the prefix (ID_), it will replace the leading - by the ID_, they are different.

awk to store and reset variable from file

Trying to use awk to lookup the string in file1 (which is always just 1 field) in the same line of file2. That is if row 1 is being used in file1 then only row 1 is used in file2. Since it is possible for the value to be missing this is a check done to ensure it is there. This is just an idea so there probably is a better way, but I just wanted to see. Thank you :).
file1
R_2017_01_13_12_11_56_user_S5-00580-24-Medexome
R_2017_01_13_14_46_04_user_S5-00580-25-Medexome
file2
The oldest folder is R_2017_01_13_12_11_56_user_S5-00580-24-Medexome, created on 2017-01-17+11:31:02.5035483130 and analysis done using v1.4 by cmccabe at 01/17/17 12:41:03 PM
desired output for $filename
R_2017_01_13_12_11_56_user_S5-00580-24-Medexome
After a bunch of processes are run using $filename, I need to reset that variable with a new one.
file1 (after rerunning some process)
R_2017_01_13_12_11_56_user_S5-00580-24-Medexome
R_2017_01_13_14_46_04_user_S5-00580-25-Medexome
file2 (after rerunning some process)
The oldest folder is R_2017_01_13_12_11_56_user_S5-00580-24-Medexome, created on 2017-01-17+11:31:02.5035483130 and analysis done using v1.4 by cmccabe at 01/17/17 12:41:03 PM
The oldest folder is R_2017_01_13_14_46_04_user_S5-00580-25-Medexome, created on 2017-01-17+06:53:07.3194950000 and analysis done using v1.4 by cmccabe at 01/18/17 06:59:08 AM
desired output for $filename now is (since this is value is new)
R_2017_01_13_14_46_04_user_S5-00580-25-Medexome
awk
filename=$(awk 'NR==1{print $1}' file1 file2)
You want to check if the last line of file2 contains a string given in file1.
For this, you just have to read that last line and then see if it matches with any of the words in file1.
$ awk 'ENDFILE {line=$0} FNR<NR && line ~ $1' file2 file1
R_2017_01_13_14_46_04_user_S5-00580-25-Medexome
This uses:
ENDFILE {line=$0}
after reading a file, $0 contains the last line that was read (well, not always, but I assume you have a modern version of awk, since ENDFILE is a GNU awk extension). With this, we store this last line into line, so that we can use it when reading the next file.
FNR<NR && line ~ $1
while reading the file1, check if the given word is present in the stored line. If so, print is automatically triggered.
This uses the FNR<NR trick, where FNR holds the number of the line in the current file, while NR the number of line in general. This way, FNR==NR is only true while reading the first file and FNR<NR from the second on.
If you only need to check the last line of file2 continuously, you could:
$ awk 'NR==FNR{a[$1];next}{for(i in a)if(i ~ $0) print i}' file1 <(tail -f file2)
Explained:
NR==FNR{a[$1];next} read into a array the search terms from file1
file2 is tail -f'd into awk using process substitution, ie. it reads a record from the end of file2, goes thru all search words in a and looks them from the record, printing search word if there is a match

splitting a multiple FASTA file into separate files keeping their original names

I am trying to work with an AWK script that was posted earlier on this forum. I am trying to split a large FASTA file containing multiple DNA sequences, into separate FASTA files. I need to separate each sequence into its own FASTA file, and the name of each of the new FASTA files needs to be the name of the DNA sequence from the original, large multifasta file (all the characters after the >).
I tried this script that I found here at stackoverflow:
awk '/^>chr/ {OUT=substr($0,2) ".fa"}; OUT {print >OUT}' your_input
It works well, but the DNA sequence begins directly after the name of the file- with no space. The DNA sequence needs to begin on a new line (regular FASTA format).
I would appreciate any help to solve this.
Thank you!!
Do you mean something like this?
awk '/^>chr/ {OUT=substr($0,2) ".fa";print " ">OUT}; OUT{print >OUT}' your_input
where the new file that is created for each "chromosome/sequence/thing" gets a blank line at the start?
I think this should work.
awk '/^>chr/ {OUT=substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File
Hope this perl script could help.
#!/usr/bin/perl
open (INFILE, "< your_input.fa")
or die "Can't open file";
while (<INFILE>) {
$line = $_;
chomp $line;
if ($line =~ /\>/) { #if has fasta >
close OUTFILE;
$new_file = substr($line,1);
$new_file .= ".fa";
open (OUTFILE, ">$new_file")
or die "Can't open: $new_file $!";
}
print OUTFILE "$line\n";
}
close OUTFILE;
The .fa (or .fasta) format looks like:
>ID1
SEQUENCE
>ID2
SEQUENCE
When splitting a fasta file it is actually not desired inserting a new line character at its top. Therefore the answer of Pramod is more appropriate. Additionally, the ID can be defined more generally to match only the > character. Consequently, the complete line would be:
awk '/^>/ {OUT=substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File
If you don't want to mess up your current directory with all the split files you can also output into a subdirectory (subdir):
awk '/^>/ {OUT="subdir/" substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File
awk to split multi-sequence fasta file into separate sequence files
This problem is best approached by considering each sequence (complete with header) a single record and changing awk's default record separator RS (usually a line break) to be the unique (one per record) > symbol used to define the start of a header. As we will want to use the header text as a file name, and as fasta headers cannot contain line-breaks, it is also convenient to reset awk's default field separaor FS (usually white space) to be line breaks.
Both of these are done in an awk BEGIN block:
BEGIN{RS=">";FS="\n"}
Since the file begins with >, the first record will be empty and therefore must be ignored to prevent an error caused by trying to write to a file name extracted from an empty record. Thus, the main awk action block is filtered to only process records beginning with record number (NR) 2. This is achieved by placing a condition before the action block as follows:
NR>1{ ... }
with the record separator set to > each record is a whole sequence including its header, and each is split into fields at line breaks (because we set the field separator to "\n"). Thus, field 1 ($1) of each record contains the text we wish to use as filenames. Note the record separator (>) is no longer part of any field and so the entire first field can be used to build the filename. In this example, ".fasta" has been appended as a file extension:
fnme=$1 ".fasta";
Next, the fasta header marker ">" is printed, followed by the entire record ($0) to the filename fnme just formed, using awk's > redirect:
print ">" $0 > fnme;
lastly, the file is closed to prevent awk exceeding the system limit for the number of open files allowed, if many files are to be written (see footnote):
close(fnme);
whole procedure
awk command
awk 'BEGIN{RS=">";FS="\n"} NR>1{fnme=$1".fasta"; print ">" $0 > fnme; close(fnme);}' example.fasta
Tested on the following mock file named example.fasta:
>DNA sequence 1
GCAAAAGAACCGCCGCCACTGGTCGTGAAAGTGGTCGATCCAGTGACATCCCAGGTGTTGTTAAATTGAT
CATGGGCAGTGGCGGTGTAGGCTTGAGTACTGGCTACAACAACACTCGCACTACCCGGAGTGATAGTAAT
GCCGGTGGCGGTACCATGTACGGTGGTGAAGT
>DNA sequence 2
TCCCAGCCAGCAGGTAGGGTCAAAACATGCAAGCCGGTGGCGATTCCGCCGACAGCATTCTCTGTAATTA
ATTGCTACCAGCGCGATTGGCGCCGCGACCAGGATCCTTTTTAACCATTTCAGAAAACCATTTGAGTCCA
TTTGAACCTCCATCTTTGTTC
>DNA sequence 3
AACAAAAGAATTAGAGATATTTAACTCCACATTATTAAACTTGTCAATAACTATTTTTAACTTACCAGAA
AATTTCAGAATCGTTGCGAAAAATCTTGGGTATATTCAACACTGCCTGTATAACGAAACACAATAGTACT
TTAGGCTAACTAAGAAAAAACTTT
results (terminal commands and output)
$ ls
'DNA sequence 1.fasta' 'DNA sequence 3.fasta'
'DNA sequence 2.fasta' example.fasta
$ cat DNA\ sequence\ 1.fasta
>DNA sequence 1
GCAAAAGAACCGCCGCCACTGGTCGTGAAAGTGGTCGATCCAGTGACATCCCAGGTGTTGTTAAATTGAT
CATGGGCAGTGGCGGTGTAGGCTTGAGTACTGGCTACAACAACACTCGCACTACCCGGAGTGATAGTAAT
GCCGGTGGCGGTACCATGTACGGTGGTGAAGT
$ cat DNA\ sequence\ 2.fasta
>DNA sequence 2
TCCCAGCCAGCAGGTAGGGTCAAAACATGCAAGCCGGTGGCGATTCCGCCGACAGCATTCTCTGTAATTA
ATTGCTACCAGCGCGATTGGCGCCGCGACCAGGATCCTTTTTAACCATTTCAGAAAACCATTTGAGTCCA
TTTGAACCTCCATCTTTGTTC
$ cat DNA\ sequence\ 3.fasta
>DNA sequence 3
AACAAAAGAATTAGAGATATTTAACTCCACATTATTAAACTTGTCAATAACTATTTTTAACTTACCAGAA
AATTTCAGAATCGTTGCGAAAAATCTTGGGTATATTCAACACTGCCTGTATAACGAAACACAATAGTACT
TTAGGCTAACTAAGAAAAAACTTT
footnote
"To write numerous files, successively, in the same awk program. If the files aren’t closed, eventually awk may exceed a system limit on the number of open files in one process. It is best to close each one when the program has finished writing it."
quoted from https://www.gnu.org/software/gawk/manual/html_node/Close-Files-And-Pipes.html