Is it possible to add some text at beginning of a file in CLI without making new file? - awk

I have two files file1 containing -
hello world
hello bangladesh
and file2 containing -
Dhaka in Bangladesh
Dhaka is capital of Bangladesh
I want to new file2 as -
hello world
hello bangladesh
Dhaka in Bangladesh
Dhaka is capital of Bangladesh
This is done by -
cat file1 file2 >> file3
mv file3 file2
But, I don't want to create new file. I guess using sed it may be possible.

Sure it's possible.
printf '%s\n' '0r file1' x | ex file2
This is a POSIX-compliant command using ex, the POSIX-specified non-visual predecessor to vi.
printf is only used here to feed a command to the editor. What printf outputs is:
0r file1
x
x is save and exit.
r is "read in the contents of the named file."
0 specifies the line number after which the read-in text should be placed.
N.B. This answer is fully copied from Is it possible to add some text at beginning of a file in CLI without making new file?
Another solution
There aren't a lot of ways to modify files in place using standard tools. Even if they appear to do so they may be using temporary files (i.e. GNU sed -i).
ex on the other hand, will do the trick:
ex +'0r file2' +wq file1
ex is a line editor and vim evolved from it so these commands may look familiar. 0r filename does the same thing as :0r filename in vim: insert the specified file after the given address (line number). The line number here is 0 which is a kind of virtual line that represents the line before line 1. So the file will be inserted before any existing text.
Then we have wq which saves and quits.
That's it.
N.B. This answer is fully copied from https://unix.stackexchange.com/a/414408/176227

You want to insert two lines at the top of an existing file.
There's no general way to do that. You can append data to the end of an existing file, but there's no way to insert data other than at the end, at least not without rewriting the portion of the file after the insertion point.
Once a file has been created, you can overwrite data within it, but the position of any unmodified data remains the same.
When you use a text editor to edit a file, it performs the updates in memory; when you save the file, it may create a new file with the same name as the old one, or it may create a temporary file and then rename it. either way, it has to write the entire content, and the original file will be clobbered (though some editors may create a backup copy).
Your method:
cat file1 file2 >> file3
mv file3 file2
is pretty much the way to do it (except that the >> should be >; you don't want to retain the previous contents of file3 if it already exists).
You can use a temporary file name:
cat file1 file2 > tmp$$ && mv tmp$$ file2
tmp$$ uses the current shell's process ID to generate a file name that's almost certain to be unique (so you don't clobber anything). The && means that the mv command won't be executed if the cat command failed, so that you'll still have the original file2 if there was an error.

Related

Add a field to the current record before processing in awk

I want to use a large awk script that was designed to take a particular input. For example "city zipcode street housenumber", so $2 is zipcode, etc...
Now the input is provided to me in a new format. In my example "city" is now missing. The new file is "zipcode street housenumber" (not for real, just trying to make the example simple)
but I happen to know that the city is a constant for that input (which is why it's not in the dataset). So if I run it through the original script $2 is now street, and everything is one field off.
I could first process the input file to prepend the city name to each line (using awk, sed, or whatever), then run it through the original script, but I would prefer to run only one script that supports both formats. I could add a command-line option that tells it the city, but I don't know how to insert it in front of the current record at the top of the script so that the rest of the script can be unchanged. It looks like I can change a field but what I want to do is "shift" the fields right so I can modify $1.
Did I mention I am a complete awk novice? (Perl is my poison.)
I think I fixed my own problem, I'm doing the following (haven't figured out how to do this conditionally based on a command line option, but it should be easy to find tutorials for that)
NF+=1;
for(i=NF; i>1; --i) $(i)=$(i-1);
$1="Vancouver";
I had the loop wrong in my comment above, but the basic idea of manipulating NF and copying fields into each others seems to work
Something in the lines of this should do it. First some missed test data:
$ cat file
1 2 3 4
2 3 4
The awk:
$ awk -v c=V '{ # define external var
if(NF==3) # if record has only three fields
$0=v FS $0 # prepend the var to the record
print $1 # print first field
}' file
Output:
1
V

How to flush output when using inplace editing with awk?

I want to use awk to edit a column of a large file inplace. If, for any reason, the process break/stops, I don't want to lose the work already done. I've tried to add fflush but seems it does not wort with inplace.
In order to simulate the desired result, here is a test file with 3 columns. The last column is all zeros.
paste -d '\t' <(seq 1 10) <(seq 11 20) |
awk 'BEGIN {FS="\t"; OFS=FS} {$(NF+1)=0; print}' > testfile
Then I want to replace the values in last column. In this simple example, I'm just replacing them by the sum of the first and second columns. I'm adding a system sleep so it might be possible to abort the script in the middle in see the result.
awk -i inplace 'BEGIN {FS="\t"; OFS=FS} $3==0{$3=$1+$2; print; fflush(); system("sleep 1")}' testfile
If you run the script and abort it (ctrl+z) before it ends, the test file is unchanged.
Is it possible to achieve the desired result (get the partial result when the script breaks or stops)? How should I do it?
"In-place" editing is not really. A temporary file holds the output, and replaces the input at the end of the script.
Actual in-place editing would be slow: unless the output is the same length as the input, the file size needs to change, and awk would have to re-write the entire file (everything after the current line, at least) on every buffer flush. Note this caveat from the documentation:
If the program dies prematurely … a temporary file may be left behind.
You could script up some recovery code to merge that temporary file with your input after an abort.
Or, you could adjust your script to only modify one line per run (and simply print every subsequent line, unmodified), and re-run it until there are no changes left to make. This would force awk to re-write the file on every change. It will be slow, but there just isn't any fast way to remove data from the middle of a file.

Renaming files based on internal text match - keep all content of file

Still having trouble figuring out how to preserve the contents of a given file using the following code that is attempting to rename the file based on a specific regex match within said file (i.e. within a given file there will always be one SMILE followed by 12 digits, e.g., SMILE000123456789).
for f in FILENAMEX_*; do awk '/SMILE[0-9]/ {OUT=$f ".txt"}; OUT {print >OUT}' ${f%.*}; done
This code is naming the file correctly but is simply printing out everything after the match instead of the entire contents of the file.
The list of files to be processed don't currently have an extension (and they need one for the next step) because I was using csplit to parse out the content from a larger file.
There are two problems: the first is using a shell variable in your awk program, and the second is the logic of the awk program itself.
To use a shell variable in awk, you can use
awk -v var="$var" '<program>'
and then use just var inside of awk.
For the second problem: if a line doesn't match your pattern and OUT is not set, you don't print the line. After the first line matching the pattern, OUT is set and you print. Since the match might be anywhere in the file, you have to store the lines at least up to the first match.
Here is a version that should work and is pretty close to your approach:
for f in FILENAMEX_*; do
awk -v f="${f%.*}" '
/SMILE[0-9]/ {
out=f".txt"
for (i=1;i<NR;++i) # Print file so far
print lines[i] > out
}
out { print > out } # Match has been seen: print
! out { lines[NR] = $0 } # No match yet: store
' "$f"
done
You could do some trickery and work with FILENAME or similar to do everything in a single invocation of awk, but since the main purpose is to find the presence of a pattern in the file, you're much better off using grep -q, which returns an exit status of 0 if the pattern has been found:
for f in FILENAMEX_*; do grep -q 'SMILE[0-9]' "$f" && cp "$f" "${f%.*}".txt; done
perhaps a different approach and just do each step separately..
ie pseudocode
for all files with some given text
extract text
rename file

Using awk on a folder and adding file name to output rows

I should start by thanking you all for all the work you put into the answers on this site. I have spent many hours reading through them but have not found anything fitting my question yet. Hence my own post.
I have a folder with multiple subfolders and txt-files within those. In column 7 of those files, there are gene names (I do genetics for a living :)). These are the string I am trying to extract. Shortly, I would like to search the whole folder for any rows within any of the files that contain a particular gene name/string. I have been using grep for this, writing something like:
grep -r GENE . > GENE.txt
Simple, but I need to be able to tweak the search further and it seems that then awk is the way to go.
So I tried using awk. I wrote something like this:
awk '$7 == "GENENAME"' FOLDER/* > GENENAME.txt
This works well (and now I can specify that the string has to be in a particular column, this I can not do with grep, right?).
However, in contrast to grep, which writes the file name at the start of each row, I now can not directly see which file which row in my output file comes from (which mostly defeats the point of the search). This, adding the name of the origin file somewhere to each row, seems like something that should absolutely be doable, but I am not able to figure it out.
The files I am searching within change (or rather get more numerous), but otherwise my search will always be for some specific string in column 7 of the same big folder. How can I get this working?
Thank you in advance,
Elisabet E
You can use FNR (FNR means file number of record) to print the row number and FILENAME to print the file's name, then you get the matching lines from which file and which row, for instance:
sample.csv:
aaa 123
bbb 456
aaa 789
command:
awk '$1 =="aaa"{print $0, FNR, FILENAME}' sample.csv
The output is:
aaa 123 1 sample.csv
aaa 789 3 sample.csv
Sounds like you're looking for:
awk '$7 == "GENENAME"{print FILENAME, $0}' FOLDER/*
If not then edit your question to clarify with sample input and expected output.

splitting a multiple FASTA file into separate files keeping their original names

I am trying to work with an AWK script that was posted earlier on this forum. I am trying to split a large FASTA file containing multiple DNA sequences, into separate FASTA files. I need to separate each sequence into its own FASTA file, and the name of each of the new FASTA files needs to be the name of the DNA sequence from the original, large multifasta file (all the characters after the >).
I tried this script that I found here at stackoverflow:
awk '/^>chr/ {OUT=substr($0,2) ".fa"}; OUT {print >OUT}' your_input
It works well, but the DNA sequence begins directly after the name of the file- with no space. The DNA sequence needs to begin on a new line (regular FASTA format).
I would appreciate any help to solve this.
Thank you!!
Do you mean something like this?
awk '/^>chr/ {OUT=substr($0,2) ".fa";print " ">OUT}; OUT{print >OUT}' your_input
where the new file that is created for each "chromosome/sequence/thing" gets a blank line at the start?
I think this should work.
awk '/^>chr/ {OUT=substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File
Hope this perl script could help.
#!/usr/bin/perl
open (INFILE, "< your_input.fa")
or die "Can't open file";
while (<INFILE>) {
$line = $_;
chomp $line;
if ($line =~ /\>/) { #if has fasta >
close OUTFILE;
$new_file = substr($line,1);
$new_file .= ".fa";
open (OUTFILE, ">$new_file")
or die "Can't open: $new_file $!";
}
print OUTFILE "$line\n";
}
close OUTFILE;
The .fa (or .fasta) format looks like:
>ID1
SEQUENCE
>ID2
SEQUENCE
When splitting a fasta file it is actually not desired inserting a new line character at its top. Therefore the answer of Pramod is more appropriate. Additionally, the ID can be defined more generally to match only the > character. Consequently, the complete line would be:
awk '/^>/ {OUT=substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File
If you don't want to mess up your current directory with all the split files you can also output into a subdirectory (subdir):
awk '/^>/ {OUT="subdir/" substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File
awk to split multi-sequence fasta file into separate sequence files
This problem is best approached by considering each sequence (complete with header) a single record and changing awk's default record separator RS (usually a line break) to be the unique (one per record) > symbol used to define the start of a header. As we will want to use the header text as a file name, and as fasta headers cannot contain line-breaks, it is also convenient to reset awk's default field separaor FS (usually white space) to be line breaks.
Both of these are done in an awk BEGIN block:
BEGIN{RS=">";FS="\n"}
Since the file begins with >, the first record will be empty and therefore must be ignored to prevent an error caused by trying to write to a file name extracted from an empty record. Thus, the main awk action block is filtered to only process records beginning with record number (NR) 2. This is achieved by placing a condition before the action block as follows:
NR>1{ ... }
with the record separator set to > each record is a whole sequence including its header, and each is split into fields at line breaks (because we set the field separator to "\n"). Thus, field 1 ($1) of each record contains the text we wish to use as filenames. Note the record separator (>) is no longer part of any field and so the entire first field can be used to build the filename. In this example, ".fasta" has been appended as a file extension:
fnme=$1 ".fasta";
Next, the fasta header marker ">" is printed, followed by the entire record ($0) to the filename fnme just formed, using awk's > redirect:
print ">" $0 > fnme;
lastly, the file is closed to prevent awk exceeding the system limit for the number of open files allowed, if many files are to be written (see footnote):
close(fnme);
whole procedure
awk command
awk 'BEGIN{RS=">";FS="\n"} NR>1{fnme=$1".fasta"; print ">" $0 > fnme; close(fnme);}' example.fasta
Tested on the following mock file named example.fasta:
>DNA sequence 1
GCAAAAGAACCGCCGCCACTGGTCGTGAAAGTGGTCGATCCAGTGACATCCCAGGTGTTGTTAAATTGAT
CATGGGCAGTGGCGGTGTAGGCTTGAGTACTGGCTACAACAACACTCGCACTACCCGGAGTGATAGTAAT
GCCGGTGGCGGTACCATGTACGGTGGTGAAGT
>DNA sequence 2
TCCCAGCCAGCAGGTAGGGTCAAAACATGCAAGCCGGTGGCGATTCCGCCGACAGCATTCTCTGTAATTA
ATTGCTACCAGCGCGATTGGCGCCGCGACCAGGATCCTTTTTAACCATTTCAGAAAACCATTTGAGTCCA
TTTGAACCTCCATCTTTGTTC
>DNA sequence 3
AACAAAAGAATTAGAGATATTTAACTCCACATTATTAAACTTGTCAATAACTATTTTTAACTTACCAGAA
AATTTCAGAATCGTTGCGAAAAATCTTGGGTATATTCAACACTGCCTGTATAACGAAACACAATAGTACT
TTAGGCTAACTAAGAAAAAACTTT
results (terminal commands and output)
$ ls
'DNA sequence 1.fasta' 'DNA sequence 3.fasta'
'DNA sequence 2.fasta' example.fasta
$ cat DNA\ sequence\ 1.fasta
>DNA sequence 1
GCAAAAGAACCGCCGCCACTGGTCGTGAAAGTGGTCGATCCAGTGACATCCCAGGTGTTGTTAAATTGAT
CATGGGCAGTGGCGGTGTAGGCTTGAGTACTGGCTACAACAACACTCGCACTACCCGGAGTGATAGTAAT
GCCGGTGGCGGTACCATGTACGGTGGTGAAGT
$ cat DNA\ sequence\ 2.fasta
>DNA sequence 2
TCCCAGCCAGCAGGTAGGGTCAAAACATGCAAGCCGGTGGCGATTCCGCCGACAGCATTCTCTGTAATTA
ATTGCTACCAGCGCGATTGGCGCCGCGACCAGGATCCTTTTTAACCATTTCAGAAAACCATTTGAGTCCA
TTTGAACCTCCATCTTTGTTC
$ cat DNA\ sequence\ 3.fasta
>DNA sequence 3
AACAAAAGAATTAGAGATATTTAACTCCACATTATTAAACTTGTCAATAACTATTTTTAACTTACCAGAA
AATTTCAGAATCGTTGCGAAAAATCTTGGGTATATTCAACACTGCCTGTATAACGAAACACAATAGTACT
TTAGGCTAACTAAGAAAAAACTTT
footnote
"To write numerous files, successively, in the same awk program. If the files aren’t closed, eventually awk may exceed a system limit on the number of open files in one process. It is best to close each one when the program has finished writing it."
quoted from https://www.gnu.org/software/gawk/manual/html_node/Close-Files-And-Pipes.html