Still having trouble figuring out how to preserve the contents of a given file using the following code that is attempting to rename the file based on a specific regex match within said file (i.e. within a given file there will always be one SMILE followed by 12 digits, e.g., SMILE000123456789).
for f in FILENAMEX_*; do awk '/SMILE[0-9]/ {OUT=$f ".txt"}; OUT {print >OUT}' ${f%.*}; done
This code is naming the file correctly but is simply printing out everything after the match instead of the entire contents of the file.
The list of files to be processed don't currently have an extension (and they need one for the next step) because I was using csplit to parse out the content from a larger file.
There are two problems: the first is using a shell variable in your awk program, and the second is the logic of the awk program itself.
To use a shell variable in awk, you can use
awk -v var="$var" '<program>'
and then use just var inside of awk.
For the second problem: if a line doesn't match your pattern and OUT is not set, you don't print the line. After the first line matching the pattern, OUT is set and you print. Since the match might be anywhere in the file, you have to store the lines at least up to the first match.
Here is a version that should work and is pretty close to your approach:
for f in FILENAMEX_*; do
awk -v f="${f%.*}" '
/SMILE[0-9]/ {
out=f".txt"
for (i=1;i<NR;++i) # Print file so far
print lines[i] > out
}
out { print > out } # Match has been seen: print
! out { lines[NR] = $0 } # No match yet: store
' "$f"
done
You could do some trickery and work with FILENAME or similar to do everything in a single invocation of awk, but since the main purpose is to find the presence of a pattern in the file, you're much better off using grep -q, which returns an exit status of 0 if the pattern has been found:
for f in FILENAMEX_*; do grep -q 'SMILE[0-9]' "$f" && cp "$f" "${f%.*}".txt; done
perhaps a different approach and just do each step separately..
ie pseudocode
for all files with some given text
extract text
rename file
Related
I am trying to match a string from file and only print the first line that matches the string. I am able to get the result using grep but is there a way I can achieve the same output using awk?
# cat file
/dev/sdac
/dev/cciss/c0d0
/dev/cciss/c0d0p1
/dev/cciss/c0d0p2
/dev/cciss/c0d0p1
# grep -wm1 c0d0p1 file
/dev/cciss/c0d0p1
Could you please try following.
awk '/c0p0d1/{print;exit}' Input_file
Explanation: I am searching string in each line and when a match is found I am printing the line and exiting ASAP since we need not to read file unnecessary. Exiting from program will make it faster too.
I'm writing a bash scrip that would translate one file to another, and am encountering an issue.
Whenever the program sees something like this(......not included):
......Mul(-a1+b2-c3...+f+e)......
change it to:
......M(-a1)*M(b2)*M(-c3)*...*M(f)*M(e)......
the number of the variables in Mul is unknown and there could be multiple occurrence of Mul in the file. There are also other places in the file where + or - appears. And Variables could be one or more characters.
I tried grouping in sed, with a group followed by a "*", but it doesn't seem to be working due to the need of replacing unknown amount of variables.
Here is a sed script that will do it:
:a
s/\(Mul(.[^)]*\)\([+-].\)/\1)*Mul(\2/
ta
s/Mul(+\{0,1\}/M(/g
The trick is to use the test to jump back to the beginning after making a substitution (e.g. "Mul(a+b+c)"=>"Mul(a)*Mul(+b+c)").
$ cat tst.awk
match($0,/Mul\([^()]+\)/) {
tgt = substr($0,RSTART+4,RLENGTH-5)
gsub(/[-+][[:alnum:]]+/,"*M(&)",tgt)
gsub(/\+/,"",tgt)
sub(/^\*/,"",tgt)
print substr($0,1,RSTART-1) tgt substr($0,RSTART+RLENGTH)
}
$ awk -f tst.awk file
......M(-a1)*M(b2)*M(-c3)*M(f)*M(e)......
The above was run on this input file:
$ cat file
......Mul(-a1+b2-c3+f+e)......
I am comparing two files using awk. Following is representation of first file (file1.txt):
1
15
MRUKLM
GHLKGM
BNUIOK
Following is representation of second file (file2.txt):
AGHLKMT
MFBGSJY
GSBDGLM
I want to compare two files based on certain patterns. Moreover, first line of the output file should contain the total number of lines in the second file followed by second and third line of the first file. Hence, the header of the output file should be as follows:
3(total lines of second file)
15(second line of first file)
MRUKLM(third line of first file)
certain pattern.....
certain pattern....
certain pattern....
I wrote the following codes:
vari=$(wc -l file2.txt)|awk -v lin="" 'NR==FNR{if(NR>1 && NR<4)lin=$lin$0;else a[NR]=$0;next}BEGIN{print vari,lin}match($0,/([0-9]*)_(.*)/,c){print a[2*c[1]+2];print a[2*c[1]+3]}' file1.txt file2.txt> output_file.txt
The part of the code that extract the pattern is working perfectly, however, I could not get any header in the output file. The output I get is as follows:
certain pattern....
certain pattern....
It turns out that I did some mistakes in assigning variables. Following are the updated codes:
awk -v vari="$(cat file2.txt|wc -l)" 'NR==FNR{if(NR>1 && NR<4)print $0;else a[NR]=$0;next}BEGIN{print vari}match($0,/([0-9]*)_(.*)/,c){print a[2*c[1]+2];print a[2*c[1]+3]}' file1.txt file2.txt > output.txt
It is giving the desirable output.
I wrote a script to groom my .bash_history file, filtering "uninteresting" commands like ls from the persisted history.
(I know there's the HISTIGNORE variable, but that would also exclude such commands from the current session's in-memory history. I find it useful to have them around within the scope of a single session, but not persisted across sessions.)
The history file can contain multi-line history entries with embedded newlines, so the entries are separated by timestamps. The script takes an input file like:
#1501304269
git stash
#1501304270
ls
#1501304318
ls | while IFS= read line; do
echo 'line is: ' $line
done
and filters out single-line ls, man, and cat commands, producing:
#1501304269
git stash
#1501304318
ls | while IFS= read line; do
echo 'line is: ' $line
done
Note that multi-line entries are unfiltered -- I figure if they're interesting enough to warrant multiple lines, they're worth remembering.
I implemented it in Awk, but I've been reading about Sed's multiline capabilities (N, h, H, x, etc.) and I'd like to try it for this purpose. If nothing else, I'd be curious to compare the two for speed.
Here's the Awk script:
/^#[[:digit:]]{10}$/ {
timestamp = $0
histentry = ""
next
}
$1 ~ /^(ls?|man|cat)$/ {
if (! timestamp) {
print
} else {
histentry = $0
}
next
}
timestamp {
print timestamp
timestamp = ""
}
histentry {
print histentry
histentry = ""
}
{ print }
Can this be done using Sed?
Sure it can be done with sed. Here is an example using GNU seds -z option, which lets us deal with the whole file at once instead of working line for line:
sed -rz "s/(#[0-9]{10}\n(cat|ls|man)\n)+(#[0-9]{10}\n|$)/\3/g;" yourfile
If everything works fine and you have a backup of your history file you might even use GNU sed -i option for inplace modification.
The -r options enables extended regexp, the -z option is explained in the manual like this:
Treat the input as a set of lines, each terminated by a zero byte
(the ASCII 'NUL' character) instead of a newline. This option can
be used with commands like 'sort -z' and 'find -print0' to process
arbitrary file names.
The basic idea is this: an uninteresting command is preceded and followed by a timestamp (or it is the last line in the file).
the timestamp RE #[0-9]{10} is taken from your awk script
(#[0-9]{10}\n(cat|ls|man)\n)+ matches one or more of the the uninteresting commands
(#[0-9]{10}|$) the second timestamp is captured into \3 (due to being in the third pair of parens) for reuse in the replacement part and the alternation |$ fits the end of file case
I am trying to work with an AWK script that was posted earlier on this forum. I am trying to split a large FASTA file containing multiple DNA sequences, into separate FASTA files. I need to separate each sequence into its own FASTA file, and the name of each of the new FASTA files needs to be the name of the DNA sequence from the original, large multifasta file (all the characters after the >).
I tried this script that I found here at stackoverflow:
awk '/^>chr/ {OUT=substr($0,2) ".fa"}; OUT {print >OUT}' your_input
It works well, but the DNA sequence begins directly after the name of the file- with no space. The DNA sequence needs to begin on a new line (regular FASTA format).
I would appreciate any help to solve this.
Thank you!!
Do you mean something like this?
awk '/^>chr/ {OUT=substr($0,2) ".fa";print " ">OUT}; OUT{print >OUT}' your_input
where the new file that is created for each "chromosome/sequence/thing" gets a blank line at the start?
I think this should work.
awk '/^>chr/ {OUT=substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File
Hope this perl script could help.
#!/usr/bin/perl
open (INFILE, "< your_input.fa")
or die "Can't open file";
while (<INFILE>) {
$line = $_;
chomp $line;
if ($line =~ /\>/) { #if has fasta >
close OUTFILE;
$new_file = substr($line,1);
$new_file .= ".fa";
open (OUTFILE, ">$new_file")
or die "Can't open: $new_file $!";
}
print OUTFILE "$line\n";
}
close OUTFILE;
The .fa (or .fasta) format looks like:
>ID1
SEQUENCE
>ID2
SEQUENCE
When splitting a fasta file it is actually not desired inserting a new line character at its top. Therefore the answer of Pramod is more appropriate. Additionally, the ID can be defined more generally to match only the > character. Consequently, the complete line would be:
awk '/^>/ {OUT=substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File
If you don't want to mess up your current directory with all the split files you can also output into a subdirectory (subdir):
awk '/^>/ {OUT="subdir/" substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File
awk to split multi-sequence fasta file into separate sequence files
This problem is best approached by considering each sequence (complete with header) a single record and changing awk's default record separator RS (usually a line break) to be the unique (one per record) > symbol used to define the start of a header. As we will want to use the header text as a file name, and as fasta headers cannot contain line-breaks, it is also convenient to reset awk's default field separaor FS (usually white space) to be line breaks.
Both of these are done in an awk BEGIN block:
BEGIN{RS=">";FS="\n"}
Since the file begins with >, the first record will be empty and therefore must be ignored to prevent an error caused by trying to write to a file name extracted from an empty record. Thus, the main awk action block is filtered to only process records beginning with record number (NR) 2. This is achieved by placing a condition before the action block as follows:
NR>1{ ... }
with the record separator set to > each record is a whole sequence including its header, and each is split into fields at line breaks (because we set the field separator to "\n"). Thus, field 1 ($1) of each record contains the text we wish to use as filenames. Note the record separator (>) is no longer part of any field and so the entire first field can be used to build the filename. In this example, ".fasta" has been appended as a file extension:
fnme=$1 ".fasta";
Next, the fasta header marker ">" is printed, followed by the entire record ($0) to the filename fnme just formed, using awk's > redirect:
print ">" $0 > fnme;
lastly, the file is closed to prevent awk exceeding the system limit for the number of open files allowed, if many files are to be written (see footnote):
close(fnme);
whole procedure
awk command
awk 'BEGIN{RS=">";FS="\n"} NR>1{fnme=$1".fasta"; print ">" $0 > fnme; close(fnme);}' example.fasta
Tested on the following mock file named example.fasta:
>DNA sequence 1
GCAAAAGAACCGCCGCCACTGGTCGTGAAAGTGGTCGATCCAGTGACATCCCAGGTGTTGTTAAATTGAT
CATGGGCAGTGGCGGTGTAGGCTTGAGTACTGGCTACAACAACACTCGCACTACCCGGAGTGATAGTAAT
GCCGGTGGCGGTACCATGTACGGTGGTGAAGT
>DNA sequence 2
TCCCAGCCAGCAGGTAGGGTCAAAACATGCAAGCCGGTGGCGATTCCGCCGACAGCATTCTCTGTAATTA
ATTGCTACCAGCGCGATTGGCGCCGCGACCAGGATCCTTTTTAACCATTTCAGAAAACCATTTGAGTCCA
TTTGAACCTCCATCTTTGTTC
>DNA sequence 3
AACAAAAGAATTAGAGATATTTAACTCCACATTATTAAACTTGTCAATAACTATTTTTAACTTACCAGAA
AATTTCAGAATCGTTGCGAAAAATCTTGGGTATATTCAACACTGCCTGTATAACGAAACACAATAGTACT
TTAGGCTAACTAAGAAAAAACTTT
results (terminal commands and output)
$ ls
'DNA sequence 1.fasta' 'DNA sequence 3.fasta'
'DNA sequence 2.fasta' example.fasta
$ cat DNA\ sequence\ 1.fasta
>DNA sequence 1
GCAAAAGAACCGCCGCCACTGGTCGTGAAAGTGGTCGATCCAGTGACATCCCAGGTGTTGTTAAATTGAT
CATGGGCAGTGGCGGTGTAGGCTTGAGTACTGGCTACAACAACACTCGCACTACCCGGAGTGATAGTAAT
GCCGGTGGCGGTACCATGTACGGTGGTGAAGT
$ cat DNA\ sequence\ 2.fasta
>DNA sequence 2
TCCCAGCCAGCAGGTAGGGTCAAAACATGCAAGCCGGTGGCGATTCCGCCGACAGCATTCTCTGTAATTA
ATTGCTACCAGCGCGATTGGCGCCGCGACCAGGATCCTTTTTAACCATTTCAGAAAACCATTTGAGTCCA
TTTGAACCTCCATCTTTGTTC
$ cat DNA\ sequence\ 3.fasta
>DNA sequence 3
AACAAAAGAATTAGAGATATTTAACTCCACATTATTAAACTTGTCAATAACTATTTTTAACTTACCAGAA
AATTTCAGAATCGTTGCGAAAAATCTTGGGTATATTCAACACTGCCTGTATAACGAAACACAATAGTACT
TTAGGCTAACTAAGAAAAAACTTT
footnote
"To write numerous files, successively, in the same awk program. If the files aren’t closed, eventually awk may exceed a system limit on the number of open files in one process. It is best to close each one when the program has finished writing it."
quoted from https://www.gnu.org/software/gawk/manual/html_node/Close-Files-And-Pipes.html