How to delete text/word or character from a text file? [duplicate] - awk

I'm working with this file of data that looks like this:
text in file
hello random text in file
example text in file
words in file hello
more words in file
hello text in file can be
more text in file
I'm trying to replace all lines that do not contain the string hello with match using sed, so the output would be:
match
hello random text in file
match
words in file hello
match
hello text in file can be
match
I tried using sed '/hello/!d' but that deletes the line. Also, I read that I can match using ! within sed but I'm not sure how to match every line and replace properly. If you could give me some direction, I would really appreciate it.

You can do it like this:
$ sed '/hello/!s/.*/match/' infile
match
hello random text in file
match
words in file hello
match
hello text in file can be
match
/hello/! makes sure we're substituting only on lines not containing hello (you had that right), and the substitution then replaces the complete pattern space (.*) with match.

awk to the rescue!
$ awk '!/hello/{$0="match"}1' file
replace the lines not matching "hello" with "match" and print all lines.

Sed with the c (change) command:
$ sed '/hello/!c match' file
match
hello random text in file
match
words in file hello
match
hello text in file can be
match

Just use awk for clarity, simplicity, etc.:
awk '{print (/hello/ ? $0 : "match")}' file

Related

Match a string from file and print only the first row that matching

I am trying to match a string from file and only print the first line that matches the string. I am able to get the result using grep but is there a way I can achieve the same output using awk?
# cat file
/dev/sdac
/dev/cciss/c0d0
/dev/cciss/c0d0p1
/dev/cciss/c0d0p2
/dev/cciss/c0d0p1
# grep -wm1 c0d0p1 file
/dev/cciss/c0d0p1
Could you please try following.
awk '/c0p0d1/{print;exit}' Input_file
Explanation: I am searching string in each line and when a match is found I am printing the line and exiting ASAP since we need not to read file unnecessary. Exiting from program will make it faster too.

Delete lines using gawk, awk or sed

Original question
I have a comma delimited .csv file looking like this:
header1,header2,header3
value10,value20,value30
value11,value21,value31
,,
,,
,,
How do i delete the "empty lines" at the end of the csv? The number of empty line is not always the same but can be any number.
And how to save the modified csv in a new file?
Question with Thor's edits
I have a comma delimited .csv file looking like this:
header1,header2,header3
value10,value20,value30
value11,value21,value31
[empty line]
[empty line]
[empty line]
How do i delete the "empty lines" at the end of the csv? The number of empty line is not always the same but can be any number.
And how to save the modified csv in a new file?
It kind of depends on your definition of an empty line. If it really is empty as in there is nothing but a newline, using awk you could:
$ awk '/./' file
or /^$/ ie, if there's anything but just a newline (default RS in awk), print it. If you need the outout to another file:
$ awk '/./' file > file2
If your definition of empty can tolerate space in the record along with the newline:
$ awk '/^[^ ]+$/' file
Update: A-ha, the definition of emptiness boiled down to all commas. OP mentions in the comments that the "empty lines" is always placed at the end so once we run into first empty line (ie. nothing but commas in the record = ^,+ = !/[^,]/ - sorry about the double negative), exit.
$ awk '!/[^,]/{exit}1' file
header1,header2,header3
value10,value20,value30
value11,value21,value31
A quick and dirty (but efficient) way of doing it is to find on your keyboard a character that is not in your file, for instance µ. Then just type:
tr '\n' 'µ' < myfile.csv | sed -e 's/[,µ]*$//' | tr 'µ' '\n' > out.csv
Not tried, but you can adapt this idea to your own need. Maybe you will have to also add the space character (or the tab, etc.) in the bracket expression.
The idea is to replace the 'end of line' character by a (temporary) µ in order to get a (temporary) single line file; then use a very basic regular expression for deleting what you want; and finally restore the 'end of line' characters.
Use below -
sed -i '/^$/d' file
Explanation :
^$ : To search line which doesn't contain anything from start(^) to last($)
d : To delete that searched line
i : to make the changes permanent so that you don't need to redirect to another file and then rename it again.
It's not clear from your question but it sounds like all you need is:
grep -v '[^,]' file1 > file2

splitting a multiple FASTA file into separate files keeping their original names

I am trying to work with an AWK script that was posted earlier on this forum. I am trying to split a large FASTA file containing multiple DNA sequences, into separate FASTA files. I need to separate each sequence into its own FASTA file, and the name of each of the new FASTA files needs to be the name of the DNA sequence from the original, large multifasta file (all the characters after the >).
I tried this script that I found here at stackoverflow:
awk '/^>chr/ {OUT=substr($0,2) ".fa"}; OUT {print >OUT}' your_input
It works well, but the DNA sequence begins directly after the name of the file- with no space. The DNA sequence needs to begin on a new line (regular FASTA format).
I would appreciate any help to solve this.
Thank you!!
Do you mean something like this?
awk '/^>chr/ {OUT=substr($0,2) ".fa";print " ">OUT}; OUT{print >OUT}' your_input
where the new file that is created for each "chromosome/sequence/thing" gets a blank line at the start?
I think this should work.
awk '/^>chr/ {OUT=substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File
Hope this perl script could help.
#!/usr/bin/perl
open (INFILE, "< your_input.fa")
or die "Can't open file";
while (<INFILE>) {
$line = $_;
chomp $line;
if ($line =~ /\>/) { #if has fasta >
close OUTFILE;
$new_file = substr($line,1);
$new_file .= ".fa";
open (OUTFILE, ">$new_file")
or die "Can't open: $new_file $!";
}
print OUTFILE "$line\n";
}
close OUTFILE;
The .fa (or .fasta) format looks like:
>ID1
SEQUENCE
>ID2
SEQUENCE
When splitting a fasta file it is actually not desired inserting a new line character at its top. Therefore the answer of Pramod is more appropriate. Additionally, the ID can be defined more generally to match only the > character. Consequently, the complete line would be:
awk '/^>/ {OUT=substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File
If you don't want to mess up your current directory with all the split files you can also output into a subdirectory (subdir):
awk '/^>/ {OUT="subdir/" substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File
awk to split multi-sequence fasta file into separate sequence files
This problem is best approached by considering each sequence (complete with header) a single record and changing awk's default record separator RS (usually a line break) to be the unique (one per record) > symbol used to define the start of a header. As we will want to use the header text as a file name, and as fasta headers cannot contain line-breaks, it is also convenient to reset awk's default field separaor FS (usually white space) to be line breaks.
Both of these are done in an awk BEGIN block:
BEGIN{RS=">";FS="\n"}
Since the file begins with >, the first record will be empty and therefore must be ignored to prevent an error caused by trying to write to a file name extracted from an empty record. Thus, the main awk action block is filtered to only process records beginning with record number (NR) 2. This is achieved by placing a condition before the action block as follows:
NR>1{ ... }
with the record separator set to > each record is a whole sequence including its header, and each is split into fields at line breaks (because we set the field separator to "\n"). Thus, field 1 ($1) of each record contains the text we wish to use as filenames. Note the record separator (>) is no longer part of any field and so the entire first field can be used to build the filename. In this example, ".fasta" has been appended as a file extension:
fnme=$1 ".fasta";
Next, the fasta header marker ">" is printed, followed by the entire record ($0) to the filename fnme just formed, using awk's > redirect:
print ">" $0 > fnme;
lastly, the file is closed to prevent awk exceeding the system limit for the number of open files allowed, if many files are to be written (see footnote):
close(fnme);
whole procedure
awk command
awk 'BEGIN{RS=">";FS="\n"} NR>1{fnme=$1".fasta"; print ">" $0 > fnme; close(fnme);}' example.fasta
Tested on the following mock file named example.fasta:
>DNA sequence 1
GCAAAAGAACCGCCGCCACTGGTCGTGAAAGTGGTCGATCCAGTGACATCCCAGGTGTTGTTAAATTGAT
CATGGGCAGTGGCGGTGTAGGCTTGAGTACTGGCTACAACAACACTCGCACTACCCGGAGTGATAGTAAT
GCCGGTGGCGGTACCATGTACGGTGGTGAAGT
>DNA sequence 2
TCCCAGCCAGCAGGTAGGGTCAAAACATGCAAGCCGGTGGCGATTCCGCCGACAGCATTCTCTGTAATTA
ATTGCTACCAGCGCGATTGGCGCCGCGACCAGGATCCTTTTTAACCATTTCAGAAAACCATTTGAGTCCA
TTTGAACCTCCATCTTTGTTC
>DNA sequence 3
AACAAAAGAATTAGAGATATTTAACTCCACATTATTAAACTTGTCAATAACTATTTTTAACTTACCAGAA
AATTTCAGAATCGTTGCGAAAAATCTTGGGTATATTCAACACTGCCTGTATAACGAAACACAATAGTACT
TTAGGCTAACTAAGAAAAAACTTT
results (terminal commands and output)
$ ls
'DNA sequence 1.fasta' 'DNA sequence 3.fasta'
'DNA sequence 2.fasta' example.fasta
$ cat DNA\ sequence\ 1.fasta
>DNA sequence 1
GCAAAAGAACCGCCGCCACTGGTCGTGAAAGTGGTCGATCCAGTGACATCCCAGGTGTTGTTAAATTGAT
CATGGGCAGTGGCGGTGTAGGCTTGAGTACTGGCTACAACAACACTCGCACTACCCGGAGTGATAGTAAT
GCCGGTGGCGGTACCATGTACGGTGGTGAAGT
$ cat DNA\ sequence\ 2.fasta
>DNA sequence 2
TCCCAGCCAGCAGGTAGGGTCAAAACATGCAAGCCGGTGGCGATTCCGCCGACAGCATTCTCTGTAATTA
ATTGCTACCAGCGCGATTGGCGCCGCGACCAGGATCCTTTTTAACCATTTCAGAAAACCATTTGAGTCCA
TTTGAACCTCCATCTTTGTTC
$ cat DNA\ sequence\ 3.fasta
>DNA sequence 3
AACAAAAGAATTAGAGATATTTAACTCCACATTATTAAACTTGTCAATAACTATTTTTAACTTACCAGAA
AATTTCAGAATCGTTGCGAAAAATCTTGGGTATATTCAACACTGCCTGTATAACGAAACACAATAGTACT
TTAGGCTAACTAAGAAAAAACTTT
footnote
"To write numerous files, successively, in the same awk program. If the files aren’t closed, eventually awk may exceed a system limit on the number of open files in one process. It is best to close each one when the program has finished writing it."
quoted from https://www.gnu.org/software/gawk/manual/html_node/Close-Files-And-Pipes.html

Getting substring using ksh script

I'm using ksh script to determine what the delimiter in a file is using awk. I know this delimiter will always be in the 4 position on the first line. The issue I'm having is that the character being used as in a delimiter in a particular file is a * and so instead of returning * in the variable the script is returning a file list. Here is sample text in my file along with my script:
text in file:
XXX*XX* *XX*XXXXXXX.......
here is my kind of what my script looks like (I don't have the script in front of me but you get the jist):
delimiter=$(awk '{substr $0, 4, 1}' file.txt)
echo ${delimiter} # lists files in directory..file.txt file1.txt file2.txt instead of * which is the desired result
Thank you in advance,
Anthony
Birei is right about your problem. But your AWK expression doesn't seem to be interested in the 1st line only. You can replace it with :
'NR==1 {print substr($0, 4, 1)}'
Then you can do a simple:
echo "$delimiter"
The shell is interpreting the content of the delimiter variable. You need to quote it to avoid this behaviour:
echo "${delimiter}"
It will print *

how to replace a pattern with a string depending on part of the pattern?

I have the following problem. I'm interpreting an input file, and now I'm encountering this:
I need to translate %%BLANKx to x spaces.
So, whereever in the input file, I find for example %%BLANK8, I need to replace %%BLANK8 with 8 spaces, %%BLANK10 with 10 spaces etc.
You can split your String on %%BLANK tag.
After, you can read the first number present in any of your token and convert their in spaces.
Now, you can concat every token in a new String.
perl -pe 's/%%BLANK(\d+)/" " x $1/e' input_file
try this. I have not tested exhaustively
$ awk '/BLANK/{ match($0,/%%BLANK([0-9]+)/,a);s=sprintf("%"a[1]"s","") ; gsub(a[0],s)}1' file
Or Ruby(1.9+)
$ ruby -ne 'print $_.gsub(/%%BLANK(\d+)/){|m|" "*$1.to_i}' file
using "%%BLANK" as record seperator , now if any new record that starts with a number replace the number with spaces.
awk 'BEGIN {RS="%%BLANK";ORS=""}{MatchFound=match($0,"^[0-9]+",Matched_string);if(MatchFound){sub(Matched_string[0],"",$0);for (i=0;i<Matched_string[0];i++){$0=" "$0};print $0}else{print $0}}' InputFile.txt