I have a large file and would like to remove any lines from the file that contain an exact string listed in another file. However, the string must match exactly (I'm sorry I don't know how to describe this better).
Here is the file:
one#email.com,name,surname,city,state
two#email.com,name,surname,city,state
three#email.com,name,surname,city,state
anotherone#email.com,name,surname,city,state
And here is the example list to filter with:
one#email.com
three#email.com
The desired output is:
two#email.com,name,surname,city,state
anotherone#email.com,name,surname,city,state
I have tried to do this using the following:
grep -v -f 2.txt 1.txt > 3.txt
However this produces the output:
two#email.com,name,surname,city,state
I assume it's doing this because "anotherone#email.com" contains "one#email.com". I've searched for a way to include beginning of the line, but not found anything suitable.
I'm open to doing in something other than grep too, I used grep because I couldn't figure it out any other way.
Assuming that your input file contains three#gmail.com not three#email.com (typo perhaps)
$ grep -vw -f 2.txt 1.txt
two#email.com,name,surname,city,state
anotherone#email.com,name,surname,city,state
-w, --word-regexp -
The expression is searched for as a word (as if surrounded by [[:<:]]' and[[:>:]]';
If you only like to print lines from first file that does not contains data from second file in the first field, then this should do:
$cat file
one#email.com,name,surname,city,state
two#email.com,name,surname,city,state
three#email.com,name,surname,city,state
anotherone#email.com,name,surname,city,state
$cat filter
one#email.com
three#email.com
awk -F, 'NR==FNR {a[$0]++;next} !($1 in a)' filter file
two#email.com,name,surname,city,state
anotherone#email.com,name,surname,city,state
For every line in the filter this creates an array a with the name and value to 1
Like a[one#email.com]=1 and a[three#email.com]=1
Then awk test line by line in the file against the array, giving
a[one#email.com]=1
a[two#email.com]=
a[three#email.com]=1
a[anotherone#email.com]=
Then print all line from the file without 1
two#email.com,name,surname,city,state
anotherone#email.com,name,surname,city,state
For this particular case -- process first file by building an associative array with filter lines being the index. In subsequent files, test if the given line is not in the array indexes -- the default action of a pattern is to print.
awk -F, -v OFS=, '
BEGIN { split("", m) }
NR==FNR { m[$0] = ""; next }
!($1 in m)
' filter.txt file.txt
But... if we are looking to filter any occurrence of the string anywhere in the line (an unconstrained exact match) we need to do something less clever and more brute-force:
awk '
BEGIN {
split("", m)
n=0
}
NR==FNR {
m[n++] = $0
next
}
{
for (i=0; i<n; ++i) {
if (index($0, m[i]))
next
}
print
}
' filter.txt file.txt
Note that if the filter contains non-printable characters (e.g. non-unix line endings), we would need to deal with them by filtering them out (e.g. with sub(/\r/, "")).
Related
I wanted to find from a list of files which contain only one specific string at the top and rest of the lines are empty using sed can someone help me with this. The text file which i wanted to find contain the contact link this
line1:Some text
line2:blank line
line3:blank line
line4:blank line
.
.
.
.
.
.
so on
I have tried this command but it deletes the empty lines. I do not want to delete the empty lines but to find the file which consist of specific string at the top and rest of the lines empty
sed -i "/^$/d" "file.txt"
sed -i "/^$/d" "file.txt"
sed is powerful and terse, but fairly unintelligent. GNU awk picks up the slack:
gawk '
FNR==1 && /Some text/ {a[FILENAME]++; next} #1
/./ {delete a[FILENAME]; nextfile} #2
END {for(f in a) print f} #3
' *.txt
If the first line of a file (File Number Record) matches /regex/ (which you should adjust to match your actual files), record the filename in an array and skip to the next input line.
If a line contains any character, remove filename from array and skip the file. (nextfile is not critical here, but will improve performance at scale)
After all processing is completed, print all indices in array.
*.txt should be adjusted to match all the files you wish to test.
If Perl is your option, would you please try the following:
perl -0777ne 'print "$ARGV\n" if /^\S+(\r?\n)+$/s;' *.txt
sed is for doing simple s/old/new on individual strings, that is all. With GNU awk for nextfile and ENDFILE:
awk '
FNR==1 { if (!/Some text/) nextfile }
/./ { f=1; nextfile }
ENDFILE { if (f) print FILENAME; f=0 }
' *.txt
If you want lines of all-blanks to be considered as empty then change /./ to NF.
This might work for you (GNU sed):
sed -znE '/^[^\n]*some string[^\n]*\n(\s*\n)+$/F' file1 file2 file3 ...
This solution slurps in each file and uses pattern matching to identify file names (F) which contain some string in the first line of an otherwise empty file.
File 1.txt:
13002:1:3:6aw:4:g:Dw:S:5342:dsan
13003:5:3s:6s:4:g:D:S:3456:fdsa
13004:16:t3:6:4hh:g:D:S:5342:inef
File 2.txt:
13002:6544
13003:5684
I need to replace the old data in column 9 of 1.txt with new data from column 2 of 2.txt if it exists. I think this can be done line by line as both files have the same column 1 field. This is a 3Gb file size. I have been playing about with awk but can't achieve the following.
I was trying the following:
awk 'NR==FNR{a[$1]=$2;} {$9a[b[2]]}' 1.txt 2.txt
Expected result:
13002:1:3:6aw:4:g:Dw:S:6544:dsan
13003:5:3s:6s:4:g:D:S:5684:fdsa
13004:16:t3:6:4hh:g:D:S:5342:inef
You seem to have a couple of odd typos in your attempt. You want to replace $9 with the value from the array if it is defined. Also, you want to make sure Awk uses colon as separator both on input and output.
awk -F : 'BEGIN { OFS=FS }
NR==FNR{a[$1]=$2; next}
$1 in a {$9 = a[$1] } 1' 2.txt 1.txt
Notice how 2.txt is first, so that NR==FNR is true when you are reading this file, but not when you start reading 1.txt. The next in the first block prevents Awk from executing the second condition while you are reading the first file. And the final 1 is a shorthand for an unconditional print which of course will be executed for every line in the second file, regardless of whether you replaced anything.
input.txt:
>block1
111111111111111111111
>block2
222222222222222222222
>block3
333333333333333333333
AWK command:
awk '/>block2.*>/' input.txt
Expected output
222222222222222222222
However, AWK is returning nothing. What am I misunderstanding?
Thanks!
If you want to print the line after the line containing >block2, then you could use:
awk '/^>block2$/ { nr=NR+1 } NR == nr { print }'
Track the record number plus 1 when you find the match; when the current record number matches the remembered one, print the current record.
If you want all the lines between the line >block2 and >block3, then you'd use:
awk '/^>block2$/,/^>block3/ {if ($0 !~ /^>block[23]$/) print }'
For all lines between the two markers, if the line doesn't match either marker, print it. The output is the same with the sample data file.
another awk
$ awk 'c&&c--; /^>block2/{c=1}' file
222222222222222222222
c specifies how many lines you want to print after the match. If you want the text between two markers
$ awk '/^>block3/{exit} s; /^>block2/{s=1}' file
222222222222222222222
if there are multiple instances and you want them all, just change exit to s=0
You probably meant:
$ awk '/>/{f=/^>block2$/;next} f' file
222222222222222222222
I have a 2 files. One is a fasta file contain multiple fasta sequences, while another file includes the names of candidate sequences I want to search (file Example below).
seq.fasta
>Clone_18
GTTACGGGGGACACATTTTCCCTTCCAATGCTGCTTTCAGTGATAAATTGAGCATGATGGATGCTGATAATATCATTCCCGTGT
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
>Clone_27-1
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTC
>Clone_27-2
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTCGTTTTGTTCTAGATTAACTATCAGTTTGGTTCTGTTTGTCCTCGTACTGGGTTGTGTCAATGCACAACTT
>Clone_34-1
GTTACGGGGGAATAACAAAACTCACCAACTAACAACTAACTACTACTTCACTTTTCAACTACTTTACTACAATACTAAGAATGAAAACCATTCTCCTCATTATCTTTGCTCTCGCTCTTTTCACAAGAGCTCAAGTCCCTGGCTACCAAGCCATCG
>Clone_34-3
GTTACGGGGGAATAACAAAACTCACCAACTAACAACTAACTACTACTTCACTTTTCAACTACTTTACTACAATACTAAGAATGAAAACCATTCTCCTCATTATCTTTGCTCTCGCTCTTTTCACAAGAGCTCAAGTCCCTGGCTACCAAGCCATCGATATCGCTGAAGCCCAATC
>Clone_44-1
GTTACGGGGGAATCCGAATTCACAGATTCAATTACACCCTAAAATCTATCTTCTCTACTTTCCCTCTCTCCATTCTCTCTCACACACTGTCACACACATCC
>Clone_44-3
GTTACGGGGGAATCCGAATTCACAGATTCAATTACACCCTAAAATCTATCTTCTCTACTTTCCCTCTCTCCATTCTCTCTCACACACTGTCACACACATCCCGGCAGCGCAGCCGTCGTCTCTACCCTTCACCAGGAATAAGTTTATTTTTCTACTTAC
name.txt
Clone_23
Clone_27-1
I want to use AWK to search through the fasta file, and obtain all the fasta sequences for given candidates whose names were saved in another file.
awk 'NR==FNR{a[$1]=$1} BEGIN{RS="\n>"; FS="\n"} NR>FNR {if (match($1,">")) {sub(">","",$1)} for (p in a) {if ($1==p) print ">"$0}}' name.txt seq.fasta
The problem is that I can only extract the sequence of first candidate in name.txt, like this
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
Can anyone help to fix one-line awk command above?
If it is ok or even desired to print the name as well, you can simply use grep:
grep -Ff name.txt -A1 a.fasta
-f name.txt picks patterns from name.txt
-F treats them as literal strings rather than regular expressions
A1 prints the matching line plus the subsequent line
If the names are not desired in output I would simply pipe to another grep:
above_command | grep -v '>'
An awk solution can look like this:
awk 'NR==FNR{n[$0];next} substr($0,2) in n && getline' name.txt a.fasta
Better explained in a multiline version:
# True as long as we are reading the first file, name.txt
NR==FNR {
# Store the names in the array 'n'
n[$0]
next
}
# I use substr() to remove the leading `>` and check if the remaining
# string which is the name is a key of `n`. getline retrieves the next line
# If it succeeds the condition becomes true and awk will print that line
substr($0,2) in n && getline
$ awk 'NR==FNR{n[">"$0];next} f{print f ORS $0;f=""} $0 in n{f=$0}' name.txt seq.fasta
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
>Clone_27-1
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTC
I have several .seq files containing text.
I want to get a single text file containing :
name_of_the_seq_file1
contents of file 1
name_of_the_seq_file2
contents of file 1
name_of_the_seq_file3
contents of file 3
...
All the files are on the same directory.
It´s possible with awk or similar?? thanks !!!
If there can be empty files then you need:
with GNU awk:
awk 'BEGINFILE{print FILENAME}1' *.seq
with other awks (untested):
awk '
FNR==1 {
for (++ARGIND;ARGV[ARGIND]!=FILENAME;ARGIND++) {
print ARGV[ARGIND]
}
print FILENAME
}
{ print }
END {
for (++ARGIND;ARGIND in ARGV;ARGIND++) {
print ARGV[ARGIND]
}
}' *.seq
You can use the following command:
awk 'FNR==1{print FILENAME}1' *.seq
FNR is the record number (which is the line number by default) of the current input file. Each time awk starts to handle another file FNR==1, in this case the current filename get's printed trough {print FILENAME}.
The trailing 1 is an awk idiom. It always evaluates to true, which makes awk print all lines of input.
Note:
The above solution works only as long as you have no empty files in that folder. Check Ed Morton's great answer which points this out.
perl -lpe 'print $ARGV if $. == 1; close(ARGV) if eof' *.seq
$. is the line number
$ARGV is the name of the current file
close(ARGV) if eof resets the line number at the end of each file