Removing lines which match with specific pattern from another file - awk

I've got two files (I only show the beginning of these files) :
patterns.txt
m64071_201130_104452/13
m64071_201130_104452/26
m64071_201130_104452/46
m64071_201130_104452/49
m64071_201130_104452/113
m64071_201130_104452/147
myfile.txt
>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/16/ccs
ACAGTCGAGCG
>m64071_201130_104452/20/ccs
CAGTCGAGCGC
>m64071_201130_104452/22/ccs
CACACATCTCG
>m64071_201130_104452/26/ccs
TAGACAATGTA
I should get an output like that :
>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/26/ccs
TAGACAATGTA
I want to create a new file if the lines in patterns.txt match with the lines in myfile.txt . I need to keep the letters ACTG associated with the pattern in question. I use :
for i in $(cat patterns.txt); do
grep -A 1 $i myfile.txt; done > my_newfile.txt
It works, but it's very slow to create the new file... The files I work on are pretty large but not too much (14M for patterns.txt and 700M for myfile.txt).
I also tried to use grep -v because I have the another file which contains the others patterns of myfile.txt not present in patterns.txt. But it is the same "speed filling file" problem.
If you see a solution..

With your shown samples please try following. Written and tested in GNU awk.
awk '
FNR==NR{
arr[$0]
next
}
/^>/{
found=0
match($0,/.*\//)
if((substr($0,RSTART+1,RLENGTH-2)) in arr){
print
found=1
}
next
}
found
' patterns.txt myfile.txt
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when patterns.txt is being read.
arr[$0] ##Creating array with index of current line.
next ##next will skip all further statements from here.
}
/^>/{ ##Checking condition if line starts from > then do following.
found=0 ##Unsetting found here.
match($0,/.*\//) ##using match to match a regex to till / in current line.
if((substr($0,RSTART+1,RLENGTH-2)) in arr){ ##Checking condition if sub string of matched regex is present in arr then do following.
print ##Printing current line here.
found=1 ##Setting found to 1 here.
}
next ##next will skip all further statements from here.
}
found ##Printing the line if found is set.
' patterns.txt myfile.txt ##Mentioning Input_file names here.

Another awk:
$ awk -F/ ' # / delimiter
NR==FNR {
a[$1,$2] # hash patterns to a
next
}
{
if( tf=((substr($1,2),$2) in a) ) # if first part found in hash
print # output and store found result in var tf
if(getline && tf) # read next record and if previous record was found
print # output
}' patterns myfile
Output:
>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/26/ccs
TAGACAATGTA
Edit: To output the ones not found:
$ awk -F/ ' # / delimiter
NR==FNR {
a[$1,$2] # hash patterns to a
next
}
{
if( tf=((substr($1,2),$2) in a) ) { # if first part found in hash
getline # consume the next record too
next
}
print # otherwise output
}' patterns myfile
Output:
>m64071_201130_104452/16/ccs
ACAGTCGAGCG
>m64071_201130_104452/20/ccs
CAGTCGAGCGC
>m64071_201130_104452/22/ccs
CACACATCTCG

Related

Remove fields in file A that don't contain matches in file B, line by line

I have a series of paired files, tab separated.
I want to compare line by line each pair and keep in file B only the fields that contain a match with the paired file A.
Example file A:
a b
d c
Example file B:
f>543 h<456 b>536 d>834 v<75345 a>12343
t>4562 c>623 f>3246 h>1345 d<52312
Desired output:
b>536 a>12343
c>623 d<52312
So far I have tried:
Convert files B in one-liner files:
cat file B | sed 's/\t/\n/g' > file B.mod
Grep one string in file A from file B, print the matching line and the next line, convert the output from 2 line back to single tab separated line:
cat file B.mod | grep -A1 (string) | awk '{printf "%s%s",$0,NR%2?"\t":"\n" ; }'
...but this failed since I realized that the matches can be in different order in A and B, as in the example above.
I'd appreciate some help as this goes far beyond my bash skills.
With your shown samples, please try following awk code.
awk '
FNR==NR{
for(i=1;i<=NF;i++){
arr[FNR,$i]
}
next
}
{
val=""
for(i=1;i<=NF;i++){
if((FNR,substr($i,1,1)) in arr){
val=(val?val OFS:"")$i
}
}
print val
}
' filea fileb
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk Program from here.
FNR==NR{ ##Checking condition FNR==NR which will be true when filea is being read.
for(i=1;i<=NF;i++){ ##Traversing through all fields here.
arr[FNR,$i] ##Creating array with index of FNR,current field value here.
}
next ##next will skip all further statements from here.
}
{
val="" ##Nullify val here.
for(i=1;i<=NF;i++){ ##Traversing through all fields here.
if((FNR,substr($i,1,1)) in arr){ ##checking condition if 1st letter of each field with FNR is present in arr then do following.
val=(val?val OFS:"")$i ##Creating val which has current $i value in it and keep adding values per line here.
}
}
print val ##Printing val here.
}
' filea fileb ##Mentioning Input_file names here.

How to use Awk to output multiple consecutive lines

Input/File
A:1111
B:21222
C:33rf33
D:444dct4
E:5tdffe
F:4444we
G:j5555
H:46666
I:efe989ef
J:efee
Basically need to select the line that contains 2122 (i.e line B/2)
& line which starts with 444dct4 (i.e Line D) till efe989ef (i.e line I/9)
To summarize
Select Line B (contains 2122)
Select Line D (444dct4) till Line I
Desired Output
B:21222
D:444dct4
E:5tdffe
F:4444we
G:j5555
H:46666
I:efe989ef
Could you please try following, written and tested with shown samples in GNU awk. This one also takes care in case line's 2nd column 21222 in between range of 444dct4 to efe989ef then it will NOT re-print it.
awk -F':' '
$2=="21222" && !found{
print
next
}
$2=="444dct4"{
found=1
}
found
$2=="efe989ef"{
found=""
}
' Input_file
Explanation: Adding detailed explanation for above.
awk -F':' ' ##Starting awk program from here and setting field separator as colon here.
$2=="21222" && !found{ ##Checking if 2nd field is 21222 and found is NOT set then try following.
print ##Printing the current line here.
next ##next will skip all further statements from here.
}
$2=="444dct4"{ ##Checking condition if 2nd field is 444dct4 then do following.
found=1 ##Setting found to 1 here.
}
found ##Checking condition if found is SET then print that line.
$2=="efe989ef"{ ##Checking condition if 2nd field is efe989ef then do following.
found="" ##Nullifying found here.
}
' Input_file ##Mentioning Input_file name here.
$ awk -F: '
/2122/ { # line that contains 2122
print
next # to avoid duplicate printing if 2122 also in D-I
}
$2~/^444dct4/,$2~/efe989ef/ # starts with 444dct4 till efe989ef
' file
Output:
B:21222
D:444dct4
E:5tdffe
F:4444we
G:j5555
H:46666
I:efe989ef
Edit:
One-liner:
$ awk -F: '/2122/{print; next} $2~/^444dct4/,$2~/efe989ef/' file.txt
awk -v str1="2122" -v str2="444dct4" -v str3="efe989ef" 'BEGIN { flag=0 } $0 ~ str1 { print } $0 ~ str2 { flag=1 } $0 ~ str3 { flag=0;print;next } flag' file
For flexibility, set the line to find as str1, the from as str2 and the to as str3. Set a print flag (flag) to begin with. When 2122 is in the second field print. Then when the second field begins with 44dct4 set the print flag to one. When the second field starts with efe989ef, set the print flag to 0, print the line and skip to the next record. The variable flag will then determine what does and doesn't get printed.

Awk adding a pipe at the end of the first line

I have a little problem with my awk command.
The objective is to add a new column to my CSV :
The header must be "customer_id"
The next rows must be a customer_id from an array
Here is my csv :
email|event_date|id|type|cha|external_id|name|date
abcd#google.fr|2020-11-13 08:04:44|12|Invalid|Mail|disable|One|2020-11-13
dcab#google.fr|2020-11-13 08:04:44|13|Invalid|Mail|disable|Two|2020-11-13
I would like to have this output :
email|event_date|id|type|cha|external_id|name|date|customer_id
abcd#google.fr|2020-11-13 08:04:44|12|Invalid|Mail|disable|One|2020-11-13|20200
dcab#google.fr|2020-11-13 08:04:44|13|Invalid|Mail|disable|Two|2020-11-13|20201
But when I'm doing the awk I have this result :
awk -v a="$(echo "${customerIdList[#]}")" 'BEGIN{FS=OFS="|"} FNR==1{$(NF+1)="customer_id"} FNR>1{split(a,b," ")} {print $0,b[NR-1]}' test.csv
email|event_date|id|type|cha|external_id|name|date|customer_id|
abcd#google.fr|2020-11-13 08:04:44|12|Invalid|Mail|disable|One|2020-11-13|20200
dcab#google.fr|2020-11-13 08:04:44|13|Invalid|Mail|disable|Two|2020-11-13|20201
Where customerIdList = (20200 20201)
There is a pipe just after the "customer_id" header and I don't know why :(
Can someone help me ?
Could you please try following, written and tested with shown samples.
awk -v var="${customerIdList[*]}" '
BEGIN{
num=split(var,arr," ")
}
FNR==1{
print $0"|customer_id"
next
}
{
$0=$0 (arr[FNR-1]?"|" arr[FNR-1]:"")
}
1
' Input_file
Explanation: Adding detailed explanation for above.
awk -v var="${customerIdList[*]}" ' ##Starting awk program from here, creating var variable and passing array values to it.
BEGIN{ ##Starting BEGIN section of this program from here.
num=split(var,arr," ") ##Splitting var into arr with space delimiter.
}
FNR==1{ ##Checking condition if this is first line.
print $0"|customer_id" ##Then printing current line with string here.
next ##next will skip all further statements from here.
}
{
$0=$0 (arr[FNR-1]?"|" arr[FNR-1]:"") ##Checking condition if value of arr with current line number -1 is NOT NULL then add its value to current line with pipe else do nothing.
}
1 ##1 will print current line.
' Input_file ##Mentioning Input_file name here.
awk -v IdList="${customerIdList[*]}" 'BEGIN { split(IdList,ListId," ") } NR > 1 { $0=$0"|"ListId[NR-1]}1' file
An array will need to be created within awk and so pass the array as a space separated string and then use awk's split function to create the array IdList. The ignoring the headers (NR>1), set the line equal to the line plus the index of ListId array NR-1.

Filter out FASTA files by specified sequence length in bash

There's a FASTA file assembly.fasta containing contig names and corresponding sequences:
>contig_1
CCAATACGGGCGCGCAGGCTTTCTATCGCGCGGCCGGCTTCGTCGAGGACGGGCGGCGCA
AGGATTACTACCGCAGCGGC
>contig_2
ATATAAACCTTATTCATCGTTTTCAGCCTAATTTTCCATTTAACAGGGATGATTTTCGTC
AAAATGCTGAGGCTTTACCAAGATTTTCTACCTTGCACCTTCAGAAAAAAATCATGGCAT
TTATAGACGAAATTCTCGAGAAA
>contig_3
CGTGATCTCGCCATTCGTGCCG
I want to get only contigs longer than 30 letters and get a new FASTA file assembly.filtered.fasta containing only those long sequences with contig names, in this format:
>contig_1
CCAATACGGGCGCGCAGGCTTTCTATCGCGCGGCCGGCTTCGTCGAGGACGGGCGGCGCA
AGGATTACTACCGCAGCGGC
>contig_2
ATATAAACCTTATTCATCGTTTTCAGCCTAATTTTCCATTTAACAGGGATGATTTTCGTC
AAAATGCTGAGGCTTTACCAAGATTTTCTACCTTGCACCTTCAGAAAAAAATCATGGCAT
TTATAGACGAAATTCTCGAGAAA
Using gnu-awk, you may use this simpler version:
awk -v RS='>[^\n]+\n' 'length() >= 30 {printf "%s", prt $0} {prt = RT}' file
>contig_1
CCAATACGGGCGCGCAGGCTTTCTATCGCGCGGCCGGCTTCGTCGAGGACGGGCGGCGCA
AGGATTACTACCGCAGCGGC
>contig_2
ATATAAACCTTATTCATCGTTTTCAGCCTAATTTTCCATTTAACAGGGATGATTTTCGTC
AAAATGCTGAGGCTTTACCAAGATTTTCTACCTTGCACCTTCAGAAAAAAATCATGGCAT
TTATAGACGAAATTCTCGAGAAA
A very quick way to achieve what you are after is:
awk -v n=30 '/^>/{ if(l>n) print b; b=$0;l=0;next }
{l+=length;b=b ORS $0}END{if(l>n) print b }' file
You might be also interested in BioAwk, it is an adapted version of awk which is tuned to process FASTA files
bioawk -c fastx -v '(length($seq)>30){print ">" $name ORS $seq}' file.fasta
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language",
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
. I'm not sure if this version is compatible with POSIX.
Could you please try following, tested and written with shown samples.
awk '
/^>/{
if(sign_val && strLen>=30){
print sign_val ORS line
}
strLen=line=""
sign_val=$0
next
}
{
strLen+=length($0)
line=(line?line ORS:"")$0
}
END{
if(sign_val && strLen>=30){
print sign_val ORS line
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/^>/{ ##Checking condition if line starts from > then do following.
if(sign_val && strLen>=30){ ##Checking if sign_val is SET and steLen is SET then do following.
print sign_val ORS line ##Printing sign_val ORS and line here.
}
strLen=line="" ##Nullify variables steLen and line here.
sign_val=$0 ##Setting sign_val to current line here.
next ##next will skip all further statements from here.
}
{
strLen+=length($0) ##Checking length of line and keep adding it here.
line=(line?line ORS:"")$0 ##Creating line variable and keep appending it to it with new line.
}
END{ ##Starting END block of this program from here.
if(sign_val && strLen>=30){ ##Checking if sign_val is SET and steLen is SET then do following.
print sign_val ORS line ##Printing sign_val ORS and line here.
}
}
' Input_file ##mentioning Input_file name here.

Print the first column of multiple files using awk

I have 20 files, I want to print the first column of each file into a different file.I need 20 output files.
i have tried the following command, but this one puts all the output into a single file.
awk '{print $1}' /home/gee/SNP_data/20* > out_file
write the output to different files, i have 20 input files
1st solution: Could you please try following.
awk '
FNR==1{
if(file){
close(file)
}
file="out_file_"FILENAME".txt"
}
{
print $1 > (file)
}
' /home/gee/SNP_data/20*
Explanation: Adding explanation for above code.
awk ' ##Starting awk program here.
FNR==1{ ##checking condition if FNR==1 then do following.
if(file){ ##Checking condition if variable file is NOT NULL then do following.
close(file) ##Using close to close the opened output file in backend, to avoid too many opened files error.
} ##Closing BLOCK for if condition.
file="out_file_"FILENAME".txt" ##Setting variable file value to string out_file_ then FILENAME(which is Input_file) and append .txt to it.
} ##Closing BLOCK for condition for FNR==1 here.
{
print $1 > (file) ##Printing first field to variable file here.
}
' /home/gee/SNP_data/20* ##Mentioning Input_file path here to pass files here.
2nd solution: In case you need to get output files like output_file_1.txt ans so on then try following. I have created an awk variable named out_file where you could change your output file's name's initial too(as per your need).
awk -v out_file="Output_file_" '
FNR==1{
if(file){
close(file)
}
++count
file=out_file count".txt"
}
{
print $1 > (file)
}
' /home/gee/SNP_data/20*
Awk has a builtin redirection operator, you can use it like:
awk '{ print $1 > ("out_" FILENAME) }' /home/gee/SNP_data/20*
or, even better:
awk 'FNR==1 { close(f); f=("out_" FILENAME) } { print $1 > f }' /home/gee/SNP_data/20*
Former is just an example usage of redirection operator, latter is how to use it robustly.