Only returning matching patterns aginst file with grep - awk

I am trying to inversely seperate a list of emails against another list using grep so that only emails not matching those expressions are returned.
The list of emails looks like:
recruitment#madeup.com
joy#netnoir.net
hello#nom.com
mary#itcouldbereal.ac.uk
thisshouldbe#theonlyone.com
The list of expressions that I am comparing it to is:
recruitment#
netnoir.net
hello#
"\.ac.\b"
I have tried:
grep -vif listofexpressions listofemails
The problems I am facing are
1.) nothing is returned
2.) the .ac. is not recognized in a file but if I used it with
grep "\.ac.\b" filename
then it does.
If I change it to
grep -if listofexpressions listofemails
then most of the expressions that do not need escaping are shown highlighted but the others are shown as well.
My expected output would be
thisshouldbe#theonlyone.com
I am sure this is simple but after reading the man page of grep and googling, I stil cannot work it out.
Thanks

With your shown samples, could you please try following. Written and tested in GNU awk.
awk '
FNR==NR{
arr[$0]
next
}
{
found=""
for(key in arr){
if(index($0,key)){
found=1
next
}
}
if(found==""){
print
}
}
' expressions listemail
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when expressions file is being read.
arr[$0] ##Created arr with index of current line here.
next ##next will skip all further statements from here.
}
{
found="" ##Nulliyfing found here.
for(key in arr){ ##Going through arr elements here.
if(index($0,key)){ ##Checking if current line is part of key by index.
found=1 ##Setting found to 1 here.
next ##next will skip all further statements.
}
}
if(found==""){ ##Checking condition if found is NULL then print that line.
print
}
}
' expressions listemails ##Mentioning Input_files here.

Related

How to extract data in such a pattern using grep or awk?

I have multiple instances of the following pattern in my document:
Dipole Moment: [D]
X: 1.5279 Y: 0.1415 Z: 0.1694 Total: 1.5438
I want to extract the total dipole moment, so 1.5438. How can I pull this off?
When I throw in grep "Dipole Moment: [D]" filename, I don't get the line after. I am new to these command line interfaces. Any help you can provide would be greatly appreciated.
Could you please try following. Written and tested with shown samples in GNU awk.
awk '/Dipole Moment: \[D\]/{found=1;next} found{print $NF;found=""}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/Dipole Moment: \[D\]/{ ##Checking if line contains Dipole Moment: \[D\] escaped [ and ] here.
found=1 ##Setting found to 1 here.
next ##next will skip all further statements from here.
}
found{ ##Checking condition if found is NOT NULL then do following.
print $NF ##Printing last field of current line here.
found="" ##Nullifying found here.
}
' Input_file ##Mentioning Input_file name here.
Sed alternative:
sed -rn '/^Dipole/{n;s/(^[[:space:]]{5}.*[[:space:]]{5})(.*)(([[:space:]]{5}.*+[:][[:space:]]{5}.*){3})/\2/p}' file
Search for the line beginning with "Dipole" then read the next line. Split this line into three sections based on regular expressions and substitute the line for the second section only, printing the result.

How to find a match to a partial string and then delete the string from the reference file using awk?

I have a problem that I have been trying to solve, but have not been able to figure out how to do it. I have a reference file that has all of the devices in my inventory by bar code.
Reference file:
PTR10001,PRINTER,SN A
PTR10002,PRINTER,SN B
PTR10003,PRINTER,SN C
MON10001,MONITOR,SN A
MON10002,MONITOR,SN B
MON10003,MONITOR,SN C
CPU10001,COMPUTER,SN A
CPU10002,COMPUTER,SN B
CPU10003,COMPUTER,SN C
What I would like to do is make a file where I only have to put the abbreviation of what I need on it.
File 2 would look like this:
PTR
CPU
MON
MON
The desired output of this would be a file that would tell me what items by barcode that I need to pull off the shelf.
Desired output file:
PTR10001
CPU10001
MON10001
MON10002
As seen in the output, since I cannot have 2 of the same barcode, I need it to look through the reference file and find the first match. After the number is copied to the output file, I would like to remove the number from the reference file so that it doesn't repeat the number.
I have tried several iterations of awk, but have not been able get the desired output.
The closest that I have gotten is the following code:
awk -F'/' '{ key = substr($1,1,3) } NR==FNR {id[key]=$1; next} key in id { $1=id[key] } { print }' $file1 $file2 > $file3
I am writing this in ksh, and would like use awk as I think this would be the best answer to the problem.
Thanks for helping me with this.
First solution:
From your detailed description, I assume order doesn't matter, as you want to know what to pull off the shelf. So you could do the opposite, first read file2, count the items, and then go to the shelf and get them.
awk -F, 'FNR==NR{c[$0]++; next} c[substr($1,1,3)]-->0{print $1}' file2 file1
output:
PTR10001
MON10001
MON10002
CPU10001
Second solution:
Your awk is very close to what you want, but you need a second dimension in your array, and not overwriting the existing ids. We will do it with a pseudo-2-d array (BTW GNU awk has real 2-dimensional arrays) where we store the ids like PTR10001,PTR10002,PTR10003, we retrieve them with split and we remove from shelf also.
> cat tst.awk
BEGIN { FS="," }
NR==FNR {
key=substr($1,1,3)
ids[key] = (ids[key]? ids[key] "," $1: $1) #append new id.
next
}
$0 in ids {
split(ids[$0], tmp, ",")
print(tmp[1])
ids[$0]=substr(ids[$0],length(tmp[1])+2) #remove from shelf
}
Output
awk -f tst.awk file1 file2
PTR10001
CPU10001
MON10001
MON10002
Here we keep the order of file2 as this is based on the idea you have tried.
Could you please try following, written and tested with shown samples in GNU awk.
awk '
FNR==NR{
iniVal[$0]++
next
}
{
counter=substr($0,1,3)
}
counter in iniVal{
if(++currVal[counter]<=iniVal[counter]){
print $1
if(currVal[counter]==iniVal[counter]){ delete iniVal[$0] }
}
}
' Input_file2 FS="," Input_file1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition if FNR==NR which is true when Input_file2 is being read.
iniVal[$0]++ ##Creating array iniVal with index of current line with increment of 1 each time it comes here.
next ##next will skip all further statements from here.
}
{
counter=substr($0,1,3) ##Creating counter variable which has 1st 3 characters of Input_file1 here.
}
counter in iniVal{ ##Checking if counter is present in iniVal then do following.
if(++currVal[counter]<=iniVal[counter]){ ##Checking if currValarray with index of counter value is lesser than or equal to iniVal then do following.
print $1 ##Printing 1st field of current line here.
if(currVal[counter]==iniVal[counter]){ ##Checking if currVal value is equal to iniVal with index of counter here.
delete iniVal[$0] ##If above condition is TRUE then deleting iniVal here.
}
}
}
' Input_file2 FS="," Input_file1 ##Mentioning Input_file names here.

How to skip first line between two patterns in awk?

I have the next script
cat foo.txt | awk '/ERROR/,/INFO/'
With the input of:
FooFoo
ERROR
Foo1
INFO
FooFoo
Now the result is:
ERROR
Foo1
INFO
I am looking for the next result:
Foo1
INFO
How I can make it work?
Thanks for your help
Give this a try:
awk '/ERROR/,/INFO/' foo.txt | tail -n +2
If your input is from a file, you don't need the cat. just awk '...' file
Could you please try following, written and tested with shown samples in GNU awk.
awk '
/ERROR/{
found=1
next
}
found{
val=(val?val ORS:"")$0
}
/INFO/{
print val
val=count=found=""
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/ERROR/{ ##Checking if line contains ERROR then do following.
found=1 ##Setting found variable here.
next ##next will skip all further statements from here.
}
found{ ##Checking here if found is SET then do following.
val=(val?val ORS:"")$0 ##Creating variable val and keep adding value to it in form of current line.
}
/INFO/{ ##Checking condition if INFO is found in current line then do following.
print val ##Printing val here.
val=count="" ##Nullifying val and count here.
}
' Input_file ##Mentioning Input_file name here.
Like this:
awk '
seen # a true (1) condition makes awk to print current line
/ERROR/{seen=1} # if we grep ERROR, assign 1 to seen flag
/INFO/{seen=0} # if we grep INFO, assign 0 to seen flag
' file
Output
Foo1
INFO

Understand the code of Split file to fasta

I understand the matching pattern but how the sequence is read from the matching pattern as the code is matching only pattern ">chr" then how sequence goes to the output file?
awk '/^>chr/ {OUT=substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File
Could you please go through following explanation once.
awk ' ##Starting awk program here.
/^>chr/{ ##Checking condition if any line starts from string chr then do following.
OUT=substr($0,2) ".fa" ##Creating variable OUT whose value is substring of first 2 letters and concatenating .fa string to it.
} ##Closing block for condition ^>chr here.
{
print >> OUT ##Printing current line to variable OUT value which is formed above and is writing output into out file.
close(OUT) ##If we keep writing lot of files we will get "Too many files opened error(s)" so closing these files in backend to avoid that error.
}
' Input_File ##Mentioning Input_file here which we are processing through awk.

Grepping all strings on the same line from multiple files

Trying to find a way to grep all names on one line for 100 files. grepping all names available in each file must appear on the same line.
FILE1
"company":"COMPANY1","companyDisplayName":"CM1","company":"COMPANY2","companyDisplayName":"CM2","company":"COMPANY3","companyDisplayName":"CM3",
FILE2
"company":"COMPANY99","companyDisplayName":"CM99"
The output i actually want is, ( include file name as prefix.)
FILE1:COMPANY1,COMPANY2,COMPANY3
FILE2:COMPANY99
i tried grep -oP '(?<="company":")[^"]*' * but i get results like this :
FILE1:COMPANY1
FILE1:COMPANY2
FILE1:COMPANY3
FILE2:COMPANY99
Could you please try following.
awk -F'[,:]' '
BEGIN{
OFS=","
}
{
for(i=1;i<=NF;i++){
if($i=="\"company\""){
val=(val?val OFS:"")$(i+1)
}
}
gsub(/\"/,"",val)
print FILENAME":"val
val=""
}
' Input_file1 Input_file2
Explanation: Adding explanation for above code.
awk -F'[,:]' ' ##Starting awk program here and setting field separator as colon OR comma here for all lines of Input_file(s).
BEGIN{ ##Starting BEGIN section of awk here.
OFS="," ##Setting OFS as comma here.
} ##Closing BEGIN BLOCK here.
{ ##Starting main BLOCK here.
for(i=1;i<=NF;i++){ ##Starting a for loop which starts from i=1 to till value of NF.
if($i=="\"company\""){ ##Checking condition if field value is equal to "company" then do following.
val=(val?val OFS:"")$(i+1) ##Creating a variable named val and concatenating its own value to it each time cursor comes here.
} ##Closing BLOCK for if condition here.
} ##Closing BLOCK for, for loop here.
gsub(/\"/,"",val) ##Using gsub to gklobally substitute all " in variable val here.
print FILENAME":"val ##Printing filename colon and variable val here.
val="" ##Nullifying variable val here.
} ##Closing main BLOCK here.
' Input_file1 Input_file2 ##Mentioning Input_file names here.
Output will be as follows.
Input_file1:COMPANY1,COMPANY2,COMPANY3
Input_file2:COMPANY99
EDIT: Adding solution in case OP needs to use grep and want to get final output from its output(though I will recommend to use awk solution itself since we are NOT using multiple commands or sub-shells).
grep -oP '(?<="company":")[^"]*' * | awk 'BEGIN{FS=":";OFS=","} prev!=$1 && val{print prev":"val;val=""} {val=(val?val OFS:"")$2;prev=$1} END{if(val){print prev":"val}}'
There are two tools that can take the output of your grep command and reformat it the way you want. First tool is GNU datamash. Second is tsv-summarize from eBay's tsv-utils package (disclaimer: I'm the author). Both tools solve this in similar ways:
$ # The grep output
$ echo $'FILE1:COMPANY1\nFILE1:COMPANY2\nFILE1:COMPANY3\nFILE2:COMPANY99' > grep-output.txt
$ cat grep-output.txt
FILE1:COMPANY1
FILE1:COMPANY2
FILE1:COMPANY3
FILE2:COMPANY99
$ # Using GNU datamash
$ cat grep-output.txt | datamash -field-separator : --group 1 unique 2
FILE1:COMPANY1,COMPANY2,COMPANY3
FILE2:COMPANY99
$ # Using tsv-summarize
$ cat grep-output.txt | tsv-summarize --delimiter : --group-by 1 --unique-values 2 --values-delimiter ,
FILE1:COMPANY1,COMPANY2,COMPANY3
FILE2:COMPANY99