How to extract data in such a pattern using grep or awk? - awk

I have multiple instances of the following pattern in my document:
Dipole Moment: [D]
X: 1.5279 Y: 0.1415 Z: 0.1694 Total: 1.5438
I want to extract the total dipole moment, so 1.5438. How can I pull this off?
When I throw in grep "Dipole Moment: [D]" filename, I don't get the line after. I am new to these command line interfaces. Any help you can provide would be greatly appreciated.

Could you please try following. Written and tested with shown samples in GNU awk.
awk '/Dipole Moment: \[D\]/{found=1;next} found{print $NF;found=""}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/Dipole Moment: \[D\]/{ ##Checking if line contains Dipole Moment: \[D\] escaped [ and ] here.
found=1 ##Setting found to 1 here.
next ##next will skip all further statements from here.
}
found{ ##Checking condition if found is NOT NULL then do following.
print $NF ##Printing last field of current line here.
found="" ##Nullifying found here.
}
' Input_file ##Mentioning Input_file name here.

Sed alternative:
sed -rn '/^Dipole/{n;s/(^[[:space:]]{5}.*[[:space:]]{5})(.*)(([[:space:]]{5}.*+[:][[:space:]]{5}.*){3})/\2/p}' file
Search for the line beginning with "Dipole" then read the next line. Split this line into three sections based on regular expressions and substitute the line for the second section only, printing the result.

Related

Only returning matching patterns aginst file with grep

I am trying to inversely seperate a list of emails against another list using grep so that only emails not matching those expressions are returned.
The list of emails looks like:
recruitment#madeup.com
joy#netnoir.net
hello#nom.com
mary#itcouldbereal.ac.uk
thisshouldbe#theonlyone.com
The list of expressions that I am comparing it to is:
recruitment#
netnoir.net
hello#
"\.ac.\b"
I have tried:
grep -vif listofexpressions listofemails
The problems I am facing are
1.) nothing is returned
2.) the .ac. is not recognized in a file but if I used it with
grep "\.ac.\b" filename
then it does.
If I change it to
grep -if listofexpressions listofemails
then most of the expressions that do not need escaping are shown highlighted but the others are shown as well.
My expected output would be
thisshouldbe#theonlyone.com
I am sure this is simple but after reading the man page of grep and googling, I stil cannot work it out.
Thanks
With your shown samples, could you please try following. Written and tested in GNU awk.
awk '
FNR==NR{
arr[$0]
next
}
{
found=""
for(key in arr){
if(index($0,key)){
found=1
next
}
}
if(found==""){
print
}
}
' expressions listemail
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when expressions file is being read.
arr[$0] ##Created arr with index of current line here.
next ##next will skip all further statements from here.
}
{
found="" ##Nulliyfing found here.
for(key in arr){ ##Going through arr elements here.
if(index($0,key)){ ##Checking if current line is part of key by index.
found=1 ##Setting found to 1 here.
next ##next will skip all further statements.
}
}
if(found==""){ ##Checking condition if found is NULL then print that line.
print
}
}
' expressions listemails ##Mentioning Input_files here.

How to skip first line between two patterns in awk?

I have the next script
cat foo.txt | awk '/ERROR/,/INFO/'
With the input of:
FooFoo
ERROR
Foo1
INFO
FooFoo
Now the result is:
ERROR
Foo1
INFO
I am looking for the next result:
Foo1
INFO
How I can make it work?
Thanks for your help
Give this a try:
awk '/ERROR/,/INFO/' foo.txt | tail -n +2
If your input is from a file, you don't need the cat. just awk '...' file
Could you please try following, written and tested with shown samples in GNU awk.
awk '
/ERROR/{
found=1
next
}
found{
val=(val?val ORS:"")$0
}
/INFO/{
print val
val=count=found=""
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/ERROR/{ ##Checking if line contains ERROR then do following.
found=1 ##Setting found variable here.
next ##next will skip all further statements from here.
}
found{ ##Checking here if found is SET then do following.
val=(val?val ORS:"")$0 ##Creating variable val and keep adding value to it in form of current line.
}
/INFO/{ ##Checking condition if INFO is found in current line then do following.
print val ##Printing val here.
val=count="" ##Nullifying val and count here.
}
' Input_file ##Mentioning Input_file name here.
Like this:
awk '
seen # a true (1) condition makes awk to print current line
/ERROR/{seen=1} # if we grep ERROR, assign 1 to seen flag
/INFO/{seen=0} # if we grep INFO, assign 0 to seen flag
' file
Output
Foo1
INFO

How to filter a field with awk follow a pattern

I have a file with that format:
Topic:test_replication PartitionCount:1 ReplicationFactor:3 Configs:retention.ms=604800000,delete.retention.ms=86400000,cleanup.policy=delete,max.message.bytes=1000012,min.insync.replicas=2,retention.bytes=-1
Topic:teste2e_funcional PartitionCount:12 ReplicationFactor:3 Configs:min.cleanable.dirty.ratio=0.00001,delete.retention.ms=86400000,cleanup.policy=delete,min.insync.replicas=2,segment.ms=604800000,retention.bytes=-1
Topic:ticket_dl.replica_cloudera PartitionCount:3 ReplicationFactor:3 Configs:message.downconversion.enable=true,file.delete.delay.ms=60000,segment.ms=604800000,min.compaction.lag.ms=0,retention.bytes=-1,segment.index.bytes=10485760,cleanup.policy=delete,message.timestamp.difference.max.ms=9223372036854775807,segment.jitter.ms=0,preallocate=false,message.timestamp.type=CreateTime,message.format.version=2.2-IV1,segment.bytes=1073741824,max.message.bytes=1000000,unclean.leader.election.enable=false,retention.ms=604800000,flush.ms=9223372036854775807,delete.retention.ms=31536000000,min.insync.replicas=2,flush.messages=9223372036854775807,compression.type=producer,index.interval.bytes=4096,min.cleanable.dirty.ratio=0.5
And I want to have only the value of Topic (e.g. test_replication) and the value of min.insync.replicas (e.g. 2)
I know that it is possible to do with regular expression, but I don't know how to do it. For me the problems is that min.insync.replicas is not in the same possition so if I use the awk option -F with for example , I will got diferent values of min.insync.replicas.
Could you please try following.
awk '
match($0,/Topic:[^ ]*/){
topic=substr($0,RSTART+6,RLENGTH-6)
match($0,/min\.insync\.replicas[^,]*/)
print topic,substr($0,RSTART+20,RLENGTH-20)
topic=""
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/Topic:[^ ]*/){ ##Using match function to match regex Topic: till space comes here.
topic=substr($0,RSTART+6,RLENGTH-6) ##Creating topic varwhich has sub-string of current line starting from RSTART till RLENGTH.
match($0,/min\.insync\.replicas[^,]*/) ##Using match again to match regex frommin to till comma here.
print topic,substr($0,RSTART+20,RLENGTH-20) ##Printing topic and sub-string from RSTART to till RLENGTH adding and substracting respectively here.
topic="" ##Nullify variable topic here.
}
' Input_file ##Mentioning Input_file name here.
2nd solution: Adding a sed solution here.
sed 's/Topic:\([^ ]*\).*min\.insync\.replicas=\([^,]*\).*/\1 \2/' Input_file
Sorry for the questions before. Was very simple:
awk '
match($0,/Topic:[^ ]*/){
topic=substr($0,RSTART+6,RLENGTH-6)
match($0,/min\.insync\.replicas[^,]*/)
mininsync=substr($0,RSTART+20,RLENGTH-20)
match($0,/retention\.ms[^,]*/)
retention=substr($0,RSTART+13,RLENGTH-13)
print topic",",mininsync,","retention
topic=""
}

Understand the code of Split file to fasta

I understand the matching pattern but how the sequence is read from the matching pattern as the code is matching only pattern ">chr" then how sequence goes to the output file?
awk '/^>chr/ {OUT=substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File
Could you please go through following explanation once.
awk ' ##Starting awk program here.
/^>chr/{ ##Checking condition if any line starts from string chr then do following.
OUT=substr($0,2) ".fa" ##Creating variable OUT whose value is substring of first 2 letters and concatenating .fa string to it.
} ##Closing block for condition ^>chr here.
{
print >> OUT ##Printing current line to variable OUT value which is formed above and is writing output into out file.
close(OUT) ##If we keep writing lot of files we will get "Too many files opened error(s)" so closing these files in backend to avoid that error.
}
' Input_File ##Mentioning Input_file here which we are processing through awk.

Grepping all strings on the same line from multiple files

Trying to find a way to grep all names on one line for 100 files. grepping all names available in each file must appear on the same line.
FILE1
"company":"COMPANY1","companyDisplayName":"CM1","company":"COMPANY2","companyDisplayName":"CM2","company":"COMPANY3","companyDisplayName":"CM3",
FILE2
"company":"COMPANY99","companyDisplayName":"CM99"
The output i actually want is, ( include file name as prefix.)
FILE1:COMPANY1,COMPANY2,COMPANY3
FILE2:COMPANY99
i tried grep -oP '(?<="company":")[^"]*' * but i get results like this :
FILE1:COMPANY1
FILE1:COMPANY2
FILE1:COMPANY3
FILE2:COMPANY99
Could you please try following.
awk -F'[,:]' '
BEGIN{
OFS=","
}
{
for(i=1;i<=NF;i++){
if($i=="\"company\""){
val=(val?val OFS:"")$(i+1)
}
}
gsub(/\"/,"",val)
print FILENAME":"val
val=""
}
' Input_file1 Input_file2
Explanation: Adding explanation for above code.
awk -F'[,:]' ' ##Starting awk program here and setting field separator as colon OR comma here for all lines of Input_file(s).
BEGIN{ ##Starting BEGIN section of awk here.
OFS="," ##Setting OFS as comma here.
} ##Closing BEGIN BLOCK here.
{ ##Starting main BLOCK here.
for(i=1;i<=NF;i++){ ##Starting a for loop which starts from i=1 to till value of NF.
if($i=="\"company\""){ ##Checking condition if field value is equal to "company" then do following.
val=(val?val OFS:"")$(i+1) ##Creating a variable named val and concatenating its own value to it each time cursor comes here.
} ##Closing BLOCK for if condition here.
} ##Closing BLOCK for, for loop here.
gsub(/\"/,"",val) ##Using gsub to gklobally substitute all " in variable val here.
print FILENAME":"val ##Printing filename colon and variable val here.
val="" ##Nullifying variable val here.
} ##Closing main BLOCK here.
' Input_file1 Input_file2 ##Mentioning Input_file names here.
Output will be as follows.
Input_file1:COMPANY1,COMPANY2,COMPANY3
Input_file2:COMPANY99
EDIT: Adding solution in case OP needs to use grep and want to get final output from its output(though I will recommend to use awk solution itself since we are NOT using multiple commands or sub-shells).
grep -oP '(?<="company":")[^"]*' * | awk 'BEGIN{FS=":";OFS=","} prev!=$1 && val{print prev":"val;val=""} {val=(val?val OFS:"")$2;prev=$1} END{if(val){print prev":"val}}'
There are two tools that can take the output of your grep command and reformat it the way you want. First tool is GNU datamash. Second is tsv-summarize from eBay's tsv-utils package (disclaimer: I'm the author). Both tools solve this in similar ways:
$ # The grep output
$ echo $'FILE1:COMPANY1\nFILE1:COMPANY2\nFILE1:COMPANY3\nFILE2:COMPANY99' > grep-output.txt
$ cat grep-output.txt
FILE1:COMPANY1
FILE1:COMPANY2
FILE1:COMPANY3
FILE2:COMPANY99
$ # Using GNU datamash
$ cat grep-output.txt | datamash -field-separator : --group 1 unique 2
FILE1:COMPANY1,COMPANY2,COMPANY3
FILE2:COMPANY99
$ # Using tsv-summarize
$ cat grep-output.txt | tsv-summarize --delimiter : --group-by 1 --unique-values 2 --values-delimiter ,
FILE1:COMPANY1,COMPANY2,COMPANY3
FILE2:COMPANY99