How to filter a field with awk follow a pattern - awk

I have a file with that format:
Topic:test_replication PartitionCount:1 ReplicationFactor:3 Configs:retention.ms=604800000,delete.retention.ms=86400000,cleanup.policy=delete,max.message.bytes=1000012,min.insync.replicas=2,retention.bytes=-1
Topic:teste2e_funcional PartitionCount:12 ReplicationFactor:3 Configs:min.cleanable.dirty.ratio=0.00001,delete.retention.ms=86400000,cleanup.policy=delete,min.insync.replicas=2,segment.ms=604800000,retention.bytes=-1
Topic:ticket_dl.replica_cloudera PartitionCount:3 ReplicationFactor:3 Configs:message.downconversion.enable=true,file.delete.delay.ms=60000,segment.ms=604800000,min.compaction.lag.ms=0,retention.bytes=-1,segment.index.bytes=10485760,cleanup.policy=delete,message.timestamp.difference.max.ms=9223372036854775807,segment.jitter.ms=0,preallocate=false,message.timestamp.type=CreateTime,message.format.version=2.2-IV1,segment.bytes=1073741824,max.message.bytes=1000000,unclean.leader.election.enable=false,retention.ms=604800000,flush.ms=9223372036854775807,delete.retention.ms=31536000000,min.insync.replicas=2,flush.messages=9223372036854775807,compression.type=producer,index.interval.bytes=4096,min.cleanable.dirty.ratio=0.5
And I want to have only the value of Topic (e.g. test_replication) and the value of min.insync.replicas (e.g. 2)
I know that it is possible to do with regular expression, but I don't know how to do it. For me the problems is that min.insync.replicas is not in the same possition so if I use the awk option -F with for example , I will got diferent values of min.insync.replicas.

Could you please try following.
awk '
match($0,/Topic:[^ ]*/){
topic=substr($0,RSTART+6,RLENGTH-6)
match($0,/min\.insync\.replicas[^,]*/)
print topic,substr($0,RSTART+20,RLENGTH-20)
topic=""
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/Topic:[^ ]*/){ ##Using match function to match regex Topic: till space comes here.
topic=substr($0,RSTART+6,RLENGTH-6) ##Creating topic varwhich has sub-string of current line starting from RSTART till RLENGTH.
match($0,/min\.insync\.replicas[^,]*/) ##Using match again to match regex frommin to till comma here.
print topic,substr($0,RSTART+20,RLENGTH-20) ##Printing topic and sub-string from RSTART to till RLENGTH adding and substracting respectively here.
topic="" ##Nullify variable topic here.
}
' Input_file ##Mentioning Input_file name here.
2nd solution: Adding a sed solution here.
sed 's/Topic:\([^ ]*\).*min\.insync\.replicas=\([^,]*\).*/\1 \2/' Input_file

Sorry for the questions before. Was very simple:
awk '
match($0,/Topic:[^ ]*/){
topic=substr($0,RSTART+6,RLENGTH-6)
match($0,/min\.insync\.replicas[^,]*/)
mininsync=substr($0,RSTART+20,RLENGTH-20)
match($0,/retention\.ms[^,]*/)
retention=substr($0,RSTART+13,RLENGTH-13)
print topic",",mininsync,","retention
topic=""
}

Related

How to extract data in such a pattern using grep or awk?

I have multiple instances of the following pattern in my document:
Dipole Moment: [D]
X: 1.5279 Y: 0.1415 Z: 0.1694 Total: 1.5438
I want to extract the total dipole moment, so 1.5438. How can I pull this off?
When I throw in grep "Dipole Moment: [D]" filename, I don't get the line after. I am new to these command line interfaces. Any help you can provide would be greatly appreciated.
Could you please try following. Written and tested with shown samples in GNU awk.
awk '/Dipole Moment: \[D\]/{found=1;next} found{print $NF;found=""}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/Dipole Moment: \[D\]/{ ##Checking if line contains Dipole Moment: \[D\] escaped [ and ] here.
found=1 ##Setting found to 1 here.
next ##next will skip all further statements from here.
}
found{ ##Checking condition if found is NOT NULL then do following.
print $NF ##Printing last field of current line here.
found="" ##Nullifying found here.
}
' Input_file ##Mentioning Input_file name here.
Sed alternative:
sed -rn '/^Dipole/{n;s/(^[[:space:]]{5}.*[[:space:]]{5})(.*)(([[:space:]]{5}.*+[:][[:space:]]{5}.*){3})/\2/p}' file
Search for the line beginning with "Dipole" then read the next line. Split this line into three sections based on regular expressions and substitute the line for the second section only, printing the result.

How I proceed with awk after if statement

my input:
Jun 26 06:54:33 host dovecot: imap-login: Login: user=<xxx>, method=PLAIN, rip=111.111.111.111, lip=111.111.111.111, mpid=00000, TLS, session=<LVVIgfWodFBZD+3W>
Like to get the IP of the rip entry with one command
awk '{ if ($6 == "imap-login:" && match($10,/rip/) ) { print $10 } }'
give me "rip=78.47.14.44,"
How it works to get only the IP?
Could you please try following, written and tested with shown samples in GNU awk.
awk 'match($0,/rip[^,]*/){print substr($0,RSTART+4,RLENGTH-4)}' Input_file
OR as per kvantour sir's suggestion:
awk 'match($0,/[:,] rip[^,]*/){val=substr($0,RSTART+4,RLENGTH-4);sub(/.*rip/,"");print val;val=""}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/rip[^,]*/){ ##Using match to match regex of rip till comma comes in current line,if regex match is found then it sets variables RSTART and RLENGTH for match.
print substr($0,RSTART+4,RLENGTH-4) ##Printing sub string from RSTART+4 value to RLENGTH+4 values to get exact IP which is coming with strig rip in line.
}
' Input_file ##Mentioning Input_file name here.
$ awk -F'[ =,]+' '$6=="imap-login:"{print $13}' file
111.111.111.111

AIX/KSH Extract string from a comma seperated line

I want to extract the part "virtual_eth_adapters" from the following comma seperated line:
lpar_io_pool_ids=none,max_virtual_slots=300,"virtual_serial_adapters=0/server/1/any//any/1,1/server/1/any//any/1","virtual_scsi_adapters=166/client/1/ibm/166/0,266/client/2/ibm/266/0",virtual_eth_adapters=116/0/263,proc_mode=shared,min_proc_units=0.5,desired_proc_units=2.0,max_proc_units=8.0
Im using AIX with ksh.
I found a workaround with awk and the -F flag to seperate the string with a delimiter and then printing the item ID. But if the input string changes the id may differ...
1st solution: Could you please try following in case you want to print string virtual_eth_adapters too in output.
awk '
match($0,/virtual_eth_adapters[^,]*/){
print substr($0,RSTART,RLENGTH)
}
' Input_file
Output will be as follows.
virtual_eth_adapters=116/0/263
2nd solution: In case you want to print only value for String virtual_eth_adapters then try following.
awk '
match($0,/virtual_eth_adapters[^,]*/){
print substr($0,RSTART+21,RLENGTH-21)
}
' Input_file
Output will be as follows.
116/0/263
Explanation: Adding explanation for code.
awk ' ##Starting awk program here.
match($0,/virtual_eth_adapters[^,]*/){ ##Using match function of awk here, to match from string virtual_eth_adapters till first occurrence of comma(,)
print substr($0,RSTART,RLENGTH) ##Printing sub-string whose starting value is RSTART and till value of RLENGTH, where RSTART and RLENGTH variables will set once a regex found by above line.
}
' Input_file ##Mentioning Input_file name here.
I do use these approach to get data out in middle of lines.
awk -F'virtual_eth_adapters=' 'NF>1{split($2,a,",");print a[1]}' file
116/0/263
Its short and easy to learn. (no counting or regex needed)
-F'virtual_eth_adapters=' split the line by virtual_eth_adapters=
NF>1 if there are more than one field (line contains virtual_eth_adapters=)
split($2,a,",") split last part of line in to array a separated by ,
print a[1] print first part of array a
And one more solution (assuming the position of the string)
awk -F\, '{print $7}'
If you need only the value try this:
awk -F\, '{print $7}'|awk -F\= '{print $2}'
Also is possible to get the value on this way:
awk -F\, '{split($7,a,"=");print a[2]}'

Understand the code of Split file to fasta

I understand the matching pattern but how the sequence is read from the matching pattern as the code is matching only pattern ">chr" then how sequence goes to the output file?
awk '/^>chr/ {OUT=substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File
Could you please go through following explanation once.
awk ' ##Starting awk program here.
/^>chr/{ ##Checking condition if any line starts from string chr then do following.
OUT=substr($0,2) ".fa" ##Creating variable OUT whose value is substring of first 2 letters and concatenating .fa string to it.
} ##Closing block for condition ^>chr here.
{
print >> OUT ##Printing current line to variable OUT value which is formed above and is writing output into out file.
close(OUT) ##If we keep writing lot of files we will get "Too many files opened error(s)" so closing these files in backend to avoid that error.
}
' Input_File ##Mentioning Input_file here which we are processing through awk.

Find string then print what comes next until another string

Here's my input.file (thousands of lines):
FN545816.1 EMBL CDS 9450 9857 . + 0 ID=cds-CBE01461.1;Parent=gene-CDR20291_3551;Dbxref=EnsemblGenomes-Gn:CDR20291_3551,EnsemblGenomes-Tr:CBE01461,GOA:C9YHF8,InterPro:IPR003594,UniProtKB/TrEMBL:C9YHF8,NCBI_GP:CBE01461.1;Name=CBE01461.1;gbkey=CDS;gene=rsbW;product=anti-sigma-B factor (serine-protein kinase);protein_id=CBE01461.1;transl_table=11
I want to extract only what comes after product= up to the next ;
So, in this case, I want to get "anti-sigma-B factor (serine-protein kinase)"
I tried this:
awk '{for(i=1; i<=NF; i++) if($i~/*product=/) print $(i+1)}' input.file > output.file
but it prints only "factor" (presumably because there's no space in between "product=" and "anti-sigma-B". It doesn't print the rest neither.
I tried many previous solutions but none gave what I want.
Thank you.
Could you please try following.
awk 'match($0,/product=[^;]*/){print substr($0,RSTART+8,RLENGTH-8)}' Input_file
Explanation: Adding explanation for above code too now.
awk ' ##Starting awk program here.
match($0,/product=[^;]*/){ ##Using match function for awk here, where giving REGEX to match from string product= till first occurrence of ;
print substr($0,RSTART+8,RLENGTH-8) ##Printing substring whose value is from RSTART+8 to till RLENGTH-8, where RSTART and RLENGTH are out of the box keywords which will be set once REGEX condition is satisfied. RSTART mean starting point of regex and RLENGTH is length of REGEX matched.
}' Input_file ##Mentioning Input_file name here.