awk extract of a series of lines

awk extract of a series of lines - awk

I am stuck at getting a right solution using awk to extract versions between "[]" from
Version Repository Repository URL
[1.0.0.44] repo-0 file://test/test-1.0.0.44-features.xml
[1.0.0.21] repo-0 file://test/test-1.0.0.21-features.xml
Is there any quick efficient one-liners anyone can help with please?

With awk, using square brackets as the field separators, output field 2 except for record number 1:
awk -F '[][]' 'NR > 1 {print $2}'
Or, grep with -o is useful for extracting substrings
grep -oP '(?<=\[)[^]]+'

Related

awk command to print columns with colum data

cat file1.txt | awk -F '{print $1 "|~|" $2 "|~|" $3}' > file2.txt
I am using above command to filter first three columns from file1 and put into file.
But only getting the column names and not the column data.
How to do that?
|~| - is the delimiter.
file1.txt has values as :
a|~|b|~|c|~|d|~|e
1|~|2|~|3|~|4|~|5
11|~|22|~|33|~|44|~|55
111|~|222|~|333|~|444|~|555
my expedted output is :
a|~|b|~|c
1|~|2|~|3
11|~|22|~|33
111|~|222|~|333

With your shown samples, please try following awk code. You need to set field separator to |~| and remove starting space from lines, then print the lines.
awk -F'\\|~\\|' -v OFS='|~|' '{sub(/^[[:blank:]]+/,"");print $1,$2,$3}' Input_file
In case you want to keep spaces(which was in initial post before edit) then try following:
awk -F'\\|~\\|' -v OFS='|~|' '{print $1,$2,$3}' Input_file
NOTE: Had a chat with user in room and got to know why this code was not working for user because of gunzip -c file was being used wrongly, its output was being saved into a variable on which user was running awk program, so correcting that command generated right file and awk program ran fine on it. Adding this as a reference for future readers.

One approach would be:
awk -v FS="," -v OFS="|~|" '{gsub(/[|][~][|]/,","); sub(/^\s*/,""); print $1,$2,$3}' file1.txt
The approach simply replaces all "|~|" with a "," setting the output file separator to "|~|". All leading whitespace is trimmed with sub().
Example Use/Output
With your data in file1.txt, you would have:
$ awk -v FS="," -v OFS="|~|" '{gsub(/[|][~][|]/,","); sub(/^\s*/,""); print $1,$2,$3}' file1.txt
a|~|b|~|c
1|~|2|~|3
11|~|22|~|33
111|~|222|~|333
Let me know if this is what you intended. You can simply redirect, e.g. > file2.txt to write to the second file.

For such cases, my bash+awk script rcut comes in handy:
rcut -Fd'|~|' -f-3 ip.txt
The -F option enables fixed string input delimiter (which is given using the -d option). And by default, the output field separator will also be same as -d when -F is active. -f-3 is similar to cut syntax to specify first three fields.
For better speed, use hck command:
hck -Ld'|~|' -D'|~|' -f-3 ip.txt
Here, -L enables literal field separator and -D specifies output field separator.
Another benefit is that hck supports -z option to automatically handle common compressed formats based on filename extension (adding this since OP had an issue with compressed input).

Another way:
sed 's/|~|/\t/g' file1.txt | awk '{print $1"|~|"$2"|~|"$3}' > file2.txt
First replace the |~| delimiter, and use the default awk separator, then print columns what you need.

How to AWK print only specific item?

I have a log file that looks like this:
RPT_LINKS=1,T1999
RPT_NUMALINKS=1
RPT_ALINKS=1,1999TK,2135,2009,31462,29467,2560
RPT_TXKEYED=1
RPT_ETXKEYED=0
I have used grep to isolate the line I am interested in with the RPT_ALINKS. In that line I want to know how to use AWK to print only the link that ends with a TK.
I am really close running this:
grep -w 'RPT_ALINKS' stats2.log | awk -F 'TK' '{print FS }'
But I am sure those who are smarter than me already know I am getting only the TK back, how do I get the entire field so that I would get a return of 1999TK?

If there is only a single RT in that line and RT is always at the end:
awk '/RPT_ALINKS/{match($0,/[^=,]*TK/); print substr($0,RSTART,RLENGTH)}'
You can also use a double grep
grep -w 'RPT_ALINKS' stats2.log | grep -wo '[^=,]*TK'
The following sed solution also works nicely:
sed '/RPT_ALINKS/s/\(^.*[,=]\)\([^=,]*TK\)\(,.*\)\?/\2/'

It doesn't get any more elegant
awk -F '=' '$1=="RPT_ALINKS" {n=split($2,array,",")
for(i=1; i<=n; i++)
if (array[i] ~ /TK$/)
{print array[i]}}
' stats2.log
n=split($2,array,","): split 1,1999TK,2135,2009,31462,29467,2560 with , to array array. n contains number of array elements, here 7.

Here is a simple solution
awk -F ',|=' '/^RPT_ALINKS/ { for (i=1; i<=NF; i++) if ($i ~ /TK$/) print $i }' stats2.log
It looks only on the record which begins with RPT_ALINKS. And there it check every field. If field ends with TK, then it prints it.

Dang, I was just about to post the double-grep alternative, but got scooped. And all the good awk solutions are taken as well.
Sigh. So here we go in bash, for fun.
$ mapfile a < stats2.log
$ for i in "${a[#]}"; do [[ $i =~ ^RPT_ALINKS=(.+,)*([^,]+TK) ]] && echo "${BASH_REMATCH[2]}"; done
1999TK
This has the disadvantage of running way slower than awk and not using fields. Oh, and it won't handle multiple *TK items on a single line. And like sed, this is processing lines as patterns rather than fields, which saps elegance. And by using mapfile, we limit the size of input you can handle because your whole log is loaded into memory. Of course you don't really need to do that, but if you were going to use a pipe, you'd use a different tool anyway. :-)
Happy Thursday.

With a sed that has -E for EREs, e.g. GNU or OSX/BSD sed:
$ sed -En 's/^RPT_ALINKS=(.*,)?([^,]*TK)(,.*|$)/\2/p' file
1999TK
With GNU awk for the 3rd arg to match():
$ awk 'match($0",",/^RPT_ALINKS=(.*,)?([^,]*TK),.*/,a){print a[2]}' file
1999TK

Instead of looping through it, you can use an other alternative.
This will be fast, loop takes time.
awk -F"TK" '/RPT_ALINKS/ {b=split($1,a,",");print a[b]FS}' stats2.log
1999TK
Here you split the line by setting field separator to TK and search for line that contains RPT_ALINKS
That gives $1=RPT_ALINKS=1,1999 and $2=,2135,2009,31462,29467,2560
$1 will always after last comma have our value.
So split it up using split function by comma. b would then contain number of fields.
Since we know that number would be in last section we do use a[b] and add FS that contains TK

awk print several substring

I would like to be able to print several substrings via awk.
Here an example of what I usually do;
awk' {print substr($0,index($0,string),10)} ' test.txt > result.txt
This allow me to print 10 letters after the discovery of my string.
But the result is the first one substring, instead of several as I expected.
Here an example if I use the string "ATGC" :
test.txt
ATGCATATAAATGCTTTTTTTTT
result.txt
ATGCATATAA
instead of
ATGCATATAA
ATGCTTTTTT
What I have to add ?
I'm sure the answer is easy for you guys !
Thank you for your help.

If you have gawk (gnu awk), you can make use of FPAT:
awk -v FPAT='ATGC.{6}' '{for(i=1;i<=NF;i++)print $i}' file
With your example:
$ awk -v FPAT='ATGC.{6}' '{for(i=1;i<=NF;i++)print $i}' <<<"ATGCATATAAATGCTTTTTTTTT"
ATGCATATAA
ATGCTTTTTT

awk '{print substr($0,1,10),RS substr($0,length -12,10)}' file
ATGCATATAA
ATGCTTTTTT

Using grep-awk and sed in one-row-command result in a "No such file or directory" error

..And I know why:
I have a xml document with lots of information inside. I need to extract what I need and eventually print them on a new file.
The xml (well, part of it.. rows just keeps repeating)
<module classname="org.openas2.processor.receiver.AS2DirectoryPollingModule"
outboxdir="%home%/../../../home/samba/user/Outbound/toMartha/"
errordir="%home%/../../../home/samba/user/Outbound/toMartha/error"
sentdir="%home%/../../../home/samba/user/data/Sent/Martha"
interval="600"
defaults="sender.name=me_myself, receiver.name=Martha"
sendfilename="true"
mimetype="application/standard"/>
<module classname="org.openas2.processor.receiver.AS2DirectoryPollingModule"
outboxdir="%home%/../../../home/samba/user/Outbound/toJosh/"
errordir="%home%/../../../home/samba/user/Outbound/toJosh/error"
sentdir="%home%/../../../home/samba/user/data/Sent/Josh"
interval="600"
defaults="sender.name=me_myself, receiver.name=Josh"
sendfilename="true"
mimetype="application/standard"/>
<module classname="org.openas2.processor.receiver.AS2DirectoryPollingModule"
outboxdir="%home%/../../../home/samba/user/Outbound/toPamela/"
errordir="%home%/../../../home/samba/user/Outbound/toPamela/error"
interval="600"
defaults="sender.name=me_myself, receiver.name=Pamela"
sendfilename="true"
mimetype="application/standard"/>
I need to extract the folder after "Outbound" and clean it from quotes or slashes.
Also, I need to exclude the "/error" so I get only 1 result for each of them.
My command is:
grep -o -v "/error" "Outbound/" config.xml | awk -F"Outbound/" '{print $2}' | sed -e "s/\/\"//g" > /tmp/sync_users
The error is: grep: Outbound/: No such file or directory which of course means that I'm giving to grep too many arguments (?) - If i remove the -v "/error" it would work but would print also the names with "/error".
Can someone help me?
EDIT:
As some pointed out in their example (thanks for the time you put in), I'd need to extract these words based on the sample above:
toMartha
toJosh
toPamela

could be intersting to use sed in this case
sed -e '\#/Outbound/#!d' -e '\#/error"$#d' -e 's#.*/Outbound/##;s#/\{0,1\}"$##' Config.xml
awk version, assuming (for last print) that your line is always 1 folder below Outbound as shown
awk -F '/' '$0 !~ /\/Outbound\// || /\/error"$/ {next} {print $(NF-1)}' Config.xml

Loose the grep altogether:
$ awk '/outboxdir/{gsub(/^.+Outbound\/|\/" *\r?$/,""); print}' file
toMartha
toJosh
toPamela
/^outboxdir/ /outboxdir/only process records that have start with outboxdir on them
gsub remove unwanted parts of the record
added space removal at the end of record and CRLF fix for Windows originated files

To give grep multiples patterns they have to be separated by newlines or specified by multiples pattern option (-e, F,.. ). However -v invert the match as a whole, you can't invert only one.
For what you're after you can use PCRE (-P argument) for the lookaround ability:
grep -o -P '(?<=Outbound\/)[^\/]+(?!.*\/error)' config.xml
Regex demo here
The regex try to
match something not a slash at least once, the [^\/]+
preceded by Outbound/ the positive lookbehind (?<=Outbound\/)
and not followed by something ending with /error, the negative lookahead (?!.*\/error)
With your first sample input:
$ grep -o -P '(?<=Outbound\/)[^\/]+(?!.*\/error)' test.txt
toMartha
toJosh
toPamela

How about:
grep -i "outbound" your_file | awk -F"Outbound/" '{print $2}' | sed -e 's/error//' -e 's/\/\"//' | uniq
Should work :)

You can use match in gawkand capturing group in regex
awk 'match($0, /^.*\/Outbound\/([^\/]+)\/([^\/]*)\/?"$/, a){
if(a[2]!="error"){print a[1]}
}' config.xml
you get,
toMartha
toJosh
toPamela

grep can accept multiple patterns with the -e option (aka --regexp, even though it can be used with --fixed-strings too, go figure). However, -v (--invert-match) applies to all of the patterns as a group.
Another solution would be to chain two calls to grep:
grep -v "/error" config.xml | grep "Outbound/" | awk -F"Outbound/" '{print $2}' | sed -e "s/\/\"//g"

Combining awk search with standard awk and awk delimiter

I`m working on a set of data for which I need specific fields as output:
The data looks like this:
/home/oracle/db.log.gz:2013-1-19T00:00:25 <user.info> 1 2013-1-19T00:00:53.911 host_name RT_FLOW [junos#26.1.1.1.2.4 source-address="10.1.2.0" source-port="616" destination-address="100.1.1.2" destination-port="23" service-name="junos-telnet" nat-source-address="20x.2x.1.2" nat-source-port="3546" nat-destination-address="9x.12x.3.0"]
From above I need three things:
(I) - 2013-1-19T00:00:53.911 which is $4
(II)- source-address="10.1.2.0" which is $8 of which I need only 10.1.2.0
(III) - destination-address="100.1.1.2" which $10 of which I need only 100.1.1.2
I cannot use simple awk like this -> awk '{ print $4 \t $8 \t $10 }' since there are some fields after "device_name" in the log file which are not always present in all log lines so I have to make use of delimiters such as
awk -F 'source-address=' '{print $2}' | awk '{print $1} -> this gives source-addressIP which is (II) requirement
I`m not sure how do I combine using a awk search for I and II and III.
Can someone help?

I believe sed is better for this job
sed -r 's/([^ ]+[ ]+){3}([^ ]+).*[ ]+source-address="([^"]+)".*[ ]+destination-address="([^"]+)".*/\2\t\3\t\4/' file
Output:
2013-1-19T00:00:53.911 10.1.2.0 100.1.1.2

What do you exactly want?
solve the problem using any (reasonably standard) tool
solve this challenge using one instance of awk
solve the problem using just awk, no matter how many instances it costs
For the first case, you could parse the line using scripting language of your choice (mine would be Perl), or do it the hard way using sed and a single big substitution. Or something between the two – use three regexes to get the parts you want.
For the second case, you could adapt any of the former solutions, preferably the sed one. Awk and sed solutions have already been posted.
For the third case, you could just run the obvious awk solutions you mentioned in your question and send the results to a single pipe like { awk …; awk …; awk …; } < file | consumer.

Try doing this :
awk '{print gensub(/.*\s+([0-9]{4}-[0-9]+-[0-9]+T[0-9]{2}:[0-9]{2}:[0-9]{2}.[0-9]+).*source-address="([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}).*destination-address="([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}).*/, "(I) \\1\n(II) \\2\n(III) \\3", "g"); }' file
Another solution using perl :
perl -lne 'print "(", "I" x ++$c, ") $_" for m/.*?\s+(\d{4}-\d+-\d+T\d{2}:\d{2}:\d{2}.\d+).*source-address="(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}).*destination-address="(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}).*/' file
Outputs :
(I) 2013-1-19T00:00:53.911
(II) 10.1.2.0
(III) 100.1.1.2

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

awk extract of a series of lines - awk

With awk, using square brackets as the field separators, output field 2 except for record number 1: awk -F '[][]' 'NR > 1 {print $2}' Or, grep with -o is useful for extracting substrings grep -oP '(?<=\[)[^]]+'

Related

awk command to print columns with colum data

How to AWK print only specific item?

awk print several substring

Using grep-awk and sed in one-row-command result in a "No such file or directory" error

Combining awk search with standard awk and awk delimiter

Categories

Resources