isolate similar data from stream - awk

We parse data of the following format -
35953539535393 BG |..|...|REF_DATA^1^Y^|...|...|
35953539535393 B |..|...|REF_DATA_IND^1^B^|...|...|
We need to print unique values of REF_DATA* appearing in the file using script.
So,the output of the above data would be :
REF_DATA^1^Y^
REF_DATA_IND^1^B^
How do we achieve this using grep ,sed or awk - using a one-liner script.

This might work for you (GNU sed & sort):
sed '/\n/!s/[^|]*REF_DATA[^|]*/\n&\n/;/^[^|]*REF_DATA/P;D' file | sort -u
Surround the intended strings by newlines, print only those strings on separate lines and sort those lines showing only unique values.

Could you please try following and let me know if this helps you.
awk 'match($0,/REF_DATA[^|]*/){val=substr($0,RSTART,RLENGTH);if(!array[val]++){print val}}' Input_file
Adding a non-one liner form of solution too now.
awk '
match($0,/REF_DATA[^|]*/){
val=substr($0,RSTART,RLENGTH);
if(!array[val]++){
print val
}
}' Input_file

Assuming you have GNU grep:
command_to_produce_data | grep -oP '(?<=[|])REF_DATA.+?(?=[|])' | sort -u

awk -F\| '{print $4}' file
REF_DATA^1^Y^
REF_DATA_IND^1^B^

Related

awk command to print columns with colum data

cat file1.txt | awk -F '{print $1 "|~|" $2 "|~|" $3}' > file2.txt
I am using above command to filter first three columns from file1 and put into file.
But only getting the column names and not the column data.
How to do that?
|~| - is the delimiter.
file1.txt has values as :
a|~|b|~|c|~|d|~|e
1|~|2|~|3|~|4|~|5
11|~|22|~|33|~|44|~|55
111|~|222|~|333|~|444|~|555
my expedted output is :
a|~|b|~|c
1|~|2|~|3
11|~|22|~|33
111|~|222|~|333
With your shown samples, please try following awk code. You need to set field separator to |~| and remove starting space from lines, then print the lines.
awk -F'\\|~\\|' -v OFS='|~|' '{sub(/^[[:blank:]]+/,"");print $1,$2,$3}' Input_file
In case you want to keep spaces(which was in initial post before edit) then try following:
awk -F'\\|~\\|' -v OFS='|~|' '{print $1,$2,$3}' Input_file
NOTE: Had a chat with user in room and got to know why this code was not working for user because of gunzip -c file was being used wrongly, its output was being saved into a variable on which user was running awk program, so correcting that command generated right file and awk program ran fine on it. Adding this as a reference for future readers.
One approach would be:
awk -v FS="," -v OFS="|~|" '{gsub(/[|][~][|]/,","); sub(/^\s*/,""); print $1,$2,$3}' file1.txt
The approach simply replaces all "|~|" with a "," setting the output file separator to "|~|". All leading whitespace is trimmed with sub().
Example Use/Output
With your data in file1.txt, you would have:
$ awk -v FS="," -v OFS="|~|" '{gsub(/[|][~][|]/,","); sub(/^\s*/,""); print $1,$2,$3}' file1.txt
a|~|b|~|c
1|~|2|~|3
11|~|22|~|33
111|~|222|~|333
Let me know if this is what you intended. You can simply redirect, e.g. > file2.txt to write to the second file.
For such cases, my bash+awk script rcut comes in handy:
rcut -Fd'|~|' -f-3 ip.txt
The -F option enables fixed string input delimiter (which is given using the -d option). And by default, the output field separator will also be same as -d when -F is active. -f-3 is similar to cut syntax to specify first three fields.
For better speed, use hck command:
hck -Ld'|~|' -D'|~|' -f-3 ip.txt
Here, -L enables literal field separator and -D specifies output field separator.
Another benefit is that hck supports -z option to automatically handle common compressed formats based on filename extension (adding this since OP had an issue with compressed input).
Another way:
sed 's/|~|/\t/g' file1.txt | awk '{print $1"|~|"$2"|~|"$3}' > file2.txt
First replace the |~| delimiter, and use the default awk separator, then print columns what you need.

awk print several substring

I would like to be able to print several substrings via awk.
Here an example of what I usually do;
awk' {print substr($0,index($0,string),10)} ' test.txt > result.txt
This allow me to print 10 letters after the discovery of my string.
But the result is the first one substring, instead of several as I expected.
Here an example if I use the string "ATGC" :
test.txt
ATGCATATAAATGCTTTTTTTTT
result.txt
ATGCATATAA
instead of
ATGCATATAA
ATGCTTTTTT
What I have to add ?
I'm sure the answer is easy for you guys !
Thank you for your help.
If you have gawk (gnu awk), you can make use of FPAT:
awk -v FPAT='ATGC.{6}' '{for(i=1;i<=NF;i++)print $i}' file
With your example:
$ awk -v FPAT='ATGC.{6}' '{for(i=1;i<=NF;i++)print $i}' <<<"ATGCATATAAATGCTTTTTTTTT"
ATGCATATAA
ATGCTTTTTT
awk '{print substr($0,1,10),RS substr($0,length -12,10)}' file
ATGCATATAA
ATGCTTTTTT

awk sed grep to extract patten with special characters

I am trying to understant the switchs and args in awk and sed
For instance, to get the number next to nonce in the line form the file response.xml:
WWW-Authenticate: ServiceAuth realm="WinREST", nonce="1828HvF7EfPnRtzSs/h10Q=="
I use by suggestion of another member
nonce=$(sed -nE 's/.*nonce="([^"]+)"/\1/p' response.xml)
to get the numbers next to the word idOperation in the line below I was trying :
idOper=$(sed -nE 's/.*idOperation="([^"]+)"/\1/p' zarph_response.xml)
line to extract the number:
{"reqStatus":{"code":0,"message":"Success","success":true},"idOperation":"185-16-6"}
how do I get the 185-16-6 ?
and if the data to extract has no ""
like the 1 next to operStatus ?
{"reqStatus":{"code":0,"message":"Success","success":true},"operStatus":1,"amountReceived":0,"amountDismissed":0,"currency":"EUR"}
Following awk may help you on same.
awk -F"\"" '/idOperation/{print $(NF-1)}' Input_file
Solution 2nd: In sed following may help you on same.
sed '/idOperation/s/\(.*:\)"\([^"]*\)\(.*\)/\2/' Input_file
EDIT: In case you want to get the digit after string operStatus then following may help you on same.
awk 'match($0,/operStatus[^,]*/){;print substr($0,RSTART+12,RLENGTH-12)}' Input_file
Using grep perl-style regexes
grep -oP "nonce=\"\K(.*)?(?=\")" Input_file
grep -oP "idOperation\":\"\K(.*)?(?=\")" Input_file
If the input is json you can use jq
jq .idOperation Input_file

strip out value from return using awk

Im not sure how to strip out the "DST=" from these lines..
Here is my command(its returning what it should) and please if there is a more efficient way or a better way, feel free to criticize.
awk '{print $10}' iptables.log |sort -u
DST=96.7.49.64
DST=96.7.49.65
DST=96.7.50.64
DST=98.27.88.26
DST=98.27.88.28
DST=98.27.88.45
DST=98.27.88.50
As you can see, I need to grab unique ip's from iptable log.
Thanks!
If you you don't mind the unsorted output, here's a better way using awk:
awk '!a[$10]++ { sub(/DST=/,"",$10); print $10 }' file
or you can keep it all in one process, and use awk's equivalent sub() function, i.e.
awk '{sub(/DST=/,"",$10); print $10}' iptables.log |sort -u
Update:
Is there anyway to key just on DST= regardless of whether its at space 10 or 11?
awk '$10~/^DST=/{sub(/DST=/,"",$10); print $10};$11~/^DST=/{sub(/DST=/,"",$11); print $11}' iptables.log | sort -u
OR
awk '{for (i=9;i<13;i++) {
if ($i ~ /^DST=/) { sub(/DST=/, "", $i); print $i}
}
}' iptables.log | sort -u
Note that here, you can change the range of fields to check and print, I'm testing fields 9-12 just for example. variables in awk like $i refer to the i'th' element in the current line, just like $1, $9, $87, etc, etc.
As I don't have iptables.log to test with, I can't test it except to confirm that the awk syntax doesn't fail. It this doesn't work, please post 2-4 sample lines of simplified data.
IHTH
You could pipe the result of your output through sed to remove the DST= from each line:
awk '{print $10}' iptables.log | sed 's/^DST=//' | sort -u
awk '{split($10,a,"=");b[a[2]];next}END{for(i in b)print i}' iptables.log

How to quote a shell variable in a TCL-expect string

I'm using the following awk command in an expect script to get the gateway for a particular destination
route | grep $dest | awk '{print $2}'
However the expect script does not like the $2 in the above statement.
Does anyone know of an alternative to awk to perform the same function as above? ie. output 2nd column.
You can use cut:
route | grep $dest | cut -d \ -f 2
That uses spaces as the field delimiter and pulls out the second field
To answer your Expect question, single quotes have no special meaning to the Tcl parser. You need to use braces to protect the body of the awk script:
route | grep $dest | awk {{print $2}}
And as awk can do what grep does, you can get away with one less process:
route | awk -v d=$dest {$0 ~ d {print $2}}
Before switching to another utility, check if changing field separator worrks. Documentation for field separators in GNU Awk here.
SED is the best alternative to use. If you don't mind a dependency, Perl should also be sufficient to solve the task
Depending on the structure of your data, you can use either cut, or use sed to do both filtering and printing the second column.
Alternatively, you could use Perl:
perl -ne 'if(/foo/) { #_ = split(/:/); print $_[1]; }'
This will print second token of each line containing foo, with : as token separator.