awk print several substring - awk

I would like to be able to print several substrings via awk.
Here an example of what I usually do;
awk' {print substr($0,index($0,string),10)} ' test.txt > result.txt
This allow me to print 10 letters after the discovery of my string.
But the result is the first one substring, instead of several as I expected.
Here an example if I use the string "ATGC" :
test.txt
ATGCATATAAATGCTTTTTTTTT
result.txt
ATGCATATAA
instead of
ATGCATATAA
ATGCTTTTTT
What I have to add ?
I'm sure the answer is easy for you guys !
Thank you for your help.

If you have gawk (gnu awk), you can make use of FPAT:
awk -v FPAT='ATGC.{6}' '{for(i=1;i<=NF;i++)print $i}' file
With your example:
$ awk -v FPAT='ATGC.{6}' '{for(i=1;i<=NF;i++)print $i}' <<<"ATGCATATAAATGCTTTTTTTTT"
ATGCATATAA
ATGCTTTTTT

awk '{print substr($0,1,10),RS substr($0,length -12,10)}' file
ATGCATATAA
ATGCTTTTTT

Related

How to print specific string from a sentence using awk

I have the following sentence within a file
FQDN=joe.blogs.com.
How can I print the string "joe"
I have tried using -->> awk -F"=" '{print $2}' file
but this returns joe.blogs.com as "=" is the delimiter.
Is it possible to use 2 delimiters on the same line?
You might use regular expression as FS. Let file.txt content be
FQDN=joe.blogs.com.
then
awk 'BEGIN{FS="[=.]"}{print $2}' file.txt
output
joe
In case you are ok with sed, could you please try following.
sed 's/.*=\([^.]*\)\..*/\1/' Input_file
With GNU grep and using its -oP flag we could try following too.
grep -oP '(.*=)\K([^.]*)' Input_file
You could use GNU grep:
grep -oP '(?<=FQDN=)[^.]+' file
^ all characters up to a '.'
^ lookbehind for 'FQDN='
^ only print match and Perl style regex
Or with Perl:
perl -lne 'print $1 if /(?<=FQDN=)([^.]+)/' file
With awk I would probably do:
awk 'BEGIN{FS="[.=]"} /FQDN=/{print $2}' file
why not keeping it simple and pipe awk?
awk -F"=" '{print $2}' | awk -F"." '{print $1}'
can I use two field delimiters on one line?
No. You may do further string manipulation as post processing, or you could use a regex as field delimiter.
Another option is to use awk's split function:
awk -F= '{ split($2,map,".");print map[1] }' file
Split the second = separated field into the array map using "." as the delimiter. Print the first index of the array.

How to AWK print only specific item?

I have a log file that looks like this:
RPT_LINKS=1,T1999
RPT_NUMALINKS=1
RPT_ALINKS=1,1999TK,2135,2009,31462,29467,2560
RPT_TXKEYED=1
RPT_ETXKEYED=0
I have used grep to isolate the line I am interested in with the RPT_ALINKS. In that line I want to know how to use AWK to print only the link that ends with a TK.
I am really close running this:
grep -w 'RPT_ALINKS' stats2.log | awk -F 'TK' '{print FS }'
But I am sure those who are smarter than me already know I am getting only the TK back, how do I get the entire field so that I would get a return of 1999TK?
If there is only a single RT in that line and RT is always at the end:
awk '/RPT_ALINKS/{match($0,/[^=,]*TK/); print substr($0,RSTART,RLENGTH)}'
You can also use a double grep
grep -w 'RPT_ALINKS' stats2.log | grep -wo '[^=,]*TK'
The following sed solution also works nicely:
sed '/RPT_ALINKS/s/\(^.*[,=]\)\([^=,]*TK\)\(,.*\)\?/\2/'
It doesn't get any more elegant
awk -F '=' '$1=="RPT_ALINKS" {n=split($2,array,",")
for(i=1; i<=n; i++)
if (array[i] ~ /TK$/)
{print array[i]}}
' stats2.log
n=split($2,array,","): split 1,1999TK,2135,2009,31462,29467,2560 with , to array array. n contains number of array elements, here 7.
Here is a simple solution
awk -F ',|=' '/^RPT_ALINKS/ { for (i=1; i<=NF; i++) if ($i ~ /TK$/) print $i }' stats2.log
It looks only on the record which begins with RPT_ALINKS. And there it check every field. If field ends with TK, then it prints it.
Dang, I was just about to post the double-grep alternative, but got scooped. And all the good awk solutions are taken as well.
Sigh. So here we go in bash, for fun.
$ mapfile a < stats2.log
$ for i in "${a[#]}"; do [[ $i =~ ^RPT_ALINKS=(.+,)*([^,]+TK) ]] && echo "${BASH_REMATCH[2]}"; done
1999TK
This has the disadvantage of running way slower than awk and not using fields. Oh, and it won't handle multiple *TK items on a single line. And like sed, this is processing lines as patterns rather than fields, which saps elegance. And by using mapfile, we limit the size of input you can handle because your whole log is loaded into memory. Of course you don't really need to do that, but if you were going to use a pipe, you'd use a different tool anyway. :-)
Happy Thursday.
With a sed that has -E for EREs, e.g. GNU or OSX/BSD sed:
$ sed -En 's/^RPT_ALINKS=(.*,)?([^,]*TK)(,.*|$)/\2/p' file
1999TK
With GNU awk for the 3rd arg to match():
$ awk 'match($0",",/^RPT_ALINKS=(.*,)?([^,]*TK),.*/,a){print a[2]}' file
1999TK
Instead of looping through it, you can use an other alternative.
This will be fast, loop takes time.
awk -F"TK" '/RPT_ALINKS/ {b=split($1,a,",");print a[b]FS}' stats2.log
1999TK
Here you split the line by setting field separator to TK and search for line that contains RPT_ALINKS
That gives $1=RPT_ALINKS=1,1999 and $2=,2135,2009,31462,29467,2560
$1 will always after last comma have our value.
So split it up using split function by comma. b would then contain number of fields.
Since we know that number would be in last section we do use a[b] and add FS that contains TK

isolate similar data from stream

We parse data of the following format -
35953539535393 BG |..|...|REF_DATA^1^Y^|...|...|
35953539535393 B |..|...|REF_DATA_IND^1^B^|...|...|
We need to print unique values of REF_DATA* appearing in the file using script.
So,the output of the above data would be :
REF_DATA^1^Y^
REF_DATA_IND^1^B^
How do we achieve this using grep ,sed or awk - using a one-liner script.
This might work for you (GNU sed & sort):
sed '/\n/!s/[^|]*REF_DATA[^|]*/\n&\n/;/^[^|]*REF_DATA/P;D' file | sort -u
Surround the intended strings by newlines, print only those strings on separate lines and sort those lines showing only unique values.
Could you please try following and let me know if this helps you.
awk 'match($0,/REF_DATA[^|]*/){val=substr($0,RSTART,RLENGTH);if(!array[val]++){print val}}' Input_file
Adding a non-one liner form of solution too now.
awk '
match($0,/REF_DATA[^|]*/){
val=substr($0,RSTART,RLENGTH);
if(!array[val]++){
print val
}
}' Input_file
Assuming you have GNU grep:
command_to_produce_data | grep -oP '(?<=[|])REF_DATA.+?(?=[|])' | sort -u
awk -F\| '{print $4}' file
REF_DATA^1^Y^
REF_DATA_IND^1^B^

strip out value from return using awk

Im not sure how to strip out the "DST=" from these lines..
Here is my command(its returning what it should) and please if there is a more efficient way or a better way, feel free to criticize.
awk '{print $10}' iptables.log |sort -u
DST=96.7.49.64
DST=96.7.49.65
DST=96.7.50.64
DST=98.27.88.26
DST=98.27.88.28
DST=98.27.88.45
DST=98.27.88.50
As you can see, I need to grab unique ip's from iptable log.
Thanks!
If you you don't mind the unsorted output, here's a better way using awk:
awk '!a[$10]++ { sub(/DST=/,"",$10); print $10 }' file
or you can keep it all in one process, and use awk's equivalent sub() function, i.e.
awk '{sub(/DST=/,"",$10); print $10}' iptables.log |sort -u
Update:
Is there anyway to key just on DST= regardless of whether its at space 10 or 11?
awk '$10~/^DST=/{sub(/DST=/,"",$10); print $10};$11~/^DST=/{sub(/DST=/,"",$11); print $11}' iptables.log | sort -u
OR
awk '{for (i=9;i<13;i++) {
if ($i ~ /^DST=/) { sub(/DST=/, "", $i); print $i}
}
}' iptables.log | sort -u
Note that here, you can change the range of fields to check and print, I'm testing fields 9-12 just for example. variables in awk like $i refer to the i'th' element in the current line, just like $1, $9, $87, etc, etc.
As I don't have iptables.log to test with, I can't test it except to confirm that the awk syntax doesn't fail. It this doesn't work, please post 2-4 sample lines of simplified data.
IHTH
You could pipe the result of your output through sed to remove the DST= from each line:
awk '{print $10}' iptables.log | sed 's/^DST=//' | sort -u
awk '{split($10,a,"=");b[a[2]];next}END{for(i in b)print i}' iptables.log

Using each line of awk output as grep pattern

I want to find every line of a file that contains any of the strings held in a column of a different file.
I have tried
grep "$(awk '{ print $1 }' file1.txt)" file2.txt
but that just outputs file2.txt in its entirety.
I know I've done this before with a pattern I found on this site, but I can't find that question anymore.
I see in the OP's comment that maybe the question is no longer a question. However, the following slight modification will handle the blank line situation. Just add a check to make sure the line has at least one field:
grep "$(awk '{if (NF > 0) print $1}' file1)" file2
And if the file with the patterns is simply a set of patterns per line, then a much simpler version of it is:
grep -f file1 file2
That causes grep to use the lines in file1 as the patterns.
THere is no need to use grep when you have awk
awk 'FNR==NR&&NF{a[$0];next}($1 in a)' file2 file1
$(awk '{ print $1 }' file1.txt) | grep text > file.txt