awk - regex to print first matching group

awk - regex to print first matching group - awk

I am trying to get the first matching group based on regex, but it's not printing anything after the second awk command. Not sure what I was doing wrong. Any help is greatly appreciated.
git status | awk 'NR=1' --> Limiting this t0 print the first line.
On branch TA1692959
git status | awk 'NR=1' | awk '/^On\sbranch\s([\w]*)/{ print $1 }' --> I was trying to get the first word "TA1692959" after "On branch"this prints nothing.

git status |
{n,m,g}awk 'NR<--NF' FS='^On branch |[^[:alnum:]_].+$' OFS=
TA1241521

If you find yourself passing the data through multiple awk calls then chances are pretty good you can do the same thing with a single awk call, eg:
git status | awk 'NR==1 && /^On branch / {print $3; exit}'
TA1692959
In this case:
there's no need for a regex; otherwise OP should update the question with additional samples showing the need for a regex
the exit is optional and merely allows awk to skip processing the rest of the input stream

Related

How do I obtain a specific row with the cut command?

Background
I have a file, named yeet.d, that looks like this
JET_FUEL = /steel/beams
ABC_DEF = /michael/jackson
....50 rows later....
SHIA_LEBEOUF = /just/do/it
....73 rows later....
GIVE_FOOD = /very/hungry
NEVER_GONNA = /give/you/up
I am familiar with the f and d options of the cut command. The f option allows you to specify which column(s) to extract from, while the d option allows you to specify what the delimiters.
Problem
I want this output returned using the cut command.
/just/do/it
From what I know, this is part of the command I want to enter:
cut -f1 -d= yeet.d
Given that I want the values to the right of the equals sign, with the equals sign as the delimiter. However this would return:
/steel/beams
/michael/jackson
....50 rows later....
/just/do/it
....73 rows later....
/very/hungry
/give/you/up
Which is more than what I want.
Question
How do I use the cut command to return only /just/do/it and nothing else from the situation above? This is different from How to get second last field from a cut command because I want to select a row within a large file, not just near from the end or the beginning.

This looks like it would be easier to express with awk...
# awk -v _s="${_string}" '$3 == _s {print $3}' "${_path}"
## Above could be more _scriptable_ form of bellow example
awk -v _search="/just/do/it" '$3 == _search {print $3}' <<'EOF'
JET_FULE = /steal/beams
SHIA_LEBEOUF = /just/do/it
NEVER_GONNA = /give/you/up
EOF
## Either way, output should be similar to
## /just/do/it
-v _something="Some Thing" bit allows for passing Bash variables to awk
$3 == _search bit tells awk to match only when column 3 is equal to the search string
To search for a sub-string within a line one can use $0 ~ _search
{print $3} bit tells awk to print column 3 for any matches
And the <<'EOF' bit tells Bash to not expand anything within the opening and closing EOF tags
... however, the above will still output duplicate matches, eg. if yeet.d somehow contained...
JET_FULE = /steal/beams
SHIA_LEBEOUF = /just/do/it
NEVER_GONNA = /give/you/up
AGAIN = /just/do/it
... there'd be two /just/do/it lines outputed by awk.
Quickest way around that would be to pipe | to head -1, but the better way would be to tell awk to exit after it's been told to print...
_string='/just/do/it'
_path='yeet.d'
awk -v _s="${_string}" '$3 == _s {print $3; exit}' "${_path}"
... though that now assumes that only the first match is wanted, obtaining the nth is possible though currently outside the scope of the question as of last time read.
Updates
To trip awk on the first column while printing the third column and exiting after the first match may look like...
_string='SHIA_LEBEOUF'
_path='yeet.d'
awk -v _s="${_string}" '$1 == _s {print $3; exit}' "${_path}"
... and generalize even further...
_string='^SHIA_LEBEOUF '
_path='yeet.d'
awk -v _s="${_string}" '$0 ~ _s {print $3; exit}' "${_path}"
... because awk totally gets regular expressions, mostly.

It depends on how you want to identify the desired line.
You could identify it by the line number. In this case you can use sed
cut -f2 -d= yeet.d | sed '53q;d'
This extracts the 53th line.
Or you could identify it by a keyword. In this case use grep
cut -f2 -d= yeet.d | grep just
This extracts all lines containing the word just.

Looks for patterns across different lines

I have a file like this (test.txt):
abc
12
34
def
56
abc
ghi
78
def
90
And I would like to search the 78 which is enclosed by "abc\nghi" and "def". Currently, I know I can do this by:
cat test.txt | awk '/abc/,/def/' | awk '/ghi/,'/def/'
Is there any better way?

One way is to use flags
$ awk '/ghi/ && p~/abc/{f=1} f; /def/{f=0} {p=$0}' test.txt
ghi
78
def
{p=$0} this will save input line for future use
/ghi/ && p~/abc/{f=1} set flag if current line contains ghi and previous line contains abc
f; print input record as long as flag is set
/def/{f=0} clear the flag if line contains def
If you only want the lines between these two boundaries
$ awk '/ghi/ && p~/abc/{f=1; next} /def/{f=0} f; {p=$0}' ip.txt
78
$ awk '/12/ && p~/abc/{f=1; next} /def/{f=0} f; {p=$0}' ip.txt
34
See also How to select lines between two patterns?

This is not really clean, but you can redefine your record separator as a regular expression to be abc\nghi\n|\ndef. This however creates multiple records, and you need to keep track which ones are between the correct ones. With awk you can check which RS was found using RT.
awk 'BEGIN{RS="abc\nghi\n|\ndef"}
(RT~/abc/){s=1}
(s==1)&&(RT~/def/){print $0}
{s=0}' file
This does :
set RS to abc\nghi\n or \ndef.
check if the record is found, if RT contains abc you found the first one.
if you found the first one and the next RT contains def, then print.

grep alternative
$ grep -Pazo '(?s)(?<=abc\nghi)(.*)(?=def)' file
but I think awk will be better

You could do this with sed. It's not ideal in that it doesn't actually understand records, but it might work for you...
sed -Ene 'H;${x;s/.*\nabc\nghi\n([0-9]+)\ndef\n.*/\1/;p;}' input.txt
Here's what's basically going on:
H - appends the current line to sed's "hold space"
${ - specifies the start of a series of commands that will be run once we come to the end of the file
x - swaps the hold space with the pattern space, so that future substitutions will work on what was stored using H
s/../../ - analyses the pattern space (which is now multi-line), capturing the data specified in your question, replacing the entire pattern space with the bracketed expression...
p - prints the result.
One important factor here is that the regular expression is ERE, so the -E option is important. If your version of sed uses some other option to enable support for ERE, then use that option instead.
Another consideration is that the regex above assumes Unix-style line endings. If you try to process a text file that was generated on DOS or Windows, the regex may need to be a little different.

awk solution:
awk '/ghi/ && r=="abc"{ f=1; n=NR+1 }f && NR==n{ v=$0 }v && NR==n+1{ print v }{ r=$0 }' file
The output:
78
Bonus GNU awk approach:
awk -v RS= 'match($0,/\nabc\nghi\n(.+)\ndef/,a){ print a[1] }' file

Enclosing a single quote in Awk

I currently have this line of code, that needs to be increased by one every-time in run this script. I would like to use awk in increasing the third string (570).
'set t 570'
I currently have this to change the code, however I am missing the closing quotation mark. I would also desire that this only acts on this specific (above) line, however am unsure about where to place the syntax that awk uses to do that.
awk '/set t /{$3+=1} 1' file.gs >file.tmp && mv file.tmp file.gs
Thank you very much for your input.

Use sub() to perform a replacement on the string itself:
$ awk '/set t/ {sub($3+0,$3+1,$3)} 1' file
'set t 571'
This looks for the value in $3 and replaces it with itself +1. To avoid replacing all of $3 and making sure the quote persists in the string, we say $3+0 so that it evaluates to just the number, not the quote:
$ echo "'set t 570'" | awk '{print $3}'
570'
$ echo "'set t 570'" | awk '{print $3+0}'
570
Note this would fail if the value in $3 happens more times in the same line, since it will replace all of them.

Multiple passes with awk and execution order

Two part question:
Part One:
First I have a sequence AATTCCGG which I want to change to TAAGGCC. I used gsub to change A to T, C to G, G to C and T to A. Unfortunetly awk executes these orders sequentially, so I ended up with AAACCCC. I got around this by using upper and lower case, then converting back to upper case values, but I would like to do this in a single step if possible.
example:
echo AATTCCGG | awk '{gsub("A","T",$1);gsub("T","A",$1);gsub("C","G",$1);gsub("G","C",$1);print $0}'
OUTPUT:
AAAACCCC
Part Two:
Is there a way to get awk to run to the end of a file for one set of instructions before starting a second set? I tried some of the following, but with no success
for the data set
1 A
2 B
3 C
4 D
5 E
I am using the following pipe to get the data I want (Just an example)
awk '{if ($1%2==0)print $1,"E";else print $0}' test | awk '{if ($1%2==0 && $2=="E") print $0}'
I am using a pipe to rerun the program, however I have found that it is quicker if I don't have to rerun the program.

This can be efficiently solved with tr:
$ echo AATTCCGG | tr ATCG TAGC
Regarding part two (this should be a different question, really): no, it is not possible with awk, pipe is the way to go.

for part two, try this command:
awk '{if ($1%2==0)print $1,"E"}' test

Here is a method I have found for the first part of the question using awk. It uses an array and a for loop.
cat sub.awk
awk '
BEGIN{d["G"]="C";d["C"]="G";d["T"]="A";d["A"]="T";FS="";OFS=""}
{for(i=1;i<(NF+1);i++)
{if($i in d)
$i=d[$i]}
}
{print}'
Input/Output:
ATCG
TAGC

How to quote a shell variable in a TCL-expect string

I'm using the following awk command in an expect script to get the gateway for a particular destination
route | grep $dest | awk '{print $2}'
However the expect script does not like the $2 in the above statement.
Does anyone know of an alternative to awk to perform the same function as above? ie. output 2nd column.

You can use cut:
route | grep $dest | cut -d \ -f 2
That uses spaces as the field delimiter and pulls out the second field

To answer your Expect question, single quotes have no special meaning to the Tcl parser. You need to use braces to protect the body of the awk script:
route | grep $dest | awk {{print $2}}
And as awk can do what grep does, you can get away with one less process:
route | awk -v d=$dest {$0 ~ d {print $2}}

Before switching to another utility, check if changing field separator worrks. Documentation for field separators in GNU Awk here.

SED is the best alternative to use. If you don't mind a dependency, Perl should also be sufficient to solve the task

Depending on the structure of your data, you can use either cut, or use sed to do both filtering and printing the second column.

Alternatively, you could use Perl:
perl -ne 'if(/foo/) { #_ = split(/:/); print $_[1]; }'
This will print second token of each line containing foo, with : as token separator.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

awk - regex to print first matching group - awk

git status | {n,m,g}awk 'NR<--NF' FS='^On branch |[^[:alnum:]_].+$' OFS= TA1241521

Related

How do I obtain a specific row with the cut command?

Looks for patterns across different lines

Enclosing a single quote in Awk

Multiple passes with awk and execution order

How to quote a shell variable in a TCL-expect string

Categories

Resources