How to extract something occuring after some common paths? - awk

I want to filter out anything that occurs after some common paths. Example, Print out the next word that occurs every pytests/ OR after src/
for "src/cs-test/test_bugcheck_0001.py"
awk -F"/" '{print $2}' works
for "metadata/pytests/ipa-cert.yaml"
awk -F"/pytest/" '{print $2}' | awk -F"." '{print $1}' works
But I want to have these in one awk statement.
metadata/pytests/ipa-cert.yaml
src/cs-test/test_bugcheck_0001.py
Expected result:
ipa-cert
cs-test

I suggest using
sed -E 's,^(.*/pytests/|[^/]+/)([^/.]+).*,\2,' file > newfile
See the online sed demo and the regex demo (not proof).
POSIX ERE pattern details
^ - start of line
(.*/pytests/|[^/]+/) - Group 1: either of the two alternatives:
.*/pytests/ - any 0+ chars as many as possible and then /pytests/ string
| - or
[^/]+/ - a negated bracket expression matching 1+ chars other than / and then a /
([^/.]+) - Group 2: a negated bracket expression matching 1 or more chars other than / and .
.* - any 0 or more chars up to the line end.
The , chars are used as delimiters in the sed command so as not to overescape the pattern that has many / chars.

Simple substitutions on individual strings is what sed is designed to do. With GNU or OSX/BSD sed for -E:
$ sed -E 's:(^|.*/)(pytests|src)/([^/.]+).*:\3:' file
ipa-cert
cs-test
or if you really want to use awk for some reason then with GNU awk for gensub():
$ awk '{print gensub(/(^|.*\/)(pytests|src)\/([^/.]+).*/,"\\3",1)}' file
ipa-cert
cs-test
and with any awk:
$ awk 'match($0,/(^|.*\/)(pytests|src)\/[^/.]+/){$0=substr($0,1,RLENGTH); sub(/.*\//,"")} 1' file
ipa-cert
cs-test

Related

How to extract string from a file in bash

I have a file called DB_create.sql which has this line
CREATE DATABASE testrepo;
I want to extract only testrepo from this. So I've tried
cat DB_create.sql | awk '{print $3}'
This gives me testrepo;
I need only testrepo. How do I get this ?
With your shown samples, please try following.
awk -F'[ ;]' '{print $(NF-1)}' DB_create.sql
OR
awk -F'[ ;]' '{print $3}' DB_create.sql
OR without setting any field separators try:
awk '{sub(/;$/,"");print $3}' DB_create.sql
Simple explanation would be: making field separator as space OR semi colon and then printing 2nd last field($NF-1) which is required by OP here. Also you need not to use cat command with awk because awk can read Input_file by itself.
Using gnu awk, you can set record separator as ; + line break:
awk -v RS=';\r?\n' '{print $3}' file.sql
testrepo
Or using any POSIX awk, just do a call to sub to strip trailing ;:
awk '{sub(/;$/, "", $3); print $3}' file.sql
testrepo
You can use
awk -F'[;[:space:]]+' '{print $3}' DB_create.sql
where the field separator is set to a [;[:space:]]+ regex that matches one or more occurrences of ; or/and whitespace chars. Then, Field 3 will contain the string you need without the semi-colon.
More pattern details:
[ - start of a bracket expression
; - a ; char
[:space:] - any whitespace char
] - end of the bracket expression
+ - a POSIX ERE one or more occurrences quantifier.
See the online demo.
Use your own code but adding the function sub():
cat DB_create.sql | awk '{sub(/;$/, "",$3);print $3}'
Although it's better not using cat. Here you can see why: Comparison of cat pipe awk operation to awk command on a file
So better this way:
awk '{sub(/;$/, "",$3);print $3}' file

Recognising backslash in awk field separator

Input is
AZE D11/879\x0Dabc\x0D\x0A\x1E!DEF F11/999
awk script sets field separator to "\x0D" (I have tried with and without escaping the backslash.
awk script is
BEGIN {FS="\\x0D"}
{print NF}
It should output 3 because there are 2 occurrences of the field separator but it outputs 1 which indicates it is not being recognized.
There are 2 ways to provide a regexp in awk - a static regexp (aka regexp literal) written as /regexp/ and a dynamic regexp (aka computed regexp) written as "regexp" and used in a regexp context. A field separator is just a regexp with some additional behavior so lets just consider regexps in general to explain what's going on in your example.
The split() function takes a field separator (a regexp for our purposes) as it's third argument so it provides a good test bed:
Using a static regexp:
$ awk '{print split($0,a,/\x0D/)}' file
1
The \ above is escaping the x, it's not a literal \. For that you need to escape the \ itself:
$ awk '{print split($0,a,/\\x0D/)}' file
3
What if we used a dynamic regexp instead of the above static regexp?
$ awk '{print split($0,a,"\x0D")}' file
1
$ awk '{print split($0,a,"\\x0D")}' file
1
$ awk '{print split($0,a,"\\\x0D")}' file
' is not a known regexp operator FNR=1) warning: regexp escape sequence `\
1
$ awk '{print split($0,a,"\\\\x0D")}' file
3
The behavior above is because awk first parses the string to convert it into a regexp (using up one layer of escape chars) and then parses it a second time when using it as a regexp (using up a second layer of escape chars).
Unfortunately when you specify a FS there is no option to specify it as a literal regexp, it's always specified using a string and thus is a dynamic regexp and so needs an extra layer of escaping:
$ awk -v FS='\x0D' '{print NF}' file
1
$ awk -v FS='\\x0D' '{print NF}' file
1
$ awk -v FS='\\\x0D' '{print NF}' file
' is not a known regexp operatorence `\
1
$ awk -v FS='\\\\x0D' '{print NF}' file
3
Now - what if you were using the wrong type of quotes in the shell part of the script, i.e. " instead of '? Then you introduce even more pain because now you're inviting the shell to also parse the string even before awk gets to see and parse it twice:
$ awk -v FS="\\\\x0D" '{print NF}' file
1
$ awk -v FS="\\\\\x0D" '{print NF}' file
' is not a known regexp operatorence `\
1
$ awk -v FS="\\\\\\x0D" '{print NF}' file
' is not a known regexp operatorence `\
1
$ awk -v FS="\\\\\\\x0D" '{print NF}' file
3
That's different from the case where the double quotes are using inside awk because that's all wrapped inside single quotes and so protected from the shell already:
$ awk 'BEGIN{FS="\\\\x0D"} {print NF}' file
3
So - in the shell always use the most restrictive quotes (' over " over none) unless you have a very specific reason not to, and when using regexps or field separators always use literal /.../ rather than dynamic "...", again unless you have a very specific reason not to.
The odd, truncated looking error message above are because of the \rs the tool is trying to print due to the escape sequence we're providing, they're really all warning: regexp escape sequence '\^M' is not a known regexp operator
You need two backslashes for a literal backslash since \ is an escape character:
$ echo 'AZE D11/879\x0Dabc\x0D\x0A\x1E!DEF F11/999' |
awk 'BEGIN{ FS="\\\\x0D" } { print NF }'
3

AWK that reads up to the /

I have the following lines of text :
170311 005201 0433 DE(N) itemhandling itemAddBarCodeData: Barcode(1/1) <0157357069/OK> ##[ti=7672,
170311 005323 0433 DE(N) itemhandling itemAddBarCodeData: Barcode(1/1) </NOREAD> ##[ti=7672,
I have the following script :
grep "itemAddBarCodeData" %myItemHandling% | gawk -F "[<>]+" -v OFS=, "{for(i=1;i<=NF;++i){if($i~/Barcode/){print substr($1,5,2)substr($1,3,2)substr($1,1,2),substr($1,8,6),$(i+1)}}}" > %myOutputPath%%myFilename%
What I need is a script that reads only the /NOREAD and the /OK so the output is like :
11-03-17,00:52:01,NOREAD
11-03-17,00:53:23,OK
any help would be greatly appreciated
Thanks
Complex gawk approach:
awk -F"[ />]" '{patsplit($1, a, /[0-9]{2}/); patsplit($2, b, /[0-9]{2}/);
printf("%s-%s-%s,%s:%s:%s,%s\n",a[3],a[2],a[1],b[1],b[2],b[3],$10)}' inpufile
The output:
11-03-17,00:52:01,OK
11-03-17,00:53:23,NOREAD
-F"[ />]" - "composite" field separator
patsplit(string, array [, fieldpat [, steps ] ])
Divide string into pieces defined by fieldpat and store the pieces in array and
the separator strings in the seps array.
You can use this following script:
script.awk
/\/[A-Z]+>/ { match($1"-"$2,/(..)(..)(..)-(..)(..)(..)/,ts)
dt=mktime( sprintf("20%s %s %s %s %s %s",
ts[1], ts[2], ts[3],
ts[4], ts[5], ts[6]) )
dtd = strftime( "%d-%m-%y", dt )
dts = strftime( "%H:%M:%S", dt )
match ( $0, /\/[A-Z]+>/) # set RSTART and RLENGTH
print dtd, dts, substr( $0, RSTART+1, RLENGTH-2)
}
Run it like this: awk -v OFS=, -f script.awk yourfile
The important part is the second match function call, which matches
a string of capital letters [A_Z]
preceded by a /
followed by a >.
It should match the OK and NOREAD case and not the Barcode(1/1).
The variables
RSTART and
RLENGTH
are set by the match function, we have to correct them by +1 and -2, because the match RE included / and >.
The first match, mktime, strftime and the sprintf function call are another way the format the date and time. The time functions are GNU AWK extensions.
Regular awk version:
awk '
{
d=$1$2
gsub(/../,"& ",d)
split(d,T)
split($8,R,"[/>]")
printf "%s-%s-%s,%s:%s:%s,%s\n",T[3],T[2],T[1],T[4],T[5],T[6],R[2]
}
' file
With script in file:
script.awk:
{
d=$1$2
gsub(/../,"& ",d)
split(d,T)
split($8,R,"[/>]")
printf "%s-%s-%s,%s:%s:%s,%s\n",T[3],T[2],T[1],T[4],T[5],T[6],R[2]
}
awk -f script.awk file
crammed on one line..
awk '{d=$1$2; gsub(/../,"& ",d); split(d,T); split($8,R,"[/>]"); printf "%s-%s-%s,%s:%s:%s,%s\n",T[3],T[2],T[1],T[4],T[5],T[6],R[2]}' file
You don't need grep when you're using awk. With GNU awk for gensub():
$ awk '/itemAddBarCodeData/{print gensub(/(..)(..)(..) (..)(..)(..).*\/([^>]+).*/,"\\3-\\2-\\1,\\4:\\5:\\6,\\7",1)}' file
11-03-17,00:52:01,OK
11-03-17,00:53:23,NOREAD
Here's a pragmatic combination of awk and sed that is conceptually relatively simple:
On Linux and BSD/macOS:
awk -F'[ />]' -v OFS=, '/itemAddBarCodeData/ {print $1, $2, $10}' file |
sed -E 's/^(..)(..)(..),(..)(..)(..)/\3-\2-\1,\4:\5:\6/'
On a Windows system, invoked from cmd.exe, different quoting and line continuation rules apply (assumes the presence of ported GNU utilities):
awk -F"[ />]" -v OFS=, "/itemAddBarCodeData/ {print $1, $2, $10}" file ^
| sed -E "s/^(..)(..)(..),(..)(..)(..)/\3-\2-\1,\4:\5:\6/"
Note how:
"..." strings rather than '...' strings must be used to protect the embedded content from interpretation by the shell
Unlike with "..." on Unix, $ has no special meaning to cmd.exe, so it can be used as-is.
^ as the very last character on a line serves as the explicit line-continuation character, and the line must be broken before the | (whereas on Unix a line ending in | is implicitly continued).
This is only used for readability here; of course, you can place your command on a single line.

Why I can't use as delimiter in awk the string "?B?"

By running the following I am getting as a result the string "utf-8"
I thought that with this command I would had string "tralala" returned
echo "=?utf-8?B?tralala" | awk -F "?B?" '{print $2 }'
Why is that?
What delimiter should I use in order to get the string "tralala" ?
? is a regex metacharacter that means zero or one matches of the preceding atom. (I'm surprised awk didn't complain about the one at the start but .)
Try echo "=?utf-8?B?tralala" | awk -F '\\?B\\?' '{print $2 }' instead.
Awk delimiters are NOT strings, they are "Field Separators" (hence the variable named FS) which are a type of Extended Regular Expression with some additional features (e.g. a single blank char as the field separator when not inside square brackets means separate by all chains of contiguous white space and ignore leading and trailing white space on each record).
The difference between a string, a regular expression, and a field separator are very important to be aware of. You sometimes also see the word "pattern" used - do not use that term, it has no (or too many possible) meaning.
A ? is an RE metacharacter so you need to tell awk not to treat it as such in your case by either of these methods:
$ echo "=?utf-8?B?tralala" | awk -F '[?]B[?]' '{print $2}'
tralala
$ echo "=?utf-8?B?tralala" | awk -F '\\?B\\?' '{print $2}'
tralala
You don't strictly need to do that for the first ? as it's metacharacter functionality is not applicable when it's the first char in an RE:
$ echo "=?utf-8?B?tralala" | awk -F '?B[?]' '{print $2}'
tralala
$ echo "=?utf-8?B?tralala" | awk -F '?B\\?' '{print $2}'
tralala
but IMHO it's best to do it anyway for clarity and future-proofing.

awk to transpose lines of a text file

A .csv file that has lines like this:
20111205 010016287,1.236220,1.236440
It needs to read like this:
20111205 01:00:16.287,1.236220,1.236440
How do I do this in awk? Experimenting, I got this far. I need to do it in two passes I think. One sub to read the date&time field, and the next to change it.
awk -F, '{print;x=$1;sub(/.*=/,"",$1);}' data.csv
Use that awk command:
echo "20111205 010016287,1.236220,1.236440" | \
awk -F[\ \,] '{printf "%s %s:%s:%s.%s,%s,%s\n", \
$1,substr($2,1,2),substr($2,3,2),substr($2,5,2),substr($2,7,3),$3,$4}'
Explanation:
-F[\ \,]: sets the delimiter to space and ,
printf "%s %s:%s:%s.%s,%s,%s\n": format the output
substr($2,0,3): cuts the second firls ($2) in the desired pieces
Or use that sed command:
echo "20111205 010016287,1.236220,1.236440" | \
sed 's/\([0-9]\{8\}\) \([0-9]\{2\}\)\([0-9]\{2\}\)\([0-9]\{2\}\)\([0-9]\{3\}\)/\1 \2:\3:\4.\5/g'
Explanation:
[0-9]\{8\}: first match a 8-digit pattern and save it as \1
[0-9]\{2\}...: after a space match 3 times a 2-digit pattern and save them to \2, \3 and \4
[0-9]\{3\}: and at last match 3-digit pattern and save it as \5
\1 \2:\3:\4.\5: format the output
sed is better suited to this job since it's a simple substitution on single lines:
$ sed -r 's/( ..)(..)(..)/\1:\2:\3./' file
20111205 01:00:16.287,1.236220,1.236440
but if you prefer here's GNU awk with gensub():
$ awk '{print gensub(/( ..)(..)(..)/,"\\1:\\2:\\3.","")}' file
20111205 01:00:16.287,1.236220,1.236440