Awk: retrive lines based on a range of columns of a file - awk

I want to retrieve lines which have a space character from character 83 + 16 characters and ignore all lines which have another character/string/integer in this range.
here is my file.txt:
7653903747235209876401 HGFDKJKK 98765435475237 caJHGFDSQ200 00779999 654321000704 2014100812204898764513165432
7653903747235209854311 KJH 98765435475280 lkjUIHJ100808442700001298765432 654321009999 2014100812204898764513165432
7653903747235209854311 BBB 98765435475280 lkjUIHJ100808442700001298765432 654321009999 2014100812204898764513165432
7653903747235209876401 GHJUYTHH 98765435475237 caJHGFDSQ200 00779999 654321000704 2014100812204898764513165432
here is my code and i want to add this condition to this code:
#!/bin/sh
var='^20141008'
awk -v var=$var '$1~/[01]1$/ && $7 ~ var' file.txt

You could add another regex match to your current awk line:
$ awk -v var="$var" '$1~/[01]1$/ && $7 ~ var && substr($0,83,16) ~ /^ +$/' file.txt
The check is that the substring containing 16 characters starting from character 83 matches the pattern. The pattern ensures that only spaces occur between the start and end of the string.

It could be:
awk -v var="$var" '$1~/[01]1$/ && $7~var && /^.{82} {16}/' file.txt
Actually referring to $7 can cause problems, as it seems that this in not a white-space separated list, but a COBOL like fixed width list. So I may reinterpret it into a single pattern using absolute positions:
var=20141008
awk -v var="$var" 'match($0,"^.{20}[01]1.{60} {16}.{38}"var)' file.txt
or the little bit shorter:
var=20141008
awk '/^.{20}[01]1.{60} {16}.{38}'"$var"'/' file.txt
In older gawk the --posix argument should be added to enable the interval regex.

Related

Chain awk regex matches like grep

I am trying to use awk to select/remove data based on cell entries in a CSV file.
How do I chain Awk commands to build up complex searches like I have done with grep? I plan to use Awk to select rows based on matching criteria in cells in multiple columns, not just the first column as in this example.
Test data
123,line1
123a,line2
abc,line3
G-123,line4
G-123a,line5
Separate Awk statements with intermediate files
awk '$1 !~ /^[[:digit:]]/ {print $0}' file.txt > output1.txt
awk '$1 !~ /^G-[[:digit:]]/ {print $0}' output1.txt > output2.txt
mv output2.txt output.txt
cat output.txt
Chained or multi-line grep version (I think limited to first column only)
grep -v \
-e "^[[:digit:]]" \
-e "^G-[[:digit:]]" \
file.txt > output.txt
cat output.txt
How can I rewrite the Awk command to avoid the intermediate files?
Generally, in awk there are boolean operators available (it's better than grep! :) )
awk '/match1/ || /match2/' file
awk '(/match1/ || /match2/ ) && /match3/' file
and so on ...
In your example you could use something like:
awk -F, '$1 ~ /^[[:digit:]]/ || $1 ~ /G-[[:digit:]]/' input >> output
Note: This is just an example of how to use boolean operators. Also the regular expression itself could have been used here to express the alternative match:
awk -F, '$1 ~ /^(G-)?[[:digit:]]/' input >> ouput
In your awk commands and example, awk regards file.txt as having only one field because you have not defined FS, so the default whitespace field separator is used.
With that said, you can easily AND your two pattern matches together like this:
awk '($1 !~ /^[[:digit:]]/) && ($1 !~ /^G-[[:digit:]]/) {print $0}' file.txt
To make awk use comma as a field separator, you can define it in a BEGIN block. In this example, the output should be just line3
awk 'BEGIN {FS=","} ($1 !~ /^[[:digit:]]/) && ($1 !~ /^G-[[:digit:]]/) {print $2}' file.txt
I would suggest the literal translation of that grep command in awk is
awk '
/^[[:digit:]]/ {next}
/^G-[[:digit:]]/ {next}
{print}
' file.txt
But you have several examples of how to write it more concisely.
You can use
awk '$1 !~ /^(G-)?[[:digit:]]/' file.txt > output.txt
The awk tries to find in Field 1:
^ - start of string
(G-)? - an optional G- char sequence (note the regex flavor in awk is POSIX ERE, so (...) denotes a capturing group and ? denotes a one or zero times quantifier)
[[:digit:]] - a digit.
If the match is found, the record (=line) is not printed. Else, the line is printed.
to stick to your question, I would use:
awk '$1 !~ /^[[:digit:]]/ && $1 !~ /G-[[:digit:]]/' file.txt > output.txt
But I like the #Wiktor Stribiżew REGEX approach!
With your shown samples, this could be also done in grep in a single regexp, we need not to chain the different regex, adding this solution in case you/anyone need it; could be helpful.
grep -v -E '^(G-)?[[:digit:]]' Input_file
Explanation: Simple explanation would be, using grep's -v option to omit lines which are matching the mentioned pattern. Then using -E option of it to enable ERE(extended regular expressions). In main program using regex ^(G-)?[[:digit:]] to match if line starts from G- OR digit then don't print that line.

Testing it awk for data with square brackets [syslog]

I have a text file like this:
File1 [test]
File1 sgfg
File1 fdgsfg
File1 [rsyslog]
File1 moredata
File1 MAX_EVENTS = 256
File1 fgsfg
File1 [other]
File1 Not this
File2 [syslog]
File2 extra
File2 MAX_EVENTS = 12
With awk I would like to match field $2 when it contains [syslog]
Example this works
awk '$2~/\[syslog\]/' file
But I like to define field in advance using var.
Not working
awk -v var="[syslog]" '$2~var' file
awk -v var="\[syslog\]" '$2~var' file
awk -v var="syslog" '{test="["var"]"} $2~test' file
This works since both sub needs to be true as well as the text match, but complicated :)
awk -v var="syslog" 'sub(/^\[/,"",$2) && sub(/\]/,"",$2) && $2==var' file
Working cases:
$ awk -v var='[syslog]' 'index($2, var)' file
File2 [syslog]
$ awk -v var='syslog' '$2~"\\[" var "\\]"' file
File2 [syslog]
$ awk -v var='[[]syslog[]]' '$2~var' file
File2 [syslog]
Basically take care of the escaping, or don't use regex matching.
As Ed kindly mentioned in the comment, ] alone does not need to be escaped:
awk -v var='syslog' '$2~"\\[" var "]"' file
awk -v var='[[]syslog]' '$2~var' file
You didn't say if you wanted a full or partial match or if you wanted a string or regexp match so here's some options:
Full string match:
awk -v var='[syslog]' '$2 == var' file
Partial string match:
awk -v var='[syslog]' 'index($2,var)' file
Full regexp match:
awk -v var='[[]syslog]' '$2 ~ "^"var"$"' file
Partial regexp match:
awk -v var='[[]syslog]' '$2 ~ var' file
There are of course, many other ways to do that too including escaping regexp metachars within the awk script to make them literal, specifying the string between [...] in the var then adding them in the awk script, matching just at the start or end of the field, etc.
See How do I find the text that matches a pattern? for more info on the different kinds of matching and Is it possible to escape regex metacharacters reliably with sed (applies to awk too) for how to escape regexp metachars to make them be treated as literal.
How about something like this?
awk -v var="[syslog]" '$2 == var' my_file
A bit of explanation. If you don't need regular expression matching you can just use == operator which compares strings literally.
Your "Not working" examples weren't working because:
The regular expression is not correct. It matches a single character, any of s,y,l,o,g.
Escaping is not correct, this would have worked var="\\\\[syslog\\\\]". But awk should have warned you about this with the message awk: warning: escape sequence '\[' treated as plain '['.
Not sure, honestly.

How to extract number with awk in quotes after equal sign

I have something like this in my parameters:
config_version = "1.2.3"
I am trying to get 1.2.3 without quotes with awk command, is it possible ?
how I get quoted number:
awk '/config_version =/ {print $3}' params.txt
output: "1.2.3"
desired: 1.2.3
find the line with the right label, trim the quotes of the value and print.
$ awk '$1=="config_version"{gsub(/"/,"",$NF); print $NF}' file
And also with awk:
$ echo 'config_version = "1.2.3"' | awk -F'=' '{gsub(/"/,"",$2);print $2}'
1.2.3
I'd use gsub to remove leading and trailing "s:
$ awk '{gsub(/^"|"$/,"",$3);print $3}'
The obligatory (or, perhaps "one of", rather than "the". There are lots of ways to do this!) sed solution:
sed -n '/^config_version *= */{y/"/ /; s///p;}'
Note that this leaves a trailing space in the result.
Use grep:
echo 'config_version = "1.2.3"' | grep -Po 'config_version\s+=\s+"\K[^"]+'
1.2.3
Here, GNU grep uses the following options:
-P : Use Perl regexes.
-o : Print the matches only (1 match per line), not the entire lines.
\K : Cause the regex engine to "keep" everything it had matched prior to the \K and not include it in the match. Specifically, ignore the preceding part of the regex when printing the match.
SEE ALSO:
grep manual
perlre - Perl regular expressions
You might set FS so " would be treated as part of field seperator, let file.txt content be:
config_version = "1.2.3"
then
awk 'BEGIN{FS="[ \"]+"}/config_version =/{print $3}'
output
1.2.3
Explanation: I instruced AWK to treaty any non-empty string consisting of spaces or " or combination thereof to be treated as field seperator. If you want to know more about FS and others I suggest reading 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR.
(tested in gawk 4.2.1)
Using gnu awk, you might also use a pattern with a capture group and print the group 1 value using m[1]
awk 'match($0, /config_version = "([^"]+)"/, m) {print m[1]}' file
If there should be digits optionally followed by a dot an digits:
awk 'match($0, /config_version = "([0-9]+(\.[0-9]+)*)"/, m) {print m[1]}' file
Output
1.2.3
there are 3 ways to do it. the clean way like (\042 octal is double quote " )
{mawk 1/2 | gawk} 'BEGIN { FS = "\042" } $1 ~ /config_version =$/ {print $2}'
I specify $1 ~ in the offball chance that it's a phrase that shows up AFTER the version number, if data was misformatted. Another more extreme version of it asks FS to do all the work
{mawk 1/2 | gawk} 'BEGIN { FS = "(^[ \t]*config_version =\042|\042.*$)"
} NF==3 {print $2}'
Here i let FS gobble up the rest of the record, from left to right, so NF==3 provides enforcement exactly only this scenario will show up. And finally, a purist approach
{mawk 1/2 | gawk} 'BEGIN { FS = "(^[ \t]*config_version =\042|\042.*$)" ;
OFS = "" ;} ( NF == 3 ) && ( $1 = $1 )'

How do I obtain a specific row with the cut command?

Background
I have a file, named yeet.d, that looks like this
JET_FUEL = /steel/beams
ABC_DEF = /michael/jackson
....50 rows later....
SHIA_LEBEOUF = /just/do/it
....73 rows later....
GIVE_FOOD = /very/hungry
NEVER_GONNA = /give/you/up
I am familiar with the f and d options of the cut command. The f option allows you to specify which column(s) to extract from, while the d option allows you to specify what the delimiters.
Problem
I want this output returned using the cut command.
/just/do/it
From what I know, this is part of the command I want to enter:
cut -f1 -d= yeet.d
Given that I want the values to the right of the equals sign, with the equals sign as the delimiter. However this would return:
/steel/beams
/michael/jackson
....50 rows later....
/just/do/it
....73 rows later....
/very/hungry
/give/you/up
Which is more than what I want.
Question
How do I use the cut command to return only /just/do/it and nothing else from the situation above? This is different from How to get second last field from a cut command because I want to select a row within a large file, not just near from the end or the beginning.
This looks like it would be easier to express with awk...
# awk -v _s="${_string}" '$3 == _s {print $3}' "${_path}"
## Above could be more _scriptable_ form of bellow example
awk -v _search="/just/do/it" '$3 == _search {print $3}' <<'EOF'
JET_FULE = /steal/beams
SHIA_LEBEOUF = /just/do/it
NEVER_GONNA = /give/you/up
EOF
## Either way, output should be similar to
## /just/do/it
-v _something="Some Thing" bit allows for passing Bash variables to awk
$3 == _search bit tells awk to match only when column 3 is equal to the search string
To search for a sub-string within a line one can use $0 ~ _search
{print $3} bit tells awk to print column 3 for any matches
And the <<'EOF' bit tells Bash to not expand anything within the opening and closing EOF tags
... however, the above will still output duplicate matches, eg. if yeet.d somehow contained...
JET_FULE = /steal/beams
SHIA_LEBEOUF = /just/do/it
NEVER_GONNA = /give/you/up
AGAIN = /just/do/it
... there'd be two /just/do/it lines outputed by awk.
Quickest way around that would be to pipe | to head -1, but the better way would be to tell awk to exit after it's been told to print...
_string='/just/do/it'
_path='yeet.d'
awk -v _s="${_string}" '$3 == _s {print $3; exit}' "${_path}"
... though that now assumes that only the first match is wanted, obtaining the nth is possible though currently outside the scope of the question as of last time read.
Updates
To trip awk on the first column while printing the third column and exiting after the first match may look like...
_string='SHIA_LEBEOUF'
_path='yeet.d'
awk -v _s="${_string}" '$1 == _s {print $3; exit}' "${_path}"
... and generalize even further...
_string='^SHIA_LEBEOUF '
_path='yeet.d'
awk -v _s="${_string}" '$0 ~ _s {print $3; exit}' "${_path}"
... because awk totally gets regular expressions, mostly.
It depends on how you want to identify the desired line.
You could identify it by the line number. In this case you can use sed
cut -f2 -d= yeet.d | sed '53q;d'
This extracts the 53th line.
Or you could identify it by a keyword. In this case use grep
cut -f2 -d= yeet.d | grep just
This extracts all lines containing the word just.

Exact string match in awk

I have a file test.txt with the next lines
1997 100 500 2010TJ
2010TJXML 16 20 59
I'm using the next awk line to get information only about string 2010TJ
awk -v var="2010TJ" '$0 ~ var {print $0}' test.txt
But the code print the two lines. I want to know how to get the line containing the exact string
1997 100 500 2010TJ
the string can be placed in any column of the file.
Several options:
Use a gawk word boundary (not POSIX awk...):
$ gawk '/\<2010TJ\>/' file
An actual space or tab or what is separating the columns:
$ awk '/^2010TJ /' file
Or compare the field directly to the string:
$ awk '$1=="2010TJ"' file
You can loop over the fields to test each field if you wish:
$ awk '{for (i=1;i<=NF;i++) if ($i=="2010TJ") {print; next}}' file
Or, given your example of setting a variable, those same using a variable:
$ gawk -v s=2010TJ '$0~"\\<" s "\\>"'
$ awk -v s=2010TJ '$0~"^" s " "'
$ awk -v s=2010TJ '$1==s'
Note the first is a little different than the second and third. The first is the standalone string 2010TJ anywhere in $0; the second and third is a string that starts with that string.
Try this (for testing only column 1) :
awk '$1 == "2010TJ" {print $0}' test.txt
or grep like (all columns) :
gawk '/\<2010TJ\>/ {print $0}' test.txt
Note
\< \> is word boundarys
another awk with word boundary
awk '/\y2010TJ\y/' file
note \y matches either beginning or end of a word.