Add a number by subtracting an existing number by awk - awk

I would like to convert
Title Page/4,Black,notBold,notItalic,open,TopLeftZoom,0,0,0.0
Contents/16,Black,notBold,notItalic,open,TopLeftZoom,0,0,0.0
to
Title Page 1/4,Black,notBold,notItalic,open,TopLeftZoom,0,0,0.0
Contents 13/16,Black,notBold,notItalic,open,TopLeftZoom,0,0,0.0
The rule is to subtract the number following / by 3 and add that result in front of /.
I tried to do that with awk.
awk -F',/' '{gsub(/\//, ($2-10) + "\/"}' myfile
but it doesn't work. Why is it? Thanks.

A slight modification to your attempt produces the desired output:
$ awk -F'[,/]' '{sub(/\//, " " ($2-3) "/") }1' file
Title Page 1/4,Black,notBold,notItalic,open,TopLeftZoom,0,0,0.0
Contents 13/16,Black,notBold,notItalic,open,TopLeftZoom,0,0,0.0
-F is used to specify the input field separator. I have changed it to a regex group which matches commas and slashes, which means that the second field $2 contains the number that you are trying to replace. As you are only interested in making a single substitution in each record, I have used sub rather than gsub. Note that in awk, strings are automatically concatenated (you shouldn't use +).
Awk programs are stuctured like condition { action }. If no condition is specified, the action block is always run. If no action is specified, the default action is { print }, which prints the record. In the above script, 1 is used to print the record, as it is the simplest expression that evaluates to true.

Related

filter unique parameters from file

i have file contains urls plus params like following
https://example.com/endpoint/?param1=123&param2=1212
https://example.com/endpoint/?param3=123&param1=98989
https://example.com/endpoint/endpoint3/?param2=123
https://example.com/endpoint/endpoint2/?param1=123
https://example.com/endpoint/endpoint2/
https://example.com/endpoint/endpoint5/"//i.example.com/00/s/Nzk5WDEwMjQ=/z/47IAAOSwBu5hXIKF
and i need to filter only urls with unique params
the desired output
http://example.com/endpoint/?param1=123&param2=1212
https://example.com/endpoint/?param3=123&param1=98989
https://example.com/endpoint/endpoint3/?param2=123
i managed to filter only urls with params with grep
grep -E '(\?[a-zA-Z0-9]{1,9}\=)'
but i need to filter params in the same time so i tried with awk with the same regex
but it gives error
awk '{sub(\?[a-zA-Z0-9]{1,9}\=)} !seen[$0]++'
update
i am sorry for editing the desired output but when i tried the scripts i figured out that their a lot of carbege in my file need to filter too.
i tried #James Brown with some editing and it looks good till the end line it dose not filter it unfortunately
awk -F '?|&' '$2&&!a[$2]++'
and to be more clear why the that output is good for me
it chosed the 1 st line because it has at least param1
2nd line because it has at least param3
3 line because it has at least param2
the comparison method here is choose just unique parameter whatever it concatenate with others with & char or not
Edited version after the reqs changes some:
$ awk -F? '{ # ? as field delimiter
split($2,b,/&/) # split at & to get whats between ? and &
if(b[1]!=""&&!a[b[1]]++) # no ? means no $2
print
}' file
Output as expected. Original answer was:
A short one:
$ awk -F? '$2&&!a[$2]++' file
Explained: Split records at ? (-F?) and if there is a second field ($2) and (&&) it is unique this far by counting the instances of the parameters in the array a (!a[$2]++), output it.
EDIT: Following solution may help when query string has ? as well as & present in it and we want to consider both of them for removing duplicates.
awk '
/\?/{
match($0,/\?[^&]*/)
val=substr($0,RSTART,RLENGTH)
match($0,/&.*/)
if(!seen[val]++ && !seen[substr($0,RSTART,RLENGTH)]++){
print
}
}' Input_file
2nd solution: (Following solution may help when we don't have & parameters in query string) With your shown samples, please try following awk program.
awk 'match($0,/\?.*$/) && !seen[substr($0,RSTART,RLENGTH)]++' Input_file
OR above could be shorten to as follows:(as per Ed sir's suggestions):
awk 's=index($0,"?") && !seen[substr($0,s)]++' Input_file
Explanation: Simple explanation would be, using match function of awk which matches everything from ? to till end of line value. Then adding an AND condition to it to make sure we get only unique values out of all matched values in all lines.
With gnu awk, you could also match the url till the first occurrence of the question mark, and then capture what follows using your initial pattern for the first parameter ([a-zA-Z0-9]{1,9}=[^&]+) followed by matching any character except the &
Then you can use the !seen[$0]++ part with the value of capture group 1.
awk '
match($0, /https?:\/\/[^?]+\?([a-zA-Z0-9]{1,9}=[^&]+)/, arr) && !seen[arr[1]]++
' file
Output
https://example.com/endpoint/?param1=123&param2=1212
https://example.com/endpoint/?param3=123&param1=98989
https://example.com/endpoint/endpoint3/?param2=123
Using awk you can check that the string starts with the protocol and contains a question mark.
Then to get the first parameter only, you can split on ? and & and use the second part of the split for seen
awk '
/^https?:\/\/[^?]*\?/ && split($0, arr, /[?&]/) > 1 && !seen[arr[2]]++
' file

Retrieve matched regex record-separator using Gnu AWK

Using AWK, I am processing a text file by splitting it into multiple records. As a record separator RS I use a regular expression. Is there a way to obtain the found record separator as RS only represents the regex string?
Example:
BEGIN { RS="a[0-9]*. "; ORS="\n-----\n"}
/foo/ {print $0 RS;}
END {}
input file:
a1. Hello
this
is foo
a2. hello
this
is bar
a3. Hello
this
is foo
output:
Hello
this
is foo
a[0-9]*.
-----
Hello
this
is foo
a[0-9]*.
-----
As you see, the output is printing RS as a string representing the regular expression, but not printing the actual value.
How can I retrieve the actual matched value of the record separator?
expected output:
Hello
this
is foo
a1
-----
Hello
this
is foo
a3
-----
In POSIX compliant AWK, the record separator RS is only a single character, hence it is easy to call it back in the form of.
awk 'BEGIN{RS="a"}{print $0 RS}'
GNU AWK, on the other hand, does not limit RS to be a one-character string but allows it to be any regular expression. In this case, it becomes a bit more tricky to use the above AWK because RS is a regular expression and not a string.
To this end, GNU AWK introduced the variable RT which represents nothing more than the found record separator. When RS is a single character, RT contains the same single character. However, when RS is a regular expression, RT contains the actual input text that matched the regular expression.
So naively, one could update your AWK program as:
BEGIN{RS="a[0-9]+[.] "; ORS="\n-----\n"}
/foo/{print $0 RT}
Unfortunately, RT is set to the value found after the current record and it seems the OP requests the value before the current record, hence you can introduce a new variable pRT which could be read as prevous record separator found.
BEGIN{RS="a[0-9]+[.] "; ORS="\n-----\n"}
/foo/{print $0 pRT}{pRT=RT}
and as Shaki Siegal pointed out in the comments, you still have to update pRT to remove the final space and dot:
BEGIN{RS="a[0-9]+[.] "; ORS="\n-----\n"}
/foo/{print $0 pRT}{pRT=RT;sub(/[.] $/,"",pRT)}
note: The original RS of the OP (RS="a[0-9]*. ") has been updated for an improved matching to RS="a[0-9]+[.] " This ensures the appearance of a number behind a and an actual ..
If, as the original example indicates, the record separator always appears at the beginning of the line, RS should be slightly modified into RS="(^|\n)a[0-9]+[.] "Dito comment also made various excellent points. So if the string a[0-9]+. appears always at the beginning, you need to process a bit more:
BEGIN {
RS ="(^|\n)a[0-9]+[.] ";
ORS="\n-----\n"
}
/foo/ {
if (RT ~ /^$/ && NR != 2) pRT = substr(pRT,2)
print $0 pRT
}
{pRT=RT;sub(/[.] $/,"",pRT)}
Here, we added a correction to fix the last record.
If there are more then two AWK records (the first record is always empty), you need to remove the first new-line character from pRT, otherwise you include an extra new-line caused by the last record which ends with a new-line (in contrast to all others).
If there are only two AWK records (one effective in the text), then you should not do this correction as the first RT does not start with a new-line
The final improvement is done by realising that we always remove the initial newline in pRT if it is there, so we can merge it all in a single gsub:
BEGIN {
RS ="(^|\n)a[0-9]+[.] ";
ORS="\n-----\n"
}
/foo/ { print $0 pRT }
{pRT=RT;gsub(/^\n|[.] $/,"",pRT)}
RS: The input record separator. Its default value is a string containing a single newline character, which means that an input record consists of a single line of text. It can also be the null string, in which case records are separated by runs of blank lines. If it is a regexp, records are separated by matches of the regexp in the input text.
The ability for RS to be a regular expression is a gawk extension. In most other AWK implementations, or if gawk is in compatibility mode (see Options), just the first character of RS’s value is used.
ORS: The output record separator. It is output at the end of every print statement. Its default value is "\n", the newline character.
RT: (GNU AWK specific) The input text that matched the text denoted by RS, the record separator. It is set every time a record is read.
source: GNU AWK manual
This might work for you (GNU sed):
sed -rn '/^a[0-9]+\.\s/{:a;x;/foo/{s/^(a[0-9]+\.)\s*(.*)/\2\n\1\n-----/p;$d};x;h;b};H;$ba' file
Gather up lines that begin an. where n is an integer. If the line(s) contain the word foo make the required substitution and print the results otherwise do nothing.
Apology: When I began the solution the question was tagged sed.
When a line beginning an. is encountered, this line replaces whatever was in the hold space. However before it does, the hold space is first checked, and if it contains the word foo i.e. a collection already exists, the requirements to be processed are met and the so the lines are formatted as required and printed. Other lines are appended to the hold space. A special condition is met when the end-of-file is encountered which the is the same condition as when line beginning an. This is allowed for by the addition of a goto label :a.
With GNU awk, which you're already using for multi-char RS, the builtin variable that contains the string that matched the RS regexp is RT.
We need to fix your RS setting though because you need a regexp for RS that matches a<integer><dot><blank> at the start of a line ((^|\n)a[0-9]+[.]) or a newline on it's own at the end of the file (\n$) so the last record in the file is parsed the same as all the rest and below is how to write that. Note that the RT will start with a newline for all except the very first match in the file so we need to strip that leading newline from RT to get the actual identifier we want to print for each record:
$ cat tst.awk
BEGIN {
RS = "(^|\n)a[0-9]+[.] |\n$"
ORS = "\n-----\n"
}
/foo/ { print $0 "\n" id }
{ id = gensub(/^\n|[.] /,"","g",RT) }
Here's what it does given this input which includes more rainy-day cases than are present in the question (you should test other proposed solutions against this):
input:
$ cat file
a1. Hello
this
is foo bat man
a2. hello
this
is bar
a3. Hello
this is a7. just fine
is foo
output:
$ awk -f tst.awk file
Hello
this
is foo bat man
a1
-----
Hello
this is a7. just fine
is foo
a3
-----

Filter fields with multiple delimiters

I've done extensive searching for a solution but can't quite find what I need. Have a file like this:
aaa|bbb|ccc|ddd~eee^fff^ggg|hhh|iii
111|222|333|444~555^666^777|888|999
AAA|BBB|CCC||EEE|FFF
What I want to do is use awk or something else to return lines from this file with a change to field 4(pipe delimited). Field 4 has a tilde and caret as delimiters which is where I'm struggling. We want the lines returned as this:
aaa|bbb|ccc|eee|hhh|iii
111|222|333|555|888|999
AAA|BBB|CCC||EEE|FFF
If field 4 is empty, it's returned as is. But when field 4 has multiple values, we want the first value right after the tilde returned only.
awk -F "[|^~]" 'BEGIN{OFS="|"}NF==6{print} NF==9{print $1,$2,$3,$5,$8,$9}' tmp.txt
aaa|bbb|ccc|eee|hhh|iii
111|222|333|555|888|999
AAA|BBB|CCC||EEE|FFF
use a regular expression as your delimiter
count the fields to decide what to do
set the output delimiter to pipe
$ awk -F'|' '{sub(/^[^~]*~/, "", $4); sub(/\^.*/, "", $4)} 1' OFS='|' file
aaa|bbb|ccc|eee|hhh|iii
111|222|333|555|888|999
AAA|BBB|CCC||EEE|FFF
This approach makes no assumption about the contents of fields other than field 4. The other fields may, for example, contain ~ or ^ characters and that will not affect the results.
How it works
-F'|'
This sets the field delimiter on input to |.
sub(/^[^~]*~/, "", $4)
If field 4 contains a ~, this removes the first ~ and everything before the first ~.
sub(/\^.*/, "", $4)
If field 4 contains ^, this removes the first ^ and everything after it.
1
This is awk's cryptic shorthand for print-the-line.
OFS='|'
This sets the field separator on output to |.

Why does "1" in awk print the current line?

In this answer,
awk '$2=="no"{$3="N/A"}1' file
was accepted. Note the 1 at the end of the AWK script. In the comments, the author of the answer said
[1 is] a cryptic way to display the current line.
I'm puzzled. How does that work?
In awk,
Since 1 always evaluates to true, it performs default operation {print $0}, hence prints the current line stored in $0
So, awk '$2=="no"{$3="N/A"}1' file is equivalent to and shorthand of
awk '$2=="no"{$3="N/A"} {print $0}' file
Again $0 is default argument to print, so you could also write
awk '$2=="no"{$3="N/A"} {print}' file
In-fact you could also use any non-zero number or any condition which always evaluates to true in place of 1
The documentation says
In an awk rule, either the pattern or the action can be omitted, but not both. If the pattern is omitted, then the action is performed for every input line. If the action is omitted, the default action is to print all lines that match the pattern.
So, it treats 1 as pattern with no action. The default action is to print the line.
Even if you have a couple of rules, like in
awk '
in_net {
if (/^\s+bindIp:/) {
print " bindIp: 0.0.0.0"
next
} else if (/^\s*(#.*)?$/) {
in_net = 0
}
}
/^net:/ {
in_net = 1
}
1
' /etc/mongod.conf
You still need 1, since default action is triggered only when encountering rule with no action.
AWK works on method of condition and then action. So if any condition is TRUE any action which we mention to happen will be executed then.
In case of 1 it means we are making that condition TRUE and in this case we are not mentioning any action to happen, so awk's by default action print will happen.
So this is why we write 1 in shortcut actually speaking.
I thought I’d add an answer that explains how this shorthand works in terms of the POSIX specification for awk:
Basic description:
An awk program is composed of pairs of the form:
pattern { action }
Missing action:
Either the pattern or the action (including the enclosing brace characters) can be omitted.
A missing pattern shall match any record of input, and a missing action shall be equivalent to:
{ print }
Description of pattern
A pattern is any valid expression
Description of Expression patterns:
An expression pattern shall be evaluated as if it were an expression in a
Boolean context. If the result is true, the pattern shall be considered to
match, and the associated action (if any) shall be executed.
Boolean context:
When an expression is used in a Boolean context, if it has a numeric value,
a value of zero shall be treated as false and any other value shall be
treated as true. Otherwise, a string value of the null string shall be
treated as false and any other value shall be treated as true.
In the example of awk '$2=="no"{$3="N/A"}1', the pattern of the first pair is $2=="no" with a corresponding action of $3="N/A". This leaves 1 by itself as the next “pair” (pattern without a corresponding action).
Instead of 1, this lone expression pattern could be any numeric value or non-empty string, e.g.,
awk 9999
awk '"string"'
The awk 1 short-hand is fine when typing one-liners in an interactive shell. On the other hand, when writing scripts, I prefer my code to be more maintainable and readable for others by using the more explicit awk '{ print }'.

where to put brackets in awk

Hello every one I want to ask that I am very very confused about the brackets {} in awk like I have written a code
{
FNR == 3 { print $1 " age is " $2 }
}
but it gave me error on outer brackets but didn't give error on the brackets around the print statement why is it so :/ also in the following code
{
s = $1
d = $2
no = $1 + $2
{print no}
}
when I remove outer brackets my arguments displayed, the number of LOC times why is it I am very confuse kindly help me
thanks
An awk script consists of commands. Each command has a pattern and an action:
pattern1 { action1 }
pattern2 { action2 }
For each line in the input, awk tests each pattern and performs the corresponding action when the pattern is true.
The pattern can be omitted, in which case it is taken as always true and the action is performed for each line. Similarly, the action can be omitted, in which case it is taken as a print; this lets you easily use awk to select lines without changing the lines.
With this structure in mind, we can interpret the given examples. The first one is a single action that is applied to every line. But the action isn't well formed---if you remove the outer brackets, it becomes a distinct pattern and action, both of which are correctly constructed.
The second example also is applied to every line. It takes the first two (whitespace separated) fields from the lines, adds them as numbers, and prints the result. Removing the outer brackets gives you three patterns without corresponding actions, and an action without a pattern. Thus, the patterns---which are the value of the assignments, and usually true---have an implicit print that is usually invoked. Similarly, the action is always invoked, printing the value of no.