For gawk, how to set FS and RS in the same command as an awk script? - awk

I have an awk command that returns the duplicates in an input stream with
awk '{a[$0]++}END{for (i in a)if (a[i]>1)print i;}'
However, I want to change the field separator characters and record separator characters before I do that. The command I use for that is
FS='\n' RS='\n\n'
Yet I'm having trouble making that happen. Is there a way to effectively combine these two commands into one? Piping one to the other doesn't seem to work either.

the action of BEGIN rule is executed before reading any input.
awk 'BEGIN{FS="\n";RS="\n\n"}{a[$0]++}END{for (i in a)if (a[i]>1)print i;}'
or you can specify them using command line options like:
awk -F '\n' -v RS='\n\n' '{a[$0]++}END{for (i in a)if (a[i]>1)print i;}'

Related

Why AWK program FS variable can be specified with -F flag of gawk (or other awk) interpreter/command?

Why AWK program's FS variable can be specified with -F flag of gawk (or other awk) interpreter/command?
Let me explain, AWK is a programming language and gawk is (one of many) an interpreter for AWK. gawk interpreter/execute/runs the AWK program that given to it. So why the FS (field separator) variable can be specified with gawk's -F flag? I find it kind of unnatural... and how does it technically do that?
My best guess as to "why" is as a convenience. FS is probably the most used/manipulated awk variable, so having a short option to set it is helpful
Consider
awk -F, '...' file.csv
# vs
awk 'BEGIN {FS=","} ...' file.csv
"How does it technically do that" -- see https://git.savannah.gnu.org/cgit/gawk.git/tree/main.c#n1586
Historically -F was implemented in gawk v1.01 so it would have existed in whatever legacy awk that gawk was based on.
Additionally, the POSIX specification mandates -F.
So why the FS (field separator) variable can be specified with gawk's
-F flag?
awk man page claims that
Command line variable assignment is most useful for dynamically
assigning values to the variables AWK uses to control how input is
broken into fields and records. It is also useful for controlling
state if multiple passes are needed over a single data file.
So -F comes handy when field separator is not etched in stone, but rather computed dynamically, as -F allows you tu use bash variable easily, imagine that you was tasked with developing part of bash script which should output last field of each line of file.txt when using character stored in variable sep as separator, then you could do that following way
awk -F ${sep} '{print $NF}' file.txt
find it kind of unnatural
This depend on what you have used before, cut user which want to get 3rd column from csv file might do that following way
cut -d , -f 3 file.csv

Remove field-internal newlines in CSV file

I tried different awk methods to achieve this, but since I don't really understand how awk works, I didn't succeed.
So, I have a - large - csv-file that contains multi-line entries such as this:
"99999";"xyz";"text
that has
multiple newlines";"fdx";"xyz"
I need to get rid of those extra newlines in between the quotes.
Since every line ends with a double quote, followed by a newline, I thought I could create a command that replaces all newlines, except the ones that are prepended by a double-quote.
How would I do that?
Chances are all you need is this, using GNU awk for multi-char RS:
awk -v RS='\r\n' '{gsub(/\n/," ")}1' file
since your input is probably a CSV exported from a Windows tool like Excel and so has \r\n "line" endings but individual \ns for newlines within fields.
Alternatively, again using GNU awk for multi-char RS and RT:
$ awk -v RS='"[^"]+"' -v ORS= '{gsub(/\n/," ",RT); print $0 RT}' file
"99999";"xyz";"text that has multiple newlines";"fdx";"xyz"
or if you want all the chains of newlines compressed to single blanks:
$ awk -v RS='"[^"]+"' -v ORS= '{gsub(/\n+/," ",RT); print $0 RT}' file
"99999";"xyz";"text that has multiple newlines";"fdx";"xyz"
If you need anything else, including being able to identify and use the individual fields on each input "line", see What's the most robust way to efficiently parse CSV using awk?.

awk set command line options in script

I'm curious about how to set command-line options in awk script, like -F for field separator. I try to write the shebang line like
#!/usr/bin/awk -F ":" -f
and get the following error:
awk: 1: unexpected character '.'
For this example, I can do with
BEGIN {FS=":"}
but I still want to know a way to set all those options. Thanks in advance.
EDIT:
let's use another example that should be easy to test.
inputfile:
1
2
3
4
test.awk:
#!/usr/bin/awk -d -f
{num += $1}
END { print num}
run
/usr/bin/awk -d -f test.awk inputfile
will get 10 and generate a file called awkvars.out with some awk global variables in it.
but
./test.awk inputfile
will get
awk: cmd. line:1: ./test.awk
awk: cmd. line:1: ^ syntax error
awk: cmd. line:1: ./test.awk
awk: cmd. line:1: ^ unterminated regexp
if I remove '-d' from shebang line,
./test.awk inputfile
will normally output 10.
My question is that whether there is a way to write "-d" in test.awk file to generate awkvars.out file?
Answering for the OP question, beyond the setting of FS.
Short Answer: you can not use multiple options with '#!', and since you need to tell awk to read the program from stdin (-f-), you are out of luck.
Long Answer:
When using shebang (#!), there is a limit of single argument (which is passed to the named programs as the 1st argument. So in general:
#! /path/to/prog arg1
input-1
input-2
Will execute /path/to/prog arg1, with the content of the file (including the leading shebang) available as stdin. This is oversimplification, actual rules are more complex., see https://unix.stackexchange.com/questions/87560/does-the-shebang-determine-the-shell-which-runs-the-script
Given this limitation of one argument, when executing awk, the only valid and required parameter is '-f', which indicates that the awk programs is provided on STDIN. You can prepend few other options that do NOT take any argument, for example 'traditional' (e.g., '-Pf-' will force POSIX behavior).
As much as I can tell, all the 'interesting' options (setting FS, RS, ORS, ...) need to be separated from the '-f-' with a space, making it impossible to embed them into the command line, other then using the 'BEGIN { ... }' or similar in the script.
Bottom line, trying #! /usr/bin/awk -f- -F, will attempt to look for program is the same as awk -f' -F', and will look for a file named '- -F`. Usually not very useful, and will not set the FS.
Let's say following is our Input_file, which we are going to use for all mentioned solutions here.
cat Input_file
a,b,c,d
ab,c
1st way of setting Field separator: 1st simple way will be setting FS value in BEGIN section of awk program file. Following is our .awk file.
cat file1.awk
BEGIN{
FS=","
}
{
print $1"..."$2
}
Now when we run the code following output will come:
/usr/local/bin/awk -f file1.awk Input_file
a...b
ab...c
2nd way of setting field separator: 2nd way will be pass FS value before reading Input_file like as follows.
/usr/local/bin/awk -f file.awk FS="," Input_file
Example: Now following is the file.awk file which has awk code.
cat file.awk
{
print $1".."$2
}
Now when we run awk file with awk -f .. command as follows will be result.
/usr/local/bin/awk -f file.awk FS="," Input_file
a..b
ab..c
Which means it is picking up the field separator as , in this above program.
3rd way of setting field separator: We can set field separator in awk -f programs like how we do for usual awk programs using -F',' option as follows.
/usr/local/bin/awk -F',' -f file.awk Input_file
a..b
ab..c
4th way of setting field separator: We could mention field separator as a variable by using -v option on command line while running file.awk script as follows.
/usr/local/bin/awk -v FS=',' -f file.awk Input_file
Never use a shebang to call awk as it robs you of the ability to separate shell arguments into awk arguments and awk variables and do anything else that's better done in shell (e.g. arg parsing with getopts) before calling awk. Just call awk from inside your shell script.
Also, don't name your shell script test.awk as it's a shell script. The fact it's implemented in awk is irrelevant. There's no reason to create a file that you sometimes call as awk file to have awk interpret and other times as just file to have the shell interpret.

Append prefix to first column of a file with awk

I have a couple of hundreds of files which I want to process with xargs. They all need a fix of their first column.
Therefore I need an awk command to append the prefix "ID_" to the first column of a file (except for the first header line). Can anyone help me with this?
Something along the line:
gawk -f ';' "{$1='ID_' $1; print $0}" file.csv > file_processed.csv
I am not expert for the command, though. And I would rather like to have some inplace processing instead of making a copy of each file. Beforehand, I made it in VIM, but then I only had one file.
:%s/^-/ID_/
I hope someone can help me here.
gawk 'BEGIN{FS=";"; OFS=";"} {if(NR>1) $1="ID_"$1; print}' file.csv > file_processed.csv
FS and OFS set the input and output field separators, respectively.
NR>1 checks whether current line number is larger than 1, so we don't modify the header line.
You can also modify the file in place with -i inplace option:
gawk -i inplace 'BEGIN{FS=";"; OFS=";"} {if(NR>1) $1="ID_"$1; print}' file.csv
Edit
After elaborating the original question, here's the final version:
gawk -i inplace 'BEGIN{FS=OFS=";"} NR>1{sub(/^-/,"ID_",$2)} 1' file.csv
which substitutes - in the beginning of second column with ID_.
NR>1 action applies for all but first (header) line. 1 invokes the default default print action.
If you just want to do something, particularly adding a prefix, on the first field, it is not different from adding the prefix to the whole line.
So you can just awk '$0 = "ID_" $0' file.csv it should do the work. If you want to make it "change in place", you can:
awk '$0="ID_"$0' csv >/tmp/foo && mv /tmp/foo file.csv
You can also make use of sed:
sed -i 's/^/ID_/' file
The -i does "in-place modification"
You mentioned vim, and gave s/^-/ID_/ cmd, it doesn't add the prefix (ID_), it will replace the leading - by the ID_, they are different.

Using shell variables in gensub in Awk

I tried to answer a question asked here
How do I sed only matched grep line
as below
awk -F":" -v M="Mary Jane" -v A="Runs" -v S="Sleeps" '{OFS=":" ; print gensub(/(M):(A)/,"\\1;S","g")}'
but it did not work so I am guessing gensub is not able to recognize the shell variables M, A and S passed to awk , Is there a way to use shell variables in gensub in awk ?
What you're experiencing is not because you're using gensub(), it's to do with regexps. There are regexp literals, e.g. $0~/foo/, and then there are dynamic regexps which are strings that are converted to regexps when they are evaluated, e.g. $0~"foo" or {var="foo"} $0~var. If you want to use a variable in ANY regexp context then you need a dynamic regexp and so that is the syntax you need to use, not the regexp literal syntax. See https://www.gnu.org/software/gawk/manual/gawk.html#Computed-Regexps.
The correct syntax for what you wrote would be:
awk -F":" -v M="Mary Jane" -v A="Runs" -v S="Sleeps" '{OFS=":" ; print gensub("("M"):"A,"\\1;"S,"g")}'
but that has some semantic issues (e.g. partial matches and setting OFS for every line instead of once when it's never even used and almost certainly unnecessarily using "g" instead of 1 and why use regexps at all when you're really trying to do exact matches on strings)