awk unable to remove new line character in a specific column

awk unable to remove new line character in a specific column - awk

I'm trying to remove newline character in the first column in csv file using awk but it doesn't seems to work
Sample File:
"This
is a test
","Something","Something"
"This is
another
test","something","something"
"One
more
test","something","something"
The command i'm using is
awk -F, '{gsub("\n","",$1); print}' sample
The output doesn't remove the new line character
I'm looking for a solution using awk not sed or perl
Can someone please help?
The output required is,
"This is a test","something","something"
"This is another test","something","something"
"One more test","something","something"

Assuming what you have is a CSV exported from Excel or some other windows tool (since that's what it looks like) and so it has \r\n line endings, all you need is this with GNU awk for multi-char RS:
$ awk -v RS='\r\n' -F'\n' '{$1=$1}1' file
"This is a test ","Something","Something"
"This is another test","something","something"
"One more test","something","something"
Otherwise with GNU awk for multi-char RS this would work for the sample you posted:
$ awk -v RS='"\\s+("|$)' -F'\n' '{$1=$1; gsub(/^"?|"?$/,"\"")}1' file
"This is a test ","Something","Something"
"This is another test","something","something"
"One more test","something","something"

Related

Remove field-internal newlines in CSV file

I tried different awk methods to achieve this, but since I don't really understand how awk works, I didn't succeed.
So, I have a - large - csv-file that contains multi-line entries such as this:
"99999";"xyz";"text
that has
multiple newlines";"fdx";"xyz"
I need to get rid of those extra newlines in between the quotes.
Since every line ends with a double quote, followed by a newline, I thought I could create a command that replaces all newlines, except the ones that are prepended by a double-quote.
How would I do that?

Chances are all you need is this, using GNU awk for multi-char RS:
awk -v RS='\r\n' '{gsub(/\n/," ")}1' file
since your input is probably a CSV exported from a Windows tool like Excel and so has \r\n "line" endings but individual \ns for newlines within fields.
Alternatively, again using GNU awk for multi-char RS and RT:
$ awk -v RS='"[^"]+"' -v ORS= '{gsub(/\n/," ",RT); print $0 RT}' file
"99999";"xyz";"text that has multiple newlines";"fdx";"xyz"
or if you want all the chains of newlines compressed to single blanks:
$ awk -v RS='"[^"]+"' -v ORS= '{gsub(/\n+/," ",RT); print $0 RT}' file
"99999";"xyz";"text that has multiple newlines";"fdx";"xyz"
If you need anything else, including being able to identify and use the individual fields on each input "line", see What's the most robust way to efficiently parse CSV using awk?.

awk command to print columns with colum data

cat file1.txt | awk -F '{print $1 "|~|" $2 "|~|" $3}' > file2.txt
I am using above command to filter first three columns from file1 and put into file.
But only getting the column names and not the column data.
How to do that?
|~| - is the delimiter.
file1.txt has values as :
a|~|b|~|c|~|d|~|e
1|~|2|~|3|~|4|~|5
11|~|22|~|33|~|44|~|55
111|~|222|~|333|~|444|~|555
my expedted output is :
a|~|b|~|c
1|~|2|~|3
11|~|22|~|33
111|~|222|~|333

With your shown samples, please try following awk code. You need to set field separator to |~| and remove starting space from lines, then print the lines.
awk -F'\\|~\\|' -v OFS='|~|' '{sub(/^[[:blank:]]+/,"");print $1,$2,$3}' Input_file
In case you want to keep spaces(which was in initial post before edit) then try following:
awk -F'\\|~\\|' -v OFS='|~|' '{print $1,$2,$3}' Input_file
NOTE: Had a chat with user in room and got to know why this code was not working for user because of gunzip -c file was being used wrongly, its output was being saved into a variable on which user was running awk program, so correcting that command generated right file and awk program ran fine on it. Adding this as a reference for future readers.

One approach would be:
awk -v FS="," -v OFS="|~|" '{gsub(/[|][~][|]/,","); sub(/^\s*/,""); print $1,$2,$3}' file1.txt
The approach simply replaces all "|~|" with a "," setting the output file separator to "|~|". All leading whitespace is trimmed with sub().
Example Use/Output
With your data in file1.txt, you would have:
$ awk -v FS="," -v OFS="|~|" '{gsub(/[|][~][|]/,","); sub(/^\s*/,""); print $1,$2,$3}' file1.txt
a|~|b|~|c
1|~|2|~|3
11|~|22|~|33
111|~|222|~|333
Let me know if this is what you intended. You can simply redirect, e.g. > file2.txt to write to the second file.

For such cases, my bash+awk script rcut comes in handy:
rcut -Fd'|~|' -f-3 ip.txt
The -F option enables fixed string input delimiter (which is given using the -d option). And by default, the output field separator will also be same as -d when -F is active. -f-3 is similar to cut syntax to specify first three fields.
For better speed, use hck command:
hck -Ld'|~|' -D'|~|' -f-3 ip.txt
Here, -L enables literal field separator and -D specifies output field separator.
Another benefit is that hck supports -z option to automatically handle common compressed formats based on filename extension (adding this since OP had an issue with compressed input).

Another way:
sed 's/|~|/\t/g' file1.txt | awk '{print $1"|~|"$2"|~|"$3}' > file2.txt
First replace the |~| delimiter, and use the default awk separator, then print columns what you need.

Use awk to interpret }{ as RS and output with ORS }\n{

I have data that looks like this:
{"anonymousId":"abc123",{"hello":"world"}}{"anonymousId":"abc456",{"hi": "again"}}
It's as if you took a newline-delimited json file and removed all the newlines.
I'm trying to use awk to convert it to to ndjson.
That is, my expected output is this:
{"anonymousId":"abc123",{"hello":"world"}}
{"anonymousId":"abc456",{"hi": "again"}}
I don't want to load the entire file into memory (which is why I'm not using sed), so my thought is I should use }{ as row separator. Then, I figure if I use }\n{ as ORS I should get my desired output.
So I tried this:
cat my-file.txt | awk -v RS="}{" -v ORS="}\n{" '{$1=$1}1'
But it doesn't work!
Here's the output I get:
{"anonymousId":"abc123",{"hello":"world"}
{}
{{"anonymousId":"abc456",{"hi": "again"}
{}
{}
{
Apart from the constraint of not loading the entire file into memory, I don't care what bash command is used, but my thinking is awk will be the way. E.g. if tr supported multi-character expressions, that would be fine with me.
Please help me understand why this isn't working as expected and what I need to change.
Thanks!
Update
Following the answers given, will add some learnings.
The TLDR is don't use a macOS if you need to do trickier things like this.
For one this doen't work on mac: echo -e "a\nb\nc\nd\ne\n" | head -n -2; it complains about illegal line parameter, but this is valid on a linux system.
The other problem was the way awk was working on my (mac) system.
My awk command was close to correct.
On linux it produces this output:
{"anonymousId":"abc123",{"hello":"world"}}
{"anonymousId":"abc456",{"hi": "again"}}}
{
So I just have to find a way to trim the trailing }\n{ (and as pointed out in the answer, the {$1=$1} is not necessary).
But all of those extraneous newlines were due to the screwy implementation of awk on my system ( It wasn't gawk and i'm not sure what it was ).

Doing $1=$1 inside awk -v RS='}{' -v ORS='}\n{' '{$1=$1}1' file isn't useful - it tells awk to recompile the current record replacing all chains of white space with blanks but you the only white space in your example is the \n at the end of the file and there's no point converting that to a blank. So your script can be reduced to:
awk -v RS='}{' -v ORS='}\n{' '1' file
but RS='}{' means different things to different awk variants.
Use of a multi-char RS with GNU awk (and probably a couple of others now) means that the RS is treated as a regexp to separate the records:
$ awk -v RS='}{' -v ORS='}\n{' '1' file
{"anonymousId":"abc123",{"hello":"world"}}
{"anonymousId":"abc456",{"hi": "again"}}
}
{$
Note the extra }\n{ added at the end because there is no }{ at the end of your input and so the end of input itself indicates the end of a record and so gets replaced with the ORS value.
Use of a multi-char RS with a POSIX awk means that the 2nd and subsequent chars in the RS get ignored and the first char is taken as the RS, hence the output you reported seeing in your question:
$ awk --posix -v RS='}{' -v ORS='}\n{' '1' file
{"anonymousId":"abc123",{"hello":"world"}
{}
{{"anonymousId":"abc456",{"hi": "again"}
{}
{
}
{$
where every } alone gets treated as matching RS and so gets replaced by ORS.
So you are not using an awk that supports multi-char RS. Your choices are to install one (preferably gawk) and do:
$ awk -v RS='}[{\n]' '{ORS=gensub(/}{/,"}\n{",1,RT)} 1' file
{"anonymousId":"abc123",{"hello":"world"}}
{"anonymousId":"abc456",{"hi": "again"}}
otherwise do something like this with any awk:
$ awk --posix -v RS='{' -v ORS= '{print pfx $0; pfx=(/}$/ ? "\n" : "") RS}' file
{"anonymousId":"abc123",{"hello":"world"}}
{"anonymousId":"abc456",{"hi": "again"}}
In the gawk solution above we define the RS as '}[{\n]' to say that the records mid-line are terminated by }{ but the record at the end of the line is terminated by }\n. So RT holds }{ for every record except the last one on the line which is }\n if your line ends with \n or NULL otherwise and so we just have to set ORS to be RT but with }{ converted to }\n{ for those records where RT has that value, otherwise ORS just gets set to }\n when RT has that value or NULL if your input didn't have a terminating \n.
An alternative gawk solution that I think I might actually prefer would be:
$ awk -v RS='}{' -v ORS='}\n{' 'NR>1{print prev} {prev=$0} END{printf "%s",prev}' file
{"anonymousId":"abc123",{"hello":"world"}}
{"anonymousId":"abc456",{"hi": "again"}}
EDIT: original answer for posterity before I noticed the OP said they don't want to read the whole file into memory:
Simple substitutions on individual strings like this is what sed is best at:
$ sed 's/}{/}\n{/g' file
{"anonymousId":"abc123",{"hello":"world"}}
{"anonymousId":"abc456",{"hi": "again"}}
otherwise with any awk:
$ awk '{gsub(/}{/,"}\n{")} 1' file
{"anonymousId":"abc123",{"hello":"world"}}
{"anonymousId":"abc456",{"hi": "again"}}

using record separator will create an extra delimiter at the end of the file, since it's static we can just remove it afterwards
$ echo '{"anonymousId":"abc123",{"hello":"world"}}{"anonymousId":"abc456",{"hi": "again"}}' |
awk -v RS='}{' -v ORS='}\n{' 1 | head -n -2
{"anonymousId":"abc123",{"hello":"world"}}
{"anonymousId":"abc456",{"hi": "again"}}
if you don't have gawk for multi-char RS support, you can have this workaround
$ echo ... |
awk -v RS='}' 'NF{printf "%s", $0 RS} !NF{print RS}' | head -n -2
there will be an extra RS, which will be trimmed afterwards.

Split a line after single quote into a newline

I have a text file like this
'ABC&\]D' 'DEFGH' 'QUEMKLEMEL' 'SSBJ|!KFFL'
and would like to convert it to
'ABC&\]D'
'DEFGH'
'QUEMKLEMEL'
'SSBJ|!KFFL'
to split the textfile at the single quotes into a newline.
The original issue is to convert a bash script from windows (the EOL characters to Unix format). I tried
sed 's/\r//' input >output
but keep getting the error when I submit the script
sbatch: error: This does not look like a batch script.
It is a long text file from windows that was transferred to linux but does not work because of the EOL characters. I tried other options to convert it so thought can subset the issue, and try it this way.

Using GNU awk for it:
$ gawk -v FPAT="([^ ]+)|('[^']+')" -v OFS="\n" '{$1=$1}1' file
Output
'ABC&\]D'
'DEFGH'
'QUEMKLEMEL'
'SSBJ|!KFFL'

Using gnu sed you can search for "' " and replace with \n:
sed "s/' /'\n/g" file
'ABC&\]D'
'DEFGH'
'QUEMKLEMEL'
'SSBJ|!KFFL'
Or using POSIX awk:
awk '{gsub(/\x27 /, "\x27\n")} 1' file
'ABC&\]D'
'DEFGH'
'QUEMKLEMEL'
'SSBJ|!KFFL'
or this one:
awk -F "' " -v OFS="'\n" '{$1=$1} 1' file

1st solution: Using gsub function of awk.
awk '{gsub(/\x27 \x27/,"\x27\n\x27")} 1' Input_file
2nd solution: Could you please try following, written and tested with shown samples.
awk '{for(i=1;i<=NF;i++){print $i}}' Input_file

awk set command line options in script

I'm curious about how to set command-line options in awk script, like -F for field separator. I try to write the shebang line like
#!/usr/bin/awk -F ":" -f
and get the following error:
awk: 1: unexpected character '.'
For this example, I can do with
BEGIN {FS=":"}
but I still want to know a way to set all those options. Thanks in advance.
EDIT:
let's use another example that should be easy to test.
inputfile:
1
2
3
4
test.awk:
#!/usr/bin/awk -d -f
{num += $1}
END { print num}
run
/usr/bin/awk -d -f test.awk inputfile
will get 10 and generate a file called awkvars.out with some awk global variables in it.
but
./test.awk inputfile
will get
awk: cmd. line:1: ./test.awk
awk: cmd. line:1: ^ syntax error
awk: cmd. line:1: ./test.awk
awk: cmd. line:1: ^ unterminated regexp
if I remove '-d' from shebang line,
./test.awk inputfile
will normally output 10.
My question is that whether there is a way to write "-d" in test.awk file to generate awkvars.out file?

Answering for the OP question, beyond the setting of FS.
Short Answer: you can not use multiple options with '#!', and since you need to tell awk to read the program from stdin (-f-), you are out of luck.
Long Answer:
When using shebang (#!), there is a limit of single argument (which is passed to the named programs as the 1st argument. So in general:
#! /path/to/prog arg1
input-1
input-2
Will execute /path/to/prog arg1, with the content of the file (including the leading shebang) available as stdin. This is oversimplification, actual rules are more complex., see https://unix.stackexchange.com/questions/87560/does-the-shebang-determine-the-shell-which-runs-the-script
Given this limitation of one argument, when executing awk, the only valid and required parameter is '-f', which indicates that the awk programs is provided on STDIN. You can prepend few other options that do NOT take any argument, for example 'traditional' (e.g., '-Pf-' will force POSIX behavior).
As much as I can tell, all the 'interesting' options (setting FS, RS, ORS, ...) need to be separated from the '-f-' with a space, making it impossible to embed them into the command line, other then using the 'BEGIN { ... }' or similar in the script.
Bottom line, trying #! /usr/bin/awk -f- -F, will attempt to look for program is the same as awk -f' -F', and will look for a file named '- -F`. Usually not very useful, and will not set the FS.

Let's say following is our Input_file, which we are going to use for all mentioned solutions here.
cat Input_file
a,b,c,d
ab,c
1st way of setting Field separator: 1st simple way will be setting FS value in BEGIN section of awk program file. Following is our .awk file.
cat file1.awk
BEGIN{
FS=","
}
{
print $1"..."$2
}
Now when we run the code following output will come:
/usr/local/bin/awk -f file1.awk Input_file
a...b
ab...c
2nd way of setting field separator: 2nd way will be pass FS value before reading Input_file like as follows.
/usr/local/bin/awk -f file.awk FS="," Input_file
Example: Now following is the file.awk file which has awk code.
cat file.awk
{
print $1".."$2
}
Now when we run awk file with awk -f .. command as follows will be result.
/usr/local/bin/awk -f file.awk FS="," Input_file
a..b
ab..c
Which means it is picking up the field separator as , in this above program.
3rd way of setting field separator: We can set field separator in awk -f programs like how we do for usual awk programs using -F',' option as follows.
/usr/local/bin/awk -F',' -f file.awk Input_file
a..b
ab..c
4th way of setting field separator: We could mention field separator as a variable by using -v option on command line while running file.awk script as follows.
/usr/local/bin/awk -v FS=',' -f file.awk Input_file

Never use a shebang to call awk as it robs you of the ability to separate shell arguments into awk arguments and awk variables and do anything else that's better done in shell (e.g. arg parsing with getopts) before calling awk. Just call awk from inside your shell script.
Also, don't name your shell script test.awk as it's a shell script. The fact it's implemented in awk is irrelevant. There's no reason to create a file that you sometimes call as awk file to have awk interpret and other times as just file to have the shell interpret.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

awk unable to remove new line character in a specific column - awk

Related

Remove field-internal newlines in CSV file

awk command to print columns with colum data

Use awk to interpret }{ as RS and output with ORS }\n{

Split a line after single quote into a newline

awk set command line options in script

Categories

Resources