AIX (no GNU sed/awk) join line if it does NOT have control M or \r character at the end - awk

I am looking for way to join line if the file does not control M character. AIX has its standard awk and sed utility but not GNU
The issue: we get file from 3rd party , from windows. And the file has ^M (i.e. \r) character at end of each line, expect for some lines , in which the data in some field itself has \n character. Hence there is need to join the lines which has this extra \n character.
Data example :
col1|col2|col3|col4|col5|^M
a1|a2|a3|a4|a5|^M
b1|b2|b3|b
4|b5|^M
c1|c2|c3|c4|c5|^M
expected output.
col1|col2|col3|col4|col5|^M
a1|a2|a3|a4|a5|^M
b1|b2|b3|b4|b5|^M
c1|c2|c3|c4|c5|^M
Thank you in advance for any help.

Just for the record, perl is handling really good transformations of \n, \r , etc, without the restrictions of non gnu sed - actually perl -pe can replace sed directly.
So this operation worked fine in BSD :
$ echo -ne "abc\r\ndef\nijk\r\nlmn\r\n" |cat -vte
abc^M$
def$
ijk^M$
lmn^M$
$ echo -ne "abc\r\ndef\nijk\r\nlmn\r\n" |perl -pe "s/\r\n/\0/g;s/\n//g;s/\0/\r\n/g" |cat -vte
abc^M$
defijk^M$
lmn^M$

A literal carriage-return can be used by typing ^V (Ctrl-V) followed by the "Return" key.
The following sed(1) script loops over lines that do not end in a carriage-return, removing the undesired line-feeds:
sed '
:label
/^M$/! {
N
s/\n//
blabel
}'
As one line:
sed -e ':l' -e '/^M$/!{N;s/\n//;bl' -e '}'

Related

replace strings between two patterns

I would like to replace (using sed/awk/tr) all the strings between CleanAgrobacterium and _gene by ZZZ in my file A.nwk:
(((CleanAgrobacterium_fabrum_str__C58_DE0068_Scaffold_Proteins_gene-FS783_RS12830:0,CleanAgrobacterium_fabrum_str__C58_DE0067_Scaffold_Proteins_gene-FS653_RS12825:0):0.056789,(CleanAgrobacterium_fabrum_GV2260_Complete_Genome_Proteins_gene-EML4058_RS17445:0,(CleanAgrobacterium_fabrum_1D1416_Chromosome_Proteins_gene-NQG32_RS17500:0,(CleanAgrobacterium_fabrum_PDC82_Contig_Proteins_gene-BLT49_RS14090:0,(CleanAgrobacterium_fabrum_N3394_Scaffold_Proteins_gene-G6L76_RS17395:0,(CleanAgrobacterium_fabrum_12D13_Complete_Genome_Proteins_gene-At12D13_RS18010:0,(CleanAgrobacterium_fabrum_Bi46_Contig_Proteins_gene-LQ162_RS02700:0,(CleanAgrobacterium_fabrum_ARqua1_Scaffold_Proteins_gene-HI842_RS18310:0,(CleanAgrobacterium_fabrum_N4094_Scaffold_Proteins_gene-G6L42_RS17400:0,(CleanAgrobacterium_fabrum_GV3101__pMP90_Complete_Genome_Proteins_gene-EML485_RS17435:0,(CleanAgrobacterium_fabrum_Kin001_Complete_Genome_Proteins_gene-FY134_RS17775:0,(CleanAgrobacterium_fabrum_LBA645_Complete_Genome_Proteins_gene-KXJ62_RS17445:0,(CleanAgrobacterium_fabrum_Di1525a_Scaffold_Proteins_gene-G6L89_RS17735:0,(CleanAgrobacterium_fabrum_NFIX02_Scaffold_Proteins_gene-BLR22_RS16795:0,(CleanAgrobacterium_fabrum_Arqua_Contig_Proteins_gene-EXN51_RS19140:0,(CleanAgrobacterium_fabrum_str__J-07_J-07_Scaffold_Proteins_gene-AGR8A_RS20015:0,CleanAgrobacterium_fabrum_1D132_Complete_Genome_Proteins_gene-At1D132_RS18580:0):0):0):0):0):0):0):0):0):0):0):0):0):0):0):0):0,(CleanAgrobacterium_fabrum_EHA105_Complete_Genome_Proteins_gene-EML540_RS17455:0,(CleanAgrobacterium_fabrum_RIT-As-3_Contig_Proteins_gene-ORG40_RS11815:0,(CleanAgrobacterium_fabrum_2788_Contig_Proteins_gene-G6L39_RS17590:0,(CleanAgrobacterium_fabrum_BG5_Complete_Genome_Proteins_gene-F3P66_RS17495:0,(CleanAgrobacterium_fabrum_Bi05_Contig_Proteins_gene-LQV40_RS07170:0,(CleanAgrobacterium_fabrum_str__C58_C58_Complete_Genome_Proteins_gene-ATU_RS17440:0,CleanAgrobacterium_fabrum_NFIX01_Scaffold_Proteins_gene-BMY00_RS16800:0):0):0):0):0):0):0);
sed "/CleanAgrobacterium/,/gene-/d" A.nwk
Instead of using a range, you could make the pattern more specific for the example data matching 1 or more alphanumeric chars or - or _ in between using [[:alnum:]_-]\+ and replace the match(es) with zzz
sed "s/CleanAgrobacterium[[:alnum:]_-]\+_gene/zzz/g" A.nwk
Output
(((zzz-FS783_RS12830:0,zzz-FS653_RS12825:0):0.056789,(zzz-EML4058_RS17445:0,(zzz-NQG32_RS17500:0,(zzz-BLT49_RS14090:0,(zzz-G6L76_RS17395:0,(zzz-At12D13_RS18010:0,(zzz-LQ162_RS02700:0,(zzz-HI842_RS18310:0,(zzz-G6L42_RS17400:0,(zzz-EML485_RS17435:0,(zzz-FY134_RS17775:0,(zzz-KXJ62_RS17445:0,(zzz-G6L89_RS17735:0,(zzz-BLR22_RS16795:0,(zzz-EXN51_RS19140:0,(zzz-AGR8A_RS20015:0,zzz-At1D132_RS18580:0):0):0):0):0):0):0):0):0):0):0):0):0):0):0):0):0,(zzz-EML540_RS17455:0,(zzz-ORG40_RS11815:0,(zzz-G6L39_RS17590:0,(zzz-F3P66_RS17495:0,(zzz-LQV40_RS07170:0,(zzz-ATU_RS17440:0,zzz-BMY00_RS16800:0):0):0):0):0):0):0);
This replaces all the text between CleanAgrobacterium and _gene by ZZZ:
sed -E 's/(CleanAgrobacterium).*(_gene)/\1ZZZ\2/g' A.nwk
But the result is probably not what you would expect. I assume you want ungreedy matching of the text in-between (.*). For that, use perl:
perl -pe 's/(CleanAgrobacterium).*(_gene)/\1ZZZ\2/g' A.nwk
This might work for you (GNU sed):
sed -E 's/CleanAgrobacterium/&\n/g
s/gene-/\n&/g
s/(CleanAgrobacterium)\n[^\n]*\n(gene-)/\1ZZZ\2/g
s/\n//g' file
Append a newline to CleanAgrobacterium and prepend a newline to gene-.
Replace everything that is not a newline between the desired words.
Remove any introduced newlines.
N.B. This does not cater for matches on separate lines. In this case use something like:
sed -E 'H;1h;$!d;x
s/\n/###NEWLINE%%%/g
s/CleanAgrobacterium/&\n/g
s/gene-/\n&/g
s/(CleanAgrobacterium)\n[^\n]*\n(gene-)/\1ZZZ\2/g
s/\n//g
s/###NEWLINE%%%/\n/g' file
This slurps the whole file into memory, replaces all newlines by a unique string, then applies the first solution and tidies up afterwards.
try this:
sed 's/gene-/gene-\n/g' < A.nwk | sed 's/CleanAgrobacterium.*gene-/CleanAgrobacteriumZZZgene-/g' | sed -n ':a;N;$!ba;s/\n//g;p' > output.txt
works with GNU Sed 4.9 using Linux .
Yet another sed solution. It replaces all THIS with THAT (with your samples in reality but more readable here) between START and END in "fooSTARTTHISENDfooSTARTTHISENDfoo"
and outputs "fooSTARTTHATENDfooSTARTTHATENDfoo".
$ sed -E 's/(CleanAgrobacterium)([^_]|_(_|g(_|e(_|n_)))*([^_g]|g([^_e]|e([^_n]|n[^_e]))))*(_(_|g(_|e(_|n_)))*(g(e?|en))?)?(_gene)/\1ZZZ\2/g' file
The solution is non-greedy and relies on regex capturing groups (CleanAgrobacterium)and (_gene), their backreferences \1, \2 and what is between them
([^_]|_(_|g(_|e(_|n_)))*([^_g]|g([^_e]|e([^_n]|n[^_e]))))*(_(_|g(_|e(_|n_)))*(g(e?|en))?)?
(not _gene) getting replaced by ZZZ. You could use it in, for example; GNU awk's gensub() which supports backreferencing:
$ gawk '{print gensub(/(CleanAgrobacterium)([^_]|_(_|g(_|e(_|n_)))*([^_g]|g([^_e]|e([^_n]|n[^_e]))))*(_(_|g(_|e(_|n_)))*(g(e?|en))?)?(_gene)/,"\\1ZZZ\\2","g",$0)}' file

Can someone explain this awk quirk? [duplicate]

The intent of this question is to provide an answer to the daily questions whose answer is "you have DOS line endings" so we can simply close them as duplicates of this one without repeating the same answers ad nauseam.
NOTE: This is NOT a duplicate of any existing question. The intent of this Q&A is not just to provide a "run this tool" answer but also to explain the issue such that we can just point anyone with a related question here and they will find a clear explanation of why they were pointed here as well as the tool to run so solve their problem. I spent hours reading all of the existing Q&A and they are all lacking in the explanation of the issue, alternative tools that can be used to solve it, and/or the pros/cons/caveats of the possible solutions. Also some of them have accepted answers that are just plain dangerous and should never be used.
Now back to the typical question that would result in a referral here:
I have a file containing 1 line:
what isgoingon
and when I print it using this awk script to reverse the order of the fields:
awk '{print $2, $1}' file
instead of seeing the output I expect:
isgoingon what
I get the field that should be at the end of the line appear at the start of the line, overwriting some text at the start of the line:
whatngon
or I get the output split onto 2 lines:
isgoingon
what
What could the problem be and how do I fix it?
The problem is that your input file uses DOS line endings of CRLF instead of UNIX line endings of just LF and you are running a UNIX tool on it so the CR remains part of the data being operated on by the UNIX tool. CR is commonly denoted by \r and can be seen as a control-M (^M) when you run cat -vE on the file while LF is \n and appears as $ with cat -vE.
So your input file wasn't really just:
what isgoingon
it was actually:
what isgoingon\r\n
as you can see with cat -v:
$ cat -vE file
what isgoingon^M$
and od -c:
$ od -c file
0000000 w h a t i s g o i n g o n \r \n
0000020
so when you run a UNIX tool like awk (which treats \n as the line ending) on the file, the \n is consumed by the act of reading the line, but that leaves the 2 fields as:
<what> <isgoingon\r>
Note the \r at the end of the second field. \r means Carriage Return which is literally an instruction to return the cursor to the start of the line so when you do:
print $2, $1
awk will print isgoingon and then will return the cursor to the start of the line before printing what which is why the what appears to overwrite the start of isgoingon.
To fix the problem, do either of these:
dos2unix file
sed 's/\r$//' file
awk '{sub(/\r$/,"")}1' file
perl -pe 's/\r$//' file
Apparently dos2unix is aka frodos in some UNIX variants (e.g. Ubuntu).
Be careful if you decide to use tr -d '\r' as is often suggested as that will delete all \rs in your file, not just those at the end of each line.
Note that GNU awk will let you parse files that have DOS line endings by simply setting RS appropriately:
gawk -v RS='\r\n' '...' file
but other awks will not allow that as POSIX only requires awks to support a single character RS and most other awks will quietly truncate RS='\r\n' to RS='\r'. You may need to add -v BINMODE=3 for gawk to even see the \rs though as the underlying C primitives will strip them on some platforms, e.g. cygwin.
One thing to watch out for is that CSVs created by Windows tools like Excel will use CRLF as the line endings but can have LFs embedded inside a specific field of the CSV, e.g.:
"field1","field2.1
field2.2","field3"
is really:
"field1","field2.1\nfield2.2","field3"\r\n
so if you just convert \r\ns to \ns then you can no longer tell linefeeds within fields from linefeeds as line endings so if you want to do that I recommend converting all of the intra-field linefeeds to something else first, e.g. this would convert all intra-field LFs to tabs and convert all line ending CRLFs to LFs:
gawk -v RS='\r\n' '{gsub(/\n/,"\t")}1' file
Doing similar without GNU awk left as an exercise but with other awks it involves combining lines that do not end in CR as they're read.
Also note that though CR is part of the [[:space:]] POSIX character class, it is not one of the whitespace characters included as separating fields when the default FS of " " is used, whose whitespace characters are only tab, blank, and newline. This can lead to confusing results if your input can have blanks before CRLF:
$ printf 'x y \n'
x y
$ printf 'x y \n' | awk '{print $NF}'
y
$
$ printf 'x y \r\n'
x y
$ printf 'x y \r\n' | awk '{print $NF}'
$
That's because trailing field separator white space is ignored at the beginning/end of a line that has LF line endings, but \r is the final field on a line with CRLF line endings if the character before it was whitespace:
$ printf 'x y \r\n' | awk '{print $NF}' | cat -Ev
^M$
You can use the \R shorthand character class in PCRE for files with unknown line endings. There are even more line ending to consider with Unicode or other platforms. The \R form is a recommended character class from the Unicode consortium to represent all forms of a generic newline.
So if you have an 'extra' you can find and remove it with the regex s/\R$/\n/ will normalize any combination of line endings into \n. Alternatively, you can use s/\R/\n/g to capture any notion of 'line ending' and standardize into a \n character.
Given:
$ printf "what\risgoingon\r\n" > file
$ od -c file
0000000 w h a t \r i s g o i n g o n \r \n
0000020
Perl and Ruby and most flavors of PCRE implement \R combined with the end of string assertion $ (end of line in multi-line mode):
$ perl -pe 's/\R$/\n/' file | od -c
0000000 w h a t \r i s g o i n g o n \n
0000017
$ ruby -pe '$_.sub!(/\R$/,"\n")' file | od -c
0000000 w h a t \r i s g o i n g o n \n
0000017
(Note the \r between the two words is correctly left alone)
If you do not have \R you can use the equivalent of (?>\r\n|\v) in PCRE.
With straight POSIX tools, your best bet is likely awk like so:
$ awk '{sub(/\r$/,"")} 1' file | od -c
0000000 w h a t \r i s g o i n g o n \n
0000017
Things that kinda work (but know your limitations):
tr deletes all \r even if used in another context (granted the use of \r is rare, and XML processing requires that \r be deleted, so tr is a great solution):
$ tr -d "\r" < file | od -c
0000000 w h a t i s g o i n g o n \n
0000016
GNU sed works, but not POSIX sed since \r and \x0D are not supported on POSIX.
GNU sed only:
$ sed 's/\x0D//' file | od -c # also sed 's/\r//'
0000000 w h a t \r i s g o i n g o n \n
0000017
The Unicode Regular Expression Guide is probably the best bet of what the definitive treatment of what a "newline" is.
Run dos2unix. While you can manipulate the line endings with code you wrote yourself, there are utilities which exist in the Linux / Unix world which already do this for you.
If on a Fedora system dnf install dos2unix will put the dos2unix tool in place (should it not be installed).
There is a similar dos2unix deb package available for Debian based systems.
From a programming point of view, the conversion is simple. Search all the characters in a file for the sequence \r\n and replace it with \n.
This means there are dozens of ways to convert from DOS to Unix using nearly every tool imaginable. One simple way is to use the command tr where you simply replace \r with nothing!
tr -d '\r' < infile > outfile

How to extract the final word of a sentence

For a given text file I'd like to extract the final word in every sentence to a space-delimited text file. It would be acceptable to have a few errors for words like Mr. and Dr., so I don't need to try to achieve that level of precision.
I was thinking I could do this with Sed and Awk, but it's been too long since I've worked with them and I don't remember where to begin. Help?
(Output example: For the previous two paragraphs, I'd like to see this):
file Mr Dr precision begin Help
Using this regex:
([[:alpha:]]+)[.!?]
Explanation
Grep can do this:
$ echo "$txt" | grep -o -E '([[:alpha:]]+)[.!?]'
file.
Mr.
Dr.
precision.
begin.
Help?
Then if you want only the words, a second time through:
$ echo "$txt" | grep -o -E '([[:alpha:]]+)[.!?]' | grep -o -E '[[:alpha:]]+'
file
Mr
Dr
precision
begin
Help
In awk, same regex:
$ echo "$txt" | awk '/[[:alpha:]]+[.!?]/{for(i=1;i<=NF;i++) if($i~/[[:alpha:]]+[.!?]/) print $i}'
Perl, same regex, allows capture groups and maybe a little more direct syntax:
$ echo "$txt" | perl -ne 'print "$1 " while /([[:alpha:]]+)[.!?]/g'
file Mr Dr precision begin Help
And with Perl, it is easier to refine the regex to be more discriminating about the words captured:
echo "$txt" | perl -ne 'print "$1 " while /([[:alpha:]]+)(?=[.!?](?:(?:\s+[[:upper:]])|(?:\s*\z)))/g'
file precision begin Help
gawk:
$ gawk -v ORS=' ' -v RS='[.?!]' '{print $NF}' w.txt
file Mr Dr precision begin Help
(Note that plain awk does not support assigning a regular expression to RS.)
This might work for you (GNU sed):
sed -r 's/^[^.?!]*\b(\w+)[.?!]/\1\n/;/\n/!d;P;D' file
For one word per line or use paste for a single line so:
sed -r 's/^[^.?!]*\b(\w+)[.?!]/\1\n/;/\n/!d;P;D' file | paste -sd' '
For another solution just using sed:
sed -r 'H;$!d;x;s/\n//g;s/\b(\w+)[.?!]/\n\1\n/g;/\n/!d;s/[^\n]*\n([^\n]*)\n/ \1/g;s/.//' file
Easy in Perl:
perl -ne 'print "$1 " while /(\w+)[.!?]/g'
-n reads the input line by line.
\w matches a "word character".
\w+ matches one or more word characters.
[.!?] matches any of the sentence-end markers.
/g stands for "globally" - it remembers where the last match occurred and tries to match after it.

How do I replace newlines with their escape sequence using awk?

I'm asking for an awk solution because the sed solution sed ':a;N;$!ba;s/\n/ /g' does NOT get the last newline character usually present in files. My problem is that if I have a stream of data, every single newline character INCLUDING the one usually present at the end of outputs generated by programs like echo needs to be replaced with \n (raw). tr does not work because it cannot substitute in strings (which \\n is), only individual characters.
As example, upon piping:
echo -e "This is\na test."
to awk, I ought to get:
This is\na test.\n
in return.
$ echo -e "This is\na test." | awk -v ORS='\\n' '1'
This is\na test.\n$
The $ at the end of the output is my prompt.

How to work with literal square bracket using awk and foreach iterations

I have a file named mapstring. Because of [ string in my patterns my script is not working. Please help me find a solution to this.
Content of mapstring
BC1 bc1
BC2 bc2
BAD_BIT[0] badl0
BAD_BIT[1] badlleftnr
I am working with following script to replace pattern in file testfile
Content of script
foreach cel (`cat mapstring |awk '{print $1}'`)
echo $cel
grep -wq $cel testfile
if( $status == 0 ) then
set var2 = `grep -w $cel rajeshmap |awk '{print $2}'`
sed -i "s% ${cel} % ${var2} %g" testfile
endif
end
Content of testfile
rajesh jain BAD_BIT[0] 1234 BAD_BIT[1000]
jain rajesh DA[0] snps
raj jain CLK stm
That's because square brackets are reserved in sed's basic regex syntax.
You'll have to escape them (and any other special characters in fact) using backslashes (i.e. \[) before using them later in your script; this can itself be done with sed, e.g.:
sed -re 's/(\[|\])/\\\1/g'
(note that using extended regexes in sed (-r) can make this easier).
Your script is rather inefficient anyhow. You can simply get rid of csh entirely (along with the useless cat and the other stylistic problems), and do this with two connected sed scripts.
sed 's/[][*\\%]/\\&/g;s/\([^ ]*\) *\(.*\)/s%\1%\2%g/' mapstring |
sed -i -f - testfile
This is assuming your sed can accept a script on standard input (-f -) and that your sed dialect does not understand any additional special characters which need to be escaped.
#!/bin/ksh
# or sh
sed 's/[[\\$^&.+*]/\\&/g' mapstring | while read -r OldCel NewCel
do
echo ${OldCel}
sed -i "/${OldCel}/ {
s/.*/ & /;s% ${OldCel} % ${NewCel} %g;s/.\\(.*\\)./\\1/
}" testfile
done
pre escape your cel values for a sed manipulation (you could add other special char if occuring and depending directive given to sed like {( )
try something like this (cannot test, no GNU sed available here)
From the good remarq of #tripleee, this need to be another shell than the one used in the request, script adapted for this