How to replace \n with a comma for certain lines on CLI - awk

I have the following text file.
# This is a test 1
"watch"
"autoconf"
# This is another line 2
"coreutils"
"binutils"
"screen"
# This is another line 3
"bash"
"emacs"
"nano"
"bison"
# This is another line 4
"libressl"
"python"
"rsync"
"unzip"
"vim"
I want to change this to the following:
# This is a test 1
watch, autoconf
# This is another line 2
coreutils, binutils, screen
# This is another line 3
bash, emacs, nano, bison
# This is another line 4
libressl, python, rsync, unzip, vim
Remove the leading white spaces, remove quotes, replace a new line with a comma.
So far I got this:
$ cat in.txt | sed 's/"//g' | sed 's/^[[:space:]]*//'> out.txt
# This is a test 1
watch
autoconf
# This is another line 2
coreutils
binutils
screen
# This is another line 3
bash
emacs
nano
bison
...
I'm not sure how to replace a new line with a comma. I tried the following.
# no change
$ cat in.txt | sed 's/"//g' | sed 's/^[[:space:]]*//'| sed 's/\n/,/g'> out.txt
# changed all new lines
$ cat in.txt | sed 's/"//g' | sed 's/^[[:space:]]*//'| sed -z 's/\n/,/g'> out.txt
$ cat out.txt
# This is a test 1,watch,autoconf,,# This is another line 2,coreutils,binutils,screen,,# This is another line 3,bash,emacs,nano,bison,,# This is another line 4,libressl,python,rsync,unzip,vim
How can I achieve this?

This might work for you (GNU sed):
sed -E 's/^\s*//;/^"(\S*)"/{s//\1/;H;$!d};x;s/.//;s/\n/, /gp;$d;z;x' file
Strip off white space at the front of all lines.
Strip out double quotes and append those words to the hold space.
Otherwise, switch to the hold space, delete the first introduced newline, replace all other newlines by , , print the result and then switch back to the pattern space and print that.

Here's an awk version. Notice that we set the record separator RS to the empty string. This tells awk to treat each block separated by an empty line as a single record. Then by setting the field separator with -F to a newline, each line in the block becomes a single field in that record.
Then it's just a matter of brute-forcing our way through the fields of each record, using sub or gsub to remove leading spaces and quotation marks, and using printf to avoid a newline when we don't want one and printing a comma instead.
$ awk -v RS="" -F'\n' '{
sub(/^[[:space:]]*/, "", $1);
print $1;
sep="";
for (i=2; i<=NF; ++i) {
gsub(/[[:space:]]*"/, "", $i);
printf "%s%s", sep, $i;
sep=", "
}
print "\n"
}' file
Output:
# This is a test 1
watch, autoconf
# This is another line 2
coreutils, binutils, screen
# This is another line 3
bash, emacs, nano, bison
# This is another line 4
libressl, python, rsync, unzip, vim

A one-liner using GNU sed:
sed -Ez 's/\n[[:blank:]]*"?/\n/g; s/"\n([^\n])/, \1/g; s/"//g' file
or, using multiline techniques with standard sed:
sed '
s/^[[:blank:]]*//
/^".*"$/{
s/.//
s/.$//
:a
$b
N
s/\n[[:blank:]]*"\(.*\)"$/, \1/
ta
}' file

With your shown samples only, could you please try following. Written and tested in GNU awk.
awk '
BEGIN{
OFS=", "
}
NF{
gsub(/"|^ +| +$/,"")
}
/^#/ || !NF{
if(value){
print first ORS value
}
first=$0
value=""
if(!NF){ print }
next
}
{
value=(value?value OFS:"")$0
}
END{
if(value){
print first ORS value
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
OFS=", " ##Setting OFS as comma space here.
}
NF{ ##Checking condition if line is NOT empty do following.
gsub(/"|^ +| +$/,"") ##Globally substituting " OR starting/ending spaces with NULL here.
}
/^#/ || !NF{ ##Checking condition if line starts from # OR line is NULL then do following.
if(value){ ##Checking condition if value is NOT NULL then do following.
print first ORS value ##Printing first ORS value values here.
}
first=$0 ##Setting first to current line here.
value="" ##Nullifying value here.
if(!NF){ print } ##Checking condition if line is empty then simply print it.
next ##next will skip all further statements from here.
}
{
value=(value?value OFS:"")$0 ##Creating value here and keep on adding current line value to it.
}
END{ ##Starting END block of this program from here.
if(value){ ##Checking condition if value is NOT NULL then do following.
print first ORS value ##Printing first ORS value values here.
}
}
' Input_file ##Mentioning Input_file name here.

Using any POSIX awk in any shell on every Unix box:
$ awk -v RS= -v ORS='\n\n' -F'[[:blank:]]*\n[[:blank:]]*' -v OFS=', ' '{
gsub(/^[[:blank:]]*|"/,"")
printf "%s\n", $1
for (i=2;i<=NF;i++) {
printf "%s%s", $i, (i<NF ? OFS : ORS)
}
}' file
# This is a test 1
watch, autoconf
# This is another line 2
coreutils, binutils, screen
# This is another line 3
bash, emacs, nano, bison
# This is another line 4
libressl, python, rsync, unzip, vim

Related

add a line between matching pattern - unix

I want to insert "123" below madguy-xyz- line in "module xyz".
There are multiple modules having similar lines. But i want to add it in only "module xyz".
module abc
njkenjkfvsfd
madguy-xyz-mafdvnskjfvn
enfvjkesn
endmodule
module xyz
njkenjkfvsfd
madguy-xyz-mafdvnskjfvn
enfvjkesn
endmodule
This is the code i tried but doesn't work,
sed -i "/module xyz/,/^endmodule/{/madguy-xyz-/a 123}" <file_name>
This is the error I got:
sed: -e expression #1, char 0: unmatched `{'
This might work for you (GNU sed):
sed '/module xyz/{:a;n;/madguy-xyz-/!ba;p;s/\S.*/123/}' file
For a line containing module xyz, continue printing lines until one containing madguy-xyz-.
Print this line too and then replace it with 123.
Another alternative solution:
sed '/module/h;G;/madguy-xyz.*\nmodule xyz/{P;s/\S.*/123/};P;d' file
Store any module line in the hold space.
Append the module line to each line.
If the first line contains madguy-xyz- and the second module xyz, print the first then substitute the second for 123.
Print the first line and delete the whole.
With your shown samples, please try following.
awk '1; /^endmodule$/{found=""};/^module xyz$/{found=1} found && /^ +madguy-xyz-/{print "123"} ' Input_file
Once you are happy with results of above command, to save output into Input_file itself try following then:
awk '1;/^endmodule$/{found=""} /^module xyz$/{found=1} found && /^ +madguy-xyz-/{print "123"} ' Input_file > temp && mv temp Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
1;
/^endmodule$/{found=""} ##Printing current line here.
/^module xyz$/{ ##Checking condition if line contains module xyz then do following.
found=1 ##Setting found to 1 here.
}
found && /^ +madguy-xyz-/{ ##Checking if found is SET and line contains madguy-xyz- then do following.
print "123" ##Printing 123 here.
}
' Input_file ##Mentioning Input_file name here.
NOTE: In case your line exactly having module xyz value then change above /module xyz/ (this condition) to $0=="module xyz" too.
With GNU sed I suggest:
sed -i -e "/module xyz/,/^endmodule/{/madguy-xyz-/a 123" -e "}" file
Using any POSIX awk in any shell on every Unix box, the following will work for the sunny day case in your question and all rainy day cases such as the ones I mentioned in my comment and more:
$ cat tst.awk
{ print }
$1 == "endmodule" {
inMod = 0
}
inMod && (index($1,old) == 1) {
sub(/[^[:space:]].*/,"")
print $0 new
}
($1 == "module") && ($2 == mod) {
inMod = 1
}
$ awk -v mod='xyz' -v old='madguy-xyz-' -v new='123' -f tst.awk file
module abc
njkenjkfvsfd
madguy-xyz-mafdvnskjfvn
enfvjkesn
endmodule
module xyz
njkenjkfvsfd
madguy-xyz-mafdvnskjfvn
123
enfvjkesn
endmodule

awk to extract days from line

I have the following csv file
238013750030646-2;;"Default";"2020-10-01 00:40:36";;"opening";0;3591911;283940640
238013750030646-2;;"Default";"2020-10-03 00:40:36";;"closing line";0;89320;283940640
238013750030646-2;;"something-else";"2020-10-04 00:40:36";;"started";0;0;283940640
238013750030646-2;;"default else";"2020-10-08 05:42:06";;"opening";0;2410;283940640
Im trying to store each line in a specific file matching the date from each line, with the date being in the 4th column of each line, so first line ("2020-10-01 00:40:36") should be in output-01.csv, second line in output-03.csv etc
This awk command
awk -F";|-" -vOFS='\t' '{print > "output-"$7".csv"}' testing.csv
half works but fails on line 3 because of the - in the 3rd column, and line 4 because of the in the 3rd column - this produces output-10.csv
Is there a way to run the awk command twice ? then i could extract the date using the ; separator and then split using -
Using gawk takes care of unsorted file too :
awk 'match($0,/([0-9]{4})-([0-9]{2})-([0-9]{2})/,arr){
file=sprintf("output-%s.csv",arr[3]);
if(!seen[file]++){
print >file;
next
}
}{
print >>file;
close(file);
}' infile
Explanation:
awk 'match($0,/([0-9]{4})-([0-9]{2})-([0-9]{2})/,arr){ # match for regex
file=sprintf("output-%s.csv",arr[3]); # file variable using array arr value, 3rd index
if(!seen[file]++){ # if not seen file name before in array seen
print >file; # print content to file
next # go to next line
}
}{
print >>file; # append content to file
close(file); # close file
}' infile
Try this:
$ awk -F';' -v OFS='\t' '{split($4,a,/[- ]/); file = "output-"a[3]".csv";
$1=$1; print > file; close(file)}' testing.csv
split($4,a,/[- ]/) this will split 4th field further based on space or - characters, saved in array a
file = "output-"a[3]".csv" output filename
$1=$1 since there's no other command changing contents of input line, this is needed to rebuild input line, otherwise OFS will not be applied
print > file print input line to required file
close(file) calling close, useful if there are too many file names
You can also use file = "output-" substr($4,10,2) ".csv" instead of split if the 4th column is consistent as shown in the sample.
With your shown samples, please try following, written and tested in GNU awk.
awk '
match($0,/[0-9]{4}(-[0-9]{2}){2}/){
outputFile=substr($0,RSTART+8,RLENGTH-8)".csv"
print >> (outputFile)
close(outputFile)
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/[0-9]{4}(-[0-9]{2}){2}/){ ##using match function to match yyyy-mm-dd here in line.
outputFile=substr($0,RSTART+8,RLENGTH-8)".csv" ##Getting matched regex sub-string into outputFile here.
print >> (outputFile) ##Printing current line into outputFile here.
close(outputFile) ##Closing output file to avoid too many files opened error.
}
' Input_file ##Mentioning Input_file name here.
To do this efficiently you should sort on the key field first:
awk -F';' '{print $4, NR, $0}' file |
sort -k1,1 -k3,3n |
awk '
{ curr=$1; sub(/([^ ]+ ){2}/,"") }
curr != prev { close(out); out="output-" (++c) ".csv"; prev=curr }
{ print > out }
'
$ head output*.csv
==> output-1.csv <==
238013750030646-2;;"Default";"2020-10-01 00:40:36";;"opening";0;3591911;283940640
==> output-2.csv <==
238013750030646-2;;"Default";"2020-10-03 00:40:36";;"closing line";0;89320;283940640
==> output-3.csv <==
238013750030646-2;;"something-else";"2020-10-04 00:40:36";;"started";0;0;283940640
==> output-4.csv <==
238013750030646-2;;"default else";"2020-10-08 05:42:06";;"opening";0;2410;283940640
The above will work using any awk+sort in any shell on every Unix box. See the many similar examples on this site for an explanation.

How can I extract using sed or awk between newlines after a specific pattern?

I like to check if there is other alternatives where I can print using other bash commands to get the range of IPs under #Hiko other than the below sed, tail and head which I actually figured out to get what I needed from my hosts file.
I'm just curious and keen in learning more on bash, hope I could gain more knowledge from the community.
:D
$ sed -n '/#Hiko/,/#Pico/p' /etc/hosts | tail -n +3 | head -n -2
/etc/hosts
#Tito
192.168.1.21
192.168.1.119
#Hiko
192.168.1.243
192.168.1.125
192.168.1.94
192.168.1.24
192.168.1.242
#Pico
192.168.1.23
192.168.1.93
192.168.1.121
1st solution: With shown samples could you please try following. Written and tested in GNU awk.
awk -v RS= '/#Pico/{exit} /#Hiko/{found=1;next} found' Input_file
Explanation:
awk -v RS= ' ##Starting awk program from here.
/#Pico/{ ##Checking condition if line has #Pico then do following.
exit ##exiting from program.
}
/#Hiko/{ ##Checking condition if line has #Hiko is present in line.
found=1 ##Setting found to 1 here.
next ##next will skip all further statements from here.
}
found ##Checking condition if found is SET then print the line.
' Input_file ##mentioning Input_file name here.
2nd solution: Without using RS function try following.
awk '/#Pico/{exit} /#Hiko/{found=1;next} NF && found' Input_file
3rd solution: You could look for record #Hiko and then could print its next record and come out with shown samples.
awk -v RS= '/#Hiko/{found=1;next} found{print;exit}' Input_file
NOTE: These all solutions above check if string #Hiko or #Pico are present in anywhere in line, in case you want to look exact string then change above only /#Hiko/ and /#Pico/ part to /^#Hiko$/ and /^#Pico$/ respectively.
With sed (checked with GNU sed, syntax might differ for other implementations)
$ sed -n '/#Hiko/{n; :a n; /^$/q; p; ba}' /etc/hosts
192.168.1.243
192.168.1.125
192.168.1.94
192.168.1.24
192.168.1.242
-n turn off automatic printing of pattern space
/#Hiko/ if line contains #Hiko
n get next line (assuming there's always an empty line)
:a label a
n get next line (using n will overwrite any previous content in the pattern space, so only single line content is present in this case)
/^$/q if the current line is empty, quit
p print the current line
ba branch to label a
You can use
awk -v RS= '/^#Hiko$/{getline;print;exit}' file
awk -v RS= '$0 == "#Hiko"{getline;print;exit}' file
Which means:
RS= - make awk read the file paragraph by paragraph
/^#Hiko$/ or '$0 == "#Hiko" - finds a paragraph that is equal to #Hiko
{getline;print;exit} - gets the next paragraph, prints it and exits.
See the online demo.
You may use:
awk -v RS= 'p && NR == p + 1; $1 == "#Hiko" {p = NR}' /etc/hosts
192.168.1.243
192.168.1.125
192.168.1.94
192.168.1.24
192.168.1.242
This might work for you (GNU sed):
sed -n '/^#/h;G;/^[0-9].*\n#Hiko/P' file
Copy the header to the hold buffer.
Append the hold buffer to each line.
If the line begins with a digit and contains the required header, print the first line in the pattern space.

Concatenate the sequence to the ID in fasta file

Here is my input file
>OTU1;size=4;
ATTCCGGGTTTACT
ATTCCTTTTATCGA
ATC
>OTU2;size=10;
CGGATCTAGGCGAT
ACT
>OTU3;size=5;
ATTCCCGGGATCTA
ACTTTTC
The expected output file is:
>OTU1;size=4;ATTCCGGGTTTACTATTCCTTTTATCGAATC
>OTU2;size=10;CGGATCTAGGCGATACT
>OTU3;size=5;ATTCCCGGGATCTAACTTTTC
I've tried the code from Remove line breaks in a FASTA file
but this doesn't work for me, and I am not sure how to modify the code from that post...
Any suggestion? Thanks in advance!
Here is another awk script. Using the awk internal parsing mechanism.
awk 'BEGIN{RS=">";OFS="";}NR>1{$1=$1;print ">"$0}' input.txt
Output is:
>OTU1;size=4;ATTCCGGGTTTACTATTCCTTTTATCGAATC
>OTU2;size=10;CGGATCTAGGCGATACT
>OTU3;size=5;ATTCCCGGGATCTAACTTTTC
Explanation:
awk '
BEGIN { # initialize awk internal variables
RS=">"; # set `RS`=record separator to `>`
OFS=""; # set `OFS`=output field separator to empty string.
}
NR>1 { # handle from 2nd record (1st record is empty).
$1=$1; # regenerate the output line
print ">"$0 # print out ">" with computed output line
}' input.txt
$ awk '{printf "%s%s", (/^>/ ? ors : ""), $0; ors=ORS} END{print ""}' file
>OTU1;size=4;ATTCCGGGTTTACTATTCCTTTTATCGAATC
>OTU2;size=10;CGGATCTAGGCGATACT
>OTU3;size=5;ATTCCCGGGATCTAACTTTTC
Could you please try following too.
awk -v RS=">" 'NR>1{gsub(/\n/,"");print ">"$0}' Input_file
My original attempt was awk -v RS=">" -v FS="\n" -v OFS="" 'NF>1{$1=$1;print ">"$0}' Input_file but later I saw it is already answered buy dudi boy so written another(first mentioned) one.
Similar to my answer here:
$ awk 'BEGIN{RS=">"; FS="\n"; ORS=""}
(FNR==1){next}
{ name=$1; seq=$0; gsub(/(^[^\n]*|)\n/,"",seq) }
{ print ">" name seq }' file1.fasta file2.fasta file3.fasta ...

How to print the length size of the following line

I would like to modify a file by including the size of following line using awk.
My file is like this:
>AAAS:1220136:1220159:-:0::NW_015494524.1:1220136-1220159(-)
ATGTCGATGCTCGATC
>AAAS::1215902:1215986:-:1::NW_015494524.1:1215902-1215986(-)
ATGCGATGCTAGCTAGCTCGAT
>AAAS:1215614:1215701:-:1::NW_015494524.1:1215614-1215701(-)
ATGCCGCGACGCAGCACCCGACGCGCAG
I am using awk to modify it to have the following format:
>Assembly_AAAS_1_16
ATGTCGATGCTCGATC
>Assembly_AAAS_2_22
ATGCGATGCTAGCTAGCTCGAT
>Assembly_AAAS_3_28
ATGCCGCGACGCAGCACCCGACGCGCAG
I have used awk to modify the first part.
awk -F":" -v i=1 '/>/{print ">Assembly_" $1 "_" val i "_";i++;next} {print length($0)} 1' infile | sed -e "s/_>/_/g" > outfile
I can use print length($0) but how to print it in the same line?
Thanks
EDIT2: Since OP has changed the sample data again so adding this code now.
awk -v val="Assembly_AAAS_" '/>/{++i;val=">"val i "_";next} {sub(/ +$/,"");print val length($0) ORS $0}' Input_file
OR
awk -v val="Assembly_AAAS_" '/>/{++i;val=">"val i "_";next} {print val length($1) ORS $0;}' Input_file
Above will remove spaces from last of the lines of Input_file, in case you don't need it then remove sub(/ +$/,""); part from above code please.
EDIT: As per OP changed solution now.
awk -v i=1 -v val=">Assembly_GeneName1_" -v val1="_sizeline" '/>/{value="\047" val i val1;i++;next} {print value length($0) ORS $0}' Input_file
OR
awk -v i=1 -v val=">Assembly_GeneName1_" -v val1="_sizeline" '
/>/{ value="\047" val i val1;
i++;
next}
{
print value length($0) ORS $0
}
' Input_file
Following awk may help you on same.
awk -v i="" -v j=2 '/>/{print "\047>Assembly_GeneName1_"++i"_sizeline"j;j+=2;next} 1' Input_file
Solution 2nd:
awk -v i=1 -v j=2 -v val=">Assembly_GeneName1_" -v val1="_sizeline" '/>/{print "\047" val i val1 j;j+=2;i++;next} 1' Input_file
What you are dealing with is a beautiful example of records which are not lines. awk is a record parser and by default, a record is defined to be a line. With awk you can define a record to be a block of text using the record separator RS.
RS : The first character of the string value of RS shall be the input record separator; a <newline> by default. If RS contains more
than one character, the results are unspecified. If RS is null, then
records are separated by sequences consisting of a <newline> plus one
or more blank lines, leading or trailing blank lines shall not result
in empty records at the beginning or end of the input, and a <newline>
shall always be a field separator, no matter what the value of FS is.
So the goal is to define the record to be
AAAS:1220136:1220159:-:0::NW_015494524.1:1220136-1220159(-)
ATGTCGATGCTCGATC
And this can be done by defining the RS="\n<". Furthremore we will use \n as a field separator FS. This way you can get the requested length as length($2) and the count by using the record count NR.
A simple awk script is then:
awk 'BEGIN{RS="\n<"; FS=OFS="\n"}
{$1=">Assembly_AAAS_"NR"_"length($2)}
{print $1,$2}' <file>
This will do exactly what you want.
note: we use print $1,$2 and not print $0 as the last record might have 3 fields (if the last char of the file is a newline). This would imply that you would have an extra empty line at the end of your file.
If you want to pick the AAAS string out of $1 you can use substr($1,1,match($1,":")-1) to pick it up. This results in this:
awk 'BEGIN{RS="\n<"; FS=OFS="\n"}
{$1=">Assembly_"substr($1,1,match($1,":")-1)"_"NR"_"length($2)}
{print $1,$2}' <file>
Finally, be aware that the above solution only works if there are no spaces in $2, if you want to change that, you can do this :
awk 'BEGIN{RS="\n<"; FS=OFS="\n"}
{ gsub(/[[:blank:]]/,"",$2);
$1=">Assembly_"substr($1,1,match($1,":")-1)"_"NR"_"length($2)
}
{ print $1,$2 }' <file>