How to convert only certain columns of whitespace into tab? - awk

I know I can use sed 's/[[:blank:]]/,/g' to convert blank spaces into commas or anything of my choosing in my file, but is there a way to somehow set it so that, only the first 5 instances of whitespace convert them into a comma?
This is because my last column has a lot of information written out, so it is annoying when sed coverts all the spaces in that column into commas.
Sample input file:
sample1 gi|11| 123 33 97.23 This is a sentence
sample2 gi|22| 234 33 97.05 Keep these spaces
And the output I was looking for:
sample1,gi|11|,123,33,97.23,This is a sentence
sample2,gi|22|,234,33,97.05,Keep these spaces
Only the first 5 chains of whitespace are changed to a comma.

With GNU awk for the 3rd arg to match():
$ awk '{ match($0,/((\S+\s+){5})(.*)/,a); gsub(/\s+/,",",a[1]); print a[1] a[3] }' file
sample1,gi|11|,123,33,97.23,This is a sentence
sample2,gi|22|,234,33,97.05,This is a sentence
but I'd recommend you actually turn it into a valid CSV (i.e. one that conforms to RFC 4180) such as could be read by MS-Excel and other tools since "This is a sentence" (and possibly other fields) can presumably include commas and double quotes:
$ awk '{
gsub(/"/,"\"\"");
match($0,/((\S+\s+){5})(.*)/,a)
gsub(/\s+/,"\",\"",a[1])
print "\"" a[1] a[3] "\""
}' file
"sample1","gi|11|","123","33","97.23","This is a sentence"
"sample2","gi|22|","234","33","97.05","This is a sentence"
For example given this input:
$ cat file
sample1 gi|11| 123 33 97.23 This is a sentence
a,b,sample2 gi|22| 234 33 97.05 This is, "typically", a sentence
The output from the first script is not valid CSV:
$ awk '{ match($0,/((\S+\s+){5})(.*)/,a); gsub(/\s+/,",",a[1]); print a[1] a[3] }' file
sample1,gi|11|,123,33,97.23,This is a sentence
a,b,sample2,gi|22|,234,33,97.05,This is, "typically", a sentence
while the output from the 2nd script IS valid CSV:
$ awk '{ gsub(/"/,"\"\""); match($0,/((\S+\s+){5})(.*)/,a); gsub(/\s+/,"\",\"",a[1]); print "\"" a[1] a[3] "\"" }' file
"sample1","gi|11|","123","33","97.23","This is a sentence"
"a,b,sample2","gi|22|","234","33","97.05","This is, ""typically"", a sentence"

perl's split can limit the number of fields with its third argument:
$ perl -lnE 'say join ",", split(" ",$_,6)' file
sample1,gi|11|,123,33,97.23,This is a sentence
sample2,gi|22|,234,33,97.05,This is a sentence
If fields might require quoting:
perl -lnE 'say join ",", map { s/"/""/g || y/,// ? "\"$_\"" : $_ } split(" ",$_,6)
' file

Ruby has a str.split that can take a limit:
ruby -ne 'puts $_.split(/\s+/,6).join(",")' file
sample1,gi|11|,123,33,97.23,This is a sentence
sample2,gi|22|,234,33,97.05,This is a sentence
As does Perl:
perl -lnE 'say join ",", split /\s+/,$_,6 ' file
# same

This might work for you (GNU sed):
sed -E 's/\s+/&\n/5;h;s/\n.*//;s/\s+/,/g;G;s/\n.*\n//' file
Append a newline to the 5th occurrence of a group of whitespace.
Make a copy of the amended line in the hold space.
Remove the section from the inserted newline to the end of the line.
Translate groups of whitespace into commas.
Append the copy.
Remove the section between newlines.
Thus the first five groups of whites space are converted to commas and the remaining groups are untouched.

Here is a way to do with 3 sed commands. This requires GNU sed, which supports the /ng (e.g. /6g) pattern flag. That will only apply the substitution from the nth occurrence on. Also note: this method will compress multiple spaces in the last column permanently.
sed 's/ \+/␣/6g' | sed 's/ \+/,/g' | sed 's/␣/ /g'
Another variation: Do the multiple space compression as a separate step with tr -s ' '. This may be more readable.
tr -s ' ' | sed 's/ /␣/6g' | sed 's/ /,/g' | sed 's/␣/ /g'
Another variation: Compress all whitespaces, not just spaces with \s
sed 's/\s\+/␣/6g' | sed 's/\s\+/,/g' | sed 's/␣/ /g'
Explanation:
The first step "protects" the spaces from the 6th occurrence on, by converting them to a special character. I've used ␣ here (unicode U+2423), but you could use any character that doesn't exist in the source data, such as \x00, {space}, etc.
sed 's/ \+/␣/6g'
The second step converts the remaining spaces to commas.
sed 's/ \+/,/g'
The third step converts the "protected" spaces back to spaces.
sed 's/␣/ /g'

Related

Convert multiple lines to a line separated by brackets and "|"

I have the following data in multiple lines:
1
2
3
4
5
6
7
8
9
10
I want to convert them to lines separated by "|" and "()":
(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|10
I made a mistake. I'm sorry,I want to convert them to lines separated by "|" and "()":
(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(10)
What I have tried is:
seq 10 | sed -r 's/(.*)/(\1)/'|paste -sd"|"
What's the best unix one-liner to do that?
This might work for you (GNU sed):
sed 's/.*/(&)/;H;1h;$!d;x;s/\n/|/g' file
Surround each line by parens.
Append all lines to the hold space except for the first line which replaces the hold space.
Delete all lines except the last.
On the last line, swap to the hold space and replace all newlines by |'s.
N.B. When a line is deleted no further commands are invoked and the command cycle begins again. That is why the last two commands are only executed on the last line of the file.
Alternative:
sed -z 's/\n$//;s/.*/(&)/mg;y/\n/|/' file
With your shown samples please try following awk code. This should work in any version of awk.
awk -v OFS="|" '{val=(val?val OFS:"") "("$0")"} END{print val}' Input_file
Using GNU sed
$ sed -Ez ':a;s/([0-9]+)\n/(\1)|/;ta;s/\|$/\n/' input_file
(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(10)
Here is another simple awk command:
awk 'NR>1 {printf "%s|", p} {p="(" $0 ")"} END {print p}' file
(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(10)
Here it is:
sed -z 's/^/(/;s/\n/)|(/g;s/|($//' your_input
where -z allows you to treat the whole file as a single string with embedded \ns.
In detail, the sed script above consists of 3 commands separated by ;s:
s/^/(/ inserts a ( at the beginning of the whole file,
s/\n/)|(/g changes every \n to )|(;
s/|($// removes the trailing |( resulting from the \n at EOF, that is likely in your file since you are on linux.
With perl:
$ seq 10 | perl -pe 's/.*/($&)/; s/\n/|/ if !eof'
(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(10)
s/.*/($&)/ to surround input lines with ()
s/\n/|/ if !eof will change newline to | except for the last input line.
Here's a solution with paste (just for fun):
$ seq 10 | paste -d'()' /dev/null - /dev/null | paste -sd'|'
(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(10)
Using any awk:
$ seq 10 | awk '{printf "%s(%s)", sep, $0; sep="|"} END{print ""}'
(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(10)

Extracting and rearranging columns

I read from stdin lines which contain fields. The field delimiter is a semicolon. There are no specific quoting characters in the input (i.e. the fields can't contain themselves semicolons or newline characters). The number of the input fields is unknown, but it is at least 4.
The output is supposed to be a similar file, consisting of the fields from 2 to the end, but field 2 and 3 reversed in order.
I'm using zsh.
I came up with a solution, but find it clumsy. In particular, I could not think of anything specific to zsh which would help me here, so basically I reverted to awk. This is my approach:
awk -F ';' '{printf("%s", $3 ";" $2); for(i=4;i<=NF;i++) printf(";%s", $i); print "" }' <input_file >output_file
The first printf takes care about the two reversed fields, and then I use an explicit loop to write out the remaining fields. Is there a possibility in awk (or gawk) to print a range of fields in a single command? Or did I miss some incredibly clever feature in zsh, which could make my life simpler?
UPDATE: Example input data
a;bb;c;D;e;fff
gg;h;ii;jj;kk;l;m;n
Should produce the output
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
Using any awk in any shell on every Unix box:
$ awk 'BEGIN{FS=OFS=";"} {t=$3; $3=$2; $2=t; sub(/[^;]*;/,"")} 1' file
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
With GNU awk you could try following code. Using match function ogf GNU awk, where using regex ^[^;]*;([^;]*;)([^;]*;)(.*)$ to catch the values as per requirement, this is creating 3 capturing groups; whose values are getting stored into array named arr(GNU awk's functionality) and then later in program printing values as per requirement.
Here is the Online demo for used regex.
awk 'match($0,/^[^;]*;([^;]*;)([^;]*;)(.*)$/,arr){
print arr[2] arr[1] arr[3]
}
' Input_file
If perl is accepted, it provides a join() function to join elements on a delimiter. In awk though you'd have to explicitly define one (which isn't complex, just more lines of code)
perl -F';' -nlae '$t = #F[2]; #F[2] = #F[1]; $F[1] = $t; print join(";", #F[1..$#F])' file
With sed, perl, hck and rcut (my own script):
$ sed -E 's/^[^;]+;([^;]+);([^;]+)/\2;\1/' ip.txt
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
# can also use: perl -F';' -lape '$_ = join ";", #F[2,1,3..$#F]' ip.txt
$ perl -F';' -lane 'print join ";", #F[2,1,3..$#F]' ip.txt
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
# -d and -D specifies input/output separators
$ hck -d';' -D';' -f3,2,4- ip.txt
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
# syntax similar to cut, but output field order can be different
$ rcut -d';' -f3,2,4- ip.txt
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
Note that the sed version will preserve input lines with less than 3 fields.
$ cat ip.txt
1;2;3
apple;fig
abc
$ sed -E 's/^[^;]+;([^;]+);([^;]+)/\2;\1/' ip.txt
3;2
apple;fig
abc
$ perl -F';' -lane 'print join ";", #F[2,1,3..$#F]' ip.txt
3;2
;fig
;
Another awk variant:
awk 'BEGIN{FS=OFS=";"} {$1=$3; $3=""; sub(/;;/, ";")} 1' file
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
With gnu awk and gensub switching the position of 2 capture groups:
awk '{print gensub(/^[^;]*;([^;]*);([^;]*)/, "\\2;\\1", 1)}' file
The pattern matches
^ Start of string
[^;]*; Negated character class, match optional chars other than ; and then match ;
([^;]*);([^;]*) 2 capture groups, both capturing chars other than ; and match ; in between
Output
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
awk '{print $3, $0}' {,O}FS=\; < file | cut -d\; -f1,3,5-
This uses awk to prepend the third column, then pipes to cut to extract the desired columns.
Here is one way to do it using only zsh:
rearrange() {
local -a lines=(${(#f)$(</dev/stdin)})
for line in $lines; do
local -a flds=(${(s.;.)line})
print $flds[3]';'$flds[2]';'${(j.;.)flds[4,-1]}
done
}
The same idea in a single line. This may not be an improvement over your awk script:
for l in ${(#f)$(<&0)}; print ${${(A)i::=${(s.;.)l}}[3]}\;$i[2]\;${(j.;.)i:3}
Some of the pieces:
$(</dev/stdin) - read from stdin using pseudo-device.
$(<&0) - another way to read from stdin.
(f) - parameter expansion flag to split by newlines.
(#) - treat split as an array.
(s.;.) - split by semicolon.
$flds[3] - expands to the third array element.
$flds[4,-1] - fourth, fifth, etc. array elements.
$i:3 - ksh-style array slice for fourth, fifth ... elements.
Mixing styles like this can be confusing, even if it is slightly shorter.
(j.;.) - join array by semicolon.
i::= - assign the result of the expansion to the variable i.
This lets us use the semicolon-split fields later.
(A)i::= - the (A) flag ensures i is an array.

Only print first and second word of each line to output with sed

I want to clean up a pattern file for later use, so only the first and second word (or number) are relevant.
I have this:
pattern.txt
# This is a test pattern
some_variable one # placeholder which replaces a variable
some_other_var 2 # other variable to replace
# Some random comment in between
different_var "hello" # this will also replace a placeholder but with a string
# And after some empty lines:
var_after_newlines 18 # some variable after a lot of newlines
{{hello}} " this is just a string surrounded by space "
{bello} "this is just a string"#and this is a comment
cello "#string with a comment in it"#and a comment
To which I apply:
sed -nE '/^\s*#/d;/^\s*$/d;s/^\s*([^\s]+)\s+([^\s]+).*$/\1 \2/p' pattern.txt > output.txt
it should clean out comment lines starting with # -> works
it should clean out empty lines (or lines with whitespace characters) -> works
it should replace every line with its first and second word seperated by one (1) space character -> doesn't work. Compare:
output.txt
Expectation:
some_variable one
some_other_var 2
different_var "hello"
var_after_newlines 18
{{hello}} " this is just a string surrounded by space "
{bello} "this is just a string"
cello "#string with a comment in it"
Reality:
different_var "hello" # thi
var_after_newline
{{hello}} " thi
{bello} "thi
cello "#
What am I missing?
EDIT:
As #Ed Morton pointed out, it would make sense to als include the following cases: strings with spaces, strings with spaces before and after quotation marks, comments within strings and comments right after the quotation mark. The accepted answers sed solution works fine with all of this.
Completely based on your shown samples only, this could be easily done with awk. Written and tested with GNU awk, should work with any awk.
awk '{sub(/\r$/,"")} NF && !/^#/{print $1,$2}' Input_file
Explanation: Simply checking 2 conditions here. 1st- NF which makes sure line is NOT empty line. 2nd- Line is NOT starting with # then print 1st and 2nd columns of current line.
With sed: Please try following in GNU sed.
sed -E 's/\r$//;/^#/d;/^\s*$/d;s/^ +//;s/([^ ]*) +([^ ]*).*/\1 \2/' Input_file
OR as per Ed sir's comments use following:
sed -E 's/\r$//; /^#/d; /^\s*$/d; s/^\s+//; s/(\S*)\s+(\S*).*/\1 \2/' Input_file
Sample output is as follows for both above solutions:
some_variable one
some_other_var 2
different_var "hello"
var_after_newlines 18
In GNU sed
sed -E '/^\s*(#.*)?$/d; s/^\s*(\S+)\s+(\S+).*/\1 \2/' pattern.txt
Update after the comments:
sed -E '/^\s*(#.*)?$/d; s/^\s*(\S+)\s+("[^"]*"|\S+).*/\1 \2/' pattern.txt
Version that should work with most any sed:
$ sed 's/^[[:space:]]*//; s/#.*//; /^$/d; s/^\([^[:space:]]\{1,\}\)[[:space:]]\{1,\}\([^[:space:]]\{1,\}\).*/\1 \2/' pattern.txt
some_variable one
some_other_var 2
different_var "hello"
var_after_newlines 18

How to delete top and last non empty lines of the file

I want to delete top and last non empty line of the file.
Example:
cat test.txt
//blank_line
abc
def
xyz
//blank_line
qwe
mnp
//blank_line
Then output should be:
def
xyz
//blank_line
qwe
I have tried with commands
sed "$(awk '/./{line=NR} END{print line}' test.txt)d" test.txt
to remove last non empty line. At here there are two command, (1) sed and (2) awk. But I want to do by single command.
Reading the whole file in memory at once with GNU sed for -E and -z:
$ sed -Ez 's/^\s*\S+\n//; s/\n\s*\S+\s*$/\n/' test.txt
def
xyz
qwe
or with GNU awk for multi-char RS:
$ awk -v RS='^$' '{gsub(/^\s*\S+\n|\n\S+\s*$/,"")} 1' test.txt
def
xyz
qwe
Both GNU tools accept \s and \S as shorthand for [[:space:]] and [^[:space:]] respectively and GNU sed accepts the non-POSIX-sed-standard \n as meaning newline.
This is a double pass method:
awk '(NR==FNR) { if(NF) {t=FNR;if(!h) h=FNR}; next}
(h<FNR && FNR<t)' file file
The integers h and t keep track of the head and the tail. In this case, empty lines can also contain blanks. You could replace if(NF) by if(length($0)==0) to be more strict.
This one reads everything into memory and does a simple replace at the end:
$ awk '{b=b RS $0}
END{ sub(/^[[:blank:]\n]*[^\n]+\n/,"",b);
sub(/\n[^\n]+[[:blank:]\n]*$,"",b);
print b }' file
A single-pass, fast and relatively memory-efficient approach utilising a buffer:
awk 'f {
if(NF) {
printf "%s",buf
buf=""
}
buf=(buf $0 ORS)
next
}
NF {
f=1
}' file
here is a golfed version of #kvantour's solution
$ awk 'NR==(n=FNR){e=!NF?e:n;b=!b?e:b}b<n&&n<e' file{,}
This might work for you (GNU sed):
sed -E '0,/\S/d;H;$!d;x;s/.(.*)\n.*\S.*/\1/' file
Use a range to delete upto and including the first line containing a non-space character. Then copy the remains of the file into the hold space and at the end of file use substitution to remove the last line containing a non-space character and any empty lines to the end of the file.
Alternative:
sed '0,/\S/d' file | tac | sed '0,/\S/d'| tac

awk to transpose lines of a text file

A .csv file that has lines like this:
20111205 010016287,1.236220,1.236440
It needs to read like this:
20111205 01:00:16.287,1.236220,1.236440
How do I do this in awk? Experimenting, I got this far. I need to do it in two passes I think. One sub to read the date&time field, and the next to change it.
awk -F, '{print;x=$1;sub(/.*=/,"",$1);}' data.csv
Use that awk command:
echo "20111205 010016287,1.236220,1.236440" | \
awk -F[\ \,] '{printf "%s %s:%s:%s.%s,%s,%s\n", \
$1,substr($2,1,2),substr($2,3,2),substr($2,5,2),substr($2,7,3),$3,$4}'
Explanation:
-F[\ \,]: sets the delimiter to space and ,
printf "%s %s:%s:%s.%s,%s,%s\n": format the output
substr($2,0,3): cuts the second firls ($2) in the desired pieces
Or use that sed command:
echo "20111205 010016287,1.236220,1.236440" | \
sed 's/\([0-9]\{8\}\) \([0-9]\{2\}\)\([0-9]\{2\}\)\([0-9]\{2\}\)\([0-9]\{3\}\)/\1 \2:\3:\4.\5/g'
Explanation:
[0-9]\{8\}: first match a 8-digit pattern and save it as \1
[0-9]\{2\}...: after a space match 3 times a 2-digit pattern and save them to \2, \3 and \4
[0-9]\{3\}: and at last match 3-digit pattern and save it as \5
\1 \2:\3:\4.\5: format the output
sed is better suited to this job since it's a simple substitution on single lines:
$ sed -r 's/( ..)(..)(..)/\1:\2:\3./' file
20111205 01:00:16.287,1.236220,1.236440
but if you prefer here's GNU awk with gensub():
$ awk '{print gensub(/( ..)(..)(..)/,"\\1:\\2:\\3.","")}' file
20111205 01:00:16.287,1.236220,1.236440