Convert data format using awk? - awk

There is a file which contains data in a 'n*1' format:
1
2
3
4
5
6
Is there any way to convert it to a 'n*3' format like:
1,2,3
4,5,6
via awk rather than using for loop ?
Really no idea about this..Any help or key word is appreciated.

Using awk
$ awk '{printf "%s%s",$0,(NR%3==0?ORS:",")}' File
1,2,3
4,5,6
The command printf "%s%s",$0,(NR%3==0?ORS:",") tells awk to print two strings. The first is $0 which is the current line. The second string is NR%3==0?ORS:"," which is either ORS the output record separator (if the line number is a multiple of three) or else , for all other line numbers.
Using sed
$ sed 'N;N;s/\n/,/g' File
1,2,3
4,5,6
By default, sed reads in each line from the file one by one. N tells sed to read in another line, appending the line to the current one, separated by a newline. N;N tells sed to do that twice so that we have a total of three lines in the pattern space. s/\n/,/g tells sed to replace those two separator newlines with commas. The result is then printed.
The above assumes that we are using GNU sed. With minor modifications, this can be made to work with BSD/OSX sed.

The most simple one - paste command:
paste -d, - - - <file
The output:
1,2,3
4,5,6

Following may help you on same.
xargs -n3 < Input_file | sed 's/ /,/g'

Try this:
awk 'NR%3==0{print;next}{printf "%s,",$0}' file
or decomposed :
NR%3==0 # condition, modulo 3 == 0
{print;next} # then print and skip to the first line
{printf "%s,",$0} # printf to not print newlines but current int + ,

$ awk '{ORS=(NR%3?",":RS)}1' file
1,2,3
4,5,6

Related

Convert multiple lines to a line separated by brackets and "|"

I have the following data in multiple lines:
1
2
3
4
5
6
7
8
9
10
I want to convert them to lines separated by "|" and "()":
(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|10
I made a mistake. I'm sorry,I want to convert them to lines separated by "|" and "()":
(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(10)
What I have tried is:
seq 10 | sed -r 's/(.*)/(\1)/'|paste -sd"|"
What's the best unix one-liner to do that?
This might work for you (GNU sed):
sed 's/.*/(&)/;H;1h;$!d;x;s/\n/|/g' file
Surround each line by parens.
Append all lines to the hold space except for the first line which replaces the hold space.
Delete all lines except the last.
On the last line, swap to the hold space and replace all newlines by |'s.
N.B. When a line is deleted no further commands are invoked and the command cycle begins again. That is why the last two commands are only executed on the last line of the file.
Alternative:
sed -z 's/\n$//;s/.*/(&)/mg;y/\n/|/' file
With your shown samples please try following awk code. This should work in any version of awk.
awk -v OFS="|" '{val=(val?val OFS:"") "("$0")"} END{print val}' Input_file
Using GNU sed
$ sed -Ez ':a;s/([0-9]+)\n/(\1)|/;ta;s/\|$/\n/' input_file
(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(10)
Here is another simple awk command:
awk 'NR>1 {printf "%s|", p} {p="(" $0 ")"} END {print p}' file
(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(10)
Here it is:
sed -z 's/^/(/;s/\n/)|(/g;s/|($//' your_input
where -z allows you to treat the whole file as a single string with embedded \ns.
In detail, the sed script above consists of 3 commands separated by ;s:
s/^/(/ inserts a ( at the beginning of the whole file,
s/\n/)|(/g changes every \n to )|(;
s/|($// removes the trailing |( resulting from the \n at EOF, that is likely in your file since you are on linux.
With perl:
$ seq 10 | perl -pe 's/.*/($&)/; s/\n/|/ if !eof'
(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(10)
s/.*/($&)/ to surround input lines with ()
s/\n/|/ if !eof will change newline to | except for the last input line.
Here's a solution with paste (just for fun):
$ seq 10 | paste -d'()' /dev/null - /dev/null | paste -sd'|'
(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(10)
Using any awk:
$ seq 10 | awk '{printf "%s(%s)", sep, $0; sep="|"} END{print ""}'
(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(10)

Extracting and rearranging columns

I read from stdin lines which contain fields. The field delimiter is a semicolon. There are no specific quoting characters in the input (i.e. the fields can't contain themselves semicolons or newline characters). The number of the input fields is unknown, but it is at least 4.
The output is supposed to be a similar file, consisting of the fields from 2 to the end, but field 2 and 3 reversed in order.
I'm using zsh.
I came up with a solution, but find it clumsy. In particular, I could not think of anything specific to zsh which would help me here, so basically I reverted to awk. This is my approach:
awk -F ';' '{printf("%s", $3 ";" $2); for(i=4;i<=NF;i++) printf(";%s", $i); print "" }' <input_file >output_file
The first printf takes care about the two reversed fields, and then I use an explicit loop to write out the remaining fields. Is there a possibility in awk (or gawk) to print a range of fields in a single command? Or did I miss some incredibly clever feature in zsh, which could make my life simpler?
UPDATE: Example input data
a;bb;c;D;e;fff
gg;h;ii;jj;kk;l;m;n
Should produce the output
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
Using any awk in any shell on every Unix box:
$ awk 'BEGIN{FS=OFS=";"} {t=$3; $3=$2; $2=t; sub(/[^;]*;/,"")} 1' file
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
With GNU awk you could try following code. Using match function ogf GNU awk, where using regex ^[^;]*;([^;]*;)([^;]*;)(.*)$ to catch the values as per requirement, this is creating 3 capturing groups; whose values are getting stored into array named arr(GNU awk's functionality) and then later in program printing values as per requirement.
Here is the Online demo for used regex.
awk 'match($0,/^[^;]*;([^;]*;)([^;]*;)(.*)$/,arr){
print arr[2] arr[1] arr[3]
}
' Input_file
If perl is accepted, it provides a join() function to join elements on a delimiter. In awk though you'd have to explicitly define one (which isn't complex, just more lines of code)
perl -F';' -nlae '$t = #F[2]; #F[2] = #F[1]; $F[1] = $t; print join(";", #F[1..$#F])' file
With sed, perl, hck and rcut (my own script):
$ sed -E 's/^[^;]+;([^;]+);([^;]+)/\2;\1/' ip.txt
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
# can also use: perl -F';' -lape '$_ = join ";", #F[2,1,3..$#F]' ip.txt
$ perl -F';' -lane 'print join ";", #F[2,1,3..$#F]' ip.txt
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
# -d and -D specifies input/output separators
$ hck -d';' -D';' -f3,2,4- ip.txt
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
# syntax similar to cut, but output field order can be different
$ rcut -d';' -f3,2,4- ip.txt
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
Note that the sed version will preserve input lines with less than 3 fields.
$ cat ip.txt
1;2;3
apple;fig
abc
$ sed -E 's/^[^;]+;([^;]+);([^;]+)/\2;\1/' ip.txt
3;2
apple;fig
abc
$ perl -F';' -lane 'print join ";", #F[2,1,3..$#F]' ip.txt
3;2
;fig
;
Another awk variant:
awk 'BEGIN{FS=OFS=";"} {$1=$3; $3=""; sub(/;;/, ";")} 1' file
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
With gnu awk and gensub switching the position of 2 capture groups:
awk '{print gensub(/^[^;]*;([^;]*);([^;]*)/, "\\2;\\1", 1)}' file
The pattern matches
^ Start of string
[^;]*; Negated character class, match optional chars other than ; and then match ;
([^;]*);([^;]*) 2 capture groups, both capturing chars other than ; and match ; in between
Output
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
awk '{print $3, $0}' {,O}FS=\; < file | cut -d\; -f1,3,5-
This uses awk to prepend the third column, then pipes to cut to extract the desired columns.
Here is one way to do it using only zsh:
rearrange() {
local -a lines=(${(#f)$(</dev/stdin)})
for line in $lines; do
local -a flds=(${(s.;.)line})
print $flds[3]';'$flds[2]';'${(j.;.)flds[4,-1]}
done
}
The same idea in a single line. This may not be an improvement over your awk script:
for l in ${(#f)$(<&0)}; print ${${(A)i::=${(s.;.)l}}[3]}\;$i[2]\;${(j.;.)i:3}
Some of the pieces:
$(</dev/stdin) - read from stdin using pseudo-device.
$(<&0) - another way to read from stdin.
(f) - parameter expansion flag to split by newlines.
(#) - treat split as an array.
(s.;.) - split by semicolon.
$flds[3] - expands to the third array element.
$flds[4,-1] - fourth, fifth, etc. array elements.
$i:3 - ksh-style array slice for fourth, fifth ... elements.
Mixing styles like this can be confusing, even if it is slightly shorter.
(j.;.) - join array by semicolon.
i::= - assign the result of the expansion to the variable i.
This lets us use the semicolon-split fields later.
(A)i::= - the (A) flag ensures i is an array.

awk to remove 5th column from N column with fixed delimiter

I have file with Nth columns
I want to remove the 5th column from last of Nth columns
Delimiter is "|"
I tested with simple example as shown below:
bash-3.2$ echo "1|2|3|4|5|6|7|8" | nawk -F\| '{print $(NF-4)}'
4
Expecting result:
1|2|3|5|6|7|8
How should I change my command to get the desired output?
If I understand you correctly, you want to use something like this:
sed -E 's/\|[^|]*((\|[^|]*){4})$/\1/'
This matches a pipe character \| followed by any number of non-pipe characters [^|]*, then captures 4 more of the same pattern ((\|[^|]*){4}). The $ at the end matches the end of the line. The first part of the match (i.e. the fifth field from the end) is dropped.
Testing it out:
$ sed -E 's/\|[^|]*((\|[^|]*){4})$/\1/' <<<"1|2|3|4|5|6|7"
1|2|4|5|6|7
You could achieve the same thing using GNU awk with gensub but I think that sed is the right tool for the job in this case.
If your version of sed doesn't support extended regex syntax with -E, you can modify it slightly:
sed 's/|[^|]*\(\(|[^|]*\)\{4\}\)$/\1/'
In basic mode, pipes are interpreted literally but parentheses for capture groups and curly brcneed to be escaped.
AWK is your friend :
Sample Input
A|B|C|D|E|F|G|H|I
A|B|C|D|E|F|G|H|I|A
A|B|C|D|E|F|G|H|I|F|E|D|O|R|Q|U|I
A|B|C|D|E|F|G|H|I|E|O|Q
A|B|C|D|E|F|G|H|I|X
A|B|C|D|E|F|G|H|I|J|K|L
Script
awk 'BEGIN{FS="|";OFS="|"}
{$(NF-5)="";sub(/\|\|/,"|");print}' file
Sample Output
A|B|C|E|F|G|H|I
A|B|C|D|F|G|H|I|A
A|B|C|D|E|F|G|H|I|F|E|O|R|Q|U|I
A|B|C|D|E|F|H|I|E|O|Q
A|B|C|D|F|G|H|I|X
A|B|C|D|E|F|H|I|J|K|L
What we did here
As you are aware awk's has special variables to store each field in the record, which ranges from $1,$2 upto $(NF)
To exclude the 5th from the last column is as simple as
Emptying the colume ie $(NF-5)=""
Removing from the record, the consecutive | formed by the above step ie do sub(/\|\|/,"|")
another alternative, using #sjsam's input file
$ rev file | cut -d'|' --complement -f6 | rev
A|B|C|E|F|G|H|I
A|B|C|D|F|G|H|I|A
A|B|C|D|E|F|G|H|I|F|E|O|R|Q|U|I
A|B|C|D|E|F|H|I|E|O|Q
A|B|C|D|F|G|H|I|X
A|B|C|D|E|F|H|I|J|K|L
not sure you want the 5'th from the last or 6th. But it's easy to adjust.
Thanks for the help and guidance.
Below is what I tested:
bash-3.2$ echo "1|2|3|4|5|6|7|8|9" | nawk 'BEGIN{FS="|";OFS="|"} {$(NF-4)="!";print}' | sed 's/|!//'
Output: 1|2|3|4|6|7|8|9
Further tested on the file that I have extracted from system and so it worked fine.

awk to transpose lines of a text file

A .csv file that has lines like this:
20111205 010016287,1.236220,1.236440
It needs to read like this:
20111205 01:00:16.287,1.236220,1.236440
How do I do this in awk? Experimenting, I got this far. I need to do it in two passes I think. One sub to read the date&time field, and the next to change it.
awk -F, '{print;x=$1;sub(/.*=/,"",$1);}' data.csv
Use that awk command:
echo "20111205 010016287,1.236220,1.236440" | \
awk -F[\ \,] '{printf "%s %s:%s:%s.%s,%s,%s\n", \
$1,substr($2,1,2),substr($2,3,2),substr($2,5,2),substr($2,7,3),$3,$4}'
Explanation:
-F[\ \,]: sets the delimiter to space and ,
printf "%s %s:%s:%s.%s,%s,%s\n": format the output
substr($2,0,3): cuts the second firls ($2) in the desired pieces
Or use that sed command:
echo "20111205 010016287,1.236220,1.236440" | \
sed 's/\([0-9]\{8\}\) \([0-9]\{2\}\)\([0-9]\{2\}\)\([0-9]\{2\}\)\([0-9]\{3\}\)/\1 \2:\3:\4.\5/g'
Explanation:
[0-9]\{8\}: first match a 8-digit pattern and save it as \1
[0-9]\{2\}...: after a space match 3 times a 2-digit pattern and save them to \2, \3 and \4
[0-9]\{3\}: and at last match 3-digit pattern and save it as \5
\1 \2:\3:\4.\5: format the output
sed is better suited to this job since it's a simple substitution on single lines:
$ sed -r 's/( ..)(..)(..)/\1:\2:\3./' file
20111205 01:00:16.287,1.236220,1.236440
but if you prefer here's GNU awk with gensub():
$ awk '{print gensub(/( ..)(..)(..)/,"\\1:\\2:\\3.","")}' file
20111205 01:00:16.287,1.236220,1.236440

printing lines using awk which have more than 3 space-delimited words?

I have a file containing lines of sentences. I want to print all lines that have more than 3 words. Words are separated by whitespace.
How could I do this with awk?
Use awk like this:
awk 'NF>3' file
GNU sed
sed -E '/\s*(\S+\s+){3}\S+/!d' file
The variable NF indicates the number of fields on the current input line.
awk 'NF>3' file