Union "tables" with awk - awk

I have multiple "tables" in a file, such as:
col1, col2, col3, col4
1, 2, 3, 4
5, 6, 7, 8
col2, col3, col5
10, 11, 12
13, 14, 15
And I would like to collapse these 2 tables to:
col1, col2, col3, col4, col5
1 , 2 , 3 , 4 ,
5 , 6 , 7 , 8 ,
, 10 , 11 , , 12
, 13 , 14 , , 15
(Note: extra whitespace left just to make things easier to understand)
This would seem to require at least 2 passes, one to collect the full list of columns, and another one to create the output table. Is it possible to do this with awk? If not, what other tool would you recommend?

give this a try:
Code:
$ cat s.awk
NR==FNR{
if (match($1, /^col/))
maxIndex=(substr($NF,4,1)>maxIndex)?substr($NF,4,1):maxColumn
next
}
FNR==1{
for (i=1;i<=maxIndex;i++)
header=(i==maxIndex)?header "col"i:header "col" i ", "
print header
}
/^col[1-9]/{
for (i in places)
delete places[i]
for (i=1;i<=NF;i++){
n=substr($i,4,1)
places[n]=i
}
}
/^[0-9]/{
s=""
for (i=1;i<=maxIndex;i++)
s=(i in places)? s $places[i] " " : s ", "
print s
}
Call with:
awk -f s.awk file file | column -t
Output:
col1, col2, col3, col4, col5
1, 2, 3, 4 ,
5, 6, 7, 8 ,
, 10, 11, , 12
, 13, 14, , 15
HTH Chris

The code assumes that the tables are separated by empty lines:
awk -F', *' 'END {
for (i = 0; ++i <= c;)
printf "%s", (cols[i] (i < c ? OFS : RS))
for (i = 0; ++i <= n;)
for (j = 0; ++j <= c;)
printf "%s", (vals[i, cols[j]] (j < c ? OFS : RS))
}
!NF {
fnr = NR + 1; next
}
NR == 1 || NR == fnr {
for (i = 0; ++i <= NF;) {
_[$i]++ || cols[++c] = $i
idx[i] = $i
}
next
}
{
++n; for (i = 0; ++i <= NF;)
vals[n, idx[i]] = $i
}' OFS=', ' tables
If you have the tables in separate files:
awk -F', *' 'END {
for (i = 0; ++i <= c;)
printf "%s", (cols[i] (i < c ? OFS : RS))
for (i = 0; ++i <= n;)
for (j = 0; ++j <= c;)
printf "%s", (vals[i, cols[j]] (j < c ? OFS : RS))
}
FNR == 1 {
for (i = 0; ++i <= NF;) {
_[$i]++ || cols[++c] = $i
idx[i] = $i
}
next
}
{
++n; for (i = 0; ++i <= NF;)
vals[n, idx[i]] = $i
}' OFS=', ' file1 file2 [.. filen]

Here's a one-pass perl solution. It assumes there is at least one blank line between each table in the file.
perl -00 -ne '
BEGIN {
%column2idx = ();
#idx2column = ();
$lineno = 0;
#lines = ();
}
chomp;
#rows = split /\n/;
#field_map = ();
#F = split /, /, $rows[0];
for ($i=0; $i < #F; $i++) {
if (not exists $column2idx{$F[$i]}) {
$idx = #idx2column;
$column2idx{$F[$i]} = $idx;
$idx2column[$idx] = $F[$i];
}
$field_map[$i] = $column2idx{$F[$i]};
}
for ($i=1; $i < #rows; $i++) {
#{$lines[$lineno]} = ();
#F = split /, /, $rows[$i];
for ($j=0; $j < #F; $j++) {
$lines[$lineno][$field_map[$j]] = $F[$j];
}
$lineno++;
}
END {
$ncols = #idx2column;
print join(", ", #idx2column), "\n";
foreach $row (#lines) {
#row = ();
for ($i=0; $i < $ncols; $i++) {
push #row, $row->[$i];
}
print join(", ", #row), "\n";
}
}
' tables | column -t
output
col1, col2, col3, col4, col5
1, 2, 3, 4,
5, 6, 7, 8,
, 10, 11, , 12
, 13, 14, , 15

Related

AWK new line sorting

I have a script that sorts numbers:
{
if ($1 <= 9) xd++
else if ($1 > 9 && $1 <= 19) xd1++
else if ($1 > 19 && $1 <= 29) xd2++
else if ($1 > 29 && $1 <= 39) xd3++
else if ($1 > 39 && $1 <= 49) xd4++
else if ($1 > 49 && $1 <= 59) xd5++
else if ($1 > 59 && $1 <= 69) xd6++
else if ($1 > 69 && $1 <= 79) xd7++
else if ($1 > 79 && $1 <= 89) xd8++
else if ($1 > 89 && $1 <= 99) xd9++
else if ($1 == 100) xd10++
} END {
print "0-9 : "xd, "10-19 : " xd1, "20-29 : " xd2, "30-39 : " xd3, "40-49 : " xd4, "50-59 : " xd5, "60-69 : " xd6, "70-79 : " xd7, "80-89 : " xd8, "90-99 : " xd9, "100 : " xd10
}
output:
$ cat xd1 | awk -f script.awk
0-9 : 16 10-19 : 4 20-29 : 30-39 : 2 40-49 : 1 50-59 : 1 60-69 : 1 70-79 : 1 80-89 : 1 90-99 : 1 100 : 2
how to make that every tenth was on a new line?
like this:
0-9 : 16
10-19 : 4
20-29 :
30-39 : 2
print with \n doesn't work
additionally:
in the top ten I have 16 numbers, how can I get this information using the "+" sign
like this:
0-9 : 16 ++++++++++++++++
10-19 : 4 ++++
20-29 :
30-39 : 2 ++
thank you in advance
If we rewrite the current code to use an array to keep track of counts, we can then use a simple for loop to print the results on individual lines, eg:
{ if ($1 <= 9) xd[0]++
else if ($1 <= 19) xd[1]++
else if ($1 <= 29) xd[2]++
else if ($1 <= 39) xd[3]++
else if ($1 <= 49) xd[4]++
else if ($1 <= 59) xd[5]++
else if ($1 <= 69) xd[6]++
else if ($1 <= 79) xd[7]++
else if ($1 <= 89) xd[8]++
else if ($1 <= 99) xd[9]++
else xd[10]++
}
END { for (i=0;i<=9;i++)
print (i*10) "-" (i*10)+9, ":", xd[i]
print "100 :", xd[10]
}
At this point we could also replace the 1st part of the script with a comparable for loop, eg:
{ for (i=0;i<=9;i++)
if ($1 <= (i*10)+9) {
xd[i]++
next
}
xd[10]++
}
END { for (i=0;i<=9;i++)
print (i*10) "-" (i*10)+9, ":", xd[i]
print "100 :", xd[10]
}
As for the additional requirement to print a variable number of + on the end of each line we can add a function (prt()) to generate the variable number of +:
function prt(n ,x) {
x=""
if (n) {
x=sprintf("%*s",n," ")
gsub(/ /,"+",x)
}
return x
}
{ for (i=0;i<=9;i++)
if ($1 <= (i*10)+9) {
xd[i]++
next
}
xd[10]++
}
END { for (i=0;i<=9;i++)
print (i*10) "-" (i*10)+9, ":", xd[i], prt(xd[i])
print "100 :", xd[10], prt(xd[10])
}
how to make that every tenth was on a new line?
Inform GNU AWK that you want OFS (output field separator) to be newline, consider following simple example
awk 'BEGIN{x=1;y=2;z=3}END{print "x is " x, "y is " y, "z is " z}' emptyfile
gives output
x is 1 y is 2 z is 3
whilst
awk 'BEGIN{OFS="\n";x=1;y=2;z=3}END{print "x is " x, "y is " y, "z is " z}' emptyfile
gives output
x is 1
y is 2
z is 3
Explanation: OFS value (default: space) is used for joining arguments of print. If you want to know more about OFS then read 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
(tested in gawk 4.2.1)
you don't need to hard-code in 10-buckets like that :
jot -r 300 1 169 | mawk '
BEGIN { _+=(_+=_^=_<_)*_*_ } { ++___[_<(__=int(($!!_)/_))?_:__] }
END {
____ = sprintf("%*s", NR, _)
gsub(".","+",____)
for(__=_-_;__<=_;__++) {
printf(" [%3.f %-6s] : %5.f %.*s\n",__*_,+__==+_?"+ "\
: " , " __*_--+_++, ___[__], ___[__], ____) } }'
[ 0 , 9 ] : 16 ++++++++++++++++
[ 10 , 19 ] : 17 +++++++++++++++++
[ 20 , 29 ] : 16 ++++++++++++++++
[ 30 , 39 ] : 19 +++++++++++++++++++
[ 40 , 49 ] : 14 ++++++++++++++
[ 50 , 59 ] : 18 ++++++++++++++++++
[ 60 , 69 ] : 18 ++++++++++++++++++
[ 70 , 79 ] : 16 ++++++++++++++++
[ 80 , 89 ] : 20 ++++++++++++++++++++
[ 90 , 99 ] : 19 +++++++++++++++++++
[100 + ] : 127 ++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++

filter out lines in file 2 because column values of file 2 are within the range of values in column 2 and 3 of file 1 for similar chr values

I have two files shown below that have chr from 1 to 22. I would like to use chr and pos1 pos 2 from file 1 and to filter out lines in file 2 because they have pos1 values are in between pos1 and pos2 in file 1 for a the same chr as shown below
file 1
chr pos1 pos2
1 2389078 2489001
1 2800001 3023010
1 2567898 2708901
3 5647956 6356191
4 5668887 6757869
file 2 :
chr pos1
1 2460067
1 2389080
3 5508907
output file:
chr pos1
1 2460067
1 2389080
I have tried to run some similar solutions such as the one below:
awk 'NR==FNR{ start[$1] = $2; end[$1] = $3; next } (FNR==1) || ( ($1 in start) && ($2 >= start[$1]) && ($2 <= end[$1]) ) ' file1 file2
however this is only printing the first line, how can this be improved?
As chr in file1 duplicate, the arrays start and end in your
script are overwritten by the values with the same chr. Would you please try:
awk '
# check if "pos" is between pos1 and pos2 in the passed "ch"
func inrange(ch, pos) {
for (i in chr) {
if (chr[i] == ch && pos >= start[i] && pos <= end[i]) return 1
}
return 0 # no match
}
NR==FNR {
if (NR > 1) { # skip the header line
chr[++n] = $1 # "n" works as an id
start[n] = $2
end[n] = $3
}
next
}
FNR==1 || inrange($1, $2) # print the "filtered" line in file2
' file1 file2
Output:
chr pos1
1 2460067
1 2389080
If the output order doesn't matter then using GNU awk for arrays of arrays:
$ cat tst.awk
NR==FNR {
chrpos[$1][$2]
next
}
$1 in chrpos {
for (pos in chrpos[$1]) {
if ( ($2 <= pos) && (pos <= $3) ) {
print $1, pos
delete chrpos[$1][pos]
}
}
}
$ awk -f tst.awk file2 file1
chr pos1
1 2389080
1 2460067

matching non-unique values to unique values

I have data which looks like this
1 3
1 2
1 9
5 4
4 6
5 6
5 8
5 9
4 2
I would like the output to be
1 3,2,9
5 4,6,8,9
4 6,2
This is just sample data but my original one has lots more values.
So this worked
So this basically creates a hash table, using the first column as a key and the second column of the line as the value:
awk '{line="";for (i = 2; i <= NF; i++) line = line $i ", "; table[$1]=table[$1] line;} END {for (key in table) print key " => " table[key];}' trial.txt
OUTPUT
4 => 6, 2
5 => 4, 6, 8, 9
1 => 3, 2, 9
I'd write
awk -v OFS=, '
{
key = $1
$1 = ""
values[key] = values[key] $0
}
END {
for (key in values) {
sub(/^,/, "", values[key])
print key " " values[key]
}
}
' file
If you want only the unique values for each key (requires GNU awk for multi-dimensional arrays)
gawk -v OFS=, '
{ for (i=2; i<=NF; i++) values[$1][$i] = i }
END {
for (key in values) {
printf "%s ", key
sep = ""
for (val in values[key]) {
printf "%s%s", sep, val
sep = ","
}
print ""
}
}
' file
or perl
perl -lane '
$key = shift #F;
$values{$key}{$_} = 1 for #F;
} END {
$, = " ";
print $_, join(",", keys %{$values{$_}}) for keys %values;
' file
if not concerned with the order of the keys, I think this is the idiomatic awk solution.
$ awk '{a[$1]=($1 in a?a[$1]",":"") $2}
END{for(k in a) print k,a[k]}' file |
column -t
4 6,2
5 4,6,8,9
1 3,2,9

how to change a dynamic pattern in awk?

the data is something like
"1||2""3""2||3""5""4||3""6""43""4||4||3""4||3", 43 ,"4||3""43""3||4||4||3"
i've tried this myself
BEGIN {
FPAT = "(\"[^\"]+\")|([ ])"
}
{
print "NF = ", NF
for (i = 1; i <= NF; i++) {
printf("$%d = <%s>\n", i, $i)
}
}
but the problem is it's giving me an output like
$ gawk -f prog4.awk data1.txt
NF = 18
$1 = <"1||2">
$2 = <"3">
$3 = <"2||3">
$4 = <"5">
$5 = <"4||3">
$6 = <"6">
$7 = <"43">
$8 = <"4||4||3">
$9 = <"4||3">
$10 = <,>
$11 = < >
$12 = <4>
$13 = <3>
$14 = < >
$15 = <,>
$16 = <"4||3">
$17 = <"43">
$18 = <"3||4||4||3">
>
as you can see $10 to $15 each and every character is taken. help appreciated.
Let's try approaching this a different way - if the following is not what you are looking for, please tell us in what way(s) it differs from your desired output and why:
$ cat tst.awk
BEGIN { FPAT="\"[^\"]+\"" }
{
for (i=1; i<=NF; i++) {
print i, "<" $i ">"
}
}
$
$ gawk -f tst.awk file
1 <"1||2">
2 <"3">
3 <"2||3">
4 <"5">
5 <"4||3">
6 <"6">
7 <"43">
8 <"4||4||3">
9 <"4||3">
10 <"4||3">
11 <"43">
12 <"3||4||4||3">

How substract millisecond with AWK - script

I'm trying to create an awk script to subtract milliseconds between 2 records joined-up for example:
By command line I might do this:
Input:
06:20:00.120
06:20:00.361
06:20:15.205
06:20:15.431
06:20:35.073
06:20:36.190
06:20:59.604
06:21:00.514
06:21:25.145
06:21:26.125
Command:
awk '{ if ( ( NR % 2 ) == 0 ) { printf("%s\n",$0) } else { printf("%s ",$0) } }' input
I'll obtain this:
06:20:00.120 06:20:00.361
06:20:15.205 06:20:15.431
06:20:35.073 06:20:36.190
06:20:59.604 06:21:00.514
06:21:25.145 06:21:26.125
To substract milliseconds properly:
awk '{ if ( ( NR % 2 ) == 0 ) { printf("%s\n",$0) } else { printf("%s ",$0) } }' input| awk -F':| ' '{print $3, $6}'
And to avoid negative numbers:
awk '{if ($2<$1) sub(/00/, "60",$2); print $0}'
awk '{$3=($2-$1); print $3}'
The goal is get this:
Call 1 0.241 ms
Call 2 0.226 ms
Call 3 1.117 ms
Call 4 0.91 ms
Call 5 0.98 ms
And finally and average.
I might perform this but command by command. I dunno how to place this into a script.
Please need help.
Using awk:
awk '
BEGIN { cmd = "date +%s.%N -d " }
NR%2 {
cmd $0 | getline var1;
next
}
{
cmd $0 | getline var2;
var3 = var2 - var1;
print "Call " ++i, var3 " ms"
}
' file
Call 1 0.241 ms
Call 2 0.226 ms
Call 3 1.117 ms
Call 4 0.91 ms
Call 5 0.98 ms
One way using awk:
Content of script.awk:
## For every input line.
{
## Convert formatted dates to time in miliseconds.
t1 = to_ms( $0 )
getline
t2 = to_ms( $0 )
## Calculate difference between both dates in miliseconds.
tr = (t1 >= t2) ? t1 - t2 : t2 - t1
## Print to output with time converted to a readable format.
printf "Call %d %s ms\n", ++cont, to_time( tr )
}
## Convert a date in format hh:mm:ss:mmm to miliseconds.
function to_ms(time, time_ms, time_arr)
{
split( time, time_arr, /:|\./ )
time_ms = ( time_arr[1] * 3600 + time_arr[2] * 60 + time_arr[3] ) * 1000 + time_arr[4]
return time_ms
}
## Convert a time in miliseconds to format hh:mm:ss:mmm. In case of 'hours' or 'minutes'
## with a value of 0, don't print them.
function to_time(i_ms, time)
{
ms = int( i_ms % 1000 )
s = int( i_ms / 1000 )
h = int( s / 3600 )
s = s % 3600
m = int( s / 60 )
s = s % 60
# time = (h != 0 ? h ":" : "") (m != 0 ? m ":" : "") s "." ms
time = (h != 0 ? h ":" : "") (m != 0 ? m ":" : "") s "." sprintf( "%03d", ms )
return time
}
Run the script:
awk -f script.awk infile
Result:
Call 1 0.241 ms
Call 2 0.226 ms
Call 3 1.117 ms
Call 4 0.910 ms
Call 5 0.980 ms
If you're not tied to awk:
to_epoch() { date -d "$1" "+%s.%N"; }
count=0
paste - - < input |
while read t1 t2; do
((count++))
diff=$(printf "%s-%s\n" $(to_epoch "$t2") $(to_epoch "$t1") | bc -l)
printf "Call %d %5.3f ms\n" $count $diff
done