Transpose till "n"th column into row: - awk

I would like to transpose column to row by every 3rd column.
Input.txt
Name
Age
Place
aa
22
xx
bb
33
yy
cc
44
zz
....
....
Desired Output
Name,Age,Place
aa,22,xx
bb,33,yy
cc,44,zz
I have tried below command and in-complete
awk '
{
for(c = 1; c <= NR; c++) { a[c]=$c }
}
END {
for(r = 1; r <= NR; r++) {
for(t = 1; t <= 3; t++) {
printf("%s ", a[c])
}
print ","
}
}' Input.txt
Looking for your suggestions...

There are many good tools for this.
This awk!
$ awk 'ORS=NR%3?",":RS' file
Name,Age,Place
aa,22,xx
bb,33,yy
cc,44,zz
It sets the output field separator as , whenever the line is not multiple of 3. This way, it joins every group of 3 lines.
More info in Idiomatic awk.
xargs
$ xargs -n3 <file
Name Age Place
aa 22 xx
bb 33 yy
cc 44 zz
This gets the input in blocks of X items, defined by -n X. Then you can replace spaces with commas with tr or sed.
paste
$ paste -d"," - - - <file
Name,Age,Place
aa,22,xx
bb,33,yy
cc,44,zz
This joins every 3 input and uses delimiter , as separator.
Regarding transpose itself, I wrote a snippet a while ago in Using bash to sort data horizontally:
transpose () {
awk '{for (i=1; i<=NF; i++) a[i,NR]=$i; max=(max<NF?NF:max)}
END {for (i=1; i<=max; i++)
{for (j=1; j<=NR; j++)
printf "%s%s", a[i,j], (j<NR?OFS:ORS)
}
}'
}

Related

Counting unique occurrences in each column

I have a file with several columns like $2$3 (until $32) as in
A refdevhet devdevhomo
B refdevhet refdevhet
C refrefhomo refdevhet
D devrefhet refdevhet
I need to count how many occurrences of each unique element in each column separately
so that I have
refdevhet 2 3
refrefhomo 1 0
devrefhet 1 0
devdevhomo 0 1
I tried several variations of
awk 'BEGIN {
FS=OFS="\t"
}
{
for(i=1; i<=32; i++) a[$i]++
}
END {
for (i in a) print i, a[i]
}' file
but instead it's printing the cumulative sum of occurrences of unique elements across the selected fields.
Here is a solution:
BEGIN {
FS=OFS="\t"
}
{
if (NF>mxf) mxf = NF;
for(i=1; i<=NF; i++) {ws[$i]=1; c[$i,i]++}
}
END {
for (w in ws) {
printf "%s", w
for (i=1;i<=mxf;i++) printf "%s%d", OFS, c[w,i];
print ""
}
}
Notice that solution is general. It will include first column into consideration as well. To omit the first column, change i=1 to i=2 in both places.
In addition to #Andriy's good answer, with GNU awk you can use a 2-dimensional array
gawk '
{for (i=2; i<=NF; i++) count[$i][i]++}
END {
for (word in count) {
printf "%s", word
for (i=2; i<=NF; i++) printf "%s%d", OFS, count[word][i]
print ""
}
}
' file | column -t
I'm assuming here that each line has the same number of fields as the last line.

Sort rows in csv file without header & first column

I've a CSV file containing records like below.
id,h1,h2,h3,h4,h5,h6,h7
101,zebra,1,papa,4,dog,3,apple
102,2,yahoo,5,kangaroo,7,ape
I want to sort rows into this file without header and first column. My output should like this.
id,h1,h2,h3,h4,h5,h6,h7
101,1,3,4,apple,dog,papa,zebra
102,2,5,7,ape,kangaroo,yahoo
I tried below AWK but don't know how to exclude header and first column.
awk -F"," ' {
s=""
for(i=1; i<=NF; i++) { a[i]=$i; }
for(i=1; i<=NF; i++)
{
for(j = i+1; j<=NF; j++)
{
if (a[i] >= a[j])
{
temp = a[j];
a[j] = a[i];
a[i] = temp;
}
}
}
for(i=1; i<=NF; i++){ s = s","a[i]; }
print s
}
' file
Thanks
If perl is okay:
$ perl -F, -lane 'print join ",", $.==1 ? #F : ($F[0], sort #F[1..$#F])' ip.txt
id,h1,h2,h3,h4,h5,h6,h7
101,1,3,4,apple,dog,papa,zebra
102,2,5,7,ape,kangaroo,yahoo
-F, to indicate , as input field separator, results saved in #F array
See https://perldoc.perl.org/perlrun#Command-Switches for details on other options
join "," to use , as output field separator
$.==1 ? #F for first line, print as is
($F[0], sort #F[1..$#F]) for other lines, get first field and sorted output of other fields
.. is range operator, $#F will give index of last field
you can also use (shift #F, sort #F) instead of ($F[0], sort #F[1..$#F])
For given header, sorting first line would work too, so this can simplify logic required
$ # can also use: perl -F, -lane 'print join ",", shift #F, sort #F'
$ perl -F, -lane 'print join ",", $F[0], sort #F[1..$#F]' ip.txt
id,h1,h2,h3,h4,h5,h6,h7
101,1,3,4,apple,dog,papa,zebra
102,2,5,7,ape,kangaroo,yahoo
$ # can also use: ruby -F, -lane 'print [$F.shift, $F.sort] * ","'
$ ruby -F, -lane 'print [$F[0], $F.drop(1).sort] * ","' ip.txt
id,h1,h2,h3,h4,h5,h6,h7
101,1,3,4,apple,dog,papa,zebra
102,2,5,7,ape,kangaroo,yahoo
if you have gawk use asort:
awk -v OFS="," 'NR>1{split($0, a, ",");
$1=a[1];
delete a[1];
n = asort(a, b);
for (i = 1; i <= n; i++){ $(i+1)=b[i]}};
1' file.csv
This splits the columns to array a with seperator as , for all raws except the first one.
Then assign the first value in the column in a raw with the first value in a and delete this value from a.
Now the a is sorted to b and assign value starting from 2 column. then print it.
You can just use the asort() function in awk for your requirement and start sorting them from second line on-wards. The solution is GNU awk specific because of length(array) function
awk 'NR==1{ print; next }
NR>1 { finalStr=""
arrayLength=""
delete b
split( $0, a, "," )
for( i = 2; i <= length(a); i++ )
b[arrayLength++] = a[i]
asort( b )
for( i = 1; i <= arrayLength ; i++ )
finalStr = (finalStr)?(finalStr","b[i]):(b[i])
printf( "%s", a[1]","finalStr )
printf( "\n" );
}' file
The idea is first we split the entire line with a , delimiter into the array a from which we get the elements from the 2nd field onwards to a new array b. We sort those elements in this new array and append the first column element when we print it finally.

extract info from a tag using awk

I have multi columns file and i want to extract some info in column 71.
I want to extract using tags which the value can be anything, for example i want to just extract AC=* ; AF=* , where the value can be anything .
I found similar question and gave it a try but it didn't work
Extract columns with values matching a specific pattern
Column 71 looks like this:
AC=14511;AC_AFR=382;AC_AMR=1177;AC_Adj=14343;AC_EAS=5;AC_FIN=427;AC_Het=11813;AC_Hom=1265;AC_NFE=11027;AC_OTH=97;AC_SAS=1228;AF=0.137;AN=106198;AN_AFR=8190;AN_AMR=10424;AN_Adj=99264;AN_EAS=7068;AN_FIN=6414;AN_NFE=51090;AN_OTH=658;AN_SAS=15420;BaseQRankSum=1.73;ClippingRankSum=-1.460e-01;DB;DP=1268322;FS=0.000;GQ_MEAN=190.24;GQ_STDDEV=319.67;Het_AFR=358;Het_AMR=1049;Het_EAS=5;Het_FIN=399;Het_NFE=8799;Het_OTH=83;Het_SAS=1120;Hom_AFR=12;Hom_AMR=64;Hom_EAS=0;Hom_FIN=14;Hom_NFE=1114;Hom_OTH=7;Hom_SAS=54;InbreedingCoeff=0.0478;MQ=60.00;MQ0=0;MQRankSum=0.037;NCC=270;POSITIVE_TRAIN_SITE;QD=21.41;ReadPosRankSum=0.212;VQSLOD=4.79;culprit=MQ;DP_HIST=30|3209|1539|1494|30007|7938|4130|2038|1310|612|334|185|97|60|31|25|9|11|7|33,0|66|339|1048|2096|2665|2626|1832|1210|584|323|179|89|54|31|22|7|9|4|15;GQ_HIST=84|66|56|82|3299|568|617|403|250|319|436|310|28566|2937|827|834|451|186|217|12591,15|15|13|16|25|11|22|28|18|38|52|31|65|76|39|83|93|65|97|12397;CSQ=T|ENSG00000186868|ENST00000334239|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11502.1|ENSP00000334886|TAU_HUMAN|B4DSE3_HUMAN|UPI0000000C16||||2/8||ENST00000334239.8:c.134-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000570299|Transcript|intron_variant&non_coding_transcript_variant||||||rs754512|1||1|MAPT|HGNC|6893|processed_transcript||||||||||2/6||ENST00000570299.1:n.262-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000340799|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS45716.1|ENSP00000340438|TAU_HUMAN||UPI000004EEE6||||3/10||ENST00000340799.5:c.221-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000262410|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11501.1|ENSP00000262410|TAU_HUMAN||UPI0000EE80B7||||4/13||ENST00000262410.5:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000446361|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11500.1|ENSP00000408975|TAU_HUMAN||UPI000004EEE5||||2/9||ENST00000446361.3:c.134-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000574436|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11499.1|ENSP00000460965|TAU_HUMAN||UPI000002D754||||3/10||ENST00000574436.1:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000571987|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11501.1|ENSP00000458742|TAU_HUMAN||UPI0000EE80B7||||3/12||ENST00000571987.1:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000415613|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS45715.1|ENSP00000410838|TAU_HUMAN||UPI0001AE66E9||||3/13||ENST00000415613.2:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000571311|Transcript|intron_variant&NMD_transcript_variant||||||rs754512|1||1|MAPT|HGNC|6893|nonsense_mediated_decay|||ENSP00000460048||I3L2Z2_HUMAN|UPI00025A2E6E||||4/4||ENST00000571311.1:c.*176-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000535772|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS56033.1|ENSP00000443028|TAU_HUMAN|B4DSE3_HUMAN|UPI000004EEE4||||4/10||ENST00000535772.1:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000576518|Transcript|stop_gained|5499|7|3|K/*|Aag/Tag|rs754512|1||1|MAPT|HGNC|6893|protein_coding|||ENSP00000458621||I3L170_HUMAN&B4DSE3_HUMAN|UPI0001639A7C|||1/7|||ENST00000576518.1:c.7A>T|ENSP00000458621.1:p.Lys3Ter|T:0.1171|||||||||15792962|||||POSITION:0.00682261208576998&ANN_ORF:-255.6993&MAX_ORF:-255.6993|PHYLOCSF_WEAK|ANC_ALLELE|LC,T|ENSG00000186868|ENST00000420682|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS45716.1|ENSP00000413056|TAU_HUMAN||UPI000004EEE6||||2/9||ENST00000420682.2:c.221-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000572440|Transcript|non_coding_transcript_exon_variant&non_coding_transcript_variant|2790|||||rs754512|1||1|MAPT|HGNC|6893|retained_intron|||||||||1/1|||ENST00000572440.1:n.2790A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000351559|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11499.1|ENSP00000303214|TAU_HUMAN||UPI000002D754||||4/11||ENST00000351559.5:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000344290|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding|YES|CCDS45715.1|ENSP00000340820|TAU_HUMAN||UPI0001AE66E9||||4/14||ENST00000344290.5:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000347967|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding|||ENSP00000302706|TAU_HUMAN|B4DSE3_HUMAN|UPI0000173D91||||4/10||ENST00000347967.5:c.32-100A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000431008|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS56033.1|ENSP00000389250|TAU_HUMAN|B4DSE3_HUMAN|UPI000004EEE4||||3/9||ENST00000431008.3:c.308-94A>T||T:0.1171|||||||||15792962||||||||
The code that i tried:
awk '{
for (i = 1; i <= NF; i++) {
if ($i ~ /AC|AF/) {
printf "%s %s ", $i, $(i + 1)
}
}
print ""
}'
I keep getting syntax error.
output wanted :
AC=14511;AF=0.137
Whenever you have name=value pairs, it's usually simplest to first create an array that maps names to values (n2v[] below) and then you can just access the values by their names.
$ cat file
AC=1;AC_AFR=2;AF=3 AC=4;AC_AFR=5;AF=6
$ cat tst.awk
{
delete n2v
split($2,tmp,/[;=]/)
for (i=1; i in tmp; i+=2) {
n2v[tmp[i]] = tmp[i+1]
}
prt("AC")
prt("AF")
}
function prt(name) { print name, "=", n2v[name] }
$ awk -f tst.awk file
AC = 4
AF = 6
Just change $2 to $71 for your real input.
Something like this should do it (in Gnu awk due to switch):
$ awk '{split($71,a,";");for(i in a )if(a[i]~/^AF/) print a[i]}' foo
AF=0.137
You split the field $71 by ;s, loop thru the array you split to looking for desired match. For multiple matches use switch:
$ awk '{
split($0,a,";");
for(i in a )
switch(a[i]) {
case /^AF=/:
b=b a[i] OFS;
break;
case /^AC=/:
b=b a[i] OFS;
break
}
sub(/.$/,"\n",b);
printf b
}' foo
AC=14511 AF=0.137
EDIT: Now it buffers output to a variable and prints it in the end. You can control the separator with OFS.

array over non-existing indices in awk

Sorry for the verbose question, it boils down to a very simple problem.
Assume there are n text files each containing one column of strings (denominating groups) and one of integers (denominating the values of instances within these groups):
# filename xxyz.log
a 5
a 6
b 10
b 15
c 101
c 100
#filename xyzz.log
a 3
a 5
c 116
c 128
Note that while the length of both columns within any given file is always identical it differs between files. Furthermore, not all files contain the same range of groups (the first one contains groups a, b, c, while the second one only contains groups a and c). In awk one could calculate the average of column 2 for each string in column 1 within each file separately and output the results with the following code:
NAMES=$(ls|grep .log|awk -F'.' '{print $1}');
for q in $NAMES;
do
gawk -F' ' -v y=$q 'BEGIN {print "param", y}
{sum1[$1] += $2; N[$1]++}
END {for (key in sum1) {
avg1 = sum1[key] / N[key];
printf "%s %f\n", key, avg1;
} }' $q.log | sort > $q.mean;
done;
Howerver, for the abovementioned reasons, the length of the resulting .mean files differs between files. For each .log file I'd like to output a .mean file listing the entire range of groups (a-d) in the first column and the corresponding mean value or empty spaces in the second column depending on whether this category is present in the .log file. I've tried the following code (given without $NAMES for brevity):
awk 'BEGIN{arr[a]="a"; arr[b]="b"; arr[c]="c"; arr[d]="d"}
{sum[$1] += $2; N[$1]++}
END {for (i in arr) {
if (i in sum) {
avg = sum[i] / N[i];
printf "%s %f\n" i, avg;}
else {
printf "%s %s\n" i, "";}
}}' xxyz.log > xxyz.mean;
but it returns the following error:
awk: (FILENAME=myfile FNR=7) fatal: not enough arguments to satisfy format string
`%s %s
'
^ ran out for this one
Any suggestions would be highly appreciated.
Will you ever have explicit zeroes or negative numbers in the log files? I'm going to assume not.
The first line of your second script doesn't do what you wanted:
awk 'BEGIN{arr[a]="a"; arr[b]="b"; arr[c]="c"; arr[d]="d"}
This assigns "a" to arr[0] (because a is a variable not previously used), then "b" to the same element (because b is a variable not previously used), then "c", then "d". Clearly, not what you had in mind. This (untested) code should do the job you need as long as you know that there are just the four groups. If you don't know the groups a priori, you need a more complex program (it can be done, but it is harder).
awk 'BEGIN { sum["a"] = 0; sum["b"] = 0; sum["c"] = 0; sum["d"] = 0 }
{ sum[$1] += $2; N[$1]++ }
END { for (i in sum) {
if (N[i] == 0) N[i] = 1 # Divide by zero protection
avg = sum[i] / N[i];
printf "%s %f\n" i, avg;
}
}' xxyz.log > xxyz.mean;
This will print a zero average for the missing groups. If you prefer, you can do:
awk 'BEGIN { sum["a"] = 0; sum["b"] = 0; sum["c"] = 0; sum["d"] = 0 }
{ sum[$1] += $2; N[$1]++ }
END { for (i in sum) {
if (N[i] == 0)
printf("%s\n", i;
else {
avg = sum[i] / N[i];
printf "%s %f\n" i, avg;
}
}
}' xxyz.log > xxyz.mean;
For each .log file I'd like to output a .mean file listing the entire
range of groups (a-d) in the first column and the corresponding mean
value or empty spaces in the second column depending on whether this
category is present in the .log file.
Not purely an awk solution, but you can get all the groups with this.
awk '{print $1}' *.log | sort -u > groups
After you calculate the means, you can then join the groups file. Let's say the means for your second input file look like this temporary, intermediate file. (I called it xyzz.tmp.)
a 4
c 122
Join the groups, preserving all the values from the groups file.
$ join -a1 groups xyzz.tmp > xyzz.mean
$ cat xyzz.mean
a 4
b
c 122
Here's my take on the problem. Run like:
./script.sh
Contents of script.sh:
array=($(awk '!a[$1]++ { print $1 }' *.log))
readarray -t sorted < <(for i in "${array[#]}"; do echo "$i"; done | sort)
for i in *.log; do
for j in "${sorted[#]}"; do
awk -v var=$j '
{
sum[$1]+=$2
cnt[$1]++
}
END {
print var, (var in cnt ? sum[var]/cnt[var] : "")
}
' "$i" >> "${i/.log/.main}"
done
done
Results of grep . *.main:
xxyz.main:a 5.5
xxyz.main:b 12.5
xxyz.main:c 100.5
xyzz.main:a 4
xyzz.main:b
xyzz.main:c 122
Here is a pure awk answer:
find . -maxdepth 1 -name "*.log" -print0 |
xargs -0 awk '{SUBSEP=" ";sum[FILENAME,$1]+=$2;cnt[FILENAME,$1]+=1;next}
END{for(i in sum)print i, sum[i], cnt[i], sum[i]/cnt[i]}'
Easy enough to push this into a file --

missing field and extra space after using for loop in awk

I need to use an awk script to extract some information from a file.
I have a title line which has 11 field and I split it to an array called titleList.
Student Number:Name:Lab1:Lab2:Lab3:Lab4:Lab5:Lab6:Exam1:Exam2:Final
After finding a proper line I need to print the fields which proceeds by the titles for example if the result is :
92839342:Robert Bloomingdale:9:26:18:22:9:12:25:39:99
I must print it in this way:
Student Number:92839342 Name:Robert Bloomingdale Lab1:9 Lab2:26 Lab3:18
Lab4:22 Lab5:9 Lab6:12 Exam1:25 Exam2:39 Final:99
I use a for loop to manage it:
for (i=0 ;i<=NF ;i++)
{
printf "%s %s %s %s",titleList[i],":",$i," "
}
everything look good except the result which has 2 problems:
first there is an extra space between each result and second the last field of the searched line is missing
Student Number : 92839342 Name : Robert Bloomingdale Lab1 : 9 Lab2 : 26
Lab3:18 Lab4 : 22 Lab5 : 9 Lab6 : 12 Exam1 : 25 Exam2 : 39 Final
what should I do?
is there any problem with \n at the end of the search result?
You can correct the amount of extra whitespace between fields by correcting the printf statement:
awk -F ":" 'NR == 1 { split($0, array, FS) } NR >= 2 { for (i=1; i<=NF; i++) printf "%s:%s ", array[i], $i; printf "\n" }' file.txt
Contents of file.txt:
Student Number:Name:Lab1:Lab2:Lab3:Lab4:Lab5:Lab6:Exam1:Exam2:Final
92839342:Robert Bloomingdale:9:26:18:22:9:12:25:39:99
Results:
Student Number:92839342 Name:Robert Bloomingdale Lab1:9 Lab2:26 Lab3:18 Lab4:22 Lab5:9 Lab6:12 Exam1:25 Exam2:39 Final:99
EDIT:
Also, your missing the last value because the file you're working with probably has windows newline endings. To fix this, run: dos2unix file.txt before running your awk code. Alternatively, you can set awk's record separater so that it understands newline endings:
awk 'BEGIN { RS="\r\n"; FS=":" } NR == 1 { split($0, array, FS) } NR >= 2 { for (i=1; i<=NF; i++) printf "%s:%s ", array[i], $i; printf "\n" }' file.txt
EDIT:
The above requires GNU awk, split() splits on the FS by default so no need to use that as an arg, it's common to use "next" rather than specifying opposite conditions, and it's common to use print "" instead of printf "\n" so you use the ORS setting rather than hard-coding it's value in output statements. So, the above should be tweaked to:
gawk 'BEGIN { RS="\r\n"; FS=":" } NR == 1 { split($0, array); next } { for (i=1; i<=NF; i++) printf "%s:%s ", array[i], $i; print "" }' file.txt