awk sum column based on matching header pattern - awk

I am having a file with ~10,000 columns and ~20,000 rows in the following format
Id A_1 A_2 A_3 B_2 B_5 C_1
T1 0 1 1 6 1 0
T2 1 1 1 0 0 1
T3 2 0 3 1 1 5
T4 1 1 1 2 3 1
In the header row, 1st column is the id. From 2nd column onward are the sample name in the following format sampleName_batch#. Now, I want to add all the values for the id's based on the sampleName and have the sampleName and the sum up value in the output. My expected output is
Id A B C
T1 2 7 0
T2 3 0 1
T3 5 2 5
T4 3 5 1
I have came across this answer https://unix.stackexchange.com/questions/569615/combine-columns-in-one-file-by-matching-header but i dont know how to modify the whole header row.
Thanks

I am trying to edit solution mentioned in OP's post from cross site, its a bit tweak in solution and all lines are same as it is from it. I am NO where near in knowledge like "THE Ed Morton" in awk, so humbly taking his permission(I hope he is ok with it) trying to edit his great solution from cross site, could you please try following.
awk '
NR==1 {
for (inFldNr=2; inFldNr<=NF; inFldNr++) {
sub(/_.*/,"",$inFldNr)
fldName = $inFldNr
if ( !(fldName in fldName2outFldNr) ) {
outFldNr2name[++numOutFlds] = fldName
fldName2outFldNr[fldName] = numOutFlds
}
outFldNr = fldName2outFldNr[fldName]
out2inFldNrs[outFldNr,++numInFlds[outFldNr]] = inFldNr
}
printf "%s%s", $1, OFS
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
outFldName = outFldNr2name[outFldNr]
printf "%s%s", outFldName, (outFldNr<numOutFlds ? OFS : ORS)
}
next
}
{
printf "%s%s", $1, OFS
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
sum = 0
for (inFldIdx=1; inFldIdx<=numInFlds[outFldNr]; inFldIdx++) {
inFldNr = out2inFldNrs[outFldNr,inFldIdx]
sum += $inFldNr
}
printf "%s%s", sum, (outFldNr<numOutFlds ? OFS : ORS)
}
}
' Input_file
What is added in already existing Ed's code:
Add a sub function to substitute everything after _ in first line(header).
Removed \t as delimiter since OP's samples are NOT tab delimited.

Related

Return ranges of one column based on group number of a second column

I have a file with two columns separated by a single tab or single space (either way is ok). The first column is sorted in ascending order. The second column can take three different numbers (0, 1 or 2). So taking the examples below:
col1 col2
15295557 2
15295594 2
15295834 2
15295937 1
15295959 1
15302817 1
15303844 0
15303848 0
15303851 0
15303860 0
15304062 0
15313455 2
15314748 2
15320909 2
15320945 2
What I would like is to group the first column in ranges based on the number in the second column. The desired output would be something like this:
col1 col2 col3
15295557 15295834 2
15295937 15302817 1
15303844 15304062 0
15313455 15320945 2
I believe awk or sed could do the trick easily, but my skills are really limited...
Any help would be much appreciated!
Thanks!
You may try this awk:
awk 'BEGIN{FS=OFS="\t"} p2 != $2 {if (NR>1) print start, p1, p2; start = $1} {p1 = $1; p2 = $2} END{print start, p1, p2}' file
15295557 15295834 2
15295937 15302817 1
15303844 15304062 0
15313455 15320945 2
An expanded form:
awk '
BEGIN {FS=OFS="\t"}
p2 != $2 {
if (NR > 1)
print start, p1, p2
start = $1
}
{
p1 = $1
p2 = $2
}
END {
print start, p1, p2
}' file

Create new columns in specific positions by row-wise summing other columns

I would like to tranform this file, by adding two columns with the result of summing a value to another columns. I would like this new columns to be located next to the corresponding summed column.
A B C
2000 33000 2
2000 153000 1
2000 178000 1
2000 225000 1
2000 252000 1
I would like to get the following data
A A1 B B1 C
2000 2999 33000 33999 2
2000 2999 153000 153999 1
2000 2999 178000 78999 1
2000 2999 225000 225999 1
2000 2999 252000 252999 1
I have found how to sum a column: awk '{{$2 += 999}; print $0}' myFile but this transforms second column, instead of creating a new one. In addition, I am not sure about how to append that column in the desired positions.
Thanks!
awk '{
# increase number of columns
NF++
# shift columns right, note - from the back!
for (i = NF; i >= 2; --i) {
$(i + 1) = $i
}
# increase second column
$2 += 999
# print it
print
}
' myfile
And similar for 4th column.
Sample specific answer: Could you please try following, written and tested with shown samples in GNU awk.
awk '
FNR==1{
$1=$1 OFS "A1"
$2=$2 OFS "B1"
print
next
}
{
$1=$1 OFS $1+999
$2=$2 OFS $2+999
}
1
' Input_file | column -t
Generic solution: Adding more generic solution, where we need NOT to write field logic for each field, just give field number inside variable fieldsChange(give only field number with comma separated) and even heading of it will be taken care. variable valAdd is having value which you need to add into new columns.
awk -v valAdd="999" -v fieldsChange="1,2" '
BEGIN{
num=split(fieldsChange,arr,",")
for(i=1;i<=num;i++){ value[arr[i]] }
}
FNR==1{
for(i=1;i<=NF;i++) {
if(i in value) { $i=$i OFS $i"1" }
}
}
FNR>1{
for(i=1;i<=NF;i++) {
if(i in value) { $i=$i OFS $i+valAdd }
}
}
1
' Input_file | column -t

How do i compare alphanumeric characters in non sequential order?

Currently i am using a awk script which compares numbers in non sequential order and prints the difference . It works pretty well for numbers but if i have alphanumeric characters , it doesn't seem to work well
In its current state , apart from simply comparing the numbers it does 2 things additionally :
Currently it accounts for the zeros before a number or character and compares the absolute values only ignoring zeros before a number or character
Currently If the same number or character occurs multiple times in both files , it outputs the additional occurance
i just want the script to work well for alphanumeric characters as well as currently it only seem to work well with plain numbers. Can someone please edit the script to have the desired output while also considering the above 2 conditions
Current script
awk '{k=$0+0}
NR==FNR {a[k]++; next}
!(k in a && a[k]-->0);
END {for(k in a) while(a[k]-->0) print k}' file1 file2
Example below
cat file1
1
01
001
8
2B
12
13C
027B
0027B
cat file2
1
2
08
12
13C
02B
9
27B
Expected output/result
1
1
2
9
27B
Explanation of expected output
In file1 : "1" , "01" , "001" evaluates to 1 * 3 times
In file 2 : "1" is present only once
Hence "1" is present twice in result ( 3-1 times )
"2" and "9" are exclusively present in file2 , So obviously both simply form part of output
In file1 : '027B" , "0027B" evaluates to 27B * 2 times
In file 2 - "27B" is present only once
Hence '27B" is present once in result ( 2 -1 times )
Explanation of matched items ( ones not forming part of expected output )
"8" from file1 ( line 4 )is matched with "08" from file2 ( line 3)
"12" from file1 ( line 6) is matched with "12" from file2 ( line 4)
"13C" from file1 (line 7 ) is matched with "13C" from file2 ( line 5 )
"2B" from file1 ( line 5 ) is matched with "02B" from file2 ( line 6 )
Lastly the order of items in expected output should be in ascending order like shown in my above example, lets say if the eg above had 3 in expected output it should read vertically as 1 1 2 3 9 27B
It should be enough to remove leading zeros when forming the key (with a special case for zero values like 0000):
/^0+$/ { k = 0 }
/[^0]/ { k = $0; sub(/^0*/, "", k) }
NR==FNR {a[k]++; next}
!(k in a && a[k]-->0);
END {for(k in a) while(a[k]-->0) print k}
$ awk -f a.awk file1 file2
2
9
27B
1
1
RE-EDIT
If you just want the values sorted numerically, pipe into sort:
$ awk -f a.awk file1 file2 | sort -n
1
1
2
3
4
5
9
27B
To output in the order as found in file2, you can remember the order in another array and then do all the printing in the END block. This version will output the values in the order of file2, with any values only in file1 printed last.
/^0+$/ { k = 0 }
/[^0]/ { k = $0; sub(/^0*/, "", k) }
NR==FNR {a[k]++; next}
{ b[FNR] = k }
!(k in a && a[k]--) { a[k] = 1 }
END {
for (i=1; i<=FNR; ++i) {
k = b[i]
while(a[k]-->0) print k
}
for (k in a) {
while(a[k]-->0) print k
}
}
$ awk -f a.awk file1 file2
1
1
2
9
27B
3
4
5

Average of a given number of rows

I want to get the average of a certain number of rows, in this case this number is dictated by the second column
-1 1 22.776109913596883 0.19607208141710716
-1 1 4.2985901827923954 1.0388892840309705
-1 1 4.642271812306717 0.96197712195674756
-1 2 2.8032298255711794 1.5930763994471333
-1 2 2.9358628368936479 1.5211062387604053
-1 2 4.9987168801017106 0.8933811184867273
1 4 2.6211673161014915 1.7037291934441456
1 4 4.483831056393683 0.99596956735821618
1 4 9.7189442154485732 0.4594901646050486
The expected output would be
-1 1 0.732313
-1 2 1.33585
1 4 1.05306
I have done
awk '{sum+=$4} (NR%3)==0 {print $2,$3,sum/3;sum=0;}' test
which works, but I would like to (somehow) generalize (NR%3)==0 in a way that awk realizes that the value of the second column has changed and therefore means that it's a new average what it needs to calculate. For instance, the first three rows have 1 as value in the second column, so once 1 changes to 2 then means that it's a new average what it needs to be calculated.
Does this make sense?
Try something like:
awk '{sum[$2] += $4; count[$2] += 1; }
END { for (k in sum) { print k " " sum[k]/count[k]; } }'
Not tested but that is the idea...
With this method, the whold computation is printed at the end; it may be not what you want if the input is some infinite stream, but according to your example I think it should be fine.
If you want to keep the first column also, you can perfectly do it with the same system.
you can also use try this;
awk '{array[$1" "$2]+=$4} END { for (i in array) {print i" " array[i]/length(array)}}' test | sort -n
Test;
$ awk '{array[$1" "$2]+=$4} END { for (i in array) {print i" " array[i]/length(array)}}' test | sort -n
-1 1 0.732313
-1 2 1.33585
1 4 1.05306

awk to Count Sum and Unique improve command

Would like to print based on 2nd column ,count of line items, sum of 3rd column and unique values of first column.Having around 100 InputTest files and not sorted ..
Am using below 3 commands to achieve the desired output , would like to know the simplest way ...
InputTest*.txt
abc,xx,5,sss
abc,yy,10,sss
def,xx,15,sss
def,yy,20,sss
abc,xx,5,sss
abc,yy,10,sss
def,xx,15,sss
def,yy,20,sss
ghi,zz,10,sss
Step#1:
cat InputTest*.txt | awk -F, '{key=$2;++a[key];b[key]=b[key]+$3} END {for(i in a) print i","a[i]","b[i]}'
Op#1
xx,4,40
yy,4,60
zz,1,10
Step#2
awk -F ',' '{print $1,$2}' InputTest*.txt | sort | uniq >Op_UniqTest2.txt
Op#2
abc xx
abc yy
def xx
def yy
ghi zz
Step#3
awk '{print $2}' Op_UniqTest2.txt | sort | uniq -c
Op#3
2 xx
2 yy
1 zz
Desired Output:
xx,4,40,2
yy,4,60,2
zz,1,10,1
Looking for suggestions !!!
BEGIN { FS = OFS = "," }
{ ++lines[$2]; if (!seen[$2,$1]++) ++diff[$2]; count[$2]+=$3 }
END { for(i in lines) print i, lines[i], count[i], diff[i] }
lines tracks the number of occurrences of each value in column 2
seen records unique combinations of the second and first column, incrementing diff[$2] whenever a unique combination is found. The ++ after seen[$2,$1] means that the condition will only be true the first time the combination is found, as the value of seen[$2,$1] will be increased to 1 and !seen[$2,$1] will be false.
count keeps a total of the third column
$ awk -f avn.awk file
xx,4,40,2
yy,4,60,2
zz,1,10,1
Using awk:
$ awk '
BEGIN { FS = OFS = "," }
{ keys[$2]++; sum[$2]+=$3 } !seen[$1,$2]++ { count[$2]++ }
END { for(key in keys) print key, keys[key], sum[key], count[key] }
' file
xx,4,40,2
yy,4,60,2
zz,1,10,1
Set the input and output field separator to , in BEGIN block. We use arrays keys to identify and count keys. sum array keeps the sum for each keys. count allows us to keep track of unique column1 for each of column2 values.