AWK - sum particular fields after match - awk

I have a txt file that is 10s to hundreds lines long and and I need to sum a particular field each line ( and output) if a preceeding field matches.
Here is an example datset:
Sample4;6a0f64d2;size=1;,Sample4;f1cb4733a;size=6;,Sample3;aa44410feb29210c1156;size=2;
Sample2;5b91bef2329bd87f4c7;size=2;,Sample1;909cd4e2940f328b3;size=2;
The structure is
<sample ID>;<random id>;size=<numeric>;, then the next entry. There could be hundreds of entries in a line (this is just a small example)
Basically, I want to sum the "size" numbers for each entry across a line (entries seperated by ',') , but only those that have match with a particular sample IDentifier (e.g. sample4 for example)
So, if we want to match just the 'Sample4's, the script would produce this-
awk '{some-code for sample4}' example.txt
7
0
Because the entries with 'Sample4' add up to 7 in line 1, but in line 2, there are no Sample4 entries matching.
This could be done for each "SampleID" or, ideally, done for all sample IDs provided in a list ( perhaps in simple file, 1 line per sample ID), which would then output the counts for each row, with each Sample ID having its own column - e.g. for the example file above, results of the script would be:
Sample1 Sample2 Sample3 Sample4
0 0 2 7
2 2 0 0
Any ideas on how to get started?
Thanks!

another awk
awk -F';' '{for(i=1;i<NF-1;i+=3)
{split($(i+2),e,"=");
sub(/,/,"",$i);
header[$i];
a[$i,NR]+=e[2]}}
END {for(h in header) printf "%s", h OFS;
print "";
for(i=1;i<=NR;i++)
{for(h in header) printf "%s", a[h,i]+0 OFS;
print ""}}' file | column -t
Sample1 Sample2 Sample3 Sample4
0 0 2 7
2 2 0 0
ps. the order of columns is not guaranteed.
Explanation
To simplify parsing I used ; as the delimiter and got rid of , before the names. Using the structure assign name=sum of values for each line using multi-dim array a, separately keep track of all names in the header array. Once the lines are consumed, in the END block print the header and for each line the value of the corresponding name (or 0 if missing). Pretty print with column -t.

If I am understanding this correctly, you can do:
$ awk '{split($0,samp,/,/)
for (i=1; i in samp; i++){
sub(/;$/, "", samp[i])
split(samp[i], fields, /;/)
split(fields[3], ns, /=/)
data[fields[1]]+=ns[2]
}
printf "For line %s:\n", NR
for (e in data)
print e, data[e]
split("", data)
}' file
Prints:
For line 1:
Sample3 2
Sample4 7
For line 2:
Sample1 2
Sample2 2

Related

AWK select rows where all columns are equal

I have a file with tab-separated values where the number of columns is not known a priori. In other words the number of columns is consistent within a file but different files have different number of columns. The first column is a key, the other columns are some arbitrary values.
I need to filter out the rows where the values are not the same. For example, assuming that the number of columns is 4, I need to keep the first 2 rows and filter out the 3-rd:
1 A A A
2 B B B
3 C D C
I'm planning to use AWK for this purpose, but I don't know how to deal with the fact that the number of columns is unknown. The case of the known number of columns is simple, this is a solution for 4 columns:
$2 == $3 && $3 == $4 {print}
How can I generalize the solution for arbitrary number of columns?
If you guarantee no field contains regex-active chars and the first field never match the second, and there is no blank line in the input:
awk '{tmp=$0;gsub($2,"")} NF==1{print tmp}' file
Note that this solution is designed for this specific case and less extendable than others.
Another slight twist on the approach. In your case you know you want to compare fields 2-4 so you can simply loop from i=3;i<=NF checking $i!=$(i-1) for equality, and if it fails, don't print, get the next record, e.g.
awk '{for(i=3;i<=NF;i++)if($i!=$(i-1))next}1'
Example Use/Output
With your data in file.txt:
$ awk '{for(i=3;i<=NF;i++)if($i!=$(i-1))next}1' file.txt
1 A A A
2 B B B
Could you please try following. This will compare all columns from 2nd column to till last column and check if every element is equal or not. If they are all same it will print line.
awk '{for(i=3;i<=NF;i++){if($(i-1)==$i){count++}};if((NF-2)==count){print};count=""}' Input_file
OR(by hard coding $2 in code, since if $2=$3 AND $3=$4 it means $2=$3=$4 so intentionally taking $2 in comparison rather than having i-1 fetching its previous value.)
awk '{for(i=3;i<=NF;i++){if($2==$i){count++}};if((NF-2)==count){print};count=""}' Input_file
I'd use a counter t with initial value of 2 to add the number of times $i == $(i+1) where i iterates from 2 to NF-1. print the line only if t==NF is true:
awk -F'\t' '{t=2;for(i=2;i<NF;i++){t+=$i==$(i+1)}}t==NF' file.txt
Here is a generalisation of the problem:
Select all lines where a set of columns have the same value: c1 c2 c3 c4 ..., where ci can be any number:
Assume we want to select the columns: 2 3 4 11 15
awk 'BEGIN{n=split("2 3 4 11 15",a)}
{for(i=2;i<=n;++i) if ($(a[i])!=$(a[1])) next}1' file
A bit more robust, in case a line might not contain all fields:
awk 'BEGIN{n=split("2 3 4 11 15",a)}
{for(i=2;i<=n;++i) if (a[i] <= NF) if ($(a[i])!=$(a[1])) next}1' file

How to duplicate every word in the first line of a file

How can I duplicate every word in the header of a file?
I have a dataframe looking like this:
ID sample1 sample2 ...
123 1 0 1 2 ...
...
I want to duplicate every column header in the file such that after splitting the data at the space, each of them will have a header.
Desired output:
ID sample1 sample1 sample2 sample2 ...
123 1 0 1 2 ...
...
I tried to use sed:
sed -e '1s/*./& &/g' file.in
but it only append the duplicated content at the end of the line.
Thanks
Another option with awk is to simply use string concatenation to duplicate each field from 2 on. For example using a 3-space separator (and your input file with the ellipses in place), you could do:
$ awk 'FNR == 1 { for (i = 2; i <= NF; i++) $i = " " $i " " $i }1' file
ID sample1 sample1 sample2 sample2 ... ...
123 1 0 1 2 ...
...
The essential part of the expression is simply setting $i = " " $i " " $i to duplicate the field.
Using sed with extended regular expressions, you could do:
sed -r '1 s/\s+\w+/& &/g' file
ID sample1 sample1 sample2 sample2 ...
123 1 0 1 2 ...
...
Where limiting the line 1 you match any one or more separator characters \s+ followed by one or more word characters \w+ and replace it with what is matched -- twice, & &.
You can do the same thing a bit more crudely with basic regular expressions using:
sed '1 s/[ \t][ \t]*[^ \t][^ \t]*/& &/g' file
Where you match one or more spaces or tabs followed by one or more not-spaces or not-tabs. (same output, but it also duplicates the ellipses in the first line)
Some like this:
awk 'NR==1 {printf "%s ",$1;for (i=2; i<=NF; i++) printf "%s %s ", $i,$i;print "";next}1' file
ID sample1 sample1 sample2 sample2 ... ...
123 1 0 1 2 ...
...
In line #1, it duplicates every word, except the first.
Using TAB as separator
awk 'NR==1 {printf "%s\t",$1;for (i=2; i<=NF; i++) printf "%s\t%s\t", $i,$i;print "";next} {$1=$1} 1' OFS="\t" file
ID sample1 sample1 sample2 sample2 ... ...
123 1 0 1 2 ...
...
This might work for you (GNU sed):
sed -E 's/\s{2,}/\t/g;1h;1d;2{H;s/\t/& /g;G;s/^\S+([^\n]*\n)(\S+)/\2\1/;:a;s/\t \S+([^\n]*\n(\t\S+))/\2\t\1/;s/\t(\t[^\n]*\n)\t\S+/\1/;ta;s/\t\n\t\S+//};y/ /\t/' file
Replace all 2 or more consecutive spaces by tabs. Copy the header to the hold space and delete it. Append the second line to the hold space and prepend a space following each tab in the second line. Append the first and second lines to the second line. The first line in the pattern space is used as a template for the headings. The first column is special (ID) and is copied non-iteratively. All other heading are replaced iteratively until there no further headings. The last tab of the first line and the remainder of the second line (last column of the headings) is removed. All subsequent spaces are replaced by tabs.
N.B. All columns will be tab delimited, if space delimited is preferred, replace the last command by y/\t/ /.
I assume you actually meant '1s/.*/& &/g' rather than '1s/*./& &/g'?
In that case, remember that * is a greedy quantifier, so will match the whole line. You want to match each word on the line:
sed -e '1s/\w\+/& &/g'
Looking at the example, it seems that we don't want the first word (ID) to be doubled like the rest - only the words with preceding whitespace:
sed -e '1s/ \+\w\+/&&/g'
Output:
ID sample1 sample1 sample2 sample2 ...
123 1 0 1 2 ...

Adding numbers of a field

I am having a text file with multiple rows and two or four column. If two column then 1st column is id and 2nd is number and if four column 1st and 2nd is id and 3rd and 4th is number. For the four column rows 2nd and 4th column cells can have multiple entry separated by comma. If there is two column only I want to print them as it is; but if there is four column I want to print only the 1st column id and in the second column I want the sum of all the number present in 3rd and 4th column for that row.
Input
CG AT,AA,CA 17 1,1,1
GT 14
TB AC,TC,TA,GG,TT,AR,NN,NM,AB,AT,TT,TC,CA,BB,GT,AT,XT,MT,NA,TT 552 6,1,1,2,2,1,2,1,5,3,4,1,2,1,1,1,3,4,5,4
TT CG,GT,TA,GB 105 3,4,1,3
Expected Output
CG 20
GT 14
TB 602
TT 116
If there are no leading spaces in the actual file, use $1 instead of $2.
$ awk -F '[ ,]+' '{for(i=1; i<=NF; i++) s+=$i; print $2, s; s=0}' <<EOF
CG AT,AA,CA 17 1,1,1
GT 14
TB AC,TC,TA,GG,TT,AR,NN,NM,AB,AT,TT,TC,CA,BB,GT,AT,XT,MT,NA,TT 552 6,1,1,2,2,1,2,1,5,3,4,1,2,1,1,1,3,4,5,4
TT CG,GT,TA,GB 105 3,4,1,3
EOF
CG 20
GT 14
TB 602
TT 116
-F '[ ,]+' means "fields are delimited by one or more spaces or commas".
There is no condition associated with the {action}, so it will be performed on every line.
NF is the Number of Fields, and $X refers to the Xth field.
Strings are equal to 0, so we can simply add every field together to get a sum.
After we print the first non-blank field and our sum, we reset the sum for the next line.
Here is a solution coded to follow your instruction as closely as possible (with no field-splitting tricks so that it's easy to reason about):
awk '
NF == 2 {
print $1, $2
next
}
NF == 4 {
N = split($4, f, /,/)
for (i = 1; i <= N; ++i)
$3 += f[i]
print $1, $3
}'
I noticed though that your input section contains leading spaces. If leading spaces are actually present (and are irrelevant), we can add a leading { sub(/^ +/, "") } to the script.

awk with empty field in columns

Here my file.dat
1 A 1 4
2 2 4
3 4 4
3 7 B
1 U 2
Running awk '{print $2}' file.dat gives:
A
2
4
7
U
But I would like to keep the empty field:
A
4
U
How to do it?
I must add that between :
column 1 and 2 there is 3 whitespaces field separator
column 2 and 3 and between column 3 and 4 one whitespace field separator
So in column 2 there are 2 fields missing (lines 2 and 4) and in column 4
there are also 2 fields missing (lines 3 and 5)
If this isn't all you need:
$ awk -F'[ ]' '{print $4}' file
A
4
U
then edit your question to provide a more truly representative example and clearer requirements.
If the input is fixed-width columns, you can use substr to extract the slice you want. I have assumed that you want a single character at index 5:
awk '{ print(substr($0,5,1)) }' file
Your awk code is missing field separators.
Your example file doesn't clearly show what that field separator is.
From observation your file appears to have 5 columns.
You need to determine what your field separator is first.
This example code expects \t which means <TAB> as the field separator.
awk -F'\t' '{print $3}' OFS='\t' file.dat
This outputs the 3rd column from the file. This is the 'read in' field separator -F'\t' and OFS='\t' is the 'read out'.
A
4
U
For GNU awk. It processes the file twice. On the first time it examines all records for which string indexes have only space and considers continuous space sequences as separator strings building up FIELDWIDTHS variable. On the second time it uses that for fixed width processing of the data.
a[i]:s get valus 0/1 and h (header) with this input will be 100010101 and that leads to FIELDWIDTHS="4 2 2 1":
1 A 1 4
2 2 4
3 4 4
3 7 B
1 U 2
| | | |
100010101 - while(match(h,/10*/))
\ /|/|/|
4 2 2 1
Script:
$ awk '
NR==FNR {
for(i=1;i<=length;i++) # all record chars
a[i]=((a[i]!~/^(0|)$/) || substr($0,i,1)!=" ") # keep track of all space places
if(--i>m)
m=i # max record length...
next
}
BEGINFILE {
if(NR!=0) { # only do this once
for(i=1;i<=m;i++) # ... used here
h=h a[i] # h=100010101
while(match(h,/10*/)) { # build FIELDWIDTHS
FIELDWIDTHS=FIELDWIDTHS " " RLENGTH # qnd
h=substr(h,RSTART+RLENGTH)
}
}
}
{
print $2 # and output
}' file file
And output:
A
4
U
You need to trim off the space from the fields, though.

looking to compare against column number 3 of row number 3 using awk

looking to compare against column number 3 of row number 3 using awk
input:
uniqueid 22618
remoteid remote1
established 1302
output:
22618
Tried:
awk '{ if(established > 1000) print 22618}'
I suggest:
awk '$1=="uniqueid" {uid=$2}; $1=="established" {est=$2}; est>1000 {print uid}' file
Output:
22618
If column 1 contains uniqueid save value of column 2 to variable uid.
If column 1 contains established save value of column 2 to variable est.
If value in variable est is larger 1000 print value in variable uid.
to compare against column number 3 of row number 3 using awk you need to specify the record (NR==3) and the field ($2 probably, not $3):
$ awk 'NR==3 && $2 > 1000{ print 22618 }' file
22618