Adding numbers of a field - awk

I am having a text file with multiple rows and two or four column. If two column then 1st column is id and 2nd is number and if four column 1st and 2nd is id and 3rd and 4th is number. For the four column rows 2nd and 4th column cells can have multiple entry separated by comma. If there is two column only I want to print them as it is; but if there is four column I want to print only the 1st column id and in the second column I want the sum of all the number present in 3rd and 4th column for that row.
Input
CG AT,AA,CA 17 1,1,1
GT 14
TB AC,TC,TA,GG,TT,AR,NN,NM,AB,AT,TT,TC,CA,BB,GT,AT,XT,MT,NA,TT 552 6,1,1,2,2,1,2,1,5,3,4,1,2,1,1,1,3,4,5,4
TT CG,GT,TA,GB 105 3,4,1,3
Expected Output
CG 20
GT 14
TB 602
TT 116

If there are no leading spaces in the actual file, use $1 instead of $2.
$ awk -F '[ ,]+' '{for(i=1; i<=NF; i++) s+=$i; print $2, s; s=0}' <<EOF
CG AT,AA,CA 17 1,1,1
GT 14
TB AC,TC,TA,GG,TT,AR,NN,NM,AB,AT,TT,TC,CA,BB,GT,AT,XT,MT,NA,TT 552 6,1,1,2,2,1,2,1,5,3,4,1,2,1,1,1,3,4,5,4
TT CG,GT,TA,GB 105 3,4,1,3
EOF
CG 20
GT 14
TB 602
TT 116
-F '[ ,]+' means "fields are delimited by one or more spaces or commas".
There is no condition associated with the {action}, so it will be performed on every line.
NF is the Number of Fields, and $X refers to the Xth field.
Strings are equal to 0, so we can simply add every field together to get a sum.
After we print the first non-blank field and our sum, we reset the sum for the next line.

Here is a solution coded to follow your instruction as closely as possible (with no field-splitting tricks so that it's easy to reason about):
awk '
NF == 2 {
print $1, $2
next
}
NF == 4 {
N = split($4, f, /,/)
for (i = 1; i <= N; ++i)
$3 += f[i]
print $1, $3
}'
I noticed though that your input section contains leading spaces. If leading spaces are actually present (and are irrelevant), we can add a leading { sub(/^ +/, "") } to the script.

Related

how to keep newline(s) when selecting a given column with awk

Suppose I have a file like this (disclaimer: this is not fixed I can have more than 7 rows, and more than 4 columns)
R H A 23
S E A 45
T E A 34
U A 35
Y T A 35
O E A 353
J G B 23
I want the output to select second column if third column is A but keeping newline or whitespace character.
output should be:
HEE TE
I tried this:
awk '{if ($3=="A") print $2}' file | awk 'BEGIN{ORS = ""}{print $1}'
But this gives:
HEETE%
Which has a weird % and is missing the space.
You may use this gnu-awk solution using FIELDWIDTHS:
awk 'BEGIN{ FIELDWIDTHS = "1 1 1 1 1 1 *" } $5 == "A" {s = s $3}
END {print s}' file
HEE TE
awk splits each record using width values provided in this variable FIELDWIDTHS.
1 1 1 1 1 1 * means each of first 6 columns will have single character length and remaining text will be filled in 7th column. Since you have a space after each value so $2,$4,$6 will be filled with a single space and $1,$3,$5 will be filled with the provided values in input.
$5 == "A" {s = s $3}: Here we are checking if $5 is A and if that condition is true then we keep appending value of $3 in a variable s. In the END block we just print variable s.
Without using fixed width parsing, awk will treat A in 4th row as $2.
Or else if we let spaces part of column value then use:
awk '
BEGIN{ FIELDWIDTHS = "2 2 2 *" }
$3 == "A " {s = s substr($2,1,1)}
END {print s}
' file

AWK select rows where all columns are equal

I have a file with tab-separated values where the number of columns is not known a priori. In other words the number of columns is consistent within a file but different files have different number of columns. The first column is a key, the other columns are some arbitrary values.
I need to filter out the rows where the values are not the same. For example, assuming that the number of columns is 4, I need to keep the first 2 rows and filter out the 3-rd:
1 A A A
2 B B B
3 C D C
I'm planning to use AWK for this purpose, but I don't know how to deal with the fact that the number of columns is unknown. The case of the known number of columns is simple, this is a solution for 4 columns:
$2 == $3 && $3 == $4 {print}
How can I generalize the solution for arbitrary number of columns?
If you guarantee no field contains regex-active chars and the first field never match the second, and there is no blank line in the input:
awk '{tmp=$0;gsub($2,"")} NF==1{print tmp}' file
Note that this solution is designed for this specific case and less extendable than others.
Another slight twist on the approach. In your case you know you want to compare fields 2-4 so you can simply loop from i=3;i<=NF checking $i!=$(i-1) for equality, and if it fails, don't print, get the next record, e.g.
awk '{for(i=3;i<=NF;i++)if($i!=$(i-1))next}1'
Example Use/Output
With your data in file.txt:
$ awk '{for(i=3;i<=NF;i++)if($i!=$(i-1))next}1' file.txt
1 A A A
2 B B B
Could you please try following. This will compare all columns from 2nd column to till last column and check if every element is equal or not. If they are all same it will print line.
awk '{for(i=3;i<=NF;i++){if($(i-1)==$i){count++}};if((NF-2)==count){print};count=""}' Input_file
OR(by hard coding $2 in code, since if $2=$3 AND $3=$4 it means $2=$3=$4 so intentionally taking $2 in comparison rather than having i-1 fetching its previous value.)
awk '{for(i=3;i<=NF;i++){if($2==$i){count++}};if((NF-2)==count){print};count=""}' Input_file
I'd use a counter t with initial value of 2 to add the number of times $i == $(i+1) where i iterates from 2 to NF-1. print the line only if t==NF is true:
awk -F'\t' '{t=2;for(i=2;i<NF;i++){t+=$i==$(i+1)}}t==NF' file.txt
Here is a generalisation of the problem:
Select all lines where a set of columns have the same value: c1 c2 c3 c4 ..., where ci can be any number:
Assume we want to select the columns: 2 3 4 11 15
awk 'BEGIN{n=split("2 3 4 11 15",a)}
{for(i=2;i<=n;++i) if ($(a[i])!=$(a[1])) next}1' file
A bit more robust, in case a line might not contain all fields:
awk 'BEGIN{n=split("2 3 4 11 15",a)}
{for(i=2;i<=n;++i) if (a[i] <= NF) if ($(a[i])!=$(a[1])) next}1' file

Using awk how to to merge lines which are duplicates based on multiple columns and substitute the average for another column

This is a variant on
Using awk how do I print all lines containing duplicates of specific columns?
Input:
a;3;c;1
a;6;b;2
a;5;c;1
Output:
a;4;c;1
a;6;b;2
Hence, all lines which have duplicates of columns 1,3 and 4 should be merged to one line and printing the average of column 2 in
column 2. All the lines which don't have duplicates (according to columns 1,3 and 4) should be printed as they are.
gawk approach:
awk -F";" '{a[$1,$3,$4]+=$2; ++c[$1,$3,$4]}END{OFS=";"; for(i in a){
split(i, sep, SUBSEP); print sep[1],a[i]/c[i],sep[2],sep[3]}}' file
The output:
a;6;b;2
a;4;c;1
a[$1,$3,$4]+=$2; - group lines by the same 1st, 3rd and 4th field, accumulating the 2nd field vale
++c[$1,$3,$4] - count the number of grouped records
split(i, sep, SUBSEP); - split compound key into array containing 1st, 3rd and 4th field value
Give this one liner a try:
awk -F';' '{k=$1 FS $3 FS $4;t[k]++;a[k]=($2+a[k])/t[k]}
END{for(x in a){sub(FS,FS a[x]"&",x);print x}}' file
it first calculates the average and save in the value of a hashtable
after all lines were processed, just insert the calculated result into the 2nd field position.
Note that the order of lines in output may be different from the input.
an indirect approach
swap12() { awk 'BEGIN{FS=OFS=";"} {t=$1;$1=$2;$2=t}1' "$1";}
swap12 file |
awk 'BEGIN {FS=OFS=";"}
{k=$2 FS $3 FS $4; a[k]+=$1; c[k]++}
END {for(k in a) print a[k]/c[k],k}' |
swap12

How to get a field by counting the column ( number of character)

I have a logfile.txt and I want to specify the filed $4 but based on number of column not number of field because the fields are separated by spaces characters and the field 2 ($2) may contain a values separated by space. I want to count lines but I don't know how specify $4 without causing a problem if the field 2 ($2) contain a space character.
here is my file:
KJKJJ1KLJKJKJ928482711 PIEJHHKIA 87166188177633 AJHHHH77760 00666667 876876800874 2014100898798789979879877770
KJKJJ1KLJKJKJ928482711 HKHG 81882776553868 HGHALJLKA700 00876763 216897879879 2014100898798789979879877770
KJKJJ1KLJKJKJ928482711 UUT UGGT 81762665356426 HGJHGHJG661557008 00778787 268767860704 2014100898798789979879877770
KJKJJ1KLJKJKJ9284827kj ARTH HGG 08276255534867 HGJHGHJG661557008 00876767 212668767684 2014100898798789979879877770
here is the code :
awk 'END { OFS="\t"; for (k in c) print c[k],"\t"k,"\t"f[k] } { k = $4 c[k]++; f[k]=substr($0,137,8) }' logfile.txt
I WANT TO COUNT BASED ON field $4. but to specify this field in code we must based on number of character (substr ($0,..,..) :
the output shold be :
1 20141008 AJHHHH77760
1 20141008 HGHALJLKA700
2 20141008 HGJHGHJG661557008
If your records are composed of fixed width fields you can use cut(1)
% cut -c1-22,23-42,43-62,... --output-delimiter=, file | sed 's/, */,/g' > file.csv
% awk -F, '{your_code}' file.csv
please write a range for each of your fixed width fields, in place of the ... ellipsis.
I have written ranges only for the first three, lazy me.
If you don't want to bother with an intermediate file, just use a | pipe.

Selecting a field after a string using awk

I'm very new to awk having just been introduced to it over the weekend.
I have a question that I'm hoping someone may be able to help me with.
How would one select a field that follows a specific string?
How would I expand this code to select more than one field following a specific string?
As an example, for any given line in my text file I have something like
2 of 10 19/4/2014 school name random text distance 800m more random text time 2:20:22 winner someonefast.
Some attributes are very consistent so I can easily extract these fields. For example 2, 10 and the date. However, there is often a lot of variable text before the next field that I wish to extract. Hence the question. Using awk can I extract the next field following a string? For example I'm interested in the fields following the /distance/ or /time/ string in combination with $1, $3, $4, $5.
Your help will be greatly appreciated.
Andy
Using awk you can select the field following a string. Here is an example:
echo '2 of 10 19/4/2014 school name random text distance 800m more random text time 2:20:22 winner someonefast.' |
awk '{
for(i=1; i<=NF; i++) {
if ( i ~ /^[1345]$/ ) {
extract = (extract ? extract FS $i : $i)
}
if ( $i ~ /distance|time/ ) {
extract = (extract ? extract FS $(i+1): $(i+1))
}
}
print extract
}'
2 10 19/4/2014 school 800m 2:20:22
What we are doing here is basically allowing awk to split on default delimiter. We create a for loop to iterate over all fields. NF stores number of fields for a given line. So we start from 1 and go all the way to the end.
In our first conditional block, we just inspect the field number. If it is 1 or 3 or 4 or 5, we create a variable called extract which concatenates the values of these fields separated by the field separator.
In our second conditional block, we check if the value of the field is either distance or time. If it is we again append to our variable but this time instead of the current value, we do $(i+1) which is basically the value of the next field or you can say value of a field that follows a specific string.
When you have name = value situations like you do here, it's best to create an array that maps the names to the values and then just print the values for the names you're interested in, e.g.:
$ awk '{for (i=1;i<=NF;i++) v[$i]=$(i+1); print $1, $3, $4, $5, v["distance"], v["time"]}' file
2 10 19/4/2014 school 800m 2:20:22
Basic:
awk '{
for (i = 6; i <= NF; ++i) {
if ($i == "distance") distance = $(i + 1)
if ($i == "time") time = $(i + 1)
}
print $1, $3, $4, $5, distance, time
}' file
Output:
2 10 19/4/2014 school 800m 2:20:22
But it's not enough to get all other significant texts which is still part of the school name after $5. You should add another condition.
The better solution is to have another delimiter besides spaces like tabs and use \t as FS.