count lines together with summing a specific column - awk

I would very much appreciate if someone can help this task. I am hoping to do this with awk, but if there is better strategy other than awk, I would also like to know.
This is infile,
S,0,3118,*,0,*,*,*,10-2,c645,5
H,0,648,99.2,+,0,0,250I648M2220I,10-2,c4204,1
H,0,597,99.2,+,0,0,314I597M2207I,10-2,c4022,1
S,1,2488,*,0,*,*,*,10-2,c17,4
H,1,798,97.4,+,0,0,1407I798M283I,10-2,c232,2
H,1,796,98,+,0,0,628I796M1064I,10-2,c67,1
H,1,751,97.5,-,0,0,668I144M3D290MD313M1073I,10-2,c115,1
H,1,792,98.4,+,0,0,628I792M1068I,10-2,c380,1
S,2,2437,*,0,*,*,*,10-2,c102,7
S,3,2218,*,0,*,*,*,10-2,c1081,10
H,3,928,99.2,-,0,0,3D925M1293I,10-2,c986,3
the outfile what I would like to have is,
outfile
0,3,7
1,5,9
2,1,7
3,2,13
So, if the second column of infile is same, count the number of lines (second column of the outfile) together summing value of the last column (third column of the outfile).
I tried as
awk -F',' '{a[$2] += $11}; END{for(c in a) print c, a[c]}' < infile
but I do not know how to count the line number together.

Simple Awk command, with logic to hash on the $2 value. Once to track the count of occurrences of the second column value and once for the sum on the last field. It goes like
awk 'BEGIN{FS=OFS=","}{unique[$2]++; uniqueSum[$2]+=$NF}END{for (i in unique) print i,unique[i],uniqueSum[i]}' file
which will get you the output you need. The part BEGIN{FS=OFS=","} takes care of setting the input and output field separators to , and {unique[$2]++; uniqueSum[$2]+=$NF} is for hashing the count of the unique ID value from $2 and its summed up value in last column. the END clause is run after all the lines are processed. So we print the unique column, its count and its total sum to get the output as you need.
0,3,7
1,5,9
2,1,7
3,2,13

Related

comparing subtracting and printing using awk

I have a file which looks like:
4.97911047 1.00000000 631.000000 369.343907
4.98065923 0.999492004 632.000000 369.771568
5.70441060 0.480057937 974.000000 642.686561
5.70448527 0.479704863 975.000000 641.643578
8.23090986 0.310811710 2020.00000 331.182895
8.23096067 0.312290865 2021.00000 331.188128
14.8077297 0.357914635 3181.00000 449.390996
14.8077541 0.352977613 3180.00000 449.377675
I want to subtract the consecutive numbers in the first column. If the result is less than or equal to some threshold say modulus 0.01, it should print the corresponding data range in the 3rd column.
For example, the difference in the 2nd and 1st consecutive data of the 1st column is
4.98065923-4.97911047=0.00154876.
Thus it should print
"631-632"
Then the difference in the 3rd and 2nd consecutive data of the 1st column is
5.70441060-4.98065923=0.723751
Thus it should not print anything. Again, the difference in the 4th and 3rd consecutive data of the 1st column is
5.70448527-5.70441060=7.467e-05
Thus it should print
"974-975"...
in this way. The final output should be like:\
631-632 \
974-975 \
2020-2021 \
3181-3180 \
Note: If the difference is less than 0.01 for 3 or more consecutive numbers of the 1st column, then it should also print the total range like "631-635" say.\
The best I could do until now is to use awk command and make the corresponding differences: \
awk 'NR > 1 { print $0 - prev } { prev = $0 }' < filename
Any help?
Regards
Sitangshu
I would GNU AWK for this task following way, let file.txt content be
4.97911047 1.00000000 631.000000 369.343907
4.98065923 0.999492004 632.000000 369.771568
5.70441060 0.480057937 974.000000 642.686561
5.70448527 0.479704863 975.000000 641.643578
8.23090986 0.310811710 2020.00000 331.182895
8.23096067 0.312290865 2021.00000 331.188128
14.8077297 0.357914635 3181.00000 449.390996
14.8077541 0.352977613 3180.00000 449.377675
then
awk 'NR>1&&($1-prev[1])<=0.01{printf "%d-%d\n",prev[3],$3}{split($0,prev)}' file.txt
gives output
631-632
974-975
2020-2021
3181-3180
Explanation: For each line I save under prev not whole line (string) but rather array, using split function to do so, so I can access for example 3rd field of previous line as prev[3]. For each line after 1st one and where difference between 1st field of current line and 1st field or previous line is lesser or equal 0.01 I do printf 3rd field of previous line and 3rd field of current line, using %d to get value formatted as integer.
(tested in gawk 4.2.1)

Bash one liner for calculating the average of a specific row of numbers in bash

I just began learning bash.
Trying to figure out how to convert a two-liner into a one liner using bash.
The First Line of code...
searches the first column of the input.txt for the word - KEYWORD.
captures every number in this KEYWORD row from column2 until the last column.
dumps all these numbers into the values.txt file placing each number on a new line.
The second line of code calculates average value for all the numbers in the first column of values txt the prints out the this value.
awk '{if($1=="KEYWORD") for(i=2;i<=NF;i++) print $i}' input.txt > values.txt
awk 'BEGIN{sum=0; counter=0}{sum+=$1; counter+=1}END{print sum/counter}' values.txt
How do I create a one liner from this?
Something like
awk '
BEGIN { count = sum = 0 }
$1 == "KEYWORD" {
for (n = 2; n <= NF; n++) sum += $n
count += NF - 1
}
END { print sum/count }' input.txt
Just keep track of the sum and total count of numbers in the first (and only) pass through the file and then average them at the end instead of printing a value for each matching line.
After reviewing this problem with several people and learning some new bash/awk shortcuts, the code below appears to be the shortest answer.
awk '/KEYWORD/{for(n=3;n<=NF;n++)sum+=$n;print sum/(NF-2)}' input.txt
This code searches the input file for the row containing "KEYWORD".
Then sums up all the field from the 3rd column to the last column.
Then prints out the average value of all those numbers. (i.e. the mean).

AWK select rows where all columns are equal

I have a file with tab-separated values where the number of columns is not known a priori. In other words the number of columns is consistent within a file but different files have different number of columns. The first column is a key, the other columns are some arbitrary values.
I need to filter out the rows where the values are not the same. For example, assuming that the number of columns is 4, I need to keep the first 2 rows and filter out the 3-rd:
1 A A A
2 B B B
3 C D C
I'm planning to use AWK for this purpose, but I don't know how to deal with the fact that the number of columns is unknown. The case of the known number of columns is simple, this is a solution for 4 columns:
$2 == $3 && $3 == $4 {print}
How can I generalize the solution for arbitrary number of columns?
If you guarantee no field contains regex-active chars and the first field never match the second, and there is no blank line in the input:
awk '{tmp=$0;gsub($2,"")} NF==1{print tmp}' file
Note that this solution is designed for this specific case and less extendable than others.
Another slight twist on the approach. In your case you know you want to compare fields 2-4 so you can simply loop from i=3;i<=NF checking $i!=$(i-1) for equality, and if it fails, don't print, get the next record, e.g.
awk '{for(i=3;i<=NF;i++)if($i!=$(i-1))next}1'
Example Use/Output
With your data in file.txt:
$ awk '{for(i=3;i<=NF;i++)if($i!=$(i-1))next}1' file.txt
1 A A A
2 B B B
Could you please try following. This will compare all columns from 2nd column to till last column and check if every element is equal or not. If they are all same it will print line.
awk '{for(i=3;i<=NF;i++){if($(i-1)==$i){count++}};if((NF-2)==count){print};count=""}' Input_file
OR(by hard coding $2 in code, since if $2=$3 AND $3=$4 it means $2=$3=$4 so intentionally taking $2 in comparison rather than having i-1 fetching its previous value.)
awk '{for(i=3;i<=NF;i++){if($2==$i){count++}};if((NF-2)==count){print};count=""}' Input_file
I'd use a counter t with initial value of 2 to add the number of times $i == $(i+1) where i iterates from 2 to NF-1. print the line only if t==NF is true:
awk -F'\t' '{t=2;for(i=2;i<NF;i++){t+=$i==$(i+1)}}t==NF' file.txt
Here is a generalisation of the problem:
Select all lines where a set of columns have the same value: c1 c2 c3 c4 ..., where ci can be any number:
Assume we want to select the columns: 2 3 4 11 15
awk 'BEGIN{n=split("2 3 4 11 15",a)}
{for(i=2;i<=n;++i) if ($(a[i])!=$(a[1])) next}1' file
A bit more robust, in case a line might not contain all fields:
awk 'BEGIN{n=split("2 3 4 11 15",a)}
{for(i=2;i<=n;++i) if (a[i] <= NF) if ($(a[i])!=$(a[1])) next}1' file

Print rows whose last field is negative

The last column of my file contains both negative and positive numbers:
a, b, -1
c, d, 2
e, f, -3
I need to extract the lines whose last field contains a negative number. Currently, I am using the following:
awk '/-/{print}' in.csv>out.csv
The above fails if '-' appears in other columns. I wonder if there is a way to test the last field in each row to see if they are negative and then extract the line.
Just tell awk to do...
awk -F, '$NF < 0' file
This sets the field separator to the comma (it looks like this is what you need) and then checks if $NF is lower than 0. And what is $NF? The last field, since NF contains the number of fields and $i points to the field number i.
The line is then printed, because a True condition triggers the default awk action, consisting in printing the current record.

find duplicate in first field, then combine text from second field of duplicate lines

I have file.csv with two fields similar to this:
text,something
more,somethingelse
text,another
foo,bar
I sort the file so that everything in the first field is in order so that all the duplicates in the first column are grouped together.
foo,bar
more,somethingelse
text,something
text,another
What I need to do but can't figure out is to move the text in the second field to same line as the duplicate in the first field, separated by a ";". It doesn't matter what order the second field is entered. I just want the output to be something like this:
foo,bar
more,somethingelse
text,something; another
I've tried this but it doesn't work. Not surprising since I'm just learning awk.
sort file.csv | awk 'BEGIN{last = ""; value = 0;} {if ($1 == last) {print $0, "; value";}}'
I wanted 'last' to hold the value of the first field of the previous line and 'value' to hold the value of the second field of the previous line. But i couldn't figure out how to make that work.
Is it possible to do this with a shell script? Thanks for any input.
This should work without the need for sort:
awk -F, '{
lines[$1] = (lines[$1] ? lines[$1] "; " $2 : $0)
}
END {
for (line in lines) print lines[line]
}' file
more,somethingelse
text,something; another
foo,bar
Set the input field separator to ,.
Check if the column1 exists in our line array. If it is then pad the second column separated by ;.
If the column1 is not present in our array assign the entire line as value
In the END block iterate through our array and print the values.