I have a file which looks like:
4.97911047 1.00000000 631.000000 369.343907
4.98065923 0.999492004 632.000000 369.771568
5.70441060 0.480057937 974.000000 642.686561
5.70448527 0.479704863 975.000000 641.643578
8.23090986 0.310811710 2020.00000 331.182895
8.23096067 0.312290865 2021.00000 331.188128
14.8077297 0.357914635 3181.00000 449.390996
14.8077541 0.352977613 3180.00000 449.377675
I want to subtract the consecutive numbers in the first column. If the result is less than or equal to some threshold say modulus 0.01, it should print the corresponding data range in the 3rd column.
For example, the difference in the 2nd and 1st consecutive data of the 1st column is
4.98065923-4.97911047=0.00154876.
Thus it should print
"631-632"
Then the difference in the 3rd and 2nd consecutive data of the 1st column is
5.70441060-4.98065923=0.723751
Thus it should not print anything. Again, the difference in the 4th and 3rd consecutive data of the 1st column is
5.70448527-5.70441060=7.467e-05
Thus it should print
"974-975"...
in this way. The final output should be like:\
631-632 \
974-975 \
2020-2021 \
3181-3180 \
Note: If the difference is less than 0.01 for 3 or more consecutive numbers of the 1st column, then it should also print the total range like "631-635" say.\
The best I could do until now is to use awk command and make the corresponding differences: \
awk 'NR > 1 { print $0 - prev } { prev = $0 }' < filename
Any help?
Regards
Sitangshu
I would GNU AWK for this task following way, let file.txt content be
4.97911047 1.00000000 631.000000 369.343907
4.98065923 0.999492004 632.000000 369.771568
5.70441060 0.480057937 974.000000 642.686561
5.70448527 0.479704863 975.000000 641.643578
8.23090986 0.310811710 2020.00000 331.182895
8.23096067 0.312290865 2021.00000 331.188128
14.8077297 0.357914635 3181.00000 449.390996
14.8077541 0.352977613 3180.00000 449.377675
then
awk 'NR>1&&($1-prev[1])<=0.01{printf "%d-%d\n",prev[3],$3}{split($0,prev)}' file.txt
gives output
631-632
974-975
2020-2021
3181-3180
Explanation: For each line I save under prev not whole line (string) but rather array, using split function to do so, so I can access for example 3rd field of previous line as prev[3]. For each line after 1st one and where difference between 1st field of current line and 1st field or previous line is lesser or equal 0.01 I do printf 3rd field of previous line and 3rd field of current line, using %d to get value formatted as integer.
(tested in gawk 4.2.1)
I just began learning bash.
Trying to figure out how to convert a two-liner into a one liner using bash.
The First Line of code...
searches the first column of the input.txt for the word - KEYWORD.
captures every number in this KEYWORD row from column2 until the last column.
dumps all these numbers into the values.txt file placing each number on a new line.
The second line of code calculates average value for all the numbers in the first column of values txt the prints out the this value.
awk '{if($1=="KEYWORD") for(i=2;i<=NF;i++) print $i}' input.txt > values.txt
awk 'BEGIN{sum=0; counter=0}{sum+=$1; counter+=1}END{print sum/counter}' values.txt
How do I create a one liner from this?
Something like
awk '
BEGIN { count = sum = 0 }
$1 == "KEYWORD" {
for (n = 2; n <= NF; n++) sum += $n
count += NF - 1
}
END { print sum/count }' input.txt
Just keep track of the sum and total count of numbers in the first (and only) pass through the file and then average them at the end instead of printing a value for each matching line.
After reviewing this problem with several people and learning some new bash/awk shortcuts, the code below appears to be the shortest answer.
awk '/KEYWORD/{for(n=3;n<=NF;n++)sum+=$n;print sum/(NF-2)}' input.txt
This code searches the input file for the row containing "KEYWORD".
Then sums up all the field from the 3rd column to the last column.
Then prints out the average value of all those numbers. (i.e. the mean).
I would very much appreciate if someone can help this task. I am hoping to do this with awk, but if there is better strategy other than awk, I would also like to know.
This is infile,
S,0,3118,*,0,*,*,*,10-2,c645,5
H,0,648,99.2,+,0,0,250I648M2220I,10-2,c4204,1
H,0,597,99.2,+,0,0,314I597M2207I,10-2,c4022,1
S,1,2488,*,0,*,*,*,10-2,c17,4
H,1,798,97.4,+,0,0,1407I798M283I,10-2,c232,2
H,1,796,98,+,0,0,628I796M1064I,10-2,c67,1
H,1,751,97.5,-,0,0,668I144M3D290MD313M1073I,10-2,c115,1
H,1,792,98.4,+,0,0,628I792M1068I,10-2,c380,1
S,2,2437,*,0,*,*,*,10-2,c102,7
S,3,2218,*,0,*,*,*,10-2,c1081,10
H,3,928,99.2,-,0,0,3D925M1293I,10-2,c986,3
the outfile what I would like to have is,
outfile
0,3,7
1,5,9
2,1,7
3,2,13
So, if the second column of infile is same, count the number of lines (second column of the outfile) together summing value of the last column (third column of the outfile).
I tried as
awk -F',' '{a[$2] += $11}; END{for(c in a) print c, a[c]}' < infile
but I do not know how to count the line number together.
Simple Awk command, with logic to hash on the $2 value. Once to track the count of occurrences of the second column value and once for the sum on the last field. It goes like
awk 'BEGIN{FS=OFS=","}{unique[$2]++; uniqueSum[$2]+=$NF}END{for (i in unique) print i,unique[i],uniqueSum[i]}' file
which will get you the output you need. The part BEGIN{FS=OFS=","} takes care of setting the input and output field separators to , and {unique[$2]++; uniqueSum[$2]+=$NF} is for hashing the count of the unique ID value from $2 and its summed up value in last column. the END clause is run after all the lines are processed. So we print the unique column, its count and its total sum to get the output as you need.
0,3,7
1,5,9
2,1,7
3,2,13
I have a 2 column tsv that I need to insert a new first column using part of the value in column 2.
What I have:
fastq/D0110.L001_R1_001.fastq fastq/D0110.L001_R2_001.fastq
fastq/D0206.L001_R1_001.fastq fastq/D0206.L001_R2_001.fastq
fastq/D0208.L001_R1_001.fastq fastq/D0208.L001_R2_001.fastq
What I want:
D0110 fastq/D0110.L001_R1_001.fastq fastq/D0110.L001_R2_001.fastq
D0206 fastq/D0206.L001_R1_001.fastq fastq/D0206.L001_R2_001.fastq
D0208 fastq/D0208.L001_R1_001.fastq fastq/D0208.L001_R2_001.fastq
I want to pull everything between "fastq/" and the first period and print that as the new first column.
$ awk -F'[/.]' '{printf "%s\t%s\n",$2,$0}' file
D0110 fastq/D0110.L001_R1_001.fastq fastq/D0110.L001_R2_001.fastq
D0206 fastq/D0206.L001_R1_001.fastq fastq/D0206.L001_R2_001.fastq
D0208 fastq/D0208.L001_R1_001.fastq fastq/D0208.L001_R2_001.fastq
How it works
awk implicitly loops over all input lines.
-F'[/.]'
This tells awk to use any occurrence of / or . as a field separator. This means that, for your input, the string you are looking for will be the second field.
printf "%s\t%s\n",$2,$0
This tells awk to print the second field ($2), followed by a tab (\t), followed by the input line ($0), followed by a newline character (\n)
echo "45" | awk 'BEGIN{FS=""}{for (i=1;i<=NF;i++)x+=$i}END{print x}'
I want to know how this works,what specifically does awk Fs,NF do here?
FS is the field separator. Setting it to "" (the empty string) means that every single character will be a separate field. So in your case you've got two fields: 4, and 5.
NF is the number of fields in a given record. In your case, that's 2. So i ranges from 1 to 2, which means that $i takes the values 4 and 5.
So this AWK script iterates over the characters and prints their sum — in this case 9.
These are built-in variables, FS being Field Separator - blank meaning split each character out. NF being Num Fields split by FS... so in this case num of chars, 2. So split the input by each character ("4", "5"), iterate each char (2) while adding their values up, print the result.
http://www.thegeekstuff.com/2010/01/8-powerful-awk-built-in-variables-fs-ofs-rs-ors-nr-nf-filename-fnr/
FS is the field separator. Normally fields are separated by whitespace, but when you set FS to the null string, each character of the input line is a separate field.
NF is the number of fields in the current input line. Since each character is a field, in this case it's the number of characters.
The for loop then iterates over each character on the line, adding it to x. So this is adding the value of each digit in input; for 45 it adds 4+5 and prints 9.