awk - moving average in blocks of ascii-file - awk
I have a big ascii-file that looks like this:
12,3,0.12,965.814
11,3,0.22,4313.2
14,3,0.42,7586.22
17,4,0,0
11,4,0,0
15,4,0,0
13,4,0,0
17,4,0,0
11,4,0,0
18,3,0.12,2764.86
12,3,0.22,2058.3
11,3,0.42,2929.62
10,4,0,0
10,4,0,0
14,4,0,0
12,4,0,0
19,3,0.12,1920.64
20,3,0.22,1721.51
12,3,0.42,1841.55
11,4,0,0
15,4,0,0
19,4,0,0
11,4,0,0
13,4,0,0
17,3,0.12,2738.99
12,3,0.22,1719.3
18,3,0.42,3757.72
.
.
.
I want to calculate a selected moving average over three values with awk. The selection should be done by the second and the third column.
A moving average should be calculated only for lines with the second column is 3.
The I would like to calculate three moving averages selected by the third column (contains for every "block" the same values in the same order).
The moving average shall then be calculated of the fourth column.
I would like to output the whole line of second moving-average value and replace the fourth column with the result.
I know that sounds very complicated, so I will give an example what I want to calculate and also the desired result:
(965.814+2764.86+1920.64)/3 = 1883.77
and output the result together with line 10:
18,3,0.12,1883.77
Then continue with the second, eleventh and eighteenth line...
The end result for my data example shall look like this:
18,3,0.12,1883.77
12,3,0.22,2697.67
11,3,0.42,4119.13
19,3,0.12,2474.83
20,3,0.22,1833.04
12,3,0.42,2842.96
I tried to calculate the moving-average with the following code in awk but think I designed the script wrong because awk tells me syntax error for every "$2 == 3".
BEGIN { FS="," ; OFS = "," }
$2 == 3 {
a; b; c; d; e; f = 0
line1 = $0; a = $3; b = $4; getline
line2 = $0; c = $3; d = $4; getline
line3 = $0; e = $3; f = $4
$2 == 3 {
line11 = $0; a = $3; b += $4; getline
line22 = $0; c = $3; d += $4; getline
line33 = $0; e = $3; f += $4
$2 == 3 {
line111 = $0; a = $3; b += $4; getline
line222 = $0; c = $3; d += $4; getline
line333 = $0; e = $3; f += $4
}
}
$0 = line11; $3 = a; $4 = b/3; print
$0 = line22; $3 = c; $4 = d/3; print
$0 = line33; $3 = e; $4 = f/3
}
{print}
Can you help me understanding how to correct my script (I think I have shortcomings with the philosophy of awk) or to start a completly new script because there is an easier solution out there ;-)
I also tried another idea:
BEGIN { FS="," ; OFS = "," }
i=0;
do {
i++;
a; b; c; d; e; f = 0
$2 == 3 {
line1 = $0; a = $3; b += $4; getline
line2 = $0; c = $3; d += $4; getline
line3 = $0; e = $3; f += $4
}while(i<3)
$0 = line1; $3 = a; $4 = b/3; print
$0 = line2; $3 = c; $4 = d/3; print
$0 = line3; $3 = e; $4 = f/3
}
{print}
This one also does not work, awk gives me two syntax errors (one at the "do" and the other after the "$$2 == 3").
I changed and tried a lot in both scripts and at some point they ran without errors but they did not give the desired output at all, so I thought there has to be a general problem.
I hope you can help me, that would be really nice!
Normalize your input
If you normalize your input using the right tools, then the task of finding a solution is far easier
My idea is to use awk to select the records where $2==3 and then use sort to group the data on the numerical value of the third column
% echo '12,3,0.12,965.814
11,3,0.22,4313.2
14,3,0.42,7586.22
17,4,0,0
11,4,0,0
15,4,0,0
13,4,0,0
17,4,0,0
11,4,0,0
18,3,0.12,2764.86
12,3,0.22,2058.3
11,3,0.42,2929.62
10,4,0,0
10,4,0,0
14,4,0,0
12,4,0,0
19,3,0.12,1920.64
20,3,0.22,1721.51
12,3,0.42,1841.55
11,4,0,0
15,4,0,0
19,4,0,0
11,4,0,0
13,4,0,0
17,3,0.12,2738.99
12,3,0.22,1719.3
18,3,0.42,3757.72' | \
awk -F, '$2==3' | \
sort --field-separator=, --key=3,3 --numeric-sort --stable
12,3,0.12,965.814
18,3,0.12,2764.86
19,3,0.12,1920.64
17,3,0.12,2738.99
11,3,0.22,4313.2
12,3,0.22,2058.3
20,3,0.22,1721.51
12,3,0.22,1719.3
14,3,0.42,7586.22
11,3,0.42,2929.62
12,3,0.42,1841.55
18,3,0.42,3757.72
%
Reason on normalized input
As you can see, the situation is now much clearer and we can try to design an algorithm to output a 3-elements running mean.
% awk -F, '$2==3' YOUR_FILE | \
sort --field-separator=, --key=3,3 --numeric-sort --stable | \
awk -F, '
$3!=prev {prev=$3
c=0
s[1]=0;s[2]=0;s[3]=0}
{old=new
new=$0
c = c+1; i = (c-1)%3+1; s[i] = $4
if(c>2)print old FS (s[1]+s[2]+s[3])/3}'
18,3,0.12,2764.86,1883.77
19,3,0.12,1920.64,2474.83
12,3,0.22,2058.3,2697.67
20,3,0.22,1721.51,1833.04
11,3,0.42,2929.62,4119.13
12,3,0.42,1841.55,2842.96
Oops,
I forgot your requirement on SUBSTITUTING $4 with the running mean, I will come out with a solution unless you're faster than me...
Edit: change the line
{old=new
to
{split(new,old,",")
and change the line
if(c>2)print old FS (s[1]+s[2]+s[3])/3}'
to
if(c>2) print old[1] FS old[2] FS old[3] FS (s[1]+s[2]+s[3])/3}'
Related
Concatenating array elements into a one string in for loop using awk
I am working on a variant calling format (vcf) file, and I tried to show you guys what I am trying to do: Input: 1 877803 838425 GC G 1 878077 966631 C CCACGG Output: 1 877803 838425 C - 1 878077 966631 - CACGG In summary, I am trying to delete the first letters of longer strings. And here is my code: awk 'BEGIN { OFS="\t" } /#/ {next} { m = split($4, a, //) n = split($5, b, //) x = "-" delete y if (m>n){ for (i = n+1; i <= m; i++) { y = sprintf("%s", a[i]) } print $1, $2, $3, y, x } else if (n>m){ for (j = m+1; i <= n; i++) { y = sprintf("%s", b[j]) ## Problem here } print $1, $2, $3, x, y } }' input.vcf > output.vcf But, I am getting the following error in line 15, not even in line 9 awk: cmd. line:15: (FILENAME=input.vcf FNR=1) fatal: attempt to use array y in a scalar context I don't know how to concatenate array elements into a one string using awk. I will be very happy if you guys help me. Merry X-Mas!
You may try this awk: awk -v OFS="\t" 'function trim(s) { return (length(s) == 1 ? "-" : substr(s, 2)); } {$4 = trim($4); $5 = trim($5)} 1' file 1 877803 838425 C - 1 878077 966631 - CACGG More readable form: awk -v OFS="\t" 'function trim(s) { return (length(s) == 1 ? "-" : substr(s, 2)) } { $4 = trim($4) $5 = trim($5) } 1' file
You can use awk's substr function to process the 4th and 5th space delimited fields: awk '{ substr($4,2)==""?$4="-":$4=substr($4,2);substr($5,2)==""?$5="-":$5=substr($5,2)}1' file If the string from position 2 onwards in field 4 is equal to "", set field 4 to "-" otherwise, set field 4 to the extract of the field from position 2 to the end of the field. Do the same with field 5. Print lines modified or not with short hand 1.
get the statistics in a text file using awk
I have a text file like this small example: >chr10:101370300-101370301 A >chr10:101370288-101370289 A >chr10:101370289-101370290 G >chr10:101471626-101471627 g >chr10:101471865-101471866 g >chr10:101471605-101471606 a >chr10:101471606-101471607 g >chr10:101471681-101471682 as you see below every line that starts with ">" I have a letter. these letters are A, G, T or C. in my results I would like to get the frequency of them in percentage. here is a small example of expected output. A = 28.57 G = 14.29 g = 42.85 a = 14.29 I am trying to do that in awk using : awk 'if $1 == "G", num=+1 { a[$1]+=num/"G" } if $1 == "G", num=+1 { a[$1]+=num/"C" } if $1 == "G", num=+1 { a[$1]+=num/"T" } if $1 == "G", num=+1 { a[$1]+=num/"A" } ' infile.txt > outfile.txt but it does not return what I want. do you know how to fix it?
Awk solution: awk '/^[a-zA-Z]/{ a[$1]++; cnt++ } END{ for (i in a) printf "%s = %.2f\n", i, a[i]*100/cnt }' file.txt /^[a-zA-Z]/ - on encountering records that only starts with a letter [a-zA-Z]: a[$1]++ - accumulate occurrences of each item(letter) cnt++ - count the total number of items(letters) The output: A = 28.57 a = 14.29 G = 14.29 g = 42.86
you sample is in contradiction with your comment (every line starting with > have no letter on my display, so i assume it's a copy/paste translation error) awk '{C[$NF]++;S+=0.01} END{ for( c in C ) printf( "%s = %2.2f\n", c, C[c]/S)}' infile.txt > outfile.txt if line are well under like the sample add 'NF==1' as first part of the awk code
awk match and find mismatch between files and output results
In the below awk I am using $5 $7 and $8 of file1 to search $3 $5 and $6 of file2. The header row is skipped and it then outputs a new file with what lines match and if they do not match what file the match is missing from. When I search for one match use 3 fields for the key for the lookup and do not skip the header I get current output. I apologize for the long post and file examples, just trying to include everything to help get this working. Thank you :). file1 Index Chromosomal Position Gene Inheritance Start End Ref Alt Func.refGene 98 48719928 FBN1 AD 48719928 48719929 AT - exonic 101 48807637 FBN1 AD 48807637 48807637 C T exonic file2 R_Index Chr Start End Ref Alt Func.IDP.refGene 36 chr15 48719928 48719929 AT - exonic 37 chr15 48719928 48719928 A G exonic 38 chr15 48807637 48807637 C T exonic awk awk -F'\t' ' NR == FNR { A[$25]; A[$26]; A[$27] next } { B[$3]; B[$5]; B[$6] } END { print "Match" OFS="," for ( k in A ) { if ( k && k in B ) printf "%s ", k } print "Missing from file1" OFS="," for ( k in B ) { if ( ! ( k in A ) ) printf "%s ", k } print "Missing from file2" OFS="," for ( k in A ) { if ( ! ( k in B ) ) printf "%s ", k } } ' file1 file2 > list current output Match Missing from file1 A C Ref 48807637 Alt Start T G - AT 48719928 Missing from file2 desired output Match 48719928 AT -, 48807637 C T Missing from file1 48719928 A G Missing from file2
You misunderstand awk syntax and are confusing awk with shell. When you wrote: A[$25] [$26] [$27] you probably meant: A[$25]; A[$26]; A[$27] (and similarly for B[]) and when you wrote: IFS= since IFS is a shell variable, not an awk one, you maybe meant FS= BUT since you're doing that in the END section and not calling split() and so not doing anything that would use FS idk what you were hoping to achieve with that. Maybe you meant: OFS= BUT you aren't doing anything that would use OFS and your desired output isn't comma-separated so idk what you'd be hoping to achieve with that either. If that's not enough info for you to solve your problem yourself then reduce your example to something with 10 columns or less so we don't have to read a lot of irrelevant info to help you.
Program 1 This works, except the output format is different from what you request: awk 'FNR==1 { next } FNR == NR { file1[$5,$7,$8] = $5 " " $7 " " $8 } FNR != NR { file2[$3,$5,$6] = $3 " " $5 " " $6 } END { print "Match:"; for (k in file1) if (k in file2) print file1[k] # Or file2[k] print "Missing in file1:"; for (k in file2) if (!(k in file1)) print file2[k] print "Missing in file2:"; for (k in file1) if (!(k in file2)) print file1[k] }' file1 file2 Output 1 Match: 48807637 C T 48719928 AT - Missing in file1: 48719928 A G Missing in file2: Program 2 If you must have each set of values in a category comma-separated on a single line, then: awk 'FNR==1 { next } FNR == NR { file1[$5,$7,$8] = $5 " " $7 " " $8 } FNR != NR { file2[$3,$5,$6] = $3 " " $5 " " $6 } END { printf "Match" pad = " " for (k in file1) { if (k in file2) { printf "%s%s", pad, file1[k] pad = ", " } } print "" printf "Missing in file1" pad = " " for (k in file2) { if (!(k in file1)) { printf "%s%s", pad, file2[k] pad = ", " } } print "" printf "Missing in file2" pad = " " for (k in file1) { if (!(k in file2)) { printf "%s%s", pad, file1[k] pad = ", " } } print "" }' file1 file2 The code is a little bigger, but the format used exacerbates the difference. The change is all in the END block; the other code is unchanged. The sequences of actions in the END block no longer fit comfortably on a single line, so they're spread out for readability. You can apply a liberal smattering of semicolons and concatenate the lines to shrink the apparent size of the program if you desire. It's tempting to try a function for the printing, but the conditions just make it too tricky to be worthwhile, I think — but I'm open to persuasion otherwise. Output 2 Match 48807637 C T, 48719928 AT - Missing in file1 48719928 A G Missing in file2 This output will be a lot harder to parse than the one shown first, so doing anything automatically with it will be tricky. While there are only 3 entries to worry about, the line length isn't an issue. If you get to 3 million entries, the lines become very long and unmanageable.
add a new column to the file based on another file
I have two files file1 and file2 as shown below. file1 has two columns and file2 has one column. I want to add second column to the file2 based on file1. How can I do this with awk? file1 2WPN B 2WUS A 2X83 A 2XFG A 2XQR C file2 2WPN_1 2WPN_2 2WPN_3 2WUS 2X83 2XFG_1 2XFG_2 2XQR Desired Output 2WPN_1 B 2WPN_2 B 2WPN_3 B 2WUS A 2X83 A 2XFG_1 A 2XFG_2 A 2XQR C your help would be appreciated.
awk -v OFS='\t' 'FNR == NR { a[$1] = $2; next } { t = $1; sub(/_.*$/, "", t); print $1, a[t] }' file1 file2 Or awk 'FNR == NR { a[$1] = $2; next } { t = $1; sub(/_.*$/, "", t); printf "%s\t%s\n", $1, a[t] }' file1 file2 Output: 2WPN_1 B 2WPN_2 B 2WPN_3 B 2WUS A 2X83 A 2XFG_1 A 2XFG_2 A 2XQR C You may pass output to column -t to keep it uniform with spaces and not tabs.
awk improve command - Count & Sum
Would like to get your suggestion to improve this command and want to remove unwanted execution to avoid time consumption, actually i am trying to find CountOfLines and SumOf$6 group by $2,substr($3,4,6),substr($4,4,6),$10,$8,$6. GunZip Input file contains around 300 Mn rows of lines. Input.gz 2067,0,09-MAY-12.04:05:14,09-MAY-12.04:05:14,21-MAR-16,600,INR,RO312,20120321_1C,K1,,32 2160,0,26-MAY-14.02:05:27,26-MAY-14.02:05:27,18-APR-18,600,INR,RO414,20140418_7,K1,,30 2160,0,26-MAY-14.02:05:27,26-MAY-14.02:05:27,18-APR-18,600,INR,RO414,20140418_7,K1,,30 2160,0,26-MAY-14.02:05:27,26-MAY-14.02:05:27,18-APR-18,600,INR,RO414,20140418_7,K1,,30 2104,5,13-JAN-13.01:01:38,,13-JAN-17,4150,INR,RO113,CD1301_RC50_B1_20130113,K2,,21 Am using the below command and working fine. zcat Input.gz | awk -F"," '{OFS=","; print $2,substr($3,4,6),substr($4,4,6),$10,$8,$6}' | \ awk -F"," 'BEGIN {count=0; sum=0; OFS=","} {key=$0; a[key]++;b[key]=b[key]+$6} \ END {for (i in a) print i,a[i],b[i]}' >Output.txt Output.txt 0,MAY-14,MAY-14,K1,RO414,600,3,1800 0,MAY-12,MAY-12,K1,RO312,600,1,600 5,JAN-13,,K2,RO113,4150,1,4150 Any suggestion to improve the above command are welcome ..
This seems more efficient: zcat Input.gz | awk -F, '{key=$2","substr($3,4,6)","substr($4,4,6)","$10","$8","$6;++a[key];b[key]=b[key]+$6}END{for(i in a)print i","a[i]","b[i]}' Output: 0,MAY-14,MAY-14,K1,RO414,600,3,1800 0,MAY-12,MAY-12,K1,RO312,600,1,600 5,JAN-13,,K2,RO113,4150,1,4150 Uncondensed form: zcat Input.gz | awk -F, '{ key = $2 "," substr($3, 4, 6) "," substr($4, 4, 6) "," $10 "," $8 "," $6 ++a[key] b[key] = b[key] + $6 } END { for (i in a) print i "," a[i] "," b[i] }'
You can do this with one awk invocation by redefining the fields according to the first awk script, i.e. something like this: $1 = $2 $2 = substr($3, 4, 6) $3 = substr($4, 4, 6) $4 = $10 $5 = $8 No need to change $6 as that is the same field. Now if you base the key on the new fields, the second script will work almost unaltered. Here is how I would write it, moving the code into a script file for better readability and maintainability: zcat Input.gz | awk -f parse.awk Where parse.awk contains: BEGIN { FS = OFS = "," } { $1 = $2 $2 = substr($3, 4, 6) $3 = substr($4, 4, 6) $4 = $10 $5 = $8 key = $1 OFS $2 OFS $3 OFS $4 OFS $5 OFS $6 a[key]++ b[key] += $6 } END { for (i in a) print i, a[i], b[i] } You can of course still run it as a one-liner, but it will look more cryptic: zcat Input.gz | awk '{ key = $2 FS substr($3,4,6) FS substr($4,4,6) FS $10 FS $8 FS $6; a[key]++; b[key]+=$6 } END { for (i in a) print i,a[i],b[i] }' FS=, OFS=, Output in both cases: 0,MAY-14,MAY-14,K1,RO414,600,3,1800 0,MAY-12,MAY-12,K1,RO312,600,1,600 5,JAN-13,,K2,RO113,4150,1,4150