Count and sum column list - awk
Not 100% sure how to do this. What I have does not add up.
awk -F, '{array[$1]+=$2} END { for (i in array) {print i array[i] }}' gaaa
Here is a example of gaaa
acic 4
acgic 56
acpdc 183
acic 1677
acpicvp
acsis 23
hidr 4
hidr 1133
aggr 24
Desired result would be:
acic 1681
acgic 56
acpdc 183
acpicvp
acsis 23
hidr 1137
aggr 24
You have set the field separator to a comma but there is no comma in your data. You want:
$ awk '{array[$1]+=$2}END{for (i in array) print i,array[i]}' gaaa
acsis 23
aggr 24
acgic 56
acpdc 183
hidr 1137
acpicvp 0
acic 1681
Related
Using awk to select rows with a specific value in column greater than x
I tried to use awk to select all rows with a value greater than 98 in the third column. In the output, only lines between 98 - 98.99... were selected and lines with a value more than 98.99 not. I would like to extract all lines with a value greater than 98 including 99, 100 and so on. Here my code and my input format: for i in *input.file; do awk '$3>98' $i >{i/input./output.}; done A chr11 98.80 83 1 0 2 84 B chr7 95.45 22 1 0 40 61 C chr7 88.89 27 0 1 46 72 D chr6 100.00 20 0 0 1 20 Expected Output A chr11 98.80 83 1 0 2 84 D chr6 100.00 20 0 0 1 20
Okay, if you have a series of files, *input.file and you want to select those lines where $3 > 98 and then write the values to the same prefix, but with output.file as the rest of the filename, you can use: awk '$3 > 98 { match (FILENAME,/input.file$/) print $0 > substr(FILENAME,1,RSTART-1) "output.file" }' *input.file Which uses match to find the index where input.file begins and then uses substr to get the part of the filename before that and appends "output.file" to the substring for the final output filename. match() sets the RSTART value to the index where input.file begins in the current filename which is then used by substr truncate the current filename at that index. See GNU awk String Functions for complete details. For exmaple, if you had input files: $ ls -1 *input.file v1input.file v2input.file Both with your example content: $ cat v1input.file A chr11 98.80 83 1 0 2 84 B chr7 95.45 22 1 0 40 61 C chr7 88.89 27 0 1 46 72 D chr6 100.00 20 0 0 1 20 Running the awk command above would results in two output files: $ ls -1 *output.file v1output.file v2output.file Containing the records where the third-field was greater than 98: $ cat v1output.file A chr11 98.80 83 1 0 2 84 D chr6 100.00 20 0 0 1 20
Awk script displaying incorrect output
I'm facing an issue in awk script - I need to generate a report containing the lowest, highest and average score for each assignment in the data file. The name of the assignment is located in column 3. Input data is: Student,Catehory,Assignment,Score,Possible Chelsey,Homework,H01,90,100 Chelsey,Homework,H02,89,100 Chelsey,Homework,H03,77,100 Chelsey,Homework,H04,80,100 Chelsey,Homework,H05,82,100 Chelsey,Homework,H06,84,100 Chelsey,Homework,H07,86,100 Chelsey,Lab,L01,91,100 Chelsey,Lab,L02,100,100 Chelsey,Lab,L03,100,100 Chelsey,Lab,L04,100,100 Chelsey,Lab,L05,96,100 Chelsey,Lab,L06,80,100 Chelsey,Lab,L07,81,100 Chelsey,Quiz,Q01,100,100 Chelsey,Quiz,Q02,100,100 Chelsey,Quiz,Q03,98,100 Chelsey,Quiz,Q04,93,100 Chelsey,Quiz,Q05,99,100 Chelsey,Quiz,Q06,88,100 Chelsey,Quiz,Q07,100,100 Chelsey,Final,FINAL,82,100 Chelsey,Survey,WS,5,5 Sam,Homework,H01,19,100 Sam,Homework,H02,82,100 Sam,Homework,H03,95,100 Sam,Homework,H04,46,100 Sam,Homework,H05,82,100 Sam,Homework,H06,97,100 Sam,Homework,H07,52,100 Sam,Lab,L01,41,100 Sam,Lab,L02,85,100 Sam,Lab,L03,99,100 Sam,Lab,L04,99,100 Sam,Lab,L05,0,100 Sam,Lab,L06,0,100 Sam,Lab,L07,0,100 Sam,Quiz,Q01,91,100 Sam,Quiz,Q02,85,100 Sam,Quiz,Q03,33,100 Sam,Quiz,Q04,64,100 Sam,Quiz,Q05,54,100 Sam,Quiz,Q06,95,100 Sam,Quiz,Q07,68,100 Sam,Final,FINAL,58,100 Sam,Survey,WS,5,5 Andrew,Homework,H01,25,100 Andrew,Homework,H02,47,100 Andrew,Homework,H03,85,100 Andrew,Homework,H04,65,100 Andrew,Homework,H05,54,100 Andrew,Homework,H06,58,100 Andrew,Homework,H07,52,100 Andrew,Lab,L01,87,100 Andrew,Lab,L02,45,100 Andrew,Lab,L03,92,100 Andrew,Lab,L04,48,100 Andrew,Lab,L05,42,100 Andrew,Lab,L06,99,100 Andrew,Lab,L07,86,100 Andrew,Quiz,Q01,25,100 Andrew,Quiz,Q02,84,100 Andrew,Quiz,Q03,59,100 Andrew,Quiz,Q04,93,100 Andrew,Quiz,Q05,85,100 Andrew,Quiz,Q06,94,100 Andrew,Quiz,Q07,58,100 Andrew,Final,FINAL,99,100 Andrew,Survey,WS,5,5 Ava,Homework,H01,55,100 Ava,Homework,H02,95,100 Ava,Homework,H03,84,100 Ava,Homework,H04,74,100 Ava,Homework,H05,95,100 Ava,Homework,H06,84,100 Ava,Homework,H07,55,100 Ava,Lab,L01,66,100 Ava,Lab,L02,77,100 Ava,Lab,L03,88,100 Ava,Lab,L04,99,100 Ava,Lab,L05,55,100 Ava,Lab,L06,66,100 Ava,Lab,L07,77,100 Ava,Quiz,Q01,88,100 Ava,Quiz,Q02,99,100 Ava,Quiz,Q03,44,100 Ava,Quiz,Q04,55,100 Ava,Quiz,Q05,66,100 Ava,Quiz,Q06,77,100 Ava,Quiz,Q07,88,100 Ava,Final,FINAL,99,100 Ava,Survey,WS,5,5 Shane,Homework,H01,50,100 Shane,Homework,H02,60,100 Shane,Homework,H03,70,100 Shane,Homework,H04,60,100 Shane,Homework,H05,70,100 Shane,Homework,H06,80,100 Shane,Homework,H07,90,100 Shane,Lab,L01,90,100 Shane,Lab,L02,0,100 Shane,Lab,L03,100,100 Shane,Lab,L04,50,100 Shane,Lab,L05,40,100 Shane,Lab,L06,60,100 Shane,Lab,L07,80,100 Shane,Quiz,Q01,70,100 Shane,Quiz,Q02,90,100 Shane,Quiz,Q03,100,100 Shane,Quiz,Q04,100,100 Shane,Quiz,Q05,80,100 Shane,Quiz,Q06,80,100 Shane,Quiz,Q07,80,100 Shane,Final,FINAL,90,100 Shane,Survey,WS,5,5 awk script : BEGIN { FS=" *\\, *" } FNR>1 { min[$3]=(!($3 in min) || min[$3]> $4 )? $4 : min[$3] max[$3]=(max[$3]> $4)? max[$3] : $4 cnt[$3]++ sum[$3]+=$4 } END { print "Name\tLow\tHigh\tAverage" for (i in cnt) printf("%s\t%d\t%d\t%.1f\n", i, min[i], max[i], sum[i]/cnt[i]) } Expected sample output: Name Low High Average Q06 77 95 86.80 L05 40 96 46.60 WS 5 5 5 Q07 58 100 78.80 L06 60 99 61 L07 77 86 64.80 When I run the script, I get a "Low" of 0 for all assignments which is not correct. Where am I going wrong? Please guide.
You can certainly do this with awk, but since you tagged this scripting as well, I'm assuming other tools are an option. For this sort of gathering of statistics on groups present in the data, GNU datamash often reduces the job to a simple one-liner. For example: $ (echo Name,Low,High,Average; datamash --header-in -s -t, -g3 min 4 max 4 mean 4 < input.csv) | tr , '\t' Name Low High Average FINAL 58 99 85.6 H01 19 90 47.8 H02 47 95 74.6 H03 70 95 82.2 H04 46 80 65 H05 54 95 76.6 H06 58 97 80.6 H07 52 90 67 L01 41 91 75 L02 0 100 61.4 L03 88 100 95.8 L04 48 100 79.2 L05 0 96 46.6 L06 0 99 61 L07 0 86 64.8 Q01 25 100 74.8 Q02 84 100 91.6 Q03 33 100 66.8 Q04 55 100 81 Q05 54 99 76.8 Q06 77 95 86.8 Q07 58 100 78.8 WS 5 5 5 This says that for each group with the same value for the 3rd column (-g3, plus -s to sort the input (A requirement of the tool)) of simple CSV input (-t,) with a header (--header-in), display the minimum, maximum, and mean of the 4th column. It's all given a new header and piped to tr to turn the commas into tabs.
Your code works as-is with GNU awk. However, running it with the -t option to warn about non-portable constructs gives: awk: foo.awk:6: warning: old awk does not support the keyword `in' except after `for' awk: foo.awk:2: warning: old awk does not support regexps as value of `FS' And running the script with a different implementation of awk (mawk in my case) does give 0's for the Low column. So, some tweaks to the script: BEGIN { FS="," } FNR>1 { min[$3]=(cnt[$3] == 0 || min[$3]> $4 )? $4 : min[$3] max[$3]=(max[$3]> $4)? max[$3] : $4 cnt[$3]++ sum[$3]+=$4 } END { print "Name\tLow\tHigh\tAverage" PROCINFO["sorted_in"] = "#ind_str_asc" # gawk-ism for pretty output; ignored on other awks for (i in cnt) printf("%s\t%d\t%d\t%.1f\n", i, min[i], max[i], sum[i]/cnt[i]) } and it works as expected on that other awk too. The changes: Using a simple comma as the field separator instead of a regex. Changing the min conditional to setting to the current value on the first time this assignment has been seen by checking to see if cnt[$3] is equal to 0 (Which it will be the first time because that value is incremented in a later line), or if the current min is greater than this value.
another similar approach $ awk -F, 'NR==1 {print "name","low","high","average"; next} {k=$3; sum[k]+=$4; count[k]++} !(k in min) {min[k]=max[k]=$4} min[k]>$4 {min[k]=$4} max[k]<$4 {max[k]=$4} END {for(k in min) print k,min[k],max[k],sum[k]/count[k]}' file | column -t name low high average Q06 77 95 86.8 L05 0 96 46.6 WS 5 5 5 Q07 58 100 78.8 L06 0 99 61 L07 0 86 64.8 H01 19 90 47.8 H02 47 95 74.6 H03 70 95 82.2
join the contents of files into a new file
I have some text files as shown below. I would like to join the contents of these files into one. file A >AXC 145 146 147 >SDF 1 8 67 >FGH file B >AXC >SDF 12 65 >FGH 123 156 190 Desired ouput new file >AXC 145 146 147 >SDF 1 8 67 12 65 >FGH 123 156 190 your help would be appreciated!
awk ' /^>/ { key=$0; if (!seen[key]++) keys[++numKeys] = key; next } { vals[key] = vals[key] ORS $0 } END{ for (keyNr=1;keyNr<=numKeys;keyNr++) {key = keys[keyNr]; print key vals[key]} } ' fileA fileB >AXC 145 146 147 >SDF 1 8 67 12 65 >FGH 123 156 190 If you really want the leading white space added to the ">SDF" values from fileA, tell us why that's the case for that one but not ">AXC" so we can code an appropriate solution.
A bit shorter than Ed's answer awk '/^>/{a=$0;next}{x[a]=x[a]$0"\n"}END{for(i in x)printf"%s\n%s",i,x[i]}' Blocks will be printed in an unspecified order.
RS=">" seperate records by > character OFS="\n" is to have number it's own line. a[i]=a[i] $0 add fields into array with index of first field. rt=RT is for adding > character to index $ awk 'BEGIN{ RS=">"; OFS="\n" } {i=rt $1; $1=""; a[i]=a[i] $0; rt=RT; next} END { for (i in a) {print i a[i] }}' d6 d5 >SDF 12 65 1 8 67 >FGH 123 156 190 >AXC 145 146 147
compare between two columns and subtract them
my question i have one file 344 0 465 1 729 2 777 3 676 4 862 5 766 0 937 1 980 2 837 3 936 5 i need to compare each two pair (zero with zero, one with one and so on) if the value exist(any value of column two should exist two times) subtract 766-344 , 937-465 and so on if not exist like the forth value do nothing (4 exist one time so do nothing) the output 422 472 251 060 074 also i need to add index example 1 422 2 472 3 251 4 060 5 074 finally i need to add this code as part of tcl script, or function of tcl porgram I have a tcl script contain awk functions like this set awkCBR0 { { if ($1 == "r" && $6 == 280) { print $2, i >> "cbr0.q"; i +=1 ; } } } exec rm -f cbr0.q exec touch cbr0.q exec awk $awkCBR0 cbr.trq thanks
Try this: awk 'a[$2]{printf "%d %03d\n",++x,$1-a[$2];next}{a[$2]=$1}' file Output $ awk 'a[$2]{printf "%d %03d\n",++x,$1-a[$2];next}{a[$2]=$1}' file 1 422 2 472 3 251 4 060 5 074 I will leave it for you to add it to tcl function.
Obtaining "consensus" results from two different files using awk
I have file1 as a result of a first operation, it has the following structure 201 12 0.298231 8.8942 206 13 -0.079795 0.6367 101 34 0.86348 0.7456 301 15 0.215355 4.6378 303 16 0.244734 5.9895 and file2 as a result of a different operation and has the same type of structure. File 2 sample 204 60 -0.246038 6.0535 304 83 -0.246209 6.0619 101 34 -0.456629 6.0826 211 36 -0.247003 6.1011 305 83 -0.247134 6.1075 206 46 -0.247485 6.1249 210 39 -0.248066 6.1537 107 41 -0.248201 6.1603 102 20 -0.248542 6.1773 I would like to select fields 1 and 2 that have a field 3 value higher than a threshold in file1 (0.8) , then for these selected values of field 1 and 2, select the values that have a field 3 value higher than another threshold in file 2 (abs(x)=0.4). Note that although files 1 and 2 have the same structure fields 1 and 2 values are not the same (not the same number of lines etc..) Can you do this with awk? desired output 101 34
If you combine awk with unix commands you can do the following sort file1.txt > sorted1.txt sort file2.txt > sorted2.txt Sorting will allow you to use JOIN on the first line (which I assume is unique). Now field 3 of file1 is $3 and file2 is $6. Using awk you can write the following.: join sorted1.txt sorted2.txt | awk 'function abs(value){return (value<0?-value:value);}{print $1"\t"$2} $3 >=0.8 && abs($6) >=0.4' In essence, in the awk you first write a function to deal with absolute values, then you simply ask it to print line 1 and 2 selecting for the criteria you detailed at $3 and $6 (formely field 3 of file1 and file2 respectively) Hope this helps...