count occurrence of value in multiple fields independently (awk) - awk

I have seen numerous posts to achieve this task for individual fields, but I am struggling to apply it on multiple field separately.
input:
group1|apple|orange|lemon
group1|apple|kiwi|banana
group1|orange|cherry| lemon
group1|apple|orange|pear
(The real file has many more fields, so I need to use a loop to process each fields)
output:
Field|Fruit|Count
2|apple|3
2|orange|1
3|orange|2
3|kiwi|1
3|cherry|1
4|lemon|2
4|banana|1
4|pear|1
What I tried so far, but returns the entire count for all the fields:
awk '
BEGIN{FS=OFS="|"; print "Field|Fruit|Count"}
{
for(i=2; i<=NF; i++){
a[$i]=$i
count[$i]++
}
}
END{
for(j in count) print j OFS count[j]
}'

Use the field number as part of the key in the count array.
awk '
BEGIN{FS=OFS="|"; print "Field|Fruit|Count"}
{
for (i = 2; i <= NF; i++) {
count[i OFS $i]++;
}
}
END {
for (j in count) {
print j, count[j];
}
}'

Related

Counting unique occurrences in each column

I have a file with several columns like $2$3 (until $32) as in
A refdevhet devdevhomo
B refdevhet refdevhet
C refrefhomo refdevhet
D devrefhet refdevhet
I need to count how many occurrences of each unique element in each column separately
so that I have
refdevhet 2 3
refrefhomo 1 0
devrefhet 1 0
devdevhomo 0 1
I tried several variations of
awk 'BEGIN {
FS=OFS="\t"
}
{
for(i=1; i<=32; i++) a[$i]++
}
END {
for (i in a) print i, a[i]
}' file
but instead it's printing the cumulative sum of occurrences of unique elements across the selected fields.
Here is a solution:
BEGIN {
FS=OFS="\t"
}
{
if (NF>mxf) mxf = NF;
for(i=1; i<=NF; i++) {ws[$i]=1; c[$i,i]++}
}
END {
for (w in ws) {
printf "%s", w
for (i=1;i<=mxf;i++) printf "%s%d", OFS, c[w,i];
print ""
}
}
Notice that solution is general. It will include first column into consideration as well. To omit the first column, change i=1 to i=2 in both places.
In addition to #Andriy's good answer, with GNU awk you can use a 2-dimensional array
gawk '
{for (i=2; i<=NF; i++) count[$i][i]++}
END {
for (word in count) {
printf "%s", word
for (i=2; i<=NF; i++) printf "%s%d", OFS, count[word][i]
print ""
}
}
' file | column -t
I'm assuming here that each line has the same number of fields as the last line.

Awk how to negate a condition

I'm trying to compute some stuff in awk, and at the end print the result in the order of the input. For each line, I check if it has not been already seen. If not, I add it to the array and also store it in an order array.
{
if (! $0 in seen) {
seen[$0] = 1
order[o++] = $0
}
} END {
for (i=0; i<o; i++)
printf "%s\n", order[i]
}
You can try it with
printf 'a\nb\na\nc\nb\na\n' | awk script_above
It prints nothing. If I print the variable o at the end, it shows that its value is still 0. What am I doing wrong?
You just need to add parens to get the right operator precedence*:
# a.awk
{
if (!($0 in seen)) {
seen[$0] = 1
order[o++] = $0
}
}
END {
for (i=0; i<o; i++)
printf "%s\n", order[i]
}
Test:
$ awk -f a.awk file
a
b
c
* (The unary ! binds more tightly than the in operator: https://www.gnu.org/software/gawk/manual/html_node/Precedence.html)
What you are trying to do is in Shell way, awk has a way where you could keep checking if an element is part of an array or not, try following once.
printf 'a\nb\na\nc\nb\na\n' | awk '
!seen[$0]++ {
order[o++] = $0
}
END {
for (i=0; i<o; i++)
printf "%s\n", order[i]
}'
Here !seen[$0]++ means it is checking condition if an element is NOT a part of indexes of array named a then go inside the BLOCK(where your next statements are provided) then it does ++ which makes sure that this element(which was NOT there in array before checking condition)'s counter incremented by 1 so that next time this !seen[$0]++` condition is NOT TRUE for the already passed element.

Awk average of n data in each column

"Using awk to bin values in a list of numbers" provide a solution to average each set of 3 points in a column using awk.
How is it possible to extend it to an indefinite number of columns mantaining the format? For example:
2457135.564106 13.249116 13.140903 0.003615 0.003440
2457135.564604 13.250833 13.139971 0.003619 0.003438
2457135.565067 13.247932 13.135975 0.003614 0.003432
2457135.565576 13.256441 13.146996 0.003628 0.003449
2457135.566039 13.266003 13.159108 0.003644 0.003469
2457135.566514 13.271724 13.163555 0.003654 0.003476
2457135.567011 13.276248 13.166179 0.003661 0.003480
2457135.567474 13.274198 13.165396 0.003658 0.003479
2457135.567983 13.267855 13.156620 0.003647 0.003465
2457135.568446 13.263761 13.152515 0.003640 0.003458
averaging values every 5 lines, should output something like
2457135.564916 13.253240 13.143976 0.003622 0.003444
2457135.567324 13.270918 13.161303 0.003652 0.003472
where the first result is the average of the first 1-5 lines, and the second result is the average of the 6-10 lines.
The accepted answer to Using awk to bin values in a list of numbers is:
awk '{sum+=$1} NR%3==0 {print sum/3; sum=0}' inFile
The obvious extension to average all the columns is:
awk 'BEGIN { N = 3 }
{ for (i = 1; i <= NF; i++) sum[i] += $i }
NR % N == 0 { for (i = 1; i <= NF; i++)
{
printf("%.6f%s", sum[i]/N, (i == NF) ? "\n" : " ")
sum[i] = 0
}
}' inFile
The extra flexibility here is that if you want to group blocks of 5 rows, you simply change one occurrence of 3 into 5. This ignores blocks of up to N-1 rows at the end of the file. If you want to, you can add an END block that prints a suitable average if NR % N != 0.
For the sample input data, the output I got from the script above was:
2457135.564592 13.249294 13.138950 0.003616 0.003437
2457135.566043 13.264723 13.156553 0.003642 0.003465
2457135.567489 13.272767 13.162732 0.003655 0.003475
You can make the code much more complex if you want to analyze what the output formats should be. I've simply used %.6f to ensure 6 decimal places.
If you want N to be a command-line parameter, you can use the -v option to relay the variable setting to awk:
awk -v N="${variable:-3}" \
'{ for (i = 1; i <= NF; i++) sum[i] += $i }
NR % N == 0 { for (i = 1; i <= NF; i++)
{
printf("%.6f%s", sum[i]/N, (i == NF) ? "\n" : " ")
sum[i] = 0
}
}' inFile
When invoked with $variable set to 5, the output generated from the sample data is:
2457135.565078 13.254065 13.144591 0.003624 0.003446
2457135.567486 13.270757 13.160853 0.003652 0.003472

How to get total column from awk

I am testing awk and got this thought .So we know that
raja#badfox:~/Desktop/trails$ cat num.txt
1 2 3 4
1 2 3 4
4 1 2 31
raja#badfox:~/Desktop/trails$ awk '{ if ($1 == '4') print $0}' num.txt
4 1 2 31
raja#badfox:~/Desktop/trails$
so the command going to check for 4 at 1st column in filename num.txt .
So now i want output as there is a 4 at column 4 also and for example if i have 100 column of information and i want get the output as how many columns i have with the term i am searching .
I mean from the above example i want column 4 and column 1 as the output and i am searching for 4 .
If you are trying to find the rows which contain your search item (in this case, the value 4), and you want a count of how many such values appear in the row (as well as the row's data), then you need something like:
awk '{ count=0
for (i = 1; i <= NF; i++) if ($i == 4) count++
if (count) print $i ": " $0
}'
That doesn't quite fit onto one SO line.
If you merely want to identify which columns contain the search value, then you can use:
awk '{ for (i = 1; i <= NF; i++) if ($i == 4) column[i] = 1 }
END { for (i in column) print i }'
This sets the (associative) array element column[i] to 1 for each column that contains the search value, 4. The loop at the end prints the column numbers that contain 4 in an indeterminate (not sorted) order. GNU awk includes sort functions (asort, asorti); POSIX awk does not. If sorted order is crucial, then consider:
awk 'BEGIN { max = 0 }
{ for (i = 1; i <= NF; i++) if ($i == 4) { column[i] = 1; if (i > max) max = i } }
END { for (i = 1; i <= max; i++) if (column[i] == 1) print i }'
You are looking for the NF variable. It's the number of fields in the line.
Here's an example of how to use it:
{
if (NF == 8) {
print $3, $8;
} else if (NF == 9) {
print $3, $9;
}
}
Or inside a loop:
# This will print the line if any field has the value 4
for (i=1; i<=NF; i++) {
if ($i == 4)
print $0
}

Awk Iterate through several Arrays in a for loop

I have created an awk program to go through the columns of a file and count each distinct word and then output totals into separate files
awk -F"$delim" {Field_Arr1[$1]++; Field_Arr2[$2]++; Field_Arr3[$3]++; Field_Arr4[$4]++};
END{\
# output fields
out_field1="top_field1"
out_field2="top_field2"
out_field3="top_field3"
out_field4="top_field4"
for( i=1; i <= NF; i++)
{
for (element in Field_Arr$i)
{
print element"\t"Field_Arr$i[element] >>out_field$i;
}
}
}' inputfile
but I don't know the appropriate syntax, so that the for loop will iterate through Field_Arr1, Field_Arr2, Field_Arr3, Field_Arr4?
I have tried using: i, $i, ${i}, {i}, "$i", and "i".
Am I trying the wrong approach or is there a way to change Field_Arr$i to Field_Arr1..4?
Thanks for the advice.
awk variables don't work that way; you'll have to do them individually by name, or use fake multidimensional arrays and parse out the components, something along the lines of:
{Field_Arr[1, $1]++; Field_Arr[2, $2]++; Field_Arr[3, $3]++; Field_Arr[4, $4]++}
END {
for (elt in Field_Arr) {
split(elt, ec, SUBSEP)
print ec[2] "\t" Field_Arr[elt] >> ("top_field" ec[1])
}
}
To count the frequencies for each column (3 in my example), try this
# Print list of word frequencies
function p_array(t,a) {
print t
for (i in a) {
print i, a[i]
}
}
{
c1[$1]++
c2[$1]++
c3[$1]++
}
END {
p_array("1st col",c1)
p_array("2nd col",c2)
p_array("3rd col",c3)
}