Counting the number of lines in each column - awk

Is it possible to count the number of lines in each column of a file? For example, I've been trying to use awk to separate columns on the semi-colon symbol, specify each column individually and us wc command to count any and all occurrences within that column.
For the below command I am trying to find the number of items in column 3 without counting blank lines. Unfortunately, this command just counts the entire file. I could move the column to a different file and count that file but I just want to know if there is a much quicker way of going about this?
awk -F ';' '{print $3}' file.txt | wc -l
Data file format
; 1 ; 2 ; 3 ; 4 ; 5 ; 6 ;
; 3 ; 4 ; 5 ; 6 ; ; 4 ;
; ; 3 ; 5 ; 6 ; 9 ; 8 ;
; 1 ; 6 ; 3 ; ; ; 4 ;
; 2 ; 3 ; ; 3 ; ; 5 ;
Example output wanted
Column 1 = 4 aka(1 + 3 + 1 + 2)
Column 2 = 5
Column 3 = 4
Colunm 4 = 4
Column 5 = 2
Column 6 = 5

Keep separate counts for each field using an array, then print the totals when you're done:
$ awk -F' *; *' '{ for (i = 2; i < NF; ++i) if ($i != "") ++count[i] }
END { for (i = 2; i < NF; ++i) print "Column", i-1, "=", count[i] }' file
Column 1 = 4
Column 2 = 5
Column 3 = 4
Column 4 = 4
Column 5 = 2
Column 6 = 5
Set the field separator to consume the semicolons as well as any surrounding spaces.
Loop through each field (except the first and last ones, which are always empty) and increment a counter for non-empty fields.
it would be tempting to use if ($i) but this would fail for a column containing a 0.
Print the counts in the END block, offsetting by -1 to start from 1 instead of 2.
One assumption made here is that the number of columns in each line is uniform throughout the file, so that NF from the last line can safely be used in the END block.
A slight variation, using a simpler field separator:
$ awk -F';' '{ for (i = 2; i < NF; ++i) count[i] += ($i ~ /[^ ]/) }
END { for (i = 2; i < NF; ++i) print "Column", i-1, "=", count[i] }' file
$i ~ /[^ ]/ is equal to 1 if any non-space characters exist in the ith field, 0 otherwise.

Related

How to use awk to sort records in a csv file

I would like to use shell script (Awk) to sort records in a CSV file. Here is what my file looks like:
A B C D E F G H I J K L M N
1 14000 37.1 425.9 1 12687 1 425 2
2 14000 41.0 4280 1 4292 1 4266.1 1 425.9 1 425 1 180 1
3 14000 37.1 425.9 1 12687 1 425 2
4 14000 192.1 180 1 12687 1 425.9 1 425 1 90 1
Sorting rows. Descending power
Skipping columns A & B
Columns C & D is a pair.
Columns E & F is a pair.
Columns G & H is a pair.
Columns I & J is a pair.
Columns K & L is a pair.
Columns M & N is a pair.
The columns will be more or less, depends on the results that I get, but still in pairs.
For example:
Looking at row 1:
A B C D E F G H
14000 37.1 425.9 1 12687 1 425 2
The output of row 1 will be:
14000 37.1 12687 1 425.9 1 425 2
Column E (12687) is bigger than column C (425.9), so move columns E & F to columns C & D.
I've thought about using string array to store pairs first, and then sorting. But no idea how to implement it. Can anyone give me ideas?
[Update]
I've solved the problem. Here is my solution:
#!/bin/bash
[ -f columnsOneTwo.csv ] && rm -fv columnsOneTwo.csv
[ -f eachpair.csv ] && rm -fv eachpair.csv
[ -f sorted.csv ] && rm -fv sorted.csv
filename="matched.csv"
linenum=1
while IFS='' read -r line || [[ -n "$line" ]]
do
#print the content in line
echo "$line"
#calculate how many columns in each line
columns=`awk -F ',' -v v=$linenum 'NR==v{print NF}' matched.csv`
echo $line | cut -d ',' -f1-2 > columnsOneTwo.csv
getPairs=`echo $line | cut -d ',' -f3-`
#get each pair from getPairs
#start from the third column
for i in `seq 1 2 $((columns-2))`
do
getEachPair=`echo $getPairs | cut -d ',' -f$i-$((i+1))` #C&D, E&F ...
echo $i ":" $getEachPair
#Deal with each pair, store in eachpair.csv
echo $getEachPair >> eachpair.csv
done
#Integrate into a same file (sorted.csv)
cat columnsOneTwo.csv eachpair.csv | sort -n -k1nr,1 -t ',' | tr '\n' ',' >> sorted.csv
echo "" >> sorted.csv
linenum=$((linenum+1))
rm -rfv eachpair.csv #remove old each pair file
done < $filename
That script executes an awful lot of commands to achieve what a single awk command can achieve. This code doesn't use a fancy sort algorithm — but it can easily be improved.
This is the program file (script.awk):
{
#print NR ": [", $0, "]"
f12 = $1 FS $2
n = 0
for (i = 3; i < NF; i += 2)
{
p1[n] = $i
p2[n++] = $(i+1)
}
#for (i = 0; i < n; i++)
# print p1[i], p2[i]
for (i = 0; i < n - 1; i++)
{
for (j = i + 1; j < n; j++)
{
if (p1[i] < p1[j] || (p1[i] == p1[j] && p2[i] < p2[j]))
{
t = p1[i]
p1[i] = p1[j]
p1[j] = t
t = p2[i]
p2[i] = p2[j]
p2[j] = t
}
}
}
#printf "%d: [%s", NR, f12
printf "%s", f12
for (i = 0; i < n; i++)
printf "%s%s%s%s", FS, p1[i], FS, p2[i]
print ""
#print "]"
}
This is the data file (data). It is the data shown in the question converted back to a comma-separated format.
14000,37.1,425.9,1,12687,1,425,2
14000,41.0,4280,1,4292,1,4266.1,1,425.9,1,425,1,180,1
14000,37.1,425.9,1,12687,1,425,2
14000,192.1,180,1,12687,1,425.9,1,425,1,90,1
This is the command executed and the output:
$ awk -F, -f script.awk data
14000,37.1,12687,1,425.9,1,425,2
14000,41.0,4292,1,4280,1,4266.1,1,425.9,1,425,1,180,1
14000,37.1,12687,1,425.9,1,425,2
14000,192.1,12687,1,425.9,1,425,1,180,1,90,1
$
For each line of input, the script captures fields A & B into f12. It saves the remaining pairs into the p1 and p2 arrays. It then sorts the arrays in parallel, in descending order of values in p1 and (in case of ties on p1) descending order of p2. It then prints the result line. The comment lines show the debug printing used during development.
The script has no error checking. It blithely assumes that the data is correctly formatted and uses only numeric data in each field.
The code shown in the question runs many commands for each line of input. That will inevitably be slower (probably by an order of magnitude or more) than the single awk script. Casual testing with the script in the question showed it took around 0.155 to 0.160s to process the 4 lines of data. Equally casual testing with the single awk script showed it took around 0.003 to 0.004s to process the same 4 lines of data. That's close to 40 times faster. Granted, there's quite a lot of verbose debugging output in the script in the question. Nevertheless, the results are indicative. I also tested a file with 40 lines of data (10 copies of the 4 lines shown). The bash script took between 1.4 and 1.5 seconds to run; the awk script took between 0.004 and 0.005 seconds to run. YMMV!

How do i compare alphanumeric characters in non sequential order?

Currently i am using a awk script which compares numbers in non sequential order and prints the difference . It works pretty well for numbers but if i have alphanumeric characters , it doesn't seem to work well
In its current state , apart from simply comparing the numbers it does 2 things additionally :
Currently it accounts for the zeros before a number or character and compares the absolute values only ignoring zeros before a number or character
Currently If the same number or character occurs multiple times in both files , it outputs the additional occurance
i just want the script to work well for alphanumeric characters as well as currently it only seem to work well with plain numbers. Can someone please edit the script to have the desired output while also considering the above 2 conditions
Current script
awk '{k=$0+0}
NR==FNR {a[k]++; next}
!(k in a && a[k]-->0);
END {for(k in a) while(a[k]-->0) print k}' file1 file2
Example below
cat file1
1
01
001
8
2B
12
13C
027B
0027B
cat file2
1
2
08
12
13C
02B
9
27B
Expected output/result
1
1
2
9
27B
Explanation of expected output
In file1 : "1" , "01" , "001" evaluates to 1 * 3 times
In file 2 : "1" is present only once
Hence "1" is present twice in result ( 3-1 times )
"2" and "9" are exclusively present in file2 , So obviously both simply form part of output
In file1 : '027B" , "0027B" evaluates to 27B * 2 times
In file 2 - "27B" is present only once
Hence '27B" is present once in result ( 2 -1 times )
Explanation of matched items ( ones not forming part of expected output )
"8" from file1 ( line 4 )is matched with "08" from file2 ( line 3)
"12" from file1 ( line 6) is matched with "12" from file2 ( line 4)
"13C" from file1 (line 7 ) is matched with "13C" from file2 ( line 5 )
"2B" from file1 ( line 5 ) is matched with "02B" from file2 ( line 6 )
Lastly the order of items in expected output should be in ascending order like shown in my above example, lets say if the eg above had 3 in expected output it should read vertically as 1 1 2 3 9 27B
It should be enough to remove leading zeros when forming the key (with a special case for zero values like 0000):
/^0+$/ { k = 0 }
/[^0]/ { k = $0; sub(/^0*/, "", k) }
NR==FNR {a[k]++; next}
!(k in a && a[k]-->0);
END {for(k in a) while(a[k]-->0) print k}
$ awk -f a.awk file1 file2
2
9
27B
1
1
RE-EDIT
If you just want the values sorted numerically, pipe into sort:
$ awk -f a.awk file1 file2 | sort -n
1
1
2
3
4
5
9
27B
To output in the order as found in file2, you can remember the order in another array and then do all the printing in the END block. This version will output the values in the order of file2, with any values only in file1 printed last.
/^0+$/ { k = 0 }
/[^0]/ { k = $0; sub(/^0*/, "", k) }
NR==FNR {a[k]++; next}
{ b[FNR] = k }
!(k in a && a[k]--) { a[k] = 1 }
END {
for (i=1; i<=FNR; ++i) {
k = b[i]
while(a[k]-->0) print k
}
for (k in a) {
while(a[k]-->0) print k
}
}
$ awk -f a.awk file1 file2
1
1
2
9
27B
3
4
5

Calculate average and write it in other file [duplicate]

This question already has answers here:
Average marks from list
(3 answers)
Closed 3 years ago.
I have a list of students with ID and marks, and I need to make another one with their average marks. main_list:
#name surname student_index_number course_group_id lecturer_id list_of_marks
athos musketeer 1 1 1 3,4,5,3.5
porthos musketeer 2 1 1 2,5,3.5
aramis musketeer 3 2 2 2,1,4,5
And I have this script
awk '{ n = split($6, a, ","); total=0; for (v in a) total += a[v]; print total / n }' main_list
But I don't want to print it, I want to write it in other file called average marks. Final content should be like this, average_list:
athos musketeer 1 1 1 3.875
porthos musketeer 2 1 1 3.5
aramis musketeer 3 2 2 3
Could you please try following once.
while read first second third fourth fifth sixth
do
if [[ "$first" =~ (^#) ]]
then
continue
fi
count="${sixth//[^,]}"
val=$(echo "(${#count}+1)" | bc)
new_val=$(echo "scale=2; (${sixth//,/+})/$val" | bc)
echo "$first $second $third $fourth $fifth $new_val"
done < "Input_file" > "Output_file"
With your attempt try following.
awk '{ n = split($6, a, ","); total=0; for (v in a) total += a[v]; print $1,$2,$3,$4,$5,total / n }' Input_file > "Output_file"
With awk:
awk '{n=split($NF,array,","); NF--; sum=0; for(i=1; i<=n; i++){sum+=array[i]} print $0,sum/n}' file
Split last field ($NF) with , to an array (array). n contains number of elements. Reduce number of columns in current line by one (NF--). Add up array content with for loop and output rest of current line ($0) and result (sum/n)
Output:
athos musketeer 1 1 1 3.875
porthos musketeer 2 1 1 3.5
aramis musketeer 3 2 2 3
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

Use awk to find all columns which contain values above and below specified numbers?

I would like an Awk command where I can search a large file for columns which contain numbers both below 3 and above 5. It also needs to skip the first column.
e.g. for the following file
1 2 6 2
2 1 7 3
3 2 5 4
4 2 8 7
5 2 6 8
6 1 9 9
In this case, only column 4 is a match, as it is the only column with values above 5 and below 3 (except for column 1, which we skip).
Currently, I have this code:
awk '{for (i=2; i<=NF; i++) {if ($i < 3 && $i > 5) {print i}}}'
But this only reads one row at a time (so never makes a match). I want to search all of the rows, but I am unable to work out how this is done.
Ideally the output would simply be the column number. So for this example, simply '4'.
Many thanks.
Could you please try following and let me know if this helps you.
awk '{for(i=1;i<=NF;i++){if($i<3){col[i]++};if($i>5){col1[i]++}}} END{for(j in col){if(col[j]>=1 && col1[j]>=1){print j}}}' Input_file
If you want to start searching from second column then change i=1 to i=2 in above code.
EDIT: Adding a non-one liner form of solution too now.
awk '
{
for(i=1;i<=NF;i++){
if($i<3) { col[i]++ };
if($i>5) { col1[i]++}}
}
END{
for(j in col){
if(col[j]>=1 && col1[j]>=1){ print j }}
}' Input_file

what happened when delete array element in awk?

I wrote the following code :
awk -F"\t" '{
a[1]=1; a[2]=2; a[3]=3; a[4]=4; a[5]=5;
delete a[4];
print "len", length(a);
for( i =1; i<=length(a); i++)
print i"\t"a[i]
for( i in a)
print i"\t"a[i]
}' -
And the output is:
len 4
1 1
2 2
3 3
4
5 5
4
5 5
1 1
2 2
3 3
my question is as I have deleted the 4th element and the length of the array a has become 4, so why there is still 5 elements with the value of the 4th elements become blank when I print the array? Does that indicate that 'delete' only delete the value and the corresponding index remains?
Remove the middle for loop and you'll see what's happening:
$ echo x | awk -F"\t" '{
a[1]=1; a[2]=2; a[3]=3; a[4]=4; a[5]=5;
delete a[4];
print "len", length(a);
for( i in a)
print i"\t"a[i]
}'
len 4
2 2
3 3
5 5
1 1
The delete is working as you expect, removing the array element with index 4, leaving 4 elements with indices 1, 2, 3, and 5. (Even though you are using numeric indices, it's still an associative array and the old a[5] is not now accessible as a[4] --- it's still a[5].)
The reason you're seeing five elements in your example is the middle for loop:
for( i =1; i<=length(a); i++)
print i"\t"a[i]
By simply referring to a[4] in the above print statement, you are recreating an element of the a array with that index having an empty value.