Counting and matching process - awk

I have a matching problem with awk :(
I will count first column elements in main.file and if its value is more than 2 I will print the first and the second column.
main.file
1725009 7211378
3353866 11601802
3353866 8719104
724973 3353866
3353866 7211378
For example number of "3353866" in the first column is 3, so output.file will be like that:
output.file
3353866 11601802
3353866 8719104
3353866 7211378
How can I do this in awk?

If you mean items with at least 3 occurrences, you can collect occurrences in one array and the collected values as a preformatted or delimited string in another.
awk '{o[$1]++;v[$1]=v[$1] "\n" $0}
END{for(k in o){if(o[k]<3)continue;
print(substr(v[k],1)}' main.file
Untested, not at my computer. The output order will be essentially random; you'll need another variable to keep track of line numbers if you require the order to be stable.
This would be somewhat less hackish in Perl or Python, where a hash/dict can contain a structured value, such as a list.

Another approach is to run through the file twice: it's a little bit slower, but the code is very neat:
awk '
NR==FNR {count[$1]++; next}
count[$1] > 2 {print}
' main.file main.file

awk '{store[$1"-"lines[$1]] = $0; lines[$1]++;}
END {for (l in store) {
split(l, pair, "-"); if (lines[pair[1]] > 2) { print store[l] } } }'

One approach is to track all the records seen, the corresponding key $1 for each record, and how often each key occurs. Once you've record those for all the lines, you can then iterate through all the records stored, only printing those for which the count of the key is greater than two.
awk '{
record[NR] = $0;
key[$0] = $1;
count[$1]++
}
END {
for (n=1; n <= length(record); n++) {
if (count[key[record[n]]] > 2) {
print record[n]
}
}
}'

Sort first, and then use awk to print only when you have 3 times or more the 1st field:
cat your_file | sort -n | awk 'prev == $1 {count++; p0=p1; p1=p2; p2=$2}
prev != $1 {prev=$1; count=1; p2=$2}
count == 3 {print $1 " " p0; print $1 " " p1; print $1 " " p2}
count > 3 {print $1 " " $2}'
This will avoid awk to use too much memory in case of big input file.

based on how the question looks and the Ray Toal edit, I'm guessing you mean based on count, so something like this works:
awk '!y[$1] {y[$1] = 1} x[$1] {if(y[$1]==1) {y[$1]==2; print $1, x[$1]}; print} {x[$1] = $2}'

Related

Filter out rows from column A based on values in column B

I have a CSV file that contains two columns. First column is a list of all subscribers and second column is a list of subscribers who need to be excluded from a mailing:
all,exclusions
alice#example.com,charles#example.com
bill#example.com,alice#example.com
charles#example.com
daisy#example.com
esther#example.com
I need to end up with an output of all subscribers from first column who are not listed in the second column. The desired output is something like this:
bill#example.com
daisy#example.com
esther#example.com
So far all I have is this:
awk -F, '(NR>1) {if($1!=$2) {print}}' subs.csv
This of course will only filter out the rows when there are matching values in both columns on the same row.
Thanks for any help.
With an array. I assume that there are no duplicates in the first column.
awk -F ',' 'NR>1{
array[$1]++; array[$2]--
}
END{
for(i in array){ if(array[i]==1){ print i } }
}' file
As one line:
awk -F ',' 'NR>1{ array[$1]++; array[$2]-- } END{for(i in array){ if(array[i]==1){ print i } } }' file
Output:
esther#example.com
daisy#example.com
bill#example.com
For completeness, remove excluded entries, including repeated values.
Data
$ cat file
all,exclusions
alice#example.com,charles#example.com
bill#example.com,alice#example.com
charles#example.com
daisy#example.com
daisy#example.com,alice#example.com
daisy#example.com,charles#example.com
daisy#example.com
esther#example.com
esther#example.com
alice#example.com
$ awk -F ',' 'NR>1 && NF==1{ all[$1]++ }
NR>1 && NF==2{ all[$1]++; excl[$2]++ }
END{ for(i in excl){ all[i]=0 };
for(i in all){ if(all[i]>=1){ print i } } }' file
esther#example.com
daisy#example.com
bill#example.com
With two arrays. First field $1 is the list of all subscribers and this is used as an index of an array called a. Second field $2 is the list of subscribers who need to be excluded and it is used as index of array b. We get subscribers from first column who are not listed in the second column this way in the END part: for (i in a) if (!(i in b)) print i using the two arrays:
awk -v FS=',' '
NR > 1 {a[$1];b[$2]}
END{for (i in a) if (!(i in b)) print i}
' file
esther#example.com
daisy#example.com
bill#example.com
Or using the continue statement which causes the next iteration to begin.
awk -v FS=',' '
NR > 1 {a[$1];b[$2]}
END{for (i in a) if (i in b) continue;else print i}
' file
esther#example.com
daisy#example.com
bill#example.com

selecting columns in awk discarding corresponding header

How to properly select columns in awk after some processing. My file here:
cat foo
A;B;C
9;6;7
8;5;4
1;2;3
I want to add a first column with line numbers and then extract some columns of the result. For the example let's get the new first (line numbers) and third columns. This way:
awk -F';' 'FNR==1{print "linenumber;"$0;next} {print FNR-1,$1,$3}' foo
gives me this unexpected output:
linenumber;A;B;C
1 9 7
2 8 4
3 1 3
but expected is (note B is now the third column as we added linenumber as first):
linenumber;B
1;6
2;5
3;2
[fixed and revised]
To get your expected output, use:
$ awk 'BEGIN {
FS=OFS=";"
}
{
print (FNR==1?"linenumber":FNR-1),$(FNR==1?3:1)
}' file
Output:
linenumber;C
1;9
2;8
3;1
To add a column with line number and extract first and last columns, use:
$ awk 'BEGIN {
FS=OFS=";"
}
{
print (FNR==1?"linenumber":FNR-1),$1,$NF
}' file
Output this time:
linenumber;A;C
1;9;7
2;8;4
3;1;3
Why do you print $0 (the complete record) in your header? And, if you want only two columns in your output, why to you print 3 (FNR-1, $1 and $3)? Finally, the reason why your output field separators are spaces instead of the expected ; is simply that... you did not specify the output field separator (OFS). You can do this with a command line variable assignment (OFS=\;), as shown in the second and third versions below, but also using the -v option (-v OFS=\;) or in a BEGIN block (BEGIN {OFS=";"}) as you wish (there are differences between these 3 methods but they don't matter here).
[EDIT]: see a generic solution at the end.
If the field you want to keep is the second of the input file (the B column), try:
$ awk -F\; 'FNR==1 {print "linenumber;" $2; next} {print FNR-1 ";" $2}' foo
linenumber;B
1;6
2;5
3;2
or
$ awk -F\; 'FNR==1 {print "linenumber",$2; next} {print FNR-1,$2}' OFS=\; foo
linenumber;B
1;6
2;5
3;2
Note that, as long as you don't want to keep the first field of the input file ($1), you could as well overwrite it with the line number:
$ awk -F\; '{$1=FNR==1?"linenumber":FNR-1; print $1,$2}' OFS=\; foo
linenumber;B
1;6
2;5
3;2
Finally, here is a more generic solution to which you can pass the list of indexes of the columns of the input file you want to print (1 and 3 in this example):
$ awk -F\; -v cols='1;3' '
BEGIN { OFS = ";"; n = split(cols, c); }
{ printf("%s", FNR == 1 ? "linenumber" : FNR - 1);
for(i = 1; i <= n; i++) printf("%s", OFS $(c[i]));
printf("\n");
}' foo
linenumber;A;C
1;9;7
2;8;4
3;1;3

awk: first, split a line into separate lines; second, use those new lines as a new input

Let's say I have this line:
foo|bar|foobar
I want to split it at | and then use those 3 new lines as the input for the further proceedings (let's say replace bar with xxx).
Sure, I can pipe two awk instances, like this:
echo "foo|bar|foobar" | awk '{gsub(/\|/, "\n"); print}' | awk '/bar/ {gsub(/bar/, "xxx"); print}'
But how I can achieve this in one script? First, do one operation on some input, and then treat the result as the new input for the second operation?
I tried something like this:
echo "foo|bar|foobar" | awk -v c=0 '{
{
gsub(/\|/, "\n");
sprintf("%s", $0);
}
{
if ($0 ~ /bar/) {
c+=1;
gsub(/bar/, "xxx");
print c;
print
}
}
}'
Which results in this:
1
foo
xxx
fooxxx
And thanks to the counter c, it's absolutely obvious that the subsequent if doesn't treat the multi-line input it receives as several new records but instead just as one multi-lined record.
Thus, my question is: how to tell awk to treat this new multi-line record it receives as many single-line records?
The desired output in this very example should be something like this if I'm correct:
1
xxx
2
fooxxx
But this is just an example, the question is more about the mechanics of such a transition.
I would suggest an alternative approach using split() where you can just split the elements based on the delimiter into an array and iterate over its fields, Instead of working on a single multi line string.
echo "foo|bar|foobar" |\
awk '{
count = 0
n = split($0, arr, "|")
for ( i = 1; i <= n; i++ )
{
if ( arr[i] ~ /bar/ )
{
count += sub(/bar/, "xxx", arr[i])
print count
print arr[i]
}
}
}'
Also you don't need an explicit increment of count variable, sub() returns the number of substitutions made on the source string. You can just increment to the existing value of count.
As one more level of optimization, you can get rid of the ~ match in the if condition and directly use the sub() function there
if ( sub(/bar/, "xxx", arr[i]) )
{
count++
print count
print arr[i]
}
If you set the record separator (RS) to the pipe character, you almost get the desired effect, e.g.:
echo 'foo|bar|foobar' | awk -v RS='|' 1
Output:
foo
bar
foobar
[...an empty line
Except that a new-line character becomes part of the last field, so there is an extra line at the end of the output. You can work around this by either including a new-line in the RS variable, making it less portable, or avoid sending new-lines to awk.
For example using the less portable way:
echo 'foo|bar|foobar' | awk -v RS='\\||\n' '{ sub(/bar/, "baz") } 1'
Output:
foo
baz
foobaz
Note that the empty record at the end is ignored.
With GNU awk:
$ awk -v RS='[|\n]' 'gsub(/bar/,"xxx"){print ++c ORS $i}' file
1
xxx
2
fooxxx
With any awk:
$ awk -F'|' '{c=0; for (i=1;i<=NF;i++) if ( gsub(/bar/,"xxx",$i) ) print ++c ORS $i }' file
1
xxx
2
fooxxx

Print every nth column of a file

I have a rather big file with 255 coma separated columns and I need to print out every third column only.
I was trying something like this
awk '{ for (i=0;i<=NF;i+=3) print $i }' file
but that doesn't seem to be the solution, since it prints to only one long column. Anybody can help? Thanks
Here is one way to do this.
The script prog.awk:
BEGIN {FS = ","} # field separator
{for (i = 1; i <= NF; i += 3) printf ("%s%c", $i, i + 3 <= NF ? "," : "\n");}
Invocation:
awk -f prog.awk <input.csv >output.csv
Example input.csv:
1,2,3,4,5,6,7,8,9,10
11,12,13,14,15,16,17,18,19,20
Example output.csv:
1,4,7,10
11,14,17,20
It behaves like that because by default awk splits fields in spaces. You have to tell it to split them with commas, and it's done using the FS variable or the -F switch. Besides that, first field is number one. The zero is the whole line, so also change the initial value of the for loop:
awk -F',' '{ for (i=1;i<=NF;i+=3) print $i }' file

awk script for finding smallest value from column

I am beginner in AWK, so please help me to learn it. I have a text file with name snd and it values are
1 0 141
1 2 223
1 3 250
1 4 280
I want to print the entire row when the third column value is minimu
This should do it:
awk 'NR == 1 {line = $0; min = $3}
NR > 1 && $3 < min {line = $0; min = $3}
END{print line}' file.txt
EDIT:
What this does is:
Remember the 1st line and its 3rd field.
For the other lines, if the 3rd field is smaller than the min found so far, remember the line and its 3rd field.
At the end of the script, print the line.
Note that the test NR > 1 can be skipped, as for the 1st line, $3 < min will be false. If you know that the 3rd column is always positive (not negative), you can also skip the NR == 1 ... test as min's value at the beginning of the script is zero.
EDIT2:
This is shorter:
awk 'NR == 1 || $3 < min {line = $0; min = $3}END{print line}' file.txt
You don't need awk to do what you want. Use sort
sort -nk 3 file.txt | head -n 1
Results:
1 0 141
I think sort is an excellent answer, unless for some reason what you're looking for is the awk logic to do this in a larger script, or you want to avoid the extra pipes, or the purpose of this question is to learn more about awk.
$ awk 'NR==1{x=$3;line=$0} $3<x{line=$0} END{print line}' snd
Broken out into pieces, this is:
NR==1 {x=$3;line=$0} -- On the first line, set an initial value for comparison and store the line.
$3<x{line=$0} - On each line, compare the third field against our stored value, and if the condition is true, store the line. (We could make this run only on NR>1, but it doesn't matter.
END{print line} -- At the end of our input, print whatever line we've stored.
You should read man awk to learn about any parts of this that don't make sense.
a short answer for this would be:
sort -k3,3n temp|head -1
since you have asked for awk:
awk '{if(min>$3||NR==1){min=$3;a[$3]=$0}}END{print a[min]}' your_file
But i prefer the shorter one always.
For calculating the smallest value in any column , let say last column
awk '(FNR==1){a=$NF} {a=$NF < a?$NF:a} END {print a}'
this will only print the smallest value of the column.
In case if complete line is needed better to use sort:
sort -r -n -t [delimiter] -k[column] [file name]
awk -F ";" '(NR==1){a=$NF;b=$0} {a=$NF<a?$NF:a;b=$NF>a?b:$0} END {print b}' filename
this will print the line with smallest value which is encountered first.
awk 'BEGIN {OFS=FS=","}{if ( a[$1]>$2 || a[$1]=="") {a[$1]=$2;} if (b[$1]<$2) {b[$1]=$2;} } END {for (i in a) {print i,a[i],b[i]}}' input_file
We use || a[$1]=="" because when 1st value of field 1 is encountered it will have null in a[$1].