awk to Count Sum and Unique improve command - awk

Would like to print based on 2nd column ,count of line items, sum of 3rd column and unique values of first column.Having around 100 InputTest files and not sorted ..
Am using below 3 commands to achieve the desired output , would like to know the simplest way ...
InputTest*.txt
abc,xx,5,sss
abc,yy,10,sss
def,xx,15,sss
def,yy,20,sss
abc,xx,5,sss
abc,yy,10,sss
def,xx,15,sss
def,yy,20,sss
ghi,zz,10,sss
Step#1:
cat InputTest*.txt | awk -F, '{key=$2;++a[key];b[key]=b[key]+$3} END {for(i in a) print i","a[i]","b[i]}'
Op#1
xx,4,40
yy,4,60
zz,1,10
Step#2
awk -F ',' '{print $1,$2}' InputTest*.txt | sort | uniq >Op_UniqTest2.txt
Op#2
abc xx
abc yy
def xx
def yy
ghi zz
Step#3
awk '{print $2}' Op_UniqTest2.txt | sort | uniq -c
Op#3
2 xx
2 yy
1 zz
Desired Output:
xx,4,40,2
yy,4,60,2
zz,1,10,1
Looking for suggestions !!!

BEGIN { FS = OFS = "," }
{ ++lines[$2]; if (!seen[$2,$1]++) ++diff[$2]; count[$2]+=$3 }
END { for(i in lines) print i, lines[i], count[i], diff[i] }
lines tracks the number of occurrences of each value in column 2
seen records unique combinations of the second and first column, incrementing diff[$2] whenever a unique combination is found. The ++ after seen[$2,$1] means that the condition will only be true the first time the combination is found, as the value of seen[$2,$1] will be increased to 1 and !seen[$2,$1] will be false.
count keeps a total of the third column
$ awk -f avn.awk file
xx,4,40,2
yy,4,60,2
zz,1,10,1

Using awk:
$ awk '
BEGIN { FS = OFS = "," }
{ keys[$2]++; sum[$2]+=$3 } !seen[$1,$2]++ { count[$2]++ }
END { for(key in keys) print key, keys[key], sum[key], count[key] }
' file
xx,4,40,2
yy,4,60,2
zz,1,10,1
Set the input and output field separator to , in BEGIN block. We use arrays keys to identify and count keys. sum array keeps the sum for each keys. count allows us to keep track of unique column1 for each of column2 values.

Related

awk: first, split a line into separate lines; second, use those new lines as a new input

Let's say I have this line:
foo|bar|foobar
I want to split it at | and then use those 3 new lines as the input for the further proceedings (let's say replace bar with xxx).
Sure, I can pipe two awk instances, like this:
echo "foo|bar|foobar" | awk '{gsub(/\|/, "\n"); print}' | awk '/bar/ {gsub(/bar/, "xxx"); print}'
But how I can achieve this in one script? First, do one operation on some input, and then treat the result as the new input for the second operation?
I tried something like this:
echo "foo|bar|foobar" | awk -v c=0 '{
{
gsub(/\|/, "\n");
sprintf("%s", $0);
}
{
if ($0 ~ /bar/) {
c+=1;
gsub(/bar/, "xxx");
print c;
print
}
}
}'
Which results in this:
1
foo
xxx
fooxxx
And thanks to the counter c, it's absolutely obvious that the subsequent if doesn't treat the multi-line input it receives as several new records but instead just as one multi-lined record.
Thus, my question is: how to tell awk to treat this new multi-line record it receives as many single-line records?
The desired output in this very example should be something like this if I'm correct:
1
xxx
2
fooxxx
But this is just an example, the question is more about the mechanics of such a transition.
I would suggest an alternative approach using split() where you can just split the elements based on the delimiter into an array and iterate over its fields, Instead of working on a single multi line string.
echo "foo|bar|foobar" |\
awk '{
count = 0
n = split($0, arr, "|")
for ( i = 1; i <= n; i++ )
{
if ( arr[i] ~ /bar/ )
{
count += sub(/bar/, "xxx", arr[i])
print count
print arr[i]
}
}
}'
Also you don't need an explicit increment of count variable, sub() returns the number of substitutions made on the source string. You can just increment to the existing value of count.
As one more level of optimization, you can get rid of the ~ match in the if condition and directly use the sub() function there
if ( sub(/bar/, "xxx", arr[i]) )
{
count++
print count
print arr[i]
}
If you set the record separator (RS) to the pipe character, you almost get the desired effect, e.g.:
echo 'foo|bar|foobar' | awk -v RS='|' 1
Output:
foo
bar
foobar
[...an empty line
Except that a new-line character becomes part of the last field, so there is an extra line at the end of the output. You can work around this by either including a new-line in the RS variable, making it less portable, or avoid sending new-lines to awk.
For example using the less portable way:
echo 'foo|bar|foobar' | awk -v RS='\\||\n' '{ sub(/bar/, "baz") } 1'
Output:
foo
baz
foobaz
Note that the empty record at the end is ignored.
With GNU awk:
$ awk -v RS='[|\n]' 'gsub(/bar/,"xxx"){print ++c ORS $i}' file
1
xxx
2
fooxxx
With any awk:
$ awk -F'|' '{c=0; for (i=1;i<=NF;i++) if ( gsub(/bar/,"xxx",$i) ) print ++c ORS $i }' file
1
xxx
2
fooxxx

awk count unique occurrences and print other columns

I have the following piece of code:
awk '{h[$1]++}; END { for(k in h) print k, h[k]}' ${infile} >> ${outfile2}
Which does part of what I want: printing out the unique values and then also counting how many times these unique values have occurred. Now, I want to print out the 2nd and 3rd column as well from each unique value. For some reason the following does not seem to work:
awk '{h[$1]++}; END { for(k in h) print k, $2, $3, h[k]}' ${infile} >> ${outfile2}
awk '{h[$1]++}; END { for(k in h) print k, h[$2], h[$3], h[k]}' ${infile} >> ${outfile2}
The first prints out the last index's 2nd and 3rd column, whereas the second code prints out nothing except k and h[k].
${infile} would look like:
20600 33.8318 -111.9286 -1 0.00 0
20600 33.8318 -111.9286 -1 0.00 0
30900 33.3979 -111.8140 -1 0.00 0
29400 33.9455 -113.5430 -1 0.00 0
30600 33.4461 -111.7876 -1 0.00 0
20600 33.8318 -111.9286 -1 0.00 0
30900 33.3979 -111.8140 -1 0.00 0
30600 33.4461 -111.7876 -1 0.00 0
The desired output would be:
20600, 33.8318, -111.9286, 3
30900, 33.3979, -111.8140, 2
29400, 33.9455, -113.5430, 1
30600, 33.4461, -111.7876, 2
You were close and you can do it all in awk, but if you are going to store the count based on field 1 and also have field 2 and field 3 available in END to output, you also need to store field 2 & 3 in arrays indexed by field 1 (or whatever field you are keeping count of). For example you could do:
awk -v OFS=', ' '
{ h[$1]++; i[$1]=$2; j[$1]=$3 }
END {
for (a in h)
print a, i[a], j[a], h[a]
}
' infile
Where h[$1] holds the count of the number of times field 1 is seen indexing the array with field 1. i[$1]=$2 captures field 2 indexed by field 1, and then j[$1]=$3 captures field 3 indexed by field 1.
Then within END all that is needed is to output field 1 (a the index of h), i[a] (field 2), j[a] (field 3), and finally h[a] the count of the number of times field 1 was seen.
Example Use/Output
Using your example data, you can just copy/middle-mouse-paste the code at the terminal with the correct filename, e.g.
$ awk -v OFS=', ' '
> { h[$1]++; i[$1]=$2; j[$1]=$3 }
> END {
> for (a in h)
> print a, i[a], j[a], h[a]
> }
> ' infile
20600, 33.8318, -111.9286, 3
29400, 33.9455, -113.5430, 1
30600, 33.4461, -111.7876, 2
30900, 33.3979, -111.8140, 2
Which provides the output desired. If you need to preserve the order of records in the order of the output you show, you can use string-concatenation to group fields 1, 2 & 3 as the index of the array and then output the array and index, e.g.
$ awk '{a[$1", "$2", "$3]++}END{for(i in a) print i ", " a[i]}' infile
20600, 33.8318, -111.9286, 3
30600, 33.4461, -111.7876, 2
29400, 33.9455, -113.5430, 1
30900, 33.3979, -111.8140, 2
Look things over and let me know if you have further questions.
GNU datamash is a very handy tool for working on groups of columnar data in files that makes this trivial to do.
Assuming your file uses tabs to separate columns like it appears to:
$ datamash -s --output-delimiter=, -g 1,2,3 count 3 < input.tsv
20600,33.8318,-111.9286,3
29400,33.9455,-113.5430,1
30600,33.4461,-111.7876,2
30900,33.3979,-111.8140,2
Though it's not much more complicated in awk, using a multi dimensional array:
$ awk 'BEGIN { OFS=SUBSEP="," }
{ group[$1,$2,$3]++ }
END { for (g in group) print g, group[g] }' input.tsv
29400,33.9455,-113.5430,1
30600,33.4461,-111.7876,2
20600,33.8318,-111.9286,3
30900,33.3979,-111.8140,2
If you want sorted output instead of random order for this one, if using GNU awk, add a PROCINFO["sorted_in"] = "#ind_str_asc" in the BEGIN block, or otherwise pipe the output through sort.
You can also get the same effect by pipelining a bunch of utilities (including awk and uniq):
$ sort -k1,3n input.tsv | cut -f1-3 | uniq -c | awk -v OFS=, '{ print $2, $3, $4, $1 }'
20600,33.8318,-111.9286,3
29400,33.9455,-113.5430,1
30600,33.4461,-111.7876,2
30900,33.3979,-111.8140,2

awk/sed remove duplicates and merge permuted columns

I have the following file:
ABC MNH 1
UHR LOI 2
QWE LOI 3
MNH ABC 4
PUQ LOI 5
MNH ABC 6
QWE LOI 7
LOI UHR 8
I want to remove all duplicates (based on the the first two columns - e.g., row 6 is a duplicate of row 4). Also I want to merge entries where column 1 and 2 are permuted (e.g., row 1 and 4). This means that this list should result in:
ABC MNH 1 4
UHR LOI 2 8
QWE LOI 3
PUQ LOI 5
However, this file is huge. About 2-3 TB. Can this be done with awk/sed?
I don't understand why what you posted is your expected output so you may have to massage it but IMHO this is right the way to approach the problem so that only "sort" is handling storing the multi-TB input internally (and sort is designed to do that with paging etc.) while the awk scripts are just processing one line at a time and keeping very little in memory:
$ cat tst.sh
#!/bin/env bash
awk '{print ($1>$2 ? $1 OFS $2 : $2 OFS $1), $0}' "$1" |
sort -k1,2 |
awk '
{ curr = $1 OFS $2 }
prev != curr {
if ( NR>1 ) {
print rec
}
rec = $0
sub(/^([^[:space:]]+[[:space:]]+){2}/,"",rec)
prev = curr
next
}
{ rec = rec OFS $NF }
END { print rec }
'
$ ./tst.sh file
ABC MNH 1 4 6
PUQ LOI 5
QWE LOI 3 7
LOI UHR 8 2
An alternative implementation after discussing with #kvantour in the comments below (requires GNU sort for -s stable sort):
$ cat tst.sh
#!/bin/env bash
awk '{print ($1>$2 ? $1 OFS $2 : $2 OFS $1), $0}' "$1" |
sort -s -k1,2 |
awk '
{ curr = $1 OFS $2 }
prev != curr {
if ( NR>1 ) {
print rec
}
rec = $0
sub(/^([^[:space:]]+[[:space:]]+){2}/,"",rec)
sub(/[[:space:]]+[^[:space:]]+$/,"",rec)
delete seen
prev = curr
}
!seen[$3,$4]++ { rec = rec OFS $NF }
END { print rec }
'
$ ./tst.sh file
ABC MNH 1 4
PUQ LOI 5
QWE LOI 3
UHR LOI 2 8
The always helpful GNU datmash to the rescue!
$ sort -k1,2 -u input.txt |
awk -v OFS="\t" '$2 < $1 { tmp = $1; $1 = $2; $2 = tmp } { print $1, $2, $3 }' |
sort -k1,2 |
datamash groupby 1,2 collapse 3 |
tr ',' ' '
ABC MNH 1 4
LOI PUQ 5
LOI QWE 3
LOI UHR 2 8
Broken down, this:
Sorts the input file based on the first two columns and removes duplicates.
If the second column is less than the first column, swaps the two (So MNH ABC 6 becomes ABC MNH 6), and outputs tab-separated columns (Which is what datamash works with by default).
Sorts that so all the transformed rows are in order (But this time keeping duplicates).
Uses datamash to produce a single line for all the duplicate first two columns, with a comma-separated list of the values of the third columns as the third column of the output (Like ABC MNH 1,4)
Turns those commas into spaces.
Most memory-efficient solutions will require the data to be sorted, and while the sort program is quite good at doing that, it'll still use a bunch of temporary files so you'll need 2-3 or so terabytes of free disk space.
If you're going to be doing a lot of stuff with the same data, it's probably worth sorting it once and reusing that file instead of sorting it every time as the first step of a pipeline:
$ sort -k1,2 -u input.txt > unique_sorted.txt
$ awk ... unique_sorted.txt | ...
If there's enough duplicates and enough RAM that it's feasible to hold the results in memory, it can be done in one pass through the input file removing duplicates as it goes and then iterating through all the remaining pairs of values:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
my %keys;
while (<>) {
chomp;
my ($col1, $col2, $col3) = split ' ';
$keys{$col1}{$col2} = $col3 unless exists $keys{$col1}{$col2};
}
$, = " ";
while (my ($col1, $sub) = each %keys) {
while (my ($col2, $col3) = each %$sub) {
next unless defined $col3;
if ($col1 lt $col2 && exists $keys{$col2}{$col1}) {
$col3 .= " $keys{$col2}{$col1}";
$keys{$col2}{$col1} = undef;
} elsif ($col2 lt $col1 && exists $keys{$col2}{$col1}) {
next;
}
say $col1, $col2, $col3;
}
}
This produces output in arbitrary unsorted order for efficiency's sake.
And an approach using sqlite (Also requires lots of extra free disk space, and that the columns are separated by tabs, not arbitrary whitespace):
#!/bin/sh
input="$1"
sqlite3 -batch -noheader -list temp.db 2>/dev/null <<EOF
.separator \t
PRAGMA page_size = 8096; -- Make sure the database can grow big enough
CREATE TABLE data(col1, col2, col3, PRIMARY KEY(col1, col2)) WITHOUT ROWID;
.import "$input" data
SELECT col1, col2, group_concat(col3, ' ')
FROM (
SELECT col1, col2, col3 FROM data WHERE col1 < col2
UNION ALL
SELECT col2, col1, col3 FROM data WHERE col2 < col1
)
GROUP BY col1, col2
ORDER BY col1, col2;
EOF
rm -f temp.db
If your first two columns will only have 3 characters maximum you will have 26^6 possible combinations for the first two columns. This is very easy to handle with awk.
{ key1=$1$2; key2=$2$1 }
(key1 in a) { next } # duplicate :> skip
(key2 in a) { print $2,$1,a[key2],$3 } # permutation :> print
{ a[key1]=$3 } # store value
This however will only print the permutations, and as requested, maximum 2 elements. As a consequence, the array a will have both key1 and the permuted key key2 in the array in case a permutation is found, otherwise it will only have key1.
This can be cleaned up with a second array keeping track if a permutation is already printed. Call it b. This way you can eliminate 2 elements from a while keeping track of one element in b:
{ key1=$1$2; key2=$2$1 }
(key1 in b) || (key2 in b) { next } # permutation printed, is duplicate
(key1 in a) { next } # only duplicate, no permutation found
(key2 in a) { # permutation found
print $2,$1,a[key2],$3 # - print
delete a[key1] # - delete keys from a
delete a[key2]
b[key1] # - store key in b
next # - skip the rest
}
{ a[key1]=$3 }
END { for (k in a) { print substr(1,3,k),substr(4,3,k),a[k] } }

awk split and drop column but something wrong with substr part

I have an awk code to split a file by columns and print out the output to a new file name.
awk -F"|" 'NR==1 {h=substr($0, index($0,$5)); next}
{file= path ""$1""$2"_"$3"_"$4"_03042017.csv"; print (a[file]++?"": "DM9 03042017" ORS h ORS) substr($0, index($0,$5)) > file}
END{for(file in a) print "EOF " a[file] > file}'
As I use substr ($0, index($0,$5) so the new output will only have data start at fifth column and the rest. It works fine except when the input data I have got the same value.
For example,
product | ID | Branch | Office | Type | ....
ABC | 12 | KL | GH | Z | ....
For the above example, the code works well as the data input is different.
product | ID | Branch | Office | Type | ....
ABC | 12 | KK | KK | Z | ....
But if I have data input like second example, I have the same value data for third and fourth columns, the code doesn't work well. Instead of getting output start and fifth column and more, I got the result at third column and more.
So, I suspect because as the data input for third and fourth are the same, so it stopped at third line as I used substr.
Is anyone can help me on this matter? Sorry for the long post and appreciate it a lot if you guys can give me some ideas. Thank you.
if structure is fixed like your sample (fixed length field)
awk -F '[[:blank:]]*[|][[:blank:]]*' -v path="./" '
NR==1 {
for( i=1;i<5;i++) $i = ""
h = $0; sub(/^[[:blank:]|]+/,"", h)
next
}
{
file= path $1 $2 "_" $3 "_" $4 "_03042017.csv"
# remove 4 first field
for( i=1;i<5;i++) $i = ""
# cleaning starting space
Cleaned = $0; sub( /^[[:blank:]|]+/, "", Cleaned)
print ( a[file]++ ? "" : "DM9 03042017" ORS h ORS ) Cleaned > file
}
END {
for(file in a) { print "EOF " a[file] > file }
}
' YourFile

Word Occurrences Script

I'm putting together a script that will the count the occurrences of words in text documents.
{
$0 = tolower($0)
for ( i = 1; i <= NF; i++ )
freq[$i]++
}
BEGIN { printf "%-20s %-6s\n", "Word", "Count"}
END {
sort = "sort -k 2nr"
for (word in freq)
printf "%-20s %-6s\n", word, freq[word] | sort
close(sort)
}
It works fine so far, but I'd like to make a couple tweaks/additions:
I'm having a hard time displaying the array index number, tried freq[$i] which just spit 0's back at me
Is there any way to eliminate the whitespace (spaces) from the word count?
You do not need to code an own loop to scan the fields, just set RS to make each word an own record: E.g. RS=[^A-Za-z] will treat every string not completely being built from uppercase and lowercase letters as record separator.
$ echo 'Hello world! I am happy123...' | awk 'BEGIN{RS="[^A-Za-z]+"}$0'
Hello
world
I
am
happy
The single $0 matches nonempty lines.
Maybe you want to allow digits in words.. just adapt RS to your needs.
So what's left?
Transform to lowercase, count, print sorted results.
File wfreq.awk:
BEGIN { RS="[^A-Za-z]+" }
$0 { counts[tolower($0)]++ }
END{
printf "%-20s %6s\n", "Word", "Count"
sort = "sort -k 2nr"
for(word in counts)
printf "%-20s %6s\n",word,counts[word] | sort
close(sort)
}
Example run (only top 10 lines of output for not spamming the answer):
$ awk -f wfreq.awk /etc/motd | head
Word Count
the 5
debian 3
linux 3
are 2
bpo 2
gnu 2
in 2
with 2
absolutely 1
But now for something not really completely different...
To sort by a different field, just adapt the sort = "sort ..." options.
I don't use asort() because not every awk has this extension.
File wfreq2.awk:
BEGIN { RS="[^A-Za-z]+" }
$0 { counts[tolower($0)]++ }
END{
printf "%-20s %6s\n", "Word", "Count"
sort = "sort -k 1"
for(word in counts)
printf "%-20s %6s\n",word,counts[word] | sort
close(sort)
}
Example run (only top 10 lines of output for not spamming the answer):
$ awk -f wfreq2.awk /etc/motd | head
Word Count
absolutely 1
amd 1
applicable 1
are 2
bpo 2
by 1
comes 1
copyright 1
darkstar 1