awk histogram in buckets - awk

Consider I have a following file..
1 a
1 b
1 a
1 c
1 a
2 a
2 d
2 a
2 d
I want to have a histogram within a bucket... for example if bucket is 1 then the output will be
a 3
b 1
c 1
a 2
d 2
for bucket 2... we have
a 5
b 1
c 1
d 2
I want to do it with awk and I literally stuck...
here is my code:
awk '
{A[$1]} count [$2]++
{for(i in A) {print i,A[i]}
}' test
Any help?
Thanks,
Amir.

Edit Adding a size_of_bucket variable.
awk -v "size_of_bucket=2" '
{
bucket = int(($1-1)/size_of_bucket);
A[bucket","$2]++;
}
END {
for (i in A) {
print i, A[i];
}
}
'

Related

For each unique occurrence in field, print sum corresponding numerical field and number of occurrences/counts

I have a file
a x 0 3
a x 0 1
b,c x 4 4
dd x 3 5
dd x 2 5
e,e,t x 5 7
a b 1 9
cc b 2 1
cc b 1 1
e,e,t b 1 2
e,e,t b 1 2
e,e,t b 1 2
for each element in $1$2, I want print the sum $3, $4 and the number of occurrences/lenght/counts
So that I have
a x 0 4 0 2
b,c x 4 4 1 1
dd x 5 10 2 2
e,e,t x 5 7 1 1
a b 1 9 1 1
cc b 3 2 2 2
e,e,t b 3 6 3 3
I am using
awk -F"\t" '{for(n=2;n<=NF; ++n) a[$1 OFS $2][n]+=$n}
END {for(i in a) {
printf "%s", i
for (n=3; n<=4; ++n) printf "\t%s", a[i][n], a[i][n]++
printf "\n" }}' file
but it's only printing the sums, not the counts
The actual file has many columns: the keys are $4$6$7$8 and the numerical columns are $9-$13
You may use this awk:
cat sum.awk
BEGIN {
FS = OFS = "\t" # set input/output FS to tab
}
{
k = $1 OFS $2 # create key using $1 tab $2
if (!(k in map3)) # if k is not in map3 save it in an ordered array
ord[++n] = k
map3[k] += $3 # sum of $3 in array map3 using key as k
$3 > 0 && ++fq3[k] # frequency of $3 if it is > 0
map4[k] += $4 # sum of $4 in array map4 using key as k
$4 > 0 && ++fq4[k] # frequency of $4 if it is > 0
}
END {
for(i=1; i<=n; ++i) # print everything by looping through ord array
print ord[i], map3[ord[i]], map4[ord[i]], fq3[ord[i]]+0, fq4[ord[i]]+0
}
Then use it as:
awk -f sum.awk file
a x 0 4 0 2
b,c x 4 4 1 1
dd x 5 10 2 2
e,e,t x 5 7 1 1
a b 1 9 1 1
cc b 3 2 2 2
e,e,t b 3 6 3 3

How to compare pairs of columns in awk?

I have the following dataset from a pairwise analysis (the first row are just the sample ids):
A B A C
1 1 1 0
1 2 1 1
1 0 1 2
I wish to compare the values for field 1 and field 2 then field 3 and field 4 such that I want to print the row number NR every time I see a 1 and 2 combination for the pairs I am examining.
For example for pairs A and B, I would want the output:
A B 2
For pairs A and C, I would want the output:
A C 3
I would want to proceed row by row so I would likely need the code to include:
for i in {1..3}; do
awk 'NR=="'${i}'" {code}'
done
But I have no idea how to proceed in a pairwise fashion (i.e. compare field 1 and field 2 and then field 3 and field 4 etc...).
How can I do this?
It's hard to say with such a minimal example but this MAY be what you want:
$ cat tst.awk
FNR==1 {
for (i=1;i<=NF;i++) {
name[i] = $i
}
next
}
{
for (i=1;i<NF;i+=2) {
if ( ($i == 1) && ($(i+1) == 2) ) {
print name[i], name[i+1], NR-1
}
}
}
$ awk -f tst.awk file
A B 2
A C 3
You certainly should only run the script once; there's no need to run awk more frequently. It isn't yet entirely clear how you want multiple matches printed. However, if you're working a line at time, then the output probably comes a line at a time.
Working on that basis, then:
awk 'NR == 1 { for (i = 1; i < NF; i += 2)
{ cols[(i+1)/2,1] = $i; cols[(i+1)/2,2] = $(i+1); }
next
}
{ for (i = 1; i < NF; i += 2)
{ if ($i == 1 && $(i+1) == 2)
print cols[(i+1)/2,1], cols[(i+1)/2,2], NR - 1
}
}'
The NR == 1 block of code captures the headings so they can be used in the main printing code. There are plenty of other ways to store the information too. The other block of code looks at the data lines and checks that pairs of fields contain 1 2 and print out the control data if there is a match. Because NF will be an even number, but the loops count on the odd numbers, the < comparison is OK. Often in awk, you use for (i = 1; i <= NF; i++) with a single increment and then <= is required for correct behaviour.
For your minimal data set, this produces:
A B 2
A C 3
For this larger data set:
A B A C
1 1 1 0
1 2 1 1
1 0 1 2
1 2 4 2
5 3 1 9
7 0 3 2
1 2 1 0
9 0 1 2
1 2 3 2
the code produces:
A B 2
A C 3
A B 4
A B 7
A C 8
A B 9

How to do a join using awk

Here is my Input file
Identifier Relation
A 1
A 2
A 3
B 2
B 3
C 1
C 2
C 3
I want to join this file to itself based on the "Relation" field.
Sample Output file
A 1 C 1
A 2 B 2
A 2 C 2
B 2 C 2
A 3 B 3
A 3 C 3
B 3 C 3
I used the following awk script:
awk 'NR==FNR {a[NR]=$0; next} { for (k in a) if (a[k]~$2) print a[k],$0}' input input > output
However, I had to do another awk step to delete lines which did a join with itself i.e, A 1 A 1 ; B 2 B 2 etc.
The second issue with this file is it prints both directions of the join, thus
A 1 C 1 is printed along with C 1 A 1 on another line.
Both these lines display the same relation and I would NOT like to include this.I want to see just one or the other i.e, "A 1 C 1" or "C 1 A 1" not both.
Any suggestions/directions are highly appreciated.
alternative solution using awk with join and sort support
$ join -j 2 <(sort -k2 -k1,1 file){,}
| awk '$2!=$3 && !($3 FS $2 in a){a[$2 FS $3]; print$2,$1,$3,$1}'
A 1 C 1
A 2 B 2
A 2 C 2
B 2 C 2
A 3 B 3
A 3 C 3
B 3 C 3
create the cross product, eliminate the diagonal and one of the symmetrical pairs.
Here is an awk-only solution:
awk 'NR>1{ar[$2]=(ar[$2]$1);}\
END{ for(key in ar){\
for(i=1; i<length(ar[key]); i++) {\
for(j=i+1; j<length(ar[key])+1; j++) {\
print substr(ar[key],i,1), key, substr(ar[key],j,1), key;\
}\
}\
}}' infile
Each number in the second column of the input serves as a key of an awk-array. The value of the corresponding array-element is a sequence of first-column letters (e.g., array[1]=ABC).
Then, we built all two-letter combinations for each sequence (e.g., "ABC" gives "AB", "AC" and "BC")
Output:
A 1 C 1
A 2 B 2
A 2 C 2
B 2 C 2
A 3 B 3
A 3 C 3
B 3 C 3
Note:
If a number occurs only once, no output is generated for this number.
The order of output depends on the order of input. (No sorting of letters!!). That is if the second input line was C 1, then array[1]="CAB" and the first output line would be C 1 A 1
First line of input is ignored due to NR>1
There is surely a solution with awk only, but I'm going to propose a solution using awk and sort because I think it's quite simple and does not require storing the entire file content in awk variables. The idea is as follows:
rewrite the input file so that the "relation" field is first (A 1 -> 1 A)
use sort -n to put together all lines with same "relation"
use awk to combine consecutive lines having the same "relation"
That would translate to something like:
awk '{print $2 " " $1}' input | sort -n |
awk '{if ($1==lastsel)printf " "; else if(lastsel) printf "\n"; lastsel=$1; printf "%s %s", $2, $1;}END{if(lastsel)printf"\n"}'
A 1 C 1
A 2 B 2 C 2
A 3 B 3 C 3
EDIT: If you want only one i-j relation per line:
awk '{print $2 " " $1}' input | sort -n |
awk '$1!=rel{rel=$1;item=$2;next;} {printf "%s %s %s %s\n", item, rel, $2, $1;}'
A 1 C 1
A 2 B 2
A 2 C 2
A 3 B 3
A 3 C 3
Note the following limitations with this solution:
In case a given n has only one entry, nothing will be output (no output such as D 1)
All relations always have the lexicographically first item in the first column (e.g. A 1 C 1 but never B 1 C 1)

matching rows and fields from two files

I would like to match the record number in one file will the same field number in another file:
file1:
1
3
5
4
3
1
5
file2:
A B C D E F G
H I J J K L M
N O P Q R S T
I would like to use the record numbers corresponding to 5 in the first file to obtain the corresponding fields in the second file. Desired output:
C G
J M
P T
So far, I've done:
awk '{ if ($1=="5") print NR }' file1 > temp
for i in $(cat temp); do
awk '{ print $"'${i}'" }' file2
done
But get the output:
C
J
P
G
M
T
I would like to have this in the format of the desired output above, but can't get it to work. Perhaps using prinf or awk for-loop might work, but I have had no success.
Thank you all.
awk 'NR==FNR{if($1==5)a[NR];next}{for(i in a){printf $i" "}print ""}' a b
C G
J M
P T

Using an array in AWK when working with two files

I have two files I merged them based key using below code
file1
-------------------------------
1 a t p bbb
2 b c f aaa
3 d y u bbb
2 b c f aaa
2 u g t ccc
2 b j h ccc
file2
--------------------------------
1 11 bbb
2 22 ccc
3 33 aaa
4 44 aaa
I merged these two file based key using below code
awk 'NR==FNR{a[$3]=$0;next;}{for(x in a){if(x==$5) print $1,$2,$3,$4,a[x]};
My question is how I can save $2 of file2 in variable or array and print after a[x] again.
My desired result is :
1 a t p 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22
As you see the first 7 columns is the result of my merge code. I need add the last column (field 2 of a[x]) to my result.
Important:
My next question is if I have .awk file, how I can use some bash script code like (| column -t) or send result to file (awk... > result.txt)? I always use these codes in command prompt. Can I use them inside my code in .awk file?
Simply add all of file2 to an array, and use split to hold the bits you want:
awk 'FNR==NR { two[$0]++; next } { for (i in two) { split(i, one); if (one[3] == $NF) print $1,$2,$3,$4, i, one[2] } }' file2 file1
Results:
1 a t p 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22
Regarding your last question; you can also add 'pipes' and 'writes' inside of your awk. Here's an example of a pipe to column -t:
Contents of script.awk:
FNR==NR {
two[$0]++
next
}
{
for (i in two) {
split(i, one)
if (one[3] == $NF) {
print $1,$2,$3,$4, i, one[2] | "column -t"
}
}
}
Run like: awk -f script.awk file2 file1
EDIT:
Add the following to your shell script:
results=$(awk '
FNR==NR {
two[$0]++
next
}
{
for (i in two) {
split(i, one)
if (one[3] == $NF) {
print $1,$2,$3,$4, i, one[2] | "column -t"
}
}
}
' $1 $2)
echo "$results"
Run like:
./script.sh file2.txt file1.txt
Results:
1 a t p 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22
Your current script is:
awk 'NR==FNR { a[$3]=$0; next }
{ for (x in a) { if (x==$5) print $1,$2,$3,$4,a[x] } }'
(Actually, the original is missing the second close brace for the second pattern/action pair.)
It seems that you process file2 before you process file1.
You shouldn't need the loop in the second code. And you can make life easier for yourself by using the splitting in the first phase to keep the values you need:
awk 'NR==FNR { c1[$3] = $1; c2[$3] = $2; next }
{ print $1, $2, $3, $4, c1[$5], c2[$5], $5, c2[$5] }'
You can upgrade that to check whether c1[$5] and c2[$5] are defined, presumably skipping the row if they are not.
Given your input files, the output is:
1 a t p 1 11 bbb 11
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22
Give or take column spacing, that's what was requested. Column spacing can be fixed by using printf instead of print, or setting OFS to tab, or ...
The c1 and c2 notations for column 1 and 2 is OK for two columns. If you need more, then you should probably use the 2D array notation:
awk 'NR==FNR { for (i = 1; i <= NF; i++) col[i,$3] = $i; next }
{ print $1, $2, $3, $4, col[1,$5], col[2,$5], $5, col[2,$5] }'
This produces the same output as before.
To achieve what you ask, save the second field after the whole line in the processing of your first file, with a[$3]=$0 OFS $2. For your second question, awk has a variable to separate fields in output, it's OFS, assign a tabulator to it and play with it. Your script would be like:
awk '
BEGIN { OFS = "\t"; }
NR==FNR{
a[$3]=$0 OFS $2;
next;
}
{
for(x in a){
if(x==$5) print $1,$2,$3,$4,a[x]
}
}
' file2 file1
That yields:
1 a t p 1 11 bbb 11
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22