extract columns from multiple text files with awk - awk

I am trying to extract column1 based on the values of column2. I would like to print the values of column1 only if column2 is ≤30 and greater than 5.
I also need to print the total number of the values of column1 based on the output. How can I do this with awk from multiple text files?
The sample of text file is shown below.
col1 col2
aa 25
bb 4
cc 6
dd 23
aa 30
The output would be
aa
cc
dd
aa
Total number of aa is 2
Total number of cc is 1
Total number of dd is 1

Something like this to get you started:
{ if ($2 <= 30 && $2 > 5) {
print $1
tot[$1] += 1 }
}
END {
for (i in tot) {
print "Total number of", i, "is", tot[i]
}
}
Output:
$ awk -f i.awk input
aa
cc
dd
aa
Total number of aa is 2
Total number of cc is 1
Total number of dd is 1

Related

AWK program that can read a second file either from a file specified on the command line or from data received via a pipe

I have an AWK program that does a join of two files, file1 and file2. The files are joined based on a set of columns. I placed the AWK program into a bash script that I named join.sh. See below. Here is an example of how the script is executed:
./join.sh '1,2,3,4' '2,3,4,5' file1 file2
That says this: Do a join of file1 and file2, using columns (fields) 1,2,3,4 of file1 and columns (fields) 2,3,4,5 of file2.
That works great.
Now what I would like to do is to filter file2 and pipe the results to the join tool:
./fetch.sh ident file2 | ./join.sh '1,2,3,4' '2,3,4,5' file1
fetch.sh is a bash script containing an AWK program that fetches the rows in file2 with primary key ident and outputs to stdout the rows that were fetched.
Unfortunately, that pipeline is not working. I get no results.
Recap: I want the join program to be able to read the second file either from a file that I specify on the command line or from data received via a pipe. How to do that?
Here is my bash script, named join.sh
#!/bin/bash
awk -v f1cols=$1 -v f2cols=$2 '
BEGIN { FS=OFS="\t"
m=split(f1cols,f1,",")
n=split(f2cols,f2,",")
}
{ sub(/\r$/, "") }
NR == 1 { b[0] = $0 }
(NR == FNR) && (NR > 1) { idx2=$(f2[1])
for (i=2;i<=n;i++)
idx2=idx2 $(f2[i])
a[idx2] = $0
next
}
(NR != FNR) && (FNR == 1) { print $0, b[0] }
FNR > 1 { idx1=$(f1[1])
for (i=2;i<=m;i++)
idx1=idx1 $(f1[i])
for (idx1 in a)
print $0, a[idx1]
}' $3 $4
I'm not sure if this is 'correct' as you haven't provided any example input and expected output, but does using - to signify stdin work for your use-case? E.g.
cat file1
1 2 3 4
AA BB CC DD
AA EE FF GG
cat file2
1 2 3 4
AA ZZ YY XX
AA 11 22 33
./join.sh '1' '1' file1 file2
1 2 3 4 1 2 3 4
AA ZZ YY XX AA BB CC DD
AA ZZ YY XX AA EE FF GG
AA 11 22 33 AA BB CC DD
AA 11 22 33 AA EE FF GG
cat file2 | ./join.sh '1' '1' file1 -
1 2 3 4 1 2 3 4
AA ZZ YY XX AA BB CC DD
AA ZZ YY XX AA EE FF GG
AA 11 22 33 AA BB CC DD
AA 11 22 33 AA EE FF GG
be able to read(...)from data received via a pipe
GNU AWK does support Using getline from a Pipe consider following simple example
awk 'BEGIN{cmd="seq 7";while((cmd | getline) > 0){print $1*7};close(cmd)}' emptyfile
gives output
7
14
21
28
35
42
49
Explanation: I process output of seq 7 command (numbers from 1 to 7 inclusive, each on separate line), body of while is executed for each line of seq 7 output, fields are set like for normal processing.

Count number of occurrences of a number larger than x from every raw

I have a file with multiple rows and 26 columns. I want to count the number of occurrences of values that are higher than 0 (I guess is also valid different from 0) in each row (excluding the first two columns). The file looks like this:
X Y Sample1 Sample2 Sample3 .... Sample24
a a1 0 7 0 0
b a2 2 8 0 0
c a3 0 3 15 3
d d3 0 0 0 0
I would like to have an output file like this:
X Y Result
a a1 1
b b1 2
c c1 3
d d1 0
awk or sed would be good.
I saw a similar question but in that case the columns were summed and the desired output was different.
awk 'NR==1{printf "X\tY\tResult%s",ORS} # Printing the header
NR>1{
count=0; # Initializing count for each row to zero
for(i=3;i<=NF;i++){ #iterating from field 3 to end, NF is #fields
if($i>0){ #$i expands to $3,$4 and so which are the fields
count++; # Incrementing if the condition is true.
}
};
printf "%s\t%s\t%s%s",$1,$2,count,ORS # For each row print o/p
}' file
should do that
another awk
$ awk '{if(NR==1) c="Result";
else for(i=3;i<=NF;i++) c+=($i>0);
print $1,$2,c; c=0}' file | column -t
X Y Result
a a1 1
b a2 2
c a3 3
d d3 0
$ awk '{print $1, $2, (NR>1 ? gsub(/ [1-9]/,"") : "Result")}' file
X Y Result
a a1 1
b a2 2
c a3 3
d d3 0

use awk for printing selected rows

I have a text file and i want to print only selected rows of it. Below is the dummy format of the text file:
Name Sub Marks percentage
A AB 50 50
Name Sub Marks percentage
b AB 50 50
Name Sub Marks percentage
c AB 50 50
Name Sub Marks percentage
d AB 50 50
I need the output as:(Don't need heading before every record and need only 3 columns omitting "MARKS")
Name Sub percentage
A AB 50
b AB 50
c AB 50
d AB 50
Please Suggest me a form of awk command using which I can achieve this, and thanks for supporting.
You can use:
awk '(NR == 1) || ((NR % 2) == 0) {print $1" "$2" "$4}' inputFile
This will print columns one, two and four but only if the record number is one or even. The results are:
Name Sub percentage
A AB 50
b AB 50
c AB 50
d AB 50
If you want it nicely formatted, you can use printf instead:
awk '(NR == 1) || ((NR % 2) == 0) {printf "%-10s %-10s %10s\n", $1, $2, $4}' inputFile
Name Sub percentage
A AB 50
b AB 50
c AB 50
d AB 50
awk solution:
awk 'NR==1 || !(NR%2){ print $1,$2,$4 }' OFS='\t' file
NR==1 || !(NR%2) - considering only the 1st and each even line
OFS='\t' - output field separator
The output:
Name Sub percentage
A AB 50
b AB 50
c AB 50
d AB 50
In case that input file has a slight different format above solution will fail. For example:
Name Sub Marks percentage
A AB 50 50
Name Sub Marks percentage
b AB 50 50
c AB 50 50
Name Sub Marks percentage
d AB 50 50
In such a case, something like this will work in all cases:
$ awk '$0!=h;NR==1{h=$0}' file1
Name Sub Marks percentage
A AB 50 50
b AB 50 50
c AB 50 50
d AB 50 50

average of specific rows in a file

I have 6 rows in files. I need to find average only of specific rows in a file and the others should be left as they are. The average should be calculated for A1 and A2, B1 and B2, other lines should stay as they are
Input:
A1 1 1 2
A2 5 6 1
A3 1 1 1
B1 10 12 12
B2 10 12 10
B3 100 200 300
Output:
A1A2 3 3.5 1.5
A3 1 1 1
B1B2 10 12 11
B3 100 200 300
EDIT: There are n columns in total
awk to the rescue!
$ awk '/[AB][12]/{a=substr($1,1,1);
k=a"1"a"2";
c1[k]+=$2; c2[k]+=$3; c3[k]+=$4; n[k]++; next}
1;
END{for(k in c1)
print k, c1[k]/n[k], c2[k]/n[k], c3[k]/n[k]}' file | sort | column -t
A1A2 3 3.5 1.5
A3 1 1 1
B1B2 10 12 11
B3 100 200 300
pattern match grouped rows, create a key, calculate sum of all fields and count of rows per key; print unmatched rows; when done print the averaged rows, since order is not preserved sort and pipe to column for easy formatting.
$ cat tst.awk
$1 ~ /^[AB]1$/ { for (i=2;i<=NF;i++) val[$1,i]=$i; next }
$1 ~ /^[AB]2$/ { p=$1; sub(2,1,p); $1=p $1; for (i=2;i<=NF;i++) $i=($i + val[p,i])/2 }
{ print }
$ awk -f tst.awk file | column -t
A1A2 3 3.5 1.5
A3 1 1 1
B1B2 10 12 11
B3 100 200 300

Awk script to loop and perform mathematical operations

I have bash and awk script that I use to extract data from the text file.
However it is too slow with large datasets and doesn't work perfectly. I believe that it is possible to write all my bash loop in one awk command and I ask somebody to help me with this.
cat dummy_list
AAA
AAAA
AAAAA
cat dummy_table
13 19 AAA 69 96 "ID-999" 34
23 42 AAA 12 19 "ID-999" 64
53 79 AAA 43 58 "ID-482" 36
13 43 AAA 12 15 "ID-492" 75
23 90 AAA 45 87 "ID-492" 34
12 41 AAAA 76 79 "ID-923" 23
19 58 AAAA 15 87 "ID-923" 75
10 40 AAAA 18 82 "ID-482" 23
11 18 AAAA 18 82 "ID-482" 52
15 19 AAAA 18 82 "ID-482" 62
59 69 AAAA 10 18 "ID-482" 83
78 89 AAAA 32 41 "ID-983" 24
23 53 AAAAA 78 99 "ID-916" 82
What I want from this table:
For every dummy_list item (AAA or AAAA or AAAAA) extract how many different times ID range was mentioned ( by this I mean unique columns 4+5+6 (like 69 96 "ID-999")). There are duplicate ID's (like 18 82 "ID-482") and I have to discard them.
My script looks like this:
while read a; do
awk -v VAR="$a" '($3==VAR) {print $4"\t"$5"\t"$6}' dummy_table |
sort -u |
cut -f 3 |
sort |
uniq -c |
awk '{print $1}' |
tr '\n' ' ' |
awk -v VAR="$a" '{print VAR"\t"$0}'
done < dummy_list
AAA 1 2 2
AAAA 2 2 1
AAAAA 1
It's the same as AAA "ID-482" mentioned once; "ID-492" mentioned twice; "ID-999" mentioned twice.
This is the output I want.
For every dummy_list item get average number of how many times it gets mentioned with the same ID. For example AAA occurs twice with "ID-999", one time with "ID-482" and two times with "ID-492" - so it's (2+1+2)/3=1.66
My script looks like this:
while read a ; do
ID_TIMES=$(awk -v VAR="$a" '($3==VAR) {print $6}' dummy_table |
sort -u |
wc -l) &&
awk -v VAR="$a" '($3==VAR) {print $6}' dummy_table |
sort |
uniq -c |
awk -v VAR="$ID_TIMES" '{sum+=$1} END {print sum/VAR}'
done < dummy_list
AAA 1.666
AAAA 2.333
AAAAA 1
For every dummy_list item extract ID range and calculate proportion between columns.
For example:
for AAA's ID-999:
RANGE1=sum $5-$4(96-69) + $5-$4(19-12)
RANGE2=sum $7(34+64)
then RANGE2*100/RANGE1=288
For the output like this:
AAA 288 240 242
....
AAAAA 390
I wasn't able to write such script by myself as I got stuck with two variables $RANGE1 and $RANGE2.
If it it possible it would be great to discard duplicate ranges like 18 82 "ID-482" in this step as well.
I believe that all these there operations can be calculated with only one awk command and I feel desperate about my scripts. I really hope that someone will help me in this operation.
You can try this.
file a.awk:
BEGIN {
# read list of items
while ( ( getline < "dummy_list" ) > 0 )
{
items[$1] = 0
}
}
{
# calculate ammountof uniqur ids
key = $3 SUBSEP $6
if ( ! ( key in ids ) && ( $3 in items ) )
{
unique_ids[$3] += 1
}
# calculate ammount of duplication
ids [$3,$6] += 1
# calculate range parameters
range1 [$3,$6] += $5 - $4
range2 [$3,$6] += $7
}
END {
for ( item in items )
{
print "--- item = " item " ---\n"
for ( key in ids )
{
split ( key, s, SUBSEP );
if ( s[1] != item ) continue;
range = range2[key] * 100 / range1[key]
average[item] += float ( ids[key] ) / unique_ids[item];
print "id = " s[2] "\tammount of dup = " ids[key] " range = " int ( range )
}
print "\naverage = " average[item] "\n"
}
}
run:
awk -f a.awk dummy_table
output:
--- item = AAAA ---
id = "ID-983" ammount of dup = 1 range = 266
id = "ID-923" ammount of dup = 2 range = 130
id = "ID-482" ammount of dup = 4 range = 110
average = 2.33333
--- item = AAAAA ---
id = "ID-916" ammount of dup = 1 range = 390
average = 1
--- item = AAA ---
id = "ID-999" ammount of dup = 2 range = 288
id = "ID-482" ammount of dup = 1 range = 240
id = "ID-492" ammount of dup = 2 range = 242
average = 1.66667
There is one moment - I can't understand how you got 225
for "ID-482" and item AAA in question #3.
RANGE2 * 100 / RANGE1 = 36 * 100 / ( 58 - 43 ) = 240.
Are you sure, that your example on question #3 is correct?
Only a partial answer but here is a one-liner solution for your first problem:
awk -F' ' '{group[$3]++;ind[$6]++};{count[$3][$6]+=1}; END{for (i in group){for (j in ind) if(count[i][j] > 0) print i, j, count[i][j]}}' dummy_variable.txt
Output:
AAA "ID-482" 1
AAA "ID-999" 2
AAA "ID-492" 2
AAAA "ID-923" 2
AAAA "ID-482" 4
AAAA "ID-983" 1
AAAAA "ID-916" 1
It is then fairly trivil to use this output to calculate the answer to your second question.