Selecting columns using specific patterns then finding sum and ratio - awk

I want to calculate the sum and ratio values from data below. (The actual data contains more than 200,000 columns and 45000 rows (lines)).
For clarity purpose I have given only simple data format.
#Frame BMR_42#O22 BMR_49#O13 BMR_59#O13 BMR_23#O26 BMR_10#O13 BMR_61#O26 BMR_23#O25
1 1 1 0 1 1 1 1
2 0 1 0 0 1 1 0
3 1 1 1 0 0 1 1
4 1 1 0 0 1 0 1
5 0 0 0 0 0 0 0
6 1 0 1 1 0 1 0
7 1 1 1 1 0 0 0
8 1 1 1 0 0 0 0
9 1 1 1 1 1 1 1
10 0 0 0 0 0 0 0
The columns need to be selected with certain criteria.
The column data which I consider is columns with "#O13" only. Below I have given the selected columns from above example.
BMR_49#O13 BMR_59#O13 BMR_10#O13
1 0 1
1 0 1
1 1 0
1 0 1
0 0 0
0 1 0
1 1 0
1 1 0
1 1 1
0 0 0
From the selected column, I want to calculate:
1) the sum of all the "1"s. In this example we get value 16.
2) the number of total rows containing occurrence of "1" (at least once). From above example there are 8 rows which contain at least one occurrence of "1".
lastly,
3) the ratio of total of all "1"s with total lines with occurrence of "1"s.
That is :: (total of all "1"s)/(total rows with the occurance of "1").
Example 16/8
As a start, I tried with this command to select only the columns with "#O13"
awk '{for (i=1;i<=NF;i++) if (i~/#O13/); print ""}' $file2
Although this run but doesn't show up the values.

This should do:
awk 'NR==1{for (i=1;i<=NF;i++) if ($i~/#O13/) a[i];next} {f=0;for (i in a) if ($i) {s++;f++};if (f) r++} END {print "number of 1="s"\nrows with 1="r"\nratio="s/r}' file
number of 1=16
rows with 1=8
ratio=2
Some more readable:
awk '
NR==1{
for (i=1;i<=NF;i++)
if ($i~/#O13/)
a[i]
next
}
{
f=0
for (i in a)
if ($i=="1") {
s++
f++
}
if (f) r++
}
END {
print "number of 1="s \
"\nrows with 1="r \
"\nratio="s/r
}
' file

Related

replace velues in column matching based on 2 columns

I have file f1 which looks like this: (has 1651 lines)
fam0110 G110 0 0 0 1 T G
fam6106 G6106 0 0 0 2 T T
fam1000 G1000 0 0 0 2 T T
...
and I have file f2 which looks like (has 1651 lines)
fam1000 G1000 1 1
fam6106 G6106 1 1
fam0110 G110 2 2
...
I would like to replace the 6th column in f1 so that it matches the 3rd column of f2 os that they match by the 1st and 2nd column
the output would look like this:
fam0110 G110 0 0 0 2 T G
fam6106 G6106 0 0 0 1 T T
fam1000 G1000 0 0 0 1 T T
I tried to do it with:
awk 'FNR==NR{a[NR]=$3;next}{$6=a[FNR]}1' pheno_laser2 chr9.plink.ped > chr9.new.ped
but this doesn't work because the lines are not sorted in the same way so I need matching by the values in the 1st and 2nd column in the both files.
Please advise
my the this doesn't work because
You have to use only the first two fields into the hash, as you want to match only for them, not for the line number or anything else.
awk 'FNR==NR{a[$1, $2]=$3;next} {$6=a[$1, $2]}1' file2 file1
Testing with your examples:
fam0110 G110 0 0 0 2 T G
fam6106 G6106 0 0 0 1 T T
fam1000 G1000 0 0 0 1 T T
Note that it would print empty field for any not matching lines, I assume this is ok.

awk: take entries from file and add values in between

I have the following file:
2 some
5 some
8 some
10 thing
15 thing
19 thing
Now I want to end up with entries, where for "some" 2,5,8 correspond to rows where there is a 1, everything else is 0. It doesn't matter how many rows there are. This means for "some":
0
1
0
0
1
0
0
1
0
0
and for "thing"
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
1
0
Is this possible in a quick way with awk? I mean with something like:
awk '{for(i=1;i<=10;i++) entries[$i]=0 for(f=0;<=NF;f++) entries[$f]=1' testfile.txt
another awk, output terminates with the last index
awk -v key='thing' '$2==key{while(++c<$1) print 0; print 1}' file
to add some extra 0's after the last 1; add END{while(i++<3) print 0}
Something like this seems to work in order to produce "some" data:
$ cat file1
2 some
5 some
8 some
10 thing
15 thing
19 thing
$ awk 'max<$1 && $2=="some"{max=$1;b[$1]=1}END{for (i=1;i<=max;i++) print (i in b?1:0)}' file1
0
1
0
0
1
0
0
1
Similarly , this one works for "thing" data
$ awk 'max<$1 && $2=="thing"{max=$1;b[$1]=1}END{for (i=1;i<=max;i++) print (i in b?1:0)}' file1
Alternativelly, as mentioned by glennjackman in comments we could use an external variable to select between some or thing:
$ awk -v word="some" 'max<$1 && $2==word{max=$1;b[$1]=1}END{for (i=1;i<=max;i++) print (i in b?1:0)}' file1
# for thing just apply awk -v word="thing"
You can achieve better parameterizing using an awk variable like this:
$ w="some" #selectable / set by shell , by script , etc
$ awk -v word="$w" 'max<$1 && $2==word{max=$1;b[$1]=1}END{for (i=1;i<=max;i++) print (i in b?1:0)}' file1
perl:
perl -lanE '
push #{$idx{$F[1]}}, $F[0] - 1; # subtract 1 because we are working with
# (zero-based) array indices
$max = $F[0]; # I assume the input is sorted by column 1
} END {
$, = "\n";
for $word (keys %idx) {
# create a $max-sized array filled with zeroes
#a = (0) x $max;
# then, populate the entries which should be 1
#a[ #{$idx{$word}} ] = (1) x #{$idx{$word}};
say $word, #a;
}
' file | pr -2T -s | nl -v0
0 thing some
1 0 0
2 0 1
3 0 0
4 0 0
5 0 1
6 0 0
7 0 0
8 0 1
9 0 0
10 1 0
11 0 0
12 0 0
13 0 0
14 0 0
15 1 0
16 0 0
17 0 0
18 0 0
19 1 0

delete many empty spaces between columns and make only one-white-space between columns

I have a file with more than 2500 columns. Each column is separated with tab or several white space.
The data format in the file is as below:
1 1 0
1 1 0
0 1 0
1 0 1
1 0 0
1 1 1
1 0 1
I want to delete the tab or many empty white-spaces between the columns and make only one white-space between the columns as below.
1 1 0
1 1 0
0 1 0
1 0 1
1 0 0
1 1 1
1 0 1
How I delete the empty spaces ?
This should do:
awk '{$1=$1}1' file
1 1 0
1 1 0
0 1 0
1 0 1
1 0 0
1 1 1
1 0 1
By setting $1=$1 it cleans up all the spaces and tabs. 1 is to print it.
With sed:
sed 's/[[:space:]]\+/ /g' filename
Alternatively with tr:
tr -s '[:blank:]' ' ' filename

Calculating ratio value within a line which contain binary numbers "0" & "1"

I have a data file which contain more than 2000 lines and 45001 columns.
The first column is actually a "string" which explains the data type.
Start from column #2, up to column #45001, the data is reprsented as
"1"
or
"0"
For example, the pattern of data in a line is
(0 0 0 1 1 0 1 1 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0)
The total number of data is 25. Within this data line, there are 5 sub-groups which are made by only the number "1"s e.g. (11 111 1111 1 111 ). The "0"s in between the subgroups are assumed as "delimiter". The total of all "1"s is = 13.
I would like to calculate the ratio of
(total of all "1"s / total of number of sub-groups made only by "1"s)
That is
(13/5).
I tried with this code for calculating the total of all "1"s ;
awk -F '0' '{print NF}' < inputfile.in
This gives value 13.
But I donn't know how to go further from here to calcuate the ratio that I want.
I don't know how to find the number of sub-groups within each line beacuse the number of occurances of "1"s and "0"s are random.
Wish to get some kind help to sort this problem.
Appreciate any help in advance.
It is not clear to me from the description what the format of the input file is. Assume the input looks like:
$ cat file
0 0 0 1 1 0 1 1 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0
To count up the number of ones and the number of groups of ones and take their ratio:
$ awk '{f=0;s1=0;s2=0;for (i=2;i<=NF;i++){s1+=$i;if ($i && !f)s2++;f=$i}; print s1/s2}' file
2.6
Update: Handling all zeros
Suppose one of the lines in the file has all zeros:
$ cat file
0 0 0 1 1 0 1 1 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
For the second line, both sums are zero which would lead to a divide by zero error. We can avoid that by adding an if statement which will print the ratio if one exists or 0/0 is it doesn't:
if (s2>0)print s1/s2; else print s1"/"s2
The complete code is now:
$ awk '{f=0;s1=0;s2=0;for (i=2;i<=NF;i++){s1+=$i;if ($i && !f)s2++;f=$i}; if (s2>0)print s1/s2; else print s1"/"s2}' file
2.6
0/0
How it works
The code uses three variables. f is a flag which is true (1) if we are currently in a group of ones and is false (0) otherwise. s1 is the the number of ones on the line. s2 is the number of groups of ones on the line.
f=0;s1=0;s2=0
At the beginning of each line, we initialize the variables.
for (i=2;i<=NF;i++){s1+=$i;if ($i && !f)s2++;f=$i}
We loop over each field on the line starting with field 2. If the field contains a 1, we increment counter s1. If the field is 1 and is the start of a new group, we increment s2.
if (s2>0)print s1/s2; else print s1"/"s2}
If we encountered at least one one, we print the ratio s1/s2. Otherwise, we print 0/0.
Here is an awk that does what you need:
cat file
data 0 0 0 1 1 0 1 1 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0
data 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
data 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
data 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
BMR_10#O24-BMR_6#O13-H13 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1
data 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1
awk '{$1="";$0="0 "$0" 0";t=split($0,b,"1")-1;gsub(/ +/,"");n=split($0,a,"[^1]+")-2;print (n?t/n:0)}' t
2.6
0
25
11
5.5
3

Selecting first nth rows by groups using AWK

I have the following file with 4 fields. There are 3 groups in field 2, and the 4th field consists 0's and 1's.
The first field is just the index.
I like to use AWK to do the following task
Select the first 3 rows of group 1 (Note that group 1 has only 2 rows). The number of rows is based on the number of 1's found in the 4th field times 3.
Select the first 6 rows of group 2. The number of rows is based on the number of 1's found in the 4th field times 3.
Select the first 9 rows of group 3. The number of rows is based on the number of 1's found in the 4th field times 3.
So 17 rows are selected for the output file.
Thank you for your help.
Input
1 1 TN1148 1
2 1 S52689 0
3 2 TA2081 1
4 2 TA2592 1
5 2 TA4011 0
6 2 TA4246 0
7 2 TA4275 0
8 2 TB0159 0
9 2 TB0392 0
10 3 TB0454 1
11 3 TB0496 1
12 3 TB1181 1
13 3 TC0027 0
14 3 TC1340 0
15 3 TC2247 0
16 3 TC3094 0
17 3 TD0106 0
18 3 TD1146 0
19 3 TD1796 0
20 3 TD3587 0
Output
1 1 TN1148 1
2 1 S52689 0
3 2 TA2081 1
4 2 TA2592 1
5 2 TA4011 0
6 2 TA4246 0
7 2 TA4275 0
8 2 TB0159 0
10 3 TB0454 1
11 3 TB0496 1
12 3 TB1181 1
13 3 TC0027 0
14 3 TC1340 0
15 3 TC2247 0
16 3 TC3094 0
17 3 TD0106 0
18 3 TD1146 0
The key to this awk program is to pass the input file in twice: Once to count how many rows you want and once to print them.
awk '
NR == FNR {wanted_rows[$2] += 3*$4; next}
--wanted_rows[$2] >= 0 {print}
' input_file.txt input_file.txt
#!/usr/bin/awk -f
# by Dennis Williamson - 2010-12-02
# for http://stackoverflow.com/questions/4334167/selecting-first-nth-rows-by-groups-using-awk
$2 == prev {
count += $4
groupcount++
array[idx++] = $0
}
$2 != prev {
if (NR > 1) {
for (i=0; i<count*3; i++) {
if (i == groupcount) break
print array[i]
}
}
prev = $2
count = 1
groupcount = 1
split("", array) # delete the array
idx = 0
array[idx++] = $0
}
END {
for (i=0; i<count*3; i++) {
if (i == groupcount) break
print array[i]
}
}