Calculating ratio value within a line which contain binary numbers "0" & "1" - awk

I have a data file which contain more than 2000 lines and 45001 columns.
The first column is actually a "string" which explains the data type.
Start from column #2, up to column #45001, the data is reprsented as
"1"
or
"0"
For example, the pattern of data in a line is
(0 0 0 1 1 0 1 1 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0)
The total number of data is 25. Within this data line, there are 5 sub-groups which are made by only the number "1"s e.g. (11 111 1111 1 111 ). The "0"s in between the subgroups are assumed as "delimiter". The total of all "1"s is = 13.
I would like to calculate the ratio of
(total of all "1"s / total of number of sub-groups made only by "1"s)
That is
(13/5).
I tried with this code for calculating the total of all "1"s ;
awk -F '0' '{print NF}' < inputfile.in
This gives value 13.
But I donn't know how to go further from here to calcuate the ratio that I want.
I don't know how to find the number of sub-groups within each line beacuse the number of occurances of "1"s and "0"s are random.
Wish to get some kind help to sort this problem.
Appreciate any help in advance.

It is not clear to me from the description what the format of the input file is. Assume the input looks like:
$ cat file
0 0 0 1 1 0 1 1 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0
To count up the number of ones and the number of groups of ones and take their ratio:
$ awk '{f=0;s1=0;s2=0;for (i=2;i<=NF;i++){s1+=$i;if ($i && !f)s2++;f=$i}; print s1/s2}' file
2.6
Update: Handling all zeros
Suppose one of the lines in the file has all zeros:
$ cat file
0 0 0 1 1 0 1 1 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
For the second line, both sums are zero which would lead to a divide by zero error. We can avoid that by adding an if statement which will print the ratio if one exists or 0/0 is it doesn't:
if (s2>0)print s1/s2; else print s1"/"s2
The complete code is now:
$ awk '{f=0;s1=0;s2=0;for (i=2;i<=NF;i++){s1+=$i;if ($i && !f)s2++;f=$i}; if (s2>0)print s1/s2; else print s1"/"s2}' file
2.6
0/0
How it works
The code uses three variables. f is a flag which is true (1) if we are currently in a group of ones and is false (0) otherwise. s1 is the the number of ones on the line. s2 is the number of groups of ones on the line.
f=0;s1=0;s2=0
At the beginning of each line, we initialize the variables.
for (i=2;i<=NF;i++){s1+=$i;if ($i && !f)s2++;f=$i}
We loop over each field on the line starting with field 2. If the field contains a 1, we increment counter s1. If the field is 1 and is the start of a new group, we increment s2.
if (s2>0)print s1/s2; else print s1"/"s2}
If we encountered at least one one, we print the ratio s1/s2. Otherwise, we print 0/0.

Here is an awk that does what you need:
cat file
data 0 0 0 1 1 0 1 1 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0
data 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
data 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
data 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
BMR_10#O24-BMR_6#O13-H13 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1
data 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1
awk '{$1="";$0="0 "$0" 0";t=split($0,b,"1")-1;gsub(/ +/,"");n=split($0,a,"[^1]+")-2;print (n?t/n:0)}' t
2.6
0
25
11
5.5
3

Related

SQL: Is there a way I can find whether a value is within a specific index range of another value?

I have two columns filled with mostly 0's and a few 1's. I want to check whether IF a 1 occurs in the first column, a 1 in the second column occurs within a range of 5 rows of that index. So for example, lets say a 1 occurs in column 1 row 83, then I would like to return TRUE if one or more 1's occur in column 2 row 83-88, and FALSE if this is not the case. Examples of this are listed in the code block. I would want to count the number of TRUE and FALSE occurrences.
TRUE:
0 0
0 0
0 0
1 1
0 0
0 0
0 0
0 0
0 0
0 0
TRUE:
0 0
0 0
0 0
1 0
0 0
0 0
0 1
0 1
0 0
0 0
FALSE:
0 0
0 0
0 1
1 0
0 0
0 0
0 0
0 0
0 0
0 1
I have no idea where to begin, so I do not have any code to start with:(
Kind regards,
Kai
Assuming you have an ordering column, you can use window functions:
select (case when count(*) = 0 then 'false' else 'true' end)
from (select t.*,
max(col2) over (order by <ordering column>
rows between current row and 4 following
) as max_col2_5
from t
) t
where col1 = 1 and max_col2_5 = 1;

How do I grab rows surrounding a flagged value?

I'm starting with a table like this:
code new_code_flag
abc123 0
xyz456 0
wer098 1
jio234 0
bcx190 0
eiw157 0
nzi123 0
epj676 0
ere654 0
yru493 1
ale674 0
I want to grab the 2 records before and 2 records after each value where "new_code_flag"=1. I want my output to look like this:
code new_code_flag
abc123 0
xyz456 0
wer098 1
jio234 0
bcx190 0
epj676 0
ere654 0
yru493 1
ale674 0
Any help on how to do this in SQL or SAS?
SQL tables represent unordered sets. Hence, in SQL you need to have a column that specifies the ordering. Assuming you do, you can do something like:
with t as (
select t.*, row_number() over (order by ?) as seqnum
from tbl t
)
select t.*
from t
where exists (select 1
from t t2
where t2.new_code_flag = 1 and
t.seqnum between t2.seqnum - 2 and t2.seqnum + 2
);
You could create two lag and two lead copies of the flag variable and then test if any of the 5 variables are 1 (true).
data have;
input code $ flag ;
cards;
abc123 0
xyz456 0
wer098 1
jio234 0
bcx190 0
eiw157 0
nzi123 0
epj676 0
ere654 0
yru493 1
ale674 0
;
data want ;
set have ;
set have(keep=flag rename=(flag=lead1_flag) firstobs=2) have(drop=_all_ obs=1);
set have(keep=flag rename=(flag=lead2_flag) firstobs=3) have(drop=_all_ obs=2);
lag1_flag=lag1(flag);
lag2_flag=lag2(flag);
if lag1_flag or lag2_flag or flag or lead1_flag or lead2_flag ;
run;
Results
lead1_ lead2_ lag1_ lag2_
Obs code flag flag flag flag flag
1 abc123 0 0 1 . .
2 xyz456 0 1 0 0 .
3 wer098 1 0 0 0 0
4 jio234 0 0 0 1 0
5 bcx190 0 0 0 0 1
6 epj676 0 0 1 0 0
7 ere654 0 1 0 0 0
8 yru493 1 0 . 0 0
9 ale674 0 . . 1 0
data want(drop=_: i);
merge have have(keep=flag firstobs=3 rename=(flag=_flag));
if flag or _flag then i=1;
if 0<i<=3 then do;
output;
i+1;
end;
else delete;
run;

Pandas iterate max value of a variable length slice in a series

Let's assume i have a Pandas DataFrame as follows:
import pandas as pd
idx = ['2003-01-02', '2003-01-03', '2003-01-06', '2003-01-07',
'2003-01-08', '2003-01-09', '2003-01-10', '2003-01-13',
'2003-01-14', '2003-01-15', '2003-01-16', '2003-01-17',
'2003-01-21', '2003-01-22', '2003-01-23', '2003-01-24',
'2003-01-27']
a = pd.DataFrame([1,2,0,0,1,2,3,0,0,0,1,2,3,4,5,0,1],
columns = ['original'], index = pd.to_datetime(idx))
I am trying to get the max for each slices of that DataFrame between two zeros.
In that example i would get:
a['result'] = [0,2,0,0,0,0,3,0,0,0,0,0,0,0,5,0,1]
that is:
original result
2003-01-02 1 0
2003-01-03 2 2
2003-01-06 0 0
2003-01-07 0 0
2003-01-08 1 0
2003-01-09 2 0
2003-01-10 3 3
2003-01-13 0 0
2003-01-14 0 0
2003-01-15 0 0
2003-01-16 1 0
2003-01-17 2 0
2003-01-21 3 0
2003-01-22 4 0
2003-01-23 5 5
2003-01-24 0 0
2003-01-27 1 1
find zeros
cumsum to make groups
mask the zeros into their own group -1
find the max location in each group idxmax
get rid of the one for group -1, that was for zeros anyway
get a.original for found max locations, reindex and fill with zeros
m = a.original.eq(0)
g = a.original.groupby(m.cumsum().mask(m, -1))
i = g.idxmax().drop(-1)
a.assign(result=a.loc[i, 'original'].reindex(a.index, fill_value=0))
original result
2003-01-02 1 0
2003-01-03 2 2
2003-01-06 0 0
2003-01-07 0 0
2003-01-08 1 0
2003-01-09 2 0
2003-01-10 3 3
2003-01-13 0 0
2003-01-14 0 0
2003-01-15 0 0
2003-01-16 1 0
2003-01-17 2 0
2003-01-21 3 0
2003-01-22 4 0
2003-01-23 5 5
2003-01-24 0 0
2003-01-27 1 1

delete many empty spaces between columns and make only one-white-space between columns

I have a file with more than 2500 columns. Each column is separated with tab or several white space.
The data format in the file is as below:
1 1 0
1 1 0
0 1 0
1 0 1
1 0 0
1 1 1
1 0 1
I want to delete the tab or many empty white-spaces between the columns and make only one white-space between the columns as below.
1 1 0
1 1 0
0 1 0
1 0 1
1 0 0
1 1 1
1 0 1
How I delete the empty spaces ?
This should do:
awk '{$1=$1}1' file
1 1 0
1 1 0
0 1 0
1 0 1
1 0 0
1 1 1
1 0 1
By setting $1=$1 it cleans up all the spaces and tabs. 1 is to print it.
With sed:
sed 's/[[:space:]]\+/ /g' filename
Alternatively with tr:
tr -s '[:blank:]' ' ' filename

Selecting columns using specific patterns then finding sum and ratio

I want to calculate the sum and ratio values from data below. (The actual data contains more than 200,000 columns and 45000 rows (lines)).
For clarity purpose I have given only simple data format.
#Frame BMR_42#O22 BMR_49#O13 BMR_59#O13 BMR_23#O26 BMR_10#O13 BMR_61#O26 BMR_23#O25
1 1 1 0 1 1 1 1
2 0 1 0 0 1 1 0
3 1 1 1 0 0 1 1
4 1 1 0 0 1 0 1
5 0 0 0 0 0 0 0
6 1 0 1 1 0 1 0
7 1 1 1 1 0 0 0
8 1 1 1 0 0 0 0
9 1 1 1 1 1 1 1
10 0 0 0 0 0 0 0
The columns need to be selected with certain criteria.
The column data which I consider is columns with "#O13" only. Below I have given the selected columns from above example.
BMR_49#O13 BMR_59#O13 BMR_10#O13
1 0 1
1 0 1
1 1 0
1 0 1
0 0 0
0 1 0
1 1 0
1 1 0
1 1 1
0 0 0
From the selected column, I want to calculate:
1) the sum of all the "1"s. In this example we get value 16.
2) the number of total rows containing occurrence of "1" (at least once). From above example there are 8 rows which contain at least one occurrence of "1".
lastly,
3) the ratio of total of all "1"s with total lines with occurrence of "1"s.
That is :: (total of all "1"s)/(total rows with the occurance of "1").
Example 16/8
As a start, I tried with this command to select only the columns with "#O13"
awk '{for (i=1;i<=NF;i++) if (i~/#O13/); print ""}' $file2
Although this run but doesn't show up the values.
This should do:
awk 'NR==1{for (i=1;i<=NF;i++) if ($i~/#O13/) a[i];next} {f=0;for (i in a) if ($i) {s++;f++};if (f) r++} END {print "number of 1="s"\nrows with 1="r"\nratio="s/r}' file
number of 1=16
rows with 1=8
ratio=2
Some more readable:
awk '
NR==1{
for (i=1;i<=NF;i++)
if ($i~/#O13/)
a[i]
next
}
{
f=0
for (i in a)
if ($i=="1") {
s++
f++
}
if (f) r++
}
END {
print "number of 1="s \
"\nrows with 1="r \
"\nratio="s/r
}
' file