Counting number of zeros in a row, adding count to new column [closed] - awk

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I have a tab-delimited table that looks like this:
chr1 100 110 + 2 3 0 8 6
chr1 150 200 + 1 4 0 2 0
chr1 200 220 + 1 4 2 0 0
chr1 250 260 + 4 2 6 1 3
I would like to count how many zeros are in columns 5-9 and add that number to column 10:
chr1 100 110 + 2 3 0 8 6 1
chr1 150 200 + 1 4 0 2 0 2
chr1 200 220 + 1 4 2 0 0 2
chr1 250 260 + 4 2 6 1 3 0
Ultimately, the goal is to subset only those rows with no more than 4 zeros (at least 2 columns being non-zero). I know how to do this subset with awk but I don't know how to count the zeros in those columns. If there is a simpler way to just require that at least two columns be non-zero between columns 5-9 that would be ideal.

rethab's answer perfectly answers your first requirement of adding an extra column. This answers your second requirement (print only lines with less than 4 zeros). With awk (tested with GNU awk), simply count the non-zero fields between field 5 et field 9 (variable nz), and print only if it is greater or equal 2:
$ cat foo.txt
chr1 100 110 + 2 3 0 8 6
chr1 150 200 + 1 4 0 2 0
chr1 250 260 + 0 0 0 1 0
chr1 200 220 + 1 4 2 0 0
chr1 250 260 + 4 2 6 1 3
$ awk '{nz=0; for(i=5;i<=9;i++) nz+=($i!=0)} nz>=2' foo.txt
chr1 100 110 + 2 3 0 8 6
chr1 150 200 + 1 4 0 2 0
chr1 200 220 + 1 4 2 0 0
chr1 250 260 + 4 2 6 1 3

This script counts the zeros and appends them as the last column:
awk '{
cnt=0
for (i=5;i<=9;i++) {
cnt+=($i==0)
}
print $0, cnt
}' inputs.txt
note that $i==0 yields 1 if the condition is true and 0 if not. Therefore, this can be used as the increment for the counter.

You can use gsub which returns the number of sustitutions per line (here per s string) and then print the number:
awk '{s=$5$6$7$8$9;x=gsub(/0/,"&",s);print $0, x}' file
chr1 100 110 + 2 3 0 8 6 1
chr1 150 200 + 1 4 0 2 0 2
chr1 200 220 + 1 4 2 0 0 2
chr1 250 260 + 4 2 6 1 3 0

Related

Separate float number awk

I am looking for a way to separate the awk output into two values.
Here is an array with values
0 0.0
1 0.1
3 0.3
4 0.4
5 1.0
The output is printed with
printf(" %d %.1f\n ",n, arr[n]
My question is how to get values
0 0 0
1 0 1
2 0 3
4 0 4
5 1 0
By separating the float number inside the printf function
You may use this awk with sub:
awk '{arr[$1]=$2}
END {for (n in arr) {sub(/\./, OFS, arr[n]); print n, arr[n]}}' file
0 0 0
1 0 1
2 0 3
4 0 4
5 1 0

How to transfer (sum up) the counts from a set of ranges to ranges that are englobing those ranges?

I am working with sequencing data, but I think the problem applies to different range-value datatypes.
I want to combine several experiments of read counts(values) from a set DNA regions that have a start and end position (ranges), into added up counts for other set of DNA regions, which generally englobe many of the primary regions. Like in the following example:
Giving the following table A with ranges and counts:
feature start end count1 count2 count3
gene1 1 10 100 30 22
gene2 15 40 20 10 6
gene3 50 70 40 11 7
gene4 100 150 23 15 9
and the following table B (with new ranges):
feature start end
range1 1 45
range2 55 160
I would like to get the following count table with the new ranges:
feature start end count1 count2 count3
range1 1 45 120 40 28
range2 55 160 63 26 16
Just to simplify, if there is at least some overlap (at least a fraction a feature in table A is contained in feature in table B), it should be added up. Any idea of a tool available doing that or a script in perl, python or R? I am counting the sequencing reads with bedtools multicov, but as far as I searched there is no other functionality doing what I want. Any idea?
Thanks.
We can do this by:
Creating an artificial key column
Perform an outer join (mxn)
Filter on the start OR end value being between our ranges
pandas.DataFrame.groupby on feature and sum the count columns
Finally concat the output to df2, to get desired output
df1['key'] = 'A'
df2['key'] = 'A'
df3 = pd.merge(df1,df2, on='key', how='outer')
df4 = df3[(df3.start_x.between(df3.start_y, df3.end_y)) | (df3.end_x.between(df3.start_y, df3.end_y))]
df5 = df4.groupby('feature_y').agg({'count1':'sum',
'count2':'sum',
'count3':'sum'}).reset_index()
df_final = pd.concat([df2.drop(['key'], axis=1), df5.drop(['feature_y'], axis=1)], axis=1)
output
print(df_final)
feature start end count1 count2 count3
0 range1 1 45 120 40 28
1 range2 55 160 63 26 16
You can use apply() and pd.concat() with a custom function where a corresponds to your first dataframe and b corresponds to your second dataframe:
def find_englobed(x):
englobed = a[(a['start'].between(x['start'], x['end'])) | (a['end'].between(x['start'], x['end']))]
return englobed[['count1','count2','count3']].sum()
pd.concat([b, b.apply(find_englobed, axis=1)], axis=1)
Yields:
feature start end count1 count2 count3
0 range1 1 45 120 40 28
1 range2 55 160 63 26 16
If it can help somebody, based on #rahlf23 answer, I modified it to make it more general, considering that on one side, the counting columns can be more, and that besides the range, it is also important to be on the right chromosome.
So if table "a" is:
feature Chromosome start end count1 count2 count3
gene1 Chr1 1 10 100 30 22
gene2 Chr1 15 40 20 10 6
gene3 Chr1 50 70 40 11 7
gene4 Chr1 100 150 23 15 9
gene5 Chr2 5 30 24 17 2
gene5 Chr2 40 80 4 28 16
and table "b" is:
feature Chromosome start end
range1 Chr1 1 45
range2 Chr1 55 160
range3 Chr2 10 90
range4 Chr2 100 200
with the following python script:
import pandas as pd
def find_englobed(x):
englobed = a[(a['Chromosome'] == x['Chromosome']) & (a['start'].between(x['start'], x['end']) | (a['end'].between(x['start'], x['end'])))]
return englobed[list(a.columns[4:])].sum()
pd.concat([b, b.apply(find_englobed, axis=1)], axis=1)
Now with a['Chromosome'] == x['Chromosome'] & I ask for them to be in the same Chromosome, and with list(a.columns[4:]) I get all the columns from the 5th until the end, being independent on the number of count columns.
I obtain the following result:
feature Chromosome start end count1 count2 count3
range1 Chr1 1 45 120.0 40.0 28.0
range2 Chr1 55 160 63.0 26.0 16.0
range3 Chr2 10 90 28.0 45.0 18.0
range4 Chr2 100 200 0.0 0.0 0.0
I am not sure why the obtained counts are with floating points.. any comment?
If you are doing genomics in pandas you might want to look into pyranges:
import pyranges as pr
c = """feature Chromosome Start End count1 count2 count3
gene1 Chr1 1 10 100 30 22
gene2 Chr1 15 40 20 10 6
gene3 Chr1 50 70 40 11 7
gene4 Chr1 100 150 23 15 9
gene5 Chr2 5 30 24 17 2
gene5 Chr2 40 80 4 28 16
"""
c2 = """feature Chromosome Start End
range1 Chr1 1 45
range2 Chr1 55 160
range3 Chr2 10 90
range4 Chr2 100 200 """
gr, gr2 = pr.from_string(c), pr.from_string(c2)
j = gr2.join(gr).drop(like="_b")
# +------------+--------------+-----------+-----------+-----------+-----------+-----------+
# | feature | Chromosome | Start | End | count1 | count2 | count3 |
# | (object) | (category) | (int32) | (int32) | (int64) | (int64) | (int64) |
# |------------+--------------+-----------+-----------+-----------+-----------+-----------|
# | range1 | Chr1 | 1 | 45 | 100 | 30 | 22 |
# | range1 | Chr1 | 1 | 45 | 20 | 10 | 6 |
# | range2 | Chr1 | 55 | 160 | 40 | 11 | 7 |
# | range2 | Chr1 | 55 | 160 | 23 | 15 | 9 |
# | range3 | Chr2 | 10 | 90 | 24 | 17 | 2 |
# | range3 | Chr2 | 10 | 90 | 4 | 28 | 16 |
# +------------+--------------+-----------+-----------+-----------+-----------+-----------+
# Unstranded PyRanges object has 6 rows and 7 columns from 2 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.
df = j.df
fs = {"Chromosome": "first", "Start":
"first", "End": "first", "count1": "sum", "count2": "sum", "count3": "sum"}
result = df.groupby("feature".split()).agg(fs)
# Chromosome Start End count1 count2 count3
# feature
# range1 Chr1 1 45 120 40 28
# range2 Chr1 55 160 63 26 16
# range3 Chr2 10 90 28 45 18

Count with group by with numpy

I have a large list with a shape in excess of (1000000, 200). I would like to count the occurrences of the items in the last column (:, -1). I can do this in pandas with a smaller list;
distribution = mylist.groupby('var1').count()
However I do not have labels on any of my dimensions. So unsure of how to reference them.
Edit:
print of pandas sample data;
0 1 2 3 4 ... 204 205 206 207 208
0 1 1 Random 1 4 12 ... 8 -14860 0 -5.0000 43.065233
1 1 1 Random 2 3 2 ... 8 -92993 -1 -1.0000 43.057945
2 1 1 Random 3 13 3 ... 8 -62907 1 -2.0000 43.070335
3 1 1 Random 3 13 3 ... 8 -62907 -1 -2.0000 43.070335
4 1 1 Random 4 4 2 ... 8 -38673 -1 0.0000 43.057945
5 1 1 Book 1 3 9 ... 8 -82339 -1 0.0000 43.059402
... ... ... ... .. .. ... .. ... .. ... ...
11795132 292 1 Random 5 12 2 ... 8 -69229 -1 0.0000 12.839051
11795133 292 1 Book 2 7 10 ... 8 -60664 -1 0.0000 46.823615
11795134 292 1 Random 2 9 4 ... 8 -78754 1 -2.0000 11.762521
11795135 292 1 Random 2 9 4 ... 8 -78754 -1 -2.0000 11.762521
11795136 292 1 Random 1 7 5 ... 8 -76275 -1 0.0000 41.839286
I want a few different counts and summaries so plan to do one at a time with;
mylist = input_list.values
mylist = mylist[:, -1]
mylist.astype(int)
Expected output;
11 2
12 1
41 1
43 6
46 1
iloc enables you to reference a column without using labels
distribution = input_list.groupby(input_list.iloc[:, -1]).count()

awk cumulative sum in on dimension

Good afternoon,
I would like to make a cumulative sum for each column and line in awk.
My in file is :
1 2 3 4
2 5 6 7
2 3 6 5
1 2 1 2
And I would like : per column
1 2 3 4
3 7 9 11
5 10 15 16
6 12 16 18
6 12 16 18
And I would like : per line
1 3 5 9 9
2 7 13 20 20
2 5 11 16 16
1 3 4 6 6
I did the sum per column as :
awk '{ for (i=1; i<=NF; ++i) sum[i] += $i}; END { for (i in sum) printf "%s ", sum[i]; printf "\n"; }' test.txt # sum
And per line .
awk '
BEGIN {FS=OFS=" "}
{
sum=0; n=0
for(i=1;i<=NF;i++)
{sum+=$i; ++n}
print $0,"sum:"sum,"count:"n,"avg:"sum/n
}' test.txt
But I would like to print all the lines and columns.
Do you have an idea?
It looks like you have all the correct information available, all you are missing is the printout statements.
Is this what you are looking for?
accumulated sum of the columns:
% cat foo
1 2 3 4
2 5 6 7
2 3 6 5
1 2 1 2
% awk '{ for (i=1; i<=NF; ++i) {sum[i]+=$i; $i=sum[i] }; print $0}' foo
1 2 3 4
3 7 9 11
5 10 15 16
6 12 16 18
accumulated sum of the rows:
% cat foo
1 2 3 4
2 5 6 7
2 3 6 5
1 2 1 2
% awk '{ sum=0; for (i=1; i<=NF; ++i) {sum+=$i; $i=sum }; print $0}' foo
1 3 6 10
2 7 13 20
2 5 11 16
1 3 4 6
Both these make use of the following :
each variable has value 0 by default (if used numerically)
I replace the field $i with what the sum value
I reprint the full line with print $0
row sums with repeated last element
$ awk '{s=0; for(i=1;i<=NF;i++) $i=s+=$i; $i=s}1' file
1 3 6 10 10
2 7 13 20 20
2 5 11 16 16
1 3 4 6 6
$i=s sets the index value (now incremented to NF+1) to the sum and 1 prints the line with that extra field.
columns sums with repeated last row
$ awk '{for(i=1;i<=NF;i++) c[i]=$i+=c[i]}1; END{print}' file
1 2 3 4
3 7 9 11
5 10 15 16
6 12 16 18
6 12 16 18
END{print} repeats the last row
ps. your math seems to be wrong for the row sums

AWK - removal of the same fields on the basis of the "$1"

I have a file1:
6
3
6
9
2
6
This command prints the result:
awk 'NR==1{a=$1};$0!=a' file1
3
9
2
Now I have file2:
6 1 2 3 4 5
3 3 4 4 4 6
6 5 2 2 5 1
9 1 3 5 4 1
2 5 6 4 8 5
6 1 5 2 3 1
I want to do the same thing, but with file2. I want to print out the result:
3 3 4 4 5 6
9 5 3 2 8 1
2 5 6 5 3 1
5 4 1
2
I want to do it in awk. Thank you for your help.
AWK is not really suited for what you are trying to do, since it is made for processing rows one at a time, while you are trying to shift numbers up and down between different rows. That said, this monster should do what you want:
awk 'NR==1{nc=NF;for(i=1;i<=nc;i++)a[i]=$i}{for(i=1;i<=nc;i++){if($i!=a[i]){v[m[i]++,i]=$i;if(m[i]>nl)nl=m[i]}}}END{for(l=0;l<nl;l++){for(i=1;i<=nc;i++){if(l<m[i]){printf("%d ", v[l,i])}else{printf(" ")}}printf("\n")}}'
If, on the other hand, your matrix of numbers had been transposed, this task would have been far simpler:
awk '{for(i=2;i<=NF;i++)if($i!=$1)printf(" %d",$i);printf("\n")}'