How to transfer (sum up) the counts from a set of ranges to ranges that are englobing those ranges? - pandas

I am working with sequencing data, but I think the problem applies to different range-value datatypes.
I want to combine several experiments of read counts(values) from a set DNA regions that have a start and end position (ranges), into added up counts for other set of DNA regions, which generally englobe many of the primary regions. Like in the following example:
Giving the following table A with ranges and counts:
feature start end count1 count2 count3
gene1 1 10 100 30 22
gene2 15 40 20 10 6
gene3 50 70 40 11 7
gene4 100 150 23 15 9
and the following table B (with new ranges):
feature start end
range1 1 45
range2 55 160
I would like to get the following count table with the new ranges:
feature start end count1 count2 count3
range1 1 45 120 40 28
range2 55 160 63 26 16
Just to simplify, if there is at least some overlap (at least a fraction a feature in table A is contained in feature in table B), it should be added up. Any idea of a tool available doing that or a script in perl, python or R? I am counting the sequencing reads with bedtools multicov, but as far as I searched there is no other functionality doing what I want. Any idea?
Thanks.

We can do this by:
Creating an artificial key column
Perform an outer join (mxn)
Filter on the start OR end value being between our ranges
pandas.DataFrame.groupby on feature and sum the count columns
Finally concat the output to df2, to get desired output
df1['key'] = 'A'
df2['key'] = 'A'
df3 = pd.merge(df1,df2, on='key', how='outer')
df4 = df3[(df3.start_x.between(df3.start_y, df3.end_y)) | (df3.end_x.between(df3.start_y, df3.end_y))]
df5 = df4.groupby('feature_y').agg({'count1':'sum',
'count2':'sum',
'count3':'sum'}).reset_index()
df_final = pd.concat([df2.drop(['key'], axis=1), df5.drop(['feature_y'], axis=1)], axis=1)
output
print(df_final)
feature start end count1 count2 count3
0 range1 1 45 120 40 28
1 range2 55 160 63 26 16

You can use apply() and pd.concat() with a custom function where a corresponds to your first dataframe and b corresponds to your second dataframe:
def find_englobed(x):
englobed = a[(a['start'].between(x['start'], x['end'])) | (a['end'].between(x['start'], x['end']))]
return englobed[['count1','count2','count3']].sum()
pd.concat([b, b.apply(find_englobed, axis=1)], axis=1)
Yields:
feature start end count1 count2 count3
0 range1 1 45 120 40 28
1 range2 55 160 63 26 16

If it can help somebody, based on #rahlf23 answer, I modified it to make it more general, considering that on one side, the counting columns can be more, and that besides the range, it is also important to be on the right chromosome.
So if table "a" is:
feature Chromosome start end count1 count2 count3
gene1 Chr1 1 10 100 30 22
gene2 Chr1 15 40 20 10 6
gene3 Chr1 50 70 40 11 7
gene4 Chr1 100 150 23 15 9
gene5 Chr2 5 30 24 17 2
gene5 Chr2 40 80 4 28 16
and table "b" is:
feature Chromosome start end
range1 Chr1 1 45
range2 Chr1 55 160
range3 Chr2 10 90
range4 Chr2 100 200
with the following python script:
import pandas as pd
def find_englobed(x):
englobed = a[(a['Chromosome'] == x['Chromosome']) & (a['start'].between(x['start'], x['end']) | (a['end'].between(x['start'], x['end'])))]
return englobed[list(a.columns[4:])].sum()
pd.concat([b, b.apply(find_englobed, axis=1)], axis=1)
Now with a['Chromosome'] == x['Chromosome'] & I ask for them to be in the same Chromosome, and with list(a.columns[4:]) I get all the columns from the 5th until the end, being independent on the number of count columns.
I obtain the following result:
feature Chromosome start end count1 count2 count3
range1 Chr1 1 45 120.0 40.0 28.0
range2 Chr1 55 160 63.0 26.0 16.0
range3 Chr2 10 90 28.0 45.0 18.0
range4 Chr2 100 200 0.0 0.0 0.0
I am not sure why the obtained counts are with floating points.. any comment?

If you are doing genomics in pandas you might want to look into pyranges:
import pyranges as pr
c = """feature Chromosome Start End count1 count2 count3
gene1 Chr1 1 10 100 30 22
gene2 Chr1 15 40 20 10 6
gene3 Chr1 50 70 40 11 7
gene4 Chr1 100 150 23 15 9
gene5 Chr2 5 30 24 17 2
gene5 Chr2 40 80 4 28 16
"""
c2 = """feature Chromosome Start End
range1 Chr1 1 45
range2 Chr1 55 160
range3 Chr2 10 90
range4 Chr2 100 200 """
gr, gr2 = pr.from_string(c), pr.from_string(c2)
j = gr2.join(gr).drop(like="_b")
# +------------+--------------+-----------+-----------+-----------+-----------+-----------+
# | feature | Chromosome | Start | End | count1 | count2 | count3 |
# | (object) | (category) | (int32) | (int32) | (int64) | (int64) | (int64) |
# |------------+--------------+-----------+-----------+-----------+-----------+-----------|
# | range1 | Chr1 | 1 | 45 | 100 | 30 | 22 |
# | range1 | Chr1 | 1 | 45 | 20 | 10 | 6 |
# | range2 | Chr1 | 55 | 160 | 40 | 11 | 7 |
# | range2 | Chr1 | 55 | 160 | 23 | 15 | 9 |
# | range3 | Chr2 | 10 | 90 | 24 | 17 | 2 |
# | range3 | Chr2 | 10 | 90 | 4 | 28 | 16 |
# +------------+--------------+-----------+-----------+-----------+-----------+-----------+
# Unstranded PyRanges object has 6 rows and 7 columns from 2 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.
df = j.df
fs = {"Chromosome": "first", "Start":
"first", "End": "first", "count1": "sum", "count2": "sum", "count3": "sum"}
result = df.groupby("feature".split()).agg(fs)
# Chromosome Start End count1 count2 count3
# feature
# range1 Chr1 1 45 120 40 28
# range2 Chr1 55 160 63 26 16
# range3 Chr2 10 90 28 45 18

Related

iteration calculation based on another dataframe

How to do iteration calculation as shown in df2 as desired output ?
any reference links for this > many thanks for helping
df1
a b c
0 1 0 5
1 9 9 2
2 2 2 8
3 6 3 0
4 6 1 7
df2 :
a b c
0 1 0 5 >> values from df1
1 19 18 9 >> values from (df1.iloc[1] * 2) + df2.iloc[0] *1)
2 23 22 25 >> values from (df1.iloc[2] * 2) + df2.iloc[1] *1)
3 35 28 25 >> values from (df1.iloc[3] * 2) + df2.iloc[2] *1)
4 47 30 39 >> values from (df1.iloc[4] * 2) + df2.iloc[3] *1)
IIUC, you can try:
df2 = df1.mul(2).cumsum().sub(df1.iloc[0])
Output:
a b c
0 1 0 5
1 19 18 9
2 23 22 25
3 35 28 25
4 47 30 39
more complex operation
If you want x[n] = x[n]*2 + x[n-1]*2, you need to iterate:
def process(s):
out = [s[0]]
for x in s[1:]:
out.append(x*2+out[-1]*3)
return out
df1.apply(process)
Output:
a b c
0 1 0 5
1 21 18 19
2 67 58 73
3 213 180 219
4 651 542 671

Python - Sort column ascending - using groupby

The following code:
import pandas as pd
df_original=pd.DataFrame({\
'race_num':[1,1,1,2,2,2,2,3,3],\
'race_position':[2,3,0,1,0,0,2,3,0],\
'percentage_place':[77,55,88,50,34,56,99,12,75]
})
Gives an output of:
race_num
race_position
percentage_place
1
2
77
1
3
55
1
0
88
2
1
50
2
0
34
2
0
56
2
2
99
3
3
12
3
0
75
I need to mainpulate this dataframe to keep the race_num grouped but sort the percentage place in ascending order - and the race_position is to stay aligned with the original percentage_place.
Desired out is:
race_num
race_position
percentage_place
1
0
88
1
2
77
1
3
55
2
2
99
2
0
56
2
1
50
2
0
34
3
0
75
3
3
12
My attempt is:
df_new = df_1.groupby(['race_num','race_position'])\['percentage_place'].nlargest().reset_index()
Thank you in advance.
Look into sort_values
In [137]: df_original.sort_values(['race_num', 'percentage_place'], ascending=[True, False])
Out[137]:
race_num race_position percentage_place
2 1 0 88
0 1 2 77
1 1 3 55
6 2 2 99
5 2 0 56
3 2 1 50
4 2 0 34
8 3 0 75
7 3 3 12

How to take the semi last value per group

I would like per group to keep the semi-last value, as indicated below.
ID number
1 50
1 49
1 48
1 45
2 47
2 40
2 31
3 60
3 51
Example output
1 48
2 40
3 60
One liner:
df[df[::-1].groupby('ID').cumcount()[::-1]==1]
Output:
ID number
2 1 48
5 2 40
7 3 60
Use Groupby.nth with -2 :
df.groupby('ID')['number'].nth(-2)
[out]
ID
1 48
2 40
3 60
Name: number, dtype: int64

Winsorize within groups of dataframe

I have a dataframe like this:
df = pd.DataFrame([[1,2],
[1,4],
[1,5],
[2,65],
[2,34],
[2,23],
[2,45]], columns = ['label', 'score'])
Is there an efficient way to create a column score_winsor that winsorises the score column within the groups at the 1% level?
I tried this with no success:
df['score_winsor'] = df.groupby('label')['score'].transform(lambda x: max(x.quantile(.01), min(x, x.quantile(.99))))
You could use scipy's implementation of winsorize
df["score_winsor"] = df.groupby('label')['score'].transform(lambda row: winsorize(row, limits=[0.01,0.01]))
Output
>>> df
label score score_winsor
0 1 2 2
1 1 4 4
2 1 5 5
3 2 65 65
4 2 34 34
5 2 23 23
6 2 45 45
This works:
df['score_winsor'] = df.groupby('label')['score'].transform(lambda x: np.maximum(x.quantile(.01), np.minimum(x, x.quantile(.99))))
Output
print(df.to_string())
label score score_winsor
0 1 2 2.04
1 1 4 4.00
2 1 5 4.98
3 2 65 64.40
4 2 34 34.00
5 2 23 23.33
6 2 45 45.00

sorting python pandas dataframe by matching values of two different columns and calculate the mean

I have a text file that imported as pandas dataframe looks like:
a b c d e
index
0 18 1 1 -30.47 0.746
1 19 1 1 -30.47 0.751
2 20 1 1 -30.47 0.801
3 33 2 1 -30.47 1.451
4 34 2 1 -30.47 1.534
5 35 2 1 -30.47 1.551
6 49 3 1 -30.47 2.297
7 50 3 1 -30.47 2.301
8 51 3 1 -30.47 2.351
9 64 4 1 -30.47 3.001
10 65 4 1 -30.47 3.085
11 66 4 1 -30.47 3.101
12 346 1 2 -28.47 0.601
13 347 1 2 -20 0.682
14 348 1 2 -28.47 0.701
15 362 2 2 -28.47 1.445
16 363 2 2 -28.47 1.451
17 364 2 2 -28.47 1.501
18 377 3 2 -28.47 2.151
19 378 3 2 -28.47 2.233
20 379 3 2 -28.47 2.251
21 392 4 2 -28.47 2.901
22 393 4 2 -28.47 2.996
23 394 4 2 -28.47 3.001
24 675 1 3 -25 0.596
25 676 1 3 -26 0.601
26 677 1 3 -22 0.651
27 690 2 3 -26.47 1.301
28 691 2 3 -26.47 1.384
29 692 2 3 -26.47 1.401
30 705 3 3 -26.47 2.051
31 706 3 3 -26.47 2.147
32 707 3 3 -26.47 2.151
33 721 4 3 -26.47 2.851
34 722 4 3 -26.47 2.935
35 723 4 3 -26.47 2.951
I have been trying to reorganize the dataframe as following: for each value in col two, for example value 1, there are multiple corresponding values in column three and four.For example
value 1 (col one) corresponds to: value 1 (col three), -3.47 (col four);
value 1 (col three), -3.47 (col four); value 1 (col three), -3.47 (col four).....value 3 (col three), -25 (col four); value 3 (col three), -26 (col four); value 3 (col three), -22 (col four)
and so on. I would like to create a new dataframe where for value 1 there are other three corresponding columns, 1,2 and 3, containing the mean of the three values of the original column four. The output should look like:
col 1, col 2, col 3, col 4
1 mean(-3.47,-3.47,-3.47) mean(-28.47,-20,-20.47) mean(-25, -26,-22)
The output should contain all the values of column one, in this case 1,2,3 and 4 (table 4x3). I am not an expert in python, I have no idea how I should approach this task besides matching values in couple. Any help is more than welcome!
IIUC:
df.groupby([2,3])[4].mean().reset_index(name='Mean').pivot(columns=3,index=2,values='Mean')
Output:
3 1 2 3
2
1 -30.47 -25.646667 -24.333333
2 -30.47 -28.470000 -26.470000
3 -30.47 -28.470000 -26.470000
4 -30.47 -28.470000 -26.470000