select rows based on rows in second column - pandas

I have two dfs and looking for an way to select (and count) rows of df1 based on rows in df2.
This is my df1:
Chromosome Start position End position Reference Variant reads \
0 chr1 109419841 109419841 C T 1
1 chr1 197008365 197008365 C T 1
variation reads % variation gDNA nomencl \
0 1 100 Chr1(GRCh37):g.109419841C>T
1 1 100 Chr1(GRCh37):g.197008365C>T
cDNA nomencl ... exon transcript ID inheritance \
0 NM_013296.4:c.-258C>T ... 2 NM_013296.4 Autosomal recessive
1 NM_001994.2:c.*143G>A ... UTR NM_001994.2 Autosomal recessive
test type Phenotype male coverage male ratio covered \
0 Unknown Deafness, autosomal recessief 0 0
1 Unknown Factor 13 deficientie 0 0
female coverage female ratio covered ratio M:F
0 1 1 0.0
1 1 1 0.0
df1 has these columns:
Chromosome 10561 non-null object
Start position 10561 non-null int64
End position 10561 non-null int64
Reference 10415 non-null object
Variant 10536 non-null object
reads 10561 non-null int64
variation reads 10561 non-null int64
% variation 10561 non-null int64
gDNA nomencl 10561 non-null object
cDNA nomencl 10446 non-null object
protein nomencl 9997 non-null object
classification 10561 non-null object
status 10561 non-null object
gene 10560 non-null object
Sanger sequencing list 10561 non-null object
exon 10502 non-null object
transcript ID 10460 non-null object
inheritance 8259 non-null object
test type 10561 non-null object
Phenotype 10380 non-null object
male coverage 10561 non-null int64
male ratio covered 10561 non-null int64
female coverage 10561 non-null int64
female ratio covered 10561 non-null int64
and this is df2:
Chromosome Startposition Endposition Bases Meancoverage \
0 chr1 11073785 11074022 27831.0 117.927966
1 chr1 11076901 11077064 11803.0 72.411043
Mediancoverage Ratiocovered>10X Ratiocovered>20X Genename Componentnr \
0 97.0 1.0 1.0 TARDBP 1
1 76.0 1.0 1.0 TARDBP 2
PositionGenes PositionGenome Position
0 TARDBP.1 chr1.11073785-11074022 comp.1_chr1.11073785-11074022
1 TARDBP.2 chr1.11076901-11077064 comp.2_chr1.11076901-11077064
I want to select all rows from df1 that have in df2
the same value for 'Chromosome'
df1['Start position'] >= df2.Startposition
df1['End position'] <= df2.Endposition.
If these three criteria are met in the same row of df2, I want to select the corresponding row in df1.
I already fused the three columns 'Chromosome','Startposition' and 'Endposition' in 'PositionGenome' to generate a lambda function but coundn't come up with anything.
Thus, hope you can help me ...

A short updata: In the end I solved the problem with unix bedtools -wb. Still I would be glad if someone could come up with an python based solution.

Related

Key Error Raise when trying to delete an existing column

RangeIndex: 381732 entries, 0 to 381731
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 381732 non-null int64
1 tweet_id 378731 non-null float64
2 time 378731 non-null object
3 tweet 378731 non-null object
4 retweet_count 336647 non-null float64
5 Unnamed: 0.1 336647 non-null float64
6 User 3001 non-null object
7 Date_Created 3001 non-null object
8 Source of Tweet 3001 non-null object
9 Tweet 3001 non-null object
dtypes: float64(3), int64(1), object(6)
memory usage: 29.1+ MB
df = df.drop(['Unnamed: 0','Unnamed: 0.1','User','Date_Created','Source of Tweet'],axis =1)
df.head()
i wrote this code to drop unwanted columns from my dataframe but i am encountering keyError not found in axis
KeyError: "['Unnamed: 0', 'Unnamed: 0.1', 'User', 'Date_Created', 'Source of Tweet'] not found in axis"
For debugging purpose, try:
cols_to_drop = ['Unnamed: 0','Unnamed: 0.1','User','Date_Created','Source of Tweet']
df = df[[col for col in df.columns if not col in cols_to_drop]]
and check the remain columns using df.info()

Pandas - get values on a graph using quantile

I have this df_players:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TableIndex 739 non-null object
1 PlayerID 739 non-null int64
2 GameWeek 739 non-null int64
3 Date 739 non-null object
4 Points 739 non-null int64
5 Price 739 non-null float64
6 BPS 739 non-null int64
7 SelectedBy 739 non-null int64
8 NetTransfersIn 739 non-null int64
9 MinutesPlayed 739 non-null float64
10 CleanSheet 739 non-null float64
11 Saves 739 non-null float64
12 PlayersBasicID 739 non-null int64
13 PlayerCode 739 non-null object
14 FirstName 739 non-null object
15 WebName 739 non-null object
16 Team 739 non-null object
17 Position 739 non-null object
18 CommentName 739 non-null object
And I'm using this function, with quantile() (value passed by variable 'cut'), to plot the distribution of players:
def jointplot(X, Y, week=None, title=None,
positions=None, height=6,
xlim=None, ylim=None, cut=0.015,
color=CB91_Blue, levels=30, bw=0.5, top_rows=100000):
if positions == None:
positions = ['GKP','DEF','MID','FWD']
#Check if week is given as a list
if week == None:
week = list(range(max(df_players['GameWeek'])))
if type(week)!=list:
week = [week]
df_played = df_players.loc[(df_players['MinutesPlayed']>=45)
&(df_players['GameWeek'].isin(week))
&(df_players['Position'].isin(positions))].head(top_rows)
if xlim == None:
xlim = (df_played[X].quantile(cut),
df_played[X].quantile(1-cut))
if ylim == None:
ylim = (df_played[Y].quantile(cut),
df_played[Y].quantile(1-cut))
sns.jointplot(X, Y, data=df_played,
kind="kde", xlim=xlim, ylim=ylim,
color=color, n_levels=levels,
height=height, bw=bw);
plt.suptitle(title,fontsize=18);
plt.show()
call:
jointplot('Price', 'Points', positions=['FWD'],
color=color_list[3], title='Forwards')
this plots:
where:
xlim = (4.5, 11.892999999999995)
ylim = (1.0, 13.0)
As far as I'm concerned, these x and y limits allow me, using the range of quantile value (cut),(1-cut), to zoom into an area of datapoints.
QUESTION
Now I would like to get player 'WebName' for players within a certain area, like so:
Ater ploting I can chose a target area above and define the range, roughly, passing xlim and ylim:
jointplot('Price', 'Points', positions=['FWD'],
xlim=(5.5, 7.0), ylim=(11.5, 13.0),
color=color_list[3], title='Forwards')
which zooms in the area in red above.
But how can I get players names inside that area?
You can just select the portion of the players dataframe based on the bounds in the plot:
selected = df_players[
(df_players.Points >= points_lbound)
& (df_players.Points <= points_ubound)
& (df_players.Price >= price_lbound)
& (df_players.Price <= price_ubound)
]
The list of WebNames would then be selected.WebNames

what is the difference between the total= df.isnull().sum(), percent1= df.count(),percent= df.isnull().count()?

Can anyone tell difference between the total= df.isnull().sum(), percent1= df.count(), df.isnull().count() as Ideally df.isnull().count() should give all the count of only null values but it is giving count of all the values .Can anyone help me to understand this?
Below is the code where i am getting output of variable total as only null values count and percent1 as only not null values count and percent as count of all the values irrespective of null or not null.
total= df.isnull().sum().sort_values(ascending=False)
percent1= df.count()#helps to get all the non null values count
percent= df.isnull().count()
print(total)
print(percent1)
print(percent)
The definition of count according to the doc is:
Count non-NA cells for each column or row.
And using isnull (or isna) change your dataframe df of whatever types you have in it to a boolean dataframe, with True where nan is originally df and False otherwise, there is no more nan in this dataframe, so count on df.isnull() will return the number of row of df as no nan exist in it. With an example:
df = pd.DataFrame({'a':range(4), 'b':[1,np.nan, 3, np.nan]})
print (df)
a b
0 0 1.0
1 1 NaN
2 2 3.0
3 3 NaN
if you use count on this dataframe you get:
print (df.count())
a 4
b 2 # here you get 2 because you have 2 nan in the column b as defined above
dtype: int64
but if you use isnull on it you get
print (df.isnull())
a b
0 False False
1 False True #was nan in column b in df
2 False False
3 False True
Here you don't have nan anymore, so the result of count will be the number of rows for both columns
print (df.isnull().count())
a 4
b 4 #no more nan in df.isnull()
dtype: int64
But because True is actually equal to 1 and False equal to 0, then using the sum method will add one for each True in df.isnull(), meaning of nan originally in df
print (df.isnull().sum())
a 0 # because only False in column a of df.isnull()
b 2 # because you have two True in df.isnull() in column b
dtype: int64
Finally, you can see the relation like this:
(df.count()+df.isnull().sum())==df.isnull().count()

calculate the mean of one row according it's label

calculate the mean of the values in one row according it's label:
A = [1,2,3,4,5,6,7,8,9,10]
B = [0,0,0,0,0,1,1,1,1, 1]
Result = pd.DataFrame(data=[A, B])
I want the output is: 0->3; 1-> 7.8
pandas has the groupby function, but I don't know how to implement this. Thanks
This is simple groupby problem ...
Result=Result.T
Result.groupby(Result[1])[0].mean()
Out[372]:
1
0 3
1 8
Name: 0, dtype: int64
Firstly, it sounds like you want to label the index:
In [11]: Result = pd.DataFrame(data=[A, B], index=['A', 'B'])
In [12]: Result
Out[12]:
0 1 2 3 4 5 6 7 8 9
A 1 2 3 4 5 6 7 8 9 10
B 0 0 0 0 0 1 1 1 1 1
If the index was unique you wouldn't have to do any groupby, just take the mean of each row (that's the axis=1):
In [13]: Result.mean(axis=1)
Out[13]:
A 5.5
B 0.5
dtype: float64
However, if you had multiple rows with the same label, then you'd need to groupby:
In [21]: Result2 = pd.DataFrame(data=[A, A, B], index=['A', 'A', 'B'])
In [22]: Result2.mean(axis=1)
Out[22]:
A 5.5
A 5.5
B 0.5
dtype: float64
Note: the duplicate rows (that happen to have the same mean as I lazily used the same row contents), in general we'd want to take the mean of those means:
In [23]: Result2.mean(axis=1).groupby(level=0).mean()
Out[23]:
A 5.5
B 0.5
dtype: float64
Note: .groupby(level=0) groups the rows which have the same index label.
You're making it difficult on yourself by constructing the dataframe in such a way as to put the things you want to take the mean of and the things you want to be your labels as different rows.
Option 1
groubpy
This deals with the data presented in the dataframe Result
Result.loc[0].groupby(Result.loc[1]).mean()
1
0 3
1 8
Name: 0, dtype: int64
Option 2
Overkill using np.bincount and because your grouping values are 0 and 1. I'd have a solution even if they weren't but it makes it simpler.
I wanted to use the raw lists A and B
pd.Series(np.bincount(B, A) / np.bincount(B))
0 3.0
1 8.0
dtype: float64
Option 3
Construct a series instead of a dataframe.
Again using raw lists A and B
pd.Series(A, B).mean(level=0)
0 3
1 8
dtype: int64

Pandas .count() puts in first row name out of nowhere?

I have a pandas dataframe, where the first row is called school and the last row is called passed, and it has only numbers 1 and 0.
I simply wanted to count how often 1 or 0 occurs in that row.
i went with :
n_passed = df[df.passed==1].count()
the funny thing is, it gives me the correct number, but also outputs 'school', for a reason that is beyond me.
school 265
Can anyone bring light into this ?
IIUC you think no rows, but columns passed and school. Then you can use value_counts with column passed:
print df
school aa bb passed
0 1 0 1 1
1 0 1 0 0
2 1 1 0 1
3 0 0 1 1
n_passed1 = df.passed[df.passed==1].value_counts()
print n_passed1
1 3
Name: passed, dtype: int64
n_passed0 = df.passed[df.passed==0].value_counts()
print n_passed0
0 1
Name: passed, dtype: int64
But I think the best is use:
n_passed1 = df.passed.value_counts()
print n_passed1
1 3
0 1
Name: passed, dtype: int64