Pandas - get values on a graph using quantile - pandas

I have this df_players:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TableIndex 739 non-null object
1 PlayerID 739 non-null int64
2 GameWeek 739 non-null int64
3 Date 739 non-null object
4 Points 739 non-null int64
5 Price 739 non-null float64
6 BPS 739 non-null int64
7 SelectedBy 739 non-null int64
8 NetTransfersIn 739 non-null int64
9 MinutesPlayed 739 non-null float64
10 CleanSheet 739 non-null float64
11 Saves 739 non-null float64
12 PlayersBasicID 739 non-null int64
13 PlayerCode 739 non-null object
14 FirstName 739 non-null object
15 WebName 739 non-null object
16 Team 739 non-null object
17 Position 739 non-null object
18 CommentName 739 non-null object
And I'm using this function, with quantile() (value passed by variable 'cut'), to plot the distribution of players:
def jointplot(X, Y, week=None, title=None,
positions=None, height=6,
xlim=None, ylim=None, cut=0.015,
color=CB91_Blue, levels=30, bw=0.5, top_rows=100000):
if positions == None:
positions = ['GKP','DEF','MID','FWD']
#Check if week is given as a list
if week == None:
week = list(range(max(df_players['GameWeek'])))
if type(week)!=list:
week = [week]
df_played = df_players.loc[(df_players['MinutesPlayed']>=45)
&(df_players['GameWeek'].isin(week))
&(df_players['Position'].isin(positions))].head(top_rows)
if xlim == None:
xlim = (df_played[X].quantile(cut),
df_played[X].quantile(1-cut))
if ylim == None:
ylim = (df_played[Y].quantile(cut),
df_played[Y].quantile(1-cut))
sns.jointplot(X, Y, data=df_played,
kind="kde", xlim=xlim, ylim=ylim,
color=color, n_levels=levels,
height=height, bw=bw);
plt.suptitle(title,fontsize=18);
plt.show()
call:
jointplot('Price', 'Points', positions=['FWD'],
color=color_list[3], title='Forwards')
this plots:
where:
xlim = (4.5, 11.892999999999995)
ylim = (1.0, 13.0)
As far as I'm concerned, these x and y limits allow me, using the range of quantile value (cut),(1-cut), to zoom into an area of datapoints.
QUESTION
Now I would like to get player 'WebName' for players within a certain area, like so:
Ater ploting I can chose a target area above and define the range, roughly, passing xlim and ylim:
jointplot('Price', 'Points', positions=['FWD'],
xlim=(5.5, 7.0), ylim=(11.5, 13.0),
color=color_list[3], title='Forwards')
which zooms in the area in red above.
But how can I get players names inside that area?

You can just select the portion of the players dataframe based on the bounds in the plot:
selected = df_players[
(df_players.Points >= points_lbound)
& (df_players.Points <= points_ubound)
& (df_players.Price >= price_lbound)
& (df_players.Price <= price_ubound)
]
The list of WebNames would then be selected.WebNames

Related

Key Error Raise when trying to delete an existing column

RangeIndex: 381732 entries, 0 to 381731
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 381732 non-null int64
1 tweet_id 378731 non-null float64
2 time 378731 non-null object
3 tweet 378731 non-null object
4 retweet_count 336647 non-null float64
5 Unnamed: 0.1 336647 non-null float64
6 User 3001 non-null object
7 Date_Created 3001 non-null object
8 Source of Tweet 3001 non-null object
9 Tweet 3001 non-null object
dtypes: float64(3), int64(1), object(6)
memory usage: 29.1+ MB
df = df.drop(['Unnamed: 0','Unnamed: 0.1','User','Date_Created','Source of Tweet'],axis =1)
df.head()
i wrote this code to drop unwanted columns from my dataframe but i am encountering keyError not found in axis
KeyError: "['Unnamed: 0', 'Unnamed: 0.1', 'User', 'Date_Created', 'Source of Tweet'] not found in axis"
For debugging purpose, try:
cols_to_drop = ['Unnamed: 0','Unnamed: 0.1','User','Date_Created','Source of Tweet']
df = df[[col for col in df.columns if not col in cols_to_drop]]
and check the remain columns using df.info()

should I log transform my model for Linear Regression analysis

I have a dataset of boston houses with the following features
<class 'pandas.core.frame.DataFrame'>
Int64Index: 414 entries, 1 to 414
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 X2 house age 414 non-null float64
1 X3 distance to the nearest MRT station 414 non-null float64
2 X4 number of convenience stores 414 non-null int64
3 X5 latitude 414 non-null float64
4 X6 longitude 414 non-null float64
5 Y house price of unit area 414 non-null float64
dtypes: float64(5), int64(1)
standard deviation is:
X2 house age 11.392485
X3 distance to the nearest MRT station 1262.109595
X4 number of convenience stores 2.945562
X5 latitude 0.012410
X6 longitude 0.015347
Y house price of unit area 13.606488
dtype: float64
I tried to calculate the skew of prices and got the value of 0.599
I log transformed the data and got the value of -0.7064
the question that I'm having is, should I continue to work with dataset log transformed or it's not necessary to transform it, and when should I even consider log transform in my data analysis?
Using log transformation or not completely depends on what fits better on your data. Just calculate the performance of your models (log transformed and not) and see which one has the best performance metrics.

How to make a graph plotting monthly data over many years in pandas

I have 11 years worth of hourly ozone concentration data.
There are 11 csv files containing ozone concentrations at every hour of every day.
I was able to read all of the files in and convert the index from date to datetime.
For my graph:
I calculated the maximum daily 8-hour average and then averaged those values over each month.
My new dataframe (df3) has:
a datetime index, which consists of the last day of the month for each month of the year over the 12 years.
It also has a column including the average MDA8 values.
I want make 3 separate scatter plots for the months of April, May, and June. (x axis = year, y axis = average MDA8 for the month)
However, I am getting stuck on how to call these individual months and plot the yearly data.
Minimal sample
site,date,start_hour,value,variable,units,quality,prelim,name
3135,2010-01-01,0,13.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
3135,2010-01-01,1,5.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
3135,2010-01-01,2,11.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
3135,2010-01-01,3,17.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
3135,2010-01-01,5,16.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
Here's a link to find similar CSV data https://www.arb.ca.gov/aqmis2/aqdselect.php?tab=hourly
I've attached some code below:
import pandas as pd
import os
import glob
import matplotlib.pyplot as plt
path = "C:/Users/blah"
for f in glob.glob(os.path.join(path, "*.csv")):
df = pd.read_csv(f, header = 0, index_col='date')
df2 = df.dropna(axis = 0, how = "all", subset = ['start_hour', 'variable'], inplace = True)
df = df.iloc[0:]
df.index = pd.to_datetime(df.index) #converting date to datetime
df['start_hour'] = pd.to_timedelta(df['start_hour'], unit = 'h')
df['datetime'] = df.index + df['start_hour']
df.set_index('datetime', inplace = True)
df2 = df.value.rolling('8H', min_periods = 6).mean()
df2.index -= pd.DateOffset(hours=3)
df2 = df4.resample('D').max()
df2.index.name = 'timestamp'
The problem occurs below:
df3 = df2.groupby(pd.Grouper(freq = 'M')).mean()
df4 = df3[df3.index.month.isin([4,5,6])]
if df4 == True:
plt.plot(df3.index, df3.values)
print(df4)
whenever I do this, I get a message saying "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."
When I try this code with df4.any() == True:, it plots all of the months except April-June and it plots all values in the same plot. I want different plots for each month.
I've also tried adding the the following and removing the previous if statement:
df5 = df4.index.year.isin([2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019])
if df5.all() == True:
plt.plot(df4.index, df4.values)
However, this gives me an image like:
Again, I want to make a separate scatterplot for each month, although this is closer to what I want. Any help would be appreciated, thanks.
EDIT
In addition, I have 2020 data, which only extends to the month of July. I don't think this is going to affect my graph, but I just wanted to mention it.
Ideally, I want it to look something like this, but a different point for each year and for the individual month of April
df.index -= pd.DateOffset(hours=3) has been removed for being potentially problematic
The first hours of each month would be in the previous month
The first hours of each day would be in the previous day
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from datetime import date
from pandas.tseries.offsets import MonthEnd
# set the path to the files
p = Path('/PythonProjects/stack_overflow/data/ozone/')
# list of files
files = list(p.glob('OZONE*.csv'))
# create a dataframe from the files - all years all data
df = pd.concat([pd.read_csv(file) for file in files])
# format the dataframe
df.start_hour = pd.to_timedelta(df['start_hour'], unit = 'h')
df.date = pd.to_datetime(df.date)
df['datetime'] = df.date + df.start_hour
df.drop(columns=['date', 'start_hour'], inplace=True)
df['month'] = df.datetime.dt.month
df['day'] = df.datetime.dt.day
df['year'] = df.datetime.dt.year
df = df[df.month.isin([4, 5, 6])].copy() # filter the dataframe - only April, May, June
df.set_index('datetime', inplace = True)
# calculate the 8-hour rolling mean
df['r_mean'] = df.value.rolling('8H', min_periods=6).mean()
# determine max value per day
r_mean_daily_max = df.groupby(['year', 'month', 'day'], as_index=False)['r_mean'].max()
# calculate the mean from the daily max
mda8 = r_mean_daily_max.groupby(['year', 'month'], as_index=False)['r_mean'].mean()
# add a new datetime column with the date as the end of the month
mda8['datetime'] = pd.to_datetime(mda8.year.astype(str) + mda8.month.astype(str), format='%Y%m') + MonthEnd(1)
df.info() & .head() before any processing
<class 'pandas.core.frame.DataFrame'>
Int64Index: 78204 entries, 0 to 4663
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 site 78204 non-null int64
1 date 78204 non-null object
2 start_hour 78204 non-null int64
3 value 78204 non-null float64
4 variable 78204 non-null object
5 units 78204 non-null object
6 quality 4664 non-null float64
7 prelim 4664 non-null object
8 name 78204 non-null object
dtypes: float64(2), int64(2), object(5)
memory usage: 6.0+ MB
site date start_hour value variable units quality prelim name
0 3135 2011-01-01 0 14.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street
1 3135 2011-01-01 1 11.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street
2 3135 2011-01-01 2 22.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street
3 3135 2011-01-01 3 25.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street
4 3135 2011-01-01 5 22.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street
df.info & .head() after processing
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 20708 entries, 2011-04-01 00:00:00 to 2020-06-30 23:00:00
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 site 20708 non-null int64
1 value 20708 non-null float64
2 variable 20708 non-null object
3 units 20708 non-null object
4 quality 2086 non-null float64
5 prelim 2086 non-null object
6 name 20708 non-null object
7 month 20708 non-null int64
8 day 20708 non-null int64
9 year 20708 non-null int64
10 r_mean 20475 non-null float64
dtypes: float64(3), int64(4), object(4)
memory usage: 1.9+ MB
site value variable units quality prelim name month day year r_mean
datetime
2011-04-01 00:00:00 3135 13.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street 4 1 2011 NaN
2011-04-01 01:00:00 3135 29.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street 4 1 2011 NaN
2011-04-01 02:00:00 3135 31.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street 4 1 2011 NaN
2011-04-01 03:00:00 3135 28.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street 4 1 2011 NaN
2011-04-01 05:00:00 3135 11.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street 4 1 2011 NaN
r_mean_daily_max.info() and .head()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 910 entries, 0 to 909
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 910 non-null int64
1 month 910 non-null int64
2 day 910 non-null int64
3 r_mean 910 non-null float64
dtypes: float64(1), int64(3)
memory usage: 35.5 KB
year month day r_mean
0 2011 4 1 44.125
1 2011 4 2 43.500
2 2011 4 3 42.000
3 2011 4 4 49.625
4 2011 4 5 45.500
mda8.info() & .head()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 0 to 29
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 30 non-null int64
1 month 30 non-null int64
2 r_mean 30 non-null float64
3 datetime 30 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(2)
memory usage: 1.2 KB
year month r_mean datetime
0 2011 4 49.808135 2011-04-30
1 2011 5 55.225806 2011-05-31
2 2011 6 58.162302 2011-06-30
3 2012 4 45.865278 2012-04-30
4 2012 5 61.061828 2012-05-31
mda8
plot 1
sns.lineplot(mda8.datetime, mda8.r_mean, marker='o')
plt.xlim(date(2011, 1, 1), date(2021, 1, 1))
plot 2
# create color mapping based on all unique values of year
years = mda8.year.unique()
colors = sns.color_palette('husl', n_colors=len(years)) # get a number of colors
cmap = dict(zip(years, colors)) # zip values to colors
for g, d in mda8.groupby('year'):
sns.lineplot(d.datetime, d.r_mean, marker='o', hue=g, palette=cmap)
plt.xlim(date(2011, 1, 1), date(2021, 1, 1))
plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
plot 3
sns.barplot(x='month', y='r_mean', data=mda8, hue='year')
plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
plt.title('MDA8: April - June')
plt.ylabel('mda8 (ppb)')
plt.show()
plot 4
for month in mda8.month.unique():
data = mda8[mda8.month == month] # filter and plot the data for a specific month
plt.figure() # create a new figure for each month
sns.lineplot(data.datetime, data.r_mean, marker='o')
plt.xlim(date(2011, 1, 1), date(2021, 1, 1))
plt.title(f'Month: {month}')
plt.ylabel('MDA8: PPB')
plt.xlabel('Year')
There will be one plot per month
plot 5
for month in mda8.month.unique():
data = mda8[mda8.month == month]
sns.lineplot(data.datetime, data.r_mean, marker='o', label=month)
plt.legend(title='Month')
plt.xlim(date(2011, 1, 1), date(2021, 1, 1))
plt.ylabel('MDA8: PPB')
plt.xlabel('Year')
Addressing I want make 3 separate scatter plots for the months of April, May, and June.
The main issue is, the data can't be plotted with a datetime axis.
The objective is to plot each day on the axis, with each figure as a different month.
Lineplot
It's kind of busy
A custom color map has been used because there aren't enough colors in the standard palette to give each year a unique color
# create color mapping based on all unique values of year
years = df.index.year.unique()
colors = sns.color_palette('husl', n_colors=len(years)) # get a number of colors
cmap = dict(zip(years, colors)) # zip values to colors
for k, v in df.groupby('month'): # group the dateframe by month
plt.figure(figsize=(16, 10))
for year in v.index.year.unique(): # withing the month plot each year
data = v[v.index.year == year]
sns.lineplot(data.index.day, data.r_mean, err_style=None, hue=year, palette=cmap)
plt.xlim(0, 33)
plt.xticks(range(1, 32))
plt.title(f'Month: {k}')
plt.xlabel('Day of Month')
plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
plt.show()
Here's April, the other two figures look similar to this
Barplot
for k, v in df.groupby('month'): # group the dateframe by month
plt.figure(figsize=(10, 20))
sns.barplot(x=v.r_mean, y=v.day, ci=None, orient='h', hue=v.index.year)
plt.title(f'Month: {k}')
plt.ylabel('Day of Month')
plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
plt.show()

convert csv import via pandas to separate columns

I have a csv file that came into pandas like this:
csv file:
Date,Numbers,Extra, NaN
05/17/2002,15 18 25 33 47,30,
Pandas input:
df = pd.read_csv('/Users/owner/Downloads/file.csv’)e
#s = Series('05/17/2002', '15 18 25 33 47')
#s.str.partition(' ‘)
Output
Date Numbers. Extra
<bound method NDFrame.head of Draw Date Winning Numbers Extra NaN.
05/17/2002 15 18 25 33 47 30 NaN.
<class 'pandas.core.frame.DataFrame’>
RangeIndex: 1718 entries, 0 to 1717
Data columns (total 4 columns):
Date 1718 non-null object
Numbers 1718 non-null object.
Extra 1718 non-null int64
NaN 815 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 53.8+ KB
How do I convert the non-null objects into two columns:
1 is a date
1 is a list
It doesn’t seem to recognize split or to.str. or headings
Thanks
I think you want this. It specifies column 0 as a date column, and a converter for column 1:
>>> df = pd.read_csv('file.csv',parse_dates=[0],converters={1:str.split})
>>> df
Date Numbers Extra NaN
0 2002-05-17 [15, 18, 25, 33, 47] 30

select rows based on rows in second column

I have two dfs and looking for an way to select (and count) rows of df1 based on rows in df2.
This is my df1:
Chromosome Start position End position Reference Variant reads \
0 chr1 109419841 109419841 C T 1
1 chr1 197008365 197008365 C T 1
variation reads % variation gDNA nomencl \
0 1 100 Chr1(GRCh37):g.109419841C>T
1 1 100 Chr1(GRCh37):g.197008365C>T
cDNA nomencl ... exon transcript ID inheritance \
0 NM_013296.4:c.-258C>T ... 2 NM_013296.4 Autosomal recessive
1 NM_001994.2:c.*143G>A ... UTR NM_001994.2 Autosomal recessive
test type Phenotype male coverage male ratio covered \
0 Unknown Deafness, autosomal recessief 0 0
1 Unknown Factor 13 deficientie 0 0
female coverage female ratio covered ratio M:F
0 1 1 0.0
1 1 1 0.0
df1 has these columns:
Chromosome 10561 non-null object
Start position 10561 non-null int64
End position 10561 non-null int64
Reference 10415 non-null object
Variant 10536 non-null object
reads 10561 non-null int64
variation reads 10561 non-null int64
% variation 10561 non-null int64
gDNA nomencl 10561 non-null object
cDNA nomencl 10446 non-null object
protein nomencl 9997 non-null object
classification 10561 non-null object
status 10561 non-null object
gene 10560 non-null object
Sanger sequencing list 10561 non-null object
exon 10502 non-null object
transcript ID 10460 non-null object
inheritance 8259 non-null object
test type 10561 non-null object
Phenotype 10380 non-null object
male coverage 10561 non-null int64
male ratio covered 10561 non-null int64
female coverage 10561 non-null int64
female ratio covered 10561 non-null int64
and this is df2:
Chromosome Startposition Endposition Bases Meancoverage \
0 chr1 11073785 11074022 27831.0 117.927966
1 chr1 11076901 11077064 11803.0 72.411043
Mediancoverage Ratiocovered>10X Ratiocovered>20X Genename Componentnr \
0 97.0 1.0 1.0 TARDBP 1
1 76.0 1.0 1.0 TARDBP 2
PositionGenes PositionGenome Position
0 TARDBP.1 chr1.11073785-11074022 comp.1_chr1.11073785-11074022
1 TARDBP.2 chr1.11076901-11077064 comp.2_chr1.11076901-11077064
I want to select all rows from df1 that have in df2
the same value for 'Chromosome'
df1['Start position'] >= df2.Startposition
df1['End position'] <= df2.Endposition.
If these three criteria are met in the same row of df2, I want to select the corresponding row in df1.
I already fused the three columns 'Chromosome','Startposition' and 'Endposition' in 'PositionGenome' to generate a lambda function but coundn't come up with anything.
Thus, hope you can help me ...
A short updata: In the end I solved the problem with unix bedtools -wb. Still I would be glad if someone could come up with an python based solution.