How to loop through pandas data frame to make boxplots at once - pandas

I have a data frame like this with thousand of entries and I want to make box plot to check the outliers in my data.
HR
O2Sat
Temp
SBP
DBP
Resp
110.9
102.5
57.21
165.2
64.0
15.2
97.0
95.0
38.72
98.0
72.0
19.0
89.0
99.0
45.02
112.0
62.5
22.0
90.0
95.0
36.7
175.0
105.0
30.0
103.0
88.5
37.47
122.0
104.0
24.5
I am using seaborn library to make Boxplots. But I have to write 6 different code lines for each column like this:
import seaborn as sns
sns.boxplot(y = 'HR', data = box_df_1)
sns.boxplot(y = 'O2Sat', data = box_df_1)
sns.boxplot(y = 'Temp', data = box_df_1)
sns.boxplot(y = 'SBP', data = box_df_1)
sns.boxplot(y = 'DBP', data = box_df_1)
sns.boxplot(y = 'Resp', data = box_df_1)
Can someone help me with some code in which Loop is used and a loop will make the boxplots at once using seaborn, and I don't have to write separate line of code for each column?
Regards,
Huzaifa

create a list for the columns:
cols = ['HR', 'O2Sat', 'Temp', 'SBP', 'DBP', 'Resp']
iterate over the list
for item in cols:
sns.boxplot(x = box_df_1[item])
plt.show()

Related

grouper day and cumsum speed

I have the following df:
I want to group this df on the first column(ID) and on the second column(key), from there to build a cumsum for each day. The cumsum should be on the last column(speed).
I tried this with the following code :
df = pd.read_csv('df.csv')
df['Time'] = pd.to_datetime(df['Time'], format='%Y-%m-%d %H:%M:%S')
df = df.sort_values(['ID','key'])
grouped = df.groupby(['ID','key'])
test = pd.DataFrame()
test2 = pd.DataFrame()
for name, group in grouped:
test = group.groupby(pd.Grouper(key='Time', freq='1d'))['Speed'].cumsum()
test = test.reset_index()
test['ID'] = ''
test['ID'] = name[0]
test['key'] = ''
test['key'] = name[1]
test2 = test2.append(test)
But the result seem quite off, there are more rows than 5. For each day one row with the cumsum of each ID and key.
Does anyone see the reason for my problem ?
thanks in advance
Friendly reminder, it's useful to include a runable example
import pandas as pd
data = [{"cid":33613,"key":14855,"ts":1550577600000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550579340000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550584800000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550682000000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550685900000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550773380000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550858400000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550941200000,"value":25.0},
{"cid":33613,"key":14855,"ts":1550978400000,"value":50.0}]
df = pd.DataFrame(data)
df['ts'] = pd.to_datetime(df['ts'], unit='ms')
I believe what you need can be accomplished as follows:
df.set_index('ts').groupby(['cid', 'key'])['value'].resample('D').sum().cumsum()
Result:
cid key ts
33613 14855 2019-02-19 150.0
2019-02-20 250.0
2019-02-21 300.0
2019-02-22 350.0
2019-02-23 375.0
2019-02-24 425.0
Name: value, dtype: float64

Pandas to mark both if cell value is a substring of another

A column with short and full form of people names, I want to unify them, if the name is a part of the other name. e.g. "James.J" and "James.Jones", I want to tag them both as "James.J".
data = {'Name': ["Amelia.Smith",
"Lucas.M",
"James.J",
"Elijah.Brown",
"Amelia.S",
"James.Jones",
"Benjamin.Johnson"]}
df = pd.DataFrame(data)
I can't figure out how to do it in Pandas. So only a xlrd way, with similarity ratio by SequenceMatcher (and sort it manually in Excel):
import xlrd
from xlrd import open_workbook,cellname
import xlwt
from xlutils.copy import copy
workbook = xlrd.open_workbook("C:\\TEM\\input.xlsx")
old_sheet = workbook.sheet_by_name("Sheet1")
from difflib import SequenceMatcher
wb = copy(workbook)
sheet = wb.get_sheet(0)
for row_index in range(0, old_sheet.nrows):
current = old_sheet.cell(row_index, 0).value
previous = old_sheet.cell(row_index-1, 0).value
sro = SequenceMatcher(None, current.lower(), previous.lower(), autojunk=True).ratio()
if sro > 0.7:
sheet.write(row_index, 1, previous)
sheet.write(row_index-1, 1, previous)
wb.save("C:\\TEM\\output.xls")
What's the nice Pandas way to do it/ Thank you.
using pandas, making use of str.split and .map with some boolean conditions to identify the dupes.
df1 = df['Name'].str.split('.',expand=True).rename(columns={0 : 'FName', 1 :'LName'})
df2 = df1.loc[df1['FName'].duplicated(keep=False)]\
.assign(ky=df['Name'].str.len())\
.sort_values('ky')\
.drop_duplicates(subset=['FName'],keep='first').drop('ky',1)
df['NewName'] = df1['FName'].map(df2.assign(newName=df2.agg('.'.join,1))\
.set_index('FName')['newName'])
print(df)
Name NewName
0 Amelia.Smith Amelia.S
1 Lucas.M NaN
2 James.J James.J
3 Elijah.Brown NaN
4 Amelia.S Amelia.S
5 James.Jones James.J
6 Benjamin.Johnson NaN
Here is an example of using apply with a custom function. For small dfs this should be fine; this will not scale well for large dfs. A more sophisticated data structure for memo would be an ok place to start to improve performance without degrading readability too much:
df = df.sort_values("Name")
def short_name(row, col="Name", memo=[]):
name = row[col]
for m_name in memo:
if name.startswith(m_name):
return m_name
memo.append(name)
return name
df["short_name"] = df.apply(short_name, axis=1)
df = df.sort_index()
output:
Name short_name
0 Amelia.Smith Amelia.S
1 Lucas.M Lucas.M
2 James.J James.J
3 Elijah.Brown Elijah.Brown
4 Amelia.S Amelia.S
5 James.Jones James.J
6 Benjamin.Johnson Benjamin.Johnson

Convert a dict to a DataFrame in pandas

I am using the following code:
import pandas as pd
from yahoofinancials import YahooFinancials
mutual_funds = ['PRLAX', 'QASGX', 'HISFX']
yahoo_financials_mutualfunds = YahooFinancials(mutual_funds)
daily_mutualfund_prices = yahoo_financials_mutualfunds.get_historical_price_data('2015-01-01', '2021-01-30', 'daily')
I get a dictionary as the output file. I would like to get a pandas dataframe with the columns: data, PRLAX, QASGX, HISFX where data is the formatted_date and the Open price for each ticker
pandas dataframe
What you can do is this:
df = pd.DataFrame({
a: {x['formatted_date']: x['adjclose'] for x in daily_mutualfund_prices[a]['prices']} for a in mutual_funds
})
which gives:
PRLAX QASGX HISFX
2015-01-02 19.694817 17.877445 11.852874
2015-01-05 19.203604 17.606575 11.665626
2015-01-06 19.444574 17.316357 11.450289
2015-01-07 19.963596 17.616247 11.525190
2015-01-08 20.260176 18.003208 11.665626
... ... ... ...
2021-01-25 21.799999 33.700001 14.350000
2021-01-26 22.000000 33.139999 14.090000
2021-01-27 21.620001 32.000000 13.590000
2021-01-28 22.120001 32.360001 13.990000
2021-01-29 21.379999 31.709999 13.590000
[1530 rows x 3 columns]
or any other of the values in the dict.

Multi-index dataframe split and stack

When I download data from yfinance, I get 8 columns (Open, High, Low, etc...) per ticker. Since I am downloading 15 tickers, I have 120 columns and 1 index column (date). They add up horizontally. See image 1
Instead of having that many columns, in 2 levels, I want just the 8 unique columns. Plus creating one new column that identifies the ticker. See Image 2.
Image 1: Current Form
Image 1 but in raw text:
Adj Close ... Volume
DANHOS13.MX FCFE18.MX FHIPO14.MX FIBRAHD15.MX FIBRAMQ12.MX FIBRAPL14.MX FIHO12.MX FINN13.MX FMTY14.MX FNOVA17.MX ... FIBRAPL14.MX FIHO12.MX FINN13.MX FMTY14.MX FNOVA17.MX FPLUS16.MX FSHOP13.MX FUNO11.MX FVIA16.MX TERRA13.MX
Date
2015-01-02 26.065336 NaN 18.526043 NaN 16.337654 18.520781 14.683501 11.301384 9.247743 NaN ... 338697 189552 148064 57 NaN NaN 212451 2649823 NaN 1111343
2015-01-05 24.670488 NaN 18.436762 NaN 15.857328 17.859756 13.795850 11.071105 9.209846 NaN ... 449555 364819 244594 19330 NaN NaN 491587 3317923 NaN 1255128
Image 2: Desired outcome
The code Im applying is:
start = dt.datetime(2015,1,1)
end = dt.datetime.now()
df = yf.download("FUNO11.MX FIBRAMQ12.MX FIHO12.MX DANHOS13.MX FINN13.MX FSHOP13.MX TERRA13.MX FMTY14.MX FIBRAPL14.MX FHIPO14.MX FIBRAHD15.MX FPLUS16.MX FVIA16.MX FNOVA17.MX FCFE18.MX",
start = start,
end = end,
group_by = 'Ticker',
actions = True)
I will download the data a little differently:
import yfinance as yf
from datetime import datetime as dt
from dateutil.relativedelta import relativedelta
start = dt(2015,1,1)
end = dt.now()
symbols = ["FUNO11.MX", "FIBRAMQ12.MX", "FIHO12.MX", "DANHOS13.MX", "FINN13.MX", "FSHOP13.MX", "TERRA13.MX", "FMTY14.MX",
"FIBRAPL14.MX", "FHIPO14.MX", "FIBRAHD15.MX", "FPLUS16.MX", "FVIA16.MX", "FNOVA17.MX", "FCFE18.MX"]
data = yf.download(symbols, start=start, end=end, actions=True)
And then
Option 1:
def reshaper(symb, dframe):
df = dframe.unstack().reset_index()
df.columns = ['variable','symbol','Date','Value']
df = df.loc[df.symbol==symb,['Date','variable','Value']].pivot_table(index='Date', columns='variable', values='Value').reset_index()
df.columns.name = ''
df['Ticker'] = symb
return df
h = pd.DataFrame()
for s in symbols:
h = h.append(reshaper(s, data), ignore_index=True)
h
Option 2: For a one-liner, you could do this:
data.stack().reset_index().rename(columns={'level_1':'Ticker'})
A slightly simpler version relies on stacking first the two column index levels (measure and ticker) to get long form tidy data, and then stack on the measure level, keeping ticker and date as indices:
import yfinance as yf
symbols = ["FUNO11.MX", "FIBRAMQ12.MX", "FIHO12.MX", "DANHOS13.MX",
"FINN13.MX", "FSHOP13.MX", "TERRA13.MX", "FMTY14.MX",
"FIBRAPL14.MX", "FHIPO14.MX", "FIBRAHD15.MX", "FPLUS16.MX",
"FVIA16.MX", "FNOVA17.MX", "FCFE18.MX"]
data = yf.download(symbols, start='2015-01-01', end='2020-11-15', actions=True)
data_reshape=data.stack(level=[0,1]).unstack(1)
data_reshape.index=data_reshape.index.set_names(['ticker'],level=[1])
data_reshape.head()
data_reshape.head()
Adj Close Close Dividends High \
Date ticker
2015-01-02 DANHOS13.MX 26.065336 37.000000 0.0 37.400002
FHIPO14.MX 18.526043 24.900000 0.0 24.900000
FIBRAMQ12.MX 16.337654 24.490000 0.0 25.110001
FIBRAPL14.MX 18.520781 26.740801 0.0 27.118500
FIHO12.MX 14.683501 21.670000 0.0 22.190001
Low Open Stock Splits Volume
Date ticker
2015-01-02 DANHOS13.MX 36.330002 36.330002 0.0 82849.0
FHIPO14.MX 24.900000 24.900000 0.0 94007.0
FIBRAMQ12.MX 24.350000 24.990000 0.0 1172917.0
FIBRAPL14.MX 26.343100 26.750700 0.0 338697.0
FIHO12.MX 21.209999 22.120001 0.0 189552.0

Set color limits for matplotlib colormap

I made a function to get the hex code given to a set of data as follows:
from matplotlib import cm, colors
def get_color(series_data, cmap='Reds'):
color_map = cm.get_cmap(cmap, 20)
f = lambda x: colors.rgb2hex(color_map(x/series_data.max())[:3])
return series_data.apply(f)
The cm.get_cmap(cmap, 20) generates a matplotlib.colors.LinearSegmentedColormap object that is ranged from the minimum value of the input series_data to its maximum.
I cannot see how could I define the color limits for the data to be evaluated. For instance, what if I wanted to set constant color limits, defining as the minimum the value 0 and the maximum 100? How could I do that within my function?
I tried to substitute series_data.max() to 100 to control the max equivalent color (max), but I couldn't control the cmin.
The parameter of color_map needs to be scaled to the [0.,1.) range. For instance, if the minimum (maximum) color value needs to be obtained for the lo (hi) value:
from matplotlib import cm, colors
import pandas as pd
def get_color(series_data, cmap='Reds', lo=None, hi=None):
if lo is None:
lo = series_data.min()
if hi is None:
hi = series_data.max()
if lo == hi:
raise Exception('Invalid range.')
color_map = cm.get_cmap(cmap, 20)
f = lambda x: colors.rgb2hex(color_map((x-lo)/(hi-lo))[:3])
return series_data.apply(f)
s = pd.Series(np.linspace(0,3,16))
colz = get_color(s, lo=1, hi=2)
for x, c in zip(s, colz):
print('{:.2f} {}'.format(x,c))
The sample output is
0.00 #fff5f0
0.20 #fff5f0
0.40 #fff5f0
0.60 #fff5f0
0.80 #fff5f0
1.00 #fff5f0
1.20 #fdc7b0
1.40 #fc8363
1.60 #ed392b
1.80 #af1117
2.00 #67000d
2.20 #67000d
2.40 #67000d
2.60 #67000d
2.80 #67000d
3.00 #67000d