Dataset
store id,revenue ,profit
101,779183,281257
101,144829,838451
101,766465,757565
101,353297,261071
101,1615461,275760
101,246731,949229
101,951518,301016
101,444669,430583
Code
import pandas as pd
import seaborn as sns
dummies = pd.read_csv('1.csv')
dummies.sort_values(by=['revenue'], inplace=True)
fea = dummies[['storeid']]
lab = dummies[['revenue']]
param = 'revenue'
qv1 = lab[param].quantile(0.25)
qv2 = lab[param].quantile(0.5)
qv3 = lab[param].quantile(0.75)
qv_limit = 1.5 * (qv3 - qv1)
un_outliers_mask = (lab[param] > qv3 + qv_limit) | (lab[param] < qv1 - qv_limit)
un_outliers_data = lab[param][un_outliers_mask]
un_outliers_name = fea[un_outliers_mask]
un_outliers_data
#41 54437
# 44 89269
# 40 1942989
# 6 1951518
dummies.boxplot(by='storeid', column=['revenue'], grid=False)
un_outliers_data Output is having both outliers higher and lower, But in plot only higher is displayed
My graph is only displaying the higher outliers
un_outliers_data has the global outliers ie you are considering mean of complete data from dummies dataframe. But your box plot filter the data by storeid and then calculates median, percentiles etc for this subset of data.
You will see the required outliers (un_outliers_data) if you just do dummies['revenue'].plot(kind='box')
Example:
Consider the below small dataset:
store id,revenue
101, 10
102, 190
103, 200
104, 210
105, 300
It should be clear that revenue = 10 & 300 are outliers, but they are not outliers if look at the data for store id 101 & 105 respectively.
Related
I'd like to calculate an average of a column using pandas based on different numbers in another column.
I have two columns A,B: I'd like to have an extra column showing the average of B when values of A are between => 0 and < 20 , =>20 and <40, =>40 and <60, =>60 and <80 , =>80 and <100 and so on.. 100 as a maximum is an example .. lets say until the max number column A which could be 20000
enter image description here
I have tried using an if statement but that only works for limited values, what about if I have 20000 as my max value and want the average for a range = 5 for A values?
enter image description here
Use cut + groupby.transform:
bins = [0, 20, 40, 60, 80, 101]
df['C'] = df['B'].groupby(pd.cut(df['A'], bins=bins, right=False)).transform('mean')
If you want to generate the bins programmatically:
import numpy as np
MAX = 100
STEP = 20
bins = np.arange(0, MAX+1, STEP)
bins[-1] += 1
I have a data frame with, among other things, a user id and an age. I need to produce a bar chart of the number of users that fall with ranges of ages. What's throwing me is that there is really no upper bound for the age range. The specific ranges I'm trying to plot are age <= 25, 25 < age <= 75 and age > 75.
I'm relatively new to Pandas and plotting, and I'm sure this is a simple thing for more experienced data wranglers. Any assistance would be greatly appreciated.
You'll need to use the pandas.cut method to do this, and you can supply custom bins and labels!
from pandas import DataFrame, cut
from numpy.random import default_rng
from numpy import arange
from matplotlib.pyplot import show
# Make som dummy data
rng = default_rng(0)
df = DataFrame({'id': arange(100), 'age': rng.normal(50, scale=20, size=100).clip(min=0)})
print(df.head())
id age
0 0 52.514604
1 1 47.357903
2 2 62.808453
3 3 52.098002
4 4 39.286613
# Use pandas.cut to bin all of the ages & assign
# these bins to a new column to demonstrate how it works
## bins are [0-25), [25-75), [75-inf)
df['bin'] = cut(df['age'], [0, 25, 75, float('inf')], labels=['under 25', '25 up to 75', '75 or older'])
print(df.head())
id age bin
0 0 52.514604 25 up to 75
1 1 47.357903 25 up to 75
2 2 62.808453 25 up to 75
3 3 52.098002 25 up to 75
4 4 39.286613 25 up to 75
# Get the value_counts of those bins and plot!
df['bin'].value_counts().sort_index().plot.bar()
show()
Using boxplot from matplotlib.pyplot the quartile values are calculated by including the median. Can this be changed to NOT include the median?
For example, consider the ordered data set
2, 3, 4, 5, 6, 7, 8
If the median is NOT included, then Q1=3 and Q3=7. However, boxplot includes the median value, i.e. 5, and generates the figure below
Is it possible to change this behavior, and NOT include the median in the calculation of the quartiles? This should correspond to Method 1 as described on on the Wikipedia page Quartile. The code to generate the figure is listed below
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
data = [2, 3, 4, 5, 6, 7, 8]
fig = plt.figure(figsize=(6,1))
ax = fig.add_axes([0.1,0.25,0.8,0.8])
bp = ax.boxplot(data, '',
vert=False,
positions=[0.4],
widths=[0.3])
ax.set_xlim([0,9])
ax.set_ylim([0,1])
ax.xaxis.set_major_locator(MultipleLocator(1))
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.spines["top"].set_visible(False)
ax.yaxis.set_ticks([])
ax.grid(which='major',axis='x',lw=0.1)
plt.show()
The question is motivated by the fact that several educational resources around the internet do not calculate the quartiles as the default settings used by matplotlib's boxplot. For example, in the online course, "Statistics and probability" from Khan Academy, the quartiles are calculated as described in Method 1 on the Wikipedia page Quartiles, while boxplot employs Method 2.
Consider an example from Khan Academy's course "Statistics and probability" section "Comparing range and interquartile range (IQR)" . The daily high temperatures are recorded in Paradise, MI. for 7 days and found to be 16, 24, 26, 26,26, 27, and 28 degree Celsius. Describe the data with a boxplot and calculate IQR.
The result of using the default settings in boxplot and that presented by Prof. Khan are very different, see figure below.
The IQR found by matplotlib is 1.5, and that calculated by Prof. Khan is 3.
As pointed out in the comments by #JohanC, boxplot can not directly be configured to follow Method 1, but requires a customized function. Therefore, neglecting the calculation of outliers, I updated the code to calculate the quartiles according to Method 1, and thus be comparable with the Khan Academy course. The code is listed below, not very pythonic, suggestions are welcome.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cbook as cbook
from matplotlib.ticker import MultipleLocator
def median(x):
"""
x - input a list of numbers
Returns the midpoint number, for example
in a list with oddnumbers
[1,2, 3, 4,5] returns 3
for a list with even numbers the algebraic mean is returned, e.g
[1,2,3,4] returns 2.5
"""
if len(x)&1:
# Odd number of elements in list, e.g. x = [1,2,3] returns 2
index_middle = int((len(x)-1)/2)
median = x[index_middle]
else:
# Even number of elements in list, e.g. x = [-1,2] returns 0.5
index_lower = int(len(x)/2-1)
index_upper = int(len(x)/2)
median = (x[index_lower]+x[index_upper])/2
return median
def method_1_quartiles(x):
"""
x - list of numbers
"""
x.sort()
N = len(x)
if N&1:
# Odd number of elements
index_middle = int((N-1)/2)
lower = x[0:index_middle] # Up to but not including
upper = x[index_middle+1:N+1]
Q1= median(lower)
Q2 = x[index_middle]
Q3 = median(upper)
else:
# Even number of elements
index_lower = int(N/2)
lower = x[0:index_lower]
upper = x[index_lower:N]
Q1= median(lower)
Q2 = (x[index_lower-1]+x[index_lower])/2
Q3 = median(upper)
return Q1,Q2,Q3
data = [16,24,26, 26, 26,27,28]
fig = plt.figure(figsize=(6,1))
ax = fig.add_axes([0.1,0.25,0.8,0.8])
stats = cbook.boxplot_stats(data,)[0]
Q1_default = stats['q1']
Q3_default = stats['q3']
stats['whislo']=min(data)
stats['whishi']=max(data)
IQR_default = Q3_default - Q1_default
Q1, Q2, Q3 = method_1_quartiles(data)
IQR = Q3-Q1
stats['q1'] = Q1
stats['q3'] = Q3
print(f"IQR: {IQR}")
ax.bxp([stats],vert=False,manage_ticks=False,widths=[0.3],positions=[0.4],showfliers=False)
ax.set_xlim([15,30])
ax.set_ylim([0,1])
ax.xaxis.set_major_locator(MultipleLocator(1))
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.spines["top"].set_visible(False)
ax.yaxis.set_ticks([])
ax.grid(which='major',axis='x',lw=0.1)
plt.show()
The graph generated is
I am trying to create a function for stratified sampling which takes in a dataframe created using the faker module along with strata, sample size and a random seed. For the sample size, I want the number of samples in each strata to vary based on user input. This is my code for creating the data:
import pandas as pd
import numpy as np
import random as rn#generating random numbers
from faker import Faker
fake = Faker()
frame_fake = pd.DataFrame( [{"region":
fake.random_number(1,fix_len=True),
"district": fake.random_number(2,fix_len=True),
"enum_area": fake.random_number(5,fix_len=True),
"hhs": fake.random_number(3),
"pop": fake.random_number(4),
"area": fake.random_number(1)} for x in range(100)])
# check for and remove duplicates from enum area (should be unique)
# before any further analysis
mask= frame_fake.duplicated('enum_area', keep='last')
duplicates = frame_fake[mask]
# print(duplicates)
# drop all except last
frame_fake = frame_fake.drop_duplicates('enum_area',
keep='last').sort_values(by='enum_area',ascending=True)
# reset index to have them sequentially after sorting by enum_area and
# drop the old index column
frame_fake = frame_fake.reset_index().drop('index',axis=1)
frame_fake
This is the code for sampling:
def stratified_custom(data,strata,sample_size, seed=None):
# for this part, we sample 5 enum areas in each strata/region
# we groupby strata and use the transform method with 'count' parameter
# to get strata sizes
data['strat_size'] = data.groupby(strata)[strata].transform('count')
# map input sample size to each strata
data['strat_sample_size'] = data[strata].map(sample_size)
# grouby strata, get sample size per stratum, cast to int and reset
# index.
smp_size = data.groupby(strata)
['strat_sample_size'].unique().astype(int).reset_index()
# groupby strata and select sample per stratum based on the sample size
# for that strata
sample = (data.groupby(strata, group_keys=False)
.apply(lambda x: x.sample(smp_size,random_state=seed)))
# probability of inclusion
sample['inclusion_prob'] =
sample['strat_sample_size']/sample['strat_size']
return sample
s_size={1:7,2:5,3:5,4:5,5:5,6:5,7:5,8:5,9:8} #pass in strata and sample
# size as dict. (key, values)
(stratified_custom(data=frame_fake,strata='region',sample_size=s_size,
seed=99).sort_values(by=['region','enum_area'],ascending=True))
I however receive this error:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
I can't figure out what this error is talking about. Any help is appreciated.
After much research, I stumbled upon this post https://stackoverflow.com/a/58794577/14198137 and implemented this in my code to not only sample based on varying sample sizes but also with fixed ones using the same function. Here is my code for the data:
import pandas as pd
import numpy as np
import random as rn
from faker import Faker
Faker.seed(99)
fake = Faker()
frame_fake = pd.DataFrame( [{"region":
fake.random_number(1,fix_len=True),"district":
fake.random_number(2,fix_len=True),"enum_area":
fake.random_number(5,fix_len=True), "hhs":
fake.random_number(3),"pop":
fake.random_number(4),"area":
rn.randint(1,2)} for x in range(100)])
frame_fake = frame_fake.drop_duplicates('enum_area',keep='last').sort_values(by='enum_area',ascending=True)
frame_fake = frame_fake.reset_index().drop('index',axis=1)
Here is the updated code for stratified sampling which now works.
def stratified_custom(data,strata,sample_size, seed=None):
data = data.copy()
data['strat_size'] = data.groupby(strata)[strata].transform('count')
try:
data['strat_sample_size'] = data[strata].map(sample_size)
smp_size = data.set_index(strata)['strat_sample_size'].to_dict()
strat2_sample = (data.groupby(strata, group_keys=False).apply(lambda x: x.sample(smp_size[x.name],random_state=seed)))
strat2_sample['inclusion_prob'] = strat2_sample['strat_sample_size']/strat2_sample['strat_size']
return strat2_sample
except:
data['strat_sample_size'] = sample_size
strat2_sample = (data.groupby(strata, group_keys=False).apply(lambda x: x.sample(sample_size,random_state=seed)))
strat2_sample['inclusion_prob'] = strat2_sample['strat_sample_size']/strat2_sample['strat_size']
return strat2_sample
s_size={1:3,2:9,3:5,4:5,5:5,6:5,7:5,8:5,9:8}
variablesize = (stratified_custom(data=frame_fake,strata='region',sample_size=s_size, seed=99).sort_values(by=['region','enum_area'],ascending=True)).head()
variablesize
fixedsize = (stratified_custom(data=frame_fake,strata='region',sample_size=3, seed=99).sort_values(by=['region','enum_area'],ascending=True)).head()
fixedsize
The output of variable sample size:
region district enum_area ... strat_size strat_sample_size inclusion_prob
5 1 60 14737 ... 5 3 0.6
26 1 42 34017 ... 5 3 0.6
68 1 31 72092 ... 5 3 0.6
0 2 65 10566 ... 10 9 0.9
15 2 22 25560 ... 10 9 0.9
The output of fixed sample size:
region district enum_area ... strat_size strat_sample_size inclusion_prob
5 1 60 14737 ... 5 3 0.6
26 1 42 34017 ... 5 3 0.6
68 1 31 72092 ... 5 3 0.6
38 2 74 48408 ... 10 3 0.3
43 2 15 56365 ... 10 3 0.3
I was however wondering if there is a better way of achieving this?
I am trying to bin values from a timeseries (hourly and subhourly temperature values) within a time window.
That is, from original hourly values, I'd like to extract binned values on a daily, weekly or monthly basis.
I have tried to combine groupby+TimeGrouper(" ") with pd.cut, with poor results.
I have came across a nice function from this tutorial, which suggests to map the data (associating to each value with its mapped range on the next column) and then grouping according to that.
def map_bin(x, bins):
kwargs = {}
if x == max(bins):
kwargs['right'] = True
bin = bins[np.digitize([x], bins, **kwargs)[0]]
bin_lower = bins[np.digitize([x], bins, **kwargs)[0]-1]
return '[{0}-{1}]'.format(bin_lower, bin)
df['Binned'] = df['temp'].apply(map_bin, bins=freq_bins)
However, applying this function results in an IndexError: index n is out of bounds for axis 0 with size n.
Ideally, I'd like make this work and apply it to achieve a double grouping at the same time: one by bins and one by timegrouper.
Update:
It appears that my earlier attempt was causing problems because of the double-indexed columns. I have simplified to something that seems to work much better.
import pandas as pd
import numpy as np
xaxis = np.linspace(0,50)
temps = pd.Series(data=xaxis,name='temps')
times = pd.date_range(start='2015-07-15',periods=50,freq='6H')
temps.index = times
bins = [0,10,20,30,40,50]
temps.resample('W').agg(lambda series:pd.value_counts(pd.cut(series,bins),sort=False)).unstack()
This outputs:
(0, 10] (10, 20] (20, 30] (30, 40] (40, 50]
2015-07-19 9 10 0 0 0
2015-07-26 0 0 10 10 8
2015-08-02 0 0 0 0 2