How can I create a bar chart with ranges of values - pandas

I have a data frame with, among other things, a user id and an age. I need to produce a bar chart of the number of users that fall with ranges of ages. What's throwing me is that there is really no upper bound for the age range. The specific ranges I'm trying to plot are age <= 25, 25 < age <= 75 and age > 75.
I'm relatively new to Pandas and plotting, and I'm sure this is a simple thing for more experienced data wranglers. Any assistance would be greatly appreciated.

You'll need to use the pandas.cut method to do this, and you can supply custom bins and labels!
from pandas import DataFrame, cut
from numpy.random import default_rng
from numpy import arange
from matplotlib.pyplot import show
# Make som dummy data
rng = default_rng(0)
df = DataFrame({'id': arange(100), 'age': rng.normal(50, scale=20, size=100).clip(min=0)})
print(df.head())
id age
0 0 52.514604
1 1 47.357903
2 2 62.808453
3 3 52.098002
4 4 39.286613
# Use pandas.cut to bin all of the ages & assign
# these bins to a new column to demonstrate how it works
## bins are [0-25), [25-75), [75-inf)
df['bin'] = cut(df['age'], [0, 25, 75, float('inf')], labels=['under 25', '25 up to 75', '75 or older'])
print(df.head())
id age bin
0 0 52.514604 25 up to 75
1 1 47.357903 25 up to 75
2 2 62.808453 25 up to 75
3 3 52.098002 25 up to 75
4 4 39.286613 25 up to 75
# Get the value_counts of those bins and plot!
df['bin'].value_counts().sort_index().plot.bar()
show()

Related

Pandas split ages by group

I'm quite new with pandas and need a bit help. I have a column with ages and need to make groups of these:
Young people: age≤30
Middle-aged people: 30<age≤60
Old people:60<age
Here is the code, but it gives me an error:
def get_num_people_by_age_category(dataframe):
young, middle_aged, old = (0, 0, 0)
dataframe["age"] = pd.cut(x=dataframe['age'], bins=[30,31,60,61], labels=["young","middle_aged","old"])
return young, middle_aged, old
ages = get_num_people_by_age_category(dataframe)
print(dataframe)
Code below gets the age groups using pd.cut().
# Import libraries
import pandas as pd
# Create DataFrame
df = pd.DataFrame({
'age': [1,20,30,31,50,60,61,80,90] #np.random.randint(1,100,50)
})
# Function: Copy-pasted from question and modified
def get_num_people_by_age_category(df):
df["age_group"] = pd.cut(x=df['age'], bins=[0,30,60,100], labels=["young","middle_aged","old"])
return df
# Call function
df = get_num_people_by_age_category(df)
Output
print(df)
age age_group
0 1 young
1 20 young
2 30 young
3 31 middle_aged
4 50 middle_aged
5 60 middle_aged
6 61 old
7 80 old
8 90 old

Why my lower outliers are not showing in Box plot

Dataset
store id,revenue ,profit
101,779183,281257
101,144829,838451
101,766465,757565
101,353297,261071
101,1615461,275760
101,246731,949229
101,951518,301016
101,444669,430583
Code
import pandas as pd
import seaborn as sns
dummies = pd.read_csv('1.csv')
dummies.sort_values(by=['revenue'], inplace=True)
fea = dummies[['storeid']]
lab = dummies[['revenue']]
param = 'revenue'
qv1 = lab[param].quantile(0.25)
qv2 = lab[param].quantile(0.5)
qv3 = lab[param].quantile(0.75)
qv_limit = 1.5 * (qv3 - qv1)
un_outliers_mask = (lab[param] > qv3 + qv_limit) | (lab[param] < qv1 - qv_limit)
un_outliers_data = lab[param][un_outliers_mask]
un_outliers_name = fea[un_outliers_mask]
un_outliers_data
#41 54437
# 44 89269
# 40 1942989
# 6 1951518
dummies.boxplot(by='storeid', column=['revenue'], grid=False)
un_outliers_data Output is having both outliers higher and lower, But in plot only higher is displayed
My graph is only displaying the higher outliers
un_outliers_data has the global outliers ie you are considering mean of complete data from dummies dataframe. But your box plot filter the data by storeid and then calculates median, percentiles etc for this subset of data.
You will see the required outliers (un_outliers_data) if you just do dummies['revenue'].plot(kind='box')
Example:
Consider the below small dataset:
store id,revenue
101, 10
102, 190
103, 200
104, 210
105, 300
It should be clear that revenue = 10 & 300 are outliers, but they are not outliers if look at the data for store id 101 & 105 respectively.

Add column of .75 quantile based off groupby

I have df with index as date and also column called scores. Now I want to maintain the df as it is but add column which gives the 0.7 quantile of scores for that day. Method of quantile would need to be midpoint and also be rounded to nearest whole number.
I've outlined one approach you could take, below.
Note that to round a value to the nearest whole number you should use Python's built-in round() function. See round() in the Python documentation for details.
import pandas as pd
import numpy as np
# set random seed for reproducibility
np.random.seed(748)
# initialize base example dataframe
df = pd.DataFrame({"date":np.arange(10),
"score":np.random.uniform(size=10)})
duplicate_dates = np.random.choice(df.index, 5)
df_dup = pd.DataFrame({"date":np.random.choice(df.index, 5),
"score":np.random.uniform(size=5)})
# finish compiling example data
df = df.append(df_dup, ignore_index=True)
# calculate 0.7 quantile result with specified parameters
result = df.groupby("date").quantile(q=0.7, axis=0, interpolation='midpoint')
# print resulting dataframe
# contains one unique 0.7 quantile value per date
print(result)
"""
0.7 score
date
0 0.585087
1 0.476404
2 0.426252
3 0.363376
4 0.165013
5 0.927199
6 0.575510
7 0.576636
8 0.831572
9 0.932183
"""
# to apply the resulting quantile information to
# a new column in our original dataframe `df`
# we can apply a dictionary to our "date" column
# create dictionary
mapping = result.to_dict()["score"]
# apply to `df` to produce desired new column
df["quantile_0.7"] = [mapping[x] for x in df["date"]]
print(df)
"""
date score quantile_0.7
0 0 0.920895 0.585087
1 1 0.476404 0.476404
2 2 0.380771 0.426252
3 3 0.363376 0.363376
4 4 0.165013 0.165013
5 5 0.927199 0.927199
6 6 0.340008 0.575510
7 7 0.695818 0.576636
8 8 0.831572 0.831572
9 9 0.932183 0.932183
10 7 0.457455 0.576636
11 6 0.650666 0.575510
12 6 0.500353 0.575510
13 0 0.249280 0.585087
14 2 0.471733 0.426252
"""

Double grouping data by bins AND time with pandas

I am trying to bin values from a timeseries (hourly and subhourly temperature values) within a time window.
That is, from original hourly values, I'd like to extract binned values on a daily, weekly or monthly basis.
I have tried to combine groupby+TimeGrouper(" ") with pd.cut, with poor results.
I have came across a nice function from this tutorial, which suggests to map the data (associating to each value with its mapped range on the next column) and then grouping according to that.
def map_bin(x, bins):
kwargs = {}
if x == max(bins):
kwargs['right'] = True
bin = bins[np.digitize([x], bins, **kwargs)[0]]
bin_lower = bins[np.digitize([x], bins, **kwargs)[0]-1]
return '[{0}-{1}]'.format(bin_lower, bin)
df['Binned'] = df['temp'].apply(map_bin, bins=freq_bins)
However, applying this function results in an IndexError: index n is out of bounds for axis 0 with size n.
Ideally, I'd like make this work and apply it to achieve a double grouping at the same time: one by bins and one by timegrouper.
Update:
It appears that my earlier attempt was causing problems because of the double-indexed columns. I have simplified to something that seems to work much better.
import pandas as pd
import numpy as np
xaxis = np.linspace(0,50)
temps = pd.Series(data=xaxis,name='temps')
times = pd.date_range(start='2015-07-15',periods=50,freq='6H')
temps.index = times
bins = [0,10,20,30,40,50]
temps.resample('W').agg(lambda series:pd.value_counts(pd.cut(series,bins),sort=False)).unstack()
This outputs:
(0, 10] (10, 20] (20, 30] (30, 40] (40, 50]
2015-07-19 9 10 0 0 0
2015-07-26 0 0 10 10 8
2015-08-02 0 0 0 0 2

How to efficiently remove duplicate rows from a DataFrame

I'm dealing with a very large Data Frame and I'm using pandas to do the analysis.
The data frame is structured as follows
import pandas as pd
df = pd.read_csv("data.csv")
df.head()
Source Target Weight
0 0 25846 1
1 0 1916 1
2 25846 0 1
3 0 4748 1
4 0 16856 1
The issue is that I want to remove all the "duplicates". In the sense that if I already have a row that contains a Source and a Target I do not want this information to be repeated on another row.
For instance, rows number 0 and 2 are "duplicate" in this sense and only one of them should be retained.
A simple way to get rid of all the "duplicates" is
for index, row in df.iterrows():
df = df[~((df.Source==row.Target)&(df.Target==row.Source))]
However, this approach is horribly slow since my data frame has about 3 million rows. Do you think there's a better way of doing this?
Create two temp columns to save minimum(df.Source, df.Target) and maximum(df.Source, df.Target), and then check duplicated rows by duplicated() method:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 5, (20, 2)), columns=["Source", "Target"])
df["T1"] = np.minimum(df.Source, df.Target)
df["T2"] = np.maximum(df.Source, df.Target)
df[~df[["T1", "T2"]].duplicated()]
No need (as usual) to use a loop with a dataframe. Use the Series.isin method:
So start with this:
df = pandas.DataFrame({
'src': [0, 0, 25, 0, 0],
'tgt': [25, 12, 0, 85, 363]
})
print(df)
src tgt
0 0 25
1 0 12
2 25 0
3 0 85
4 0 363
Then select all of the where src is not in tgt:
df[~(df['src'].isin(df['tgt']) & df['tgt'].isin(df['src']))]
src tgt
1 0 12
3 0 85
4 0 363
Your Source and Targets appear to be mutually exclusive (i.e. you can have one, but not both). Why not add them together (e.g. 25846 + 0) to get the unique identifier. You can then delete the unneeded Target column (reducing memory), and then drop duplicates. In the event your weights are not the same, it will take the first one by default.
df.Source += df.Target
df.drop('Target', axis=1, inplace=True)
df.drop_duplicates(inplace=True)
>>> df
Source Weight
0 25846 1
1 1916 1
3 4748 1
4 16856 1