DataFrame. Split Column to antoher Columns with average by date - pandas

There is a dataframe:
d = {'date' : ['2020-02-01', '2020-02-01', '2020-02-01', '2020-02-01', '2020-02-02', '2020-02-02', '2020-02-02'], 'type' : ['Bird', 'Dog', 'Cat', 'Bird', 'Dog', 'Cat', 'Bird'], 'weight' : [1, 2, 3, 4, 5, 6, 7]}
df = pd.DataFrame(d)
I would like to split the "type" column by the value of type and get columns - Bird, Dog, Cat. And values in these columns must be the average weight of the birds, dogs, etc. on the same date.
To get something like that.
date bird dog cat
2020-02-01 ... ... ...
2020-02-02 ... ... ...
I started to try group by but can't figure out with that. Maybe split dataframe by val in the "type" column and merge the obtained dataframes again?

Use pivot_table and apply mean to aggregate values those has the same index/column:
out = df.pivot_table(index='date', columns='type', values='weight', aggfunc='mean') \
.rename_axis(columns=None).reset_index()
print(out)
# Output:
date Bird Cat Dog
0 2020-02-01 2.5 3.0 2.0
1 2020-02-02 7.0 6.0 5.0

Related

Python: plotting a combination of observations taken in the same day

I have the following dataset:
id test date
1 A 2000-01-01
1 B 2000-01-01
1 C 2000-01-08
2 A 2000-01-01
2 A 2000-01-01
2 B 2000-01-08
3 A 2000-01-01
3 C 2000-01-01
3 B 2000-01-08
4 A 2000-01-01
4 B 2000-01-01
4 C 2000-01-01
5 A 2000-01-01
5 B 2000-01-01
5 C 2000-01-01
I would love to create a matrix figure with the count of how many individuals got a test taken on the same day.
For example:
Since we can see that 1 time (for one individual, id=1) test A and B were taken on the day; also for one individual (id = 3) test A and B were taken on the same day; and for two individuals (id=4 and 5) the three tests were taken on the same day.
So far I am doing the following:
df_tests = df.groupby(['id', 'date']).value_counts().reset_index(name='count')
df_tests_unique = df_tests[df_tests_re.duplicated(subset=['id','date'], keep=False)]
df_tests_unique = df_tests_unique[["id", "date", "test"]]
So the only thing left is to count the number of times the different tests ocur within the same date
Thanks for the fun exercise :) Given below is a possible solution. I created a numpy array and plotted it using seaborn. Note that it's quite hardcoded for the case where there is only A, B, C but I'm sure you will be able to generalize that. Also, the default color scheme of seaborn brings opposite colors than what you intended but that's easily fixable as well. Hope I helped!
This is the resulting plot from the script:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame({
'id': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5],
'test': ['A', 'B', 'C', 'A', 'A', 'B', 'A', 'C', 'B', 'A', 'B', 'C', 'A', 'B', 'C'],
'date': ['2000-01-01', '2000-01-01', '2000-01-08', '2000-01-01', '2000-01-01', '2000-01-08', '2000-01-01', '2000-01-01', '2000-01-08', '2000-01-01', '2000-01-01', '2000-01-01', '2000-01-01', '2000-01-01', '2000-01-01']
})
df_tests = df.groupby(['id', 'date']).value_counts().reset_index(name='count')
df_test_with_patterns = (df_tests[df_tests.duplicated(subset=['id', 'date'], keep=False)]
.groupby(['id', 'date'])
.agg({'test': 'sum'})
.reset_index().groupby('test').count().reset_index()
.assign(pattern=lambda df: df.test.apply(lambda tst: [1 if x in tst else 0 for x in ['A', 'B', 'C']]))
)
pattern_mat = np.vstack(df_test_with_patterns.pattern.values.tolist())
ax = sns.heatmap(pattern_mat, xticklabels=['A', 'B', 'C'], yticklabels=df_test_with_patterns.id.values)
ax.set(xlabel='Test Type', ylabel='# of individuals that took in a single day')
plt.show()
print
Building on Erap answer, this works too, maybe slightly faster:
out = pd.get_dummies(df.set_index(['date', 'id'], drop=True).sort_index()).groupby(level=[0,1]).sum()
and then iterate through the different dates to get the different charts
for i in out.index.levels[0]:
d = out.loc[i]
plt.figure()
plt.title(f'test for date {i}')
sns.heatmap(d.gt(0))

Generating a separate column that stores weighted average per group

It must not be that hard but I can't cope with this problem.
Imagine I have a long format dataframe with some data and want to calculate a weighted average of score per person and weighted by a manager and keep it as a separate variable - 'w_mean_m'.
df['w_mean_m'] = df.groupby('person')['score'].transform(lambda x: np.average(x['score'], weights=x['manager_weight']))
throws an error and I have no idea how to fix it.
Because GroupBy.transform working with each column separately is not possible select multiple columns, so is used GroupBy.apply with Series.map for new column:
s = (df.groupby('contact')
.apply(lambda x: np.average(x['score'], weights=x['manager_weight'])))
df['w_mean_m'] = df['contact'].map(s)
One hack is possible with selected values by unique index for weights:
df = df.reset_index(drop=True)
f = lambda x: np.average(x, weights=df.loc[x.index, "manager_weight"])
df['w_mean_m1'] = df.groupby('contact')['score'].transform(f)
print (df)
manager_weight score contact w_mean_m1
0 1.0 1 a 1.282609
1 1.1 1 a 1.282609
2 1.2 1 a 1.282609
3 1.3 2 a 1.282609
4 1.4 2 b 2.355556
5 1.5 2 b 2.355556
6 1.6 3 b 2.355556
7 1.7 3 c 3.770270
8 1.8 4 c 3.770270
9 1.9 4 c 3.770270
10 2.0 4 c 3.770270
Setup:
df = pd.DataFrame(
{
"manager_weight": [1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0],
"score": [1,1,1,2,2,2,3,3,4,4,4],
"contact": ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'c']
})

Get a multiple-index column values

Let's assume we have a DataFrame df with N rows:
| multiple-index | | ordinary columns |
I_A, I_B, I_C, I_D, C_A, C_B, C_C, C_D
How can we extract all N values for I_B index column? df.index gives us all combinations of I_A...I_D but it is not what we need. Of course, we can iterate over it but it will cost productivity, there must be an easier, more straightforward way?
Thank you for your time.
UPDATE
E.g., we have df generated by:
data = {
"animal": ["cat", "dog", "parrot", "hamster"],
"size": ["big", "big", "small", "small"],
"feet": [4, 4, 2, 4]
}
multi = pd.DataFrame(data)
multi.set_index(["size", "feet"], inplace = True)
and which is:
animal
size feet |
big 4 | cat
big 4 | dog
small 2 | parrot
small 4 | hamster
Its index is:
MultiIndex([( 'big', 4),
( 'big', 4),
('small', 2),
('small', 4)],
names=['size', 'feet'])
from which we would like to get all sizes:
['big', 'big', 'small', 'small']
How can we do that?
I think you're looking for MultiIndex.get_level_values:
multi.index.get_level_values('size')
Output: Index(['big', 'big', 'small', 'small'], dtype='object', name='size')
Or as list:
multi.index.get_level_values('size').to_list()
Output: ['big', 'big', 'small', 'small']

How do I do an average plus count of a column using pandas data frame?

This code looks really stupid but this is a basic representation of the problem I've been dealing with all day - I have 3 columns, type, day and month. I'd like to count the number of dogs/cats by day, and then average it out over the month.
import numpy as np
import pandas as pd
data = {'Type':['Dog', 'Cat', 'Cat', 'Cat', 'Dog', 'Dog', 'Dog', 'Cat'], 'Day':[1, 1, 2, 2, 3, 3, 4, 4], 'Month': [1, 1, 1, 1, 2, 2, 2, 2]}
newDF = pd.DataFrame(data)
Which creates a dataframe that looks like this:
Type|Day|Month
---------
Dog|1|1
Cat|1|1
Cat|2|1
Cat|2|1
Dog|3|2
Dog|3|2
Dog|4|2
Cat|4|2
What I'm trying to do here is create a table below showing this:
Type | Month1 | Month2
------------------------
Dog | 1 | 1.5
Cat | 1.5 | 1
So basically, I just want to use some combination of pivot table or groupby to create a pivot_table containing the count of number of cats / dogs per day, and then average that out over the month. For some reason, I just can't manage to figure it out. Can someone smart enough with pandas please help? Thank you!
Two groupbys + unstack
(newDF.groupby(['Type', 'Day', 'Month']).size()
.groupby(level=[0,2]).mean()
.unstack()
.add_prefix('Month').rename_axis(None, 1))
Output:
Month1 Month2
Type
Cat 1.5 1.0
Dog 1.0 1.5
Just a groupby combined with an unstack and mean:
df.groupby(df.columns.tolist()) \
.size() \
.unstack(level='Day') \
.mean(axis=1) \
.unstack(level='Month')
Output:
Month 1 2
Type
Cat 1.5 1.0
Dog 1.0 1.5

pandas dataframe generic column values

Is there a consistent way of getting pandas column values by DF['ColName'], including index column? If 'ColName' is an index column, you get KeyError.
It is very inconvenient that every time you need to determine whether a column being passed in is an index column or not, then handle it differently.
Thank you.
consider the dataframe df
df = pd.DataFrame(
dict(
A=[1, 2, 3],
B=[4, 5, 6],
C=['x', 'y', 'z'],
),
pd.MultiIndex.from_tuples(
[
('cat', 'red'),
('dog', 'blue'),
('bird', 'yellow')
],
names=['species', 'color']
)
)
print(df)
A B C
species color
cat red 1 4 x
dog blue 2 5 y
bird yellow 3 6 z
you can always refer to levels of the index in the same way you'd refer to columns if you reset_index() first.
Grab column 'A'
df.reset_index()['A']
0 1
1 2
2 3
Name: A, dtype: int64
Grab 'color' without reset_index()
df['color']
> KeyError
With reset_index()
0 red
1 blue
2 yellow
Name: color, dtype: object
This doesn't come without it's downside. That index was potentially useful to have for column 'A'
df['A']
species color
cat red 1
dog blue 2
bird yellow 3
Name: A, dtype: int64
Automatically aligned the 'index' with the values of column 'A' which was the whole point of it being the index.