I have the following dataset:
id test date
1 A 2000-01-01
1 B 2000-01-01
1 C 2000-01-08
2 A 2000-01-01
2 A 2000-01-01
2 B 2000-01-08
3 A 2000-01-01
3 C 2000-01-01
3 B 2000-01-08
4 A 2000-01-01
4 B 2000-01-01
4 C 2000-01-01
5 A 2000-01-01
5 B 2000-01-01
5 C 2000-01-01
I would love to create a matrix figure with the count of how many individuals got a test taken on the same day.
For example:
Since we can see that 1 time (for one individual, id=1) test A and B were taken on the day; also for one individual (id = 3) test A and B were taken on the same day; and for two individuals (id=4 and 5) the three tests were taken on the same day.
So far I am doing the following:
df_tests = df.groupby(['id', 'date']).value_counts().reset_index(name='count')
df_tests_unique = df_tests[df_tests_re.duplicated(subset=['id','date'], keep=False)]
df_tests_unique = df_tests_unique[["id", "date", "test"]]
So the only thing left is to count the number of times the different tests ocur within the same date
Thanks for the fun exercise :) Given below is a possible solution. I created a numpy array and plotted it using seaborn. Note that it's quite hardcoded for the case where there is only A, B, C but I'm sure you will be able to generalize that. Also, the default color scheme of seaborn brings opposite colors than what you intended but that's easily fixable as well. Hope I helped!
This is the resulting plot from the script:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame({
'id': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5],
'test': ['A', 'B', 'C', 'A', 'A', 'B', 'A', 'C', 'B', 'A', 'B', 'C', 'A', 'B', 'C'],
'date': ['2000-01-01', '2000-01-01', '2000-01-08', '2000-01-01', '2000-01-01', '2000-01-08', '2000-01-01', '2000-01-01', '2000-01-08', '2000-01-01', '2000-01-01', '2000-01-01', '2000-01-01', '2000-01-01', '2000-01-01']
})
df_tests = df.groupby(['id', 'date']).value_counts().reset_index(name='count')
df_test_with_patterns = (df_tests[df_tests.duplicated(subset=['id', 'date'], keep=False)]
.groupby(['id', 'date'])
.agg({'test': 'sum'})
.reset_index().groupby('test').count().reset_index()
.assign(pattern=lambda df: df.test.apply(lambda tst: [1 if x in tst else 0 for x in ['A', 'B', 'C']]))
)
pattern_mat = np.vstack(df_test_with_patterns.pattern.values.tolist())
ax = sns.heatmap(pattern_mat, xticklabels=['A', 'B', 'C'], yticklabels=df_test_with_patterns.id.values)
ax.set(xlabel='Test Type', ylabel='# of individuals that took in a single day')
plt.show()
print
Building on Erap answer, this works too, maybe slightly faster:
out = pd.get_dummies(df.set_index(['date', 'id'], drop=True).sort_index()).groupby(level=[0,1]).sum()
and then iterate through the different dates to get the different charts
for i in out.index.levels[0]:
d = out.loc[i]
plt.figure()
plt.title(f'test for date {i}')
sns.heatmap(d.gt(0))
Related
It must not be that hard but I can't cope with this problem.
Imagine I have a long format dataframe with some data and want to calculate a weighted average of score per person and weighted by a manager and keep it as a separate variable - 'w_mean_m'.
df['w_mean_m'] = df.groupby('person')['score'].transform(lambda x: np.average(x['score'], weights=x['manager_weight']))
throws an error and I have no idea how to fix it.
Because GroupBy.transform working with each column separately is not possible select multiple columns, so is used GroupBy.apply with Series.map for new column:
s = (df.groupby('contact')
.apply(lambda x: np.average(x['score'], weights=x['manager_weight'])))
df['w_mean_m'] = df['contact'].map(s)
One hack is possible with selected values by unique index for weights:
df = df.reset_index(drop=True)
f = lambda x: np.average(x, weights=df.loc[x.index, "manager_weight"])
df['w_mean_m1'] = df.groupby('contact')['score'].transform(f)
print (df)
manager_weight score contact w_mean_m1
0 1.0 1 a 1.282609
1 1.1 1 a 1.282609
2 1.2 1 a 1.282609
3 1.3 2 a 1.282609
4 1.4 2 b 2.355556
5 1.5 2 b 2.355556
6 1.6 3 b 2.355556
7 1.7 3 c 3.770270
8 1.8 4 c 3.770270
9 1.9 4 c 3.770270
10 2.0 4 c 3.770270
Setup:
df = pd.DataFrame(
{
"manager_weight": [1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0],
"score": [1,1,1,2,2,2,3,3,4,4,4],
"contact": ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'c']
})
I would like to groupby a dataframe by column's appearance patten (not the same order but not repeat).
for example below, group the x column (0,1,2) as a group and (3,4,5) as another group. group element maybe not the same, but no any element repeated in each group.
#+begin_src python :results output
import pandas as pd
df = pd.DataFrame({
'x': ['a', 'b', 'c', 'c', 'b', 'a'],
'y': [1, 2, 3, 4, 3, 1]})
print(df)
#+end_src
#+RESULTS:
: x y
: 0 a 1
: 1 b 2
: 2 c 3
: 3 c 4
: 4 b 3
: 5 a 1
Try with cumcount , the output can be the group number for you
df.groupby('x').cumcount()
Out[81]:
0 0
1 0
2 0
3 1
4 1
5 1
dtype: int64
// The comments have made me realize that this is actually a far broader question about how the on keyword works in .reshape. I left the old question below for reference, but I think the question is much broader.
Here's a reproducible example; I would expect the first two statements to give the same results, and the second two statements to give the same results. They don't.
get_df = lambda : pd.DataFrame( {'DATETIME' : pd.to_datetime(['2018-01-01 11:25:00', '2018-01-01 11:50:00', '2018-01-03 10:30:00'
, '2018-01-04 10:25:00']*2),
'GROUP' : ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'FILTER' : [True, True, True, True, False, False, True, True],
'X' : [1, 2, 3, 4, 5, 6, 7, 8]} )
df = get_df()
df = df.set_index('DATETIME')
df.groupby('GROUP').resample('D').X.sum()
# Returns
# -------
# GROUP DATETIME
# A 2018-01-01 3
# 2018-01-02 0
# 2018-01-03 3
# 2018-01-04 4
# B 2018-01-01 11
# 2018-01-02 0
# 2018-01-03 7
# 2018-01-04 8
# Name: X, dtype: int64
df = get_df()
df.groupby('GROUP').resample('D', on = 'DATETIME').X.sum()
# Returns
# -------
# GROUP DATETIME
# A 2018-01-01 10
# B 2018-01-03 11
# 2018-01-04 15
# Name: X, dtype: int64
df = get_df()
df = df.set_index('DATETIME')
df[df.FILTER].groupby('GROUP').resample('D').X.sum()
# Returns
# -------
# GROUP DATETIME
# A 2018-01-01 3
# 2018-01-02 0
# 2018-01-03 3
# 2018-01-04 4
# B 2018-01-03 7
# 2018-01-04 8
# Name: X, dtype: int64
df = get_df()
df[df.FILTER].groupby('GROUP').resample('D', on = 'DATETIME').X.sum()
# Error
# -----
# IndexError: index 6 is out of bounds for size 6
Any thoughts?
Original question
I'm trying to do a groupby followed by a re-sample in pandas. This works if the date is in the df's index, but NOT if it is in a column, and I supply the "on" keyword in the re-sample.
Python 3.7.1 and Pandas 0.24.2
Set up the dataframe:
df = pd.DataFrame( {'DATETIME' : pd.to_datetime(['2018-01-01 11:25:00', '2018-01-01 11:50:00', '2018-01-03 10:30:00'
, '2018-01-04 10:25:00', '2018-01-03 10:30:00', '2018-01-04 10:25:00']),
'GROUP' : ['A', 'A', 'A', 'A', 'B', 'B'],
'X' : [1, 2, 3, 4, 5, 6]} )
Then run this:
df[df.GROUP == 'B'].groupby('GROUP').resample('D', on = 'DATETIME').X.sum()
And I get this error: IndexError: index 4 is out of bounds for size 2
If, however, I first index by the date:
df = df.set_index('DATETIME')
df[df.GROUP == 'B'].groupby('GROUP').resample('D').X.sum()
It works fine.
Any ideas?
You need to use "apply" on a custum function and let pandas adapt itself to the output.
def my_func(grouped):
my_sum = grouped.resample('D', on = 'DATETIME').X.sum()
return my_sum
Now call this function on your groupby object:
df[df.GROUP == 'B'].groupby("GROUP").apply(my_func)
You get:
#Output
DATETIME 2018-01-03 00:00:00 2018-01-04 00:00:00
GROUP
B 5 6
What you did is ambiguous: pandas expects a series of 2 elements because group B has 2 elements but you are trying to obtain a dataframe like above.
I had a similar situation with my resampling. You have to run the following sequence of setting and reseting the index in order to make the index error go away:
df = df.set_index('order_date')
df.reset_index(inplace=True)
This line of code below will return an error if you do not run code above
df.groupby('Ship To #').resample('MS', on='order_date').product.sum()
hope it works.
I am trying to apply groupby -> mean to the n-1 rows and then assign the mean to the n-th row in pandas. Here is the following code and desired output. It takes a long time to run and I wonder does anyone know how to optimize this.
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': ['A', 'A', 'A', 'B', 'B', 'C'],
'vals': [2, 3, 4, 5, 6, 7]})
# current solution
for h in df['id'].unique():
h_df = df[df['id'] == h]
indices = h_df.index
size = h_df.shape[0]
last_index = indices[size-1]
if size == 1:
df.iloc[last_index, df.columns.get_loc('vals')] = np.nan
continue
exclude_last = h_df[:size-1]
avg = (exclude_last.groupby('id')['vals'].mean()).values[0]
df.iloc[last_index, df.columns.get_loc('vals')] = avg
# output
# id vals
# A 2
# A 3
# A 2.5 => (2+3) / 2
# B 5
# B 5 => (5/1)
# C np.nan
There's no reason to iterate over the unique values and select the groups and do another groupby. All that can be done by the .groupby itself:
In [1]: def mean_head(group):
...: group.vals.iloc[-1] = group.vals.iloc[:-1].mean()
...: return group
...:
In [2]: df.groupby("id").apply(mean_head)
Out[2]:
id vals
0 A 2.0
1 A 3.0
2 A 2.5
3 B 5.0
4 B 5.0
5 C NaN
I don't know why I'm struggling so hard with this one. I'm trying to do the excel equivalent of an averageifs calc across a pandas dataframe.
I have the following:
df = pd.DataFrame(rng.rand(1000, 7), columns=['1/31/2019', '2/28/2019', '3/31/2019', '4/30/2019', '5/31/2019', '6/30/2019', '7/31/2019'])
I also have a column:
df['Doc_Number'] = ['A', 'B', 'C', 'B', 'C', 'B', 'A', 'A', 'D', 'G', 'G', 'D', 'G', 'B' ...]
I want to do the excel equivalent of averageifs on the Doc_Number on each column of the df while maintaining the structure of the dataframe. So in each column, I'd calc the mean if df['Doc_Number'] = ['A', 'B', 'C'...] but I'd still maintain the 1,000 rows and I'd apply the calc to each individual column ['1/31/2019', '2/28/2019', '3/31/2019' ...].
For a single column, I would do something like:
df['AverageIfs'] = df.groupby('Doc_Number')['1/31/2019'].transform('np.mean')
But how would you apply the calc to each column of the df? In reality, I have many more columns to apply the calc across.
I'm a complete amateur so thanks for putting up with my questions.
You can remove ['1/31/2019'] after groupby for process all columns to new DataFramme, change columns names with add_suffix and add to original by join:
#simplify df for easy check output
np.random.seed(123)
df = pd.DataFrame(np.random.rand(14, 2), columns=['1/31/2019', '2/28/2019'])
df['Doc_Number'] = ['A', 'B', 'C', 'B', 'C', 'B', 'A', 'A', 'D', 'G', 'G', 'D', 'G', 'B']
print (df)
1/31/2019 2/28/2019 Doc_Number
0 0.696469 0.286139 A
1 0.226851 0.551315 B
2 0.719469 0.423106 C
3 0.980764 0.684830 B
4 0.480932 0.392118 C
5 0.343178 0.729050 B
6 0.438572 0.059678 A
7 0.398044 0.737995 A
8 0.182492 0.175452 D
9 0.531551 0.531828 G
10 0.634401 0.849432 G
11 0.724455 0.611024 D
12 0.722443 0.322959 G
13 0.361789 0.228263 B
df = df.join(df.groupby('Doc_Number').transform('mean').add_suffix('_mean'))
print (df)
1/31/2019 2/28/2019 Doc_Number 1/31/2019_mean 2/28/2019_mean
0 0.696469 0.286139 A 0.511029 0.361271
1 0.226851 0.551315 B 0.478146 0.548364
2 0.719469 0.423106 C 0.600200 0.407612
3 0.980764 0.684830 B 0.478146 0.548364
4 0.480932 0.392118 C 0.600200 0.407612
5 0.343178 0.729050 B 0.478146 0.548364
6 0.438572 0.059678 A 0.511029 0.361271
7 0.398044 0.737995 A 0.511029 0.361271
8 0.182492 0.175452 D 0.453474 0.393238
9 0.531551 0.531828 G 0.629465 0.568073
10 0.634401 0.849432 G 0.629465 0.568073
11 0.724455 0.611024 D 0.453474 0.393238
12 0.722443 0.322959 G 0.629465 0.568073
13 0.361789 0.228263 B 0.478146 0.548364