Pandas sum over multiple columns after group by - pandas

if I have a data set where the columns are something like:
Day Column2 Column3 Column4......Column100
Is there a better way to do something like the below?
grouped_df = df.groupby('Day').agg({
'Column2': lambda x : sum(x),
'Column3': lambda x : sum(x),
'Column4': lambda x : sum(x),
..........
'Column100': lambda x : sum(x)})
What I have works but wondering if there is a more elegant solution.
Thank You

You can try df.groupby('Day').sum() just like what MaxU said.

you can do it this way:
In [17]: df
Out[17]:
a b c d e Day
0 7 5 4 9 4 2016-01-01
1 2 1 5 4 5 2014-01-01
2 2 8 8 6 9 2014-01-01
3 1 4 4 3 7 2015-01-01
4 5 6 7 9 5 2016-01-01
5 3 6 0 8 7 2015-01-01
6 7 4 4 5 5 2014-01-01
7 1 1 0 1 6 2015-01-01
8 7 8 9 8 3 2015-01-01
9 8 5 5 2 8 2015-01-01
10 6 1 3 0 3 2014-01-01
11 1 8 2 7 2 2016-01-01
12 2 5 2 5 1 2016-01-01
13 1 2 3 2 2 2016-01-01
14 7 4 9 5 2 2016-01-01
15 4 0 8 9 5 2015-01-01
16 8 5 8 9 7 2015-01-01
17 6 7 9 5 4 2016-01-01
18 7 4 2 3 2 2016-01-01
19 2 7 8 6 8 2015-01-01
In [18]: cols = df.columns
In [19]: cols[1:]
Out[19]: Index(['b', 'c', 'd', 'e', 'Day'], dtype='object')
In [20]: df.ix[:, cols[1:]].groupby('Day').sum()
Out[20]:
b c d e
Day
2014-01-01 14 20 15 22
2015-01-01 36 42 46 51
2016-01-01 41 38 45 22
setup sample DF:
rows = 20
df = pd.DataFrame(np.random.randint(0, 10, size=(rows, 5)), columns=list('abcde'))
dates = [pd.to_datetime(d) for d in ['2016-01-01','2015-01-01','2014-01-01']]
df['Day'] = np.random.choice(dates, len(df))

Related

Pandas: How to extract data that has been grouped by

Here is an example code to demonstrate my problem:
import numpy as np
import pandas as pd
np.random.seed(10)
df = pd.DataFrame(np.random.randint(0,10,size=(100, 2)), columns=list('xy'))
df
x y
0 9 4
1 0 1
2 9 0
3 1 8
4 9 0
... ... ...
95 0 4
96 6 4
97 9 8
98 0 7
99 1 7
groups = df.groupby(['x'])
groups.size()
x
0 11
1 12
2 15
3 13
4 14
5 5
6 6
7 9
8 5
9 10
dtype: int64
How can I access the x-values as a column and the aggregated y-values as a second column to plot x versus y?
Two options.
Use reset_index():
groups = df.groupby(['x']).size().reset_index(name='size')
Add as_index=False to groupby:
groups = df.groupby(['x'], as_index=False).size()
Output for both:
>>> groups
x size
0 0 16
1 1 9
2 2 9
3 3 5
4 4 7
5 5 10
6 6 10
7 7 7
8 8 12
9 9 15
IIUC, use as_index=False:
groups = df.groupby(['x'], as_index=False)
out = groups.size()
out.plot(x='x', y='size')
If you only want to plot, you can also keep the x as index:
df.groupby(['x']).size().plot()
output:
x size
0 0 16
1 1 9
2 2 9
3 3 5
4 4 7
5 5 10
6 6 10
7 7 7
8 8 12
9 9 15

Stack multiple columns into single column while maintaining other columns in Pandas?

Given pandas multiple columns as below
cl_a cl_b cl_c cl_d cl_e
0 1 a 5 6 20
1 2 b 4 7 21
2 3 c 3 8 22
3 4 d 2 9 23
4 5 e 1 10 24
I would like to stack the column cl_c cl_d cl_e into a single column with the name ax. But, please note that, the columns cl_a cl_b were maintained.
cl_a cl_b ax from_col
1,a,5,cl_c
2,b,4,cl_c
3,c,3,cl_c
4,d,2,cl_c
5,e,1,cl_c
1,a,6,cl_d
2,b,7,cl_d
3,c,8,cl_d
4,d,9,cl_d
5,e,10,cl_d
1,a,20,cl_e
2,b,21,cl_e
3,c,22,cl_e
4,d,23,cl_e
5,e,24,cl_e
So far, the following code does the job
df = pd.DataFrame ( {'cl_a': [1,2,3,4,5], 'cl_b': ['a','b','c','d','e'],
'cl_c': [5,4,3,2,1],'cl_d': [6,7,8,9,10],
'cl_e': [20,21,22,23,24]})
df_new = pd.DataFrame()
for col_name in ['cl_c','cl_d','cl_e']:
df_new=df_new.append (df [['cl_a', 'cl_b', col_name]].rename(columns={col_name: "ax"}))
However, I am curious whether there is Pandas build-in approach that can do the trick
Edit:
Upon Quong answer, I realise of the need to include another column (i.e., from_col) beside the ax. The from_col indicate the origin of ax previous column name.
Yes, it's called melt:
df.melt(['cl_a','cl_b'], value_name='ax').drop(columns='variable')
Output:
cl_a cl_b ax
0 1 a 5
1 2 b 4
2 3 c 3
3 4 d 2
4 5 e 1
5 1 a 6
6 2 b 7
7 3 c 8
8 4 d 9
9 5 e 10
10 1 a 20
11 2 b 21
12 3 c 22
13 4 d 23
14 5 e 24
Or equivalently set_index().stack():
(df.set_index(['cl_a','cl_b']).stack()
.reset_index(level=-1, drop=True)
.reset_index(name='ax')
)
with a slightly different output:
cl_a cl_b ax
0 1 a 5
1 1 a 6
2 1 a 20
3 2 b 4
4 2 b 7
5 2 b 21
6 3 c 3
7 3 c 8
8 3 c 22
9 4 d 2
10 4 d 9
11 4 d 23
12 5 e 1
13 5 e 10
14 5 e 24

Pandas covid dataframe: get cases per day per county

I am reading covid-19 data from https://ti.saude.rs.gov.br/covid19/download , and I would like to:
select only rows where 'MUNICIPIO' column has value of 'SÃO LOURENÇO DO SUL';
then sort by column 'DATA_CONFIRMACAO';
then count rows in each group, getting a timeseries where "each point is the number of cases per day";
then plot with x-axis being date, and y-axis being count;
I tried this, without success:
import matplotlib.pyplot as plt
import pandas as pd
# Index(['COD_IBGE', 'MUNICIPIO', 'COD_REGIAO_COVID', 'REGIAO_COVID', 'SEXO',
# 'FAIXAETARIA', 'CRITERIO', 'DA 'FAIXAETARIA', 'CRITERIO', 'DATA_CONFIRMACAO', 'DATA_SINTOMAS',
# 'DATA_EVOLUCAO', 'EVOLUCAO', 'HOSPITALIZADO', 'FEBRE', 'TOSSE',
# 'GARGANTA', 'DISPNEIA', 'OUTROS', 'CONDICOES', 'GESTANTE',
# 'DATA_INCLUSAO_OBITO', 'DATA_EVOLUCAO_ESTIMADA', 'RACA_COR',
# 'ETNIA_INDIGENA', 'PROFISSIONAL_SAUDE', 'BAIRRO', 'HOSPITALIZACAO_SRAG',
# 'FONTE_INFORMACAO', 'PAIS_NASCIMENTO', 'PES_PRIV_LIBERDADE'],
url = "https://ti.saude.rs.gov.br/covid19/download"
data = pd.read_csv('covid-rs.csv', delimiter=';')
result = data[data['MUNICIPIO'] == 'SÃO LOURENÇO DO SUL'].groupby('DATA_CONFIRMACAO').count()
print(result)
Output is:
COD_IBGE MUNICIPIO COD_REGIAO_COVID REGIAO_COVID SEXO FAIXAETARIA CRITERIO ... ETNIA_INDIGENA PROFISSIONAL_SAUDE BAIRRO HOSPITALIZACAO_SRAG FONTE_INFORMACAO PAIS_NASCIMENTO PES_PRIV_LIBERDADE
DATA_CONFIRMACAO ...
01/07/2020 2 2 2 2 2 2 2 ... 2 2 2 2 2 2 2
01/09/2020 2 2 2 2 2 2 2 ... 2 2 2 2 2 2 2
01/12/2020 24 24 24 24 24 24 24 ... 24 24 24 24 24 24 24
02/07/2020 3 3 3 3 3 3 3 ... 3 3 3 3 3 3 3
02/09/2020 5 5 5 5 5 5 5 ... 5 5 5 5 5 5 5
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
30/11/2020 20 20 20 20 20 20 20 ... 20 20 19 20 20 20 20
31/03/2020 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1
31/07/2020 5 5 5 5 5 5 5 ... 5 5 5 5 5 5 5
31/08/2020 7 7 7 7 7 7 7 ... 7 7 7 7 7 7 7
31/10/2020 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1
[129 rows x 28 columns]
Try converting your dates to datetime type, then groupby will sort your date automatically. Plus, you would get better looking x-ticks.
url = "https://ti.saude.rs.gov.br/covid19/download"
data = pd.read_csv('covid-rs.csv', delimiter=';',
parse_dates=['DATA_CONFIRMACAO'],
dayfirst=True)
result = data[data['MUNICIPIO'] == 'SÃO LOURENÇO DO SUL'].groupby('DATA_CONFIRMACAO').count()
print(result)

How to multiply dataframe columns with dataframe column in pandas?

I want to multiply hdataframe columns with dataframe column.
I have two dataframews as shown here:
A dataframe, B dataframe
a b c d e
3 4 4 4 2
3 3 3 3 3
3 3 3 3 4
and I want to make multiplication A and B.
Multiplication result should be like this:
a b c d
6 8 8 8
9 9 9 9
12 12 12 12
I tried just * multiplication but got a wrong result.
Thank you in advance!
Use B.values or B.to_numpy() which will return numpy array and then you can multiply with DataFrame
Ex.:
>>> A
a b c d
0 3 4 4 4
1 3 3 3 3
2 3 3 3 3
>>> B
c
0 2
1 3
2 4
>>> A * B.values
a b c d
0 6 8 8 8
1 9 9 9 9
2 12 12 12 12
Just another variation on #Dishin's excellent answer:
U can use pandas mul method to multiply A by B, by setting B as a series and multiplying on the index:
A.mul(B.iloc[:,0],axis='index')
a b c d
0 6 8 8 8
1 9 9 9 9
2 12 12 12 12
Use DataFrame.mul with Series by selecting e column:
df = A.mul(B['e'], axis=0)
print (df)
a b c d
0 6 8 8 8
1 9 9 9 9
2 12 12 12 12
I think you are looking for the mul function, as seen on this thread here, here is the code.
df = pd.DataFrame([[3, 4, 4, 4],[3, 3, 3, 3],[3, 3, 3, 3]])
val = [2,3,4]
df.mul(val, axis = 0)
Here are the results:
0 1 2 3
0 6 8 8 8
1 9 9 9 9
2 12 12 12 12
Ignore the indices.

How to count each x entries and mark the occurence of this sequence with a value in a pandas dataframe?

I want to create a column C (based on B) which counts each beginning of a series of 4 entries in B (or the dataframe as general). I have the following pandas data frame:
A B
1 100
2 102
3 103
4 104
5 105
6 106
7 108
8 109
9 110
10 112
11 113
12 115
13 116
14 118
15 120
16 121
I want to create the following column C:
A C
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 2
9 3
10 3
11 3
12 3
13 4
14 4
15 4
16 4
This column C should count each series of 4 entries of the dataframe.
Thanks in advance.
Use:
df['C'] = df.index // 4 + 1
Given that you have fairly simple dataframe it's okay to assume that you have generic index which is a RangeIndex object.
In your example it would look like this:
df.index
#RangeIndex(start=0, stop=16, step=1)
That being said values of this index are the following:
df.index.values
#array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], dtype=int64)
Converting such array into your desired output is performed using the formula:
x // 4 + 1
Where // is the operator used for floor division.
General solution is create numpy array by np.arange, then use integer division by 4 and add 1, because python count from 0:
df['C'] = np.arange(len(df)) // 4 + 1
print (df)
A B C
0 1 100 1
1 2 102 1
2 3 103 1
3 4 104 1
4 5 105 2
5 6 106 2
6 7 108 2
7 8 109 2
8 9 110 3
9 10 112 3
10 11 113 3
11 12 115 3
12 13 116 4
13 14 118 4
14 15 120 4
15 16 121 4