Pandas covid dataframe: get cases per day per county - pandas

I am reading covid-19 data from https://ti.saude.rs.gov.br/covid19/download , and I would like to:
select only rows where 'MUNICIPIO' column has value of 'SÃO LOURENÇO DO SUL';
then sort by column 'DATA_CONFIRMACAO';
then count rows in each group, getting a timeseries where "each point is the number of cases per day";
then plot with x-axis being date, and y-axis being count;
I tried this, without success:
import matplotlib.pyplot as plt
import pandas as pd
# Index(['COD_IBGE', 'MUNICIPIO', 'COD_REGIAO_COVID', 'REGIAO_COVID', 'SEXO',
# 'FAIXAETARIA', 'CRITERIO', 'DA 'FAIXAETARIA', 'CRITERIO', 'DATA_CONFIRMACAO', 'DATA_SINTOMAS',
# 'DATA_EVOLUCAO', 'EVOLUCAO', 'HOSPITALIZADO', 'FEBRE', 'TOSSE',
# 'GARGANTA', 'DISPNEIA', 'OUTROS', 'CONDICOES', 'GESTANTE',
# 'DATA_INCLUSAO_OBITO', 'DATA_EVOLUCAO_ESTIMADA', 'RACA_COR',
# 'ETNIA_INDIGENA', 'PROFISSIONAL_SAUDE', 'BAIRRO', 'HOSPITALIZACAO_SRAG',
# 'FONTE_INFORMACAO', 'PAIS_NASCIMENTO', 'PES_PRIV_LIBERDADE'],
url = "https://ti.saude.rs.gov.br/covid19/download"
data = pd.read_csv('covid-rs.csv', delimiter=';')
result = data[data['MUNICIPIO'] == 'SÃO LOURENÇO DO SUL'].groupby('DATA_CONFIRMACAO').count()
print(result)
Output is:
COD_IBGE MUNICIPIO COD_REGIAO_COVID REGIAO_COVID SEXO FAIXAETARIA CRITERIO ... ETNIA_INDIGENA PROFISSIONAL_SAUDE BAIRRO HOSPITALIZACAO_SRAG FONTE_INFORMACAO PAIS_NASCIMENTO PES_PRIV_LIBERDADE
DATA_CONFIRMACAO ...
01/07/2020 2 2 2 2 2 2 2 ... 2 2 2 2 2 2 2
01/09/2020 2 2 2 2 2 2 2 ... 2 2 2 2 2 2 2
01/12/2020 24 24 24 24 24 24 24 ... 24 24 24 24 24 24 24
02/07/2020 3 3 3 3 3 3 3 ... 3 3 3 3 3 3 3
02/09/2020 5 5 5 5 5 5 5 ... 5 5 5 5 5 5 5
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
30/11/2020 20 20 20 20 20 20 20 ... 20 20 19 20 20 20 20
31/03/2020 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1
31/07/2020 5 5 5 5 5 5 5 ... 5 5 5 5 5 5 5
31/08/2020 7 7 7 7 7 7 7 ... 7 7 7 7 7 7 7
31/10/2020 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1
[129 rows x 28 columns]

Try converting your dates to datetime type, then groupby will sort your date automatically. Plus, you would get better looking x-ticks.
url = "https://ti.saude.rs.gov.br/covid19/download"
data = pd.read_csv('covid-rs.csv', delimiter=';',
parse_dates=['DATA_CONFIRMACAO'],
dayfirst=True)
result = data[data['MUNICIPIO'] == 'SÃO LOURENÇO DO SUL'].groupby('DATA_CONFIRMACAO').count()
print(result)

Related

Dataframe within a Dataframe - to create new column_

For the following dataframe:
import pandas as pd
df=pd.DataFrame({'list_A':[3,3,3,3,3,\
2,2,2,2,2,2,2,4,4,4,4,4,4,4,4,4,4,4,4]})
How can 'list_A' be manipulated to give 'list_B'?
Desired output:
list_A
list_B
0
3
1
1
3
1
2
3
1
3
3
0
4
2
1
5
2
1
6
2
0
7
2
0
8
4
1
9
4
1
10
4
1
11
4
1
12
4
0
13
4
0
14
4
0
15
4
0
16
4
0
As you can see, if List_A has the number 3 - then the first 3 values of List_B are '1' and then the value of List_B changes to '0', until List_A changes value again.
GroupBy.cumcount
df['list_B'] = df['list_A'].gt(df.groupby('list_A').cumcount()).astype(int)
print(df)
Output
list_A list_B
0 3 1
1 3 1
2 3 1
3 3 0
4 3 0
5 2 1
6 2 1
7 2 0
8 2 0
9 2 0
10 2 0
11 2 0
12 4 1
13 4 1
14 4 1
15 4 1
16 4 0
17 4 0
18 4 0
19 4 0
20 4 0
21 4 0
22 4 0
23 4 0
EDIT
blocks = df['list_A'].ne(df['list_A'].shift()).cumsum()
df['list_B'] = df['list_A'].gt(df.groupby(blocks).cumcount()).astype(int)

Group counts in new column

I want a new column "group_count". This shows me in how many groups in total the attribute occurs.
Group Attribute group_count
0 1 10 4
1 1 10 4
2 1 10 4
3 2 10 4
4 2 20 1
5 3 30 1
6 3 10 4
7 4 10 4
I tried to groupby Group and attributes and then transform by using count
df["group_count"] = df.groupby(["Group", "Attributes"])["Attributes"].transform("count")
Group Attribute group_count
0 1 10 3
1 1 10 3
2 1 10 3
3 2 10 1
4 2 20 1
5 3 30 1
6 3 10 1
7 4 10 1
But it doesnt work
Use df.drop_duplicates(['Group','Attribute']) to get unique Attribute per group , then groupby on Atttribute to get count of Group, finally map with original Attribute column.
m=df.drop_duplicates(['Group','Attribute'])
df['group_count']=df['Attribute'].map(m.groupby('Attribute')['Group'].count())
print(df)
Group Attribute group_count
0 1 10 4
1 1 10 4
2 1 10 4
3 2 10 4
4 2 20 1
5 3 30 1
6 3 10 4
7 4 10 4
Use DataFrameGroupBy.nunique with transform:
df['group_count1'] = df.groupby('Attribute')['Group'].transform('nunique')
print (df)
Group Attribute group_count group_count1
0 1 10 4 4
1 1 10 4 4
2 1 10 4 4
3 2 10 4 4
4 2 20 1 1
5 3 30 1 1
6 3 10 4 4
7 4 10 4 4

Pandas get order of column value grouped by other column value

I have the following dataframe:
srch_id price
1 30
1 20
1 25
3 15
3 102
3 39
Now I want to create a third column in which I determine the price position grouped by the search id. This is the result I want:
srch_id price price_position
1 30 3
1 20 1
1 25 2
3 15 1
3 102 3
3 39 2
I think I need to use the transform function. However I can't seem to figure out how I should handle the argument I get using .transform():
def k(r):
return min(r)
tmp = train.groupby('srch_id')['price']
train['min'] = tmp.transform(k)
Because r is either a list or an element?
You can use series.rank() with df.groupby():
df['price_position']=df.groupby('srch_id')['price'].rank()
print(df)
srch_id price price_position
0 1 30 3.0
1 1 20 1.0
2 1 25 2.0
3 3 15 1.0
4 3 102 3.0
5 3 39 2.0
is this:
df['price_position'] = df.sort_values('price').groupby('srch_id').price.cumcount() + 1
Out[1907]:
srch_id price price_position
0 1 30 3
1 1 20 1
2 1 25 2
3 3 15 1
4 3 102 3
5 3 39 2

Count with group by with numpy

I have a large list with a shape in excess of (1000000, 200). I would like to count the occurrences of the items in the last column (:, -1). I can do this in pandas with a smaller list;
distribution = mylist.groupby('var1').count()
However I do not have labels on any of my dimensions. So unsure of how to reference them.
Edit:
print of pandas sample data;
0 1 2 3 4 ... 204 205 206 207 208
0 1 1 Random 1 4 12 ... 8 -14860 0 -5.0000 43.065233
1 1 1 Random 2 3 2 ... 8 -92993 -1 -1.0000 43.057945
2 1 1 Random 3 13 3 ... 8 -62907 1 -2.0000 43.070335
3 1 1 Random 3 13 3 ... 8 -62907 -1 -2.0000 43.070335
4 1 1 Random 4 4 2 ... 8 -38673 -1 0.0000 43.057945
5 1 1 Book 1 3 9 ... 8 -82339 -1 0.0000 43.059402
... ... ... ... .. .. ... .. ... .. ... ...
11795132 292 1 Random 5 12 2 ... 8 -69229 -1 0.0000 12.839051
11795133 292 1 Book 2 7 10 ... 8 -60664 -1 0.0000 46.823615
11795134 292 1 Random 2 9 4 ... 8 -78754 1 -2.0000 11.762521
11795135 292 1 Random 2 9 4 ... 8 -78754 -1 -2.0000 11.762521
11795136 292 1 Random 1 7 5 ... 8 -76275 -1 0.0000 41.839286
I want a few different counts and summaries so plan to do one at a time with;
mylist = input_list.values
mylist = mylist[:, -1]
mylist.astype(int)
Expected output;
11 2
12 1
41 1
43 6
46 1
iloc enables you to reference a column without using labels
distribution = input_list.groupby(input_list.iloc[:, -1]).count()

Pandas count values inside dataframe

I have a dataframe that looks like this:
A B C
1 1 8 3
2 5 4 3
3 5 8 1
and I want to count the values so to make df like this:
total
1 2
3 2
4 1
5 2
8 2
is it possible with pandas?
With np.unique -
In [332]: df
Out[332]:
A B C
1 1 8 3
2 5 4 3
3 5 8 1
In [333]: ids, c = np.unique(df.values.ravel(), return_counts=1)
In [334]: pd.DataFrame({'total':c}, index=ids)
Out[334]:
total
1 2
3 2
4 1
5 2
8 2
With pandas-series -
In [357]: pd.Series(np.ravel(df)).value_counts().sort_index()
Out[357]:
1 2
3 2
4 1
5 2
8 2
dtype: int64
You can also use stack() and groupby()
df = pd.DataFrame({'A':[1,8,3],'B':[5,4,3],'C':[5,8,1]})
print(df)
A B C
0 1 5 5
1 8 4 8
2 3 3 1
df1 = df.stack().reset_index(1)
df1.groupby(0).count()
level_1
0
1 2
3 2
4 1
5 2
8 2
Other alternative may be to use stack, followed by value_counts then, result changed to frame and finally sorting the index:
count_df = df.stack().value_counts().to_frame('total').sort_index()
count_df
Result:
total
1 2
3 2
4 1
5 2
8 2
using np.unique(, return_counts=True) and np.column_stack():
pd.DataFrame(np.column_stack(np.unique(df, return_counts=True)))
returns:
0 1
0 1 2
1 3 2
2 4 1
3 5 2
4 8 2