Count with group by with numpy

Count with group by with numpy - numpy

I have a large list with a shape in excess of (1000000, 200). I would like to count the occurrences of the items in the last column (:, -1). I can do this in pandas with a smaller list;
distribution = mylist.groupby('var1').count()
However I do not have labels on any of my dimensions. So unsure of how to reference them.
Edit:
print of pandas sample data;
0 1 2 3 4 ... 204 205 206 207 208
0 1 1 Random 1 4 12 ... 8 -14860 0 -5.0000 43.065233
1 1 1 Random 2 3 2 ... 8 -92993 -1 -1.0000 43.057945
2 1 1 Random 3 13 3 ... 8 -62907 1 -2.0000 43.070335
3 1 1 Random 3 13 3 ... 8 -62907 -1 -2.0000 43.070335
4 1 1 Random 4 4 2 ... 8 -38673 -1 0.0000 43.057945
5 1 1 Book 1 3 9 ... 8 -82339 -1 0.0000 43.059402
... ... ... ... .. .. ... .. ... .. ... ...
11795132 292 1 Random 5 12 2 ... 8 -69229 -1 0.0000 12.839051
11795133 292 1 Book 2 7 10 ... 8 -60664 -1 0.0000 46.823615
11795134 292 1 Random 2 9 4 ... 8 -78754 1 -2.0000 11.762521
11795135 292 1 Random 2 9 4 ... 8 -78754 -1 -2.0000 11.762521
11795136 292 1 Random 1 7 5 ... 8 -76275 -1 0.0000 41.839286
I want a few different counts and summaries so plan to do one at a time with;
mylist = input_list.values
mylist = mylist[:, -1]
mylist.astype(int)
Expected output;
11 2
12 1
41 1
43 6
46 1

iloc enables you to reference a column without using labels
distribution = input_list.groupby(input_list.iloc[:, -1]).count()

Related

Pandas: I want slice the data and shuffle them to genereate some synthetic data

Just want to help with data science to generate some synthetic data since we don't have enough labelled data. I want to cut the rows around the random position of the y column around 0s, don't cut 1 sequence.
After cutting, want to shuffle those slices and generate a new DataFrame.
It's better to have some parameters that adjust the maximum, and minimum sequence to cut, the number of cuts, and something like that.
The raw data
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
4 2 0
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
12 0.5 0
...
Some possible cuts
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
--------------
4 2 0
--------------
5 100 1
6 200 1
7 1234 1
-------------
8 12 0
9 40 0
10 200 1
11 300 1
-------------
12 0.5 0
...
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
4 2 0
-------------
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
------------
12 0.5 0
...
This is NOT CORRECT
ts v1 y
0 100 1
1 120 1
------------
2 80 1
3 5 0
4 2 0
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
12 0.5 0
...

You can use:
#number of cuts
N = 3
#create random N index values of index if y=0
idx = np.random.choice(df.index[df['y'].eq(0)], N, replace=False)
#create groups with check membership and cumulative sum
arr = df.index.isin(idx).cumsum()
#randomize unique integers - groups
u = np.unique(arr)
np.random.shuffle(u)
#change order of groups in DataFrame
df = df.set_index(arr).loc[u].reset_index(drop=True)
print (df)
ts v1 y
0 9 40.0 0
1 10 200.0 1
2 11 300.0 1
3 12 0.5 0
4 3 5.0 0
5 4 2.0 0
6 5 100.0 1
7 6 200.0 1
8 7 1234.0 1
9 8 12.0 0
10 0 100.0 1
11 1 120.0 1
12 2 80.0 1

Dataframe within a Dataframe - to create new column_

For the following dataframe:
import pandas as pd
df=pd.DataFrame({'list_A':[3,3,3,3,3,\
2,2,2,2,2,2,2,4,4,4,4,4,4,4,4,4,4,4,4]})
How can 'list_A' be manipulated to give 'list_B'?
Desired output:
list_A
list_B
0
3
1
1
3
1
2
3
1
3
3
0
4
2
1
5
2
1
6
2
0
7
2
0
8
4
1
9
4
1
10
4
1
11
4
1
12
4
0
13
4
0
14
4
0
15
4
0
16
4
0
As you can see, if List_A has the number 3 - then the first 3 values of List_B are '1' and then the value of List_B changes to '0', until List_A changes value again.

GroupBy.cumcount
df['list_B'] = df['list_A'].gt(df.groupby('list_A').cumcount()).astype(int)
print(df)
Output
list_A list_B
0 3 1
1 3 1
2 3 1
3 3 0
4 3 0
5 2 1
6 2 1
7 2 0
8 2 0
9 2 0
10 2 0
11 2 0
12 4 1
13 4 1
14 4 1
15 4 1
16 4 0
17 4 0
18 4 0
19 4 0
20 4 0
21 4 0
22 4 0
23 4 0
EDIT
blocks = df['list_A'].ne(df['list_A'].shift()).cumsum()
df['list_B'] = df['list_A'].gt(df.groupby(blocks).cumcount()).astype(int)

Pandas covid dataframe: get cases per day per county

I am reading covid-19 data from https://ti.saude.rs.gov.br/covid19/download , and I would like to:
select only rows where 'MUNICIPIO' column has value of 'SÃO LOURENÇO DO SUL';
then sort by column 'DATA_CONFIRMACAO';
then count rows in each group, getting a timeseries where "each point is the number of cases per day";
then plot with x-axis being date, and y-axis being count;
I tried this, without success:
import matplotlib.pyplot as plt
import pandas as pd
# Index(['COD_IBGE', 'MUNICIPIO', 'COD_REGIAO_COVID', 'REGIAO_COVID', 'SEXO',
# 'FAIXAETARIA', 'CRITERIO', 'DA 'FAIXAETARIA', 'CRITERIO', 'DATA_CONFIRMACAO', 'DATA_SINTOMAS',
# 'DATA_EVOLUCAO', 'EVOLUCAO', 'HOSPITALIZADO', 'FEBRE', 'TOSSE',
# 'GARGANTA', 'DISPNEIA', 'OUTROS', 'CONDICOES', 'GESTANTE',
# 'DATA_INCLUSAO_OBITO', 'DATA_EVOLUCAO_ESTIMADA', 'RACA_COR',
# 'ETNIA_INDIGENA', 'PROFISSIONAL_SAUDE', 'BAIRRO', 'HOSPITALIZACAO_SRAG',
# 'FONTE_INFORMACAO', 'PAIS_NASCIMENTO', 'PES_PRIV_LIBERDADE'],
url = "https://ti.saude.rs.gov.br/covid19/download"
data = pd.read_csv('covid-rs.csv', delimiter=';')
result = data[data['MUNICIPIO'] == 'SÃO LOURENÇO DO SUL'].groupby('DATA_CONFIRMACAO').count()
print(result)
Output is:
COD_IBGE MUNICIPIO COD_REGIAO_COVID REGIAO_COVID SEXO FAIXAETARIA CRITERIO ... ETNIA_INDIGENA PROFISSIONAL_SAUDE BAIRRO HOSPITALIZACAO_SRAG FONTE_INFORMACAO PAIS_NASCIMENTO PES_PRIV_LIBERDADE
DATA_CONFIRMACAO ...
01/07/2020 2 2 2 2 2 2 2 ... 2 2 2 2 2 2 2
01/09/2020 2 2 2 2 2 2 2 ... 2 2 2 2 2 2 2
01/12/2020 24 24 24 24 24 24 24 ... 24 24 24 24 24 24 24
02/07/2020 3 3 3 3 3 3 3 ... 3 3 3 3 3 3 3
02/09/2020 5 5 5 5 5 5 5 ... 5 5 5 5 5 5 5
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
30/11/2020 20 20 20 20 20 20 20 ... 20 20 19 20 20 20 20
31/03/2020 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1
31/07/2020 5 5 5 5 5 5 5 ... 5 5 5 5 5 5 5
31/08/2020 7 7 7 7 7 7 7 ... 7 7 7 7 7 7 7
31/10/2020 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1
[129 rows x 28 columns]

Try converting your dates to datetime type, then groupby will sort your date automatically. Plus, you would get better looking x-ticks.
url = "https://ti.saude.rs.gov.br/covid19/download"
data = pd.read_csv('covid-rs.csv', delimiter=';',
parse_dates=['DATA_CONFIRMACAO'],
dayfirst=True)
result = data[data['MUNICIPIO'] == 'SÃO LOURENÇO DO SUL'].groupby('DATA_CONFIRMACAO').count()
print(result)

Pandas: Calculate percentage of column for each class

I have a dataframe like this:
Class Boolean Sum
0 1 0 10
1 1 1 20
2 2 0 15
3 2 1 25
4 3 0 52
5 3 1 48
I want to calculate percentage of 0/1's for each class, so for example the output could be:
Class Boolean Sum %
0 1 0 10 0.333
1 1 1 20 0.666
2 2 0 15 0.375
3 2 1 25 0.625
4 3 0 52 0.520
5 3 1 48 0.480

Divide column Sum with GroupBy.transform for return Series with same length as original DataFrame filled by aggregated values:
df['%'] = df['Sum'].div(df.groupby('Class')['Sum'].transform('sum'))
print (df)
Class Boolean Sum %
0 1 0 10 0.333333
1 1 1 20 0.666667
2 2 0 15 0.375000
3 2 1 25 0.625000
4 3 0 52 0.520000
5 3 1 48 0.480000
Detail:
print (df.groupby('Class')['Sum'].transform('sum'))
0 30
1 30
2 40
3 40
4 100
5 100
Name: Sum, dtype: int64

Set value from another dataframe

Having a data frame exex as
EXEX I J
1 702 2 3
2 3112 2 4
3 1360 2 5
4 702 3 2
5 221 3 5
6 591 3 11
7 3112 4 2
8 394 4 5
9 3416 4 11
10 1360 5 2
11 221 5 3
12 394 5 4
13 108 5 11
14 591 11 3
15 3416 11 4
16 108 11 5
is there a more efficient pandas approach to update the value of an existing dataframe df of 0 to the value exex.EXEX where the exex.I field is the index and the exex.J field is the column? Is there a way in where to update the data by specifing the name instead of the row index? This is because if the name fields change, the row index would be different and could lead to an erroneous result.
i get it by:
df = pd.DataFrame(0, index = range(1,908), columns=range(1,908))
for index, row in exex12.iterrows():
df.set_value(row[1],row[2],row[0])

Assign to df.values
df.values[exex.I.values - 1, exex.J.values - 1] = exex.EXEX.values
print(df.iloc[:5, :5])
1 2 3 4 5
1 0 0 0 0 0
2 0 0 702 3112 1360
3 0 702 0 0 221
4 0 3112 0 0 394
5 0 1360 221 394 0

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Count with group by with numpy - numpy

iloc enables you to reference a column without using labels distribution = input_list.groupby(input_list.iloc[:, -1]).count()

Related

Pandas: I want slice the data and shuffle them to genereate some synthetic data

Dataframe within a Dataframe - to create new column_

Pandas covid dataframe: get cases per day per county

Pandas: Calculate percentage of column for each class

Set value from another dataframe

Categories

Resources