I have the following dataframe:
srch_id price
1 30
1 20
1 25
3 15
3 102
3 39
Now I want to create a third column in which I determine the price position grouped by the search id. This is the result I want:
srch_id price price_position
1 30 3
1 20 1
1 25 2
3 15 1
3 102 3
3 39 2
I think I need to use the transform function. However I can't seem to figure out how I should handle the argument I get using .transform():
def k(r):
return min(r)
tmp = train.groupby('srch_id')['price']
train['min'] = tmp.transform(k)
Because r is either a list or an element?
You can use series.rank() with df.groupby():
df['price_position']=df.groupby('srch_id')['price'].rank()
print(df)
srch_id price price_position
0 1 30 3.0
1 1 20 1.0
2 1 25 2.0
3 3 15 1.0
4 3 102 3.0
5 3 39 2.0
is this:
df['price_position'] = df.sort_values('price').groupby('srch_id').price.cumcount() + 1
Out[1907]:
srch_id price price_position
0 1 30 3
1 1 20 1
2 1 25 2
3 3 15 1
4 3 102 3
5 3 39 2
Related
Just want to help with data science to generate some synthetic data since we don't have enough labelled data. I want to cut the rows around the random position of the y column around 0s, don't cut 1 sequence.
After cutting, want to shuffle those slices and generate a new DataFrame.
It's better to have some parameters that adjust the maximum, and minimum sequence to cut, the number of cuts, and something like that.
The raw data
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
4 2 0
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
12 0.5 0
...
Some possible cuts
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
--------------
4 2 0
--------------
5 100 1
6 200 1
7 1234 1
-------------
8 12 0
9 40 0
10 200 1
11 300 1
-------------
12 0.5 0
...
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
4 2 0
-------------
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
------------
12 0.5 0
...
This is NOT CORRECT
ts v1 y
0 100 1
1 120 1
------------
2 80 1
3 5 0
4 2 0
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
12 0.5 0
...
You can use:
#number of cuts
N = 3
#create random N index values of index if y=0
idx = np.random.choice(df.index[df['y'].eq(0)], N, replace=False)
#create groups with check membership and cumulative sum
arr = df.index.isin(idx).cumsum()
#randomize unique integers - groups
u = np.unique(arr)
np.random.shuffle(u)
#change order of groups in DataFrame
df = df.set_index(arr).loc[u].reset_index(drop=True)
print (df)
ts v1 y
0 9 40.0 0
1 10 200.0 1
2 11 300.0 1
3 12 0.5 0
4 3 5.0 0
5 4 2.0 0
6 5 100.0 1
7 6 200.0 1
8 7 1234.0 1
9 8 12.0 0
10 0 100.0 1
11 1 120.0 1
12 2 80.0 1
I am reading covid-19 data from https://ti.saude.rs.gov.br/covid19/download , and I would like to:
select only rows where 'MUNICIPIO' column has value of 'SÃO LOURENÇO DO SUL';
then sort by column 'DATA_CONFIRMACAO';
then count rows in each group, getting a timeseries where "each point is the number of cases per day";
then plot with x-axis being date, and y-axis being count;
I tried this, without success:
import matplotlib.pyplot as plt
import pandas as pd
# Index(['COD_IBGE', 'MUNICIPIO', 'COD_REGIAO_COVID', 'REGIAO_COVID', 'SEXO',
# 'FAIXAETARIA', 'CRITERIO', 'DA 'FAIXAETARIA', 'CRITERIO', 'DATA_CONFIRMACAO', 'DATA_SINTOMAS',
# 'DATA_EVOLUCAO', 'EVOLUCAO', 'HOSPITALIZADO', 'FEBRE', 'TOSSE',
# 'GARGANTA', 'DISPNEIA', 'OUTROS', 'CONDICOES', 'GESTANTE',
# 'DATA_INCLUSAO_OBITO', 'DATA_EVOLUCAO_ESTIMADA', 'RACA_COR',
# 'ETNIA_INDIGENA', 'PROFISSIONAL_SAUDE', 'BAIRRO', 'HOSPITALIZACAO_SRAG',
# 'FONTE_INFORMACAO', 'PAIS_NASCIMENTO', 'PES_PRIV_LIBERDADE'],
url = "https://ti.saude.rs.gov.br/covid19/download"
data = pd.read_csv('covid-rs.csv', delimiter=';')
result = data[data['MUNICIPIO'] == 'SÃO LOURENÇO DO SUL'].groupby('DATA_CONFIRMACAO').count()
print(result)
Output is:
COD_IBGE MUNICIPIO COD_REGIAO_COVID REGIAO_COVID SEXO FAIXAETARIA CRITERIO ... ETNIA_INDIGENA PROFISSIONAL_SAUDE BAIRRO HOSPITALIZACAO_SRAG FONTE_INFORMACAO PAIS_NASCIMENTO PES_PRIV_LIBERDADE
DATA_CONFIRMACAO ...
01/07/2020 2 2 2 2 2 2 2 ... 2 2 2 2 2 2 2
01/09/2020 2 2 2 2 2 2 2 ... 2 2 2 2 2 2 2
01/12/2020 24 24 24 24 24 24 24 ... 24 24 24 24 24 24 24
02/07/2020 3 3 3 3 3 3 3 ... 3 3 3 3 3 3 3
02/09/2020 5 5 5 5 5 5 5 ... 5 5 5 5 5 5 5
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
30/11/2020 20 20 20 20 20 20 20 ... 20 20 19 20 20 20 20
31/03/2020 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1
31/07/2020 5 5 5 5 5 5 5 ... 5 5 5 5 5 5 5
31/08/2020 7 7 7 7 7 7 7 ... 7 7 7 7 7 7 7
31/10/2020 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1
[129 rows x 28 columns]
Try converting your dates to datetime type, then groupby will sort your date automatically. Plus, you would get better looking x-ticks.
url = "https://ti.saude.rs.gov.br/covid19/download"
data = pd.read_csv('covid-rs.csv', delimiter=';',
parse_dates=['DATA_CONFIRMACAO'],
dayfirst=True)
result = data[data['MUNICIPIO'] == 'SÃO LOURENÇO DO SUL'].groupby('DATA_CONFIRMACAO').count()
print(result)
I have a dataframe like this:
Class Boolean Sum
0 1 0 10
1 1 1 20
2 2 0 15
3 2 1 25
4 3 0 52
5 3 1 48
I want to calculate percentage of 0/1's for each class, so for example the output could be:
Class Boolean Sum %
0 1 0 10 0.333
1 1 1 20 0.666
2 2 0 15 0.375
3 2 1 25 0.625
4 3 0 52 0.520
5 3 1 48 0.480
Divide column Sum with GroupBy.transform for return Series with same length as original DataFrame filled by aggregated values:
df['%'] = df['Sum'].div(df.groupby('Class')['Sum'].transform('sum'))
print (df)
Class Boolean Sum %
0 1 0 10 0.333333
1 1 1 20 0.666667
2 2 0 15 0.375000
3 2 1 25 0.625000
4 3 0 52 0.520000
5 3 1 48 0.480000
Detail:
print (df.groupby('Class')['Sum'].transform('sum'))
0 30
1 30
2 40
3 40
4 100
5 100
Name: Sum, dtype: int64
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 4 years ago.
I have following dataframe:
df = pd.DataFrame([[1,1,1,1,1,1,1,1,2,2,2,2,3,3,3,3,3,3,3],['A','B','B','B','C','D','D','E','A','C','C','C','A','B','B','B','B','D','E'], [18,25,47,27,31,55,13,19,73,55,58,14,2,46,33,35,24,60,7]]).T
df.columns = ['Brand_ID','Category','Price']
Brand_ID Category Price
0 1 A 18
1 1 B 25
2 1 B 47
3 1 B 27
4 1 C 31
5 1 D 55
6 1 D 13
7 1 E 19
8 2 A 73
9 2 C 55
10 2 C 58
11 2 C 14
12 3 A 2
13 3 B 46
14 3 B 33
15 3 B 35
16 3 B 24
17 3 D 60
18 3 E 7
What I need to do is to group by Brand_ID and category and count (similar to the first part of this question). However, I need instead to write the output into a different column depending on the category. So my Output should look like follows:
Brand_ID Category_A Category_B Category_C Category_D Category_E
0 1 1 3 1 2 1
1 2 1 0 3 0 0
2 3 1 4 0 1 1
Is there any possibility to do this directly with pandas?
Try:
df.groupby(['Brand_ID','Category'])['Price'].count()\
.unstack(fill_value=0)\
.add_prefix('Category_')\
.reset_index()\
.rename_axis([None], axis=1)
Output
Brand_ID Category_A Category_B Category_C Category_D Category_E
0 1 1 3 1 2 1
1 2 1 0 3 0 0
2 3 1 4 0 1 1
OR
pd.crosstab(df.Brand_ID, df.Category)\
.add_prefix('Category_')\
.reset_index()\
.rename_axis([None], axis=1)
You're describing a pivot_table:
df.pivot_table(index='Brand_ID', columns='Category', aggfunc='size', fill_value=0)
Output:
Category A B C D E
Brand_ID
1 1 3 1 2 1
2 1 0 3 0 0
3 1 4 0 1 1
I have a table like the one below. I would like to get this data to SSRS (Grouped by LineID and Product and Column as Hour) to show only those rows where HourCount > 0 for every LineID and Product.
LineID Product Hour HourCount
3 A 0 0
3 A 1 0
3 A 2 0
3 A 3 0
3 A 4 0
3 A 5 0
3 B 0 65
3 B 1 56
3 B 2 45
3 B 3 34
3 B 4 43
3 B 5 45
4 A 0 54
4 A 1 34
4 A 2 45
4 A 3 44
4 A 4 55
4 A 5 44
4 B 0 0
4 B 1 0
4 B 2 0
4 B 3 0
4 B 4 0
4 B 5 0
5 A 0 45
5 A 1 77
5 A 2 66
5 A 3 55
5 A 4 0
5 A 5 0
5 B 0 0
5 B 1 0
5 B 2 45
5 B 3 0
5 B 4 0
5 B 5 0
Basically I would like this table to look like this before it's in SSRS:
LineID Product Hour HourCount
3 B 0 65
3 B 1 56
3 B 2 45
3 B 3 34
3 B 4 43
3 B 5 45
4 A 0 54
4 A 1 34
4 A 2 45
4 A 3 44
4 A 4 55
4 A 5 44
5 A 0 45
5 A 1 77
5 A 2 66
5 A 3 55
5 A 4 0
5 A 5 0
5 B 0 0
5 B 1 0
5 B 2 45
5 B 3 0
5 B 4 0
5 B 5 0
So display Product for the line only if any of the Hourd have HourCount higher then 0.
Is there any query that could give me these results or I should play with display settings in SSRS?
Something like this should work:
with NonZero as
(
select *
, GroupZeroCount = sum(HourCount) over (partition by LineID, Product)
from HourTable
)
select LineID
, Product
, [Hour]
, HourCount
from NonZero
where GroupZeroCount > 0
SQL Fiddle with demo.
You could certainly so something similar in SSRS, but it's certainly much easier and intuitive to apply at the T-SQL level.
I think you are looking for
SELECT LineID,Product,Hour,Count(Hour) AS HourCount
FROM abc
GROUP BY LineID,Productm,Hour HAVING Count(Hour) > 0