I have a dataset that looks something like the following:
df = pd.DataFrame({"Date":['2021-01-01','2021-01-02','2021-01-03','2021-01-04','2021-01-05','2021-01-06','2021-01-07'],'Value':[0,0,14,0,0,0,9]})
df['Date']=pd.to_datetime(df['Date'])
df
Date Value
2021-01-01 0
2021-01-02 0
2021-01-03 14
2021-01-04 0
2021-01-05 0
2021-01-06 0
2021-01-07 9
I know that where the data is missing was due to a lack of reporting, so rows with values represents that day, plus the sum of values from the missing days as well. I want the outcome to randomly distribute the data backwards, based on the existing values, example below:
df2 = pd.DataFrame({"Date":['2021-01-01','2021-01-02','2021-01-03','2021-01-04','2021-01-05','2021-01-06','2021-01-07'],'Value':[2,8,4,3,1,4,1]})
df2['Date']=pd.to_datetime(df2['Date'])
df2
Date Value
2021-01-01 2
2021-01-02 8
2021-01-03 4
2021-01-04 3
2021-01-05 1
2021-01-06 4
2021-01-07 1
(The local 'totals' on 2021-01-03 and 2021-01-07 remain the same)
I know part of the problem is that the intervals of missing/present data isn't consistent...
Any ideas on how to get this done? All advice appreciated.
You can create a group up to a non zero Value with ne (not equal) to 0, shift to keep the non-zero value in the right group and cumsum.
Then to create a split per group of difference length summing to the non zero value of the group, you can refer to this question for example.
so using numpy.random.multinomial like in this answer you get:
import numpy as np
np.random.seed(1)
df['new_value'] = (
df.groupby(df['Value'].ne(0).shift(1,fill_value=False).cumsum())
['Value'].apply(lambda x: np.random.multinomial(x.max(), [1./len(x)]*len(x)))
.explode() # create a Series
.to_numpy() # because index alignment not possible
)
print(df)
Date Value new_value
0 2021-01-01 0 4
1 2021-01-02 0 6
2 2021-01-03 14 4
3 2021-01-04 0 0
4 2021-01-05 0 2
5 2021-01-06 0 2
6 2021-01-07 9 5
or you can also use this answer that seems a bit more popular:
import random
def constrained_sum_sample_pos(n, total):
"""Return a randomly chosen list of n positive integers summing to total.
Each such list is equally likely to occur."""
dividers = sorted(random.sample(range(1, total), n - 1))
return [a - b for a, b in zip(dividers + [total], [0] + dividers)]
df['new_value_r'] = (
df.groupby(df['Value'].ne(0).shift(1,fill_value=False).cumsum())
['Value'].apply(lambda x: constrained_sum_sample_pos(len(x), x.max()))
.explode()
.to_numpy()
)
print(df)
Date Value new_value new_value_r
0 2021-01-01 0 4 5
1 2021-01-02 0 6 2
2 2021-01-03 14 4 7
3 2021-01-04 0 0 1
4 2021-01-05 0 2 2
5 2021-01-06 0 2 5
6 2021-01-07 9 5 1
Related
This code works but seems so hairy. Is there a better way to drop 100 rows from a dataframe starting from the row where a certain value criteria is met?
In my case, I want to find next row where a value in column_name is < 21000, then drop that and the next 100 rows in the dataframe.
pd.drop(pd[(pd.index >= pd.loc[pd[column_name] < 21000].index[0])][:100].index, inplace=True)
The index is timedate values.
Given jch's example df, plus a datetime index:
A
2021-01-01 a
2021-01-04 b
2021-01-07 c
2021-01-10 d
2021-01-13 e
2021-01-16 f
2021-01-19 g
2021-01-22 h
2021-01-25 i
2021-01-28 j
Doing, let's drop 3 values, 'e' and the two values after it:
i = df.A.eq('e').argmax()
df = df.drop(df.index[i:i+3])
print(df)
Output:
A
2021-01-01 a
2021-01-04 b
2021-01-07 c
2021-01-10 d
2021-01-22 h
2021-01-25 i
2021-01-28 j
Thinking of it from the other direction, you could just include everything around the 100. The example below does that, but 'drops' 3 instead of 100.
df = pd.DataFrame({'A':list('abcdefghij')})
print(df)
A
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
9 j
Execute
r = df['A'].eq('d').argmax()
pd.concat([df.iloc[:r],df.iloc[r+3:]])
Result
A
0 a
1 b
2 c
6 g
7 h
8 i
9 j
I have a dataframe:
# create example df
df = pd.DataFrame(index=[1,2,3,4,5,6,7,8])
df['ID'] = [1,1,1,1,2,2,2,2]
df['election_date'] = pd.date_range("01/01/2010", periods=8, freq="M")
df['status'] = ['b','a','b','c','a','d','d','b']
# sort values
df.sort_values(['election_date'], inplace=True, ascending=False)
df.reset_index(drop=True, inplace=True)
df
ID election_date status
0 2 2010-08-31 b
1 2 2010-07-31 d
2 2 2010-06-30 d
3 2 2010-05-31 a
4 1 2010-04-30 c
5 1 2010-03-31 b
6 1 2010-02-28 a
7 1 2010-01-31 b
I would like to get the cumulative most frequent status for column status for each ID. This is what I would expect:
ID election_date status cum_most_freq_status
0 2 2010-08-31 b d
1 2 2010-07-31 d d
2 2 2010-06-30 d a
3 2 2010-05-31 a NaN
4 1 2010-04-30 c b
5 1 2010-03-31 b a
6 1 2010-02-28 a b
7 1 2010-01-31 b NaN
Interpretation:
for 2010-01-31 the value is NaN because there was no status value in the past. The same works for 2010-05-31.
for 2010-03-31 the most frequent status in the past was a and b. Therefore we take the most recent value, which was a.
How would you do it?
You can first make a DataFrame with ID and election_date as its index, and one-hot-encoded status values, then calculate cumsum.
We want to pick the most recent status if there is a tie in counts, so I'm adding a small number (less than 1) to cumsum for the current status, so when we apply idxmax it will pick up the most recent status in case there's a tie.
After finding the most frequent cumulative status with idxmax we can merge with the original DataFrame:
# make one-hot-encoded status dataframe
z = (df
.groupby(['ID', 'election_date', 'status'])
.size().unstack().fillna(0))
# break ties to choose most recent
z = z.groupby(level=0).cumsum() + (z * 1e-4)
# shift by 1 row, since we only count previous status occurrences
z = z.groupby(level=0).shift()
# merge
df.merge(z.idxmax(axis=1).to_frame('cum_most_freq_status').reset_index())
Output:
ID election_date status cum_most_freq_status
0 2 2010-08-31 b d
1 2 2010-07-31 d d
2 2 2010-06-30 d a
3 2 2010-05-31 a NaN
4 1 2010-04-30 c b
5 1 2010-03-31 b a
6 1 2010-02-28 a b
7 1 2010-01-31 b NaN
Imagine I have a table like
ID
Date
1
2021-01-01
1
2021-01-05
1
2021-01-17
1
2021-02-01
1
2021-02-18
1
2021-02-28
1
2021-03-30
2
2021-01-01
2
2021-01-14
2
2021-02-15
I want to select all data on this table, but creating a new column with a new Event_ID. An Event is defined as all the rows with the same ID, within a time frame of 15 days. The issue is that I want the time frame to move - as in the first 3 rows: row 2 is within the 15 days of row 1 (so they belong to the same event). Row 3 is within 15 days of row 2 (but further apart from row 1), but I want it to be added to the same event as before. (Note: the table is not ordered like in the example, it was just for convenience).
The output should be
ID
Date
Event_ID
1
2021-01-01
1
1
2021-01-05
1
1
2021-01-17
1
1
2021-02-01
1
1
2021-02-18
2
1
2021-02-28
2
1
2021-03-30
3
2
2021-01-01
4
2
2021-01-14
4
2
2021-02-15
5
I can also do it in R with data.table (depending on efficiency/performance)
Here is one data.table approach in R :
library(data.table)
#Change to data.table
setDT(df)
#Order the dataset
setorder(df, ID, Date)
#Set flag to TRUE/FALSE if difference is greater than 15
df[, greater_than_15 := c(TRUE, diff(Date) > 15), ID]
#Take cumulative sum to create consecutive event id.
df[, Event_ID := cumsum(greater_than_15)]
df
# ID Date greater_than_15 Event_ID
# 1: 1 2021-01-01 TRUE 1
# 2: 1 2021-01-05 FALSE 1
# 3: 1 2021-01-17 FALSE 1
# 4: 1 2021-02-01 FALSE 1
# 5: 1 2021-02-18 TRUE 2
# 6: 1 2021-02-28 FALSE 2
# 7: 1 2021-03-30 TRUE 3
# 8: 2 2021-01-01 TRUE 4
# 9: 2 2021-01-14 FALSE 4
#10: 2 2021-02-15 TRUE 5
data
df <- structure(list(ID = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2),
Date = structure(c(18628, 18632, 18644, 18659, 18676, 18686, 18716,
18628, 18641, 18673), class = "Date")),
row.names = c(NA, -10L), class = "data.frame")
A r solution may be using dplyr approach and rleid function from data.table
library(dplyr)
library(data.table)
df %>% group_by(ID) %>%
mutate(Date = as.Date(Date)) %>% #mutating Date column as Date
arrange(ID, Date) %>% #arranging the rows in order
mutate(Event = if_else(is.na(Date - lag(Date)), Date - Date, Date - lag(Date)),
Event = paste(ID, cumsum(if_else(Event > 15, 1, 0)), sep = "_")) %>%
ungroup() %>% #since the event numbers are not to be created group-wise
mutate(Event = rleid(Event))
# A tibble: 9 x 3
ID Date Event
<int> <date> <int>
1 1 2021-01-01 1
2 1 2021-01-05 1
3 1 2021-01-17 1
4 1 2021-02-15 2
5 1 2021-02-28 2
6 1 2021-03-30 3
7 2 2021-01-01 4
8 2 2021-01-14 4
9 2 2021-02-15 5
I have a DataFrame, df, like:
id date
a 2019-07-11
a 2019-07-16
b 2018-04-01
c 2019-08-10
c 2019-07-11
c 2018-05-15
I want to add a count column and shows how many rows with the same id exist in the date with a date that is before the date of that row. Meaning:
id date count
a 2019-07-11 0
a 2019-07-16 1
b 2018-04-01 0
c 2019-08-10 2
c 2019-07-11 1
c 2018-05-15 0
If you believe it is easier in SQL and know how to do it, that works for me too.
Do this:
In [1688]: df.sort_values('date').groupby('id').cumcount()
Out[1688]:
2 0
5 0
0 0
4 1
1 1
3 2
dtype: int64
Here is my dataframe:
my_df = pd.DataFrame({'group':['a','a', 'a','b','b'], 'date':['2017-01-02', '2017-01-02','2017-03-01', '2018-02-05', '2018-04-06']})
my_df['date']= pd.to_datetime(my_df['date'], format = '%Y-%m-%d')
I would like to add rank per group, where same values would be assigned same rank.
Here is what I would like as output:
date group rank
0 2017-01-02 a 1
1 2017-01-02 a 1
2 2017-03-01 a 2
3 2018-02-05 b 1
4 2018-04-06 b 2
I guess I can do it by grouping twice and ranking and join back to original dataframe, but I wonder if there is faster way to do it.
Just using rank with method dense
my_df.groupby(['group'])['date'].rank(method ='dense')
Out[6]:
0 1.0
1 1.0
2 2.0
3 1.0
4 2.0
Name: date, dtype: float64
You could use transform with factorize:
my_df['group_rank'] = my_df.groupby(['group'])['date'].transform(lambda x: x.factorize()[0])
>>> my_df
date group group_rank
0 2017-01-02 a 0
1 2017-01-02 a 0
2 2017-03-01 a 1
3 2018-02-05 b 0
4 2018-04-06 b 1
If you add + 1 to the end of that, it will be ranks of 1 and 2 as in your desired output, but I thought this might not be important (since they are properly binned together in any case)