Pandas: Add new column with several values to groupby dataframe - pandas

for my dataframe, I want to add a new column for every single unique value in another column. The new column consists of several datetime entries that every unique value of the other column should get.
Example:
Original Df:
ID
1
2
3
New Column DF:
Date
2015/01/01
2015/02/01
2015/03/01
Resulting Df:
ID Date
1 2015/01/01
2015/02/01
2015/03/01
2 2015/01/01
2015/02/01
2015/03/01
3 2015/01/01
2015/02/01
2015/03/01
I tried to stick to this solution: https://stackoverflow.com/a/12394122/3856569
But it gives me the following error: Length of values does not match length of index
Anyone has a simple solution to do that? Thanks a lot!

UPDATE: replicating ids 6 times:
In [172]: %paste
data = """\
id
1
2
3
"""
df = pd.read_csv(io.StringIO(data))
# repeat each ID 6 times
df = pd.DataFrame(df['id'].tolist()*6, columns=['id'])
start_date = pd.to_datetime('2015-01-01')
df['date'] = start_date
df['date'] = df.groupby('id', as_index=False)\
.transform(lambda x: pd.date_range(start_date,
freq='1D',
periods=len(x)))
df.sort_values(by=['id','date'])
## -- End pasted text --
Out[172]:
id date
0 1 2015-01-01
3 1 2015-01-02
6 1 2015-01-03
9 1 2015-01-04
12 1 2015-01-05
15 1 2015-01-06
1 2 2015-01-01
4 2 2015-01-02
7 2 2015-01-03
10 2 2015-01-04
13 2 2015-01-05
16 2 2015-01-06
2 3 2015-01-01
5 3 2015-01-02
8 3 2015-01-03
11 3 2015-01-04
14 3 2015-01-05
17 3 2015-01-06
OLD more generic answer:
prepare sample DF:
start_date = pd.to_datetime('2015-01-01')
data = """\
id
1
2
2
3
1
2
3
2
1
"""
df = pd.read_csv(io.StringIO(data))
In [200]: df
Out[200]:
id
0 1
1 2
2 2
3 3
4 1
5 2
6 3
7 2
8 1
Solution:
In [201]: %paste
df['date'] = start_date
df['date'] = df.groupby('id', as_index=False)\
.transform(lambda x: pd.date_range(start_date,
freq='1D',
periods=len(x)))
## -- End pasted text --
In [202]: df
Out[202]:
id date
0 1 2015-01-01
1 2 2015-01-01
2 2 2015-01-02
3 3 2015-01-01
4 1 2015-01-02
5 2 2015-01-03
6 3 2015-01-02
7 2 2015-01-04
8 1 2015-01-03
Sorted:
In [203]: df.sort_values(by='id')
Out[203]:
id date
0 1 2015-01-01
4 1 2015-01-02
8 1 2015-01-03
1 2 2015-01-01
2 2 2015-01-02
5 2 2015-01-03
7 2 2015-01-04
3 3 2015-01-01
6 3 2015-01-02

A rather straightforward numpy approach, making use of repeat and tile:
import numpy as np
import pandas as pd
N = 3 # arbitrary number of IDs/dates
ID = np.arange(N) + 1
dates = pd.date_range('20160101', periods=N)
df = pd.DataFrame({'ID' : np.repeat(ID, N),
'dates' : np.tile(dates, N)})
Resulting DataFrame:
In [1]: df
Out[1]:
ID dates
0 1 2016-01-01
1 1 2016-01-02
2 1 2016-01-03
3 2 2016-01-01
4 2 2016-01-02
5 2 2016-01-03
6 3 2016-01-01
7 3 2016-01-02
8 3 2016-01-03
Update
Assuming you already have a DataFrame of IDs, as pointed out by MaxU, you can tile the IDs
df = pd.DataFrame({'ID' : np.tile(df['ID'], N),
'dates' : np.tile(dates, N)})
# now df needs sorting
df = df.sort_values(by=['ID', 'dates'])
Resulting DataFrame:
In [5]: df
Out[5]:
ID dates
0 1 2016-01-01
3 1 2016-01-01
6 1 2016-01-01
1 2 2016-01-02
4 2 2016-01-02
7 2 2016-01-02
2 3 2016-01-03
5 3 2016-01-03
8 3 2016-01-03

Related

Mumbojumbo .rolling() .max() .groupby() combination in python pandas

I am looking to do a "rolling" .max() .min() of B column "groupedby" date(column A values). However, trick is it should start on every row again so i can not use for example anything like df['MAX'] = df['B'].rolling(10).max().shift(-9) (couse i need to end it where group ends - every group can have different number of rows) or simply groupby A column (becouse i need that rolling max min with start on each row and end where each group ends - which means for row 1 column C is max of rows 1-4 in column B, for row 2 column C is max of rows 2-4 from column B, for row 3 column C is max of rows 3-4 from column B, for row 4 column C is max of row 4 from column B etc etc..). Hope it make sence - columns C and D are desired results. Thank you all in advance.
A B C(max) D(min)
1 2016-01-01 0 7 0
2 2016-01-01 7 7 3
3 2016-01-01 3 4 3
4 2016-01-01 4 4 4
5 2016-01-02 2 5 1
6 2016-01-02 5 5 1
7 2016-01-02 1 1 1
8 2016-01-03 1 4 1
9 2016-01-03 3 4 2
10 2016-01-03 4 4 2
11 2016-01-03 2 2 2
df['C_max'] = df.groupby('A')['B'].transform(lambda x: x[::-1].cummax()[::-1])
df['D_min'] = df.groupby('A')['B'].transform(lambda x: x[::-1].cummin()[::-1])
A B C(max) D(min) C_max D_min
1 2016-01-01 0 7 0 7 0
2 2016-01-01 7 7 3 7 3
3 2016-01-01 3 4 3 4 3
4 2016-01-01 4 4 4 4 4
5 2016-01-02 2 5 1 5 1
6 2016-01-02 5 5 1 5 1
7 2016-01-02 1 1 1 1 1
8 2016-01-03 1 4 1 4 1
9 2016-01-03 3 4 2 4 2
10 2016-01-03 4 4 2 4 2
11 2016-01-03 2 2 2 2 2

Create new Event_ID based on ID with sliding window on date column

Imagine I have a table like
ID
Date
1
2021-01-01
1
2021-01-05
1
2021-01-17
1
2021-02-01
1
2021-02-18
1
2021-02-28
1
2021-03-30
2
2021-01-01
2
2021-01-14
2
2021-02-15
I want to select all data on this table, but creating a new column with a new Event_ID. An Event is defined as all the rows with the same ID, within a time frame of 15 days. The issue is that I want the time frame to move - as in the first 3 rows: row 2 is within the 15 days of row 1 (so they belong to the same event). Row 3 is within 15 days of row 2 (but further apart from row 1), but I want it to be added to the same event as before. (Note: the table is not ordered like in the example, it was just for convenience).
The output should be
ID
Date
Event_ID
1
2021-01-01
1
1
2021-01-05
1
1
2021-01-17
1
1
2021-02-01
1
1
2021-02-18
2
1
2021-02-28
2
1
2021-03-30
3
2
2021-01-01
4
2
2021-01-14
4
2
2021-02-15
5
I can also do it in R with data.table (depending on efficiency/performance)
Here is one data.table approach in R :
library(data.table)
#Change to data.table
setDT(df)
#Order the dataset
setorder(df, ID, Date)
#Set flag to TRUE/FALSE if difference is greater than 15
df[, greater_than_15 := c(TRUE, diff(Date) > 15), ID]
#Take cumulative sum to create consecutive event id.
df[, Event_ID := cumsum(greater_than_15)]
df
# ID Date greater_than_15 Event_ID
# 1: 1 2021-01-01 TRUE 1
# 2: 1 2021-01-05 FALSE 1
# 3: 1 2021-01-17 FALSE 1
# 4: 1 2021-02-01 FALSE 1
# 5: 1 2021-02-18 TRUE 2
# 6: 1 2021-02-28 FALSE 2
# 7: 1 2021-03-30 TRUE 3
# 8: 2 2021-01-01 TRUE 4
# 9: 2 2021-01-14 FALSE 4
#10: 2 2021-02-15 TRUE 5
data
df <- structure(list(ID = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2),
Date = structure(c(18628, 18632, 18644, 18659, 18676, 18686, 18716,
18628, 18641, 18673), class = "Date")),
row.names = c(NA, -10L), class = "data.frame")
A r solution may be using dplyr approach and rleid function from data.table
library(dplyr)
library(data.table)
df %>% group_by(ID) %>%
mutate(Date = as.Date(Date)) %>% #mutating Date column as Date
arrange(ID, Date) %>% #arranging the rows in order
mutate(Event = if_else(is.na(Date - lag(Date)), Date - Date, Date - lag(Date)),
Event = paste(ID, cumsum(if_else(Event > 15, 1, 0)), sep = "_")) %>%
ungroup() %>% #since the event numbers are not to be created group-wise
mutate(Event = rleid(Event))
# A tibble: 9 x 3
ID Date Event
<int> <date> <int>
1 1 2021-01-01 1
2 1 2021-01-05 1
3 1 2021-01-17 1
4 1 2021-02-15 2
5 1 2021-02-28 2
6 1 2021-03-30 3
7 2 2021-01-01 4
8 2 2021-01-14 4
9 2 2021-02-15 5

Repeat pandas rows based on content of a list

I have a large pandas dataframe df as:
Col1 Col2
2 4
3 5
I have a large list as:
['2020-08-01', '2021-09-01', '2021-11-01']
I am trying to achieve the following:
Col1 Col2 StartDate
2 4 8/1/2020
3 5 8/1/2020
2 4 9/1/2021
3 5 9/1/2021
2 4 11/1/2021
3 5 11/1/2021
Basically tile the dataframe df while adding the elements of list as a new column. I am not sure how to approach this?
Let use list comprehension with assign and pd.concat:
l = ['2020-08-01', '2021-09-01', '2021-11-01']
pd.concat([df1.assign(startDate=i) for i in l], ignore_index=True)
Output:
Col1 Col2 startDate
0 2 4 2020-08-01
1 3 5 2020-08-01
2 2 4 2021-09-01
3 3 5 2021-09-01
4 2 4 2021-11-01
5 3 5 2021-11-01
You can try a combination of np.tile and np.repeat:
df.loc[np.tile(df.index,len(lst))].assign(StartDate=np.repeat(lst,len(df)))
Output:
Col1 Col2 StartDate
0 2 4 2020-08-01
1 3 5 2020-08-01
0 2 4 2021-09-01
1 3 5 2021-09-01
0 2 4 2021-11-01
1 3 5 2021-11-01
You can also cross join using merge after creating a df from the list:
l = ['2020-08-01', '2021-09-01', '2021-11-01']
(df.assign(k=1).merge(pd.DataFrame({'StartDate':l, 'k':1}),on='k')
.sort_values('StartDate').drop("k",1))
Col1 Col2 StartDate
0 2 4 2020-08-01
3 3 5 2020-08-01
1 2 4 2021-09-01
4 3 5 2021-09-01
2 2 4 2021-11-01
5 3 5 2021-11-01
I would use concat:
df = pd.DataFrame({'col1': [2,3], 'col2': [4, 5]})
dict_dfs = {k: df for k in ['2020-08-01', '2021-09-01', '2021-11-01']}
pd.concat(dict_dfs)
Then you can rename and clean the index.
col1 col2
2020-08-01 0 2 4
1 3 5
2021-09-01 0 2 4
1 3 5
2021-11-01 0 2 4
1 3 5
I may do itertools, notice the order can be down with sort_values based on column 1
import itertools
df=pd.DataFrame([*itertools.product(df.index,l)]).set_index(0).join(df)
1 Col1 Col2
0 2020-08-01 2 4
0 2021-09-01 2 4
0 2021-11-01 2 4
1 2020-08-01 3 5
1 2021-09-01 3 5
1 2021-11-01 3 5

Select the last value in time after multiple groupings

I want to group ‘name’ first, then press ‘day’ to aggregate and select the last value of each ‘name’ every day.
I got some ideas from here:pandas - how to organised dataframe based on date and assign new values to column
I tried this, but I can't succeed. Is there any good way?
df = df.groupby(df['name']).resample('D',on='Timestamp').apply(['last'])
eg:
import pandas as pd
N = 9
rng = pd.date_range('2011-01-01', periods=N, freq='15S')
df = pd.DataFrame({'Timestamp': rng, 'name': ['A','A', 'B','B','B','B','C','C','C'],
'value': [1, 2, 3, 2, 3, 1, 3, 4, 3],'Temp': range(N)})
[out]:
Timestamp name value Temp
0 2011-01-01 00:00:00 A 1 0
1 2011-01-01 00:00:15 A 2 1
2 2011-01-01 00:00:30 B 3 2
3 2011-01-01 00:00:45 B 2 3
4 2011-01-01 00:01:00 B 3 4
5 2011-01-01 00:01:15 B 1 5
6 2011-01-01 00:01:30 C 3 6
7 2011-01-01 00:01:45 C 4 7
8 2011-01-01 00:02:00 C 3 8
I want to get these:
[out]:
Timestamp name value Temp
1 2011-01-01 00:00:15 A 2 1
5 2011-01-01 00:01:15 B 1 5
8 2011-01-01 00:02:00 C 3 8
IIUC
df.groupby('name').tail(1)
Out[25]:
Temp Timestamp name value
1 1 2011-01-01 00:00:15 A 2
5 5 2011-01-01 00:01:15 B 1
8 8 2011-01-01 00:02:00 C 3
Or
df.drop_duplicates('name',keep='last')
Out[26]:
Temp Timestamp name value
1 1 2011-01-01 00:00:15 A 2
5 5 2011-01-01 00:01:15 B 1
8 8 2011-01-01 00:02:00 C 3
If need last values per days and per column name, use GroupBy.tail with Grouper:
df1 = df.groupby([pd.Grouper(freq='D', key='Timestamp'), 'name']).tail(1)
print (df1)
Timestamp name value Temp
1 2011-01-01 00:00:15 A 2 1
5 2011-01-01 00:01:15 B 1 5
8 2011-01-01 00:02:00 C 3 8
Or convert values of Timestamp to dates by Series.dt.date:
df2 = df.groupby([df['Timestamp'].dt.date, 'name']).tail(1)
print (df2)
Timestamp name value Temp
1 2011-01-01 00:00:15 A 2 1
5 2011-01-01 00:01:15 B 1 5
8 2011-01-01 00:02:00 C 3 8
There are also alternatives with Series.dt.normalize:
df2 = df.groupby([df['Timestamp'].dt.normalize(), 'name']).tail(1)
Or Series.dt.floor:
df2 = df.groupby([df['Timestamp'].dt.floor('D'), 'name']).tail(1)

Pandas sum over multiple columns after group by

if I have a data set where the columns are something like:
Day Column2 Column3 Column4......Column100
Is there a better way to do something like the below?
grouped_df = df.groupby('Day').agg({
'Column2': lambda x : sum(x),
'Column3': lambda x : sum(x),
'Column4': lambda x : sum(x),
..........
'Column100': lambda x : sum(x)})
What I have works but wondering if there is a more elegant solution.
Thank You
You can try df.groupby('Day').sum() just like what MaxU said.
you can do it this way:
In [17]: df
Out[17]:
a b c d e Day
0 7 5 4 9 4 2016-01-01
1 2 1 5 4 5 2014-01-01
2 2 8 8 6 9 2014-01-01
3 1 4 4 3 7 2015-01-01
4 5 6 7 9 5 2016-01-01
5 3 6 0 8 7 2015-01-01
6 7 4 4 5 5 2014-01-01
7 1 1 0 1 6 2015-01-01
8 7 8 9 8 3 2015-01-01
9 8 5 5 2 8 2015-01-01
10 6 1 3 0 3 2014-01-01
11 1 8 2 7 2 2016-01-01
12 2 5 2 5 1 2016-01-01
13 1 2 3 2 2 2016-01-01
14 7 4 9 5 2 2016-01-01
15 4 0 8 9 5 2015-01-01
16 8 5 8 9 7 2015-01-01
17 6 7 9 5 4 2016-01-01
18 7 4 2 3 2 2016-01-01
19 2 7 8 6 8 2015-01-01
In [18]: cols = df.columns
In [19]: cols[1:]
Out[19]: Index(['b', 'c', 'd', 'e', 'Day'], dtype='object')
In [20]: df.ix[:, cols[1:]].groupby('Day').sum()
Out[20]:
b c d e
Day
2014-01-01 14 20 15 22
2015-01-01 36 42 46 51
2016-01-01 41 38 45 22
setup sample DF:
rows = 20
df = pd.DataFrame(np.random.randint(0, 10, size=(rows, 5)), columns=list('abcde'))
dates = [pd.to_datetime(d) for d in ['2016-01-01','2015-01-01','2014-01-01']]
df['Day'] = np.random.choice(dates, len(df))