Select the last value in time after multiple groupings - pandas

I want to group ‘name’ first, then press ‘day’ to aggregate and select the last value of each ‘name’ every day.
I got some ideas from here:pandas - how to organised dataframe based on date and assign new values to column
I tried this, but I can't succeed. Is there any good way?
df = df.groupby(df['name']).resample('D',on='Timestamp').apply(['last'])
eg:
import pandas as pd
N = 9
rng = pd.date_range('2011-01-01', periods=N, freq='15S')
df = pd.DataFrame({'Timestamp': rng, 'name': ['A','A', 'B','B','B','B','C','C','C'],
'value': [1, 2, 3, 2, 3, 1, 3, 4, 3],'Temp': range(N)})
[out]:
Timestamp name value Temp
0 2011-01-01 00:00:00 A 1 0
1 2011-01-01 00:00:15 A 2 1
2 2011-01-01 00:00:30 B 3 2
3 2011-01-01 00:00:45 B 2 3
4 2011-01-01 00:01:00 B 3 4
5 2011-01-01 00:01:15 B 1 5
6 2011-01-01 00:01:30 C 3 6
7 2011-01-01 00:01:45 C 4 7
8 2011-01-01 00:02:00 C 3 8
I want to get these:
[out]:
Timestamp name value Temp
1 2011-01-01 00:00:15 A 2 1
5 2011-01-01 00:01:15 B 1 5
8 2011-01-01 00:02:00 C 3 8

IIUC
df.groupby('name').tail(1)
Out[25]:
Temp Timestamp name value
1 1 2011-01-01 00:00:15 A 2
5 5 2011-01-01 00:01:15 B 1
8 8 2011-01-01 00:02:00 C 3
Or
df.drop_duplicates('name',keep='last')
Out[26]:
Temp Timestamp name value
1 1 2011-01-01 00:00:15 A 2
5 5 2011-01-01 00:01:15 B 1
8 8 2011-01-01 00:02:00 C 3

If need last values per days and per column name, use GroupBy.tail with Grouper:
df1 = df.groupby([pd.Grouper(freq='D', key='Timestamp'), 'name']).tail(1)
print (df1)
Timestamp name value Temp
1 2011-01-01 00:00:15 A 2 1
5 2011-01-01 00:01:15 B 1 5
8 2011-01-01 00:02:00 C 3 8
Or convert values of Timestamp to dates by Series.dt.date:
df2 = df.groupby([df['Timestamp'].dt.date, 'name']).tail(1)
print (df2)
Timestamp name value Temp
1 2011-01-01 00:00:15 A 2 1
5 2011-01-01 00:01:15 B 1 5
8 2011-01-01 00:02:00 C 3 8
There are also alternatives with Series.dt.normalize:
df2 = df.groupby([df['Timestamp'].dt.normalize(), 'name']).tail(1)
Or Series.dt.floor:
df2 = df.groupby([df['Timestamp'].dt.floor('D'), 'name']).tail(1)

Related

Repeat pandas rows based on content of a list

I have a large pandas dataframe df as:
Col1 Col2
2 4
3 5
I have a large list as:
['2020-08-01', '2021-09-01', '2021-11-01']
I am trying to achieve the following:
Col1 Col2 StartDate
2 4 8/1/2020
3 5 8/1/2020
2 4 9/1/2021
3 5 9/1/2021
2 4 11/1/2021
3 5 11/1/2021
Basically tile the dataframe df while adding the elements of list as a new column. I am not sure how to approach this?
Let use list comprehension with assign and pd.concat:
l = ['2020-08-01', '2021-09-01', '2021-11-01']
pd.concat([df1.assign(startDate=i) for i in l], ignore_index=True)
Output:
Col1 Col2 startDate
0 2 4 2020-08-01
1 3 5 2020-08-01
2 2 4 2021-09-01
3 3 5 2021-09-01
4 2 4 2021-11-01
5 3 5 2021-11-01
You can try a combination of np.tile and np.repeat:
df.loc[np.tile(df.index,len(lst))].assign(StartDate=np.repeat(lst,len(df)))
Output:
Col1 Col2 StartDate
0 2 4 2020-08-01
1 3 5 2020-08-01
0 2 4 2021-09-01
1 3 5 2021-09-01
0 2 4 2021-11-01
1 3 5 2021-11-01
You can also cross join using merge after creating a df from the list:
l = ['2020-08-01', '2021-09-01', '2021-11-01']
(df.assign(k=1).merge(pd.DataFrame({'StartDate':l, 'k':1}),on='k')
.sort_values('StartDate').drop("k",1))
Col1 Col2 StartDate
0 2 4 2020-08-01
3 3 5 2020-08-01
1 2 4 2021-09-01
4 3 5 2021-09-01
2 2 4 2021-11-01
5 3 5 2021-11-01
I would use concat:
df = pd.DataFrame({'col1': [2,3], 'col2': [4, 5]})
dict_dfs = {k: df for k in ['2020-08-01', '2021-09-01', '2021-11-01']}
pd.concat(dict_dfs)
Then you can rename and clean the index.
col1 col2
2020-08-01 0 2 4
1 3 5
2021-09-01 0 2 4
1 3 5
2021-11-01 0 2 4
1 3 5
I may do itertools, notice the order can be down with sort_values based on column 1
import itertools
df=pd.DataFrame([*itertools.product(df.index,l)]).set_index(0).join(df)
1 Col1 Col2
0 2020-08-01 2 4
0 2021-09-01 2 4
0 2021-11-01 2 4
1 2020-08-01 3 5
1 2021-09-01 3 5
1 2021-11-01 3 5

How to create a new column of conditional count in a Pandas' DataFrame

I have a DataFrame, df, like:
id date
a 2019-07-11
a 2019-07-16
b 2018-04-01
c 2019-08-10
c 2019-07-11
c 2018-05-15
I want to add a count column and shows how many rows with the same id exist in the date with a date that is before the date of that row. Meaning:
id date count
a 2019-07-11 0
a 2019-07-16 1
b 2018-04-01 0
c 2019-08-10 2
c 2019-07-11 1
c 2018-05-15 0
If you believe it is easier in SQL and know how to do it, that works for me too.
Do this:
In [1688]: df.sort_values('date').groupby('id').cumcount()
Out[1688]:
2 0
5 0
0 0
4 1
1 1
3 2
dtype: int64

Using column values from 1 df to another pandas

[IN] df
[OUT]:
customer_id Order_date Status
1 2015-01-16 R
1 2015-01-19 G
2 2014-12-21 R
2 2015-01-10 G
1 2015-01-10 B
3 2018-01-18 Y
3 2017-03-04 Y
4 2019-11-05 B
4 2010-01-01 G
3 2019-02-03 U
3 2020-01-01 R
3 2018-01-01 R
Code to extract Customer_IDs where count of order_trasactions is at least 3:
[IN]
df22=(df.groupby('customer_id')['order_date'].nunique().loc[lambda
x:x>=3].reset_index()).rename(columns={'order_date':'Count_Order_Date'})
[OUT]
Customer_id Count_Order_Dates
1 3
3 5
Output I want:
I want to use the IDs that I got using this above code in the original dataframe df so I need the output as follows:
[OUT]
customer_id Order_date Status
1 2015-01-16 R
1 2015-01-19 G
1 2015-01-10 B
3 2018-01-18 Y
3 2017-03-04 Y
3 2019-02-03 U
3 2020-01-01 R
3 2018-01-01 R
So in the output only ID 1 and 3 are reflected(ones where there were at least 3 or more unique order dates).
What i have tried so far (which has failed):
df[df['customer_id'].isin(df22['customer_id'])]
Reason it has failed I feel is because when I do df['customer_id'].nunique() and
df22['customer_id'].nunique(), values are different in both the cases.
It was a simple error. I had forgotten to reassign df value to df[df['customer_id'].isin(df22['customer_id'])]
So doing
df = df[df['customer_id'].isin(df22['customer_id'])]
solved my problem.
Thanks #YOandBEN_W for pointing it out.

Pandas sum over multiple columns after group by

if I have a data set where the columns are something like:
Day Column2 Column3 Column4......Column100
Is there a better way to do something like the below?
grouped_df = df.groupby('Day').agg({
'Column2': lambda x : sum(x),
'Column3': lambda x : sum(x),
'Column4': lambda x : sum(x),
..........
'Column100': lambda x : sum(x)})
What I have works but wondering if there is a more elegant solution.
Thank You
You can try df.groupby('Day').sum() just like what MaxU said.
you can do it this way:
In [17]: df
Out[17]:
a b c d e Day
0 7 5 4 9 4 2016-01-01
1 2 1 5 4 5 2014-01-01
2 2 8 8 6 9 2014-01-01
3 1 4 4 3 7 2015-01-01
4 5 6 7 9 5 2016-01-01
5 3 6 0 8 7 2015-01-01
6 7 4 4 5 5 2014-01-01
7 1 1 0 1 6 2015-01-01
8 7 8 9 8 3 2015-01-01
9 8 5 5 2 8 2015-01-01
10 6 1 3 0 3 2014-01-01
11 1 8 2 7 2 2016-01-01
12 2 5 2 5 1 2016-01-01
13 1 2 3 2 2 2016-01-01
14 7 4 9 5 2 2016-01-01
15 4 0 8 9 5 2015-01-01
16 8 5 8 9 7 2015-01-01
17 6 7 9 5 4 2016-01-01
18 7 4 2 3 2 2016-01-01
19 2 7 8 6 8 2015-01-01
In [18]: cols = df.columns
In [19]: cols[1:]
Out[19]: Index(['b', 'c', 'd', 'e', 'Day'], dtype='object')
In [20]: df.ix[:, cols[1:]].groupby('Day').sum()
Out[20]:
b c d e
Day
2014-01-01 14 20 15 22
2015-01-01 36 42 46 51
2016-01-01 41 38 45 22
setup sample DF:
rows = 20
df = pd.DataFrame(np.random.randint(0, 10, size=(rows, 5)), columns=list('abcde'))
dates = [pd.to_datetime(d) for d in ['2016-01-01','2015-01-01','2014-01-01']]
df['Day'] = np.random.choice(dates, len(df))

Pandas: Add new column with several values to groupby dataframe

for my dataframe, I want to add a new column for every single unique value in another column. The new column consists of several datetime entries that every unique value of the other column should get.
Example:
Original Df:
ID
1
2
3
New Column DF:
Date
2015/01/01
2015/02/01
2015/03/01
Resulting Df:
ID Date
1 2015/01/01
2015/02/01
2015/03/01
2 2015/01/01
2015/02/01
2015/03/01
3 2015/01/01
2015/02/01
2015/03/01
I tried to stick to this solution: https://stackoverflow.com/a/12394122/3856569
But it gives me the following error: Length of values does not match length of index
Anyone has a simple solution to do that? Thanks a lot!
UPDATE: replicating ids 6 times:
In [172]: %paste
data = """\
id
1
2
3
"""
df = pd.read_csv(io.StringIO(data))
# repeat each ID 6 times
df = pd.DataFrame(df['id'].tolist()*6, columns=['id'])
start_date = pd.to_datetime('2015-01-01')
df['date'] = start_date
df['date'] = df.groupby('id', as_index=False)\
.transform(lambda x: pd.date_range(start_date,
freq='1D',
periods=len(x)))
df.sort_values(by=['id','date'])
## -- End pasted text --
Out[172]:
id date
0 1 2015-01-01
3 1 2015-01-02
6 1 2015-01-03
9 1 2015-01-04
12 1 2015-01-05
15 1 2015-01-06
1 2 2015-01-01
4 2 2015-01-02
7 2 2015-01-03
10 2 2015-01-04
13 2 2015-01-05
16 2 2015-01-06
2 3 2015-01-01
5 3 2015-01-02
8 3 2015-01-03
11 3 2015-01-04
14 3 2015-01-05
17 3 2015-01-06
OLD more generic answer:
prepare sample DF:
start_date = pd.to_datetime('2015-01-01')
data = """\
id
1
2
2
3
1
2
3
2
1
"""
df = pd.read_csv(io.StringIO(data))
In [200]: df
Out[200]:
id
0 1
1 2
2 2
3 3
4 1
5 2
6 3
7 2
8 1
Solution:
In [201]: %paste
df['date'] = start_date
df['date'] = df.groupby('id', as_index=False)\
.transform(lambda x: pd.date_range(start_date,
freq='1D',
periods=len(x)))
## -- End pasted text --
In [202]: df
Out[202]:
id date
0 1 2015-01-01
1 2 2015-01-01
2 2 2015-01-02
3 3 2015-01-01
4 1 2015-01-02
5 2 2015-01-03
6 3 2015-01-02
7 2 2015-01-04
8 1 2015-01-03
Sorted:
In [203]: df.sort_values(by='id')
Out[203]:
id date
0 1 2015-01-01
4 1 2015-01-02
8 1 2015-01-03
1 2 2015-01-01
2 2 2015-01-02
5 2 2015-01-03
7 2 2015-01-04
3 3 2015-01-01
6 3 2015-01-02
A rather straightforward numpy approach, making use of repeat and tile:
import numpy as np
import pandas as pd
N = 3 # arbitrary number of IDs/dates
ID = np.arange(N) + 1
dates = pd.date_range('20160101', periods=N)
df = pd.DataFrame({'ID' : np.repeat(ID, N),
'dates' : np.tile(dates, N)})
Resulting DataFrame:
In [1]: df
Out[1]:
ID dates
0 1 2016-01-01
1 1 2016-01-02
2 1 2016-01-03
3 2 2016-01-01
4 2 2016-01-02
5 2 2016-01-03
6 3 2016-01-01
7 3 2016-01-02
8 3 2016-01-03
Update
Assuming you already have a DataFrame of IDs, as pointed out by MaxU, you can tile the IDs
df = pd.DataFrame({'ID' : np.tile(df['ID'], N),
'dates' : np.tile(dates, N)})
# now df needs sorting
df = df.sort_values(by=['ID', 'dates'])
Resulting DataFrame:
In [5]: df
Out[5]:
ID dates
0 1 2016-01-01
3 1 2016-01-01
6 1 2016-01-01
1 2 2016-01-02
4 2 2016-01-02
7 2 2016-01-02
2 3 2016-01-03
5 3 2016-01-03
8 3 2016-01-03