Using column values from 1 df to another pandas - pandas

[IN] df
[OUT]:
customer_id Order_date Status
1 2015-01-16 R
1 2015-01-19 G
2 2014-12-21 R
2 2015-01-10 G
1 2015-01-10 B
3 2018-01-18 Y
3 2017-03-04 Y
4 2019-11-05 B
4 2010-01-01 G
3 2019-02-03 U
3 2020-01-01 R
3 2018-01-01 R
Code to extract Customer_IDs where count of order_trasactions is at least 3:
[IN]
df22=(df.groupby('customer_id')['order_date'].nunique().loc[lambda
x:x>=3].reset_index()).rename(columns={'order_date':'Count_Order_Date'})
[OUT]
Customer_id Count_Order_Dates
1 3
3 5
Output I want:
I want to use the IDs that I got using this above code in the original dataframe df so I need the output as follows:
[OUT]
customer_id Order_date Status
1 2015-01-16 R
1 2015-01-19 G
1 2015-01-10 B
3 2018-01-18 Y
3 2017-03-04 Y
3 2019-02-03 U
3 2020-01-01 R
3 2018-01-01 R
So in the output only ID 1 and 3 are reflected(ones where there were at least 3 or more unique order dates).
What i have tried so far (which has failed):
df[df['customer_id'].isin(df22['customer_id'])]
Reason it has failed I feel is because when I do df['customer_id'].nunique() and
df22['customer_id'].nunique(), values are different in both the cases.

It was a simple error. I had forgotten to reassign df value to df[df['customer_id'].isin(df22['customer_id'])]
So doing
df = df[df['customer_id'].isin(df22['customer_id'])]
solved my problem.
Thanks #YOandBEN_W for pointing it out.

Related

way to drop values starting at a given index in pandas

This code works but seems so hairy. Is there a better way to drop 100 rows from a dataframe starting from the row where a certain value criteria is met?
In my case, I want to find next row where a value in column_name is < 21000, then drop that and the next 100 rows in the dataframe.
pd.drop(pd[(pd.index >= pd.loc[pd[column_name] < 21000].index[0])][:100].index, inplace=True)
The index is timedate values.
Given jch's example df, plus a datetime index:
A
2021-01-01 a
2021-01-04 b
2021-01-07 c
2021-01-10 d
2021-01-13 e
2021-01-16 f
2021-01-19 g
2021-01-22 h
2021-01-25 i
2021-01-28 j
Doing, let's drop 3 values, 'e' and the two values after it:
i = df.A.eq('e').argmax()
df = df.drop(df.index[i:i+3])
print(df)
Output:
A
2021-01-01 a
2021-01-04 b
2021-01-07 c
2021-01-10 d
2021-01-22 h
2021-01-25 i
2021-01-28 j
Thinking of it from the other direction, you could just include everything around the 100. The example below does that, but 'drops' 3 instead of 100.
df = pd.DataFrame({'A':list('abcdefghij')})
print(df)
A
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
9 j
Execute
r = df['A'].eq('d').argmax()
pd.concat([df.iloc[:r],df.iloc[r+3:]])
Result
A
0 a
1 b
2 c
6 g
7 h
8 i
9 j

Get the cumulative most frequent status for specific column in pandas dataframe

I have a dataframe:
# create example df
df = pd.DataFrame(index=[1,2,3,4,5,6,7,8])
df['ID'] = [1,1,1,1,2,2,2,2]
df['election_date'] = pd.date_range("01/01/2010", periods=8, freq="M")
df['status'] = ['b','a','b','c','a','d','d','b']
# sort values
df.sort_values(['election_date'], inplace=True, ascending=False)
df.reset_index(drop=True, inplace=True)
df
ID election_date status
0 2 2010-08-31 b
1 2 2010-07-31 d
2 2 2010-06-30 d
3 2 2010-05-31 a
4 1 2010-04-30 c
5 1 2010-03-31 b
6 1 2010-02-28 a
7 1 2010-01-31 b
I would like to get the cumulative most frequent status for column status for each ID. This is what I would expect:
ID election_date status cum_most_freq_status
0 2 2010-08-31 b d
1 2 2010-07-31 d d
2 2 2010-06-30 d a
3 2 2010-05-31 a NaN
4 1 2010-04-30 c b
5 1 2010-03-31 b a
6 1 2010-02-28 a b
7 1 2010-01-31 b NaN
Interpretation:
for 2010-01-31 the value is NaN because there was no status value in the past. The same works for 2010-05-31.
for 2010-03-31 the most frequent status in the past was a and b. Therefore we take the most recent value, which was a.
How would you do it?
You can first make a DataFrame with ID and election_date as its index, and one-hot-encoded status values, then calculate cumsum.
We want to pick the most recent status if there is a tie in counts, so I'm adding a small number (less than 1) to cumsum for the current status, so when we apply idxmax it will pick up the most recent status in case there's a tie.
After finding the most frequent cumulative status with idxmax we can merge with the original DataFrame:
# make one-hot-encoded status dataframe
z = (df
.groupby(['ID', 'election_date', 'status'])
.size().unstack().fillna(0))
# break ties to choose most recent
z = z.groupby(level=0).cumsum() + (z * 1e-4)
# shift by 1 row, since we only count previous status occurrences
z = z.groupby(level=0).shift()
# merge
df.merge(z.idxmax(axis=1).to_frame('cum_most_freq_status').reset_index())
Output:
ID election_date status cum_most_freq_status
0 2 2010-08-31 b d
1 2 2010-07-31 d d
2 2 2010-06-30 d a
3 2 2010-05-31 a NaN
4 1 2010-04-30 c b
5 1 2010-03-31 b a
6 1 2010-02-28 a b
7 1 2010-01-31 b NaN

How to create a new column of conditional count in a Pandas' DataFrame

I have a DataFrame, df, like:
id date
a 2019-07-11
a 2019-07-16
b 2018-04-01
c 2019-08-10
c 2019-07-11
c 2018-05-15
I want to add a count column and shows how many rows with the same id exist in the date with a date that is before the date of that row. Meaning:
id date count
a 2019-07-11 0
a 2019-07-16 1
b 2018-04-01 0
c 2019-08-10 2
c 2019-07-11 1
c 2018-05-15 0
If you believe it is easier in SQL and know how to do it, that works for me too.
Do this:
In [1688]: df.sort_values('date').groupby('id').cumcount()
Out[1688]:
2 0
5 0
0 0
4 1
1 1
3 2
dtype: int64

Pandas groupby and rank - same rank for duplicates

Here is my dataframe:
my_df = pd.DataFrame({'group':['a','a', 'a','b','b'], 'date':['2017-01-02', '2017-01-02','2017-03-01', '2018-02-05', '2018-04-06']})
my_df['date']= pd.to_datetime(my_df['date'], format = '%Y-%m-%d')
I would like to add rank per group, where same values would be assigned same rank.
Here is what I would like as output:
date group rank
0 2017-01-02 a 1
1 2017-01-02 a 1
2 2017-03-01 a 2
3 2018-02-05 b 1
4 2018-04-06 b 2
I guess I can do it by grouping twice and ranking and join back to original dataframe, but I wonder if there is faster way to do it.
Just using rank with method dense
my_df.groupby(['group'])['date'].rank(method ='dense')
Out[6]:
0 1.0
1 1.0
2 2.0
3 1.0
4 2.0
Name: date, dtype: float64
You could use transform with factorize:
my_df['group_rank'] = my_df.groupby(['group'])['date'].transform(lambda x: x.factorize()[0])
>>> my_df
date group group_rank
0 2017-01-02 a 0
1 2017-01-02 a 0
2 2017-03-01 a 1
3 2018-02-05 b 0
4 2018-04-06 b 1
If you add + 1 to the end of that, it will be ranks of 1 and 2 as in your desired output, but I thought this might not be important (since they are properly binned together in any case)

Pandas: Add new column with several values to groupby dataframe

for my dataframe, I want to add a new column for every single unique value in another column. The new column consists of several datetime entries that every unique value of the other column should get.
Example:
Original Df:
ID
1
2
3
New Column DF:
Date
2015/01/01
2015/02/01
2015/03/01
Resulting Df:
ID Date
1 2015/01/01
2015/02/01
2015/03/01
2 2015/01/01
2015/02/01
2015/03/01
3 2015/01/01
2015/02/01
2015/03/01
I tried to stick to this solution: https://stackoverflow.com/a/12394122/3856569
But it gives me the following error: Length of values does not match length of index
Anyone has a simple solution to do that? Thanks a lot!
UPDATE: replicating ids 6 times:
In [172]: %paste
data = """\
id
1
2
3
"""
df = pd.read_csv(io.StringIO(data))
# repeat each ID 6 times
df = pd.DataFrame(df['id'].tolist()*6, columns=['id'])
start_date = pd.to_datetime('2015-01-01')
df['date'] = start_date
df['date'] = df.groupby('id', as_index=False)\
.transform(lambda x: pd.date_range(start_date,
freq='1D',
periods=len(x)))
df.sort_values(by=['id','date'])
## -- End pasted text --
Out[172]:
id date
0 1 2015-01-01
3 1 2015-01-02
6 1 2015-01-03
9 1 2015-01-04
12 1 2015-01-05
15 1 2015-01-06
1 2 2015-01-01
4 2 2015-01-02
7 2 2015-01-03
10 2 2015-01-04
13 2 2015-01-05
16 2 2015-01-06
2 3 2015-01-01
5 3 2015-01-02
8 3 2015-01-03
11 3 2015-01-04
14 3 2015-01-05
17 3 2015-01-06
OLD more generic answer:
prepare sample DF:
start_date = pd.to_datetime('2015-01-01')
data = """\
id
1
2
2
3
1
2
3
2
1
"""
df = pd.read_csv(io.StringIO(data))
In [200]: df
Out[200]:
id
0 1
1 2
2 2
3 3
4 1
5 2
6 3
7 2
8 1
Solution:
In [201]: %paste
df['date'] = start_date
df['date'] = df.groupby('id', as_index=False)\
.transform(lambda x: pd.date_range(start_date,
freq='1D',
periods=len(x)))
## -- End pasted text --
In [202]: df
Out[202]:
id date
0 1 2015-01-01
1 2 2015-01-01
2 2 2015-01-02
3 3 2015-01-01
4 1 2015-01-02
5 2 2015-01-03
6 3 2015-01-02
7 2 2015-01-04
8 1 2015-01-03
Sorted:
In [203]: df.sort_values(by='id')
Out[203]:
id date
0 1 2015-01-01
4 1 2015-01-02
8 1 2015-01-03
1 2 2015-01-01
2 2 2015-01-02
5 2 2015-01-03
7 2 2015-01-04
3 3 2015-01-01
6 3 2015-01-02
A rather straightforward numpy approach, making use of repeat and tile:
import numpy as np
import pandas as pd
N = 3 # arbitrary number of IDs/dates
ID = np.arange(N) + 1
dates = pd.date_range('20160101', periods=N)
df = pd.DataFrame({'ID' : np.repeat(ID, N),
'dates' : np.tile(dates, N)})
Resulting DataFrame:
In [1]: df
Out[1]:
ID dates
0 1 2016-01-01
1 1 2016-01-02
2 1 2016-01-03
3 2 2016-01-01
4 2 2016-01-02
5 2 2016-01-03
6 3 2016-01-01
7 3 2016-01-02
8 3 2016-01-03
Update
Assuming you already have a DataFrame of IDs, as pointed out by MaxU, you can tile the IDs
df = pd.DataFrame({'ID' : np.tile(df['ID'], N),
'dates' : np.tile(dates, N)})
# now df needs sorting
df = df.sort_values(by=['ID', 'dates'])
Resulting DataFrame:
In [5]: df
Out[5]:
ID dates
0 1 2016-01-01
3 1 2016-01-01
6 1 2016-01-01
1 2 2016-01-02
4 2 2016-01-02
7 2 2016-01-02
2 3 2016-01-03
5 3 2016-01-03
8 3 2016-01-03