Pandas group by date and get count while removing duplicates - pandas

I have a data frame that looks like this:
maid date hour count
0 023f1f5f-37fb-4869-a957-b66b111d808e 2021-08-14 13 2
1 023f1f5f-37fb-4869-a957-b66b111d808e 2021-08-14 15 1
2 0589b8a3-9d33-4db4-b94a-834cc8f46106 2021-08-13 23 14
3 0589b8a3-9d33-4db4-b94a-834cc8f46106 2021-08-14 0 1
4 104010f8-5f57-4f7c-8ad9-5fc3ec0f9f39 2021-08-11 14 2
5 11947b4a-ccf8-48dc-a6a3-925836b3c520 2021-08-13 7 1
I am trying get a count of maid's for each date in such a way that if a maid is included in day 1, I don't want to include in any of the subsequent days. For example, 0589b8a3-9d33-4db4-b94a-834cc8f46106 is present in both 13th as well as 14. I want to include the maid in the count for 13th but not on 14th as it is already included in 13th.
I have written the following code and it works for small data frames:
import pandas as pd
df=pd.read_csv('/home/ubuntu/uniqueSiteId.csv')
umaids=[]
tdf=[]
df['date']=pd.to_datetime(df.date)
df=df.sort_values('date')
df=df[['maid','date']]
df=df.drop_duplicates(['maid','date'])
dts=df['date'].unique()
for dt in dts:
if not umaids:
df1=df[df['date']==dt]
k=df1['maid'].unique()
umaids.extend(k)
dff=df1
fdf=df1.values.tolist()
elif umaids:
dfs=df[df['date']==dt]
df2=dfs[~dfs['maid'].isin(umaids)]
umaids.extend(df2['maid'].unique())
sdf=df2.values.tolist()
tdf.append(sdf)
ftdf = [item for t in tdf for item in t]
ndf=fdf+ftdf
ndf=pd.DataFrame(ndf,columns=['maid','date'])
print(ndf)
Since I have 1000's of data frames and most often my data frame is more than a million rows, the above takes a long time to run. Is there a better way to do this.
The expected output is this:
maid date
0 104010f8-5f57-4f7c-8ad9-5fc3ec0f9f39 2021-08-11
1 0589b8a3-9d33-4db4-b94a-834cc8f46106 2021-08-13
2 11947b4a-ccf8-48dc-a6a3-925836b3c520 2021-08-13
3 023f1f5f-37fb-4869-a957-b66b111d808e 2021-08-14

As per discussion in the comments, the solution is quite simple: sort the dataframe by date and then drop duplicates only by maid. This will keep the first occurence of maid, which also happens to be the first occurence in time since we sorted by date. Then do the groupby as usual.

Related

Changing column name and it's values at the same time

Pandas help!
I have a specific column like this,
Mpg
0 18
1 17
2 19
3 21
4 16
5 15
Mpg is mile per gallon,
Now I need to replace that 'MPG' column to 'litre per 100 km' and change those values to litre per 100 km' at the same time. Any help? Thanks beforehand.
-Tom
I changed the name of the column but doing both simultaneously,i could not.
Use pop to return and delete the column at the same time and rdiv to perform the conversion (1 mpg = 1/235.15 liter/100km):
df['litre per 100 km'] = df.pop('Mpg').rdiv(235.15)
If you want to insert the column in the same position:
df.insert(df.columns.get_loc('Mpg'), 'litre per 100 km',
df.pop('Mpg').rdiv(235.15))
Output:
litre per 100 km
0 13.063889
1 13.832353
2 12.376316
3 11.197619
4 14.696875
5 15.676667
An alternative to pop would be to store the result in another dataframe. This way you can perform the two steps at the same time. In my code below, I first reproduce your dataframe, then store the constant for conversion and perform it on all entries using the apply method.
df = pd.DataFrame({'Mpg':[18,17,19,21,16,15]})
cc = 235.214583 # constant for conversion from mpg to L/100km
df2 = pd.DataFrame()
df2['litre per 100 km'] = df['Mpg'].apply(lambda x: cc/x)
print(df2)
The output of this code is:
litre per 100 km
0 13.067477
1 13.836152
2 12.379715
3 11.200694
4 14.700911
5 15.680972
as expected.

How do you iterate through a data frame based on the value in a row

I have a data frame which I am trying to iterate through, however not based on time, but on an increase of 10 for example
Column A
Column B
12:05
1
13:05
6
14:05
11
15:05
16
so in this case it would return a new data frame with the rows with 1 and 11. How am I able to do this? The different methods that I have tried such as asfreq resample etc. don't seem to work. They say invalid frequency. The reason I think about this is that it is not time based. What is the function that allows me to do this that isn't time based but based on a numerical value such as 10 or 7. I don't want the every nth number, but every time the column value changes by 10 from the last selected value. ex 1 to 11 then if the next values were 12 15 17 21, it would be 21.
here is one way to do it
# do a remainder division, and choose rows where remainder is zero
# offset by the first value, to make calculation simpler
first_val = df.loc[0]['Column B']
df.loc[((df['Column B'] - first_val) % 10).eq(0)]
Column A Column B
0 12:05 1
2 14:05 11

Pandas - Take value n month before

I am working with datetime. Is there anyway to get a value of n months before.
For example, the data look like:
dft = pd.DataFrame(
np.random.randn(100, 1),
columns=["A"],
index=pd.date_range("20130101", periods=100, freq="M"),
)
dft
Then:
For every Jul of each year, we take value of December in previous year and apply it to June next year
For other month left (from Aug this year to June next year), we take value of previous month
For example: that value from Jul-2000 to June-2001 will be the same and equal to value of Dec-1999.
What I've been trying to do is:
dft['B'] = np.where(dft.index.month == 7,
dft['A'].shift(7, freq='M') ,
dft['A'].shift(1, freq='M'))
However, the result is simply a copy of column A. I don't know why. But when I tried for single line of code :
dft['C'] = dft['A'].shift(7, freq='M')
then everything is shifted as expected. I don't know what is the issue here
The issue is index alignment. This shift that you performed acts on the index, but using numpy.where you convert to arrays and lose the index.
Use pandas' where or mask instead, everything will remain as Series and the index will be preserved:
dft['B'] = (dft['A'].shift(1, freq='M')
.mask(dft.index.month == 7, dft['A'].shift(7, freq='M'))
)
output:
A B
2013-01-31 -2.202668 NaN
2013-02-28 0.878792 -2.202668
2013-03-31 -0.982540 0.878792
2013-04-30 0.119029 -0.982540
2013-05-31 -0.119644 0.119029
2013-06-30 -1.038124 -0.119644
2013-07-31 0.177794 -1.038124
2013-08-31 0.206593 -2.202668 <- correct
2013-09-30 0.188426 0.206593
2013-10-31 0.764086 0.188426
... ... ...
2020-12-31 1.382249 -1.413214
2021-01-31 -0.303696 1.382249
2021-02-28 -1.622287 -0.303696
2021-03-31 -0.763898 -1.622287
2021-04-30 0.420844 -0.763898
[100 rows x 2 columns]

How to use double groupby in Pandas and filter based on if condition?

I have a data frame called df that looks like this in Pandas:
**id amt date seq**
SB 450,000,000 2020-05-11 1
OM 430,000,000 2020-05-11 1
SB 450,000,000 2020-05-12 1
OM 450,000,000 2020-05-12 1
OM 130,000,000 2020-05-12 2
I need to find the value in amt for each ID for each day. The issue is that one some days there are multiple cycles as indicated by "seq".
If there are 2 cycles (aka seq=2) for any one day, I need to take the value when seq=2 for that id on that day, and drop any values for seq=1 with the same day and id. Some days there are only 1 cycle for any one id, and on those days I can just stick with the value where seq=1.
My goal is to Pandas groupby day and then again groupby id, then apply an if statement for if the seq column contains a 2 for that id and that day, then filter that groupby object to only include the row where seq=2 for that day and id. The end result would be a data frame with only the rows where seq=2 for any day when there are multiple cycles and seq=1 or 2, and the rows where seq=1 for days where there is only one cycle and seq=1 for all ids.
So far I have tried:
`for day in df.groupby(df['date']):
for id in day[1].groupby(['id']):
if 2 in id[1]['seq']:
id[1]=id[1].apply(lambda g: g[g['seq']==2])`
Which gives me:
KeyError: 'seq'
and I have also tried:
`for day in df.groupby(df['date']):
for id in day[1].groupby(['id']):
id=list(id)
if 2 in id[1]['seq']:
id[1]=id[1][id[1]['seq']==2]`
Which runs fine but then doesn't actually change or doing anything to df (same number of rows remain).
Can anyone help me with how I can accomplish this?
Thank you in advance!
You can do this if you groupby date + id, then get the indices of the rows where seq is at it's maximum for those groupings. Once you get the those indices, you can slice back into the original dataframe to get your desired subset:
max_seq_indices = df.groupby(["date", "**id"])["seq**"].idxmax()
print(max_seq_indices)
date **id
2020-05-11 OM 1
SB 0
2020-05-12 OM 4
SB 2
Name: seq**, dtype: int64
Looking at the values of this Series, you can see that we have a maximum seq for ["2020-05-11", "OM"] at row 1. Likewise, there is a maximum seq for ["2020-05-11", "SB"] at row 0. And so on. If we use this to slice back into our original dataframe, we end up with a subset that you described in your question:
new_df = df.loc[max_seq_indices]
print(new_df)
**id amt date seq**
1 OM 430,000,000 2020-05-11 1
0 SB 450,000,000 2020-05-11 1
4 OM 130,000,000 2020-05-12 2
2 SB 450,000,000 2020-05-12 1
This approach will encounter issues if your have a seq greater than 2, but only want the rows where seq is 2. However if that is the case, leave a comment and I can update my answer with a more robust (but probably more complex) solution
You can also work with a sorted dataframe like:
df.sort_values(['date', '**id', 'seq**'], inplace=True)
Then you can use groupby to take just the last of each group
df.reset_index(drop=True).groupby(['date', '**id'])['amt'].agg('last')

Pandas series group by calculate percentage

I have a data frame. I have grouped a column status by date using
y = news_dataframe.groupby(by=[news_dataframe['date'].dt.date,news_dataframe['status']])['status'].count()
and my output is --
date status count
2019-05-29 selected 24
rejected auto 243
waiting 109
no action 1363
2019-05-30 selected 28
rejected auto 188
waiting 132
no action 1249
repeat 3
2019-05-31 selected 13
rejected auto 8
waiting 23
no action 137
repeat 2
source 1
Name: reasonForReject, dtype: int64
Now I want to calculate the percentage of each status group by date. How can I achieve this using pandas dataframe?
Compute two different groupbys and divide one by the other:
y_numerator = news_dataframe.groupby(by=[news_dataframe['date'].dt.date,news_dataframe['status']])['status'].count()
y_denominator = news_dataframe.groupby(by=news_dataframe['date'].dt.date)['status'].count()
y=y_numerator/y_denominator
I guess that's the shortest:
news_dataframe['date'] = news_dataframe['date'].dt.date
news_dataframe.groupby(['date','status'])['status'].count()/news_dataframe.groupby(['date'])['status'].count()
try this:
# just fill the consecutive rows with this
df=df.ffill()
df.df1.columns=['date','status','count']
# getting the total value of count with date and status
df1=df.groupby(['date']).sum().reset_index()
#renaming it to total as it is the sum
df1.columns=['date','status','total']
# now join the tables to find the total and actual value together
df2=df.merge(df1,on=['date'])
#calculate the percentage
df2['percentage']=(df2.count/df2.total)*100
If you need one liner its:
df['percentage']=(df.ffill()['count]/df.ffill().groupby(['date']).sum().reset_index().rename(columns={'count': 'total'}).merge(df,on=['date'])['total'])*100