This code works but seems so hairy. Is there a better way to drop 100 rows from a dataframe starting from the row where a certain value criteria is met?
In my case, I want to find next row where a value in column_name is < 21000, then drop that and the next 100 rows in the dataframe.
pd.drop(pd[(pd.index >= pd.loc[pd[column_name] < 21000].index[0])][:100].index, inplace=True)
The index is timedate values.
Given jch's example df, plus a datetime index:
A
2021-01-01 a
2021-01-04 b
2021-01-07 c
2021-01-10 d
2021-01-13 e
2021-01-16 f
2021-01-19 g
2021-01-22 h
2021-01-25 i
2021-01-28 j
Doing, let's drop 3 values, 'e' and the two values after it:
i = df.A.eq('e').argmax()
df = df.drop(df.index[i:i+3])
print(df)
Output:
A
2021-01-01 a
2021-01-04 b
2021-01-07 c
2021-01-10 d
2021-01-22 h
2021-01-25 i
2021-01-28 j
Thinking of it from the other direction, you could just include everything around the 100. The example below does that, but 'drops' 3 instead of 100.
df = pd.DataFrame({'A':list('abcdefghij')})
print(df)
A
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
9 j
Execute
r = df['A'].eq('d').argmax()
pd.concat([df.iloc[:r],df.iloc[r+3:]])
Result
A
0 a
1 b
2 c
6 g
7 h
8 i
9 j
Related
I have 1 dataframe that looks like this
df1
company 2022-03-14 00:00:00 2022-03-15 00:00:00 2022-03-16 00:00:00
a 1 1 2
b 1 1 1
c 1 0 1
d 1 2 2
I have another dataframe that looks like below
df2
company numbers present
a NaN
b NaN
c NaN
d NaN
I want to population the 'numbers present' column in df2 with dates from df1 and number of rows with that date should be the numbers appearing below the datetime column headers. Final output df3 should look like below
df3
company numbers present
a 2022-03-14
a 2022-03-15
a 2022-03-16
a 2022-03-16
b 2022-03-14
b 2022-03-15
b 2022-03-16
c 2022-03-14
c 2022-03-16
d 2022-03-14
d 2022-03-15
d 2022-03-15
d 2022-03-16
d 2022-03-16
Use df.melt and df.index.repeat:
new_df = df.melt(id_vars='company').pipe(lambda x: x.loc[x.index.repeat(x.value)]).drop('value', axis=1).sort_values('company').reset_index(drop=True)
Output:
>>> new_df
company variable
0 a 2022-03-14-00:00:00
1 a 2022-03-15-00:00:00
2 a 2022-03-16-00:00:00
3 a 2022-03-16-00:00:00
4 b 2022-03-14-00:00:00
5 b 2022-03-15-00:00:00
6 b 2022-03-16-00:00:00
7 c 2022-03-14-00:00:00
8 c 2022-03-16-00:00:00
9 d 2022-03-14-00:00:00
10 d 2022-03-15-00:00:00
11 d 2022-03-15-00:00:00
12 d 2022-03-16-00:00:00
13 d 2022-03-16-00:00:00
I have a dataset that looks something like the following:
df = pd.DataFrame({"Date":['2021-01-01','2021-01-02','2021-01-03','2021-01-04','2021-01-05','2021-01-06','2021-01-07'],'Value':[0,0,14,0,0,0,9]})
df['Date']=pd.to_datetime(df['Date'])
df
Date Value
2021-01-01 0
2021-01-02 0
2021-01-03 14
2021-01-04 0
2021-01-05 0
2021-01-06 0
2021-01-07 9
I know that where the data is missing was due to a lack of reporting, so rows with values represents that day, plus the sum of values from the missing days as well. I want the outcome to randomly distribute the data backwards, based on the existing values, example below:
df2 = pd.DataFrame({"Date":['2021-01-01','2021-01-02','2021-01-03','2021-01-04','2021-01-05','2021-01-06','2021-01-07'],'Value':[2,8,4,3,1,4,1]})
df2['Date']=pd.to_datetime(df2['Date'])
df2
Date Value
2021-01-01 2
2021-01-02 8
2021-01-03 4
2021-01-04 3
2021-01-05 1
2021-01-06 4
2021-01-07 1
(The local 'totals' on 2021-01-03 and 2021-01-07 remain the same)
I know part of the problem is that the intervals of missing/present data isn't consistent...
Any ideas on how to get this done? All advice appreciated.
You can create a group up to a non zero Value with ne (not equal) to 0, shift to keep the non-zero value in the right group and cumsum.
Then to create a split per group of difference length summing to the non zero value of the group, you can refer to this question for example.
so using numpy.random.multinomial like in this answer you get:
import numpy as np
np.random.seed(1)
df['new_value'] = (
df.groupby(df['Value'].ne(0).shift(1,fill_value=False).cumsum())
['Value'].apply(lambda x: np.random.multinomial(x.max(), [1./len(x)]*len(x)))
.explode() # create a Series
.to_numpy() # because index alignment not possible
)
print(df)
Date Value new_value
0 2021-01-01 0 4
1 2021-01-02 0 6
2 2021-01-03 14 4
3 2021-01-04 0 0
4 2021-01-05 0 2
5 2021-01-06 0 2
6 2021-01-07 9 5
or you can also use this answer that seems a bit more popular:
import random
def constrained_sum_sample_pos(n, total):
"""Return a randomly chosen list of n positive integers summing to total.
Each such list is equally likely to occur."""
dividers = sorted(random.sample(range(1, total), n - 1))
return [a - b for a, b in zip(dividers + [total], [0] + dividers)]
df['new_value_r'] = (
df.groupby(df['Value'].ne(0).shift(1,fill_value=False).cumsum())
['Value'].apply(lambda x: constrained_sum_sample_pos(len(x), x.max()))
.explode()
.to_numpy()
)
print(df)
Date Value new_value new_value_r
0 2021-01-01 0 4 5
1 2021-01-02 0 6 2
2 2021-01-03 14 4 7
3 2021-01-04 0 0 1
4 2021-01-05 0 2 2
5 2021-01-06 0 2 5
6 2021-01-07 9 5 1
Assume that we have a table like this:
Date
User
Item
2021-01-01
A
X
2021-01-05
A
Y
2021-01-11
A
Z
2021-01-01
B
X
2021-01-16
B
Y
2021-01-01
C
X
2021-01-02
C
Y
2021-01-03
C
Z
2021-01-10
D
X
2021-01-15
D
Y
I want to group each user by its date and item and add sequence number DURING QUERY, NOT BY MODIFYING TABLE. For a user, the sequence number of the item with the first date should be first. The number should start all over again for each new user. I want to retrieve data like this:
Date
User
Item
Sequence
2021-01-01
A
X
1
2021-01-05
A
Y
2
2021-01-11
A
Z
3
2021-01-01
B
X
1
2021-01-16
B
Y
2
2021-01-01
C
X
1
2021-01-02
C
Y
2
2021-01-03
C
Z
3
2021-01-10
D
X
1
2021-01-15
D
Y
2
Is it possible? Can I retrieve data like that?
Thanks!
This is what row_number() does:
select t.*,
row_number() over (partition by user order by date) as seqnum
from t;
I have a dataframe:
# create example df
df = pd.DataFrame(index=[1,2,3,4,5,6,7,8])
df['ID'] = [1,1,1,1,2,2,2,2]
df['election_date'] = pd.date_range("01/01/2010", periods=8, freq="M")
df['status'] = ['b','a','b','c','a','d','d','b']
# sort values
df.sort_values(['election_date'], inplace=True, ascending=False)
df.reset_index(drop=True, inplace=True)
df
ID election_date status
0 2 2010-08-31 b
1 2 2010-07-31 d
2 2 2010-06-30 d
3 2 2010-05-31 a
4 1 2010-04-30 c
5 1 2010-03-31 b
6 1 2010-02-28 a
7 1 2010-01-31 b
I would like to get the cumulative most frequent status for column status for each ID. This is what I would expect:
ID election_date status cum_most_freq_status
0 2 2010-08-31 b d
1 2 2010-07-31 d d
2 2 2010-06-30 d a
3 2 2010-05-31 a NaN
4 1 2010-04-30 c b
5 1 2010-03-31 b a
6 1 2010-02-28 a b
7 1 2010-01-31 b NaN
Interpretation:
for 2010-01-31 the value is NaN because there was no status value in the past. The same works for 2010-05-31.
for 2010-03-31 the most frequent status in the past was a and b. Therefore we take the most recent value, which was a.
How would you do it?
You can first make a DataFrame with ID and election_date as its index, and one-hot-encoded status values, then calculate cumsum.
We want to pick the most recent status if there is a tie in counts, so I'm adding a small number (less than 1) to cumsum for the current status, so when we apply idxmax it will pick up the most recent status in case there's a tie.
After finding the most frequent cumulative status with idxmax we can merge with the original DataFrame:
# make one-hot-encoded status dataframe
z = (df
.groupby(['ID', 'election_date', 'status'])
.size().unstack().fillna(0))
# break ties to choose most recent
z = z.groupby(level=0).cumsum() + (z * 1e-4)
# shift by 1 row, since we only count previous status occurrences
z = z.groupby(level=0).shift()
# merge
df.merge(z.idxmax(axis=1).to_frame('cum_most_freq_status').reset_index())
Output:
ID election_date status cum_most_freq_status
0 2 2010-08-31 b d
1 2 2010-07-31 d d
2 2 2010-06-30 d a
3 2 2010-05-31 a NaN
4 1 2010-04-30 c b
5 1 2010-03-31 b a
6 1 2010-02-28 a b
7 1 2010-01-31 b NaN
[IN] df
[OUT]:
customer_id Order_date Status
1 2015-01-16 R
1 2015-01-19 G
2 2014-12-21 R
2 2015-01-10 G
1 2015-01-10 B
3 2018-01-18 Y
3 2017-03-04 Y
4 2019-11-05 B
4 2010-01-01 G
3 2019-02-03 U
3 2020-01-01 R
3 2018-01-01 R
Code to extract Customer_IDs where count of order_trasactions is at least 3:
[IN]
df22=(df.groupby('customer_id')['order_date'].nunique().loc[lambda
x:x>=3].reset_index()).rename(columns={'order_date':'Count_Order_Date'})
[OUT]
Customer_id Count_Order_Dates
1 3
3 5
Output I want:
I want to use the IDs that I got using this above code in the original dataframe df so I need the output as follows:
[OUT]
customer_id Order_date Status
1 2015-01-16 R
1 2015-01-19 G
1 2015-01-10 B
3 2018-01-18 Y
3 2017-03-04 Y
3 2019-02-03 U
3 2020-01-01 R
3 2018-01-01 R
So in the output only ID 1 and 3 are reflected(ones where there were at least 3 or more unique order dates).
What i have tried so far (which has failed):
df[df['customer_id'].isin(df22['customer_id'])]
Reason it has failed I feel is because when I do df['customer_id'].nunique() and
df22['customer_id'].nunique(), values are different in both the cases.
It was a simple error. I had forgotten to reassign df value to df[df['customer_id'].isin(df22['customer_id'])]
So doing
df = df[df['customer_id'].isin(df22['customer_id'])]
solved my problem.
Thanks #YOandBEN_W for pointing it out.