auto increment inside group - pandas

I have a dataframe:
df = pd.DataFrame.from_dict({
'product': ('a', 'a', 'a', 'a', 'c', 'b', 'b', 'b'),
'sales': ('-', '-', 'hot_price', 'hot_price', '-', 'min_price', 'min_price', 'min_price'),
'price': (100, 100, 50, 50, 90, 70, 70, 70),
'dt': ('2020-01-01 00:00:00', '2020-01-01 00:05:00', '2020-01-01 00:07:00', '2020-01-01 00:10:00', '2020-01-01 00:13:00', '2020-01-01 00:15:00', '2020-01-01 00:19:00', '2020-01-01 00:21:00')
})
product sales price dt
0 a - 100 2020-01-01 00:00:00
1 a - 100 2020-01-01 00:05:00
2 a hot_price 50 2020-01-01 00:07:00
3 a hot_price 50 2020-01-01 00:10:00
4 c - 90 2020-01-01 00:13:00
5 b min_price 70 2020-01-01 00:15:00
6 b min_price 70 2020-01-01 00:19:00
7 b min_price 70 2020-01-01 00:21:00
I need the next output:
product sales price dt unique_group
0 a - 100 2020-01-01 00:00:00 0
1 a - 100 2020-01-01 00:05:00 0
2 a hot_price 50 2020-01-01 00:07:00 1
3 a hot_price 50 2020-01-01 00:10:00 1
4 c - 90 2020-01-01 00:13:00 2
5 b min_price 70 2020-01-01 00:15:00 3
6 b min_price 70 2020-01-01 00:19:00 3
7 b min_price 70 2020-01-01 00:21:00 3
How I do it:
unique_group = 0
df['unique_group'] = unique_group
for i in range(1, len(df)):
current, prev = df.loc[i], df.loc[i - 1]
if not all([
current['product'] == prev['product'],
current['sales'] == prev['sales'],
current['price'] == prev['price'],
]):
unique_group += 1
df.loc[i, 'unique_group'] = unique_group
Is it possible to do it without iteration? I tried using cumsum(), shift(), ngroup(), drop_duplicates() but unsuccessfully.

IIUC, GroupBy.ngroup:
df['unique_group'] = df.groupby(['product', 'sales', 'price'],sort=False).ngroup()
print(df)
product sales price dt unique_group
0 a - 100 2020-01-01 00:00:00 0
1 a - 100 2020-01-01 00:05:00 0
2 a hot_price 50 2020-01-01 00:07:00 1
3 a hot_price 50 2020-01-01 00:10:00 1
4 c - 90 2020-01-01 00:13:00 2
5 b min_price 70 2020-01-01 00:15:00 3
6 b min_price 70 2020-01-01 00:19:00 3
7 b min_price 70 2020-01-01 00:21:00 3
this works either way, even if the data frame is not ordered
Another approach
this works with the ordered data frame
cols = ['product','sales','price']
df['unique_group'] = df[cols].ne(df[cols].shift()).any(axis=1).cumsum().sub(1)

Another option which might be a bit faster than groupby:
df['unique_group'] = (~df.duplicated(['product','sales','price'])).cumsum() - 1
Output:
product sales price dt unique_group
0 a - 100 2020-01-01 00:00:00 0
1 a - 100 2020-01-01 00:05:00 0
2 a hot_price 50 2020-01-01 00:07:00 1
3 a hot_price 50 2020-01-01 00:10:00 1
4 c - 90 2020-01-01 00:13:00 2
5 b min_price 70 2020-01-01 00:15:00 3
6 b min_price 70 2020-01-01 00:19:00 3
7 b min_price 70 2020-01-01 00:21:00 3

Related

Pandas groupby issue after melt bug?

Python version 3.8.12
pandas 1.4.1
Given the following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'id': [1000] * 4,
'date': ['2022-01-01'] * 4,
'ts': pd.date_range('2022-01-01', freq='5M', periods=4),
'A': np.random.randint(1, 6, size=4),
'B': np.random.rand(4)
})
that looks like this:
id
date
ts
A
B
0
1000
2022-01-01
2022-01-01 00:00:00
4
0.98019
1
1000
2022-01-01
2022-01-01 00:05:00
3
0.82021
2
1000
2022-01-01
2022-01-01 00:10:00
4
0.549684
3
1000
2022-01-01
2022-01-01 00:15:00
5
0.0818311
I transposed the columns A and B with pandas melt:
melted = df.melt(
id_vars=['id', 'date', 'ts'],
value_vars=['A', 'B'],
var_name='label',
value_name='value',
ignore_index=True
)
that looks like this:
id
date
ts
label
value
0
1000
2022-01-01
2022-01-01 00:00:00
A
4
1
1000
2022-01-01
2022-01-01 00:05:00
A
3
2
1000
2022-01-01
2022-01-01 00:10:00
A
4
3
1000
2022-01-01
2022-01-01 00:15:00
A
5
4
1000
2022-01-01
2022-01-01 00:00:00
B
0.98019
5
1000
2022-01-01
2022-01-01 00:05:00
B
0.82021
6
1000
2022-01-01
2022-01-01 00:10:00
B
0.549684
7
1000
2022-01-01
2022-01-01 00:15:00
B
0.0818311
Then I groupby and select the first group:
melted.groupby(['id', 'date']).first()
that gives me this:
ts label value
id date
1000 2022-01-01 2022-01-01 A 4.0
but I would expect this output instead:
ts A B
id date
1000 2022-01-01 2022-01-01 00:00:00 4 0.980190
2022-01-01 2022-01-01 00:05:00 3 0.820210
2022-01-01 2022-01-01 00:10:00 4 0.549684
2022-01-01 2022-01-01 00:15:00 5 0.081831
What am I not getting? Or this is a bug? Also why the ts columns is converted to a date?
my bad!!! I thought first will get the first group but instead it will get the first element for each group, as stated in the documentation for the aggregation functions of pandas. Sorry folks, was doing this late at night and could not think straight :/
To select the first group, I needed to use get_group function.

how to transpose m x n into k x 2 form dataframe in pandas

I have m x n form dataframe such as belows.
date1 amt1 date2 amt2
2021-01-02 120 1991-01-02 90
2021-01-03 100 1991-01-03 95
2021-01-04 110 1991-01-04 95
....
Is there any way to transpose into k x 2 form dataframe like...
date amt
2021-01-02 120
2021-01-03 100
2021-01-04 110
...
1991-01-02 90
1991-01-03 95
1991-01-04 95
...
This can be done easily with reshape, although a bit different order:
pd.DataFrame(df.to_numpy().reshape(-1, 2), columns=['date', 'amt'])
Output:
date amt
0 2021-01-02 120
1 1991-01-02 90
2 2021-01-03 100
3 1991-01-03 95
4 2021-01-04 110
5 1991-01-04 95
reset_index then u se pd.wide_to long,
df.reset_index(inplace=True)
pd.wide_to_long(df, stubnames=['date', 'amt'], i=['index'], j='id').reset_index(drop=True)
date amt
0 2021-01-02 120
1 2021-01-03 100
2 2021-01-04 110
3 1991-01-02 90
4 1991-01-03 95
5 1991-01-04 95

convert data wide to long with make sequential date in postgresql

I have data frame with date like below :
id start_date end_date product supply_per_day
1 2020-03-01 2020-03-01 A 10
1 2020-03-01 2020-03-01 B 10
1 2020-03-01 2020-03-02 A 5
2 2020-02-28 2020-03-02 A 10
2 2020-03-01 2020-03-03 B 4
2 2020-03-02 2020-03-05 A 5
I want make this data wide to long like :
id date product supply_per_day
1 2020-03-01 A 10
1 2020-03-01 B 10
1 2020-03-01 A 5
1 2020-03-02 A 5
2 2020-02-28 A 10
2 2020-03-01 A 10
2 2020-03-02 A 10
2 2020-03-01 B 4
2 2020-03-02 B 4
2 2020-03-03 B 4
2 2020-03-02 B 5
2 2020-03-03 B 5
2 2020-03-04 B 5
2 2020-03-05 B 5
give me some idea please
For Oracle 12c and later, you can use:
SELECT t.id,
d.dt,
t.product,
t.supply_per_day
FROM table_name t
OUTER APPLY(
SELECT start_date + LEVEL - 1 AS dt
FROM DUAL
CONNECT BY start_date + LEVEL - 1 <= end_date
) d
Which, for the sample data:
CREATE TABLE table_name ( id, start_date, end_date, product, supply_per_day ) AS
SELECT 1, DATE '2020-03-01', DATE '2020-03-01', 'A', 10 FROM DUAL UNION ALL
SELECT 1, DATE '2020-03-01', DATE '2020-03-01', 'B', 10 FROM DUAL UNION ALL
SELECT 1, DATE '2020-03-01', DATE '2020-03-02', 'A', 5 FROM DUAL UNION ALL
SELECT 2, DATE '2020-02-28', DATE '2020-03-02', 'A', 10 FROM DUAL UNION ALL
SELECT 2, DATE '2020-03-01', DATE '2020-03-03', 'B', 4 FROM DUAL UNION ALL
SELECT 2, DATE '2020-03-02', DATE '2020-03-05', 'A', 5 FROM DUAL;
Outputs:
ID
DT
PRODUCT
SUPPLY_PER_DAY
1
2020-03-01 00:00:00
A
10
1
2020-03-01 00:00:00
B
10
1
2020-03-01 00:00:00
A
5
1
2020-03-02 00:00:00
A
5
2
2020-02-28 00:00:00
A
10
2
2020-02-29 00:00:00
A
10
2
2020-03-01 00:00:00
A
10
2
2020-03-02 00:00:00
A
10
2
2020-03-01 00:00:00
B
4
2
2020-03-02 00:00:00
B
4
2
2020-03-03 00:00:00
B
4
2
2020-03-02 00:00:00
A
5
2
2020-03-03 00:00:00
A
5
2
2020-03-04 00:00:00
A
5
2
2020-03-05 00:00:00
A
5
db<>fiddle here
In Postgres you can use generate_series() for this:
select t.id, g.day::date as date, t.product, t.supply_per_day
from the_table t
cross join generate_series(t.start_date, t.end_date, interval '1 day') as g(day)
order by t.id, g.day

Pandas groupby time and ID and aggregate

I am trying to calculate, what is the sum of payment made 2nd half of year minus the 1st half of the year.
This is how the data may look:
ID date payment
1 1/1/2020 10
1 1/2/2020 11
1 1/3/2020 10
1 1/4/2020 10
1 1/5/2020 11
1 1/6/2020 10
1 1/7/2020 10
1 1/8/2020 11
1 1/9/2020 10
1 1/10/2020 32
1 1/11/2020 10
1 1/12/2020 12
2 1/1/2020 10
2 1/2/2020 10
2 1/3/2020 41
2 1/4/2020 10
2 1/5/2020 53
2 1/6/2020 10
2 1/7/2020 10
2 1/8/2020 44
2 1/9/2020 10
2 1/10/2020 2
2 1/11/2020 9
2 1/12/2020 5
I convert the df date to a pandas dt
df.date = df.date.astype(str).str.slice(0, 10)
df.date = pd.to_datetime(pay.date)
print(df.date.min(),df.date.max())
output: 2020-01-01 00:00:00 2020-12-01 00:00:00
Then i create time points and different data frames for 1st and 2nd half of the year
observation_date = '2020-12-31'
observation_date = datetime.strptime(observation_date, '%Y-%m-%d')
observation_date = observation_date.date()
observation_date = pd.Timestamp(observation_date)
print(observation_date)
mo6_ago = observation_date - relativedelta(months=6)
mo6_ago = pd.Timestamp(mo6_ago)
print(mo6_ago)
mo6_ago_plus1 = observation_date - relativedelta(months=6) + relativedelta(days=1)
mo6_ago_plus1 = pd.Timestamp(mo6_ago_plus1)
print(mo6_ago_plus1)
mo12_ago = observation_date - relativedelta(months=12) + relativedelta(days=1)
mo12_ago = pd.Timestamp(mo12_ago)
print(mo12_ago)
output:
2020-12-31 00:00:00
2020-06-30 00:00:00
2020-07-01 00:00:00
2020-01-01 00:00:00
mask = (df['date'] >= mo12_ago) & (df['date'] <= mo6_ago)
first_half = df.loc[mask]
first_half = first_half[['ID','date','payment']]
print(first_half.date.min(),first_half.date.max())
output: 2020-01-01 00:00:00 2020-06-01 00:00:00
mask = (df['date'] >= mo6_ago_plus1) & (df['date'] <= observation_date)
sec_half = df.loc[mask]
sec_half = sec_half[['ID','date','payment']]
print(sec_half.date.min(),sec_half.date.max())
output: 2020-07-01 00:00:00 2020-12-01 00:00:00
then i group and sum for the 2 half of the year and merge them into one df like that
sum_first_half = first_half.groupby(['ID'])['payment'].sum().reset_index()
sum_first_half = sum_first_half.rename(columns = {'payment':'payment_first_half'})
sum_sec_half = sec_half.groupby(['ID'])['payment'].sum().reset_index()
sum_sec_half = sum_sec_half.rename(columns = {'payment':'payment_sec_half'})
df_new = pd.merge(sum_first_half, sum_sec_half, how='outer', on='ID')
Finally i take minus the 2 columns this way
df_new['sec_minus_first'] = df_new['payment_sec_half'] -df_new['payment_first_half']
ID payment_first_half payment_sec_half sec_minus_first
1 62 85 23
2 134 80 -54
Is there a faster and more memory efficient way of doing this?
Using datetime:
from datetime import datetime as dt
Convert date column to datetime:
df["date"] = pd.to_datetime(df["date"])
Split on a date of your choice, group by ID, sum each half, then subtract the halves:
df.loc[df['date'] >= dt(2020, 7, 1)].groupby("ID").sum() - df.loc[df['date'] < dt(2020, 7, 1)].groupby("ID").sum()

Calculate number of days from date time column to a specific date - pandas

I have a df as shown below.
df:
ID open_date limit
1 2020-06-03 100
1 2020-06-23 500
1 2019-06-29 300
1 2018-06-29 400
From the above I would like to calculate a column named age_in_days.
age_in_days is the number of days from open_date to 2020-06-30.
Expected output
ID open_date limit age_in_days
1 2020-06-03 100 27
1 2020-06-23 500 7
1 2019-06-29 300 367
1 2018-06-29 400 732
Make sure open_date in datetime dtype and subtract it from 2020-06-30
df['open_date'] = pd.to_datetime(df.open_date)
df['age_in_days'] = (pd.Timestamp('2020-06-30') - df.open_date).dt.days
Out[209]:
ID open_date limit age_in_days
0 1 2020-06-03 100 27
1 1 2020-06-23 500 7
2 1 2019-06-29 300 367
3 1 2018-06-29 400 732