DataFrame transformation: from multiple columns to one [duplicate] - pandas

This question already has answers here:
How do I melt a pandas dataframe?
(3 answers)
Closed yesterday.
How can I transform with Pandas & NumPy this DataFrame:
to DataFrame like:
Name Year Nb
------+--------+--------+-------
0 | A | 2021 | 5.0
1 | A | 2020 | 4.0
2 | A | 2019 | 10.0
3 | A | 2018 | 4.0
4 | A | 2017 | 4.0
...
k | A-Jay | 2021 | 5.0
k+1 | A-Jay | 2020 | 6.0
...
l+i | A.J. | 2019 | 3.0
m | Aaban | 2021 | 4.0
m+1 | Aaban | 2020 | 4.0
...
?

Here's a way (probably the most elegant) using melt():
out = ( df
.melt(id_vars='Name', var_name='Year', value_name='Nb')
.dropna()
.sort_values(['Name','Year'], ascending=[True,False])
.reset_index(drop=True) )
Here's another way, this one using stack():
out = ( df
.set_index('Name')
.stack()
.reset_index()
.rename(columns={'level_1':'Year',0:'Nb'})
.sort_values(['Name','Year'],ascending=[True,False])
.reset_index(drop=True) )
Sample input:
Name 2021 2022 2023 2024
0 a NaN 4.0 None 0.0
1 b 2.0 NaN None NaN
2 c 3.0 6.0 None 0.0
Output:
Name Year Nb
0 a 2024 0.0
1 a 2022 4.0
2 b 2021 2.0
3 c 2024 0.0
4 c 2022 6.0
5 c 2021 3.0

Related

How to calculate a metric for current week/current year vs same week number/last year with a window function?

I'm calculating some metrics and try to achieve this with a window function. First I calculated WEEKLY_PROFIT delta for current week/CURRENT year vs last week/CURRENT year.
How can I apply a window function to calculate WEEKLY_PROFIT for current week/CURRENT year vs same week number/LAST year?
SELECT TR_YR, TR_WEEK, WEEKLY_PROFIT,
COALESCE((WEEKLY_PROFIT/NULLIF(lag(WEEKLY_PROFIT, 1, 0)
OVER (PARTITION BY TR_YR ORDER BY TR_WEEK), 0))-1, 0) AS DELTA_PROFIT_WEEKLY_VS_CY
FROM base_metrics
GROUP BY TR_YR, TR_WEEK
The table after calculating DELTA_PROFIT_WEEKLY_VS_CY
|TR_YR |TR_WEEK|WEEKLY_PROFIT|DELTA_PROFIT_WEEKLY_VS_CY|
| 2019 | 1 | 500.0 | 0.0 |
| 2020 | 1 | 1000.0 | 0.0 |
| 2020 | 2 | 1500.0 | 0.5 |
| 2020 | 3 | 700.0 | -0.53 |
Here is what I expect after calculating DELTA_PROFIT_WEEKLY_VS_LY (WEEKLY_PROFIT for current week/CURRENT year vs same week number/LAST year)
|TR_YR |TR_WEEK|WEEKLY_PROFIT|DELTA_PROFIT_WEEKLY_VS_CY|DELTA_PROFIT_WEEKLY_VS_LY|
| 2019 | 1 | 400.0 | 0.0 | 0.0 |
| 2020 | 1 | 1000.0 | 0.0 | 1.5 |
| 2020 | 2 | 1500.0 | 0.5 | 0.0 |
| 2020 | 3 | 700.0 | -0.53 | 0.0 |
I feel this is easier done with a left join. You can handle nulls with coalesce if you later wish to
select a.tr_yr,
a.tr_week,
a.weekly_profit,
a.weekly_profit/c.weekly_profit as profit_wow,
a.weekly_profit/b.weekly_profit as profit_yoy
from t a
left join t b on a.tr_yr=b.tr_yr+1 and a.tr_week=b.tr_week
left join t c on a.tr_yr=c.tr_yr and a.tr_week=c.tr_week+1
order by a.tr_yr, a.tr_week;
Your code looks pretty good, except for the GROUP BY and PARTITION BY clause:
SELECT TR_YR, TR_WEEK, WEEKLY_PROFIT,
COALESCE((WEEKLY_PROFIT/NULLIF(lag(WEEKLY_PROFIT, 1, 0)
OVER (PARTITION BY TR_WEEK ORDER BY TR_YR), 0))-1, 0
) AS DELTA_PROFIT_WEEKLY_VS_CY
FROM base_metrics;

Get the difference of two columns with an offset/rolling/shift of 1

Stupid question: I have two columns A and B and would like to create a new_col, which is actually the difference between the current B and the previous A. Previous e.g. means the row before the current row. How can this be achieved (maybe even with a variable offset)?
Target:
df
| A | B | new_col |
|---|----|----------|
| 1 | 2 | nan (or2)|
| 3 | 4 | 3 |
| 5 | 10 | 7 |
Pseudo code:
new_col[0] = B[0] - 0
new_col[1] = B[1] - A[0]
new_col[2] = B[2] - A[1]
Use Series.shift:
df['new_col'] = df['B'] - df['A'].shift()
A B new_col
0 1 2 NaN
1 3 4 3.0
2 5 10 7.0

How to get value in a column based on condition in pandas data frame pivot table?

I have a MySQL table as shown below:
ID | article | price | promo_price | delivery_days | stock | received_on
17591 03D/6H 3082.00 1716.21 30 0 2019-03-20
29315 03D/6H 3082.00 1716.21 26 0 2019-03-24
47796 03D/6H 3082.00 1716.21 24 0 2019-03-25
22016 L1620S 685.00 384.81 0 3 2019-03-20
35043 L1620S 685.00 384.81 0 2 2019-03-24
53731 L1620S 685.00 384.81 0 2 2019-03-25
I created a pivot table to monitor the stock data.
md = df.pivot_table(
values='stock',
index=['article','price', 'promo_price','delivery_days'],
columns='received_on',
aggfunc=np.sum)
dates = md.columns.tolist()
dates.sort(reverse=True)
md = md[dates]
This is the resuslt
+---------------------------------+--------------+--------------+--------------+
| | 2019-03-25 | 2019-03-24 | 2019-03-20 |
|---------------------------------+--------------+--------------+--------------|
| ('03D/6H', 3082.0, 1716.21, 24) | 0 | nan | nan |
| ('03D/6H', 3082.0, 1716.21, 26) | nan | 0 | nan |
| ('03D/6H', 3082.0, 1716.21, 30) | nan | nan | 0 |
| ('L1620S-KD', 685.0, 384.81, 0) | 2 | 2 | 3 |
+---------------------------------+--------------+--------------+--------------+
How do I filter the rows and get the price, promo_price and delivery days of an article based on the recent stock received date?
For ex: I want the stock info for all the days but price, promo_price and delivery days of only 2019-03-25 as shown below
+---------------------------------+--------------+--------------+--------------+
| | 2019-03-25 | 2019-03-24 | 2019-03-20 |
|---------------------------------+--------------+--------------+--------------|
| ('03D/6H', 3082.0, 1716.21, 24) | 0 | nan | nan |
| ('L1620S', 685.0, 384.81, 0) | 2 | 2 | 3 |
+---------------------------------+--------------+--------------+--------------+
EDIT:
If there is no change in price, promo_price and delivery days, I am getting the result as expected. But if there is any change in the values, then I am getting multiple rows for the same article.
Article L1620S data is as expected. But article 03D/6H resulted in three rows.
You can use:
df['received_on'] = pd.to_datetime(df['received_on'])
md = df.pivot_table(
values='stock',
index=['article','price', 'promo_price','delivery_days'],
columns='received_on',
aggfunc=np.sum)
#sorting columns in descending order
md = md.sort_index(axis=1, ascending=False)
#remove missing rows in first column
md = md.dropna(subset=[md.columns[0]])
#another solution
#md = md[md.iloc[:, 0].notna()]
print (md)
received_on 2019-03-25 2019-03-24 2019-03-20
article price promo_price delivery_days
03D/6H 3082.0 1716.21 24 0.0 NaN NaN
L1620S 685.0 384.81 0 2.0 2.0 3.0
EDIT: First filter by first level and then by position - first row:
md = md.sort_index(axis=1, ascending=False)
idx = pd.IndexSlice
md1 = md.loc[idx['03D/6H',:,:],:].iloc[[0]]
print (md1)
received_on 2019-03-25 2019-03-24 2019-03-20
article price promo_price delivery_days
03D/6H 3082.0 1716.21 24 0.0 NaN NaN

pandas pivot onto values

Given a dataframe
df=pd.DataFrame([[1,11,0],[1,12,1],[2,21,0],[2,22,1]])
df.columns = ['Key','Value','PivotOn']
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
The pivot action will give me columns 0 and 1 from the column 'PivotOn'. But I would like to always pivot onto values 0, 1 and 2, even if there might not exist a row that has PivotOn = 2 (just produce nan for it).
I cannot modify original dataframe so I'd want something like:
pivoted = df.pivot(index='Key',columns=[0,1,2],values='Value')
where it will always produce 3 columns of 0, 1 and 2 and column 2 is filled with nans.
Assume PivotOn has three unique values 0, 1, 2
df=pd.DataFrame([[1,11,0],[1,12,1],[2,21,0],[2,22,2]])
df.columns = ['Key','Value','PivotOn']
df
+---+-----+-------+---------+
| | Key | Value | PivotOn |
+---+-----+-------+---------+
| 0 | 1 | 11 | 0 |
| 1 | 1 | 12 | 1 |
| 2 | 2 | 21 | 0 |
| 3 | 2 | 22 | 2 |
+---+-----+-------+---------+
And say you need to include columns 2, 3 and 4 (you can also assume that 2 may or may not be present in original df, so generalizing)
Then go as -
expected = {2, 3, 4}
res = list(expected - set(df.PivotOn.unique()))
if len(res) > 1:
new_df = pd.DataFrame({'Key':np.NaN, 'Value':np.NaN, 'PivotOn':res}, index=range(df.shape[0], df.shape[0] + len(res)))
ndf = pd.concat([df, new_df], sort=False)
pivoted = ndf.pivot(index='Key',columns='PivotOn',values='Value').dropna(how='all')
else:
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
pivoted
+---------+------+------+------+-----+-----+
| PivotOn | 0 | 1 | 2 | 3 | 4 |
+---------+------+------+------+-----+-----+
| Key | | | | | |
| 1.0 | 11.0 | 12.0 | NaN | NaN | NaN |
| 2.0 | 21.0 | NaN | 22.0 | NaN | NaN |
+---------+------+------+------+-----+-----+
You might try this if all you need is a column '2' with nan's when they do not exist in your dataframe;
def no_col_2(df):
if 2 not in df['PivotOn']:
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
pivoted['2'] = np.nan
else:
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
no_col_2(df)
print(pivoted)
PivotOn 0 1 2
Key
1 11 12 NaN
2 21 22 NaN

How to keep the last value in Pandas without removing the rows

I am working on a dataset in which I want to attribute the last action of a user to a certain goal. In the process I arrive at below tableset.
table
date | action_id | u_id | goal
2016-01-08 | CUID22 | 586758 | 'Goal#1'
2017-03-04 | CUID45 | 586758 | 'Goal#1'
2018-09-01 | CUID30 | 586758 | 'Goal#1'
How can I remove/replace the first two u_id or goal values whilst keeping the rows to arrive at below tableset.
table
date | action_id | u_id | goal
2016-01-08 | CUID22 | NaN | NaN
2017-03-04 | CUID45 | NaN | NaN
2018-09-01 | CUID30 | 586758 | 'Goal#1'
I beleive you need duplicated:
cols = ['u_id','goal']
df.loc[df.duplicated(cols, keep='last'), cols] = np.nan
Or:
cols = ['u_id','goal']
df[cols] = df[cols].mask(df.duplicated(cols, keep='last'))
print (df)
date action_id u_id goal
0 2016 0 NaN NaN
1 2017 1 NaN NaN
2 2018 2 1.0 1.0