I have a pandas dataframe that looks like this:
ID date num
1 2018-03-28 3
1 2018-03-29 1
1 2018-03-30 4
1 2018-04-04 1
2 2018-04-03 1
2 2018-04-04 6
2 2018-04-10 3
2 2018-04-11 4
Created by the following code:
import pandas as pd
df = pd.DataFrame({'ID': [1, 1, 1, 1, 2, 2, 2, 2], 'date': ['2018-03-28',
'2018-03-29', '2018-03-30', '2018-04-04', '2018-04-03', '2018-04-04',
'2018-04-10', '2018-04-11'], 'num': [3,1,4,1,1,6,3,4]})
What I would like is to create a new column called 'maxnum' that is filled with the maximum value of num per ID for the date that is on that row and all earlier dates. This column would look like this:
ID date maxnum num
1 2018-03-28 3 3
1 2018-03-29 3 1
1 2018-03-30 4 4
1 2018-04-04 4 1
2 2018-04-03 1 1
2 2018-04-04 6 6
2 2018-04-10 6 3
2 2018-04-11 6 4
Does anyone know how I can program this column correctly and efficiently?
Thanks in advance!
Using cummax (assuming your dataframe is order by date already, if not
run mask lines)
#df.date=pd.to_datetime(df.date)
#df=df.sort_values('date')
df.groupby('ID').num.cummax()
Out[258]:
0 3
1 3
2 4
3 4
4 1
5 6
6 6
7 6
Name: num, dtype: int64
Related
I have a data frame with three column variables A,B,C, taking numeric values in {1,2}, {6,7}, and {11,12}. I would like to see the following. For what fraction of possible observed pairs (A,B) do we have both [observations for which C=11 and observations for which C=12].
I start by entering the dataframe:
df = pd.DataFrame({"A": [1, 2, 1, 1, 2, 1, 1, 2], "B": [6,7,7,6,7,6,6,6], "C": [11,12,11,11,12,12,11,12]})
--------
A B C
0 1 6 11
1 2 7 12
2 1 7 11
3 1 6 11
4 2 7 12
5 1 6 12
6 1 6 11
7 2 6 12
Then I think I need to use groupby. I run
g = df.groupby(["A", "B"])
"g.C.value_counts()"
-----------
A B C
1 6 11 3
12 1
7 11 1
2 6 12 1
7 12 2
Name: C, dtype: int64
This shows that we have one pair of (A,B) for which we have both a C=11 and a C=12, and 3 pairs of (A,B) for which we only have either C=11 or C=12. So I would like to make pandas tells me that we have 25% of (A,B) paris for which C takes both values and 75% for which it only takes one value.
How can I accomplish this? I would like to do so for a big data frame where I can't just eyeball it from the value_counts--this small dataframe is just to illustrate.
Thanks!
Pass normalize=True
out = df.groupby(["A", "B"]).C.value_counts(normalize=True)
Out[791]:
A B C
1 6 11 0.75
12 0.25
7 11 1.00
2 6 12 1.00
7 12 1.00
Name: C, dtype: float64
I am looking to do a "rolling" .max() .min() of B column "groupedby" date(column A values). However, trick is it should start on every row again so i can not use for example anything like df['MAX'] = df['B'].rolling(10).max().shift(-9) (couse i need to end it where group ends - every group can have different number of rows) or simply groupby A column (becouse i need that rolling max min with start on each row and end where each group ends - which means for row 1 column C is max of rows 1-4 in column B, for row 2 column C is max of rows 2-4 from column B, for row 3 column C is max of rows 3-4 from column B, for row 4 column C is max of row 4 from column B etc etc..). Hope it make sence - columns C and D are desired results. Thank you all in advance.
A B C(max) D(min)
1 2016-01-01 0 7 0
2 2016-01-01 7 7 3
3 2016-01-01 3 4 3
4 2016-01-01 4 4 4
5 2016-01-02 2 5 1
6 2016-01-02 5 5 1
7 2016-01-02 1 1 1
8 2016-01-03 1 4 1
9 2016-01-03 3 4 2
10 2016-01-03 4 4 2
11 2016-01-03 2 2 2
df['C_max'] = df.groupby('A')['B'].transform(lambda x: x[::-1].cummax()[::-1])
df['D_min'] = df.groupby('A')['B'].transform(lambda x: x[::-1].cummin()[::-1])
A B C(max) D(min) C_max D_min
1 2016-01-01 0 7 0 7 0
2 2016-01-01 7 7 3 7 3
3 2016-01-01 3 4 3 4 3
4 2016-01-01 4 4 4 4 4
5 2016-01-02 2 5 1 5 1
6 2016-01-02 5 5 1 5 1
7 2016-01-02 1 1 1 1 1
8 2016-01-03 1 4 1 4 1
9 2016-01-03 3 4 2 4 2
10 2016-01-03 4 4 2 4 2
11 2016-01-03 2 2 2 2 2
I have a large pandas dataframe df as:
Col1 Col2
2 4
3 5
I have a large list as:
['2020-08-01', '2021-09-01', '2021-11-01']
I am trying to achieve the following:
Col1 Col2 StartDate
2 4 8/1/2020
3 5 8/1/2020
2 4 9/1/2021
3 5 9/1/2021
2 4 11/1/2021
3 5 11/1/2021
Basically tile the dataframe df while adding the elements of list as a new column. I am not sure how to approach this?
Let use list comprehension with assign and pd.concat:
l = ['2020-08-01', '2021-09-01', '2021-11-01']
pd.concat([df1.assign(startDate=i) for i in l], ignore_index=True)
Output:
Col1 Col2 startDate
0 2 4 2020-08-01
1 3 5 2020-08-01
2 2 4 2021-09-01
3 3 5 2021-09-01
4 2 4 2021-11-01
5 3 5 2021-11-01
You can try a combination of np.tile and np.repeat:
df.loc[np.tile(df.index,len(lst))].assign(StartDate=np.repeat(lst,len(df)))
Output:
Col1 Col2 StartDate
0 2 4 2020-08-01
1 3 5 2020-08-01
0 2 4 2021-09-01
1 3 5 2021-09-01
0 2 4 2021-11-01
1 3 5 2021-11-01
You can also cross join using merge after creating a df from the list:
l = ['2020-08-01', '2021-09-01', '2021-11-01']
(df.assign(k=1).merge(pd.DataFrame({'StartDate':l, 'k':1}),on='k')
.sort_values('StartDate').drop("k",1))
Col1 Col2 StartDate
0 2 4 2020-08-01
3 3 5 2020-08-01
1 2 4 2021-09-01
4 3 5 2021-09-01
2 2 4 2021-11-01
5 3 5 2021-11-01
I would use concat:
df = pd.DataFrame({'col1': [2,3], 'col2': [4, 5]})
dict_dfs = {k: df for k in ['2020-08-01', '2021-09-01', '2021-11-01']}
pd.concat(dict_dfs)
Then you can rename and clean the index.
col1 col2
2020-08-01 0 2 4
1 3 5
2021-09-01 0 2 4
1 3 5
2021-11-01 0 2 4
1 3 5
I may do itertools, notice the order can be down with sort_values based on column 1
import itertools
df=pd.DataFrame([*itertools.product(df.index,l)]).set_index(0).join(df)
1 Col1 Col2
0 2020-08-01 2 4
0 2021-09-01 2 4
0 2021-11-01 2 4
1 2020-08-01 3 5
1 2021-09-01 3 5
1 2021-11-01 3 5
I have a dataframe:
np.random.seed(1)
df1 = pd.DataFrame({'day':[3, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6],
'item': [1, 1, 2, 2, 1, 2, 3, 3, 4, 3, 4],
'price':np.random.randint(1,30,11)})
day item price
0 3 1 6
1 4 1 12
2 4 2 13
3 4 2 9
4 5 1 10
5 5 2 12
6 5 3 6
7 5 3 16
8 5 4 1
9 6 3 17
10 6 4 2
After the groupby code gb = df1.groupby(['day','item'])['price'].mean(), I get:
gb
day item
3 1 6
4 1 12
2 11
5 1 10
2 12
3 11
4 1
6 3 17
4 2
Name: price, dtype: int64
I want to get the trend from the groupby series replacing back into the dataframe column price. The price is the variation of the item-price with repect to the previous day price
day item price
0 3 1 nan
1 4 1 6
2 4 2 nan
3 4 2 nan
4 5 1 -2
5 5 2 1
6 5 3 nan
7 5 3 nan
8 5 4 nan
9 6 3 6
10 6 4 1
Please help me to code the last step. A single/double line code will be most helpful. As the actual dataframe is huge, I would like to avoid iterations.
Hope this helps!
#get the average values
mean_df=df1.groupby(['day','item'])['price'].mean().reset_index()
#rename columns
mean_df.columns=['day','item','average_price']
#sort by day an item in ascending
mean_df=mean_df.sort_values(by=['day','item'])
#shift the price for each item and each day
mean_df['shifted_average_price'] = mean_df.groupby(['item'])['average_price'].shift(1)
#combine with original df
df1=pd.merge(df1,mean_df,on=['day','item'])
#replace the price by difference of previous day's
df1['price']=df1['price']-df1['shifted_average_price']
#drop unwanted columns
df1.drop(['average_price', 'shifted_average_price'], axis=1, inplace=True)
in pandas I have a dataframe using unstack()like follows with
mean median std
0 1 2 3 0 1 2 3 0 1 2 3
-------------------------------------------------------------------------------
2019-08-31 2 3 6 4 3 3 2 3 0.3 0.4 2 3
before the unstack(),the frame is :
mean median std
--------------------------------------------
2019-08-31 0 2 3 0.3
1 3 3 0.4
2 6 2 2
3 4 3 3
2019-09-01 0
which unstack() command can I use to remove the the first row index and make the frame like:
0 1 2 3 0 1 2 3 0 1 2 3
-------------------------------------------------------------------------------
2019-08-31 2 3 6 4 3 3 2 3 0.3 0.4 2 3
Start from your original instruction:
df = df.unstack()
This instruction actually unstacks level=-1 i.e. the last level
of the MultiIndex and adds it to the column index.
Then run:
df.columns = df.columns.droplevel()
which actually drops level=0 i.e. the top level of MultiIndex
(just mean / median / std).