Repeat pandas rows based on content of a list - pandas

I have a large pandas dataframe df as:
Col1 Col2
2 4
3 5
I have a large list as:
['2020-08-01', '2021-09-01', '2021-11-01']
I am trying to achieve the following:
Col1 Col2 StartDate
2 4 8/1/2020
3 5 8/1/2020
2 4 9/1/2021
3 5 9/1/2021
2 4 11/1/2021
3 5 11/1/2021
Basically tile the dataframe df while adding the elements of list as a new column. I am not sure how to approach this?

Let use list comprehension with assign and pd.concat:
l = ['2020-08-01', '2021-09-01', '2021-11-01']
pd.concat([df1.assign(startDate=i) for i in l], ignore_index=True)
Output:
Col1 Col2 startDate
0 2 4 2020-08-01
1 3 5 2020-08-01
2 2 4 2021-09-01
3 3 5 2021-09-01
4 2 4 2021-11-01
5 3 5 2021-11-01

You can try a combination of np.tile and np.repeat:
df.loc[np.tile(df.index,len(lst))].assign(StartDate=np.repeat(lst,len(df)))
Output:
Col1 Col2 StartDate
0 2 4 2020-08-01
1 3 5 2020-08-01
0 2 4 2021-09-01
1 3 5 2021-09-01
0 2 4 2021-11-01
1 3 5 2021-11-01

You can also cross join using merge after creating a df from the list:
l = ['2020-08-01', '2021-09-01', '2021-11-01']
(df.assign(k=1).merge(pd.DataFrame({'StartDate':l, 'k':1}),on='k')
.sort_values('StartDate').drop("k",1))
Col1 Col2 StartDate
0 2 4 2020-08-01
3 3 5 2020-08-01
1 2 4 2021-09-01
4 3 5 2021-09-01
2 2 4 2021-11-01
5 3 5 2021-11-01

I would use concat:
df = pd.DataFrame({'col1': [2,3], 'col2': [4, 5]})
dict_dfs = {k: df for k in ['2020-08-01', '2021-09-01', '2021-11-01']}
pd.concat(dict_dfs)
Then you can rename and clean the index.
col1 col2
2020-08-01 0 2 4
1 3 5
2021-09-01 0 2 4
1 3 5
2021-11-01 0 2 4
1 3 5

I may do itertools, notice the order can be down with sort_values based on column 1
import itertools
df=pd.DataFrame([*itertools.product(df.index,l)]).set_index(0).join(df)
1 Col1 Col2
0 2020-08-01 2 4
0 2021-09-01 2 4
0 2021-11-01 2 4
1 2020-08-01 3 5
1 2021-09-01 3 5
1 2021-11-01 3 5

Related

Mumbojumbo .rolling() .max() .groupby() combination in python pandas

I am looking to do a "rolling" .max() .min() of B column "groupedby" date(column A values). However, trick is it should start on every row again so i can not use for example anything like df['MAX'] = df['B'].rolling(10).max().shift(-9) (couse i need to end it where group ends - every group can have different number of rows) or simply groupby A column (becouse i need that rolling max min with start on each row and end where each group ends - which means for row 1 column C is max of rows 1-4 in column B, for row 2 column C is max of rows 2-4 from column B, for row 3 column C is max of rows 3-4 from column B, for row 4 column C is max of row 4 from column B etc etc..). Hope it make sence - columns C and D are desired results. Thank you all in advance.
A B C(max) D(min)
1 2016-01-01 0 7 0
2 2016-01-01 7 7 3
3 2016-01-01 3 4 3
4 2016-01-01 4 4 4
5 2016-01-02 2 5 1
6 2016-01-02 5 5 1
7 2016-01-02 1 1 1
8 2016-01-03 1 4 1
9 2016-01-03 3 4 2
10 2016-01-03 4 4 2
11 2016-01-03 2 2 2
df['C_max'] = df.groupby('A')['B'].transform(lambda x: x[::-1].cummax()[::-1])
df['D_min'] = df.groupby('A')['B'].transform(lambda x: x[::-1].cummin()[::-1])
A B C(max) D(min) C_max D_min
1 2016-01-01 0 7 0 7 0
2 2016-01-01 7 7 3 7 3
3 2016-01-01 3 4 3 4 3
4 2016-01-01 4 4 4 4 4
5 2016-01-02 2 5 1 5 1
6 2016-01-02 5 5 1 5 1
7 2016-01-02 1 1 1 1 1
8 2016-01-03 1 4 1 4 1
9 2016-01-03 3 4 2 4 2
10 2016-01-03 4 4 2 4 2
11 2016-01-03 2 2 2 2 2

How to unstack the first row index in the Multidex in Pandas

in pandas I have a dataframe using unstack()like follows with
mean median std
0 1 2 3 0 1 2 3 0 1 2 3
-------------------------------------------------------------------------------
2019-08-31 2 3 6 4 3 3 2 3 0.3 0.4 2 3
before the unstack(),the frame is :
mean median std
--------------------------------------------
2019-08-31 0 2 3 0.3
1 3 3 0.4
2 6 2 2
3 4 3 3
2019-09-01 0
which unstack() command can I use to remove the the first row index and make the frame like:
0 1 2 3 0 1 2 3 0 1 2 3
-------------------------------------------------------------------------------
2019-08-31 2 3 6 4 3 3 2 3 0.3 0.4 2 3
Start from your original instruction:
df = df.unstack()
This instruction actually unstacks level=-1 i.e. the last level
of the MultiIndex and adds it to the column index.
Then run:
df.columns = df.columns.droplevel()
which actually drops level=0 i.e. the top level of MultiIndex
(just mean / median / std).

Pandas: Obtaining the maximum of a column based on other column values

I have a pandas dataframe that looks like this:
ID date num
1 2018-03-28 3
1 2018-03-29 1
1 2018-03-30 4
1 2018-04-04 1
2 2018-04-03 1
2 2018-04-04 6
2 2018-04-10 3
2 2018-04-11 4
Created by the following code:
import pandas as pd
df = pd.DataFrame({'ID': [1, 1, 1, 1, 2, 2, 2, 2], 'date': ['2018-03-28',
'2018-03-29', '2018-03-30', '2018-04-04', '2018-04-03', '2018-04-04',
'2018-04-10', '2018-04-11'], 'num': [3,1,4,1,1,6,3,4]})
What I would like is to create a new column called 'maxnum' that is filled with the maximum value of num per ID for the date that is on that row and all earlier dates. This column would look like this:
ID date maxnum num
1 2018-03-28 3 3
1 2018-03-29 3 1
1 2018-03-30 4 4
1 2018-04-04 4 1
2 2018-04-03 1 1
2 2018-04-04 6 6
2 2018-04-10 6 3
2 2018-04-11 6 4
Does anyone know how I can program this column correctly and efficiently?
Thanks in advance!
Using cummax (assuming your dataframe is order by date already, if not
run mask lines)
#df.date=pd.to_datetime(df.date)
#df=df.sort_values('date')
df.groupby('ID').num.cummax()
Out[258]:
0 3
1 3
2 4
3 4
4 1
5 6
6 6
7 6
Name: num, dtype: int64

Pandas count values inside dataframe

I have a dataframe that looks like this:
A B C
1 1 8 3
2 5 4 3
3 5 8 1
and I want to count the values so to make df like this:
total
1 2
3 2
4 1
5 2
8 2
is it possible with pandas?
With np.unique -
In [332]: df
Out[332]:
A B C
1 1 8 3
2 5 4 3
3 5 8 1
In [333]: ids, c = np.unique(df.values.ravel(), return_counts=1)
In [334]: pd.DataFrame({'total':c}, index=ids)
Out[334]:
total
1 2
3 2
4 1
5 2
8 2
With pandas-series -
In [357]: pd.Series(np.ravel(df)).value_counts().sort_index()
Out[357]:
1 2
3 2
4 1
5 2
8 2
dtype: int64
You can also use stack() and groupby()
df = pd.DataFrame({'A':[1,8,3],'B':[5,4,3],'C':[5,8,1]})
print(df)
A B C
0 1 5 5
1 8 4 8
2 3 3 1
df1 = df.stack().reset_index(1)
df1.groupby(0).count()
level_1
0
1 2
3 2
4 1
5 2
8 2
Other alternative may be to use stack, followed by value_counts then, result changed to frame and finally sorting the index:
count_df = df.stack().value_counts().to_frame('total').sort_index()
count_df
Result:
total
1 2
3 2
4 1
5 2
8 2
using np.unique(, return_counts=True) and np.column_stack():
pd.DataFrame(np.column_stack(np.unique(df, return_counts=True)))
returns:
0 1
0 1 2
1 3 2
2 4 1
3 5 2
4 8 2

Pandas: Add new column with several values to groupby dataframe

for my dataframe, I want to add a new column for every single unique value in another column. The new column consists of several datetime entries that every unique value of the other column should get.
Example:
Original Df:
ID
1
2
3
New Column DF:
Date
2015/01/01
2015/02/01
2015/03/01
Resulting Df:
ID Date
1 2015/01/01
2015/02/01
2015/03/01
2 2015/01/01
2015/02/01
2015/03/01
3 2015/01/01
2015/02/01
2015/03/01
I tried to stick to this solution: https://stackoverflow.com/a/12394122/3856569
But it gives me the following error: Length of values does not match length of index
Anyone has a simple solution to do that? Thanks a lot!
UPDATE: replicating ids 6 times:
In [172]: %paste
data = """\
id
1
2
3
"""
df = pd.read_csv(io.StringIO(data))
# repeat each ID 6 times
df = pd.DataFrame(df['id'].tolist()*6, columns=['id'])
start_date = pd.to_datetime('2015-01-01')
df['date'] = start_date
df['date'] = df.groupby('id', as_index=False)\
.transform(lambda x: pd.date_range(start_date,
freq='1D',
periods=len(x)))
df.sort_values(by=['id','date'])
## -- End pasted text --
Out[172]:
id date
0 1 2015-01-01
3 1 2015-01-02
6 1 2015-01-03
9 1 2015-01-04
12 1 2015-01-05
15 1 2015-01-06
1 2 2015-01-01
4 2 2015-01-02
7 2 2015-01-03
10 2 2015-01-04
13 2 2015-01-05
16 2 2015-01-06
2 3 2015-01-01
5 3 2015-01-02
8 3 2015-01-03
11 3 2015-01-04
14 3 2015-01-05
17 3 2015-01-06
OLD more generic answer:
prepare sample DF:
start_date = pd.to_datetime('2015-01-01')
data = """\
id
1
2
2
3
1
2
3
2
1
"""
df = pd.read_csv(io.StringIO(data))
In [200]: df
Out[200]:
id
0 1
1 2
2 2
3 3
4 1
5 2
6 3
7 2
8 1
Solution:
In [201]: %paste
df['date'] = start_date
df['date'] = df.groupby('id', as_index=False)\
.transform(lambda x: pd.date_range(start_date,
freq='1D',
periods=len(x)))
## -- End pasted text --
In [202]: df
Out[202]:
id date
0 1 2015-01-01
1 2 2015-01-01
2 2 2015-01-02
3 3 2015-01-01
4 1 2015-01-02
5 2 2015-01-03
6 3 2015-01-02
7 2 2015-01-04
8 1 2015-01-03
Sorted:
In [203]: df.sort_values(by='id')
Out[203]:
id date
0 1 2015-01-01
4 1 2015-01-02
8 1 2015-01-03
1 2 2015-01-01
2 2 2015-01-02
5 2 2015-01-03
7 2 2015-01-04
3 3 2015-01-01
6 3 2015-01-02
A rather straightforward numpy approach, making use of repeat and tile:
import numpy as np
import pandas as pd
N = 3 # arbitrary number of IDs/dates
ID = np.arange(N) + 1
dates = pd.date_range('20160101', periods=N)
df = pd.DataFrame({'ID' : np.repeat(ID, N),
'dates' : np.tile(dates, N)})
Resulting DataFrame:
In [1]: df
Out[1]:
ID dates
0 1 2016-01-01
1 1 2016-01-02
2 1 2016-01-03
3 2 2016-01-01
4 2 2016-01-02
5 2 2016-01-03
6 3 2016-01-01
7 3 2016-01-02
8 3 2016-01-03
Update
Assuming you already have a DataFrame of IDs, as pointed out by MaxU, you can tile the IDs
df = pd.DataFrame({'ID' : np.tile(df['ID'], N),
'dates' : np.tile(dates, N)})
# now df needs sorting
df = df.sort_values(by=['ID', 'dates'])
Resulting DataFrame:
In [5]: df
Out[5]:
ID dates
0 1 2016-01-01
3 1 2016-01-01
6 1 2016-01-01
1 2 2016-01-02
4 2 2016-01-02
7 2 2016-01-02
2 3 2016-01-03
5 3 2016-01-03
8 3 2016-01-03