How to repeat a dataframe - python - pandas

I have a simple csv dataframe as follow:
Date,Data
2000-01-31,9
2000-02-28,8
2000-03-31,7
2000-04-30,6
2000-05-31,5
2000-06-30,4
2000-07-31,3
2000-08-31,2
2000-09-30,1
2000-10-31,0
2000-11-30,11
2000-12-31,12
I would like to repeat this dataframe over 10 years, with the year stamp changing accordingly, as follow:
Date,Data
2000-01-31,9
2000-02-28,8
2000-03-31,7
2000-04-30,6
2000-05-31,5
2000-06-30,4
2000-07-31,3
2000-08-31,2
2000-09-30,1
2000-10-31,0
2000-11-30,11
2000-12-31,12
2001-01-31,9
2001-02-28,8
2001-03-31,7
2001-04-30,6
2001-05-31,5
2001-06-30,4
2001-07-31,3
2001-08-31,2
2001-09-30,1
2001-10-31,0
2001-11-30,11
2001-12-31,12
....
How can I do that?

You can just using concat
n=2
Newdf=pd.concat([df]*n,keys=range(n))
Newdf.Date+=pd.to_timedelta(Newdf.index.get_level_values(level=0),'Y')
Newdf.reset_index(level=0,drop=True, inplace=true)

Try:
df1 = pd.concat([df] * 10)
date_fix = pd.date_range(start='2000-01-31', freq='M', periods=len(df1))
df1['Date'] = date_fix
df1
[out]
Date Data
0 2000-01-31 9
1 2000-02-29 8
2 2000-03-31 7
3 2000-04-30 6
4 2000-05-31 5
5 2000-06-30 4
6 2000-07-31 3
... ... ...
5 2009-06-30 4
6 2009-07-31 3
7 2009-08-31 2
8 2009-09-30 1
9 2009-10-31 0
10 2009-11-30 11
11 2009-12-31 12

Related

Pick row with key GROUP_FILENAME and add a new column with column name

I have a data frame which looks like this
GROUP_FIELD_NAME:BKR_ID
GROUP_FIELD_VALUE:T80
GROUP_FIELD_NAME:GROUP_OFFSET
GROUP_FIELD_VALUE:0
GROUP_FIELD_NAME:GROUP_LENGTH
GROUP_FIELD_VALUE:0
GROUP_FIELD_NAME:FIRM_ID
GROUP_FIELD_VALUE:KIZEM
GROUP_FILENAME:000000018.pdf
GROUP_FIELD_NAME:BKR_ID
GROUP_FIELD_VALUE:T80
GROUP_FIELD_VALUE:P
GROUP_FIELD_NAME:FI_ID
GROUP_FIELD_VALUE:
GROUP_FIELD_NAME:RUN_DTE
GROUP_FIELD_VALUE:20220208
GROUP_FIELD_NAME:FIRM_ID
GROUP_FIELD_VALUE:KIZEM
GROUP_FILENAME:000000019.pdf
It has three keys Group field ,group field value and group file name,i want to create a dataframe like this
I am expecting a data frame with three column group_field_name,group_field_value and group_file name.
You can use:
(df['col'].str.extract('GROUP_FILENAME:(.*)|([^:]+):(.*)')
.set_axis(['GROUP_FILENAME', 'var', 'val'], axis=1)
.assign(GROUP_FILENAME=lambda d: d['GROUP_FILENAME'].bfill(),
n=lambda d: d.groupby(['GROUP_FILENAME', 'var']).cumcount()
)
.dropna(subset=['var'])
.pivot(index=['GROUP_FILENAME', 'n'], columns='var', values='val')
.droplevel(1).rename_axis(columns=None)
.reset_index('GROUP_FILENAME')
)
Output:
GROUP_FILENAME GROUP_FIELD_NAME GROUP_FIELD_VALUE
0 000000018.pdf BKR_ID T80
1 000000018.pdf GROUP_OFFSET 0
2 000000018.pdf GROUP_LENGTH 0
3 000000018.pdf FIRM_ID KIZEM
4 000000019.pdf BKR_ID T80
5 000000019.pdf FI_ID P
6 000000019.pdf RUN_DTE
7 000000019.pdf FIRM_ID 20220208
8 000000019.pdf NaN KIZEM
Used input:
col
0 GROUP_FIELD_NAME:BKR_ID
1 GROUP_FIELD_VALUE:T80
2 GROUP_FIELD_NAME:GROUP_OFFSET
3 GROUP_FIELD_VALUE:0
4 GROUP_FIELD_NAME:GROUP_LENGTH
5 GROUP_FIELD_VALUE:0
6 GROUP_FIELD_NAME:FIRM_ID
7 GROUP_FIELD_VALUE:KIZEM
8 GROUP_FILENAME:000000018.pdf
9 GROUP_FIELD_NAME:BKR_ID
10 GROUP_FIELD_VALUE:T80
11 GROUP_FIELD_VALUE:P
12 GROUP_FIELD_NAME:FI_ID
13 GROUP_FIELD_VALUE:
14 GROUP_FIELD_NAME:RUN_DTE
15 GROUP_FIELD_VALUE:20220208
16 GROUP_FIELD_NAME:FIRM_ID
17 GROUP_FIELD_VALUE:KIZEM
18 GROUP_FILENAME:000000019.pdf

Pandas, Replace values of a column with a variable (negative) if it is less than that variable, else keep the values as is

say:
m = 170000 , v = -(m/100)
{'01-09-2021': 631, '02-09-2021': -442, '08-09-2021': 6, '09-09-2021': 1528, '13-09-2021': 2042, '14-09-2021': 1098, '15-09-2021': -2092, '16-09-2021': -6718, '20-09-2021': -595, '22-09-2021': 268, '23-09-2021': -2464, '28-09-2021': 611, '29-09-2021': -1700, '30-09-2021': 4392}
I want to replace values in column 'Final' with v if the value is less than v, else keep the original value. Tried numpy.where , df.loc etc but didn't work.
Your can use clip:
df['Final'] = df['Final'].clip(-1700)
print(df)
# Output:
Date Final
0 01-09-2021 631
1 02-09-2021 -442
2 08-09-2021 6
3 09-09-2021 1528
4 13-09-2021 2042
5 14-09-2021 1098
6 15-09-2021 -1700
7 16-09-2021 -1700
8 20-09-2021 -595
9 22-09-2021 268
10 23-09-2021 -1700
11 28-09-2021 611
12 29-09-2021 -1700
13 30-09-2021 4392
Or the classical np.where:
df['Final'] = np.where(df['Final'] < -1700, -1700, df['Final'])
Setup:
df = pd.DataFrame({'Date': d.keys(), 'Final': d.values()})
You can try:
df.loc[df['Final']<v, 'Final'] = v
Output:
Date Final
0 01-09-2021 631
1 02-09-2021 -442
2 08-09-2021 6
3 09-09-2021 1528
4 13-09-2021 2042
5 14-09-2021 1098
6 15-09-2021 -1700
7 16-09-2021 -1700
8 20-09-2021 -595
9 22-09-2021 268
10 23-09-2021 -1700
11 28-09-2021 611
12 29-09-2021 -1700
13 30-09-2021 4392

Find Max Gradient by Row in For Loop Pandas

I have a df of 15 x 4 and I'm trying to compute the maximum gradient in a North (N) minus South (S) direction for each row using a "S" and "N" value for each min or max in the rows below. I'm not sure that this is the best pythonic way to do this. My df "ms" looks like this:
minSlats minNlats maxSlats maxNlats
0 57839.4 54917.0 57962.6 56979.9
0 57763.2 55656.7 58120.0 57766.0
0 57905.2 54968.6 58014.3 57031.6
0 57796.0 54810.2 57969.0 56848.2
0 57820.5 55156.4 58019.5 57273.2
0 57542.7 54330.6 58057.6 56145.1
0 57829.8 54755.4 57978.8 56777.5
0 57796.0 54810.2 57969.0 56848.2
0 57639.4 54286.6 58087.6 56140.1
0 57653.3 56182.7 57996.5 57975.8
0 57665.1 56048.3 58069.7 58031.4
0 57559.9 57121.3 57890.8 58043.0
0 57689.7 55155.5 57959.4 56440.8
0 57649.4 56076.5 58043.0 58037.4
0 57603.9 56290.0 57959.8 57993.9
My loop structure looks like this:
J = len(ms)
grad = pd.DataFrame()
for i in range(J):
if ms.maxSlats.iloc[i] > ms.maxNlats.iloc[i]:
gr = ( ms.maxSlats.iloc[i] - ms.minNlats.iloc[i] ) * -1
grad[gr] = [i+1, i]
elif ms.maxNlats.iloc[i] > ms.maxSlats.iloc[i]:
gr = ms.maxNlats.iloc[i] - ms.minSlats.iloc[i]
grad[gr] = [i+1, i]
grad = grad.T # need to transpose
print(grad)
I obtain the correct answer but I'm wondering if there is a cleaner way to do this to obtain the same answer below:
grad.T
Out[317]:
0 1
-3045.6 1 0
-2463.3 2 1
-3045.7 3 2
-3158.8 8 7
-2863.1 5 4
-3727.0 6 5
-3223.4 7 6
-3801.0 9 8
-1813.8 10 9
-2021.4 11 10
483.1 12 11
-2803.9 13 12
-1966.5 14 13
390.0 15 14
thank you,
Use np.where to compute gradient and keep only last duplicated index.
grad = np.where(ms.maxSlats > ms.maxNlats, (ms.maxSlats - ms.minNlats) * -1,
ms.maxNlats - ms.minSlats)
df = pd.DataFrame({'A': pd.RangeIndex(1, len(ms)+1),
'B': pd.RangeIndex(0, len(ms))},
index=grad)
df = df[~df.index.duplicated(keep='last')]
>>> df
A B
-3045.6 1 0
-2463.3 2 1
-3045.7 3 2
-2863.1 5 4
-3727.0 6 5
-3223.4 7 6
-3158.8 8 7
-3801.0 9 8
-1813.8 10 9
-2021.4 11 10
483.1 12 11
-2803.9 13 12
-1966.5 14 13
390.0 15 14

cumulative product for specific groups of observations in pandas

I have a dataset of the following type
Date ID window var
0 1998-01-28 X -5 8.500e-03
1 1998-01-28 Y -5 1.518e-02
2 1998-01-29 X -4 8.005e-03
3 1998-01-29 Y -4 7.905e-03
4 1998-01-30 X -3 -5.497e-03
... ... ... ...
3339 2016-12-19 Y 3 -4.365e-04
3340 2016-12-20 X 4 3.628e-03
3341 2016-12-20 Y 4 6.608e-03
3342 2016-12-21 X 5 -2.467e-03
3343 2016-12-21 Y 5 -2.651e-03
My aim is to calculate the cumulative product of the variable var according to the variable window. The idea is that for every date, I have identified a window of 5 days around that date /the variable window goes from -5 to 5). Now, I want to calculate the cumulative product in the window that belongs to a specific date. For example, the first date (1998-01-28) has a value of windows of -5, and thus represent the starting point for the calculation of the cumprod. I want to have a new variable called cumprod which is exactly var on the date in which window is -5, then it is the cumprod between the value of varat -5 and -4, and so on until window is equal to 5. This defines the value of cumprod for the first group of dates, where every group is defined by consecutive dates in a way that var starts at -5 and ends at 5. I shall then repeat this for any group of date. I will therefore obtain something like
Date ID window var cumprod
0 1998-01-28 X -5 8.500e-03 8.500e-03
1 1998-01-28 Y -5 1.518e-02 1.518e-02
2 1998-01-29 X -4 8.005e-03 6.80425e-05
3 1998-01-29 Y -4 7.905e-03 0.00011999790000000002
4 1998-01-30 X -3 -5.497e-03
... ... ... ...
3339 2016-12-19 Y 3 -4.365e-04
3340 2016-12-20 X 4 3.628e-03
3341 2016-12-20 Y 4 6.608e-03
3342 2016-12-21 X 5 -2.467e-03
3343 2016-12-21 Y 5 -2.651e-03
where I gave an example in of cumprod for the first 2 dates.
How could I achieve this? I was thinking to find a way to attach an identifier to every group of dates and then run some sort of cumprod() method using .groupby(group_identifier). I can't think of how to do it though. Would it be possible to simplify it by using a rolling function on window? Any other kind of approach is very welcomed.
I suggest the following
import numpy as np
import pandas as pd
np.random.seed(123)
df = pd.DataFrame({"Date": pd.date_range("1998-01-28", freq="d", periods=22),
"window": np.concatenate([np.arange(-5,6,1),np.arange(-5,6,1)]),
"var": np.random.randint(1,10,22)
})
My df is similar to yours:
Date window var
0 1998-01-28 -5 3
1 1998-01-29 -4 3
2 1998-01-30 -3 7
3 1998-01-31 -2 2
4 1998-02-01 -1 4
5 1998-02-02 0 7
6 1998-02-03 1 2
7 1998-02-04 2 1
8 1998-02-05 3 2
9 1998-02-06 4 1
10 1998-02-07 5 1
11 1998-02-08 -5 4
12 1998-02-09 -4 5
Then I create a grouping variable and transform var usingcumprod:
df = df.sort_values("Date") # My df is already sorted by Date given the way
# I created it, but I add this to make sure yours is sorted by date
df["group"] = (df["window"] == -5).cumsum()
df = pd.concat([df, df.groupby("group")["var"].transform("cumprod")], axis=1)
And the result is :
Date window var group var
0 1998-01-28 -5 3 1 3
1 1998-01-29 -4 3 1 9
2 1998-01-30 -3 7 1 63
3 1998-01-31 -2 2 1 126
4 1998-02-01 -1 4 1 504
5 1998-02-02 0 7 1 3528
6 1998-02-03 1 2 1 7056
7 1998-02-04 2 1 1 7056
8 1998-02-05 3 2 1 14112
9 1998-02-06 4 1 1 14112
10 1998-02-07 5 1 1 14112
11 1998-02-08 -5 4 2 4
12 1998-02-09 -4 5 2 20
13 1998-02-10 -3 1 2 20

How to substitute a column in a pandas dataframe whit a series?

Let's have a dataframe df and a series s1 in pandas
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10000,1000))
s1 = pd.Series(range(0,10000))
How can I modify df so that the column 42 become equal to s1?
How can I modify df so that the columns between 42 and 442 become equal to s1?
I would like to know the simplest way to do that but also a way to do that in place.
I think you need first same length Series with DataFrame, here 20:
np.random.seed(456)
df = pd.DataFrame(np.random.randn(20,10))
#print (df)
s1 = pd.Series(range(0,20))
#print (s1)
#set column by Series
df[8] = s1
#set Series to range of columns
cols = df.loc[:, 3:5].columns
df[cols] = pd.concat([s1] * len(cols), axis=1)
print (df)
0 1 2 3 4 5 6 7 8 9
0 -0.668129 -0.498210 0.618576 0 0 0 0.301966 0.449483 0 -0.315231
1 -2.015971 -1.130231 -1.111846 1 1 1 1.915676 0.920348 1 1.157552
2 -0.106208 -0.088752 -0.971485 2 2 2 -0.366948 -0.301085 2 1.141635
3 -1.309529 -0.274381 0.864837 3 3 3 0.670294 0.086347 3 -1.212503
4 0.120359 -0.358880 1.199936 4 4 4 0.389167 1.201631 4 0.445432
5 -1.031109 0.067133 -1.213451 5 5 5 -0.636896 0.013802 5 1.726135
6 -0.491877 0.254206 -0.268168 6 6 6 0.671070 -0.633645 6 1.813671
7 0.080433 -0.882443 1.152671 7 7 7 0.249225 1.385407 7 1.010374
8 0.307274 0.806150 0.071719 8 8 8 1.133853 -0.789922 8 -0.286098
9 -0.767206 1.094445 1.603907 9 9 9 0.083149 2.322640 9 0.396845
10 -0.740018 -0.853377 -2.039522 10 10 10 0.764962 -0.472048 10 -0.071255
11 -0.238565 1.077573 2.143252 11 11 11 1.542892 2.572560 11 -0.803516
12 -0.139521 -0.992107 -0.892619 12 12 12 0.259612 -0.661760 12 -1.508976
13 -1.077001 0.381962 0.205388 13 13 13 -0.023986 -1.293080 13 1.846402
14 -0.714792 -0.728496 -0.127079 14 14 14 0.606065 -2.320500 14 -0.992798
15 -0.127113 -0.563313 -0.101387 15 15 15 0.647325 -0.816023 15 -0.309938
16 -1.151304 -1.673719 0.074930 16 16 16 -0.392157 0.736714 16 1.142983
17 -1.247396 -0.471524 1.173713 17 17 17 -0.005391 0.426134 17 0.781832
18 -0.325111 0.579248 0.040363 18 18 18 0.361926 0.036871 18 0.581314
19 -1.057501 -1.814500 0.109628 19 19 19 -1.738658 -0.061883 19 0.989456
Timings
Another solutions, but it seems concat solution is fastest:
np.random.seed(456)
df = pd.DataFrame(np.random.randn(1000,1000))
#print (df)
s1 = pd.Series(range(0,1000))
#print (s1)
#set column by Series
df[8] = s1
#set Series to range of columns
cols = df.loc[:, 42:442].columns
print (df)
In [310]: %timeit df[cols] = np.broadcast_to(s1.values[:, np.newaxis], (len(df),len(cols)))
1 loop, best of 3: 202 ms per loop
In [311]: %timeit df[cols] = np.repeat(s1.values[:, np.newaxis], len(cols), axis=1)
1 loop, best of 3: 208 ms per loop
In [312]: %timeit df[cols] = np.array([s1.values]*len(cols)).transpose()
10 loops, best of 3: 175 ms per loop
In [313]: %timeit df[cols] = pd.concat([s1] * len(cols), axis=1)
10 loops, best of 3: 53.8 ms per loop