Insert items from MultiIndexed dataframe into regular dataframe based on time - pandas

I have this regular dataframe indexed by 'Date', called ES:
Price Day Hour num_obs med abs_med Ret
Date
2006-01-03 08:30:00 1260.583333 1 8 199 1260.416667 0.166667 0.000364
2006-01-03 08:35:00 1261.291667 1 8 199 1260.697917 0.593750 0.000562
2006-01-03 08:40:00 1261.125000 1 8 199 1260.843750 0.281250 -0.000132
2006-01-03 08:45:00 1260.958333 1 8 199 1260.895833 0.062500 -0.000132
2006-01-03 08:50:00 1261.214286 1 8 199 1260.937500 0.276786 0.000203
I have this other dataframe indexed by the following MultiIndex. The first index goes from 0 to 23 and the second index goes from 0 to 55. In other words we have daily 5 minute increment data.
5min_Ret
0 0 2.235875e-06
5 9.814064e-07
10 -1.453213e-06
15 4.295757e-06
20 5.884896e-07
25 -1.340122e-06
30 9.470660e-06
35 1.178204e-06
40 -1.111621e-05
45 1.159005e-05
50 6.148861e-06
55 1.070586e-05
1 0 1.485287e-05
5 3.018576e-06
10 -1.513273e-05
15 -1.105312e-05
20 3.600874e-06
...
I want to create a column in the original dataframe, ES, that has the appropriate '5min_Ret' at each appropriate hour/5minute combo.
I've tried multiple things: looping over rows, finding some apply function. But nothing has worked so far. I feel like I'm overlooking a simple and Pythonic solution here.
The expected output creates a new column called '5min_ret' to the original dataframe in which each row corresponds to the correct hour/5minute pair from the smaller dataframe containing the 5min_ret
Price Day Hour num_obs med abs_med Ret 5min_ret
Date
2006-01-03 08:30:00 1260.583333 1 8 199 1260.416667 0.166667 0.000364 xxxx
2006-01-03 08:35:00 1261.291667 1 8 199 1260.697917 0.593750 0.000562 xxxx
2006-01-03 08:40:00 1261.125000 1 8 199 1260.843750 0.281250 -0.000132 xxxx
2006-01-03 08:45:00 1260.958333 1 8 199 1260.895833 0.062500 -0.000132 xxxx
2006-01-03 08:50:00 1261.214286 1 8 199 1260.937500 0.276786 0.000203 xxxx

I think one way is to use merge on hour and minute. First create a column 'min' in ES from the datetimeindex such as:
ES['min'] = ES.index.minute
Now you can merge with your multiindex DF containing the column '5min_Ret' that I named df_multi such as:
ES = ES.merge(df_multi.reset_index(), left_on = ['hour','min'],
right_on = ['level_0','level_1'], how='left')
Here you merge on 'hour' and 'min' from ES with 'level_0' and 'level_1', which are created from your multiindex of df_multi when you do reset_index, and on the value of the left df (being ES)
You should get a new column in ES named '5min_Ret' with the value you are looking for. You can drop the colum 'min' if you don't need it anymore by ES = ES.drop('min',axis=1)

Related

How to concatenate a dataframe to a multiindex main dataframe along columns

I have tried a few answers but was not able to get the desired result in my case.
I am working with stocks data.
I have a list ['3MINDIA.NS.csv', 'AARTIDRUGS.NS.csv', 'AARTIIND.NS.csv', 'AAVAS.NS.csv', 'ABB.NS.csv']
for every stock in the list I get an output which contains trades and related info.. it looks something like this:
BUY SELL profits rel_profits
0 2004-01-13 2004-01-27 -44.200012 -0.094606
1 2004-02-05 2004-02-16 18.000000 0.044776
2 2005-03-08 2005-03-11 25.000000 0.048077
3 2005-03-31 2005-04-01 13.000000 0.025641
4 2005-10-11 2005-10-26 -20.400024 -0.025342
5 2005-10-31 2005-11-04 67.000000 0.095578
6 2006-05-22 2006-06-05 -55.100098 -0.046693
7 2007-03-06 2007-03-14 3.000000 0.001884
8 2007-03-19 2007-03-28 41.500000 0.028222
9 2007-07-31 2007-08-14 69.949951 0.038224
10 2008-01-24 2008-02-05 25.000000 0.013055
11 2009-11-04 2009-11-05 50.000000 0.031250
12 2010-12-10 2010-12-15 63.949951 0.018612
13 2011-02-02 2011-02-15 -53.050049 -0.015543
14 2011-09-30 2011-10-07 74.799805 0.018181
15 2015-12-09 2015-12-18 -215.049805 -0.019523
16 2016-01-18 2016-02-01 -475.000000 -0.046005
17 2016-11-16 2016-11-30 -1217.500000 -0.096877
18 2018-03-26 2018-04-02 0.250000 0.000013
19 2018-05-22 2018-05-25 250.000000 0.012626
20 2018-06-05 2018-06-12 101.849609 0.005361
21 2018-09-25 2018-10-10 -2150.000000 -0.090717
22 2021-01-27 2021-02-03 500.150391 0.024638
23 2021-06-30 2021-07-07 393.000000 0.016038
24 2021-08-12 2021-08-13 840.000000 0.035279
25 NaN NaN -1693.850281 0.995277
# note: every dataframe will have a last row with NaN values in buy, sell columns
# each datafram has different number of rows
Now I tried to add an extra level of index to this dataframe like this:
symbol = name of the stock from given list for ex. for 3MINDIA.NS.csv symbol is 3MINDIA
trades.columns = pd.MultiIndex.from_product([[symbol], trades.columns])
after this I tried to concatenate each trades dataframe that is generated in the loop to a main dataframe using:
result_df = pd.concat([result_df, trades], axis=1)
# I am trying to do this so that Whenever
I call result_df[symbol] I should be able
to see the trade dates for that particular symbol.
But I get a result_df that has lot of NaN values because each trades dataframe has variable number of rows in it.
IS there any way I can combine trades dataframes along the columns with stock symbol as higher level index and not get all the NaN values in my result_df
result_df I got
So I found a way to get what I wanted.
first I added this code in loop
trades = pd.concat([trades], keys=[symbol], names=['Stocks'])
after this I used concatenate again on result_df and trades
# Desired Result
result_df = pd.concat([result_df, trades], axis=0, ignore_index=False)
And BAM!!! This is exactly what I wanted

SUM in dataframe of rows that has the same date and ADD new column

My code starts this way: it takes data from HERE and I want to extract al the rows that contain "fascia_anagrafica" equal to "20-29". In italian "fascia_anagrafica" means "age range". That was relatively simple, as you see below, and I dropped some unimportant values.
import pandas as pd
import json
import numpy
import sympy
from numpy import arange,exp
from scipy.optimize import curve_fit
from matplotlib import pyplot
import math
import decimal
df = pd.read_csv('https://raw.githubusercontent.com/italia/covid19-opendata-
vaccini/master/dati/somministrazioni-vaccini-latest.csv')
df = df[df["fascia_anagrafica"] == "20-29"]
df01=df.drop(columns= ["fornitore","area","sesso_maschile","sesso_femminile","seconda_dose","pregressa_infezione","dose_aggiuntiva","codice_NUTS1","codice_NUTS2","codice_regione_ISTAT","nome_area"])
now dataframe looks like this:IMAGE
as you see, for every date there is the "20-29 age range" and for every line you may find the value "prima_dose" which stands for "first_dose".
Now the problem:
If you take into consideration the date "2020-12-27" you will notice that it is repeated about 20 times (with 20 different values) since in italy there are 21 regions, then the same applies for the other dates. Unfortunately they are not always 21, because in certain regions they didn't put any values in some days so the dataframe is NOT periodic.
I want to add a column in the dataframe that makes a sum of the values that has same date fo all dates in the dataframe. An exaple here:
Date.................prima_dose...........sum_column
2020-8-9.............. 1.......................13 <----this is (1+3+4+5 in the day 2020-8-9)
2020-8-9..............3........................8 <----this is (2+5+1 in the day 2020-8-10)
2020-8-9.............. 4..............and so on...
2020-8-9.............. 5
2020-8-10.............. 2
2020-8-10.............. 5
2020-8-10.............. 1
thanks!
If you just want to sum all the values of 'prima_dose' for each date and get the result in a new dataframe, you could use groupby.sum():
result = df01.groupby('data_somministrazione')['prima_dose'].sum().reset_index()
Prints:
>>> result
data_somministrazione prima_dose
0 2020-12-27 700
1 2020-12-28 171
2 2020-12-29 87
3 2020-12-30 486
4 2020-12-31 2425
.. ... ...
289 2021-10-12 11583
290 2021-10-13 12532
291 2021-10-14 15347
292 2021-10-15 13689
293 2021-10-16 9293
[294 rows x 2 columns]
This will change the structure of your current dataframe, and return a unique row per date
If you want to add a new column in your existing dataframe without altering it's structure, you should use groupby.transform():
df01['prima_dose_per_date'] = df01.groupby('data_somministrazione')['prima_dose'].transform('sum')
Prints:
>>> df01
data_somministrazione fascia_anagrafica prima_dose prima_dose_per_date
0 2020-12-27 20-29 2 700
7 2020-12-27 20-29 9 700
12 2020-12-27 20-29 60 700
17 2020-12-27 20-29 59 700
23 2020-12-27 20-29 139 700
... ... ... ...
138475 2021-10-16 20-29 533 9293
138484 2021-10-16 20-29 112 9293
138493 2021-10-16 20-29 0 9293
138502 2021-10-16 20-29 529 9293
138515 2021-10-16 20-29 0 9293
[15595 rows x 4 columns]
This will keep the current structure of your dataframe and return a new column with the sum of prima_dose per each date.

How could I remove duplicates if duplicates mean less than 30days?

(using sql or pandas)
I want to delete records if the Date difference between two records is less than 30 days.
But first record of ID must be remained.
#example
ROW ID DATE
1 A 2020-01-01 -- first
2 A 2020-01-03
3 A 2020-01-31
4 A 2020-02-05
5 A 2020-02-28
6 A 2020-03-09
7 B 2020-03-06 -- first
8 B 2020-05-07
9 B 2020-06-02
#expected results
ROW ID DATE
1 A 2020-01-01
4 A 2020-02-05
6 A 2020-03-09
7 B 2020-03-06
8 B 2020-05-07
ROW 2,3 are within 30 days from ROW 1
ROW 5 is within 30 days from ROW 4
ROW 9 is within 30 days from ROW 8
To cope with your task it is not possible to call any
vectorized methods.
The cause is that after a row is recognized as a duplicate, then
this row "does not count" when you check further rows.
E.g. after rows 2020-01-03 and 2020-01-31 were deleted (as
"too close" to the previous row) then 2020-02-05 row should be
left, because now the distance to the previous row (2020-01-01)
is big enough.
So I came up with a solution based on a "function with memory":
def isDupl(elem):
if isDupl.prev is None:
isDupl.prev = elem
return False
dDiff = (elem - isDupl.prev).days
rv = dDiff <= 30
if not rv:
isDupl.prev = elem
return rv
This function should be invoked for each DATE in the
current group (with same ID) but before that isDupl.prev
must be set to None.
So the function to apply to each group of rows is:
def isDuplGrp(grp):
isDupl.prev = None
return grp.DATE.apply(isDupl)
And to get the expected result, run:
df[~(df.groupby('ID').apply(isDuplGrp).reset_index(level=0, drop=True))]
(you may save it back to df).
The result is:
ROW ID DATE
0 1 A 2020-01-01
3 4 A 2020-02-05
5 6 A 2020-03-09
6 7 B 2020-03-06
7 8 B 2020-05-07
And finally, a remark about the other solution:
It contains rows:
3 4 A 2020-02-05
4 5 A 2020-02-28
which are only 23 days apart, so this solution is wrong.
The same pertains to rows:
5 A 2020-02-28
6 A 2020-03-09
which are also too close in time.
You can try this:
Convert date to datetime64
Get the first date from each group df.groupby('ID')['DATE'].transform('first')
Add a filter to keep only dates greater than 30 days
Append the first date of each group to the dataframe
Code:
df['DATE'] = pd.to_datetime(df['DATE'])
df1 = df[(df['DATE'] - df.groupby('ID')['DATE'].transform('first')) >= pd.Timedelta(30, unit='D')]
df1 = df1.append(df.groupby('ID', as_index=False).agg('first')).sort_values(by=['ID', 'DATE'])
print(df1)
ROW ID DATE
0 1 A 2020-01-01
2 3 A 2020-01-31
3 4 A 2020-02-05
4 5 A 2020-02-28
5 6 A 2020-03-09
1 7 B 2020-03-06
7 8 B 2020-05-07
8 9 B 2020-06-02

How to calculate time delta in pandas dataframe?

ip app device os channel click_time is_attributed
0 83230 3 1 33 888 2017-11-06 14:32:21 0
1 17357 3 1 19 379 2017-11-06 14:33:34 0
2 35810 3 1 13 379 2017-11-06 14:34:12 0
3 45745 14 1 33 888 2017-11-06 14:34:52 0
4 161007 3 1 13 379 2017-11-06 14:35:08 0
Here is the dataframe and I want to add one column which represents the time (seconds) delta value between every specified condition.
For example, let's take os-channel as an identifier and the timedelta in line-3 (os=33&channel=888) should be the time gap that is from the record last seen os=33&channel=88, which can be found in line-0. So the timedelta should be the gap between 2017-11-06 14:34:52 and 2017-11-06 14:32:21. Is there is no os=33&channel=888 before, the outcome should be Nan.
So how can I realize this in pandas ?
Assuming click_time is already datetime
df.groupby([“os”, “channel”]).click_time.diff()
Create a new column
df.assign(click_diff=df.groupby([“os”, “channel”]).click_time.diff())

How do I sort column by targeting a specific number within that cell?

I would like to use Pandas Python to sort a specific column by date (more specifically the year). However, the year is buried within a bunch of other numbers. How do I just target the 2 digits that I need?
In the example below, I want to sort this column by the numbers [16,14,15...] rather than considering all the numbers in that row.
3/18/16 11:46
6/19/14 14:58
7/27/15 14:22
8/3/15 12:59
2/20/13 12:33
9/27/16 12:08
7/27/15 14:22
Given a dataframe like this,
date
0 3/18/16
1 6/19/14
2 7/27/15
3 8/3/15
4 2/20/13
5 9/27/16
6 7/27/15
You can convert the date column to datetime format and then sort.
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(by = 'date')
The resulting dataframe
date
4 2013-02-20
1 2014-06-19
2 2015-07-27
6 2015-07-27
3 2015-08-03
0 2016-03-18
5 2016-09-27